`DataFrame` data structure
The DataFrame data structure is the heart of the Panda's library. It's a primary object used in data analysis and cleaning tasks. Conceptually, a DataFrame is a two-dimensional series object with an index and multiple columns of content, each column having a label. Essentially, a DataFrame is a two-axes labeled array.
`DataFrame` Indexing and Loading
Introduction to CSV Files
`Series` Data Structure
A Series in pandas is a one-dimensional array-like object that can hold various data types, similar to a list in Python, but with additional features. It combines elements of lists and dictionaries, storing items in order and allowing access via labels (index).
About
This section serves as my personal repository of CS notes, aimed at reviewing and reinforcing fundamental concepts in computer science, including algorithms, data structures, and data science.
Activity Selection
The Activity Selection problem is a classic example of a greedy algorithm application. It is concerned with selecting the maximum number of activities that don't overlap in time from a given set of activities.
Algorithms
It is really hard to organize this section, Algorithms
Array
An array is a basic data structure used in programming to store a collection of elements, all of the same data type. These elements are stored in contiguous memory locations, which allows for efficient indexing and iteration. Arrays are widely used due to their simplicity and performance benefits.
Attention Mechanism
Attention mechanisms address challenges in traditional neural network models like CNNs and RNNs, which require fixed input sizes. They offer a flexible approach to handling inputs of varying size and content, such as long text sequences. This flexibility is achieved through mechanisms that enable dynamic focus on different parts of the input.
Backtracking
Backtracking algorithms are a type of exhaustive search technique used to solve problems. The core approach involves starting from an initial state and exploring all possible solutions by trying each choice once. Only correct solutions are recorded, and the search continues until a solution is found or all choices are tried. This method utilizes depth-first search to traverse the solution space.
Basic Syntax
Learning Python Syntax
Basics
Functions
Basics
R is a powerful language used for data analysis, statistical computing, and graphical representation. This guide will introduce the basic syntax, data structures, control structures, and functions in R, providing a solid foundation for beginners.
Bellman-Ford Algorithm
The Bellman-Ford algorithm is a graph-searching algorithm that calculates the shortest paths from a single source vertex to all other vertices in a weighted graph. It is particularly useful for graphs containing negative weight edges.
Best Practices for Python Type Conventions
Python is a dynamically typed language, but with the introduction of type hints (PEP 484) and the growing adoption of static type checkers like mypy, type annotations have become essential for writing clear, maintainable code. This guide covers the best practices for using type hints effectively in Python.
Binary Search
Binary search is a highly efficient searching algorithm used to find the position of a target value within a sorted array. This algorithm operates on the divide and conquer principle, which significantly reduces the time complexity compared to linear search methods.
Breadth-first search (BFS)
Breadth-first search (BFS) is a widely used algorithm for traversing or searching tree and graph data structures. It starts at a selected node (often referred to as the 'root' in the context of trees), and explores all of the neighbor nodes at the present depth prior to moving on to nodes at the next depth level. This characteristic allows BFS to provide the shortest path to a target node in an unweighted graph, which is one of its most significant advantages.
Bubble Sort
Bubble Sort is one of the simplest sorting algorithms in computer science. It repeatedly steps through the list to be sorted, compares adjacent elements, and swaps them if they are in the wrong order. This process continues until the list is sorted.
Conditionals
Python allows us to compare values using comparison operators, enabling us to make decisions based on these comparisons. This is fundamental for controlling the flow of a program.
Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are specialized neural networks designed primarily for processing structured grid data such as images. CNNs leverage the inherent properties of data like spatial relationships and locality to reduce the complexity and computational cost associated with learning from high-dimensional data.
Data Cleaning with Pandas
In this lecture, we covered a basic data cleaning process using pandas. We focused on cleaning a dataset containing a list of US presidents from Wikipedia. The key operations included importing the dataset, cleaning up names, and converting date formats. Below are the detailed steps and methods used.
Data Import in R
In the world of data analysis, most data is not created within R itself but comes from various data collection software, hardware, and channels such as Excel and the internet. This chapter focuses on how to import data into R to begin data analysis. Readers can either systematically go through the chapter or select topics based on their actual needs and time constraints.
Data Science
Courses
Data Structure
Linear Data Structures
Datasets
| Dataset Source | Description |
Date Functionality
In today's lecture, we explored time series and date functionality in pandas. Manipulating dates and times in pandas is highly flexible, enabling us to conduct advanced analysis such as time series analysis.
Decorators
Python decorators are a very powerful and useful tool that allows programmers to modify the behavior of a function or class. Decorators allow for the extension or modification of the function's behavior without permanently modifying it. Here’s how to use and create decorators in Python:
Deep Learning
Useful Links
Depth-First Search (DFS)
Depth-First Search (DFS) is a fundamental algorithm used in graph theory to traverse or search through the nodes of a graph in a systematic manner. DFS explores as deep as possible along each branch before backtracking, making it an efficient algorithm for tasks that need to explore all the nodes in a graph thoroughly.
Dictionaries
Dictionaries in Python are data structures that store data in key-value pairs, allowing efficient data retrieval and modification based on unique keys.
Diffusion Models
Diffusion models are a class of generative models that learn data distributions by iteratively adding and removing noise from data. They have gained prominence for their ability to generate high-quality samples in domains like image and audio synthesis.
Dijkstra's Algorithm
Dijkstra's algorithm is a graph search algorithm that finds the shortest paths from a single source node to all other nodes in a weighted graph with non-negative edge weights. It was conceived by computer scientist Edsger W. Dijkstra in 1956 and published three years later.
Divide and Conquer Algorithms
Divide and conquer, also known as "divide and rule", is a fundamental algorithm strategy based on recursion, involving two main phases: divide and conquer.
Dynamic Programming
Dynamic Programming (DP) is a technique used to optimize recursive algorithms by storing intermediate results, avoiding redundant calculations, and thus significantly improving computational efficiency. It is particularly useful for problems that exhibit overlapping subproblems and optimal substructure.
Errors and Exceptions
The Try-Except Construct
Fibonacci Series
The Fibonacci series is a sequence of numbers where each number is the sum of the two preceding ones, usually starting with 0 and 1. Mathematically, it is defined as:
File Paths and Managing Files
File Paths
Floyd-Warshall Algorithm
The Floyd-Warshall Algorithm is a dynamic programming method used to find the shortest paths between all pairs of nodes in a weighted graph. This algorithm can handle both directed and undirected graphs and works with graphs that have negative weight edges, though it cannot handle graphs with negative weight cycles.
Fractional Knapsack Problem
The Fractional Knapsack Problem is a variant of the knapsack problem where it's permissible to take fractional parts of items rather than having to make a binary choice for each item (all or nothing). This problem allows for a greedy approach to find an optimal solution.
Functions
Defining Functions
Fundamentals
Courses
General Concepts
In machine learning, the evaluation of models is a crucial step to understand their performance. Two key concepts in this process are the hypothesis and the loss functions.
Generative Models
Courses
Graph algorithms
Graph algorithms are a set of instructions or procedures designed to perform specific tasks on graph data structures. A graph is a collection of nodes (also known as vertices) connected by edges. Graphs are used to model various types of relationships and processes in physical, biological, social, and information systems. The study and application of graph algorithms are central to numerous fields, including computer science, mathematics, network analysis, and social sciences, due to their ability to efficiently solve problems related to connectivity, flow, and routing within complex networks.
Greedy Algorithms
Greedy algorithms are a class of algorithms used to solve optimization problems by making the locally optimal choice at each step with the hope of finding the global optimum.
Grouping Data
Grouping data is an essential task for data analysis, allowing us to understand and manipulate data at a group level. Pandas provides the groupby() function to facilitate this, implementing the split-apply-combine pattern. This pattern involves splitting the data into groups, applying a function to each group, and then combining the results.
Heap Sort
Heap Sort is a comparison-based sorting algorithm that uses a binary heap data structure to efficiently sort elements. It's particularly effective for data sets stored in random access structures like arrays.
Huffman Coding
Huffman Coding is a widely used method for data compression that involves creating variable-length codes for input characters, with the lengths based on the frequencies of the characters. It's an efficient form of lossless compression, which means no information is lost in the compression process.
Idioms
In Python programming, certain idioms are considered more appropriate and efficient. Pandas, a sub-language within Python, has its own idioms, often referred to as "pandorable." These idioms improve code readability and performance. Here are some key features to make your code pandorable.
index
Introduction to Testing in Python
Indexing `DataFrame`
In pandas, both Series and DataFrame objects can have indices applied to them. An index serves as a row-level label, corresponding to axis zero. Indices can be autogenerated or explicitly set. This guide covers various methods for handling indices in pandas, including setting, resetting, and using multi-level indices.
Insertion Sort
Insertion Sort is a fundamental comparison-based sorting algorithm that mirrors the way you might sort playing cards in your hand. By building a sorted array one element at a time, it offers simplicity and efficiency, especially for small or nearly sorted datasets.
Job Sequencing Problem
The Job Sequencing Problem is a classic problem in computer science, often solved using greedy algorithms. The problem involves scheduling jobs to maximize profit when each job has a deadline and associated profit if it is completed on or before its deadline.
Knapsack Problem
The Knapsack Problem is a classic optimization problem that can be efficiently solved using dynamic programming. The goal is to choose a subset of items with maximum total value, subject to a weight constraint on the total weight of the chosen items.
Kruskal's algorithm
Kruskal's algorithm is a popular method used to find the minimum spanning tree (MST) for a connected, weighted, undirected graph. The goal of the algorithm is to find a subset of the graph's edges that forms a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. This algorithm is categorized under greedy algorithms because it finds a local optimum with the hope that this leads to a global optimum.
Lambda Functions
Lambda functions in Python are small, anonymous functions defined by the lambda keyword. They can have any number of arguments but only one expression. The syntax is simple and intended for short functions that are convenient to use inside other functions, particularly those that require a simple function as an argument.
Languages
Languages of Data Science
Libraries
Scientific Computing Libraries in Python
Linear and Quadratic Discriminant Analysis - LDA & QDA
Overview
Linear data structures
Linear Data Structures are those where data elements are arranged sequentially, with each element connected to the next and previous one (except for the first and last). This arrangement forms a linear sequence, making it easy to traverse the data in a single run.
Linear Models
| Model | Mathematical Expression | Description |
Linear Regression
Regression analysis is a statistical method used for predicting numerical values based on input features. Common applications include predicting home prices, stock values, patient hospital stays, and retail sales forecasts.
Linear Search
Linear search, also known as sequential search, is a fundamental algorithm used to find a specific element within a list. It operates by starting at the beginning of the list and examining each element one by one until the target element is found or the end of the list is reached. This method is straightforward and does not require the list to be sorted, but it can be inefficient for large datasets.
Linked List
A linked list is a fundamental data structure that represents a sequence of elements. Each element, known as a node, contains:
Lists
Lists in Python are versatile data structures that allow efficient manipulation and storage of a collection of items. Unlike basic data types like integers, floats, Booleans, and strings, lists can hold multiple items, which can be of different data types, all within a single variable.
Longest Common Subsequence
The Longest Common Subsequence (LCS) problem is a classic computer science problem used to find the longest subsequence common to all sequences in a set of sequences (often just two sequences). A subsequence is a sequence derived from another sequence where some elements might be deleted without changing the order of the remaining elements.
Loops
Loops are fundamental constructs in programming that allow the execution of a block of code multiple times. They are essential for automating repetitive tasks, making code efficient and concise.
Machine Learning
Introduction
Managing Data and Processes
Reading Data Interactively
Merge Sort
Merge Sort is an efficient, general-purpose, comparison-based sorting algorithm. It follows the divide and conquer paradigm, which involves breaking a problem into smaller subproblems, solving them independently, and then combining their solutions to solve the original problem.
Merging DataFrames
Relational Theory Concepts
Minimum Spanning Tree (MST)
In graph theory, a Minimum Spanning Tree (MST) for a weighted graph is a spanning tree with the minimum weight among all the spanning trees. It connects all the vertices together without any cycles and with the minimum possible total edge weight.
Missing Values
Missing values are common in data cleaning activities and can occur for various reasons. Understanding the nature and handling of missing data is crucial for accurate data analysis.
Multilayer Perceptron
Multilayer Perceptrons (MLPs) are a class of deep neural networks characterized by their layered structure, which includes an input layer, one or more hidden layers, and an output layer. Each layer comprises neurons that are fully connected to neurons in the subsequent layer through weighted connections.
NumPy
Official Site
Object-oriented Programming
Object-Oriented Programming (OOP) is a programming paradigm that uses "objects" to design software. It allows developers to create classes that encapsulate data and functions, promoting code reusability and modularity.
Open-Source Tools
Data Management Tools
Ordinary Least Squares
Linear models aim to predict a target value as a linear combination of input features. Commonly represented with:
Other Test Concepts
Black Box vs. White Box Testing
Pandas
Official Website
Permutation Problem
The permutation problem involves finding all possible arrangements of elements from a given set using a backtracking algorithm. This problem can be approached considering either arrays without duplicate elements or arrays that include duplicates.
Pivot Tables
A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of the aggregation function. A pivot table is itself a DataFrame where the rows represent one variable, the columns represent another, and the cells contain some aggregate value. Pivot tables often include marginal values, which are the sums for each column and row. This allows for an easy visual representation of the relationship between two variables.
Prim's Algorithm
Prim's Algorithm is a popular and efficient greedy algorithm used to find the Minimum Spanning Tree (MST) of a weighted, undirected graph. This algorithm helps to find a subset of the graph's edges that forms a tree including every vertex, where the total weight of all the edges in the tree is minimized.
Processing Log Files
Log file analysis is a crucial task in system administration and software development, allowing experts to monitor system activities, debug issues, and extract valuable information. This guide demonstrates how to parse log files using Python to count the occurrences of usernames in CRON job entries.
Programming Languages
Data Analyst
pytorch
Official Website
Querying `DataFrame`
Boolean Masking
Querying `Series`
A Pandas Series can be queried either by index position or index label. If an index is not specified during querying, the position and the label are effectively the same values.
Queue
A queue is a fundamental linear data structure in computer science that follows the First-In-First-Out (FIFO) principle. Just like a real-world queue (e.g., people lining up at a ticket counter), the first element added to the queue will be the first one to be removed.
Quick Sort
Quick Sort is a highly efficient sorting algorithm that employs the divide and conquer strategy. It is widely used due to its efficiency and general applicability. The core operation involves pivot partitioning, where an element called the pivot is selected, and the array is rearranged so that elements less than the pivot are moved to its left, and elements greater than the pivot are placed to its right.
R
Official Site
Reading and Writing Files
Interacting with files is a fundamental aspect of programming, especially in automation and data processing tasks. Python provides robust capabilities to manipulate files and directories, making it a powerful tool for IT specialists and system administrators.
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are a type of neural network designed for processing sequences by leveraging hidden states to capture temporal information. They are particularly well-suited for tasks like language modeling, where the goal is to predict the next token based on the historical sequence of previous tokens.
Recursion
Recursion is the repeated application of the same procedure to a smaller problem. It allows complex tasks to be broken down into simpler, more manageable sub-tasks.
regex
Regular expressions (regexes) are patterns used to match character combinations in strings. They are essential tools for text processing in programming and data science, allowing for:
Running System Commands in Python
Python allows interaction with the operating system by executing system commands directly from scripts using the subprocess module.
Scales
Creating a DataFrame with Letter Grades
Search Algorithms
Search algorithms are essential for locating elements within various data structures such as arrays, linked lists, trees, and graphs. These can be broadly categorized into two types based on their implementation approaches:
Selection Sort
Selection Sort is a straightforward comparison-based sorting algorithm. It repeatedly selects the minimum element from the unsorted portion of the list and swaps it with the first unsorted element. This process gradually builds a sorted segment at the beginning of the list.
Shortest Path Problems
Shortest path problems are key in optimizing routes in various applications like network design, transportation, and telecommunications. They focus on finding the shortest route from a start to a destination in a graph, which consists of nodes (vertices) and edges (connections between nodes).
Softmax Regression
In previous sections, we explored linear regression and its implementations, both from scratch and using high-level APIs. Regression models are typically used for quantitative outputs such as predicting prices, number of wins, or the number of days a patient might stay in the hospital. However, not all problems are best served by regression models due to the nature of their outputs. This leads to special cases like logarithmic regression or survival modeling.
Sorting Algorithms
Sorting algorithms are crucial for arranging data in a specific order, enhancing the efficiency of search, analysis, and processing tasks. They cater to various data types, such as integers, floating point numbers, characters, and strings, and can be customized based on different sorting rules, including numerical size and character ASCII order.
Space complexity
Space complexity is a metric that evaluates the total memory space required by an algorithm in terms of the size of its input data. It is analogous to time complexity but focuses on memory usage instead of execution time.
Stack
A stack is a linear data structure that follows the Last-In-First-Out (LIFO) principle. Imagine a stack of plates: the last plate placed on top is the first one to be taken off. In a stack, all insertions and deletions occur at one end, called the "top" of the stack.
Strings
Strings are an essential data type in Python, used to represent text data. They are sequences of characters enclosed within single (') or double (") quotes.
Subset Sum Problem
The Subset Sum Problem involves finding all possible combinations of elements in a given array such that their sum equals a specified target. This problem is a classic example of using backtracking for combinatorial exploration. It has variations based on whether duplicate elements are allowed in the input set and whether each element can be chosen more than once.
Sympy
SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.
Tableau
Reference and Useful Links
Terms
Data Visualization Types
Time Complexity
Time complexity is a metric used to describe the efficiency of an algorithm by examining how the run time increases with the size of the input. It primarily focuses on the growth trend of the runtime rather than precise execution times, which can vary across different hardware and software environments.
Transformers
The Transformer model, introduced by Vaswani et al. (2017), is a deep architecture solely based on attention mechanisms, omitting traditional convolutional or recurrent layers. It is designed for sequence-to-sequence learning and has been widely applied in language, vision, speech, and reinforcement learning applications. The architecture supports parallel computation and features a short path length between input and output, making it highly efficient for tasks involving sequential data.
Unit Tests
Unittest provides developers with a set of tools to construct and run tests on individual components or units of code to ensure their correctness. By running unittests, developers can identify and fix bugs, creating more reliable code.
Variational Autoencoders
A Variational Autoencoder (VAE) is a generative model that uses neural networks to encode input data into a latent space and then decodes it back to reconstruct the original data. VAEs combine principles from deep learning and probabilistic graphical models, enabling unsupervised learning of complex data distributions.