`DataFrame` data structure
The DataFrame data structure is the heart of the Panda's library. It's a primary object used in data analysis and cleaning tasks. Conceptually, a DataFrame is a two-dimensional series object with an index and multiple columns of content, each column having a label. Essentially, a DataFrame is a two-axes labeled array.
`DataFrame` Indexing and Loading
Introduction to CSV Files
`Series` Data Structure
A Series in pandas is a one-dimensional array-like object that can hold various data types, similar to a list in Python, but with additional features. It combines elements of lists and dictionaries, storing items in order and allowing access via labels (index).
About
This section serves as my personal repository of CS notes, aimed at reviewing and reinforcing fundamental concepts in computer science, including algorithms, data structures, and data science.
Activity Selection
The Activity Selection problem is a classic example of a greedy algorithm application. It is concerned with selecting the maximum number of activities that don't overlap in time from a given set of activities.
Algorithms
It is really hard to organize this section, Algorithms
Backtracking
Backtracking algorithms are a type of exhaustive search technique used to solve problems. The core approach involves starting from an initial state and exploring all possible solutions by trying each choice once. Only correct solutions are recorded, and the search continues until a solution is found or all choices are tried. This method utilizes depth-first search to traverse the solution space.
Basic Syntax
Learning Python Syntax
Basics
Functions
Basics
R is a powerful language used for data analysis, statistical computing, and graphical representation. This guide will introduce the basic syntax, data structures, control structures, and functions in R, providing a solid foundation for beginners.
Best Practices for Python Type Conventions
Python is a dynamically typed language, but with the introduction of type hints (PEP 484) and the growing adoption of static type checkers like mypy, type annotations have become essential for writing clear, maintainable code. This guide covers the best practices for using type hints effectively in Python.
Conditionals
Python allows us to compare values using comparison operators, enabling us to make decisions based on these comparisons. This is fundamental for controlling the flow of a program.
Data Cleaning with Pandas
In this lecture, we covered a basic data cleaning process using pandas. We focused on cleaning a dataset containing a list of US presidents from Wikipedia. The key operations included importing the dataset, cleaning up names, and converting date formats. Below are the detailed steps and methods used.
Data Import in R
In the world of data analysis, most data is not created within R itself but comes from various data collection software, hardware, and channels such as Excel and the internet. This chapter focuses on how to import data into R to begin data analysis. Readers can either systematically go through the chapter or select topics based on their actual needs and time constraints.
Data Science
Courses
Datasets
| Dataset Source | Description |
Date Functionality
In today's lecture, we explored time series and date functionality in pandas. Manipulating dates and times in pandas is highly flexible, enabling us to conduct advanced analysis such as time series analysis.
Decorators
Python decorators are a very powerful and useful tool that allows programmers to modify the behavior of a function or class. Decorators allow for the extension or modification of the function's behavior without permanently modifying it. Here’s how to use and create decorators in Python:
Dictionaries
Dictionaries in Python are data structures that store data in key-value pairs, allowing efficient data retrieval and modification based on unique keys.
Divide and Conquer Algorithms
Divide and conquer, also known as "divide and rule", is a fundamental algorithm strategy based on recursion, involving two main phases: divide and conquer.
Dynamic Programming
Dynamic Programming (DP) is a technique used to optimize recursive algorithms by storing intermediate results, avoiding redundant calculations, and thus significantly improving computational efficiency. It is particularly useful for problems that exhibit overlapping subproblems and optimal substructure.
Errors and Exceptions
The Try-Except Construct
Fibonacci Series
The Fibonacci series is a sequence of numbers where each number is the sum of the two preceding ones, usually starting with 0 and 1. Mathematically, it is defined as:
File Paths and Managing Files
File Paths
Floyd-Warshall Algorithm
The Floyd-Warshall Algorithm is a dynamic programming method used to find the shortest paths between all pairs of nodes in a weighted graph. This algorithm can handle both directed and undirected graphs and works with graphs that have negative weight edges, though it cannot handle graphs with negative weight cycles.
Fractional Knapsack Problem
The Fractional Knapsack Problem is a variant of the knapsack problem where it's permissible to take fractional parts of items rather than having to make a binary choice for each item (all or nothing). This problem allows for a greedy approach to find an optimal solution.
Functions
Defining Functions
Graph algorithms
Graph algorithms are a set of instructions or procedures designed to perform specific tasks on graph data structures. A graph is a collection of nodes (also known as vertices) connected by edges. Graphs are used to model various types of relationships and processes in physical, biological, social, and information systems. The study and application of graph algorithms are central to numerous fields, including computer science, mathematics, network analysis, and social sciences, due to their ability to efficiently solve problems related to connectivity, flow, and routing within complex networks.
Greedy Algorithms
Greedy algorithms are a class of algorithms used to solve optimization problems by making the locally optimal choice at each step with the hope of finding the global optimum.
Grouping Data
Grouping data is an essential task for data analysis, allowing us to understand and manipulate data at a group level. Pandas provides the groupby() function to facilitate this, implementing the split-apply-combine pattern. This pattern involves splitting the data into groups, applying a function to each group, and then combining the results.
Huffman Coding
Huffman Coding is a widely used method for data compression that involves creating variable-length codes for input characters, with the lengths based on the frequencies of the characters. It's an efficient form of lossless compression, which means no information is lost in the compression process.
Idioms
In Python programming, certain idioms are considered more appropriate and efficient. Pandas, a sub-language within Python, has its own idioms, often referred to as "pandorable." These idioms improve code readability and performance. Here are some key features to make your code pandorable.
index
Introduction to Testing in Python
Indexing `DataFrame`
In pandas, both Series and DataFrame objects can have indices applied to them. An index serves as a row-level label, corresponding to axis zero. Indices can be autogenerated or explicitly set. This guide covers various methods for handling indices in pandas, including setting, resetting, and using multi-level indices.
Job Sequencing Problem
The Job Sequencing Problem is a classic problem in computer science, often solved using greedy algorithms. The problem involves scheduling jobs to maximize profit when each job has a deadline and associated profit if it is completed on or before its deadline.
Knapsack Problem
The Knapsack Problem is a classic optimization problem that can be efficiently solved using dynamic programming. The goal is to choose a subset of items with maximum total value, subject to a weight constraint on the total weight of the chosen items.
Kruskal's algorithm
Kruskal's algorithm is a popular method used to find the minimum spanning tree (MST) for a connected, weighted, undirected graph. The goal of the algorithm is to find a subset of the graph's edges that forms a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. This algorithm is categorized under greedy algorithms because it finds a local optimum with the hope that this leads to a global optimum.
Lambda Functions
Lambda functions in Python are small, anonymous functions defined by the lambda keyword. They can have any number of arguments but only one expression. The syntax is simple and intended for short functions that are convenient to use inside other functions, particularly those that require a simple function as an argument.
Languages
Languages of Data Science
Libraries
Scientific Computing Libraries in Python
Lists
Lists in Python are versatile data structures that allow efficient manipulation and storage of a collection of items. Unlike basic data types like integers, floats, Booleans, and strings, lists can hold multiple items, which can be of different data types, all within a single variable.
Longest Common Subsequence
The Longest Common Subsequence (LCS) problem is a classic computer science problem used to find the longest subsequence common to all sequences in a set of sequences (often just two sequences). A subsequence is a sequence derived from another sequence where some elements might be deleted without changing the order of the remaining elements.
Loops
Loops are fundamental constructs in programming that allow the execution of a block of code multiple times. They are essential for automating repetitive tasks, making code efficient and concise.
Managing Data and Processes
Reading Data Interactively
Merging DataFrames
Relational Theory Concepts
Minimum Spanning Tree (MST)
In graph theory, a Minimum Spanning Tree (MST) for a weighted graph is a spanning tree with the minimum weight among all the spanning trees. It connects all the vertices together without any cycles and with the minimum possible total edge weight.
Missing Values
Missing values are common in data cleaning activities and can occur for various reasons. Understanding the nature and handling of missing data is crucial for accurate data analysis.
NumPy
Official Site
Object-oriented Programming
Object-Oriented Programming (OOP) is a programming paradigm that uses "objects" to design software. It allows developers to create classes that encapsulate data and functions, promoting code reusability and modularity.
Other Test Concepts
Black Box vs. White Box Testing
Pandas
Official Website
Permutation Problem
The permutation problem involves finding all possible arrangements of elements from a given set using a backtracking algorithm. This problem can be approached considering either arrays without duplicate elements or arrays that include duplicates.
Pivot Tables
A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of the aggregation function. A pivot table is itself a DataFrame where the rows represent one variable, the columns represent another, and the cells contain some aggregate value. Pivot tables often include marginal values, which are the sums for each column and row. This allows for an easy visual representation of the relationship between two variables.
Prim's Algorithm
Prim's Algorithm is a popular and efficient greedy algorithm used to find the Minimum Spanning Tree (MST) of a weighted, undirected graph. This algorithm helps to find a subset of the graph's edges that forms a tree including every vertex, where the total weight of all the edges in the tree is minimized.
Processing Log Files
Log file analysis is a crucial task in system administration and software development, allowing experts to monitor system activities, debug issues, and extract valuable information. This guide demonstrates how to parse log files using Python to count the occurrences of usernames in CRON job entries.
Programming Languages
Data Analyst
pytorch
Official Website
Querying `DataFrame`
Boolean Masking
Querying `Series`
A Pandas Series can be queried either by index position or index label. If an index is not specified during querying, the position and the label are effectively the same values.
R
Official Site
Reading and Writing Files
Interacting with files is a fundamental aspect of programming, especially in automation and data processing tasks. Python provides robust capabilities to manipulate files and directories, making it a powerful tool for IT specialists and system administrators.
Recursion
Recursion is the repeated application of the same procedure to a smaller problem. It allows complex tasks to be broken down into simpler, more manageable sub-tasks.
regex
Regular expressions (regexes) are patterns used to match character combinations in strings. They are essential tools for text processing in programming and data science, allowing for:
Running System Commands in Python
Python allows interaction with the operating system by executing system commands directly from scripts using the subprocess module.
Scales
Creating a DataFrame with Letter Grades
Shortest Path Problems
Shortest path problems are key in optimizing routes in various applications like network design, transportation, and telecommunications. They focus on finding the shortest route from a start to a destination in a graph, which consists of nodes (vertices) and edges (connections between nodes).
Softmax 回归
在之前的章节中,我们探讨了线性回归及其实现,包括从零开始的实现和使用高级API的实现。回归模型通常用于定量输出,例如预测价格、胜场数或患者住院天数。然而,并非所有问题都适合使用回归模型,这取决于其输出的性质。这导致了对数回归或生存建模等特殊情况。
Space complexity
Space complexity is a metric that evaluates the total memory space required by an algorithm in terms of the size of its input data. It is analogous to time complexity but focuses on memory usage instead of execution time.
Strings
Strings are an essential data type in Python, used to represent text data. They are sequences of characters enclosed within single (') or double (") quotes.
Subset Sum Problem
The Subset Sum Problem involves finding all possible combinations of elements in a given array such that their sum equals a specified target. This problem is a classic example of using backtracking for combinatorial exploration. It has variations based on whether duplicate elements are allowed in the input set and whether each element can be chosen more than once.
Sympy
SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.
Tableau
Reference and Useful Links
Terms
Data Visualization Types
Time Complexity
Time complexity is a metric used to describe the efficiency of an algorithm by examining how the run time increases with the size of the input. It primarily focuses on the growth trend of the runtime rather than precise execution times, which can vary across different hardware and software environments.
Transformer模型
由 Vaswani 等人(2017)提出的 Transformer 模型是一种完全基于注意力机制的深度架构,省略了传统的卷积层或循环层。它专为序列到序列学习而设计,并已广泛应用于语言、视觉、语音和强化学习领域。该架构支持并行计算,并具有输入和输出之间较短的路径长度,这使其在处理 序列数据任务时效率极高。
Unit Tests
Unittest provides developers with a set of tools to construct and run tests on individual components or units of code to ensure their correctness. By running unittests, developers can identify and fix bugs, creating more reliable code.
卷积神经网络
卷积神经网络 (CNN) 是一种专门设计的神经网络,主要用于处理结构化网格数据,例如图像。CNN 利用数据的固有属性,如空间关系和局部性,来降低从高维数据中学习的复杂性和计算成本。
变分自编码器
变分自编码器 (VAE) 是一种生成模型,它使用神经网络将输入数据编码到潜在空间中,然后将其解码以重建原始数据。VAE 结合了深度学习和概率图模型的原理,从而能够对复杂数据分布进行无监督学习。
多层感知机
多层感知机 (MLP) 是一类深度神经网络,其特点是分层结构,包括一个输入层、一个或多个隐藏层以及一个输出层。每个层都包含神经元,这些神经元通过加权连接与后续层中的神经元完全连接。
开源工具
数据管理工具
循环神经网络
循环神经网络 (RNN) 是一种旨在通过利用隐藏状态来捕获时间信息以处理序列数据的神经网络。它们特别适用于语言建模等任务,在这些任务中,目标是根据先前标记的历史序列来预测下一个标记。
扩散模型
扩散模型是一类生成模型,它通过迭代地向数据添加和移除噪声来学习数据分布。它们因其在图像和音频合成等领域生成高质量样本的能力而备受关注。
普通最小二乘法
线性模型旨在将目标值预测为输入特征的线性组合。通常表示为:
机器学习
引言
注意力机制
注意力机制解决了传统神经网络模型(如CNN和RNN)中需要固定输入大小的挑战。它们提供了一种灵活的方法来处理大小和内容各异的输入,例如长文本序列。这种灵活性是通过能够动态关注输入不同部分的机制来实现的。
深度学习
有用链接
生成模型
课程
线性判别分析与二次判别分析 - LDA 与 QDA
概述
线性回归
回归分析 是一种统计方法,用于根据输入特征预测数值。常见的应用包括预测房价、股票价值、患者住院时间以及零售销售预测。
线性模型
| Model | Mathematical Expression | Description |
通用概念
在机器学习中,模型评估是理解模型性能的关键一步。此过程中的两个关键概念是假设和损失函数。