跳到主要内容

91 篇文档带有标签「infotech」

查看所有标签

`DataFrame` data structure

The DataFrame data structure is the heart of the Panda's library. It's a primary object used in data analysis and cleaning tasks. Conceptually, a DataFrame is a two-dimensional series object with an index and multiple columns of content, each column having a label. Essentially, a DataFrame is a two-axes labeled array.

`Series` Data Structure

A Series in pandas is a one-dimensional array-like object that can hold various data types, similar to a list in Python, but with additional features. It combines elements of lists and dictionaries, storing items in order and allowing access via labels (index).

About

This section serves as my personal repository of CS notes, aimed at reviewing and reinforcing fundamental concepts in computer science, including algorithms, data structures, and data science.

Activity Selection

The Activity Selection problem is a classic example of a greedy algorithm application. It is concerned with selecting the maximum number of activities that don't overlap in time from a given set of activities.

Algorithms

It is really hard to organize this section, Algorithms

Backtracking

Backtracking algorithms are a type of exhaustive search technique used to solve problems. The core approach involves starting from an initial state and exploring all possible solutions by trying each choice once. Only correct solutions are recorded, and the search continues until a solution is found or all choices are tried. This method utilizes depth-first search to traverse the solution space.

Basics

R is a powerful language used for data analysis, statistical computing, and graphical representation. This guide will introduce the basic syntax, data structures, control structures, and functions in R, providing a solid foundation for beginners.

Best Practices for Python Type Conventions

Python is a dynamically typed language, but with the introduction of type hints (PEP 484) and the growing adoption of static type checkers like mypy, type annotations have become essential for writing clear, maintainable code. This guide covers the best practices for using type hints effectively in Python.

Conditionals

Python allows us to compare values using comparison operators, enabling us to make decisions based on these comparisons. This is fundamental for controlling the flow of a program.

Data Cleaning with Pandas

In this lecture, we covered a basic data cleaning process using pandas. We focused on cleaning a dataset containing a list of US presidents from Wikipedia. The key operations included importing the dataset, cleaning up names, and converting date formats. Below are the detailed steps and methods used.

Data Import in R

In the world of data analysis, most data is not created within R itself but comes from various data collection software, hardware, and channels such as Excel and the internet. This chapter focuses on how to import data into R to begin data analysis. Readers can either systematically go through the chapter or select topics based on their actual needs and time constraints.

Datasets

| Dataset Source | Description |

Date Functionality

In today's lecture, we explored time series and date functionality in pandas. Manipulating dates and times in pandas is highly flexible, enabling us to conduct advanced analysis such as time series analysis.

Decorators

Python decorators are a very powerful and useful tool that allows programmers to modify the behavior of a function or class. Decorators allow for the extension or modification of the function's behavior without permanently modifying it. Here’s how to use and create decorators in Python:

Dictionaries

Dictionaries in Python are data structures that store data in key-value pairs, allowing efficient data retrieval and modification based on unique keys.

Divide and Conquer Algorithms

Divide and conquer, also known as "divide and rule", is a fundamental algorithm strategy based on recursion, involving two main phases: divide and conquer.

Dynamic Programming

Dynamic Programming (DP) is a technique used to optimize recursive algorithms by storing intermediate results, avoiding redundant calculations, and thus significantly improving computational efficiency. It is particularly useful for problems that exhibit overlapping subproblems and optimal substructure.

Fibonacci Series

The Fibonacci series is a sequence of numbers where each number is the sum of the two preceding ones, usually starting with 0 and 1. Mathematically, it is defined as:

Floyd-Warshall Algorithm

The Floyd-Warshall Algorithm is a dynamic programming method used to find the shortest paths between all pairs of nodes in a weighted graph. This algorithm can handle both directed and undirected graphs and works with graphs that have negative weight edges, though it cannot handle graphs with negative weight cycles.

Fractional Knapsack Problem

The Fractional Knapsack Problem is a variant of the knapsack problem where it's permissible to take fractional parts of items rather than having to make a binary choice for each item (all or nothing). This problem allows for a greedy approach to find an optimal solution.

Graph algorithms

Graph algorithms are a set of instructions or procedures designed to perform specific tasks on graph data structures. A graph is a collection of nodes (also known as vertices) connected by edges. Graphs are used to model various types of relationships and processes in physical, biological, social, and information systems. The study and application of graph algorithms are central to numerous fields, including computer science, mathematics, network analysis, and social sciences, due to their ability to efficiently solve problems related to connectivity, flow, and routing within complex networks.

Greedy Algorithms

Greedy algorithms are a class of algorithms used to solve optimization problems by making the locally optimal choice at each step with the hope of finding the global optimum.

Grouping Data

Grouping data is an essential task for data analysis, allowing us to understand and manipulate data at a group level. Pandas provides the groupby() function to facilitate this, implementing the split-apply-combine pattern. This pattern involves splitting the data into groups, applying a function to each group, and then combining the results.

Huffman Coding

Huffman Coding is a widely used method for data compression that involves creating variable-length codes for input characters, with the lengths based on the frequencies of the characters. It's an efficient form of lossless compression, which means no information is lost in the compression process.

Idioms

In Python programming, certain idioms are considered more appropriate and efficient. Pandas, a sub-language within Python, has its own idioms, often referred to as "pandorable." These idioms improve code readability and performance. Here are some key features to make your code pandorable.

index

Introduction to Testing in Python

Indexing `DataFrame`

In pandas, both Series and DataFrame objects can have indices applied to them. An index serves as a row-level label, corresponding to axis zero. Indices can be autogenerated or explicitly set. This guide covers various methods for handling indices in pandas, including setting, resetting, and using multi-level indices.

Job Sequencing Problem

The Job Sequencing Problem is a classic problem in computer science, often solved using greedy algorithms. The problem involves scheduling jobs to maximize profit when each job has a deadline and associated profit if it is completed on or before its deadline.

Knapsack Problem

The Knapsack Problem is a classic optimization problem that can be efficiently solved using dynamic programming. The goal is to choose a subset of items with maximum total value, subject to a weight constraint on the total weight of the chosen items.

Kruskal's algorithm

Kruskal's algorithm is a popular method used to find the minimum spanning tree (MST) for a connected, weighted, undirected graph. The goal of the algorithm is to find a subset of the graph's edges that forms a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. This algorithm is categorized under greedy algorithms because it finds a local optimum with the hope that this leads to a global optimum.

Lambda Functions

Lambda functions in Python are small, anonymous functions defined by the lambda keyword. They can have any number of arguments but only one expression. The syntax is simple and intended for short functions that are convenient to use inside other functions, particularly those that require a simple function as an argument.

Libraries

Scientific Computing Libraries in Python

Lists

Lists in Python are versatile data structures that allow efficient manipulation and storage of a collection of items. Unlike basic data types like integers, floats, Booleans, and strings, lists can hold multiple items, which can be of different data types, all within a single variable.

Longest Common Subsequence

The Longest Common Subsequence (LCS) problem is a classic computer science problem used to find the longest subsequence common to all sequences in a set of sequences (often just two sequences). A subsequence is a sequence derived from another sequence where some elements might be deleted without changing the order of the remaining elements.

Loops

Loops are fundamental constructs in programming that allow the execution of a block of code multiple times. They are essential for automating repetitive tasks, making code efficient and concise.

Minimum Spanning Tree (MST)

In graph theory, a Minimum Spanning Tree (MST) for a weighted graph is a spanning tree with the minimum weight among all the spanning trees. It connects all the vertices together without any cycles and with the minimum possible total edge weight.

Missing Values

Missing values are common in data cleaning activities and can occur for various reasons. Understanding the nature and handling of missing data is crucial for accurate data analysis.

Object-oriented Programming

Object-Oriented Programming (OOP) is a programming paradigm that uses "objects" to design software. It allows developers to create classes that encapsulate data and functions, promoting code reusability and modularity.

Permutation Problem

The permutation problem involves finding all possible arrangements of elements from a given set using a backtracking algorithm. This problem can be approached considering either arrays without duplicate elements or arrays that include duplicates.

Pivot Tables

A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of the aggregation function. A pivot table is itself a DataFrame where the rows represent one variable, the columns represent another, and the cells contain some aggregate value. Pivot tables often include marginal values, which are the sums for each column and row. This allows for an easy visual representation of the relationship between two variables.

Prim's Algorithm

Prim's Algorithm is a popular and efficient greedy algorithm used to find the Minimum Spanning Tree (MST) of a weighted, undirected graph. This algorithm helps to find a subset of the graph's edges that forms a tree including every vertex, where the total weight of all the edges in the tree is minimized.

Processing Log Files

Log file analysis is a crucial task in system administration and software development, allowing experts to monitor system activities, debug issues, and extract valuable information. This guide demonstrates how to parse log files using Python to count the occurrences of usernames in CRON job entries.

Querying `Series`

A Pandas Series can be queried either by index position or index label. If an index is not specified during querying, the position and the label are effectively the same values.

R

Official Site

Reading and Writing Files

Interacting with files is a fundamental aspect of programming, especially in automation and data processing tasks. Python provides robust capabilities to manipulate files and directories, making it a powerful tool for IT specialists and system administrators.

Recursion

Recursion is the repeated application of the same procedure to a smaller problem. It allows complex tasks to be broken down into simpler, more manageable sub-tasks.

regex

Regular expressions (regexes) are patterns used to match character combinations in strings. They are essential tools for text processing in programming and data science, allowing for:

Scales

Creating a DataFrame with Letter Grades

Shortest Path Problems

Shortest path problems are key in optimizing routes in various applications like network design, transportation, and telecommunications. They focus on finding the shortest route from a start to a destination in a graph, which consists of nodes (vertices) and edges (connections between nodes).

Softmax 回归

在之前的章节中,我们探讨了线性回归及其实现,包括从零开始的实现和使用高级API的实现。回归模型通常用于定量输出,例如预测价格、胜场数或患者住院天数。然而,并非所有问题都适合使用回归模型,这取决于其输出的性质。这导致了对数回归或生存建模等特殊情况。

Space complexity

Space complexity is a metric that evaluates the total memory space required by an algorithm in terms of the size of its input data. It is analogous to time complexity but focuses on memory usage instead of execution time.

Strings

Strings are an essential data type in Python, used to represent text data. They are sequences of characters enclosed within single (') or double (") quotes.

Subset Sum Problem

The Subset Sum Problem involves finding all possible combinations of elements in a given array such that their sum equals a specified target. This problem is a classic example of using backtracking for combinatorial exploration. It has variations based on whether duplicate elements are allowed in the input set and whether each element can be chosen more than once.

Sympy

SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.

Terms

Data Visualization Types

Time Complexity

Time complexity is a metric used to describe the efficiency of an algorithm by examining how the run time increases with the size of the input. It primarily focuses on the growth trend of the runtime rather than precise execution times, which can vary across different hardware and software environments.

Transformer模型

由 Vaswani 等人(2017)提出的 Transformer 模型是一种完全基于注意力机制的深度架构,省略了传统的卷积层或循环层。它专为序列到序列学习而设计,并已广泛应用于语言、视觉、语音和强化学习领域。该架构支持并行计算,并具有输入和输出之间较短的路径长度,这使其在处理序列数据任务时效率极高。

Unit Tests

Unittest provides developers with a set of tools to construct and run tests on individual components or units of code to ensure their correctness. By running unittests, developers can identify and fix bugs, creating more reliable code.

卷积神经网络

卷积神经网络 (CNN) 是一种专门设计的神经网络,主要用于处理结构化网格数据,例如图像。CNN 利用数据的固有属性,如空间关系和局部性,来降低从高维数据中学习的复杂性和计算成本。

变分自编码器

变分自编码器 (VAE) 是一种生成模型,它使用神经网络将输入数据编码到潜在空间中,然后将其解码以重建原始数据。VAE 结合了深度学习和概率图模型的原理,从而能够对复杂数据分布进行无监督学习。

多层感知机

多层感知机 (MLP) 是一类深度神经网络,其特点是分层结构,包括一个输入层、一个或多个隐藏层以及一个输出层。每个层都包含神经元,这些神经元通过加权连接与后续层中的神经元完全连接。

循环神经网络

循环神经网络 (RNN) 是一种旨在通过利用隐藏状态来捕获时间信息以处理序列数据的神经网络。它们特别适用于语言建模等任务,在这些任务中,目标是根据先前标记的历史序列来预测下一个标记。

扩散模型

扩散模型是一类生成模型,它通过迭代地向数据添加和移除噪声来学习数据分布。它们因其在图像和音频合成等领域生成高质量样本的能力而备受关注。

普通最小二乘法

线性模型旨在将目标值预测为输入特征的线性组合。通常表示为:

注意力机制

注意力机制解决了传统神经网络模型(如CNN和RNN)中需要固定输入大小的挑战。它们提供了一种灵活的方法来处理大小和内容各异的输入,例如长文本序列。这种灵活性是通过能够动态关注输入不同部分的机制来实现的。

线性回归

回归分析 是一种统计方法,用于根据输入特征预测数值。常见的应用包括预测房价、股票价值、患者住院时间以及零售销售预测。

线性模型

| Model | Mathematical Expression | Description |

通用概念

在机器学习中,模型评估是理解模型性能的关键一步。此过程中的两个关键概念是假设和损失函数。