91 篇文档带有标签「infotech」

`DataFrame` data structure

The DataFrame data structure is the heart of the Panda's library. It's a primary object used in data analysis and cleaning tasks. Conceptually, a DataFrame is a two-dimensional series object with an index and multiple columns of content, each column having a label. Essentially, a DataFrame is a two-axes labeled array.

`DataFrame` Indexing and Loading

Introduction to CSV Files

`Series` Data Structure

A Series in pandas is a one-dimensional array-like object that can hold various data types, similar to a list in Python, but with additional features. It combines elements of lists and dictionaries, storing items in order and allowing access via labels (index).

About

This section serves as my personal repository of CS notes, aimed at reviewing and reinforcing fundamental concepts in computer science, including algorithms, data structures, and data science.

Activity Selection

The Activity Selection problem is a classic example of a greedy algorithm application. It is concerned with selecting the maximum number of activities that don't overlap in time from a given set of activities.

Algorithms

It is really hard to organize this section, Algorithms

Backtracking

Backtracking algorithms are a type of exhaustive search technique used to solve problems. The core approach involves starting from an initial state and exploring all possible solutions by trying each choice once. Only correct solutions are recorded, and the search continues until a solution is found or all choices are tried. This method utilizes depth-first search to traverse the solution space.

Basic Syntax

Learning Python Syntax

Basics

Functions

Basics

R is a powerful language used for data analysis, statistical computing, and graphical representation. This guide will introduce the basic syntax, data structures, control structures, and functions in R, providing a solid foundation for beginners.

Best Practices for Python Type Conventions

Python is a dynamically typed language, but with the introduction of type hints (PEP 484) and the growing adoption of static type checkers like mypy, type annotations have become essential for writing clear, maintainable code. This guide covers the best practices for using type hints effectively in Python.

Conditionals

Python allows us to compare values using comparison operators, enabling us to make decisions based on these comparisons. This is fundamental for controlling the flow of a program.

Data Cleaning with Pandas

In this lecture, we covered a basic data cleaning process using pandas. We focused on cleaning a dataset containing a list of US presidents from Wikipedia. The key operations included importing the dataset, cleaning up names, and converting date formats. Below are the detailed steps and methods used.

Data Import in R

In the world of data analysis, most data is not created within R itself but comes from various data collection software, hardware, and channels such as Excel and the internet. This chapter focuses on how to import data into R to begin data analysis. Readers can either systematically go through the chapter or select topics based on their actual needs and time constraints.

Data Science

Courses

Datasets

| Dataset Source | Description |

Date Functionality

In today's lecture, we explored time series and date functionality in pandas. Manipulating dates and times in pandas is highly flexible, enabling us to conduct advanced analysis such as time series analysis.

Decorators

Python decorators are a very powerful and useful tool that allows programmers to modify the behavior of a function or class. Decorators allow for the extension or modification of the function's behavior without permanently modifying it. Here’s how to use and create decorators in Python:

Dictionaries

Dictionaries in Python are data structures that store data in key-value pairs, allowing efficient data retrieval and modification based on unique keys.

Divide and Conquer Algorithms

Divide and conquer, also known as "divide and rule", is a fundamental algorithm strategy based on recursion, involving two main phases: divide and conquer.

Dynamic Programming

Dynamic Programming (DP) is a technique used to optimize recursive algorithms by storing intermediate results, avoiding redundant calculations, and thus significantly improving computational efficiency. It is particularly useful for problems that exhibit overlapping subproblems and optimal substructure.

Errors and Exceptions

The Try-Except Construct

Fibonacci Series

The Fibonacci series is a sequence of numbers where each number is the sum of the two preceding ones, usually starting with 0 and 1. Mathematically, it is defined as:

File Paths and Managing Files

File Paths

Floyd-Warshall Algorithm

The Floyd-Warshall Algorithm is a dynamic programming method used to find the shortest paths between all pairs of nodes in a weighted graph. This algorithm can handle both directed and undirected graphs and works with graphs that have negative weight edges, though it cannot handle graphs with negative weight cycles.

Fractional Knapsack Problem

The Fractional Knapsack Problem is a variant of the knapsack problem where it's permissible to take fractional parts of items rather than having to make a binary choice for each item (all or nothing). This problem allows for a greedy approach to find an optimal solution.

Functions

Defining Functions

Graph algorithms

Graph algorithms are a set of instructions or procedures designed to perform specific tasks on graph data structures. A graph is a collection of nodes (also known as vertices) connected by edges. Graphs are used to model various types of relationships and processes in physical, biological, social, and information systems. The study and application of graph algorithms are central to numerous fields, including computer science, mathematics, network analysis, and social sciences, due to their ability to efficiently solve problems related to connectivity, flow, and routing within complex networks.

Greedy Algorithms

Greedy algorithms are a class of algorithms used to solve optimization problems by making the locally optimal choice at each step with the hope of finding the global optimum.

Grouping Data

Grouping data is an essential task for data analysis, allowing us to understand and manipulate data at a group level. Pandas provides the groupby() function to facilitate this, implementing the split-apply-combine pattern. This pattern involves splitting the data into groups, applying a function to each group, and then combining the results.

Huffman Coding

Huffman Coding is a widely used method for data compression that involves creating variable-length codes for input characters, with the lengths based on the frequencies of the characters. It's an efficient form of lossless compression, which means no information is lost in the compression process.

Idioms

In Python programming, certain idioms are considered more appropriate and efficient. Pandas, a sub-language within Python, has its own idioms, often referred to as "pandorable." These idioms improve code readability and performance. Here are some key features to make your code pandorable.

index

Introduction to Testing in Python

Indexing `DataFrame`

In pandas, both Series and DataFrame objects can have indices applied to them. An index serves as a row-level label, corresponding to axis zero. Indices can be autogenerated or explicitly set. This guide covers various methods for handling indices in pandas, including setting, resetting, and using multi-level indices.

Job Sequencing Problem

The Job Sequencing Problem is a classic problem in computer science, often solved using greedy algorithms. The problem involves scheduling jobs to maximize profit when each job has a deadline and associated profit if it is completed on or before its deadline.

Knapsack Problem

The Knapsack Problem is a classic optimization problem that can be efficiently solved using dynamic programming. The goal is to choose a subset of items with maximum total value, subject to a weight constraint on the total weight of the chosen items.

Kruskal's algorithm

Kruskal's algorithm is a popular method used to find the minimum spanning tree (MST) for a connected, weighted, undirected graph. The goal of the algorithm is to find a subset of the graph's edges that forms a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. This algorithm is categorized under greedy algorithms because it finds a local optimum with the hope that this leads to a global optimum.

Lambda Functions

Lambda functions in Python are small, anonymous functions defined by the lambda keyword. They can have any number of arguments but only one expression. The syntax is simple and intended for short functions that are convenient to use inside other functions, particularly those that require a simple function as an argument.

Languages

Languages of Data Science

Libraries

Scientific Computing Libraries in Python

Lists

Lists in Python are versatile data structures that allow efficient manipulation and storage of a collection of items. Unlike basic data types like integers, floats, Booleans, and strings, lists can hold multiple items, which can be of different data types, all within a single variable.

Longest Common Subsequence

The Longest Common Subsequence (LCS) problem is a classic computer science problem used to find the longest subsequence common to all sequences in a set of sequences (often just two sequences). A subsequence is a sequence derived from another sequence where some elements might be deleted without changing the order of the remaining elements.

Loops

Loops are fundamental constructs in programming that allow the execution of a block of code multiple times. They are essential for automating repetitive tasks, making code efficient and concise.

Managing Data and Processes

Reading Data Interactively

Merging DataFrames

Relational Theory Concepts

Minimum Spanning Tree (MST)

In graph theory, a Minimum Spanning Tree (MST) for a weighted graph is a spanning tree with the minimum weight among all the spanning trees. It connects all the vertices together without any cycles and with the minimum possible total edge weight.

Missing Values

Missing values are common in data cleaning activities and can occur for various reasons. Understanding the nature and handling of missing data is crucial for accurate data analysis.

NumPy

Official Site

Object-oriented Programming

Object-Oriented Programming (OOP) is a programming paradigm that uses "objects" to design software. It allows developers to create classes that encapsulate data and functions, promoting code reusability and modularity.

Other Test Concepts

Black Box vs. White Box Testing

Pandas

Official Website

Permutation Problem

The permutation problem involves finding all possible arrangements of elements from a given set using a backtracking algorithm. This problem can be approached considering either arrays without duplicate elements or arrays that include duplicates.

Pivot Tables

A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of the aggregation function. A pivot table is itself a DataFrame where the rows represent one variable, the columns represent another, and the cells contain some aggregate value. Pivot tables often include marginal values, which are the sums for each column and row. This allows for an easy visual representation of the relationship between two variables.

Prim's Algorithm

Prim's Algorithm is a popular and efficient greedy algorithm used to find the Minimum Spanning Tree (MST) of a weighted, undirected graph. This algorithm helps to find a subset of the graph's edges that forms a tree including every vertex, where the total weight of all the edges in the tree is minimized.

Processing Log Files

Log file analysis is a crucial task in system administration and software development, allowing experts to monitor system activities, debug issues, and extract valuable information. This guide demonstrates how to parse log files using Python to count the occurrences of usernames in CRON job entries.

Programming Languages

Data Analyst

pytorch

Official Website

Querying `DataFrame`

Boolean Masking

Querying `Series`

A Pandas Series can be queried either by index position or index label. If an index is not specified during querying, the position and the label are effectively the same values.

R

Official Site

Reading and Writing Files

Interacting with files is a fundamental aspect of programming, especially in automation and data processing tasks. Python provides robust capabilities to manipulate files and directories, making it a powerful tool for IT specialists and system administrators.

Recursion

Recursion is the repeated application of the same procedure to a smaller problem. It allows complex tasks to be broken down into simpler, more manageable sub-tasks.

regex

Regular expressions (regexes) are patterns used to match character combinations in strings. They are essential tools for text processing in programming and data science, allowing for:

Running System Commands in Python

Python allows interaction with the operating system by executing system commands directly from scripts using the subprocess module.

Scales

Creating a DataFrame with Letter Grades

Shortest Path Problems

Shortest path problems are key in optimizing routes in various applications like network design, transportation, and telecommunications. They focus on finding the shortest route from a start to a destination in a graph, which consists of nodes (vertices) and edges (connections between nodes).

Softmax 回归

在之前的章节中，我们探讨了线性回归及其实现，包括从零开始的实现和使用高级API的实现。回归模型通常用于定量输出，例如预测价格、胜场数或患者住院天数。然而，并非所有问题都适合使用回归模型，这取决于其输出的性质。这导致了对数回归或生存建模等特殊情况。

Space complexity

Space complexity is a metric that evaluates the total memory space required by an algorithm in terms of the size of its input data. It is analogous to time complexity but focuses on memory usage instead of execution time.

Strings

Strings are an essential data type in Python, used to represent text data. They are sequences of characters enclosed within single (') or double (") quotes.

Subset Sum Problem

The Subset Sum Problem involves finding all possible combinations of elements in a given array such that their sum equals a specified target. This problem is a classic example of using backtracking for combinatorial exploration. It has variations based on whether duplicate elements are allowed in the input set and whether each element can be chosen more than once.

Sympy

SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.

Tableau

Reference and Useful Links

Terms

Data Visualization Types

Time Complexity

Time complexity is a metric used to describe the efficiency of an algorithm by examining how the run time increases with the size of the input. It primarily focuses on the growth trend of the runtime rather than precise execution times, which can vary across different hardware and software environments.

`DataFrame` data structure

`DataFrame` Indexing and Loading

`Series` Data Structure

About

Activity Selection

Algorithms

Backtracking

Basic Syntax

Basics

Basics

Best Practices for Python Type Conventions

Conditionals

Data Cleaning with Pandas

Data Import in R

Data Science

Datasets

Date Functionality

Decorators

Dictionaries

Divide and Conquer Algorithms

Dynamic Programming

Errors and Exceptions

Fibonacci Series

File Paths and Managing Files

Floyd-Warshall Algorithm

Fractional Knapsack Problem

Functions

Graph algorithms

Greedy Algorithms

Grouping Data

Huffman Coding

Idioms

index

Indexing `DataFrame`

Job Sequencing Problem

Knapsack Problem

Kruskal's algorithm

Lambda Functions

Languages

Libraries

Lists

Longest Common Subsequence

Loops

Managing Data and Processes

Merging DataFrames

Minimum Spanning Tree (MST)

Missing Values

NumPy

Object-oriented Programming

Other Test Concepts

Pandas

Permutation Problem

Pivot Tables

Prim's Algorithm

Processing Log Files

Programming Languages

pytorch

Querying `DataFrame`

Querying `Series`

R

Reading and Writing Files

Recursion

regex

Running System Commands in Python

Scales

Shortest Path Problems

Softmax 回归

Space complexity

Strings

Subset Sum Problem

Sympy

Tableau

Terms

Time Complexity

Transformer模型

Unit Tests

卷积神经网络

变分自编码器

多层感知机

开源工具