Open-Source Tools
Data Management Tools
Relational Databases:
- MySQL
- PostgreSQL
NoSQL Databases:
- MongoDB
- Apache CouchDB
- Apache Cassandra
File-Based Systems:
- Hadoop File System
- Ceph (Cloud File System)
Search Tools:
- ElasticSearch: Stores text data and creates search indexes for fast document retrieval.
Data Integration and Transformation Tools
ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform):
- Data Refinery and Cleansing
Common Tools:
- Apache AirFlow: Workflow automation and scheduling tool, originally created by Airbnb.
- KubeFlow: Runs data science pipelines on Kubernetes.
- Apache Kafka: Distributed event streaming platform, developed by LinkedIn.
- Apache NiFi: Provides a visual editor for data flow automation.
- Apache SparkSQL: Allows use of ANSI SQL and scales up to large compute clusters.
- NodeRED: A visual editor for wiring the Internet of Things (IoT), supports low-resource environments like Raspberry Pi.
Data Visualization Tools
Programming Libraries vs. User Interface Tools:
Libraries with UI:
- Pixie Dust: Simplifies plotting in Python with an interface.
- Hue: Creates visualizations from SQL queries.
Visualization Web Applications:
- Kibana: Works with Elasticsearch for data exploration and visualization.
- Apache Superset: A data exploration and visualization platform.
Model Tools
Model Deployment:
- Apache PredictionIO: Deploys Apache Spark ML models.
- Seldon: Supports TensorFlow, Apache SparkML, R, and scikit-learn; runs on Kubernetes and Redhat OpenShift.
- MLeap: Deploys SparkML models.
- TensorFlow Service: Serves TensorFlow models, with versions for embedded devices (TensorFlow Lite) and web browsers (TensorFlow. js).
Model Monitoring:
- ModelDB: Stores and queries machine learning model metadata, supports Apache Spark ML Pipelines and scikit-learn.
- Prometheus: A multi-purpose monitoring tool, used for model performance monitoring.
Model Performance and Fairness:
- IBM AI Fairness 360: Detects and mitigates bias in machine learning models.
- IBM Adversarial Robustness 360 Toolbox: Detects and enhances robustness against adversarial attacks.
- IBM AI Explainability 360: Improves model transparency by explaining model decisions and training simpler models for comparison.
Code and Data Asset Management Tools
Code Asset Management (Version Control):
- Git: The de facto standard for version control.
- GitHub
- GitLab: Fully open-source and self-hostable.
- Bitbucket
Data Asset Management (Data Governance):
- Apache Atlas: Supports data governance and lineage.
- ODPi Egeria: Offers open APIs, types, and protocols for metadata repositories.
- Kylo: An open-source data management platform with extensive support for data asset management.
Development Environments
Jupyter:
- Jupyter Notebooks: Supports interactive Python programming and over 100 programming languages via kernels.
- Jupyter Lab: The next version of Jupyter Notebooks, more modern and modular with enhanced file handling capabilities.
Apache Zeppelin:
- Similar to Jupyter Notebooks with integrated plotting capabilities.
RStudio:
- Development environment for R and statistics, with integrated support for Python.
Spyder:
- A Python development environment inspired by RStudio, integrates code, documentation, and visualizations.
Cluster Execution Environments
Apache Spark:
- A batch data processing engine with linear scalability, widely used across industries.
Apache Flink:
- Focuses on real-time data stream processing, supports both batch and stream processing.
Ray:
- Focuses on large-scale deep learning model training.
Fully Integrated and Visual Tools
KNIME:
- Visual user interface with drag-and-drop capabilities for data integration, transformation, and visualization. Can be extended with R and Python, and supports Apache Spark.
Orange:
- Easier to use than KNIME but less flexible, supports important data science tasks.