Skip to main content

Open-Source Tools

Data Management Tools

Relational Databases:

  • MySQL
  • PostgreSQL

NoSQL Databases:

  • MongoDB
  • Apache CouchDB
  • Apache Cassandra

File-Based Systems:

  • Hadoop File System
  • Ceph (Cloud File System)

Search Tools:

  • ElasticSearch: Stores text data and creates search indexes for fast document retrieval.

Data Integration and Transformation Tools

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform):

  • Data Refinery and Cleansing

Common Tools:

  • Apache AirFlow: Workflow automation and scheduling tool, originally created by Airbnb.
  • KubeFlow: Runs data science pipelines on Kubernetes.
  • Apache Kafka: Distributed event streaming platform, developed by LinkedIn.
  • Apache NiFi: Provides a visual editor for data flow automation.
  • Apache SparkSQL: Allows use of ANSI SQL and scales up to large compute clusters.
  • NodeRED: A visual editor for wiring the Internet of Things (IoT), supports low-resource environments like Raspberry Pi.

Data Visualization Tools

Programming Libraries vs. User Interface Tools:

Libraries with UI:

  • Pixie Dust: Simplifies plotting in Python with an interface.
  • Hue: Creates visualizations from SQL queries.

Visualization Web Applications:

  • Kibana: Works with Elasticsearch for data exploration and visualization.
  • Apache Superset: A data exploration and visualization platform.

Model Tools

Model Deployment:

  • Apache PredictionIO: Deploys Apache Spark ML models.
  • Seldon: Supports TensorFlow, Apache SparkML, R, and scikit-learn; runs on Kubernetes and Redhat OpenShift.
  • MLeap: Deploys SparkML models.
  • TensorFlow Service: Serves TensorFlow models, with versions for embedded devices (TensorFlow Lite) and web browsers (TensorFlow. js).

Model Monitoring:

  • ModelDB: Stores and queries machine learning model metadata, supports Apache Spark ML Pipelines and scikit-learn.
  • Prometheus: A multi-purpose monitoring tool, used for model performance monitoring.

Model Performance and Fairness:

  • IBM AI Fairness 360: Detects and mitigates bias in machine learning models.
  • IBM Adversarial Robustness 360 Toolbox: Detects and enhances robustness against adversarial attacks.
  • IBM AI Explainability 360: Improves model transparency by explaining model decisions and training simpler models for comparison.

Code and Data Asset Management Tools

Code Asset Management (Version Control):

  • Git: The de facto standard for version control.
    • GitHub
    • GitLab: Fully open-source and self-hostable.
    • Bitbucket

Data Asset Management (Data Governance):

  • Apache Atlas: Supports data governance and lineage.
  • ODPi Egeria: Offers open APIs, types, and protocols for metadata repositories.
  • Kylo: An open-source data management platform with extensive support for data asset management.

Development Environments

Jupyter:

  • Jupyter Notebooks: Supports interactive Python programming and over 100 programming languages via kernels.
  • Jupyter Lab: The next version of Jupyter Notebooks, more modern and modular with enhanced file handling capabilities.

Apache Zeppelin:

  • Similar to Jupyter Notebooks with integrated plotting capabilities.

RStudio:

  • Development environment for R and statistics, with integrated support for Python.

Spyder:

  • A Python development environment inspired by RStudio, integrates code, documentation, and visualizations.

Cluster Execution Environments

Apache Spark:

  • A batch data processing engine with linear scalability, widely used across industries.

Apache Flink:

  • Focuses on real-time data stream processing, supports both batch and stream processing.

Ray:

  • Focuses on large-scale deep learning model training.

Fully Integrated and Visual Tools

KNIME:

  • Visual user interface with drag-and-drop capabilities for data integration, transformation, and visualization. Can be extended with R and Python, and supports Apache Spark.

Orange:

  • Easier to use than KNIME but less flexible, supports important data science tasks.