Back to Portfolio

🔍 Exploratory Data Analysis

A comprehensive Python-based EDA framework with interactive visualizations, statistical techniques, and automated pattern detection for complex datasets.

Project Philosophy

Complete EDA workflow to discover patterns, outliers and relationships using modern Python tools.

Core Mission

Develop a robust EDA pipeline that enables deep data understanding through informative visualizations, descriptive statistics, and automated pattern detection for data-driven decision making.

Statistical Engine

Automated descriptive statistics, distribution analysis, and multi-method outlier detection (IQR, Z-score, IsolationForest).

Visualization Suite

Interactive plots and dashboards built with Matplotlib, Seaborn and Plotly supporting zoom, pan and hover tooltips for deeper exploration.

1

Data Ingestion

Load CSV/JSON/Excel with type inference and validation.

2

Profiling

Missing values, duplicates, and data-types report generation.

3

Univariate & Bivariate

Distribution plots, correlation heatmaps and pairwise analysis.

4

Multivariate

Dimensionality reduction and clustering for pattern discovery.

5

Reporting

Automated HTML/interactive reports with recommendations.

System Architecture

A modular pipeline from ingestion to reporting designed for reproducibility and scale.

1

Data Collection

Flexible loaders for CSV, Excel and JSON with validation and schema inference.

2

Preprocessing

Cleaning, imputation, and type handling with reproducible transforms.

3

Exploration

Statistical summaries, distributions, and pairwise relationships for hypothesis generation.

4

Visualization

Static & interactive plots (Matplotlib/Seaborn/Plotly) to communicate insights.

5

Reporting

Automated HTML/interactive reports and exportable artifacts for stakeholders.

Technology Stack

Libraries and tools chosen for performance, interactivity and reproducibility.

Python 3.10+

Core language with mature data ecosystem.

Plotly / Seaborn

Interactive and static plotting for exploratory analysis.

Pandas / NumPy

High-performance data handling and numeric routines.

Scikit-learn

Modeling primitives and evaluation utilities.

Technical Implementation

Modern Python stack optimized for analysis and reproducibility.

Pandas & NumPy

High-performance data manipulation and numerical routines for efficient EDA.

Matplotlib, Seaborn & Plotly

Static and interactive visualizations for distribution, correlation and time-series analysis.

Scipy & ML Tools

Advanced statistics and clustering algorithms for deeper insights.

# Example: Automated EDA Pipeline import pandas as pd import numpy as np from eda_toolkit import EDAAnalyzer # Initialize analyzer analyzer = EDAAnalyzer(config='comprehensive') # Load and analyze data df = pd.read_csv('dataset.csv') results = analyzer.analyze(df) # Generate interactive report analyzer.generate_report(output_format='html', include_recommendations=True, interactive=True)
# Visualization example (Plotly) import plotly.express as px # Sample plot fig = px.histogram(df, x='age', nbins=30, title='Age Distribution') fig.write_html('age_distribution.html')

Results & Performance

Key findings and system performance metrics.

Overview

This section summarizes how the model metrics and visualizations are generated in the Titanic Streamlit app and how to interpret them. The values shown on this page are snapshots produced by the app; for the most up-to-date numbers consult the live demo or the exported metrics file.

Key Metrics

Accuracy is the fraction of correct predictions (use cautiously on imbalanced data). Precision measures how many of the predicted positives were actually positive. Recall (sensitivity) measures how many of the actual positives were identified. F1 is the harmonic mean of precision and recall. ROC AUC captures model separability across thresholds.

Confusion Matrix & ROC

The confusion matrix (Actual × Predicted) decomposes errors into true negatives, false positives, false negatives and true positives — inspect FN vs FP depending on your objectives. The ROC curve plots True Positive Rate vs False Positive Rate across thresholds; AUC close to 1.0 is desirable.

Feature Importance

Feature importance (e.g., from RandomForest) shows which variables the model relied on most. Use it to guide feature engineering, data collection priorities, or to investigate potential bias in predictions.

Classification Report & Practical Notes

The classification report lists per-class precision, recall, f1 and support; examine support to understand imbalance effects. Practical guidance: lower threshold to increase recall (catch more positives) or raise it to increase precision (fewer false positives). Combine numeric metrics with the confusion matrix and ROC for a complete view.

Integration & Maintenance

For maintainers: the app saves metrics to 'titanic_metrics.pkl'. To keep this page synchronized, export metrics as JSON (e.g. 'metrics.json') during the app pipeline and load them at build or runtime to inject numbers into the DOM.

82%
Accuracy
78%
Precision
72%
Recall
75%
F1 Score
0.86
ROC AUC
891
Sample Size
82%
Accuracy
78%
Precision
72%
Recall
75%
F1 Score
0.86
ROC AUC
891
Sample Size

Evaluation Framework

Approach to validate analyses, reproducibility and quality of insights.

Evaluation Script Example
import time # Example evaluation harness for EDA quality checks def evaluate_report(report_path): # load report metadata and sample checks # compute coverage of variables, missingness thresholds, and basic sanity tests pass def run_checks(df): checks = {} checks['missing_pct'] = df.isnull().mean().to_dict() return checks

Sanity Checks

Automated tests for distributions, missingness and schema drift.

Reproducibility

Deterministic transforms and saved artifacts for repeatable analysis.

Peer Review

Human-in-the-loop validation of anomalies and recommended actions.

Project Structure

Organização do repositório e pontos de extensão.

Directory Structure
eda-project/ ├── data/ # raw and processed datasets ├── notebooks/ # exploratory notebooks ├── src/ # analysis & utils ├── reports/ # generated HTML / plot artifacts └── requirements.txt # pinned dependencies

Getting Started

Quick setup to run the EDA toolkit locally or in Colab.

Quick Setup Instructions
# Clone git clone https://github.com/bcmaymonegalvao/eda-project.git cd eda-project # Create venv python -m venv .venv .venv\Scripts\activate # Windows # Install pip install -r requirements.txt

Future Enhancements

Planned improvements and extensions to make the analysis even more powerful and accessible.

Data Loaders

Support for more formats (Parquet, databases) and scalable ingestion.

Interactive Dashboard

Streamlit dashboards with saved plots and drill-down capabilities.

Indexing & Caching

Cache computed summaries and pre-rendered visualizations for performance.