Data Source Support

sktime-mcp now supports loading data from multiple sources beyond the built-in demo datasets!

Supported Data Sources

  1. SQL Databases - PostgreSQL, MySQL, SQLite, MSSQL

  2. Files - CSV, TSV, Excel, Parquet

Quick Start

1. CSV File

result = executor.load_data_source({
    "type": "file",
    "path": "/path/to/data.csv",
    "time_column": "date",
    "target_column": "sales",
    "exog_columns": ["temperature", "promotion"],  # Optional
})

2. SQL Database

# SQLite
result = executor.load_data_source({
    "type": "sql",
    "connection_string": "sqlite:///path/to/database.db",
    "query": "SELECT date, sales FROM sales_table",
    "time_column": "date",
    "target_column": "sales",
})

# PostgreSQL
result = executor.load_data_source({
    "type": "sql",
    "connection_string": "postgresql://user:pass@host:5432/db",
    "query": "SELECT * FROM sales WHERE date >= '2020-01-01'",
    "time_column": "date",
    "target_column": "sales",
})

MCP Tools

load_data_source

Load data from any supported source.

Arguments:

  • config (dict): Data source configuration

Returns:

  • success (bool): Whether loading succeeded

  • data_handle (str): Handle to reference the loaded data

  • metadata (dict): Information about the data (rows, columns, frequency, etc.)

  • validation (dict): Data quality validation results

Example:

{
  "config": {
    "type": "pandas",
    "data": {"date": [...], "value": [...]},
    "time_column": "date",
    "target_column": "value"
  }
}

fit_predict (custom data)

After load_data_source, call fit_predict with data_handle set (and omit dataset, or pass an empty string). This is the same MCP tool used for demo datasets.

Arguments:

  • estimator_handle (str): Handle from instantiate_estimator

  • data_handle (str): Handle from load_data_source

  • horizon (int): Forecast horizon (default: 12)

Example:

{
  "estimator_handle": "est_abc123",
  "data_handle": "data_xyz789",
  "horizon": 7
}

list_data_sources

List all available data source types.

Returns:

  • sources (list): Available source types

  • descriptions (dict): Description for each source

list_available_data

List available data in a single response, including demo datasets and active handles.

Arguments (optional):

  • is_demo (bool):

    • True → only demo datasets

    • False → only active data handles

    • omitted → both

Returns:

  • system_demos (list): Built-in demo dataset names

  • active_handles (list): Active data handle information

  • total (int): Combined count of returned entries

release_data_handle

Release a data handle and free memory.

Arguments:

  • data_handle (str): Handle to release

Configuration Options

Pandas Adapter

{
    "type": "pandas",
    "data": df_or_dict,  # DataFrame or dict
    "time_column": "date",  # Optional, will try to detect
    "target_column": "sales",  # Optional, defaults to first column
    "exog_columns": ["temp", "promo"],  # Optional
    "frequency": "D"  # Optional, will try to infer
}

SQL Adapter

{
    "type": "sql",
    
    # Option 1: Connection string
    "connection_string": "postgresql://user:pass@host:5432/db",
    
    # Option 2: Individual components
    "dialect": "postgresql",  # postgresql, mysql, sqlite, mssql
    "host": "localhost",
    "port": 5432,
    "database": "mydb",
    "username": "user",
    "password": "pass",
    
    # Query
    "query": "SELECT * FROM sales",  # Direct SQL query
    # OR
    "table": "sales",  # Table name
    "filters": {"region": "North"},  # Optional filters
    
    # Column mapping
    "time_column": "date",
    "target_column": "sales",
    "exog_columns": ["temperature"],
    
    # Optional
    "parse_dates": ["date"],
    "frequency": "D"
}

File Adapter

{
    "type": "file",
    "path": "/path/to/data.csv",
    "format": "csv",  # csv, excel, parquet (auto-detected if not specified)
    
    # Column mapping
    "time_column": "date",
    "target_column": "sales",
    "exog_columns": ["temperature"],
    
    # CSV-specific options
    "csv_options": {
        "sep": ",",
        "header": 0,
        "encoding": "utf-8"
    },
    
    # Excel-specific options
    "excel_options": {
        "sheet_name": 0,
        "header": 0
    },
    
    # Common options
    "parse_dates": True,
    "frequency": "D"
}

Data Validation

All data sources are automatically validated for:

  • ✅ DatetimeIndex presence

  • ✅ No duplicate time indices

  • ✅ Sufficient data points

  • ✅ Missing value detection

  • ✅ Frequency inference

Validation results are included in the response:

{
    "valid": True,
    "errors": [],  # Critical issues that prevent usage
    "warnings": ["Missing values detected: {'sales': 5.0}"]  # Non-critical issues
}

Installation

Core (Pandas support included)

pip install -e .

With SQL support

pip install -e ".[sql]"

With file format support

pip install -e ".[files]"

With all optional dependencies

pip install -e ".[all]"

Examples

See the examples/ directory for complete working examples:

  • pandas_example.py - Loading from pandas DataFrames

  • csv_example.py - Loading from CSV/TSV files

  • sql_example.py - Loading from SQL databases (SQLite, PostgreSQL, MySQL)

Architecture

Data Source Layer
├── base.py                 # Abstract adapter interface
├── registry.py             # Adapter registry
└── adapters/
    ├── pandas_adapter.py   # In-memory DataFrames
    ├── sql_adapter.py      # SQL databases
    └── file_adapter.py     # CSV, Excel, Parquet

Each adapter implements:

  • load() - Load data from source

  • validate() - Check data quality

  • to_sktime_format() - Convert to (y, X) format

Metadata

Each loaded data source provides rich metadata:

{
    "source": "sql",
    "rows": 100,
    "columns": ["sales", "temperature"],
    "frequency": "D",
    "start_date": "2020-01-01",
    "end_date": "2020-04-09",
    "missing_values": {"sales": 0, "temperature": 2}
}

Error Handling

All operations return structured error information:

{
    "success": False,
    "error": "Could not convert index to datetime",
    "error_type": "ValueError",
    "validation": {
        "valid": False,
        "errors": ["Index must be DatetimeIndex"]
    }
}

💡 LLM Guidelines & Best Practices

When using these tools to handle user data, follow these best practices to avoid common execution errors:

1. Column Identification

  • Target vs. Features: Identify which column is the target (y) and which are exogenous features (X).

  • Numeric Only: Ensure the target column contains only numeric data. Forecasting models will fail if string/categorical columns are accidentally included in y.

  • User Confirmation: If the dataset has multiple columns and the user hasn’t specified which one to forecast, ask for clarification instead of guessing.

2. Strategic Loading

  • Use usecols: When loading from CSV or Excel, use csv_options: {"usecols": ["Column1", "Column2"]} to load only the necessary columns. This prevents string-based columns from being accidentally passed to the model.

  • Time Column Mapping: Always map the time_column if one exists. If it doesn’t, sktime will default to a RangeIndex, which is often safer for simple sequences.

3. Frequency Hurdles

  • Month-Start vs Month-End: If you encounter errors like <MonthBegin> is not supported as period frequency, it is likely due to statsmodels limitations. In such cases, consider dropping the timestamp index and using a simple integer index (RangeIndex) by not specifying a time_column.

4. Validation First

  • Status Check: Always inspect the validation object returned by load_data_source. If valid is false, read the errors list to understand why the data might break the model.

Troubleshooting

“No module named ‘sqlalchemy’”

pip install "sktime-mcp[sql]"

“No module named ‘openpyxl’” (for Excel files)

pip install "sktime-mcp[files]"

“No module named ‘pyarrow’” (for Parquet files)

pip install "sktime-mcp[files]"

“Could not infer frequency”

Specify frequency explicitly in config:

config = {
    ...
    "frequency": "D"  # Daily, "W" for weekly, "M" for monthly, etc.
}

“Index must be DatetimeIndex”

Ensure your time column is specified correctly:

config = {
    ...
    "time_column": "date",  # Name of your date/time column
    "parse_dates": True
}