Data Source Support
sktime-mcp now supports loading data from multiple sources beyond the built-in demo datasets!
Supported Data Sources
SQL Databases - PostgreSQL, MySQL, SQLite, MSSQL
Files - CSV, TSV, Excel, Parquet
Quick Start
1. CSV File
result = executor.load_data_source({
"type": "file",
"path": "/path/to/data.csv",
"time_column": "date",
"target_column": "sales",
"exog_columns": ["temperature", "promotion"], # Optional
})
2. SQL Database
# SQLite
result = executor.load_data_source({
"type": "sql",
"connection_string": "sqlite:///path/to/database.db",
"query": "SELECT date, sales FROM sales_table",
"time_column": "date",
"target_column": "sales",
})
# PostgreSQL
result = executor.load_data_source({
"type": "sql",
"connection_string": "postgresql://user:pass@host:5432/db",
"query": "SELECT * FROM sales WHERE date >= '2020-01-01'",
"time_column": "date",
"target_column": "sales",
})
MCP Tools
load_data_source
Load data from any supported source.
Arguments:
config(dict): Data source configuration
Returns:
success(bool): Whether loading succeededdata_handle(str): Handle to reference the loaded datametadata(dict): Information about the data (rows, columns, frequency, etc.)validation(dict): Data quality validation results
Example:
{
"config": {
"type": "pandas",
"data": {"date": [...], "value": [...]},
"time_column": "date",
"target_column": "value"
}
}
fit_predict (custom data)
After load_data_source, call fit_predict with data_handle set (and omit dataset, or pass an empty string). This is the same MCP tool used for demo datasets.
Arguments:
estimator_handle(str): Handle frominstantiate_estimatordata_handle(str): Handle fromload_data_sourcehorizon(int): Forecast horizon (default: 12)
Example:
{
"estimator_handle": "est_abc123",
"data_handle": "data_xyz789",
"horizon": 7
}
list_data_sources
List all available data source types.
Returns:
sources(list): Available source typesdescriptions(dict): Description for each source
list_available_data
List available data in a single response, including demo datasets and active handles.
Arguments (optional):
is_demo(bool):True→ only demo datasetsFalse→ only active data handlesomitted → both
Returns:
system_demos(list): Built-in demo dataset namesactive_handles(list): Active data handle informationtotal(int): Combined count of returned entries
release_data_handle
Release a data handle and free memory.
Arguments:
data_handle(str): Handle to release
Configuration Options
Pandas Adapter
{
"type": "pandas",
"data": df_or_dict, # DataFrame or dict
"time_column": "date", # Optional, will try to detect
"target_column": "sales", # Optional, defaults to first column
"exog_columns": ["temp", "promo"], # Optional
"frequency": "D" # Optional, will try to infer
}
SQL Adapter
{
"type": "sql",
# Option 1: Connection string
"connection_string": "postgresql://user:pass@host:5432/db",
# Option 2: Individual components
"dialect": "postgresql", # postgresql, mysql, sqlite, mssql
"host": "localhost",
"port": 5432,
"database": "mydb",
"username": "user",
"password": "pass",
# Query
"query": "SELECT * FROM sales", # Direct SQL query
# OR
"table": "sales", # Table name
"filters": {"region": "North"}, # Optional filters
# Column mapping
"time_column": "date",
"target_column": "sales",
"exog_columns": ["temperature"],
# Optional
"parse_dates": ["date"],
"frequency": "D"
}
File Adapter
{
"type": "file",
"path": "/path/to/data.csv",
"format": "csv", # csv, excel, parquet (auto-detected if not specified)
# Column mapping
"time_column": "date",
"target_column": "sales",
"exog_columns": ["temperature"],
# CSV-specific options
"csv_options": {
"sep": ",",
"header": 0,
"encoding": "utf-8"
},
# Excel-specific options
"excel_options": {
"sheet_name": 0,
"header": 0
},
# Common options
"parse_dates": True,
"frequency": "D"
}
Data Validation
All data sources are automatically validated for:
✅ DatetimeIndex presence
✅ No duplicate time indices
✅ Sufficient data points
✅ Missing value detection
✅ Frequency inference
Validation results are included in the response:
{
"valid": True,
"errors": [], # Critical issues that prevent usage
"warnings": ["Missing values detected: {'sales': 5.0}"] # Non-critical issues
}
Installation
Core (Pandas support included)
pip install -e .
With SQL support
pip install -e ".[sql]"
With file format support
pip install -e ".[files]"
With all optional dependencies
pip install -e ".[all]"
Examples
See the examples/ directory for complete working examples:
pandas_example.py- Loading from pandas DataFramescsv_example.py- Loading from CSV/TSV filessql_example.py- Loading from SQL databases (SQLite, PostgreSQL, MySQL)
Architecture
Data Source Layer
├── base.py # Abstract adapter interface
├── registry.py # Adapter registry
└── adapters/
├── pandas_adapter.py # In-memory DataFrames
├── sql_adapter.py # SQL databases
└── file_adapter.py # CSV, Excel, Parquet
Each adapter implements:
load()- Load data from sourcevalidate()- Check data qualityto_sktime_format()- Convert to (y, X) format
Metadata
Each loaded data source provides rich metadata:
{
"source": "sql",
"rows": 100,
"columns": ["sales", "temperature"],
"frequency": "D",
"start_date": "2020-01-01",
"end_date": "2020-04-09",
"missing_values": {"sales": 0, "temperature": 2}
}
Error Handling
All operations return structured error information:
{
"success": False,
"error": "Could not convert index to datetime",
"error_type": "ValueError",
"validation": {
"valid": False,
"errors": ["Index must be DatetimeIndex"]
}
}
💡 LLM Guidelines & Best Practices
When using these tools to handle user data, follow these best practices to avoid common execution errors:
1. Column Identification
Target vs. Features: Identify which column is the target (
y) and which are exogenous features (X).Numeric Only: Ensure the target column contains only numeric data. Forecasting models will fail if string/categorical columns are accidentally included in
y.User Confirmation: If the dataset has multiple columns and the user hasn’t specified which one to forecast, ask for clarification instead of guessing.
2. Strategic Loading
Use
usecols: When loading from CSV or Excel, usecsv_options: {"usecols": ["Column1", "Column2"]}to load only the necessary columns. This prevents string-based columns from being accidentally passed to the model.Time Column Mapping: Always map the
time_columnif one exists. If it doesn’t, sktime will default to aRangeIndex, which is often safer for simple sequences.
3. Frequency Hurdles
Month-Start vs Month-End: If you encounter errors like
<MonthBegin> is not supported as period frequency, it is likely due tostatsmodelslimitations. In such cases, consider dropping the timestamp index and using a simple integer index (RangeIndex) by not specifying atime_column.
4. Validation First
Status Check: Always inspect the
validationobject returned byload_data_source. Ifvalidisfalse, read theerrorslist to understand why the data might break the model.
Troubleshooting
“No module named ‘sqlalchemy’”
pip install "sktime-mcp[sql]"
“No module named ‘openpyxl’” (for Excel files)
pip install "sktime-mcp[files]"
“No module named ‘pyarrow’” (for Parquet files)
pip install "sktime-mcp[files]"
“Could not infer frequency”
Specify frequency explicitly in config:
config = {
...
"frequency": "D" # Daily, "W" for weekly, "M" for monthly, etc.
}
“Index must be DatetimeIndex”
Ensure your time column is specified correctly:
config = {
...
"time_column": "date", # Name of your date/time column
"parse_dates": True
}