Example usage
To use simplefit in a project, import the package with following:
Imports
import simplefit
from simplefit.cleaner import cleaner
from simplefit.regressor import regressor
from simplefit.classifier import classifier
from simplefit.eda import plot_distributions, plot_corr, plot_splom
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
import altair as alt
alt.renderers.enable('html') # render plot on html
RendererRegistry.enable('html')
print(simplefit.__version__)
0.1.5
# Specific altair rendering
# alt.renderers.enable('notebook') # to render plot on jupyter notebook
# alt.renderers.enable('mimetype') # to render plot on github
# to enable altair to plot graphs if your dataset has more than 5000 rows, try any of these:
# alt.data_transformers.enable("data_server")
# OR
# from altair import pipe, limit_rows, to_values
# t = lambda data: pipe(data, limit_rows(max_rows=1000000), to_values)
# alt.data_transformers.register('custom', t)
# alt.data_transformers.enable('custom')
Sample Data
We will be using the SpotifyFeatures.csv data as an example.
df = pd.read_csv("../tests/data/SpotifyFeatures.csv")
Clean data
Loads and cleans the dataset, removes NA rows, strips extra white spaces, etc and returns clean dataframe. Imports the cleaner function from the module simplefit.cleaner
clean_df = cleaner(df,lower_case=False)
clean_df
| genre | artist_name | track_name | track_id | popularity | acousticness | danceability | duration_ms | energy | instrumentalness | key | liveness | loudness | mode | speechiness | tempo | time_signature | valence | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Movie | Henri Salvador | C'est beau de faire un Show | 0BRjO6ga9RKCKjfDqeFgWV | 0 | 0.61100 | 0.389 | 99373 | 0.910 | 0.000000 | C# | 0.3460 | -1.828 | Major | 0.0525 | 166.969 | 4/4 | 0.814 |
| 1 | Movie | Martin & les fées | Perdu d'avance (par Gad Elmaleh) | 0BjC1NfoEOOusryehmNudP | 1 | 0.24600 | 0.590 | 137373 | 0.737 | 0.000000 | F# | 0.1510 | -5.559 | Minor | 0.0868 | 174.003 | 4/4 | 0.816 |
| 2 | Movie | Joseph Williams | Don't Let Me Be Lonely Tonight | 0CoSDzoNIKCRs124s9uTVy | 3 | 0.95200 | 0.663 | 170267 | 0.131 | 0.000000 | C | 0.1030 | -13.879 | Minor | 0.0362 | 99.488 | 5/4 | 0.368 |
| 3 | Movie | Henri Salvador | Dis-moi Monsieur Gordon Cooper | 0Gc6TVm52BwZD07Ki6tIvf | 0 | 0.70300 | 0.240 | 152427 | 0.326 | 0.000000 | C# | 0.0985 | -12.178 | Major | 0.0395 | 171.758 | 4/4 | 0.227 |
| 4 | Movie | Fabien Nataf | Ouverture | 0IuslXpMROHdEPvSl1fTQK | 4 | 0.95000 | 0.331 | 82625 | 0.225 | 0.123000 | F | 0.2020 | -21.150 | Major | 0.0456 | 140.576 | 4/4 | 0.390 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 232720 | Soul | Slave | Son Of Slide | 2XGLdVl7lGeq8ksM6Al7jT | 39 | 0.00384 | 0.687 | 326240 | 0.714 | 0.544000 | D | 0.0845 | -10.626 | Major | 0.0316 | 115.542 | 4/4 | 0.962 |
| 232721 | Soul | Jr Thomas & The Volcanos | Burning Fire | 1qWZdkBl4UVPj9lK6HuuFM | 38 | 0.03290 | 0.785 | 282447 | 0.683 | 0.000880 | E | 0.2370 | -6.944 | Minor | 0.0337 | 113.830 | 4/4 | 0.969 |
| 232722 | Soul | Muddy Waters | (I'm Your) Hoochie Coochie Man | 2ziWXUmQLrXTiYjCg2fZ2t | 47 | 0.90100 | 0.517 | 166960 | 0.419 | 0.000000 | D | 0.0945 | -8.282 | Major | 0.1480 | 84.135 | 4/4 | 0.813 |
| 232723 | Soul | R.LUM.R | With My Words | 6EFsue2YbIG4Qkq8Zr9Rir | 44 | 0.26200 | 0.745 | 222442 | 0.704 | 0.000000 | A | 0.3330 | -7.137 | Major | 0.1460 | 100.031 | 4/4 | 0.489 |
| 232724 | Soul | Mint Condition | You Don't Have To Hurt No More | 34XO9RwPMKjbvRry54QzWn | 35 | 0.09730 | 0.758 | 323027 | 0.470 | 0.000049 | G# | 0.0836 | -6.708 | Minor | 0.0287 | 113.897 | 4/4 | 0.479 |
232725 rows × 18 columns
Plot Distributions
Creates numerical distribution plots on either all the numeric columns or the ones provided to it. Import the plot_distributions function from the module simplefit.eda
clean_df = clean_df[:5000]
plot_distributions(clean_df, bins = 40, dist_cols=['danceability', 'duration_ms', 'energy'])
Plot Correlation plot
Creates correlation plot for all the columns in the dataframe
plot_corr(df, corr='spearman')
Plot SPLOM
Creates SPLOM plot for all the numeric columns in the dataframe or the ones passed by the user
plot_splom(clean_df, pair_cols=["energy", "acousticness"])
Fit Regressor
Preprocesses the data, fits baseline model(Dummy Regressor) and Ridge with default setup and returns model scores in the form of a dataframe
regressor(clean_df, target_col = 'popularity', numeric_feats = ['danceability', 'loudness'], categorical_feats=['genre'], cv=10)
| DummyRegressor | Ridge | RidgeCV | linearRegression | |
|---|---|---|---|---|
| fit_time | 0.000987 | 0.013444 | 0.111921 | 0.015127 |
| score_time | 0.000308 | 0.006379 | 0.006418 | 0.006230 |
| test_score | -2.980655 | -0.006898 | -0.000396 | -0.007836 |
| train_score | 0.000000 | 0.898819 | 0.897763 | 0.898831 |
Fit Classifier
Preprocesses the data, fits baseline model(Dummy Classifier) and Logistic Regression with default setup and returns model scores in the form of a dataframe
classification_df = pd.read_csv("../tests/data/adult.csv")
clean_df = cleaner(classification_df,lower_case=True)
First the classifier is passed with all the inputs
classifier(clean_df, target_col = 'income', numeric_feats = ['age', 'fnlwgt'], categorical_feats=['occupation'], cv=10)
| DummyClassifier | LogisticRegression | |
|---|---|---|
| fit_time | 0.013652 | 0.275881 |
| score_time | 0.004547 | 0.008931 |
| test_score | 0.759190 | 0.768588 |
| train_score | 0.759190 | 0.768912 |
Without passing any numeric features, it will train on all numeric features
classifier(clean_df, target_col = 'income', numeric_feats = [], categorical_feats=['occupation'], cv=10)
| DummyClassifier | LogisticRegression | |
|---|---|---|
| fit_time | 0.013802 | 0.573708 |
| score_time | 0.004554 | 0.017578 |
| test_score | 0.759190 | 0.792304 |
| train_score | 0.759190 | 0.821914 |