Example usage

To use simplefit in a project, import the package with following:

Imports

import simplefit
from simplefit.cleaner import cleaner
from simplefit.regressor import regressor
from simplefit.classifier import classifier
from simplefit.eda import plot_distributions, plot_corr, plot_splom

import warnings
import pandas as pd
warnings.filterwarnings('ignore')
import altair as alt

alt.renderers.enable('html')  # render plot on html
RendererRegistry.enable('html')
print(simplefit.__version__)
0.1.5
# Specific altair rendering 
# alt.renderers.enable('notebook')   # to render plot on jupyter notebook
# alt.renderers.enable('mimetype')   # to render plot on github

# to enable altair to plot graphs if your dataset has more than 5000 rows, try any of these:
# alt.data_transformers.enable("data_server")

# OR

# from altair import pipe, limit_rows, to_values
# t = lambda data: pipe(data, limit_rows(max_rows=1000000), to_values)
# alt.data_transformers.register('custom', t)
# alt.data_transformers.enable('custom')

Sample Data

We will be using the SpotifyFeatures.csv data as an example.

df = pd.read_csv("../tests/data/SpotifyFeatures.csv")

Clean data

Loads and cleans the dataset, removes NA rows, strips extra white spaces, etc and returns clean dataframe. Imports the cleaner function from the module simplefit.cleaner

clean_df = cleaner(df,lower_case=False)
clean_df
genre artist_name track_name track_id popularity acousticness danceability duration_ms energy instrumentalness key liveness loudness mode speechiness tempo time_signature valence
0 Movie Henri Salvador C'est beau de faire un Show 0BRjO6ga9RKCKjfDqeFgWV 0 0.61100 0.389 99373 0.910 0.000000 C# 0.3460 -1.828 Major 0.0525 166.969 4/4 0.814
1 Movie Martin & les fées Perdu d'avance (par Gad Elmaleh) 0BjC1NfoEOOusryehmNudP 1 0.24600 0.590 137373 0.737 0.000000 F# 0.1510 -5.559 Minor 0.0868 174.003 4/4 0.816
2 Movie Joseph Williams Don't Let Me Be Lonely Tonight 0CoSDzoNIKCRs124s9uTVy 3 0.95200 0.663 170267 0.131 0.000000 C 0.1030 -13.879 Minor 0.0362 99.488 5/4 0.368
3 Movie Henri Salvador Dis-moi Monsieur Gordon Cooper 0Gc6TVm52BwZD07Ki6tIvf 0 0.70300 0.240 152427 0.326 0.000000 C# 0.0985 -12.178 Major 0.0395 171.758 4/4 0.227
4 Movie Fabien Nataf Ouverture 0IuslXpMROHdEPvSl1fTQK 4 0.95000 0.331 82625 0.225 0.123000 F 0.2020 -21.150 Major 0.0456 140.576 4/4 0.390
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
232720 Soul Slave Son Of Slide 2XGLdVl7lGeq8ksM6Al7jT 39 0.00384 0.687 326240 0.714 0.544000 D 0.0845 -10.626 Major 0.0316 115.542 4/4 0.962
232721 Soul Jr Thomas & The Volcanos Burning Fire 1qWZdkBl4UVPj9lK6HuuFM 38 0.03290 0.785 282447 0.683 0.000880 E 0.2370 -6.944 Minor 0.0337 113.830 4/4 0.969
232722 Soul Muddy Waters (I'm Your) Hoochie Coochie Man 2ziWXUmQLrXTiYjCg2fZ2t 47 0.90100 0.517 166960 0.419 0.000000 D 0.0945 -8.282 Major 0.1480 84.135 4/4 0.813
232723 Soul R.LUM.R With My Words 6EFsue2YbIG4Qkq8Zr9Rir 44 0.26200 0.745 222442 0.704 0.000000 A 0.3330 -7.137 Major 0.1460 100.031 4/4 0.489
232724 Soul Mint Condition You Don't Have To Hurt No More 34XO9RwPMKjbvRry54QzWn 35 0.09730 0.758 323027 0.470 0.000049 G# 0.0836 -6.708 Minor 0.0287 113.897 4/4 0.479

232725 rows × 18 columns

Plot Distributions

Creates numerical distribution plots on either all the numeric columns or the ones provided to it. Import the plot_distributions function from the module simplefit.eda

clean_df = clean_df[:5000]
plot_distributions(clean_df, bins = 40, dist_cols=['danceability', 'duration_ms', 'energy'])

Plot Correlation plot

Creates correlation plot for all the columns in the dataframe

plot_corr(df, corr='spearman')

Plot SPLOM

Creates SPLOM plot for all the numeric columns in the dataframe or the ones passed by the user

plot_splom(clean_df, pair_cols=["energy", "acousticness"])

Fit Regressor

Preprocesses the data, fits baseline model(Dummy Regressor) and Ridge with default setup and returns model scores in the form of a dataframe

regressor(clean_df, target_col = 'popularity', numeric_feats = ['danceability', 'loudness'], categorical_feats=['genre'], cv=10)
DummyRegressor Ridge RidgeCV linearRegression
fit_time 0.000987 0.013444 0.111921 0.015127
score_time 0.000308 0.006379 0.006418 0.006230
test_score -2.980655 -0.006898 -0.000396 -0.007836
train_score 0.000000 0.898819 0.897763 0.898831

Fit Classifier

Preprocesses the data, fits baseline model(Dummy Classifier) and Logistic Regression with default setup and returns model scores in the form of a dataframe

classification_df = pd.read_csv("../tests/data/adult.csv")
clean_df = cleaner(classification_df,lower_case=True)

First the classifier is passed with all the inputs

classifier(clean_df, target_col = 'income', numeric_feats = ['age', 'fnlwgt'], categorical_feats=['occupation'], cv=10)
DummyClassifier LogisticRegression
fit_time 0.013652 0.275881
score_time 0.004547 0.008931
test_score 0.759190 0.768588
train_score 0.759190 0.768912

Without passing any numeric features, it will train on all numeric features

classifier(clean_df, target_col = 'income', numeric_feats = [], categorical_feats=['occupation'], cv=10)
DummyClassifier LogisticRegression
fit_time 0.013802 0.573708
score_time 0.004554 0.017578
test_score 0.759190 0.792304
train_score 0.759190 0.821914