Example usage

To use simplefit in a project, import the package with following:

Imports

import simplefit
from simplefit.cleaner import cleaner
from simplefit.regressor import regressor
from simplefit.classifier import classifier
from simplefit.eda import plot_distributions, plot_corr, plot_splom

import warnings
import pandas as pd
warnings.filterwarnings('ignore')
import altair as alt

alt.renderers.enable('html')  # render plot on html

RendererRegistry.enable('html')

print(simplefit.__version__)

0.1.5

# Specific altair rendering 
# alt.renderers.enable('notebook')   # to render plot on jupyter notebook
# alt.renderers.enable('mimetype')   # to render plot on github

# to enable altair to plot graphs if your dataset has more than 5000 rows, try any of these:
# alt.data_transformers.enable("data_server")

# OR

# from altair import pipe, limit_rows, to_values
# t = lambda data: pipe(data, limit_rows(max_rows=1000000), to_values)
# alt.data_transformers.register('custom', t)
# alt.data_transformers.enable('custom')

Sample Data

We will be using the SpotifyFeatures.csv data as an example.

df = pd.read_csv("../tests/data/SpotifyFeatures.csv")

Clean data

Loads and cleans the dataset, removes NA rows, strips extra white spaces, etc and returns clean dataframe. Imports the cleaner function from the module simplefit.cleaner

clean_df = cleaner(df,lower_case=False)

clean_df

	genre	artist_name	track_name	track_id	popularity	acousticness	danceability	duration_ms	energy	instrumentalness	key	liveness	loudness	mode	speechiness	tempo	time_signature	valence
0	Movie	Henri Salvador	C'est beau de faire un Show	0BRjO6ga9RKCKjfDqeFgWV	0	0.61100	0.389	99373	0.910	0.000000	C#	0.3460	-1.828	Major	0.0525	166.969	4/4	0.814
1	Movie	Martin & les fées	Perdu d'avance (par Gad Elmaleh)	0BjC1NfoEOOusryehmNudP	1	0.24600	0.590	137373	0.737	0.000000	F#	0.1510	-5.559	Minor	0.0868	174.003	4/4	0.816
2	Movie	Joseph Williams	Don't Let Me Be Lonely Tonight	0CoSDzoNIKCRs124s9uTVy	3	0.95200	0.663	170267	0.131	0.000000	C	0.1030	-13.879	Minor	0.0362	99.488	5/4	0.368
3	Movie	Henri Salvador	Dis-moi Monsieur Gordon Cooper	0Gc6TVm52BwZD07Ki6tIvf	0	0.70300	0.240	152427	0.326	0.000000	C#	0.0985	-12.178	Major	0.0395	171.758	4/4	0.227
4	Movie	Fabien Nataf	Ouverture	0IuslXpMROHdEPvSl1fTQK	4	0.95000	0.331	82625	0.225	0.123000	F	0.2020	-21.150	Major	0.0456	140.576	4/4	0.390
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
232720	Soul	Slave	Son Of Slide	2XGLdVl7lGeq8ksM6Al7jT	39	0.00384	0.687	326240	0.714	0.544000	D	0.0845	-10.626	Major	0.0316	115.542	4/4	0.962
232721	Soul	Jr Thomas & The Volcanos	Burning Fire	1qWZdkBl4UVPj9lK6HuuFM	38	0.03290	0.785	282447	0.683	0.000880	E	0.2370	-6.944	Minor	0.0337	113.830	4/4	0.969
232722	Soul	Muddy Waters	(I'm Your) Hoochie Coochie Man	2ziWXUmQLrXTiYjCg2fZ2t	47	0.90100	0.517	166960	0.419	0.000000	D	0.0945	-8.282	Major	0.1480	84.135	4/4	0.813
232723	Soul	R.LUM.R	With My Words	6EFsue2YbIG4Qkq8Zr9Rir	44	0.26200	0.745	222442	0.704	0.000000	A	0.3330	-7.137	Major	0.1460	100.031	4/4	0.489
232724	Soul	Mint Condition	You Don't Have To Hurt No More	34XO9RwPMKjbvRry54QzWn	35	0.09730	0.758	323027	0.470	0.000049	G#	0.0836	-6.708	Minor	0.0287	113.897	4/4	0.479

232725 rows × 18 columns

Plot Distributions

Creates numerical distribution plots on either all the numeric columns or the ones provided to it. Import the plot_distributions function from the module simplefit.eda

clean_df = clean_df[:5000]
plot_distributions(clean_df, bins = 40, dist_cols=['danceability', 'duration_ms', 'energy'])

Plot Correlation plot

Creates correlation plot for all the columns in the dataframe

plot_corr(df, corr='spearman')

Plot SPLOM

Creates SPLOM plot for all the numeric columns in the dataframe or the ones passed by the user

plot_splom(clean_df, pair_cols=["energy", "acousticness"])

Fit Regressor

Preprocesses the data, fits baseline model(Dummy Regressor) and Ridge with default setup and returns model scores in the form of a dataframe

regressor(clean_df, target_col = 'popularity', numeric_feats = ['danceability', 'loudness'], categorical_feats=['genre'], cv=10)

	DummyRegressor	Ridge	RidgeCV	linearRegression
fit_time	0.000987	0.013444	0.111921	0.015127
score_time	0.000308	0.006379	0.006418	0.006230
test_score	-2.980655	-0.006898	-0.000396	-0.007836
train_score	0.000000	0.898819	0.897763	0.898831

Fit Classifier

Preprocesses the data, fits baseline model(Dummy Classifier) and Logistic Regression with default setup and returns model scores in the form of a dataframe

classification_df = pd.read_csv("../tests/data/adult.csv")
clean_df = cleaner(classification_df,lower_case=True)

First the classifier is passed with all the inputs

classifier(clean_df, target_col = 'income', numeric_feats = ['age', 'fnlwgt'], categorical_feats=['occupation'], cv=10)

	DummyClassifier	LogisticRegression
fit_time	0.013652	0.275881
score_time	0.004547	0.008931
test_score	0.759190	0.768588
train_score	0.759190	0.768912

Without passing any numeric features, it will train on all numeric features

classifier(clean_df, target_col = 'income', numeric_feats = [], categorical_feats=['occupation'], cv=10)

	DummyClassifier	LogisticRegression
fit_time	0.013802	0.573708
score_time	0.004554	0.017578
test_score	0.759190	0.792304
train_score	0.759190	0.821914