Xente Fraud Classification Challenge on Zindi

How a data scientist used datasist to quickly prototype a solution in a data science competition

Link to the competition page where you can get the dataset and also better understand the problem.

Data Overview

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
#Import the Datasist Library
import datasist as ds


#Read in data set
train_data = pd.read_csv('training.csv')
test_data = pd.read_csv('test.csv')

Quick summary of a data set using the describe function in the structdata module

ds.structdata.describe(train_data)

First five data points

TransactionId

BatchId

AccountId

SubscriptionId

CustomerId

CurrencyCode

CountryCode

ProviderId

ProductId

ProductCategory

ChannelId

Amount

Value

TransactionStartTime

PricingStrategy

FraudResult

0

TransactionId_76871

BatchId_36123

AccountId_3957

SubscriptionId_887

CustomerId_4406

UGX

256

ProviderId_6

ProductId_10

airtime

ChannelId_3

1000.0

1000

2018-11-15T02:18:49Z

2

0

1

TransactionId_73770

BatchId_15642

AccountId_4841

SubscriptionId_3829

CustomerId_4406

UGX

256

ProviderId_4

ProductId_6

financial_services

ChannelId_2

-20.0

20

2018-11-15T02:19:08Z

2

0

2

TransactionId_26203

BatchId_53941

AccountId_4229

SubscriptionId_222

CustomerId_4683

UGX

256

ProviderId_6

ProductId_1

airtime

ChannelId_3

500.0

500

2018-11-15T02:44:21Z

2

0

3

TransactionId_380

BatchId_102363

AccountId_648

SubscriptionId_2185

CustomerId_988

UGX

256

ProviderId_1

ProductId_21

utility_bill

ChannelId_3

20000.0

21800

2018-11-15T03:32:55Z

2

0

4

TransactionId_28195

BatchId_38780

AccountId_4841

SubscriptionId_3829

CustomerId_988

UGX

256

ProviderId_4

ProductId_6

financial_services

ChannelId_2

-644.0

644

2018-11-15T03:34:21Z

2

0

Last five data points

TransactionId

BatchId

AccountId

SubscriptionId

CustomerId

CurrencyCode

CountryCode

ProviderId

ProductId

ProductCategory

ChannelId

Amount

Value

TransactionStartTime

PricingStrategy

FraudResult

95657

TransactionId_89881

BatchId_96668

AccountId_4841

SubscriptionId_3829

CustomerId_3078

UGX

256

ProviderId_4

ProductId_6

financial_services

ChannelId_2

-1000.0

1000

2019-02-13T09:54:09Z

2

0

95658

TransactionId_91597

BatchId_3503

AccountId_3439

SubscriptionId_2643

CustomerId_3874

UGX

256

ProviderId_6

ProductId_10

airtime

ChannelId_3

1000.0

1000

2019-02-13T09:54:25Z

2

0

95659

TransactionId_82501

BatchId_118602

AccountId_4841

SubscriptionId_3829

CustomerId_3874

UGX

256

ProviderId_4

ProductId_6

financial_services

ChannelId_2

-20.0

20

2019-02-13T09:54:35Z

2

0

95660

TransactionId_136354

BatchId_70924

AccountId_1346

SubscriptionId_652

CustomerId_1709

UGX

256

ProviderId_6

ProductId_19

tv

ChannelId_3

3000.0

3000

2019-02-13T10:01:10Z

2

0

95661

TransactionId_35670

BatchId_29317

AccountId_4841

SubscriptionId_3829

CustomerId_1709

UGX

256

ProviderId_4

ProductId_6

financial_services

ChannelId_2

-60.0

60

2019-02-13T10:01:28Z

2

0

Shape of data set: (95662, 16)

Size of data set: 1530592

Data Types

Note: All Non-numerical features are identified as objects in pandas

Data Type

TransactionId

object

BatchId

object

AccountId

object

SubscriptionId

object

CustomerId

object

CurrencyCode

object

CountryCode

int64

ProviderId

object

ProductId

object

ProductCategory

object

ChannelId

object

Amount

float64

Value

int64

TransactionStartTime

object

PricingStrategy

int64

FraudResult

int64

Column(s) {'TransactionStartTime'} should be in Datetime format. Use the [to_date] function in datasist.feature_engineering to coonvert to Pandas Datetime format

Numerical Features in Data set

['CountryCode', 'Amount', 'Value', 'PricingStrategy', 'FraudResult']

Statistical Description of Columns

CountryCode

Amount

Value

PricingStrategy

FraudResult

count

95662.0

9.566200e+04

9.566200e+04

95662.000000

95662.000000

mean

256.0

6.717846e+03

9.900584e+03

2.255974

0.002018

std

0.0

1.233068e+05

1.231221e+05

0.732924

0.044872

min

256.0

-1.000000e+06

2.000000e+00

0.000000

0.000000

25%

256.0

-5.000000e+01

2.750000e+02

2.000000

0.000000

50%

256.0

1.000000e+03

1.000000e+03

2.000000

0.000000

75%

256.0

2.800000e+03

5.000000e+03

2.000000

0.000000

max

256.0

9.880000e+06

9.880000e+06

4.000000

1.000000

Categorical Features in Data set

['TransactionId',

'BatchId',

'AccountId',

'SubscriptionId',

'CustomerId',

'CurrencyCode',

'ProviderId',

'ProductId',

'ProductCategory',

'ChannelId',

'TransactionStartTime']

Unique class Count of Categorical features

Feature

Unique Count

0

TransactionId

95662

1

BatchId

94809

2

AccountId

3633

3

SubscriptionId

3627

4

CustomerId

3742

5

CurrencyCode

1

6

ProviderId

6

7

ProductId

23

8

ProductCategory

9

9

ChannelId

4

10

TransactionStartTime

94556

Missing Values in Data

features

missing_counts

missing_percent

0

TransactionId

0

0.0

1

BatchId

0

0.0

2

AccountId

0

0.0

3

SubscriptionId

0

0.0

4

CustomerId

0

0.0

5

CurrencyCode

0

0.0

6

CountryCode

0

0.0

7

ProviderId

0

0.0

8

ProductId

0

0.0

9

ProductCategory

0

0.0

10

ChannelId

0

0.0

11

Amount

0

0.0

12

Value

0

0.0

13

TransactionStartTime

0

0.0

14

PricingStrategy

0

0.0

15

FraudResult

0

0.0

Remove features that contains only one unique field as these features are redundant

#Drop redundant features
ds.feature_engineering.drop_redundant(data=train_data)
ds.feature_engineering.drop_redundant(data=test_data)
Dropped ['CurrencyCode', 'CountryCode']
Dropped ['CurrencyCode', 'CountryCode']

Check for missing values in dataset with the display function

Data Exploration

cat_feats = ds.structdata.get_cat_feats(train_data)
cat_feats

['TransactionId',

'BatchId',

'AccountId',

'SubscriptionId',

'CustomerId',

'ProviderId',

'ProductId',

'ProductCategory',

'ChannelId',

'TransactionStartTime']

ds.structdata.get_unique_counts(train_data)

Feature

Unique Count

0

TransactionId

95662

1

BatchId

94809

2

AccountId

3633

3

SubscriptionId

3627

4

CustomerId

3742

5

ProviderId

6

6

ProductId

23

7

ProductCategory

9

8

ChannelId

4

9

TransactionStartTime

94556

From the unique display output, we notice that the TransactionId and BatchId contains too many classes and thus we can drop them.

train_data.drop(['TransactionId', 'BatchId'], axis=1, inplace=True)
test_data.drop(['TransactionId', 'BatchId'], axis=1, inplace=True)

Visualization of Categorical Features

ds.visualizations.countplot(train_data)

Unique Values in AccountId is too large to plot

Unique Values in SubscriptionId is too large to plot

Unique Values in CustomerId is too large to plot

Unique Values in TransactionStartTime is too large to plot

ds.visualizations.class_count(train_data)

Unique classes in AccountId too large

Unique classes in SubscriptionId too large

Unique classes in CustomerId too large

Class Count for ProviderId

ProviderId

ProviderId_4

38189

ProviderId_6

34186

ProviderId_5

14542

ProviderId_1

5643

ProviderId_3

3084

ProviderId_2

18

Unique classes in ProductId too large

Class Count for ProductCategory

ProductCategory

financial_services

45405

airtime

45027

utility_bill

1920

data_bundles

1613

tv

1279

ticket

216

movies

175

transport

25

other

2

Class Count for ChannelId

ChannelId

ChannelId_3

56935

ChannelId_2

37141

ChannelId_5

1048

ChannelId_1

538

Unique classes in TransactionStartTime too large

train_data.columns

Index(['AccountId', 'SubscriptionId', 'CustomerId', 'ProviderId', 'ProductId',

'ProductCategory', 'ChannelId', 'Amount', 'Value',

'TransactionStartTime', 'PricingStrategy', 'FraudResult'],

dtype='object')

ds.visualizations.catbox(data=train_data, target='FraudResult', fig_size=(7,7))

TransactionId feature has too many categories and will not be ploted

BatchId feature has too many categories and will not be ploted

AccountId feature has too many categories and will not be ploted

SubscriptionId feature has too many categories and will not be ploted

CustomerId feature has too many categories and will not be ploted

ProductId feature has too many categories and will not be ploted

TransactionStartTime feature has too many categories and will not be ploted

Visualization of Numerical Features

ds.visualizations.histogram(train_data, fig_size=(5,5), bins=5)
ds.visualizations.boxplot(data=train_data, target='FraudResult', fig_size=(5,5))

Working on Time Feature

date_feats = ds.structdata.get_date_cols(train_data)
date_feats

{'TransactionStartTime'}

train_data[date_feats].head()

TransactionStartTime

0

2018-11-15T02:18:49Z

1

2018-11-15T02:19:08Z

2

2018-11-15T02:44:21Z

3

2018-11-15T03:32:55Z

4

2018-11-15T03:34:21Z

num_feats = ds.structdata.get_num_feats(train_data)

ds.timeseries.num_timeplot(data=train_data,num_cols=num_feats, time_col='TransactionStartTime')

The extract_dates function in the timeseries module can extract date information from a datetime column automatically.

train_data = ds.timeseries.extract_dates(data=train_data, date_cols=['TransactionStartTime'])
test_data = ds.timeseries.extract_dates(data=test_data, date_cols=['TransactionStartTime'])

train_data.head(2).T

0

1

AccountId

AccountId_3957

AccountId_4841

SubscriptionId

SubscriptionId_887

SubscriptionId_3829

CustomerId

CustomerId_4406

CustomerId_4406

ProviderId

ProviderId_6

ProviderId_4

ProductId

ProductId_10

ProductId_6

ProductCategory

airtime

financial_services

ChannelId

ChannelId_3

ChannelId_2

Amount

1000

-20

Value

1000

20

PricingStrategy

2

2

FraudResult

0

0

TransactionStartTime_dow

Thursday

Thursday

TransactionStartTime_doy

319

319

TransactionStartTime_dom

15

15

TransactionStartTime_hr

2

2

TransactionStartTime_min

18

19

TransactionStartTime_is_wkd

0

0

TransactionStartTime_yr

2018

2018

TransactionStartTime_qtr

4

4

TransactionStartTime_mth

11

11

Merge Data and Encode Categorical Features

#perform merge 
all_data, ntrain, ntest = ds.structdata.join_train_and_test(train_data, test_data)

#Label Encode Large Categorical features
large_cats = ['AccountId', 'SubscriptionId', 'CustomerId', 'ProductId']

from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
for cat in large_cats:
    all_data[cat] = lb.fit_transform(all_data[cat])

# One hot encode small categorical features
all_data = pd.get_dummies(all_data, drop_first=True)

#Get traina nd test set back
train = all_data[:ntrain]
test = all_data[ntrain:]

#Get target and drop it
target = train['FraudResult']
train.drop('FraudResult', axis=1, inplace=True)
test.drop('FraudResult', axis=1, inplace=True)

print("Shape of training datasets is {}".format(train.shape))
print("Shape of training target is {}".format(test.shape))
print("Shape of target is {}".format(target.shape))

Modeling

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score, train_test_split

X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.3, random_state=2)

rf_model = RandomForestClassifier(n_estimators=100,random_state=232)
lg_model = LogisticRegression(max_iter=100, random_state=2, solver='lbfgs')

Train two classifiers and get detailed report with the datasist classification report function.

lg_model.fit(X_train, y_train, )
pred = lg_model.predict(X_test)

#Get report from true and predicted values
ds.model.get_classification_report(y_test, pred)

Accuracy is 100.0

F1 score is 41.0

Precision is 58.0

Recall is 32.0

****************************************************************************************************

confusion Matrix

Score positive Score negative

Actual positive 28629 13

Actual negative 39 18

In [94]:

rf_model.fit(X_train, y_train)
pred = rf_model.predict(X_test)

#Get report from true and predicted values
ds.model.get_classification_report(y_test, pred)

Accuracy is 100.0

F1 score is 90.0

Precision is 90.0

Recall is 91.0

****************************************************************************************************

confusion Matrix

Score positive Score negative

Actual positive 28636 6

Actual negative 5 52

From the metrics reported, we can see that the RandomForest model does better than then Logistic Regression model. Next we can use a cross validation function to detect overfitting.

ds.model.train_classifier(train_data=train, target=target, model=rf_model,
                                                     cross_validate=True, cv=3)

Accuracy is 99.8787

F1_score is 72.1679

Precision is 76.2845

Recall is 79.7035

Finally, we can plot the feature importance to see the most important features.

feats = train.columns
ds.model.plot_feature_importance(estimator=rf_model, col_names=feats)

Download Notebook Full Notebook Here

Last updated