How to detect credit card fraudsters

This is a pretty long tutorial and I know how hard it is to go through everything, hopefully you may skip a few blocks of code if you need

One of the oldest problem in Statistics is to deal with unbalanced data, for example, surviving data, credit risk, fraud.

Basically, in any data where either your success rate is too high or too low, models are almost irrelevant. This comes from the fact that we use a criteria around 1% to measure the accuracy of a model, ie, if my model predicts in the testing set 99% of the success (or failure depending on what you are trying to do), the model is a hero.

scale

What happens with unbalanced data is that the success metric happening in around 1% (usually less than 10%), so if you have no model and predicts success at 1%, then the model passes the accuracy criteria.

In this post you will learn a 'trick' to deal with this type of data: oversampling. In this project I will skip the descriptive analysis hoping that we all want to focus on fraud analysis a bit more.

Importing Libraries

These are the libraries we are using in our project:


 #to plot stuff
 import pandas as pd
 import matplotlib.pyplot as plt
 %matplotlib inline
import scikitplot as skplt
 import numpy as np
#decision tree
 from sklearn.datasets import make_blobs
 from sklearn.tree import DecisionTreeClassifier
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.datasets import make_classification
 from sklearn.metrics import classification_report
#split training and testing
 from sklearn.model_selection import train_test_split
from collections import Counter
 from imblearn.over_sampling import RandomOverSampler
 from imblearn.under_sampling import RandomUnderSampler
#model evaluation
 from sklearn.metrics import confusion_matrix
#decision tree plotter
 from mlxtend.plotting import category_scatter
 from mlxtend.plotting import plot_decision_regions
import statsmodels.api as sm

Data Problem and Motivation

I have been working on a similar problem at work and I after a lot of research I finally finish my model deployment in python and I decided it could help others as well.

The data used here was available in a kaggle competition 2 years ago in this link: https://www.kaggle.com/mlg-ulb/creditcardfraud.

This is what the data looks like:


 df = pd.read_csv('creditcard.csv')
 df.head()
 
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 31 columns

Factoring the X matrix for Regression

For category variables we need to make them into factors in order for the analysis to work.


#define our original variables
y = df['Class']
features =list(df.columns[1:len(df.columns)-1])
X = pd.get_dummies(df[features], drop_first=True)
p = sum(y)/len(y)
print("Percentage of Fraud: "+"{:.3%}".format(p));
And we should also separate the data in training and testing:

#define training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
p_test = sum(y_test)/len(y_test)
p_train = sum(y_train)/len(y_train)
print("Percentage of Fraud in the test set: "+"{:.3%}".format(p_test))
print("Percentage of Fraud in the train set: "+"{:.3%}".format(p_train))

Percentage of Fraud in the test set: 0.172%
Percentage of Fraud in the train set: 0.173%

Now because of our unbalanced data we need the dataset to have balanced success rate in order to validate the model. There are two ways of doing it:

  • Oversampling: which is pretty much flooding the dataset with success events in a way that the percentage of success is closer to 50% (balanced) then the original set.
  • Undersampling: which is reducing the unbalanced event, forcing the dataset to be balanced.

Method: Over Sampling

First let's perform the Oversampling analysis:


#oversampling
ros = RandomOverSampler(random_state=42)
X_over, y_over = ros.fit_resample(X_train, y_train)
ros = RandomUnderSampler(random_state=42)
X_under, y_under = ros.fit_resample(X_train, y_train)
#write oversample dataframe as a pandas dataframe and add the column names
#column names were removed from the dataframe when we performed the oversampling
#column names will be useful down the road when we do a feature selection
X = pd.DataFrame(X)
names = list(X.columns)
X_over = pd.DataFrame(X_over)
X_over.columns = names
X_under = pd.DataFrame(X_under)
X_under.columns = names
p_over = sum(y_over)/len(y_over)
p_under = sum(y_under)/len(y_under)
print("Percentage of Fraud in the test set: "+"{:.3%}".format(p_over))
print("Percentage of Fraud in the train set: "+"{:.3%}".format(p_under))

 

Percentage of Fraud in the test set: 50.00%.

Percentage of Fraud in the train set: 50.00%

Now the data has the same amount of success and failures.

 

Modeling with Logistic Regression

In the snipped of code below we have three function:

  • remove_pvalues(): a function to perform feature selection. It removes features with a p-value higher than 5%, which is the probability of the weight for that feature being equal 0. In a regression fitting, the coefficients p-value test if each individual feature is irrelevant for the model (null hypothesis) or otherwise (alternative hypothesis). If you want to know more the Wikipedia article on p-value is pretty reasonable (https://en.wikipedia.org/wiki/P-value)
  • stepwise_logistic(): now we will repeat the process of removing the "irrelevant features" until there is no more features to be removed. This function will loop through the model interactions until stops removing features.
  • logit_score(): The output of a logistic regression model is actually a vector of probabilities from 0 to 1, the closer to 0 the more unlikely is of that record to be a fraud given the current variables in the model. And on another hand the closer to 1, the more likely is to be a fraud. For the purposes of this problem, if the probability is higher than 0.95 (which it was picked by me randomly and "guttely" speaking) I am calling that a 1. And then at the end, this function scores the predicted success rate against the real value of the testing set.

Note: The IMO the threshold to call a fraud or success depends on how many false negative/positive you are willing to accept. In statistics, it's impossible to control both at the same time, so you need to pick and choose.

def remove_pvalues(my_list, alpha=0.05): features = [] counter = 0

 for item in my_list:
  if my_list.iloc[counter]<=alpha:
  features.append(my_list.index[counter])
  counter+=1
 return features
def stepwise_logistic(X_res, y_res):
 lr=sm.Logit(y, X)
 lr = lr.fit()
 my_pvalues = lr.pvalues
 features = remove_pvalues(my_pvalues)
 n_old = len(features)
 while True:
  X_new = X_res[features]
  lr=sm.Logit(y_res,X_new)
  lr = lr.fit()
  new_features = remove_pvalues(lr.pvalues)
  n_new = len(new_features)

  if n_old- n_new==0:
   break
  else:
   n_old = n_new
 return new_features

def logit_score(model, X, y, threshold=0.95):
 y_t = model.predict(X)
 y_t = y_t > threshold
 z= abs(y_t-y)
 score = 1- sum(z)/len(z)
 return score

Performing Logistic Regression

Now our guns are loaded, we just need to fire, below I am performing the feature selection and then the model fitting in the final X matrix:


features = stepwise_logistic(X_over, y_over)
X_over_new = X_over[features]
lr=sm.Logit(y_over,X_over_new)
lr = lr.fit()

Scoring the model

Now the most expected part of this tutorial, which is just checking how many "rights and wrongs" we are getting, considering the model was created based on a dataset with a 50% of fraud occurrences, and then tested in a set with 0.17% of fraud occurrences.
Voila:

score = logit_score(lr, X_test[features], y_test) y_t = lr.predict(X_test[features]) #rounds to 1 any likelihood higher than 0.95, otherwise sign 0 y_t = y_t > 0.95

print("Percentage of Rights and Wrongs in the testing set "+"{:.3%}".format(score));
 
Percentage of Rights and Wrongs in the testing set 99.261%

Now that we know the model is awesome, let's see where we are getting it wrong. The plot below shows how many false negatives and false positives we have. We see a lot more false positives (we are saying a transaction was fraudulent even though it was not). This come from the 0.95 threshold above, if you increase that value to 0.99 for example, you will increase the amount of false negatives as well. The statistician need to decide what is the optimum cut.

skplt.metrics.plot_confusion_matrix(y_test, y_t) confusion_matrix

let's see what variables are more important to the model in absolute value. In the chart below you can see the top 15 most relevant features:


 #calculate the weights
 weights = lr.params
#creates a dataframe for them and the coefficient names
 importance_list = pd.DataFrame(
 {'names': features,
 'weights': weights
 })
#normalized absolute weights
 importance_list['abs_weights'] = np.abs(importance_list['weights'])
 total = sum(importance_list['abs_weights'])
 importance_list['norm_weights'] = importance_list['abs_weights']/total
#select top 10 with higher importance
 importance_list = importance_list.sort_values(by='norm_weights', ascending=False)
 importance_list = importance_list.iloc[0:14]
#plot them tcharam!
 ax = importance_list['norm_weights'].plot(kind='bar', title ="Variable importance",figsize=(15,10),legend=True, fontsize=12)
 ax.set_xticklabels(importance_list['names'], rotation=90)

plt.show() importance_plot

Now to visualize the weights we can see in the plot below which variables decrease x increase the likelihood of having a fradulent transaction.

ax = importance_list['weights'].plot(kind='bar', title ="Variable importance",figsize=(15,10),legend=True, fontsize=12) ax.set_xticklabels(importance_list['names'], rotation=90) plt.show() weights

I know this is a pretty long tutorial but hopefully you will not need to go through all the yak shaving I had to go through what I went through.

When should I use hits, visits or visitors in my Adobe Analytics Segments

This past week I got asked this question and although I have no doubts when I am creating my Adobe Analytics (AA) segments I still could not give a clear answer. Later on that day, I stoped and thought about my logical mechanism to create the segment. This is how I choose.

Problem that I am trying to solve
Segments is a very powerful and useful tool in AA. It allows you to create variables that were not available before. And because it is so useful, some analysts forget that they should be created to solve problems and not expand variables.

For example, let's create a situation where I am asked to filter all conversions for a specific product, here called product bazinga. The checkout url for my commerce variables are all the same after a parameter but they all differ in the product name. No need to say that the product name is not shared in this data layer setup. This is how the confirmation url would like:

When someone buys the product cozinga:
www.companyfic.com/cozinga/checkout/orderconfirmation

And when someone buys the bazinga product we:
www.companyfic.com/bazinga/checkout/orderconfirmation

This case it's a no brainer because I want to filter all pages that contain the string bazinga and the string orderconfirmation. And when the user hits these pages. It does not matter for me what they do after, before, in a different day. All care is that hit.

Basically I defined my problem. Defined the view I was trying to focus on and then it is clear which option I should choose.

Variables involved in the report

Now that the product is clear, looking at the variables you will be selecting in this report is also pretty useful. If the variables are more related to the hit, page views, time on page, scroll down then that's what you are looking for. If the variables are more related to the visit (such as revenue, basket size, day of the order) then you should definitely use visit. The same with visitor.

Hit: I am focus on the page and what happens in that page. Anything after or before does not influence my segment. Ex: Page isolation, bounce rate in pages, product analysis.

Visits: I am focus on that visit, pages only matter in a visit level. Ex. what is number of page views for the product bazinga that comes SEM traffic source.

Visitors: I have a broad focus and I am interested in users. It does matter what users did in a previous visit. Ex. I want to know how many users from BC bought the product bazinga in the last 30 days.

 

Watchouts

Be careful when using segments that were created for different purpose, it might influence results. Ex. If you have your bazinga purchase segment and the business asks you to answer a question related to traffic source, you might have a higher number than you were supposed to. You are applying a hit segment to a problem that requires a visit segment for instance.

 

 

 

 

Visitors versus Unique Visitors

One of the concepts that people usually misunderstand when speaking about digital data is the terminology used in the platforms. For example, I often hear these questions:

  • “Are you giving me unique visitors or visitors?”
  • Or even “do you have unique visits too?”.

They exemplify how the concept is not crystal clear and therefore, they end up misused.

Visitors

Visitors are defined as each and every person that accesses the website. For example if we were talking about a physical store, the unique visitors would be the amount of people that are visiting the store in a specific time frame. In Omniture this is clear because it has unique visitors metric break down by month, quarter and year. No one says “I had 50 unique clients in my store this afternoon” because clients are all unique. But we do say unique visitors when we refer to the digital store – most of it because of Omniture terminology – but they actually refer to the same concept: clients.

New Visitors

New visitors in Omniture are associated with the visit number, any visitor that has the visit number equals 0 is considered a brand new one. And to have the new visitor equal 0, it would have to be the absolutely first time ever you have been to that site using that device. Well, it is cookie based.

Also, as a consequence of using the cookie register, if he/she deletes/cleans the cache, we are no longer able to identify the customer history. Therefore it would count as a new visitor.

Returning Visitors

Because we have new visitor, we also have returning visitor. Basically, it is anyone that does not have a visit number 0 attached to the cookie. Or anyone that is entering the store/website for a time that is not the first.

Visits

Going back to the context of a physical store, visits are the number of times someone enters the store. If they leave and come back, it’d be considered a new visit even though it’s the same client. In this context, every visit is unique too. Well, what would be a visit that is not unique, if such term exists? Exactly! There’s no such thing as returning visit because what it returns is the visitor.

So putting in a context, a visitor comes to the site for the first time and within the same day visits the site again and buys something and leaves.

This is the table of key metrics for that day:

New visitors 1
Unique Visitors 1
Visits 2

 

So in conclusion the clients (unique visitor/visitor ) comes (visit) the store for the first time (new visitor) on the day one, and come back (returning visitor) on the day 2.

I hope this can help people to understand better this concepts.


 

What's behind A/B tests

Well often I'm asked how an A/B test work.

So my first saying, which is the most important thing that I'm going to mention here, is: A/B test is not a simple comparison of frequencies. So don't ever use the terminology A/B test when you are just comparing 2 numbers, like EVER.

Now that we drew the line let's go back to the post.
There are several tools around (most of them being embedded in your web BI solution like Adwords, Adobe Site Catalyst, etc), but in fact you don't really get to understand how it's calculated.

So here I'll show (and share in my github account) some methods to do A/B tests.

Using R
Because I don't have any real life data that I can share, I'm going to use the ToothGrowth dataset from R (please find more details about the dataset here). Basically they treated 60 guinea pigs with 6 variations of treatment and they collected the size of their teeth after a while. Now we are interested in understanding what's the best treatment.

We can bring this example to any kind of industry that is not medical/health related. For example, let's say you are running a campaign for sofas and for that you're using 2 different digital channels: Facebook and Display. But you decided to have 3 types of campaign: a small banner, a big banner and a flash animation. Then you gave 6 variations of treatment just like our guinea pig example.

So going back to our dataset. We have 2 delivery method for vitamin C (Acid Ascorbic and Orange Juice) and 3 different doses - 0.5mg, 1mg, 2mg. This is the reason for the 6 variation. To make it easier (and possible see more in here), we are assuming the the whole dataset has normal distribution.

So first, let's find out if there's difference between the delivery methods.
It's always a good idea to check the stats for A/B sample, like mean and variance. You can find these stats on the bottom of this page

So the first question: is there a difference between guinea pigs that took ascorbic acid and those that took orange juice?

H_0 Null Hypothesis: \mu_{orange juice} = \mu_{ascorbi cacid} or \mu_{orange juice} - \mu_{ascorbi cacid} =0

 

H_1 Alternative Hypothesis: \mu_{orangejuice} \neq \mu_{ascorbic acid} or \mu_{orange juice} - \mu_{ascorbi cacid} \neq 0

So in most of the cases, we use the p-value as a comparison for our test. Being higher than 0.05 considered high. If you're not familiar with the concept of p-value, please check here.

The calculation behind the test is considering our null hypothesis true, what's the probability of having an extreme value outside of standard variation?

Basically we create a new variable called OrangeJuice-AscorbicAcid that has mean as:
\mu_{orangejuice} - \mu_{ascorbi cacid} and variance as
\frac{S_{orange juice}^2}{n_{orange juice}}+\frac{S_{ascorbic acid}^2}{n_{ascorbic acid}}

In a gross way explaining, the number you should calculate is

\frac{\bar{X}_{orange juice} - \bar{X}_{acid ascorbic}}{\sqrt{\frac{S_{orange juice}^2}{n_{orange juice}}+\frac{S_{ascorbic acid}^2}{n_{ascorbic acid}}}}

That has T-distribution if n_{orange juice} + n_{acid ascorbic} < 40 and Normal distribution otherwise.

Back to our test, first we filter the variables by Ascorbic Acid (VC) and orange juice (OJ)


vc_subset oj_subset

then we can apply the t-test


t.test(vc_subset$len, oj_subset$len)

Which results into
test1_vc_oj

P-value is higher than 5% so we decide against the null hypothesis.

The rest of this dataset analysis you can see in the github repository that I shared in the bottom of this post.

Now assuming we have 2 campaign running using different templates.

For the first campaign the click through was 25% and for second one the click through was 40%. It's hard to tell with only these two piece of information if there's difference between the campaign. For example, let's say that the first campaign we have 100 people seeing it, while the second one only 10. Even having a much higher click through rate, the second campaign has a very low number of impressions and it's hard to say if there's difference between them.
In the chart below you can see that we have to show to at least 55 people keeping the 40% to say that the there's difference between the samples. And at least 40 to conclude that the campaign 2 is better than the campaign 1.

pvalue_chart

Currently there are several links that can be used to do an A/B test, such as:

  • https://vwo.com/ab-split-test-significance-calculator/
  • http://www.hubspot.com/ab-test-calculator

I'm also uploading a spreadsheet with the proportion and the mean tests here.

 

And for more information about how to implement t-tests in R please access this link

Central Limit Theorem and Convergence

This is my first post (yaaay \o/).
Sometimes it's hard to understand the meaning of CLT and convergence. I realized many times when people first hear about these concepts that they actually don't know what it's. And they stay this way until they learn (if they learn) more advanced theories.

One of the first things that we learn in school is Central Limit Theorem. Basically for any symmetric distribution, when you have a large number of observations, it converges to a normal distribution.

There are some distributions more inclined to converge than others. For example, the exponential distribution, t-student, binomial.

But those are all old news. Here I'm going to show an example of convergence in real life.

Exponential Distribution

The CLT says that if you have X_1, X_2, ...X_k independent random variables with distribuition exp(\lambda) when k is large enough the mean of these identical distribuited vectors converges to a normal with mean \lambda and variance 1/(k*\lambda^2).
Now to prove this, we can generate 40 observations from a exponential distribuition with \lambda = 0.2. And then replicate this experiment k times.

The Rcode for that is:


k=10 #number of replication
n= 40 #size of each vector
lambda = 0.2
list_of_exponential = array(1:k) #variable to keep all the calculated means
#loop to run the replication
for(i in 1:k){
  list_of_exponential[i] = mean(rexp(n, lambda))
}

As you can see the k here is very small (10) so how would be the distribution of the mean for these iid variables?

expo_k_10

As you can see it's not very "normal". The reason for that is because we need a bigger k in order to get closer to a normal.

If we run the same script with k=100 for example

exp_k_100

And if we run for k=1000

exp_k_1000

You finally can see that it does converge to a normal distribution.

There some decent explanation in the wikipedia page (if you need more theory) and I also included the code in my github page.

As this is my first post, please send me your feedback/suggestions so I can improve the blog and posts =D.

Continue reading