1 比赛介绍
金融风控之贷款违约预测挑战赛,根据feature判断申请人是否会违约。
比赛链接:https://tianchi.aliyun.com/competition/entrance/531830/introduction
评价指标为AUC。
2 训练集介绍
比赛要求参赛选手根据给定的数据集,建立模型,预测金融风险。
训练集一共80万条,测试集A和B分别20万条。另外还有很多匿名特征和脱敏特征。
特征介绍:
id
为贷款清单分配的唯一信用证标识loanAmnt
贷款金额term
贷款期限(year)interestRate
贷款利率installment
分期付款金额grade
贷款等级subGrade
贷款等级之子级employmentTitle
就业职称employmentLength
就业年限(年)homeOwnership
借款人在登记时提供的房屋所有权状况annualIncome
年收入verificationStatus
验证状态issueDate
贷款发放的月份purpose
借款人在贷款申请时的贷款用途类别postCode
借款人在贷款申请中提供的邮政编码的前3位数字regionCode
地区编码dti
债务收入比delinquency_2years
借款人过去2年信用档案中逾期30天以上的违约事件数ficoRangeLow
借款人在贷款发放时的fico所属的下限范围ficoRangeHigh
借款人在贷款发放时的fico所属的上限范围openAcc
借款人信用档案中未结信用额度的数量pubRec
贬损公共记录的数量pubRecBankruptcies
公开记录清除的数量revolBal
信贷周转余额合计revolUtil
循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额totalAcc
借款人信用档案中当前的信用额度总数initialListStatus
贷款的初始列表状态applicationType
表明贷款是个人申请还是与两个共同借款人的联合申请earliesCreditLine
借款人最早报告的信用额度开立的月份title
借款人提供的贷款名称policyCode
公开可用的策略_代码=1新产品不公开可用的策略_代码=2n*
匿名特征n0-n14
,为一些贷款人行为计数特征的处理
3 过程介绍
数据总体:
- 读入数据集,查看数据集size;
- 了解数据类型;
- 查看统计量;
需要处理的值:
- 查看缺失值
- 查看inf值(无限大的值)
- 查看唯一值(只有一个值)
查看数据类型
- 类别型数据
数值型数据
- 离散数值型数据
- 连续数值型数据
数据间相关关系
- 特征间关系
- 特征与目标变量间关系
- pandas_profiling生成数据报告
4 代码部分
4.1 导入数据分析及可视化过程需要的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
import pandas_profiling
warnings.filterwarnings('ignore')
4.2 读取数据集
data_train = pd.read_csv('./train.csv')
data_test_a = pd.read_csv('./testA.csv')
4.3 查看总体
data_train.shape
(800000, 47)
data_test_a.shape
(200000, 46)
data_train.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
dtype='object')
4.4 查看数据类型
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n3 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
4.5 查看基本统计量
data_train.describe()
id | loanAmnt | term | interestRate | installment | employmentTitle | homeOwnership | annualIncome | verificationStatus | isDefault | ... | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.000000 | 8.000000e+05 | 800000.000000 | 800000.000000 | ... | 759730.000000 | 759730.000000 | 759730.000000 | 759729.000000 | 759730.000000 | 766761.000000 | 730248.000000 | 759730.000000 | 759730.000000 | 759730.000000 |
mean | 399999.500000 | 14416.818875 | 3.482745 | 13.238391 | 437.947723 | 72005.351714 | 0.614213 | 7.613391e+04 | 1.009683 | 0.199513 | ... | 8.107937 | 8.575994 | 8.282953 | 14.622488 | 5.592345 | 11.643896 | 0.000815 | 0.003384 | 0.089366 | 2.178606 |
std | 230940.252015 | 8716.086178 | 0.855832 | 4.765757 | 261.460393 | 106585.640204 | 0.675749 | 6.894751e+04 | 0.782716 | 0.399634 | ... | 4.799210 | 7.400536 | 4.561689 | 8.124610 | 3.216184 | 5.484104 | 0.030075 | 0.062041 | 0.509069 | 1.844377 |
min | 0.000000 | 500.000000 | 3.000000 | 5.310000 | 15.690000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 199999.750000 | 8000.000000 | 3.000000 | 9.750000 | 248.450000 | 427.000000 | 0.000000 | 4.560000e+04 | 0.000000 | 0.000000 | ... | 5.000000 | 4.000000 | 5.000000 | 9.000000 | 3.000000 | 8.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
50% | 399999.500000 | 12000.000000 | 3.000000 | 12.740000 | 375.135000 | 7755.000000 | 1.000000 | 6.500000e+04 | 1.000000 | 0.000000 | ... | 7.000000 | 7.000000 | 7.000000 | 13.000000 | 5.000000 | 11.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
75% | 599999.250000 | 20000.000000 | 3.000000 | 15.990000 | 580.710000 | 117663.500000 | 1.000000 | 9.000000e+04 | 2.000000 | 0.000000 | ... | 11.000000 | 11.000000 | 10.000000 | 19.000000 | 7.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
max | 799999.000000 | 40000.000000 | 5.000000 | 30.990000 | 1715.420000 | 378351.000000 | 5.000000 | 1.099920e+07 | 2.000000 | 1.000000 | ... | 70.000000 | 132.000000 | 79.000000 | 128.000000 | 45.000000 | 82.000000 | 4.000000 | 4.000000 | 39.000000 | 30.000000 |
8 rows × 42 columns
data_train.head(3).append(data_train.tail(3))
id | loanAmnt | term | interestRate | installment | grade | subGrade | employmentTitle | employmentLength | homeOwnership | ... | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 35000.0 | 5 | 19.52 | 917.97 | E | E2 | 320.0 | 2 years | 2 | ... | 9.0 | 8.0 | 4.0 | 12.0 | 2.0 | 7.0 | 0.0 | 0.0 | 0.0 | 2.0 |
1 | 1 | 18000.0 | 5 | 18.49 | 461.90 | D | D2 | 219843.0 | 5 years | 0 | ... | NaN | NaN | NaN | NaN | NaN | 13.0 | NaN | NaN | NaN | NaN |
2 | 2 | 12000.0 | 5 | 16.99 | 298.17 | D | D3 | 31698.0 | 8 years | 0 | ... | 0.0 | 21.0 | 4.0 | 5.0 | 3.0 | 11.0 | 0.0 | 0.0 | 0.0 | 4.0 |
799997 | 799997 | 6000.0 | 3 | 13.33 | 203.12 | C | C3 | 2582.0 | 10+ years | 1 | ... | 4.0 | 26.0 | 4.0 | 10.0 | 4.0 | 5.0 | 0.0 | 0.0 | 1.0 | 4.0 |
799998 | 799998 | 19200.0 | 3 | 6.92 | 592.14 | A | A4 | 151.0 | 10+ years | 0 | ... | 10.0 | 6.0 | 12.0 | 22.0 | 8.0 | 16.0 | 0.0 | 0.0 | 0.0 | 5.0 |
799999 | 799999 | 9000.0 | 3 | 11.06 | 294.91 | B | B3 | 13.0 | 5 years | 0 | ... | 3.0 | 4.0 | 4.0 | 8.0 | 3.0 | 7.0 | 0.0 | 0.0 | 0.0 | 2.0 |
4.6.1 查看缺失值
print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')
There are 22 columns in train dataset with missing values.
统计一下空值比例大于50%的feature
have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
if value > 0.5:
fea_null_moreThanHalf[key] = value
print(fea_null_moreThanHalf)
{}
可视化nan
值
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing = missing.sort_values()
missing.plot.bar()
4.6.2 查看无限值
for col in data_train.columns.values:
if data_train[col].dtype != object and np.isinf(data_train[col]).any():
print("{} has infinity values".format(col))
这个数据集没有无限值,如果有,可以用下面代码来改为
nan
再统一处理# data_train.replace([np.inf, -np.inf], np.nan)
# nan可视化
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing = missing.sort_values()
missing.plot.bar()
4.6.3 查看唯一值
one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1]
one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1]
one_value_fea,one_value_fea_test
['policyCode']
print(f'There are {len(one_value_fea)} columns in train dataset with one unique value.')
print(f'There are {len(one_value_fea_test)} columns in test dataset with one unique value.')
There are 1 columns in train dataset with one unique value.
There are 1 columns in test dataset with one unique value.
4.7 取出数值特征与分类特征
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))
4.8 查看分类特征
category_fea
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
for col in category_fea:
print('Feature 【{}】 has 【{}】 unique classes'.format(col, data_train[col].nunique()))
data_train[col].hist()
plt.figure()
4.9.1 数值特征分为连续特征与离散特征
def get_numerical_serial_fea(data,feas):
# 连续特征
numerical_serial_fea = []
# 离散特征
numerical_noserial_fea = []
for fea in feas:
uniques = data[fea].nunique()
# 如果类别数小于10,就认为是离散,否则就是连续
if uniques <= 10:
numerical_noserial_fea.append(fea)
continue
numerical_serial_fea.append(fea)
return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)
numerical_serial_fea
['id',
'loanAmnt',
'interestRate',
...
'n10',
'n13',
'n14']
numerical_noserial_fea
['term',
'homeOwnership',
'verificationStatus',
'isDefault',
'initialListStatus',
'applicationType',
'policyCode',
'n11',
'n12']
4.9.2 查看离散类别
data_train['term'].value_counts()
3 606902
5 193098
Name: term, dtype: int64
data_train['homeOwnership'].value_counts()
0 395732
1 317660
2 86309
3 185
5 81
4 33
Name: homeOwnership, dtype: int64
data_train['verificationStatus'].value_counts()
1 309810
2 248968
0 241222
Name: verificationStatus, dtype: int64
data_train['initialListStatus'].value_counts()
0 466438
1 333562
Name: initialListStatus, dtype: int64
data_train['applicationType'].value_counts()
0 784586
1 15414
Name: applicationType, dtype: int64
data_train['policyCode'].value_counts()# 只有单个值,没用
1.0 800000
Name: policyCode, dtype: int64
data_train['n11'].value_counts() # 分布严重不平衡,下面检查一下正负样本的分布
0.0 729682
1.0 540
2.0 24
4.0 1
3.0 1
Name: n11, dtype: int64
data_train.groupby('n11').sum()['isDefault'] / data_train['n11'].value_counts()
0.0 0.199276
1.0 0.201852
2.0 0.083333
3.0 1.000000
4.0 0.000000
dtype: float64
data_train['n12'].value_counts() # 分布严重不平衡,下面检查一下正负样本的分布
0.0 757315
1.0 2281
2.0 115
3.0 16
4.0 3
Name: n12, dtype: int64
data_train.groupby('n12').sum()['isDefault'] / data_train['n12'].value_counts()
0.0 0.202093
1.0 0.225340
2.0 0.217391
3.0 0.375000
4.0 0.000000
dtype: float64
4.10.1 可视化数字特征分布
f = pd.melt(data_train, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
- 如果分布不符合正态分布,可以log化来稍微缓解
- 一些情况下,正态可以让模型收敛更快,一些模型要求数据正态(eg. GMM、KNN),保证数据不要过偏态即可,过于偏态可能会影响模型预测结果。
4.10.2 plot一下log修正后的loanAmnt
plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)
plt.subplot(221)
sub_plot_1 = sns.distplot(data_train['loanAmnt'])
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)
plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)
4.11 根据欺诈类别,查看分类特征的分布
train_loan_fr = data_train.loc[data_train['isDefault'] == 1]
train_loan_nofr = data_train.loc[data_train['isDefault'] == 0]
fig, axes = plt.subplots(len(category_fea), 2, figsize=(15, 8))
for ax, col in zip(axes, category_fea):
train_loan_fr.groupby(col)[col].count().plot(kind='barh', ax=ax[0], title='Count of {} fraud'.format(col))
train_loan_nofr.groupby(col)[col].count().plot(kind='barh', ax=ax[1], title='Count of {} non-fraud'.format(col))
不能用下列代码实现,因为会导致类别乱序,无法对比
fig, axes = plt.subplots(len(category_fea), 2, figsize=(15, 8)) for ax, col in zip(axes, category_fea): train_loan_fr[col].value_counts().plot(kind='barh', ax=ax[0], title='Count of {} fraud'.format(col)) train_loan_nofr[col].value_counts().plot(kind='barh', ax=ax[1], title='Count of {} non-fraud'.format(col))
for col in numerical_serial_fea:
fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6))
data_train.loc[data_train['isDefault'] == 1] \
[col].apply(lambda x: np.sign(x) * np.log(1 + np.abs(x))) \
.plot(kind='hist',
bins=100,
title='Log {} - Fraud'.format(col),
color='r',
xlim=(-3, 10),
ax= ax1)
data_train.loc[data_train['isDefault'] == 0] \
[col].apply(lambda x: np.sign(x) * np.log(1 + np.abs(x))) \
.plot(kind='hist',
bins=100,
title='Log {} - Not Fraud'.format(col),
color='b',
xlim=(-3, 10),
ax=ax2)
plt.figure()
4.12 Loan Amount在欺诈与非欺诈的区别
percent = data_train.groupby('isDefault')['loanAmnt',].count()
percent = percent.reset_index()
sum_amnt = data_train.groupby('isDefault')['loanAmnt'].sum()
sum_amnt = sum_amnt.reset_index()
plt.figure(figsize=(12,5))
plt.subplot(121)
plot_tr_1 = sns.barplot(x='isDefault', y='loanAmnt', dodge=True, data=percent)
plot_tr_1.set_title("Percent of Frauds \n 0: good user | 1: bad user", fontsize=14)
# 写上百分比
for p in plot_tr_1.patches:
height = p.get_height()
plot_tr_1.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/len(data_train) * 100),
ha="center", fontsize=15)
plt.subplot(122)
plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt', dodge=True, data=sum_amnt)
plot_tr_2.set_title("Total Amount in loanAmnt \n 0: good user | 1: bad user", fontsize=14)
# 写上百分比
for p in plot_tr_2.patches:
height = p.get_height()
plot_tr_2.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/data_train['loanAmnt'].sum() * 100),
ha="center", fontsize=15)
4.13 日期特征处理
# 新的特征issueDateDT,表示数据日期离数据集中日期最早的日期(2007-06-01)的天数
# 转换为日期格式
data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days
#转化成日期格式
data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days
plt.hist(data_train['issueDateDT'], label='train');
plt.hist(data_test_a['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
#train 和 test issueDateDT 日期有重叠 所以使用基于时间的分割进行验证是不明智的
4.14 透视图查看数据
# 透视图 索引可以有多个,“columns(列)”是可选的,聚合函数aggfunc最后是被应用到了变量“values”中你所列举的项目上。
pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum)
loanAmnt | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
issueDateDT | 0 | 30 | 61 | 92 | 122 | 153 | 183 | 214 | 245 | 274 | ... | 3926 | 3957 | 3987 | 4018 | 4048 | 4079 | 4110 | 4140 | 4171 | 4201 |
grade | |||||||||||||||||||||
A | NaN | 53650.0 | 42000.0 | 19500.0 | 34425.0 | 63950.0 | 43500.0 | 168825.0 | 85600.0 | 101825.0 | ... | 13093850.0 | 11757325.0 | 11945975.0 | 9144000.0 | 7977650.0 | 6888900.0 | 5109800.0 | 3919275.0 | 2694025.0 | 2245625.0 |
B | NaN | 13000.0 | 24000.0 | 32125.0 | 7025.0 | 95750.0 | 164300.0 | 303175.0 | 434425.0 | 538450.0 | ... | 16863100.0 | 17275175.0 | 16217500.0 | 11431350.0 | 8967750.0 | 7572725.0 | 4884600.0 | 4329400.0 | 3922575.0 | 3257100.0 |
C | NaN | 68750.0 | 8175.0 | 10000.0 | 61800.0 | 52550.0 | 175375.0 | 151100.0 | 243725.0 | 393150.0 | ... | 17502375.0 | 17471500.0 | 16111225.0 | 11973675.0 | 10184450.0 | 7765000.0 | 5354450.0 | 4552600.0 | 2870050.0 | 2246250.0 |
D | NaN | NaN | 5500.0 | 2850.0 | 28625.0 | NaN | 167975.0 | 171325.0 | 192900.0 | 269325.0 | ... | 11403075.0 | 10964150.0 | 10747675.0 | 7082050.0 | 7189625.0 | 5195700.0 | 3455175.0 | 3038500.0 | 2452375.0 | 1771750.0 |
E | 7500.0 | NaN | 10000.0 | NaN | 17975.0 | 1500.0 | 94375.0 | 116450.0 | 42000.0 | 139775.0 | ... | 3983050.0 | 3410125.0 | 3107150.0 | 2341825.0 | 2225675.0 | 1643675.0 | 1091025.0 | 1131625.0 | 883950.0 | 802425.0 |
F | NaN | NaN | 31250.0 | 2125.0 | NaN | NaN | NaN | 49000.0 | 27000.0 | 43000.0 | ... | 1074175.0 | 868925.0 | 761675.0 | 685325.0 | 665750.0 | 685200.0 | 316700.0 | 315075.0 | 72300.0 | NaN |
G | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 24625.0 | NaN | NaN | ... | 56100.0 | 243275.0 | 224825.0 | 64050.0 | 198575.0 | 245825.0 | 53125.0 | 23750.0 | 25100.0 | 1000.0 |
4.15 用pandas_profiling输出报告
# 如果不用minimal会计算correlation,如果样本数量大,其中的kendall correlation会花费很多时间
pfr = pandas_profiling.ProfileReport(data_train, minimal=True)
pfr.to_file("./example.html")
方式2
profile = data_train.profile_report( title="Report without correlations", correlations={ "pearson": {"calculate": False}, "spearman": {"calculate": False}, "kendall": {"calculate": False}, "phi_k": {"calculate": False}, "cramers": {"calculate": False}, }, )