Kaggle-金融贷款风险赛——1 EDA

博主： admin
发布时间：2020 年 12 月 23 日
618次浏览
暂无评论
19486字数
分类：机器学习实战

1 比赛介绍

金融风控之贷款违约预测挑战赛，根据feature判断申请人是否会违约。

比赛链接：https://tianchi.aliyun.com/competition/entrance/531830/introduction

评价指标为AUC。

2 训练集介绍

比赛要求参赛选手根据给定的数据集，建立模型，预测金融风险。

训练集一共80万条，测试集A和B分别20万条。另外还有很多匿名特征和脱敏特征。

特征介绍：

id 为贷款清单分配的唯一信用证标识
loanAmnt 贷款金额
term 贷款期限（year）
interestRate 贷款利率
installment 分期付款金额
grade 贷款等级
subGrade 贷款等级之子级
employmentTitle 就业职称
employmentLength 就业年限（年）
homeOwnership 借款人在登记时提供的房屋所有权状况
annualIncome 年收入
verificationStatus 验证状态
issueDate 贷款发放的月份
purpose 借款人在贷款申请时的贷款用途类别
postCode 借款人在贷款申请中提供的邮政编码的前3位数字
regionCode 地区编码
dti 债务收入比
delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
openAcc 借款人信用档案中未结信用额度的数量
pubRec 贬损公共记录的数量
pubRecBankruptcies 公开记录清除的数量
revolBal 信贷周转余额合计
revolUtil 循环额度利用率，或借款人使用的相对于所有可用循环信贷的信贷金额
totalAcc 借款人信用档案中当前的信用额度总数
initialListStatus 贷款的初始列表状态
applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
earliesCreditLine 借款人最早报告的信用额度开立的月份
title 借款人提供的贷款名称
policyCode 公开可用的策略_代码=1新产品不公开可用的策略_代码=2
n* 匿名特征n0-n14，为一些贷款人行为计数特征的处理

3 过程介绍

数据总体：
- 读入数据集，查看数据集size；
- 了解数据类型；
- 查看统计量；
需要处理的值：
- 查看缺失值
- 查看inf值(无限大的值)
- 查看唯一值(只有一个值)
查看数据类型
- 类别型数据
- 数值型数据
  - 离散数值型数据
  - 连续数值型数据
数据间相关关系
- 特征间关系
- 特征与目标变量间关系
pandas_profiling生成数据报告

4 代码部分

4.1 导入数据分析及可视化过程需要的库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
import pandas_profiling
warnings.filterwarnings('ignore')

4.2 读取数据集

data_train = pd.read_csv('./train.csv')
data_test_a = pd.read_csv('./testA.csv')

4.3 查看总体

data_train.shape

(800000, 47)

data_test_a.shape

(200000, 46)

data_train.columns

Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')

4.4 查看数据类型

data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n3                  759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

4.5 查看基本统计量

data_train.describe()

	id	loanAmnt	term	interestRate	installment	employmentTitle	homeOwnership	annualIncome	verificationStatus	isDefault	...	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
count	800000.000000	800000.000000	800000.000000	800000.000000	800000.000000	799999.000000	800000.000000	8.000000e+05	800000.000000	800000.000000	...	759730.000000	759730.000000	759730.000000	759729.000000	759730.000000	766761.000000	730248.000000	759730.000000	759730.000000	759730.000000
mean	399999.500000	14416.818875	3.482745	13.238391	437.947723	72005.351714	0.614213	7.613391e+04	1.009683	0.199513	...	8.107937	8.575994	8.282953	14.622488	5.592345	11.643896	0.000815	0.003384	0.089366	2.178606
std	230940.252015	8716.086178	0.855832	4.765757	261.460393	106585.640204	0.675749	6.894751e+04	0.782716	0.399634	...	4.799210	7.400536	4.561689	8.124610	3.216184	5.484104	0.030075	0.062041	0.509069	1.844377
min	0.000000	500.000000	3.000000	5.310000	15.690000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	...	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	199999.750000	8000.000000	3.000000	9.750000	248.450000	427.000000	0.000000	4.560000e+04	0.000000	0.000000	...	5.000000	4.000000	5.000000	9.000000	3.000000	8.000000	0.000000	0.000000	0.000000	1.000000
50%	399999.500000	12000.000000	3.000000	12.740000	375.135000	7755.000000	1.000000	6.500000e+04	1.000000	0.000000	...	7.000000	7.000000	7.000000	13.000000	5.000000	11.000000	0.000000	0.000000	0.000000	2.000000
75%	599999.250000	20000.000000	3.000000	15.990000	580.710000	117663.500000	1.000000	9.000000e+04	2.000000	0.000000	...	11.000000	11.000000	10.000000	19.000000	7.000000	14.000000	0.000000	0.000000	0.000000	3.000000
max	799999.000000	40000.000000	5.000000	30.990000	1715.420000	378351.000000	5.000000	1.099920e+07	2.000000	1.000000	...	70.000000	132.000000	79.000000	128.000000	45.000000	82.000000	4.000000	4.000000	39.000000	30.000000

8 rows × 42 columns

data_train.head(3).append(data_train.tail(3))

	id	loanAmnt	term	interestRate	installment	grade	subGrade	employmentTitle	employmentLength	homeOwnership	...	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
0	0	35000.0	5	19.52	917.97	E	E2	320.0	2 years	2	...	9.0	8.0	4.0	12.0	2.0	7.0	0.0	0.0	0.0	2.0
1	1	18000.0	5	18.49	461.90	D	D2	219843.0	5 years	0	...	NaN	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	NaN
2	2	12000.0	5	16.99	298.17	D	D3	31698.0	8 years	0	...	0.0	21.0	4.0	5.0	3.0	11.0	0.0	0.0	0.0	4.0
799997	799997	6000.0	3	13.33	203.12	C	C3	2582.0	10+ years	1	...	4.0	26.0	4.0	10.0	4.0	5.0	0.0	0.0	1.0	4.0
799998	799998	19200.0	3	6.92	592.14	A	A4	151.0	10+ years	0	...	10.0	6.0	12.0	22.0	8.0	16.0	0.0	0.0	0.0	5.0
799999	799999	9000.0	3	11.06	294.91	B	B3	13.0	5 years	0	...	3.0	4.0	4.0	8.0	3.0	7.0	0.0	0.0	0.0	2.0

4.6.1 查看缺失值

print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')

There are 22 columns in train dataset with missing values.

统计一下空值比例大于50%的feature

have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
    if value > 0.5:
        fea_null_moreThanHalf[key] = value
print(fea_null_moreThanHalf)

{}

可视化nan值

missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing = missing.sort_values()
missing.plot.bar()

4.6.2 查看无限值

for col in data_train.columns.values:
    if data_train[col].dtype != object and np.isinf(data_train[col]).any():
        print("{} has infinity values".format(col))

这个数据集没有无限值，如果有，可以用下面代码来改为nan再统一处理
# data_train.replace([np.inf, -np.inf], np.nan)

# nan可视化
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing = missing.sort_values()
missing.plot.bar()

4.6.3 查看唯一值

one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1]


one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1]
one_value_fea,one_value_fea_test

['policyCode']

print(f'There are {len(one_value_fea)} columns in train dataset with one unique value.')
print(f'There are {len(one_value_fea_test)} columns in test dataset with one unique value.')

There are 1 columns in train dataset with one unique value.
There are 1 columns in test dataset with one unique value.

4.7 取出数值特征与分类特征

numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))

4.8 查看分类特征

category_fea

['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

for col in category_fea:
    print('Feature 【{}】 has 【{}】 unique classes'.format(col, data_train[col].nunique()))
    data_train[col].hist()
    plt.figure()

4.9.1 数值特征分为连续特征与离散特征

def get_numerical_serial_fea(data,feas):
    # 连续特征
    numerical_serial_fea = []
    # 离散特征
    numerical_noserial_fea = []
    for fea in feas:
        uniques = data[fea].nunique()
        # 如果类别数小于10，就认为是离散，否则就是连续
        if uniques <= 10:
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)

numerical_serial_fea

['id',
 'loanAmnt',
 'interestRate',
...
 'n10',
 'n13',
 'n14']

numerical_noserial_fea

['term',
 'homeOwnership',
 'verificationStatus',
 'isDefault',
 'initialListStatus',
 'applicationType',
 'policyCode',
 'n11',
 'n12']

4.9.2 查看离散类别

data_train['term'].value_counts()

3    606902
5    193098
Name: term, dtype: int64

data_train['homeOwnership'].value_counts()

0    395732
1    317660
2     86309
3       185
5        81
4        33
Name: homeOwnership, dtype: int64

data_train['verificationStatus'].value_counts()

1    309810
2    248968
0    241222
Name: verificationStatus, dtype: int64

data_train['initialListStatus'].value_counts()

0    466438
1    333562
Name: initialListStatus, dtype: int64

data_train['applicationType'].value_counts()

0    784586
1     15414
Name: applicationType, dtype: int64

data_train['policyCode'].value_counts()# 只有单个值，没用

1.0    800000
Name: policyCode, dtype: int64

data_train['n11'].value_counts() # 分布严重不平衡，下面检查一下正负样本的分布

0.0    729682
1.0       540
2.0        24
4.0         1
3.0         1
Name: n11, dtype: int64

data_train.groupby('n11').sum()['isDefault'] / data_train['n11'].value_counts()

0.0    0.199276
1.0    0.201852
2.0    0.083333
3.0    1.000000
4.0    0.000000
dtype: float64

data_train['n12'].value_counts() # 分布严重不平衡，下面检查一下正负样本的分布

0.0    757315
1.0      2281
2.0       115
3.0        16
4.0         3
Name: n12, dtype: int64

data_train.groupby('n12').sum()['isDefault'] / data_train['n12'].value_counts()

0.0    0.202093
1.0    0.225340
2.0    0.217391
3.0    0.375000
4.0    0.000000
dtype: float64

4.10.1 可视化数字特征分布

f = pd.melt(data_train, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

如果分布不符合正态分布，可以log化来稍微缓解
一些情况下，正态可以让模型收敛更快，一些模型要求数据正态（eg. GMM、KNN）,保证数据不要过偏态即可，过于偏态可能会影响模型预测结果。

4.10.2 plot一下log修正后的loanAmnt

plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)

plt.subplot(221)
sub_plot_1 = sns.distplot(data_train['loanAmnt'])
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)

4.11 根据欺诈类别，查看分类特征的分布

train_loan_fr = data_train.loc[data_train['isDefault'] == 1]
train_loan_nofr = data_train.loc[data_train['isDefault'] == 0]

fig, axes = plt.subplots(len(category_fea), 2, figsize=(15, 8))
for ax, col in zip(axes, category_fea):
    train_loan_fr.groupby(col)[col].count().plot(kind='barh', ax=ax[0], title='Count of {} fraud'.format(col))
    train_loan_nofr.groupby(col)[col].count().plot(kind='barh', ax=ax[1], title='Count of {} non-fraud'.format(col))

不能用下列代码实现，因为会导致类别乱序，无法对比

fig, axes = plt.subplots(len(category_fea), 2, figsize=(15, 8))
for ax, col in zip(axes, category_fea):
    train_loan_fr[col].value_counts().plot(kind='barh', ax=ax[0], title='Count of {} fraud'.format(col))
    train_loan_nofr[col].value_counts().plot(kind='barh', ax=ax[1], title='Count of {} non-fraud'.format(col))

for col in numerical_serial_fea:
    fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6))
    data_train.loc[data_train['isDefault'] == 1] \
        [col].apply(lambda x: np.sign(x) * np.log(1 + np.abs(x))) \
        .plot(kind='hist',
              bins=100,
              title='Log {} - Fraud'.format(col),
              color='r',
              xlim=(-3, 10),
             ax= ax1)
    data_train.loc[data_train['isDefault'] == 0] \
        [col].apply(lambda x: np.sign(x) * np.log(1 + np.abs(x))) \
        .plot(kind='hist',
              bins=100,
              title='Log {} - Not Fraud'.format(col),
              color='b',
              xlim=(-3, 10),
             ax=ax2)
    plt.figure()

4.12 Loan Amount在欺诈与非欺诈的区别

percent = data_train.groupby('isDefault')['loanAmnt',].count()
percent = percent.reset_index()

sum_amnt = data_train.groupby('isDefault')['loanAmnt'].sum()
sum_amnt = sum_amnt.reset_index()

plt.figure(figsize=(12,5))
plt.subplot(121)
plot_tr_1 = sns.barplot(x='isDefault', y='loanAmnt',  dodge=True, data=percent)
plot_tr_1.set_title("Percent of Frauds  \n 0: good user | 1: bad user", fontsize=14)

# 写上百分比
for p in plot_tr_1.patches:
    height = p.get_height()
    plot_tr_1.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/len(data_train) * 100),
            ha="center", fontsize=15)     


plt.subplot(122)

plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt',  dodge=True, data=sum_amnt)
plot_tr_2.set_title("Total Amount in loanAmnt  \n 0: good user | 1: bad user", fontsize=14)

# 写上百分比
for p in plot_tr_2.patches:
    height = p.get_height()
    plot_tr_2.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/data_train['loanAmnt'].sum() * 100),
            ha="center", fontsize=15)

4.13 日期特征处理

# 新的特征issueDateDT，表示数据日期离数据集中日期最早的日期（2007-06-01）的天数

# 转换为日期格式
data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days

#转化成日期格式
data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')

startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days

plt.hist(data_train['issueDateDT'], label='train');
plt.hist(data_test_a['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
#train 和 test issueDateDT 日期有重叠 所以使用基于时间的分割进行验证是不明智的

4.14 透视图查看数据

# 透视图 索引可以有多个，“columns（列）”是可选的，聚合函数aggfunc最后是被应用到了变量“values”中你所列举的项目上。
pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum)

loanAmnt
issueDateDT	0	30	61	92	122	153	183	214	245	274	...	3926	3957	3987	4018	4048	4079	4110	4140	4171	4201
grade
A	NaN	53650.0	42000.0	19500.0	34425.0	63950.0	43500.0	168825.0	85600.0	101825.0	...	13093850.0	11757325.0	11945975.0	9144000.0	7977650.0	6888900.0	5109800.0	3919275.0	2694025.0	2245625.0
B	NaN	13000.0	24000.0	32125.0	7025.0	95750.0	164300.0	303175.0	434425.0	538450.0	...	16863100.0	17275175.0	16217500.0	11431350.0	8967750.0	7572725.0	4884600.0	4329400.0	3922575.0	3257100.0
C	NaN	68750.0	8175.0	10000.0	61800.0	52550.0	175375.0	151100.0	243725.0	393150.0	...	17502375.0	17471500.0	16111225.0	11973675.0	10184450.0	7765000.0	5354450.0	4552600.0	2870050.0	2246250.0
D	NaN	NaN	5500.0	2850.0	28625.0	NaN	167975.0	171325.0	192900.0	269325.0	...	11403075.0	10964150.0	10747675.0	7082050.0	7189625.0	5195700.0	3455175.0	3038500.0	2452375.0	1771750.0
E	7500.0	NaN	10000.0	NaN	17975.0	1500.0	94375.0	116450.0	42000.0	139775.0	...	3983050.0	3410125.0	3107150.0	2341825.0	2225675.0	1643675.0	1091025.0	1131625.0	883950.0	802425.0
F	NaN	NaN	31250.0	2125.0	NaN	NaN	NaN	49000.0	27000.0	43000.0	...	1074175.0	868925.0	761675.0	685325.0	665750.0	685200.0	316700.0	315075.0	72300.0	NaN
G	NaN	NaN	NaN	NaN	NaN	NaN	NaN	24625.0	NaN	NaN	...	56100.0	243275.0	224825.0	64050.0	198575.0	245825.0	53125.0	23750.0	25100.0	1000.0

4.15 用pandas_profiling输出报告

# 如果不用minimal会计算correlation,如果样本数量大，其中的kendall correlation会花费很多时间
pfr = pandas_profiling.ProfileReport(data_train, minimal=True) 
pfr.to_file("./example.html")

最终输出文件

方式2

profile = data_train.profile_report(
    title="Report without correlations",
    correlations={
        "pearson": {"calculate": False},
        "spearman": {"calculate": False},
        "kendall": {"calculate": False},
        "phi_k": {"calculate": False},
        "cramers": {"calculate": False},
    },
)

最后修改：2021 年 07 月 30 日 11 : 56 PM

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

Kaggle-金融贷款风险赛——1 EDA

admin • 2020 年 12 月 23 日

1 比赛介绍

金融风控之贷款违约预测挑战赛，根据feature判断申请人是否会违约。

比赛链接：https://tianchi.aliyun.com/competition/entrance/531830/introduction

评价指标为AUC。

2 训练集介绍

比赛要求参赛选手根据给定的数据集，建立模型，预测金融风险。

训练集一共80万条，测试集A和B分别20万条。另外还有很多匿名特征和脱敏特征。

特征介绍：

id 为贷款清单分配的唯一信用证标识
loanAmnt 贷款金额
term 贷款期限（year）
interestRate 贷款利率
installment 分期付款金额
grade 贷款等级
subGrade 贷款等级之子级
employmentTitle 就业职称
employmentLength 就业年限（年）
homeOwnership 借款人在登记时提供的房屋所有权状况
annualIncome 年收入
verificationStatus 验证状态
issueDate 贷款发放的月份
purpose 借款人在贷款申请时的贷款用途类别
postCode 借款人在贷款申请中提供的邮政编码的前3位数字
regionCode 地区编码
dti 债务收入比
delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
openAcc 借款人信用档案中未结信用额度的数量
pubRec 贬损公共记录的数量
pubRecBankruptcies 公开记录清除的数量
revolBal 信贷周转余额合计
revolUtil 循环额度利用率，或借款人使用的相对于所有可用循环信贷的信贷金额
totalAcc 借款人信用档案中当前的信用额度总数
initialListStatus 贷款的初始列表状态
applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
earliesCreditLine 借款人最早报告的信用额度开立的月份
title 借款人提供的贷款名称
policyCode 公开可用的策略_代码=1新产品不公开可用的策略_代码=2
n* 匿名特征n0-n14，为一些贷款人行为计数特征的处理

3 过程介绍

数据总体：
- 读入数据集，查看数据集size；
- 了解数据类型；
- 查看统计量；
需要处理的值：
- 查看缺失值
- 查看inf值(无限大的值)
- 查看唯一值(只有一个值)
查看数据类型
- 类别型数据
- 数值型数据
  - 离散数值型数据
  - 连续数值型数据
数据间相关关系
- 特征间关系
- 特征与目标变量间关系
pandas_profiling生成数据报告

4 代码部分

4.1 导入数据分析及可视化过程需要的库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
import pandas_profiling
warnings.filterwarnings('ignore')

4.2 读取数据集

data_train = pd.read_csv('./train.csv')
data_test_a = pd.read_csv('./testA.csv')

4.3 查看总体

data_train.shape

(800000, 47)

data_test_a.shape

(200000, 46)

data_train.columns

Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')

4.4 查看数据类型

data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n3                  759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

4.5 查看基本统计量

data_train.describe()

	id	loanAmnt	term	interestRate	installment	employmentTitle	homeOwnership	annualIncome	verificationStatus	isDefault	...	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
count	800000.000000	800000.000000	800000.000000	800000.000000	800000.000000	799999.000000	800000.000000	8.000000e+05	800000.000000	800000.000000	...	759730.000000	759730.000000	759730.000000	759729.000000	759730.000000	766761.000000	730248.000000	759730.000000	759730.000000	759730.000000
mean	399999.500000	14416.818875	3.482745	13.238391	437.947723	72005.351714	0.614213	7.613391e+04	1.009683	0.199513	...	8.107937	8.575994	8.282953	14.622488	5.592345	11.643896	0.000815	0.003384	0.089366	2.178606
std	230940.252015	8716.086178	0.855832	4.765757	261.460393	106585.640204	0.675749	6.894751e+04	0.782716	0.399634	...	4.799210	7.400536	4.561689	8.124610	3.216184	5.484104	0.030075	0.062041	0.509069	1.844377
min	0.000000	500.000000	3.000000	5.310000	15.690000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	...	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	199999.750000	8000.000000	3.000000	9.750000	248.450000	427.000000	0.000000	4.560000e+04	0.000000	0.000000	...	5.000000	4.000000	5.000000	9.000000	3.000000	8.000000	0.000000	0.000000	0.000000	1.000000
50%	399999.500000	12000.000000	3.000000	12.740000	375.135000	7755.000000	1.000000	6.500000e+04	1.000000	0.000000	...	7.000000	7.000000	7.000000	13.000000	5.000000	11.000000	0.000000	0.000000	0.000000	2.000000
75%	599999.250000	20000.000000	3.000000	15.990000	580.710000	117663.500000	1.000000	9.000000e+04	2.000000	0.000000	...	11.000000	11.000000	10.000000	19.000000	7.000000	14.000000	0.000000	0.000000	0.000000	3.000000
max	799999.000000	40000.000000	5.000000	30.990000	1715.420000	378351.000000	5.000000	1.099920e+07	2.000000	1.000000	...	70.000000	132.000000	79.000000	128.000000	45.000000	82.000000	4.000000	4.000000	39.000000	30.000000

8 rows × 42 columns

data_train.head(3).append(data_train.tail(3))

	id	loanAmnt	term	interestRate	installment	grade	subGrade	employmentTitle	employmentLength	homeOwnership	...	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
0	0	35000.0	5	19.52	917.97	E	E2	320.0	2 years	2	...	9.0	8.0	4.0	12.0	2.0	7.0	0.0	0.0	0.0	2.0
1	1	18000.0	5	18.49	461.90	D	D2	219843.0	5 years	0	...	NaN	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	NaN
2	2	12000.0	5	16.99	298.17	D	D3	31698.0	8 years	0	...	0.0	21.0	4.0	5.0	3.0	11.0	0.0	0.0	0.0	4.0
799997	799997	6000.0	3	13.33	203.12	C	C3	2582.0	10+ years	1	...	4.0	26.0	4.0	10.0	4.0	5.0	0.0	0.0	1.0	4.0
799998	799998	19200.0	3	6.92	592.14	A	A4	151.0	10+ years	0	...	10.0	6.0	12.0	22.0	8.0	16.0	0.0	0.0	0.0	5.0
799999	799999	9000.0	3	11.06	294.91	B	B3	13.0	5 years	0	...	3.0	4.0	4.0	8.0	3.0	7.0	0.0	0.0	0.0	2.0

4.6.1 查看缺失值

print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')

There are 22 columns in train dataset with missing values.

统计一下空值比例大于50%的feature

have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
    if value > 0.5:
        fea_null_moreThanHalf[key] = value
print(fea_null_moreThanHalf)

{}

可视化nan值

missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing = missing.sort_values()
missing.plot.bar()

4.6.2 查看无限值

for col in data_train.columns.values:
    if data_train[col].dtype != object and np.isinf(data_train[col]).any():
        print("{} has infinity values".format(col))

这个数据集没有无限值，如果有，可以用下面代码来改为nan再统一处理
# data_train.replace([np.inf, -np.inf], np.nan)

# nan可视化
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing = missing.sort_values()
missing.plot.bar()

4.6.3 查看唯一值

one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1]


one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1]
one_value_fea,one_value_fea_test

['policyCode']

print(f'There are {len(one_value_fea)} columns in train dataset with one unique value.')
print(f'There are {len(one_value_fea_test)} columns in test dataset with one unique value.')

There are 1 columns in train dataset with one unique value.
There are 1 columns in test dataset with one unique value.

4.7 取出数值特征与分类特征

numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))

4.8 查看分类特征

category_fea

['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

for col in category_fea:
    print('Feature 【{}】 has 【{}】 unique classes'.format(col, data_train[col].nunique()))
    data_train[col].hist()
    plt.figure()

4.9.1 数值特征分为连续特征与离散特征

def get_numerical_serial_fea(data,feas):
    # 连续特征
    numerical_serial_fea = []
    # 离散特征
    numerical_noserial_fea = []
    for fea in feas:
        uniques = data[fea].nunique()
        # 如果类别数小于10，就认为是离散，否则就是连续
        if uniques <= 10:
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)

numerical_serial_fea

['id',
 'loanAmnt',
 'interestRate',
...
 'n10',
 'n13',
 'n14']

numerical_noserial_fea

['term',
 'homeOwnership',
 'verificationStatus',
 'isDefault',
 'initialListStatus',
 'applicationType',
 'policyCode',
 'n11',
 'n12']

4.9.2 查看离散类别

data_train['term'].value_counts()

3    606902
5    193098
Name: term, dtype: int64

data_train['homeOwnership'].value_counts()

0    395732
1    317660
2     86309
3       185
5        81
4        33
Name: homeOwnership, dtype: int64

data_train['verificationStatus'].value_counts()

1    309810
2    248968
0    241222
Name: verificationStatus, dtype: int64

data_train['initialListStatus'].value_counts()

0    466438
1    333562
Name: initialListStatus, dtype: int64

data_train['applicationType'].value_counts()

0    784586
1     15414
Name: applicationType, dtype: int64

data_train['policyCode'].value_counts()# 只有单个值，没用

1.0    800000
Name: policyCode, dtype: int64

data_train['n11'].value_counts() # 分布严重不平衡，下面检查一下正负样本的分布

0.0    729682
1.0       540
2.0        24
4.0         1
3.0         1
Name: n11, dtype: int64

data_train.groupby('n11').sum()['isDefault'] / data_train['n11'].value_counts()

0.0    0.199276
1.0    0.201852
2.0    0.083333
3.0    1.000000
4.0    0.000000
dtype: float64

data_train['n12'].value_counts() # 分布严重不平衡，下面检查一下正负样本的分布

0.0    757315
1.0      2281
2.0       115
3.0        16
4.0         3
Name: n12, dtype: int64

data_train.groupby('n12').sum()['isDefault'] / data_train['n12'].value_counts()

0.0    0.202093
1.0    0.225340
2.0    0.217391
3.0    0.375000
4.0    0.000000
dtype: float64

4.10.1 可视化数字特征分布

f = pd.melt(data_train, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

如果分布不符合正态分布，可以log化来稍微缓解
一些情况下，正态可以让模型收敛更快，一些模型要求数据正态（eg. GMM、KNN）,保证数据不要过偏态即可，过于偏态可能会影响模型预测结果。

4.10.2 plot一下log修正后的loanAmnt

plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)

plt.subplot(221)
sub_plot_1 = sns.distplot(data_train['loanAmnt'])
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)

4.11 根据欺诈类别，查看分类特征的分布

train_loan_fr = data_train.loc[data_train['isDefault'] == 1]
train_loan_nofr = data_train.loc[data_train['isDefault'] == 0]

fig, axes = plt.subplots(len(category_fea), 2, figsize=(15, 8))
for ax, col in zip(axes, category_fea):
    train_loan_fr.groupby(col)[col].count().plot(kind='barh', ax=ax[0], title='Count of {} fraud'.format(col))
    train_loan_nofr.groupby(col)[col].count().plot(kind='barh', ax=ax[1], title='Count of {} non-fraud'.format(col))

不能用下列代码实现，因为会导致类别乱序，无法对比

fig, axes = plt.subplots(len(category_fea), 2, figsize=(15, 8))
for ax, col in zip(axes, category_fea):
    train_loan_fr[col].value_counts().plot(kind='barh', ax=ax[0], title='Count of {} fraud'.format(col))
    train_loan_nofr[col].value_counts().plot(kind='barh', ax=ax[1], title='Count of {} non-fraud'.format(col))

for col in numerical_serial_fea:
    fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6))
    data_train.loc[data_train['isDefault'] == 1] \
        [col].apply(lambda x: np.sign(x) * np.log(1 + np.abs(x))) \
        .plot(kind='hist',
              bins=100,
              title='Log {} - Fraud'.format(col),
              color='r',
              xlim=(-3, 10),
             ax= ax1)
    data_train.loc[data_train['isDefault'] == 0] \
        [col].apply(lambda x: np.sign(x) * np.log(1 + np.abs(x))) \
        .plot(kind='hist',
              bins=100,
              title='Log {} - Not Fraud'.format(col),
              color='b',
              xlim=(-3, 10),
             ax=ax2)
    plt.figure()

4.12 Loan Amount在欺诈与非欺诈的区别

percent = data_train.groupby('isDefault')['loanAmnt',].count()
percent = percent.reset_index()

sum_amnt = data_train.groupby('isDefault')['loanAmnt'].sum()
sum_amnt = sum_amnt.reset_index()

plt.figure(figsize=(12,5))
plt.subplot(121)
plot_tr_1 = sns.barplot(x='isDefault', y='loanAmnt',  dodge=True, data=percent)
plot_tr_1.set_title("Percent of Frauds  \n 0: good user | 1: bad user", fontsize=14)

# 写上百分比
for p in plot_tr_1.patches:
    height = p.get_height()
    plot_tr_1.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/len(data_train) * 100),
            ha="center", fontsize=15)     


plt.subplot(122)

plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt',  dodge=True, data=sum_amnt)
plot_tr_2.set_title("Total Amount in loanAmnt  \n 0: good user | 1: bad user", fontsize=14)

# 写上百分比
for p in plot_tr_2.patches:
    height = p.get_height()
    plot_tr_2.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/data_train['loanAmnt'].sum() * 100),
            ha="center", fontsize=15)

4.13 日期特征处理

# 新的特征issueDateDT，表示数据日期离数据集中日期最早的日期（2007-06-01）的天数

# 转换为日期格式
data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days

#转化成日期格式
data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')

startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days

plt.hist(data_train['issueDateDT'], label='train');
plt.hist(data_test_a['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
#train 和 test issueDateDT 日期有重叠 所以使用基于时间的分割进行验证是不明智的

4.14 透视图查看数据

# 透视图 索引可以有多个，“columns（列）”是可选的，聚合函数aggfunc最后是被应用到了变量“values”中你所列举的项目上。
pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum)

loanAmnt
issueDateDT	0	30	61	92	122	153	183	214	245	274	...	3926	3957	3987	4018	4048	4079	4110	4140	4171	4201
grade
A	NaN	53650.0	42000.0	19500.0	34425.0	63950.0	43500.0	168825.0	85600.0	101825.0	...	13093850.0	11757325.0	11945975.0	9144000.0	7977650.0	6888900.0	5109800.0	3919275.0	2694025.0	2245625.0
B	NaN	13000.0	24000.0	32125.0	7025.0	95750.0	164300.0	303175.0	434425.0	538450.0	...	16863100.0	17275175.0	16217500.0	11431350.0	8967750.0	7572725.0	4884600.0	4329400.0	3922575.0	3257100.0
C	NaN	68750.0	8175.0	10000.0	61800.0	52550.0	175375.0	151100.0	243725.0	393150.0	...	17502375.0	17471500.0	16111225.0	11973675.0	10184450.0	7765000.0	5354450.0	4552600.0	2870050.0	2246250.0
D	NaN	NaN	5500.0	2850.0	28625.0	NaN	167975.0	171325.0	192900.0	269325.0	...	11403075.0	10964150.0	10747675.0	7082050.0	7189625.0	5195700.0	3455175.0	3038500.0	2452375.0	1771750.0
E	7500.0	NaN	10000.0	NaN	17975.0	1500.0	94375.0	116450.0	42000.0	139775.0	...	3983050.0	3410125.0	3107150.0	2341825.0	2225675.0	1643675.0	1091025.0	1131625.0	883950.0	802425.0
F	NaN	NaN	31250.0	2125.0	NaN	NaN	NaN	49000.0	27000.0	43000.0	...	1074175.0	868925.0	761675.0	685325.0	665750.0	685200.0	316700.0	315075.0	72300.0	NaN
G	NaN	NaN	NaN	NaN	NaN	NaN	NaN	24625.0	NaN	NaN	...	56100.0	243275.0	224825.0	64050.0	198575.0	245825.0	53125.0	23750.0	25100.0	1000.0

4.15 用pandas_profiling输出报告

# 如果不用minimal会计算correlation,如果样本数量大，其中的kendall correlation会花费很多时间
pfr = pandas_profiling.ProfileReport(data_train, minimal=True) 
pfr.to_file("./example.html")

最终输出文件

方式2

profile = data_train.profile_report(
    title="Report without correlations",
    correlations={
        "pearson": {"calculate": False},
        "spearman": {"calculate": False},
        "kendall": {"calculate": False},
        "phi_k": {"calculate": False},
        "cramers": {"calculate": False},
    },
)

1 比赛介绍

2 训练集介绍

比赛要求参赛选手根据给定的数据集，建立模型，预测金融风险。

3 过程介绍

4 代码部分

4.1 导入数据分析及可视化过程需要的库

4.2 读取数据集

4.3 查看总体

4.4 查看数据类型

4.5 查看基本统计量

4.6.1 查看缺失值

4.6.2 查看无限值

4.6.3 查看唯一值

4.7 取出数值特征与分类特征

4.8 查看分类特征

4.9.1 数值特征分为连续特征与离散特征

4.9.2 查看离散类别

4.10.1 可视化数字特征分布

4.10.2 plot一下log修正后的loanAmnt

4.11 根据欺诈类别，查看分类特征的分布

4.12 Loan Amount在欺诈与非欺诈的区别

4.13 日期特征处理

4.14 透视图查看数据

4.15 用pandas_profiling输出报告

发表评论 取消回复

Kaggle-金融贷款风险赛——1 EDA

1 比赛介绍

2 训练集介绍

比赛要求参赛选手根据给定的数据集，建立模型，预测金融风险。

3 过程介绍

4 代码部分

4.1 导入数据分析及可视化过程需要的库

4.2 读取数据集

4.3 查看总体

4.4 查看数据类型

4.5 查看基本统计量

4.6.1 查看缺失值

4.6.2 查看无限值

4.6.3 查看唯一值

4.7 取出数值特征与分类特征

4.8 查看分类特征

4.9.1 数值特征分为连续特征与离散特征

4.9.2 查看离散类别

4.10.1 可视化数字特征分布

4.10.2 plot一下log修正后的loanAmnt

4.11 根据欺诈类别，查看分类特征的分布

4.12 Loan Amount在欺诈与非欺诈的区别

4.13 日期特征处理

4.14 透视图查看数据

4.15 用pandas_profiling输出报告

发表评论取消回复