1 比赛介绍

金融风控之贷款违约预测挑战赛,根据feature判断申请人是否会违约。

比赛链接:https://tianchi.aliyun.com/competition/entrance/531830/introduction

评价指标为AUC。

2 训练集介绍

比赛要求参赛选手根据给定的数据集,建立模型,预测金融风险。

训练集一共80万条,测试集A和B分别20万条。另外还有很多匿名特征和脱敏特征。

特征介绍:

  • id 为贷款清单分配的唯一信用证标识
  • loanAmnt 贷款金额
  • term 贷款期限(year)
  • interestRate 贷款利率
  • installment 分期付款金额
  • grade 贷款等级
  • subGrade 贷款等级之子级
  • employmentTitle 就业职称
  • employmentLength 就业年限(年)
  • homeOwnership 借款人在登记时提供的房屋所有权状况
  • annualIncome 年收入
  • verificationStatus 验证状态
  • issueDate 贷款发放的月份
  • purpose 借款人在贷款申请时的贷款用途类别
  • postCode 借款人在贷款申请中提供的邮政编码的前3位数字
  • regionCode 地区编码
  • dti 债务收入比
  • delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
  • ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
  • ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
  • openAcc 借款人信用档案中未结信用额度的数量
  • pubRec 贬损公共记录的数量
  • pubRecBankruptcies 公开记录清除的数量
  • revolBal 信贷周转余额合计
  • revolUtil 循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额
  • totalAcc 借款人信用档案中当前的信用额度总数
  • initialListStatus 贷款的初始列表状态
  • applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
  • earliesCreditLine 借款人最早报告的信用额度开立的月份
  • title 借款人提供的贷款名称
  • policyCode 公开可用的策略_代码=1新产品不公开可用的策略_代码=2
  • n* 匿名特征n0-n14,为一些贷款人行为计数特征的处理

3 过程介绍

  • 数据总体:

    • 读入数据集,查看数据集size;
    • 了解数据类型;
    • 查看统计量;
  • 需要处理的值:

    • 查看缺失值
    • 查看inf值(无限大的值)
    • 查看唯一值(只有一个值)
  • 查看数据类型

    • 类别型数据
    • 数值型数据

      • 离散数值型数据
      • 连续数值型数据
  • 数据间相关关系

    • 特征间关系
    • 特征与目标变量间关系
  • pandas_profiling生成数据报告

4 代码部分

4.1 导入数据分析及可视化过程需要的库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
import pandas_profiling
warnings.filterwarnings('ignore')

4.2 读取数据集

data_train = pd.read_csv('./train.csv')
data_test_a = pd.read_csv('./testA.csv')

4.3 查看总体

data_train.shape
(800000, 47)
data_test_a.shape
(200000, 46)
data_train.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')

4.4 查看数据类型

data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n3                  759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

4.5 查看基本统计量

data_train.describe()
idloanAmntterminterestRateinstallmentemploymentTitlehomeOwnershipannualIncomeverificationStatusisDefault...n5n6n7n8n9n10n11n12n13n14
count800000.000000800000.000000800000.000000800000.000000800000.000000799999.000000800000.0000008.000000e+05800000.000000800000.000000...759730.000000759730.000000759730.000000759729.000000759730.000000766761.000000730248.000000759730.000000759730.000000759730.000000
mean399999.50000014416.8188753.48274513.238391437.94772372005.3517140.6142137.613391e+041.0096830.199513...8.1079378.5759948.28295314.6224885.59234511.6438960.0008150.0033840.0893662.178606
std230940.2520158716.0861780.8558324.765757261.460393106585.6402040.6757496.894751e+040.7827160.399634...4.7992107.4005364.5616898.1246103.2161845.4841040.0300750.0620410.5090691.844377
min0.000000500.0000003.0000005.31000015.6900000.0000000.0000000.000000e+000.0000000.000000...0.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%199999.7500008000.0000003.0000009.750000248.450000427.0000000.0000004.560000e+040.0000000.000000...5.0000004.0000005.0000009.0000003.0000008.0000000.0000000.0000000.0000001.000000
50%399999.50000012000.0000003.00000012.740000375.1350007755.0000001.0000006.500000e+041.0000000.000000...7.0000007.0000007.00000013.0000005.00000011.0000000.0000000.0000000.0000002.000000
75%599999.25000020000.0000003.00000015.990000580.710000117663.5000001.0000009.000000e+042.0000000.000000...11.00000011.00000010.00000019.0000007.00000014.0000000.0000000.0000000.0000003.000000
max799999.00000040000.0000005.00000030.9900001715.420000378351.0000005.0000001.099920e+072.0000001.000000...70.000000132.00000079.000000128.00000045.00000082.0000004.0000004.00000039.00000030.000000

8 rows × 42 columns

data_train.head(3).append(data_train.tail(3))
idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n14
0035000.0519.52917.97EE2320.02 years2...9.08.04.012.02.07.00.00.00.02.0
1118000.0518.49461.90DD2219843.05 years0...NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN
2212000.0516.99298.17DD331698.08 years0...0.021.04.05.03.011.00.00.00.04.0
7999977999976000.0313.33203.12CC32582.010+ years1...4.026.04.010.04.05.00.00.01.04.0
79999879999819200.036.92592.14AA4151.010+ years0...10.06.012.022.08.016.00.00.00.05.0
7999997999999000.0311.06294.91BB313.05 years0...3.04.04.08.03.07.00.00.00.02.0

4.6.1 查看缺失值

print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')
There are 22 columns in train dataset with missing values.

统计一下空值比例大于50%的feature

have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
    if value > 0.5:
        fea_null_moreThanHalf[key] = value
print(fea_null_moreThanHalf)
{}

可视化nan

missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing = missing.sort_values()
missing.plot.bar()

4.6.2 查看无限值

for col in data_train.columns.values:
    if data_train[col].dtype != object and np.isinf(data_train[col]).any():
        print("{} has infinity values".format(col))

这个数据集没有无限值,如果有,可以用下面代码来改为nan再统一处理

# data_train.replace([np.inf, -np.inf], np.nan)
# nan可视化
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing = missing.sort_values()
missing.plot.bar()

4.6.3 查看唯一值

one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1]


one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1]
one_value_fea,one_value_fea_test
['policyCode']
print(f'There are {len(one_value_fea)} columns in train dataset with one unique value.')
print(f'There are {len(one_value_fea_test)} columns in test dataset with one unique value.')
There are 1 columns in train dataset with one unique value.
There are 1 columns in test dataset with one unique value.

4.7 取出数值特征与分类特征

numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))

4.8 查看分类特征

category_fea
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
for col in category_fea:
    print('Feature 【{}】 has 【{}】 unique classes'.format(col, data_train[col].nunique()))
    data_train[col].hist()
    plt.figure()

4.9.1 数值特征分为连续特征与离散特征

def get_numerical_serial_fea(data,feas):
    # 连续特征
    numerical_serial_fea = []
    # 离散特征
    numerical_noserial_fea = []
    for fea in feas:
        uniques = data[fea].nunique()
        # 如果类别数小于10,就认为是离散,否则就是连续
        if uniques <= 10:
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)
numerical_serial_fea
['id',
 'loanAmnt',
 'interestRate',
...
 'n10',
 'n13',
 'n14']
numerical_noserial_fea
['term',
 'homeOwnership',
 'verificationStatus',
 'isDefault',
 'initialListStatus',
 'applicationType',
 'policyCode',
 'n11',
 'n12']

4.9.2 查看离散类别

data_train['term'].value_counts()
3    606902
5    193098
Name: term, dtype: int64
data_train['homeOwnership'].value_counts()
0    395732
1    317660
2     86309
3       185
5        81
4        33
Name: homeOwnership, dtype: int64
data_train['verificationStatus'].value_counts()
1    309810
2    248968
0    241222
Name: verificationStatus, dtype: int64
data_train['initialListStatus'].value_counts()
0    466438
1    333562
Name: initialListStatus, dtype: int64
data_train['applicationType'].value_counts()
0    784586
1     15414
Name: applicationType, dtype: int64
data_train['policyCode'].value_counts()# 只有单个值,没用
1.0    800000
Name: policyCode, dtype: int64
data_train['n11'].value_counts() # 分布严重不平衡,下面检查一下正负样本的分布
0.0    729682
1.0       540
2.0        24
4.0         1
3.0         1
Name: n11, dtype: int64
data_train.groupby('n11').sum()['isDefault'] / data_train['n11'].value_counts()
0.0    0.199276
1.0    0.201852
2.0    0.083333
3.0    1.000000
4.0    0.000000
dtype: float64
data_train['n12'].value_counts() # 分布严重不平衡,下面检查一下正负样本的分布
0.0    757315
1.0      2281
2.0       115
3.0        16
4.0         3
Name: n12, dtype: int64
data_train.groupby('n12').sum()['isDefault'] / data_train['n12'].value_counts()
0.0    0.202093
1.0    0.225340
2.0    0.217391
3.0    0.375000
4.0    0.000000
dtype: float64

4.10.1 可视化数字特征分布

f = pd.melt(data_train, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

  • 如果分布不符合正态分布,可以log化来稍微缓解
  • 一些情况下,正态可以让模型收敛更快,一些模型要求数据正态(eg. GMM、KNN),保证数据不要过偏态即可,过于偏态可能会影响模型预测结果。

4.10.2 plot一下log修正后的loanAmnt

plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)

plt.subplot(221)
sub_plot_1 = sns.distplot(data_train['loanAmnt'])
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)

4.11 根据欺诈类别,查看分类特征的分布

train_loan_fr = data_train.loc[data_train['isDefault'] == 1]
train_loan_nofr = data_train.loc[data_train['isDefault'] == 0]
fig, axes = plt.subplots(len(category_fea), 2, figsize=(15, 8))
for ax, col in zip(axes, category_fea):
    train_loan_fr.groupby(col)[col].count().plot(kind='barh', ax=ax[0], title='Count of {} fraud'.format(col))
    train_loan_nofr.groupby(col)[col].count().plot(kind='barh', ax=ax[1], title='Count of {} non-fraud'.format(col))

不能用下列代码实现,因为会导致类别乱序,无法对比

fig, axes = plt.subplots(len(category_fea), 2, figsize=(15, 8))
for ax, col in zip(axes, category_fea):
    train_loan_fr[col].value_counts().plot(kind='barh', ax=ax[0], title='Count of {} fraud'.format(col))
    train_loan_nofr[col].value_counts().plot(kind='barh', ax=ax[1], title='Count of {} non-fraud'.format(col))

for col in numerical_serial_fea:
    fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6))
    data_train.loc[data_train['isDefault'] == 1] \
        [col].apply(lambda x: np.sign(x) * np.log(1 + np.abs(x))) \
        .plot(kind='hist',
              bins=100,
              title='Log {} - Fraud'.format(col),
              color='r',
              xlim=(-3, 10),
             ax= ax1)
    data_train.loc[data_train['isDefault'] == 0] \
        [col].apply(lambda x: np.sign(x) * np.log(1 + np.abs(x))) \
        .plot(kind='hist',
              bins=100,
              title='Log {} - Not Fraud'.format(col),
              color='b',
              xlim=(-3, 10),
             ax=ax2)
    plt.figure()

4.12 Loan Amount在欺诈与非欺诈的区别

percent = data_train.groupby('isDefault')['loanAmnt',].count()
percent = percent.reset_index()

sum_amnt = data_train.groupby('isDefault')['loanAmnt'].sum()
sum_amnt = sum_amnt.reset_index()

plt.figure(figsize=(12,5))
plt.subplot(121)
plot_tr_1 = sns.barplot(x='isDefault', y='loanAmnt',  dodge=True, data=percent)
plot_tr_1.set_title("Percent of Frauds  \n 0: good user | 1: bad user", fontsize=14)

# 写上百分比
for p in plot_tr_1.patches:
    height = p.get_height()
    plot_tr_1.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/len(data_train) * 100),
            ha="center", fontsize=15)     


plt.subplot(122)

plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt',  dodge=True, data=sum_amnt)
plot_tr_2.set_title("Total Amount in loanAmnt  \n 0: good user | 1: bad user", fontsize=14)

# 写上百分比
for p in plot_tr_2.patches:
    height = p.get_height()
    plot_tr_2.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/data_train['loanAmnt'].sum() * 100),
            ha="center", fontsize=15)     

4.13 日期特征处理

# 新的特征issueDateDT,表示数据日期离数据集中日期最早的日期(2007-06-01)的天数

# 转换为日期格式
data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days
#转化成日期格式
data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')

startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days
plt.hist(data_train['issueDateDT'], label='train');
plt.hist(data_test_a['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
#train 和 test issueDateDT 日期有重叠 所以使用基于时间的分割进行验证是不明智的

4.14 透视图查看数据

# 透视图 索引可以有多个,“columns(列)”是可选的,聚合函数aggfunc最后是被应用到了变量“values”中你所列举的项目上。
pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum)
loanAmnt
issueDateDT0306192122153183214245274...3926395739874018404840794110414041714201
grade
ANaN53650.042000.019500.034425.063950.043500.0168825.085600.0101825.0...13093850.011757325.011945975.09144000.07977650.06888900.05109800.03919275.02694025.02245625.0
BNaN13000.024000.032125.07025.095750.0164300.0303175.0434425.0538450.0...16863100.017275175.016217500.011431350.08967750.07572725.04884600.04329400.03922575.03257100.0
CNaN68750.08175.010000.061800.052550.0175375.0151100.0243725.0393150.0...17502375.017471500.016111225.011973675.010184450.07765000.05354450.04552600.02870050.02246250.0
DNaNNaN5500.02850.028625.0NaN167975.0171325.0192900.0269325.0...11403075.010964150.010747675.07082050.07189625.05195700.03455175.03038500.02452375.01771750.0
E7500.0NaN10000.0NaN17975.01500.094375.0116450.042000.0139775.0...3983050.03410125.03107150.02341825.02225675.01643675.01091025.01131625.0883950.0802425.0
FNaNNaN31250.02125.0NaNNaNNaN49000.027000.043000.0...1074175.0868925.0761675.0685325.0665750.0685200.0316700.0315075.072300.0NaN
GNaNNaNNaNNaNNaNNaNNaN24625.0NaNNaN...56100.0243275.0224825.064050.0198575.0245825.053125.023750.025100.01000.0

4.15 用pandas_profiling输出报告

# 如果不用minimal会计算correlation,如果样本数量大,其中的kendall correlation会花费很多时间
pfr = pandas_profiling.ProfileReport(data_train, minimal=True) 
pfr.to_file("./example.html")

最终输出文件

方式2

profile = data_train.profile_report(
    title="Report without correlations",
    correlations={
        "pearson": {"calculate": False},
        "spearman": {"calculate": False},
        "kendall": {"calculate": False},
        "phi_k": {"calculate": False},
        "cramers": {"calculate": False},
    },
)
最后修改:2021 年 07 月 30 日 11 : 56 PM
如果觉得我的文章对你有用,请随意赞赏