2019-10-26-IEEE竞赛复盘

IEEE-CIS-Fraud-Detection

Posted by lambda on October 26, 2019

Concat: github: lambda_xmu

Competition Describe

预测在线交易是否存在欺诈的可能性,是个二分类问题,标签isFraud。数据由identitytransaction组成。

Data

Transaction

Categorical Features

  • ProductCD: product code, the product for each transaction
  • card1 - card6:支付卡相关信息(卡类型、卡分类、国家、开户银行等)
  • addr1, addr2:和购买者相关,分别是付费区域和国家
  • P_emaildomain:买方电子邮件域
  • R_emaildomain:收件人电子邮件域
  • M1 - M9:匹配信息,例如支付卡上的姓名和地址等。

Other

  • TransactionAMT: 交易额(USD)
  • dist: 距离
  • C1-C14: counting,例如与支付卡关联的地址数量等。
  • D1-D15: timedelta,例如上次交易间隔的天数等。
  • Vxxx: Vesta设计的特征,包括 ranking, counting 和其他实体关系。例如,与IP和电子邮件或地址相关联的支付卡在24小时的时间范围内出现次数。

Identity

Categorical Features

  • DeviceType
  • DeviceInfo
  • id_12 - id_38

Other

  • 网络连接信息(IP, ISP, Proxy等等)
  • 交易相关的数字签名(UA/browser/os/version等等)

TransactionDT 是 timedelta,并非真正的时间戳。

Read Data

Multiprocessing

import pandas as pd
# option 1
%%time
train_transaction = pd.read_csv('data/train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('data/test_transaction.csv', index_col='TransactionID')
train_identity = pd.read_csv('data/train_identity.csv', index_col='TransactionID')
test_identity = pd.read_csv('data/test_identity.csv', index_col='TransactionID')
print ("Data is loaded!")

# CPU times: user 34.4 s, sys: 4.18 s, total: 38.6 s
# Wall time: 38.8 s

# option 2
%%time
files = ['data/test_identity.csv',
         'data/test_transaction.csv',
         'data/train_identity.csv',
         'data/train_transaction.csv']

import multiprocessing

def load_data(file):
    return pd.read_csv(file)

with multiprocessing.Pool() as pool:
    test_identity, test_transaction, train_identity, train_transaction = pool.map(load_data, files)

# CPU times: user 3.71 s, sys: 8 s, total: 11.7 s
# Wall time: 35.7 s

可以看出,使用进程可以减少数据读取消耗时间。

Data Minification

根据数据类型和最大值和最小值,更改数据类型到int8int16int32int64floa32floa64等类型。

def reduce_mem_usage1(props):
    # 以下代码补充缺失值为 min-1
    start_mem_usg = props.memory_usage().sum() / 1024**2
    print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    NAlist = [] # Keeps track of columns that have missing values filled in.
    for col in props.columns:
        if props[col].dtype != object:  # Exclude strings

            # Print current column type
            print("******************************")
            print("Column: ",col)
            print("dtype before: ",props[col].dtype)

            # make variables for Int, max and min
            IsInt = False
            mx = props[col].max()
            mn = props[col].min()

            # Integer does not support NA, therefore, NA needs to be filled
            if not np.isfinite(props[col]).all():
                NAlist.append(col)
                props[col].fillna(mn-1,inplace=True)

            # test if column can be converted to an integer
            asint = props[col].fillna(0).astype(np.int64)
            result = (props[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True


            # Make Integer/unsigned Integer datatypes
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)
                    elif mx < 65535:
                        props[col] = props[col].astype(np.uint16)
                    elif mx < 4294967295:
                        props[col] = props[col].astype(np.uint32)
                    else:
                        props[col] = props[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        props[col] = props[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        props[col] = props[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        props[col] = props[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        props[col] = props[col].astype(np.int64)

            # Make float datatypes 32 bit
            else:
                props[col] = props[col].astype(np.float32)

            # Print new column type
            print("dtype after: ",props[col].dtype)
            print("******************************")

    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = props.memory_usage().sum() / 1024**2
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return props, NAlist

# Memory usage of properties dataframe is : 1775.1522827148438  MB
# ___MEMORY USAGE AFTER COMPLETION:___
# Memory usage is:  452.7989959716797  MB
# This is  25.507614213197968 % of the initial size

使用函数reduce_mem_usage1后,内存节省了75%!

EDA

Label Distribution

在进行EDA时,第一步非常重要的就是知道标签分布,是否均衡。

train_transaction['isFraud'].value_counts().plot(kind='bar',figsize=(8, 5))

这次IEEE比赛,数据非常不平衡。

Time

Train vs Test

train_transaction['TransactionDT'].plot(kind='hist',
                                        figsize=(15, 5),
                                        label='train',
                                        bins=50,
                                        title='Train vs Test TransactionDT distribution')
test_transaction['TransactionDT'].plot(kind='hist',
                                       label='test',
                                       bins=50)

可以看到,训练集和测试集是按时间划分的

Time with Label

startdate = datetime.datetime.strptime('2017-12-01', '%Y-%m-%d')
train_transaction['TransactionDT'] = train_transaction['TransactionDT'].apply(lambda x: (startdate + datetime.timedelta(seconds = x)))
test_transaction['TransactionDT'] = test_transaction['TransactionDT'].apply(lambda x: (startdate + datetime.timedelta(seconds = x)))

fig, ax1 = plt.subplots(figsize=(16, 6))
train_transaction.set_index('TransactionDT').resample('D').mean()['isFraud'].plot(ax=ax1, color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_ylabel('isFraud mean', color='blue', fontsize=14)
ax2 = ax1.twinx()
train_transaction['TransactionDT'].dt.floor('d').value_counts().sort_index().plot(ax=ax2, color='tab:orange');
ax2.tick_params(axis='y', labelcolor='tab:orange');
ax2.set_ylabel('Number of training examples', color='tab:orange', fontsize=14);
ax2.grid(False)

蓝色为每日欺诈率,黄色为每日交易量,通过此图,可以发现时间和欺诈比例的关系,和每日交易量和每日欺诈率的关系。

Day and Hour Feature

def make_day_feature(df, offset=0, tname='TransactionDT'):
    """
    Creates a day of the week feature, encoded as 0-6.

    Parameters:
    -----------
    df : pd.DataFrame
        df to manipulate.
    offset : float (default=0)
        offset (in days) to shift the start/end of a day.
    tname : str
        Name of the time column in df.
    """
    # found a good offset is 0.58
    days = df[tname] / (3600*24)
    encoded_days = np.floor(days-1+offset) % 7
    return encoded_days

def make_hour_feature(df, tname='TransactionDT'):
    """
    Creates an hour of the day feature, encoded as 0-23.

    Parameters:
    -----------
    df : pd.DataFrame
        df to manipulate.
    tname : str
        Name of the time column in df.
    """
    hours = df[tname] / (3600)
    encoded_hours = np.floor(hours) % 24
    return encoded_hours
vals = plt.hist(train['TransactionDT'] / (3600*24), bins=1800)
plt.xlim(70, 78)
plt.xlabel('Days')
plt.ylabel('Number of transactions')
plt.ylim(0,1000)

可以看出时间是有周期性的。使用make_day_feature来构造特征,offset调整开始的时间。make_hour_feature来构造时间特征:

plt.figure(figsize=(10,7))
train['hours'] = make_hour_feature(train)
plt.plot(train.groupby('hours').mean()['isFraud'], color='k')
ax = plt.gca()  # 获得当前的Axes对象ax
ax2 = ax.twinx()
_ = ax2.hist(train['hours'], alpha=0.3, bins=24)
ax.set_xlabel('Encoded hour')
ax.set_ylabel('Fraction of fraudulent transactions')
ax2.set_ylabel('Number of transactions')

可以看出,在4小时时,交易量最小,但是欺诈率最低。因此小时和欺诈率是有相关性的。

TransactionAmt

fig, ax = plt.subplots(1, 2, figsize=(18,4))

time_val = train_transaction['TransactionAmt'].values

sns.distplot(time_val, ax=ax[0], color='r')
ax[0].set_title('Distribution of TransactionAmt', fontsize=14)
ax[1].set_xlim([min(time_val), max(time_val)])

sns.distplot(np.log(time_val), ax=ax[1], color='b')
ax[1].set_title('Distribution of LOG TransactionAmt', fontsize=14)
ax[1].set_xlim([min(np.log(time_val)), max(np.log(time_val))])
plt.show()


fig, ax = plt.subplots(1, 2, figsize=(18,4))

time_val = train_transaction.loc[train_transaction['isFraud'] == 1]['TransactionAmt'].values

sns.distplot(np.log(time_val), ax=ax[0], color='r')
ax[0].set_title('Distribution of LOG TransactionAmt, isFraud=1', fontsize=14)
ax[1].set_xlim([min(np.log(time_val)), max(np.log(time_val))])

time_val = train_transaction.loc[train_transaction['isFraud'] == 0]['TransactionAmt'].values

sns.distplot(np.log(time_val), ax=ax[1], color='b')
ax[1].set_title('Distribution of LOG TransactionAmt, isFraud=0', fontsize=14)
ax[1].set_xlim([min(np.log(time_val)), max(np.log(time_val))])
plt.show()

对于交易量或其他取值范围较大的特征,取对数是一个很好的操作。因为取对数后,数值相对大小不变,但范围减小很多,对于非树模型来说,更易于梯度更新。

Raw Feature EDA

def plot_numerical(feature):
    """
    Plot some information about a numerical feature for both train and test set.
    Args:
        feature (str): name of the column in DataFrame
    """
    fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(16, 18))
    sns.kdeplot(train[feature], ax=axes[0][0], label='Train');
    sns.kdeplot(test[feature], ax=axes[0][0], label='Test');

    sns.kdeplot(train[train['isFraud']==0][feature], ax=axes[0][1], label='isFraud 0')
    sns.kdeplot(train[train['isFraud']==1][feature], ax=axes[0][1], label='isFraud 1')

    test[feature].index += len(train)
    axes[1][0].plot(train[feature], '.', label='Train');
    axes[1][0].plot(test[feature], '.', label='Test');
    axes[1][0].set_xlabel('row index');
    axes[1][0].legend()
    test[feature].index -= len(train)

    axes[1][1].plot(train[train['isFraud']==0][feature], '.', label='isFraud 0');
    axes[1][1].plot(train[train['isFraud']==1][feature], '.', label='isFraud 1');
    axes[1][1].set_xlabel('row index');
    axes[1][1].legend()

    pd.DataFrame({'train': [train[feature].isnull().sum()], 'test': [test[feature].isnull().sum()]}).plot(kind='bar', rot=0, ax=axes[2][0]);
    pd.DataFrame({'isFraud 0': [train[(train['isFraud']==0) & (train[feature].isnull())][feature].shape[0]],
                  'isFraud 1': [train[(train['isFraud']==1) & (train[feature].isnull())][feature].shape[0]]}).plot(kind='bar', rot=0, ax=axes[2][1]);

    fig.suptitle(feature, fontsize=18);
    axes[0][0].set_title('Train/Test KDE distribution');
    axes[0][1].set_title('Target value KDE distribution');
    axes[1][0].set_title('Index versus value: Train/Test distribution');
    axes[1][1].set_title('Index versus value: Target distribution');
    axes[2][0].set_title('Number of NaNs');
    axes[2][1].set_title('Target value distribution among NaN values');

def plot_categorical(feature, train=train, test=test, target='isFraud', values=10):
    """
    Plotting distribution for the selected amount of most frequent values between train and test
    along with distibution of target
    Args:
        train (pandas.DataFrame): training set
        test (pandas.DataFrame): testing set
        feature (str): name of the feature
        target (str): name of the target feature
        values (int): amount of most frequest values to look at
    """
    df_train = pd.DataFrame(data={feature: train[feature], 'isTest': 0})
    df_test = pd.DataFrame(data={feature: test[feature], 'isTest': 1})
    df = pd.concat([df_train, df_test], ignore_index=True)
    df = df[df[feature].isin(df[feature].value_counts(dropna=False).head(values).index)]
    train = train[train[feature].isin(train[feature].value_counts(dropna=False).head(values).index)]
    fig, axes = plt.subplots(2, 1, figsize=(14, 12))
    sns.countplot(data=df.fillna('NaN'), x=feature, hue='isTest', ax=axes[0]);
    sns.countplot(data=train[[feature, target]].fillna('NaN'), x=feature, hue=target, ax=axes[1]);
    axes[0].set_title('Train / Test distibution of {} most frequent values'.format(values));
    axes[1].set_title('Train distibution by {} of {} most frequent values'.format(target, values));
    axes[0].legend(['Train', 'Test']);

def _desc(data, col, label):
    '''
    return: count/mean/std/min/25%/50%/75%/max/unique values/NaNs/NaNs share
    '''
    d0 = data.describe().reset_index()
    d0.columns = [col, label]
    return d0.append({col:'unique values', label:data.unique().shape[0]}, ignore_index=True) \
             .append({col:'NaNs', label:data.isnull().sum()}, ignore_index=True) \
             .append({col:'NaNs share', label:np.round(data.isnull().sum() / data.shape[0], 4)}, ignore_index=True) \

def desc(col):
    d0 = _desc(train[col], col, 'Train')
    d1 = _desc(train.loc[train[LABEL] == 1, col], col, 'Train fraud')
    d2 = _desc(train.loc[train[LABEL] == 0, col], col, 'Train Not fraud')
    d3 = _desc(test[col], col, 'Test')
    dd = d0.merge(d1).merge(d2).merge(d3)
    display(dd)

    if col not in [ID]:
        N = 10
        d0 = train[[LABEL, col]].fillna(-999).groupby(col)[LABEL].agg(['size','mean','sum']).reset_index().sort_values('size', ascending=False).reset_index(drop=True)
        d1 = test[[ID,col]].fillna(-999).groupby(col)[ID].count().reset_index()
        dd = d0.merge(d1, how='left', on=col).head(N)
        dd = dd.rename({'size':'Count in train (desc)','mean':'Mean target','sum':'Sum target','TransactionID':'Count in test'}, axis=1)
        display(dd)

def numeric(col):
    plot_numerical(col)
    desc(col)

def categorical(col):
    plot_categorical(col)
    desc(col)

def eda(col):
    if col not in [LABEL, TIME]:
        categorical(col) if train[col].dtype == 'object' else numeric(col)

连续型特征得到如下图:左边为训练集和测试集的对比,右边为正负样本比较;第一排为密度分布图,第二排为散点图【可以看特征在训练集和测试集的分布,假如数据中无时间,可用index作为x轴】,第三排为缺失值可视化。如特征V2

分类型特征如下图:分别对训练集、测试集和不同标签出现频次最多的10个类别的可视化;上图可看训练集和测试集的分布差异,下图可看标签和类别的相关性:

Other

unique

对于分类变量可以查看每个类别不同值的数量nunique。比如IEEE比赛中的 C Features

plt.figure(figsize=(10, 7))
d_features = list(train_transaction.columns[16:30])
uniques = [len(train_transaction[col].unique()) for col in d_features]
sns.set(font_scale=1.2)
ax = sns.barplot(d_features, uniques, log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique values per feature TRAIN')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center")

Binning

可以根据邮箱后缀进行分箱操作,例如:gmailgoogleicloud.comapple,不常见的可以归类为other

emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft', 'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo', 'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink', 'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other', 'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo', 'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other', 'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft', 'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other', 'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple'}
us_emails = ['gmail', 'net', 'edu']

for c in ['P_emaildomain', 'R_emaildomain']:
    train_transaction[c + '_bin'] = train_transaction[c].map(emails)
    test_transaction[c + '_bin'] = test_transaction[c].map(emails)

    train_transaction[c + '_suffix'] = train_transaction[c].map(lambda x: str(x).split('.')[-1])
    test_transaction[c + '_suffix'] = test_transaction[c].map(lambda x: str(x).split('.')[-1])

    train_transaction[c + '_suffix'] = train_transaction[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
    test_transaction[c + '_suffix'] = test_transaction[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')

此类操作适合很多场景:比如手机类型,mate 20mate 30 都属品牌HUAWEI等。

Investigate “D” features

有时候比赛中很多特征是匿名的,但一些特征是较容易挖掘出后面的信息,并且知道其信息后对后面构造特征有非常大的作用。在IEEE比赛中,D 特征就是如此:

D1-D15: timedelta, such as days between previous transaction, etc.

# card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
# 简单认为相同的 card 信息是同一用户

columns = ['card1', 'card2', 'card3', 'card4', 'card5', 'card6']
grouped = train.groupby(columns, as_index=False)['TransactionID'].count()

card1 = 18383
card2 = 128
card3 = 150
card4 = 'visa'
card5 = 226
card6 = 'credit'

train_slice = train[(train['card1']==card1)&
                   (train['card2']==card2)&
                   (train['card3']==card3)&
                   (train['card4']==card4)&
                   (train['card5']==card5)&
                   (train['card6']==card6)]

features = ['TransactionID','TransactionDT','ProductCD', 'P_emaildomain', 'R_emaildomain', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15']
train_slice = train_slice.sort_values(['TransactionID'])[features]

# get a number of days from a starting point\
train_slice['DaysFromStart'] = np.round(train_slice['TransactionDT']/(60*60*24),0)

train_slice['DaysFromPreviousTransaction'] = train_slice['DaysFromStart'].diff()

从上面可以看到:D3=DaysFromPreviousTransaction,因此 D3 是距离上次交易的时间间隔。对于 D1 是递增的,且可以看到 481 = 449 + 32, 510 = 481 + 29, 因此 D1 可能是距离第一次交易的时间间隔。D2D1 相等,除了第一次交易 D1 为0时,D2NAN

train[(train['D1']==0)&(train['D3']>0)].shape[0]/train.shape[0]
# 0.0021776678971788532

但也有例外,但是数量很少,只有 3% 不到。

Pre-processing

根据Investigate "D" features部分,可以根据D1来补充D2的缺失值。下面可以做个简单统计:

def count_uniques(train, test, pair):
    unique_train = []
    unique_test = []

    for value in train[pair[0]].unique():
        unique_train.append(train[pair[1]][train[pair[0]] == value].value_counts().shape[0])

    for value in test[pair[0]].unique():
        unique_test.append(test[pair[1]][test[pair[0]] == value].value_counts().shape[0])

    pair_values_train = pd.Series(data=unique_train, index=train[pair[0]].unique())
    pair_values_test = pd.Series(data=unique_test, index=test[pair[0]].unique())

    return pair_values_train, pair_values_test

def nans_distribution(train, test, unique_train, unique_test, pair):
    train_nans_per_category = []
    test_nans_per_category = []

    for value in unique_train.unique():
        train_nans_per_category.append(train[train[pair[0]].isin(list(unique_train[unique_train == value].index))][pair[1]].isna().sum())

    for value in unique_test.unique():
        test_nans_per_category.append(test[test[pair[0]].isin(list(unique_test[unique_test == value].index))][pair[1]].isna().sum())

    pair_values_train = pd.Series(data=train_nans_per_category, index=unique_train.unique())
    pair_values_test = pd.Series(data=test_nans_per_category, index=unique_test.unique())

    return pair_values_train, pair_values_test

def fill_card_nans(train, test, pair_values_train, pair_values_test, pair):
    print(f'In train{[pair[1]]} there are {train[pair[1]].isna().sum()} NaNs' )
    print(f'In test{[pair[1]]} there are {test[pair[1]].isna().sum()} NaNs' )

    print('Filling train...')

    for value in pair_values_train[pair_values_train == 1].index:
        train[pair[1]][train[pair[0]] == value] = train[pair[1]][train[pair[0]] == value].value_counts().index[0]

    print('Filling test...')

    for value in pair_values_test[pair_values_test == 1].index:
        test[pair[1]][test[pair[0]] == value] = test[pair[1]][test[pair[0]] == value].value_counts().index[0]

    print(f'In train{[pair[1]]} there are {train[pair[1]].isna().sum()} NaNs' )
    print(f'In test{[pair[1]]} there are {test[pair[1]].isna().sum()} NaNs' )

    return train, test

函数count_uniques用来判断每个类别里有多少unique值,假如每个类别里unique值都为1,可以说明这对pair是相似的;函数nans_distribution是统计每个unique值中NaN值的数目;fill_card_nans用来补充类别中每个值对应的unique值为1缺失值,补充值为出现频率最多的值。

下面就可以利用card1来补充card2~card6,例如card3

unique_values_train, unique_values_test = count_uniques(train, test, ('card1', 'card3'))
pd.concat([unique_values_train.value_counts(), unique_values_test.value_counts()], axis=1).rename(columns={0: 'train', 1: 'test'})

train_nan_dist, test_nan_dist = nans_distribution(train, test, unique_values_train, unique_values_test, ('card1', 'card3'))
pd.concat([train_nan_dist, test_nan_dist], axis=1).rename(columns={0: 'train', 1: 'test'})

train, test = fill_card_nans(train, test, unique_values_train, unique_values_test, ('card1', 'card3'))

# In train['card3'] there are 1565 NaNs
# In test['card3'] there are 3002 NaNs
# Filling train...
# Filling test...
# In train['card3'] there are 17 NaNs
# In test['card3'] there are 48 NaNs

利用card1可以补充card信息以外,还可以补充其他信息:

depend_features = []

for col in train.columns:
    if train[train['card1'] == 13926][col].value_counts().shape[0] == 1:
        depend_features.append(col)

# ['card1', 'card2', 'card3', 'card4', 'card6', 'addr2', 'dist2', 'C3', 'C7', 'C12', 'D6', 'M1', 'V1', 'V2', 'V8', 'V9', 'V14', 'V15', 'V16', 'V27', 'V28', 'V33', 'V34', 'V37', 'V41', 'V46', 'V47', 'V51', 'V52', 'V57', 'V58', 'V65', 'V68', 'V73', 'V74', 'V77', 'V78', 'V79', 'V88', 'V89', 'V94', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V138', 'V141', 'V142', 'V146', 'V147', 'V161', 'V162', 'V163', 'V167', 'V168', 'V169', 'V170', 'V171', 'V172', 'V173', 'V174', 'V175', 'V176', 'V177', 'V178', 'V179', 'V180', 'V181', 'V182', 'V183', 'V184', 'V185', 'V186', 'V187', 'V188', 'V189', 'V190', 'V191', 'V192', 'V193', 'V194', 'V195', 'V196', 'V197', 'V198', 'V199', 'V200', 'V201', 'V202', 'V203', 'V204', 'V205', 'V206', 'V207', 'V208', 'V209', 'V210', 'V211', 'V212', 'V213', 'V214', 'V215', 'V216', 'V217', 'V218', 'V219', 'V220', 'V223', 'V224', 'V225', 'V226', 'V228', 'V229', 'V230', 'V231', 'V232', 'V233', 'V234', 'V235', 'V236', 'V237', 'V238', 'V239', 'V240', 'V241', 'V242', 'V243', 'V244', 'V246', 'V247', 'V248', 'V249', 'V252', 'V253', 'V254', 'V257', 'V258', 'V260', 'V261', 'V262', 'V263', 'V264', 'V265', 'V266', 'V267', 'V268', 'V269', 'V273', 'V274', 'V275', 'V276', 'V277', 'V278', 'V286', 'V305', 'V311', 'V322', 'V325', 'V328', 'V329', 'V330', 'V331', 'V334', 'V337', 'V338', 'V339']

可以汇总如下:

def fill_pairs(train, test, pairs):
    for pair in pairs:

        unique_train = []
        unique_test = []

        print(f'Pair: {pair}')
        print(f'In train{[pair[1]]} there are {train[pair[1]].isna().sum()} NaNs' )
        print(f'In test{[pair[1]]} there are {test[pair[1]].isna().sum()} NaNs' )

        for value in train[pair[0]].unique():
            unique_train.append(train[pair[1]][train[pair[0]] == value].value_counts().shape[0])

        for value in test[pair[0]].unique():
            unique_test.append(test[pair[1]][test[pair[0]] == value].value_counts().shape[0])

        pair_values_train = pd.Series(data=unique_train, index=train[pair[0]].unique())
        pair_values_test = pd.Series(data=unique_test, index=test[pair[0]].unique())

        print('Filling train...')

        for value in pair_values_train[pair_values_train == 1].index:
            train.loc[train[pair[0]] == value, pair[1]] = train.loc[train[pair[0]] == value, pair[1]].value_counts().index[0]

        print('Filling test...')

        for value in pair_values_test[pair_values_test == 1].index:
            test.loc[test[pair[0]] == value, pair[1]] = test.loc[test[pair[0]] == value, pair[1]].value_counts().index[0]

        print(f'In train{[pair[1]]} there are {train[pair[1]].isna().sum()} NaNs' )
        print(f'In test{[pair[1]]} there are {test[pair[1]].isna().sum()} NaNs' )

    return train, test

pairs = [('card1', 'card2'), ('card1', 'card3')]
train, test = fill_pairs(train, test, pairs)

Data Relaxation

当训练集和测试集在某些特征存在差异时,模型见到未见过的数据时很难对其进行拟合。Data Relaxation是删除训练集中的所有值出现的频率是测试集中的3倍的值,对测试集也同理,并且删除出现次数较少的值。

def relax_data(df_train, df_test, col):
    cv1 = pd.DataFrame(df_train[col].value_counts().reset_index().rename({col:'train'},axis=1))
    cv2 = pd.DataFrame(df_test[col].value_counts().reset_index().rename({col:'test'},axis=1))
    cv3 = pd.merge(cv1,cv2,on='index',how='outer')
    factor = len(df_test)/len(df_train)
    cv3['train'].fillna(0,inplace=True)
    cv3['test'].fillna(0,inplace=True)
    cv3['remove'] = False
    cv3['remove'] = cv3['remove'] | (cv3['train'] < len(df_train)/10000)
    cv3['remove'] = cv3['remove'] | (factor*cv3['train'] < cv3['test']/3)
    cv3['remove'] = cv3['remove'] | (factor*cv3['train'] > 3*cv3['test'])
    cv3['new'] = cv3.apply(lambda x: x['index'] if x['remove']==False else 0,axis=1)
    cv3['new'],_ = cv3['new'].factorize(sort=True)
    cv3.set_index('index',inplace=True)
    cc = cv3['new'].to_dict()
    df_train[col] = df_train[col].map(cc)
    df_test[col] = df_test[col].map(cc)
    return df_train, df_test

比如IEEE比赛中的V258特征,有不少值只在训练集中出现,通过Data Relaxation后,见下图分布:

Tricks

每个比赛都有 Tricks,而这个比赛的 Tricks 是构建UID,即定义 Fraudulent Clients,因为在赛题中有介绍:

The logic of our labeling is define reported chargeback on the card as fraud transaction (isFraud=1) and transactions posterior to it with either user account, email address or billing address directly linked to these attributes as fraud too. If none of above is reported and found beyond 120 days, then we define as legit transaction (isFraud=0).

首先根据 M(信用卡匹配信息)、card(支付卡相关信息)和 P_emaildomain(买方邮件域)特征相同的认为是同一个用户。

columns = ['M'+ str(i) for i in range(1,10)]+['card'+ str(i) for i in range(1,7)]+['P_emaildomain']
# identity = pd.concat([train[columns],test[columns]]).drop_duplicates()

identity = test[columns].drop_duplicates()
identity['identity_id'] = list(range(len(identity)))

all_data = pd.concat([train.drop('isFraud', axis=1), test])
all_data = all_data.merge(identity, on=columns, how='left')
all_data = all_data[~pd.isnull(all_data['identity_id'])]

all_data = all_data.groupby('identity_id')['TransactionDT'].agg(['max','min','count']).reset_index()
all_data = all_data.sort_values('min',ascending=True).reset_index(drop=True)

all_data['percent'] = 1/len(all_data)
all_data['percent'] = all_data['percent'].cumsum()

all_data[all_data['min']>max(train['TransactionDT'])].shape[0]/all_data.shape[0]

# 0.6229906923024662

可以看出 62% 的用户只出现在测试集。The challenge in this competition is building a model that can predict unseen clients (not unseen time).

How the Magic Works

首先使用 UID 来定义同一用户;其次根据 UID 对特征进行聚合;最后删除 UID 列。假如有下面10列交易:

假如只用特征FeatureX,只能对70%的交易分对正确;假如用定义好的 UID 计算聚合特征(FeatureX的均值),可将所有交易分对正对【类似集成学习,弱学习器–>强学习器】。注意,这里我们并未使用 UID 作为特征。

How to find UID - (Unique Identification)

由前面知道,测试集中有60%以上未出现在训练集中,但是使用信用卡信息并不能真正的确定是同一个UID,因为存在不少一对多的情形。但数据集共有430列,但是哪些列可以确定UID?通过对抗验证可以大量减少手工活。

对抗验证–adversarial validation

  1. 合并训练集和测试集,并且将训练集和测试集的标签分别设置为0和1;
  2. 构建一个分类器,用于学习the different between testing and training data;
  3. 找到训练集中与测试集最相似的样本(most resemble data),作为验证集,其余的作为训练集;【也可将概率作为样本的权重,做有权重的交叉验证】
  4. 观察AUC,理想的状况是在0.5左右。
from xgboost import XGBClassifier
import catboost as cbt
from sklearn.preprocessing import LabelEncoder

for i in ['D' + str(i) for i in range(1,16)]:
    train[i] = np.floor(train.TransactionDT / (24*60*60)) - train[i]
    test[i] = np.floor(test.TransactionDT / (24*60*60)) - test[i]

    features = ['TransactionAmt',
           'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6',
           'addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain',
           'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11',
           'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8',
           'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4',
           'M5', 'M6', 'M7', 'M8', 'M9']
    all_data = pd.concat([train[features].sample(frac=0.1),
               test[features].sample(frac=0.1)])
    all_data['is_this_transaction_in_test_data'] = [0]*train.sample(frac=0.1).shape[0] + [1]*test.sample(frac=0.1).shape[0]

cat_col = all_data.select_dtypes(object).columns
for i in cat_col:
    lbl = LabelEncoder()
    all_data[i] = lbl.fit_transform(all_data[i].astype(str))

cat_list = ['M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9',
            'P_emaildomain', 'ProductCD', 'R_emaildomain', 'card4', 'card6']
cbt_model = cbt.CatBoostClassifier(iterations=1000,learning_rate=0.1,verbose=100,eval_metric='AUC')
cbt_model.fit(all_data.drop(['is_this_transaction_in_test_data'],axis=1),all_data['is_this_transaction_in_test_data'])

from sklearn.metrics import roc_auc_score
y_test = cbt_model.predict(all_data.drop(['is_this_transaction_in_test_data'],axis=1))
roc_auc_score(y_test, all_data['is_this_transaction_in_test_data'])

feature = pd.DataFrame({'importance':cbt_model.feature_importances_, 'feature':cbt_model.feature_names_})
feature = feature.sort_values('importance',ascending=False)
feature = feature[feature['importance']!=0]
plt.figure(figsize=(10, 15))
plt.barh(feature['feature'],feature['importance'],height =0.5)

因此,可以使用 card1, addr1D1 作为 UID 识别。

features = ['card1', 'addr1', 'D1']
all_data = pd.concat([train[features].sample(frac=0.1),
           test[features].sample(frac=0.1)])
all_data['is_this_transaction_in_test_data'] = [0]*train.sample(frac=0.1).shape[0] + [1]*test.sample(frac=0.1).shape[0]

cat_col = all_data.select_dtypes(object).columns
for i in cat_col:
    lbl = LabelEncoder()
    all_data[i] = lbl.fit_transform(all_data[i].astype(str))

cat_list = ['M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9',
            'P_emaildomain', 'ProductCD', 'R_emaildomain', 'card4', 'card6']
cbt_model = cbt.CatBoostClassifier(iterations=1000,learning_rate=0.1,verbose=100,eval_metric='AUC')
cbt_model.fit(all_data.drop(['is_this_transaction_in_test_data'],axis=1),all_data['is_this_transaction_in_test_data'])

y_test = cbt_model.predict(all_data.drop(['is_this_transaction_in_test_data'],axis=1))
roc_auc_score(y_test, all_data['is_this_transaction_in_test_data'])

# 0.90223049623829

Feature Engineering

Encoding Functions

  1. encode_FE:合并训练接和测试集,计算frequency encoding
  2. encode_LE:根据训练集,对类别变量计算label encoding
  3. encode_AG:计算聚合特征(aggregated features),比如均值、方差
  4. encode_CB:合并两列
  5. encode_AG2:计算聚合特征,一个group里面每个feature有多少unique值
# FREQUENCY ENCODE TOGETHER
def encode_FE(df1, df2, cols):
    for col in cols:
        df = pd.concat([df1[col],df2[col]])
        vc = df.value_counts(dropna=True, normalize=True).to_dict()
        vc[-1] = -1
        nm = col+'_FE'
        df1[nm] = df1[col].map(vc)
        df1[nm] = df1[nm].astype('float32')
        df2[nm] = df2[col].map(vc)
        df2[nm] = df2[nm].astype('float32')
        print(nm,', ',end='')

# LABEL ENCODE
def encode_LE(col,train=X_train,test=X_test,verbose=True):
    df_comb = pd.concat([train[col],test[col]],axis=0)
    df_comb,_ = df_comb.factorize(sort=True) # Encode the object as an enumerated type or categorical variable.
                                             # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html
    nm = col
    if df_comb.max()>32000:
        train[nm] = df_comb[:len(train)].astype('int32')
        test[nm] = df_comb[len(train):].astype('int32')
    else:
        train[nm] = df_comb[:len(train)].astype('int16')
        test[nm] = df_comb[len(train):].astype('int16')
    del df_comb; x=gc.collect()
    if verbose: print(nm,', ',end='')

# GROUP AGGREGATION MEAN AND STD
def encode_AG(main_columns, uids, aggregations=['mean'], train_df=X_train, test_df=X_test,
              fillna=True, usena=False):
    # AGGREGATION OF MAIN WITH UID FOR GIVEN STATISTICS
    for main_column in main_columns:
        for col in uids:
            for agg_type in aggregations:
                new_col_name = main_column+'_'+col+'_'+agg_type
                temp_df = pd.concat([train_df[[col, main_column]], test_df[[col,main_column]]])
                if usena: temp_df.loc[temp_df[main_column]==-1,main_column] = np.nan
                temp_df = temp_df.groupby([col])[main_column].agg([agg_type]).reset_index().rename(
                                                        columns={agg_type: new_col_name})

                temp_df.index = list(temp_df[col])
                temp_df = temp_df[new_col_name].to_dict()

                train_df[new_col_name] = train_df[col].map(temp_df).astype('float32')
                test_df[new_col_name]  = test_df[col].map(temp_df).astype('float32')

                if fillna:
                    train_df[new_col_name].fillna(-1,inplace=True)
                    test_df[new_col_name].fillna(-1,inplace=True)

                print("'"+new_col_name+"'",', ',end='')

# COMBINE FEATURES
def encode_CB(col1,col2,df1=X_train,df2=X_test):
    nm = col1+'_'+col2
    df1[nm] = df1[col1].astype(str)+'_'+df1[col2].astype(str)
    df2[nm] = df2[col1].astype(str)+'_'+df2[col2].astype(str)
    encode_LE(nm,verbose=False)
    print(nm,', ',end='')

# GROUP AGGREGATION NUNIQUE
def encode_AG2(main_columns, uids, train_df=X_train, test_df=X_test):
    for main_column in main_columns:
        for col in uids:
            comb = pd.concat([train_df[[col]+[main_column]],test_df[[col]+[main_column]]],axis=0)
            mp = comb.groupby(col)[main_column].agg(['nunique'])['nunique'].to_dict()
            train_df[col+'_'+main_column+'_ct'] = train_df[col].map(mp).astype('float32')
            test_df[col+'_'+main_column+'_ct'] = test_df[col].map(mp).astype('float32')
            print(col+'_'+main_column+'_ct, ',end='')

Feature Engineering

The procedure for engineering features is as follows. First you think of an idea and create a new feature. Then you add it to your model and evaluate whether local validation AUC increases or decreases. If AUC increases keep the feature, otherwise discard the feature.

# TRANSACTION AMT CENTS
# 有很多小数点可能是海淘(汇率转换)
X_train['cents'] = (X_train['TransactionAmt'] - np.floor(X_train['TransactionAmt'])).astype('float32')
X_test['cents'] = (X_test['TransactionAmt'] - np.floor(X_test['TransactionAmt'])).astype('float32')
print('cents, ', end='')

# FREQUENCY ENCODE: ADDR1, CARD1, CARD2, CARD3, P_EMAILDOMAIN
encode_FE(X_train,X_test,['addr1','card1','card2','card3','P_emaildomain'])

# COMBINE COLUMNS CARD1+ADDR1, CARD1+ADDR1+P_EMAILDOMAIN
encode_CB('card1','addr1')
encode_CB('card1_addr1','P_emaildomain')

# FREQUENCY ENOCDE
encode_FE(X_train,X_test,['card1_addr1','card1_addr1_P_emaildomain'])

# GROUP AGGREGATE
encode_AG(['TransactionAmt','D9','D11'],['card1','card1_addr1','card1_addr1_P_emaildomain'],['mean','std'],usena=True)

Feature Selection

  • forward feature selection (using single or groups of features):给定特征集合 ${a_1,a_2,…,a_n}$,首先选择一个最优的单特征子集(比如 ${a_2}$)作为第一轮选定集,然后在此基础上加入一个特征,构建包含两个特征的候选子集,选择最优的双特征子集作为第二轮选定子集,依次类推,直到找不到更优的特征子集才停止,这样逐渐增加相关特征的策略成为前向(forward)搜索;类似的,如果从完整的特征集合开始,每次尝试去掉一个无关特征,这样逐渐减少特征的策略称为后向(backward)搜索;前向后向搜索结合起来,每一轮逐渐增加选定相关特征(这些特征在后续轮中确定不会被去除),同时减少无关特征,这样的策略被称为双向(bidirectional)搜索
  • recursive feature elimination (using single or groups of features):通过学习器返回的 feature_importances_ 属性来获得每个特征的重要程度。 然后,从当前的特征集合中移除最不重要的特征。在特征集合上不断的重复递归这个步骤,直到最终达到所需要的特征数量为止。
  • permutation importance:首先我们有一个已经训练好的模型以及该模型的预测表现(如RMSE),比如房价预测模型本来在validation数据上的RMSE是200。然后针对其中的变量(如面积),把这个变量的值全部打乱重新排序,用这个重新排序过的数据来做预测,得到一个预测表现。比如说这下RMSE变成了500,那么面积这个变量的重要性就可以记为500-200=300
  • correlation analysis:好的特征子集所包含的特征应该是与分类的相关度较高
  • time consistency: features found patterns in the present that exist in the future or not?
  • client consistency:若样本1与样本2属于不同的分类,但在特征A、 B上的取值完全一样,那么特征子集{A,B}不应该选作最终的特征集。
  • train/test distribution analysis

下面主要介绍其中几个:

Time Consistency

对每个特征构建一个模型,用第一个月数据训练,预测最后一个月数据,希望训练集和验证集的AUC都大于0.5。

Covariate Shift

Covariate Shift:检查特征在训练集和测试集分布是否一致。当AUC为0.5时,说明这个特征在训练集和测试集分布变化不大。

import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import gc

def covariate_shift(feature):
    df_train = pd.DataFrame(data={feature: train[feature], 'isTest': 0})
    df_test = pd.DataFrame(data={feature: test[feature], 'isTest': 1})

    # Creating a single dataframe
    df = pd.concat([df_train, df_test], ignore_index=True)

    # Encoding if feature is categorical
    if str(df[feature].dtype) in ['object', 'category']:
        df[feature] = LabelEncoder().fit_transform(df[feature].astype(str))

    # Splitting it to a training and testing set
    X_train, X_test, y_train, y_test = train_test_split(df[feature], df['isTest'], test_size=0.33, random_state=47, stratify=df['isTest'])

    clf = lgb.LGBMClassifier(**params, num_boost_round=500)
    clf.fit(X_train.values.reshape(-1, 1), y_train)
    roc_auc =  roc_auc_score(y_test, clf.predict_proba(X_test.values.reshape(-1, 1))[:, 1])

    del df, X_train, y_train, X_test, y_test
    gc.collect();

    return roc_auc

当某特输出的Covariate Shift较高,可使用前面介绍的Data Relaxation

首先根据缺失值个数划分block,即当原始特征中缺失值个数都是89个,则它们划分为同一个block,以下是对IEEE比赛中的V特征进行划分:

nans_df = train.isna()
nans_groups={}
i_cols = ['V'+str(i) for i in range(1,340)]
for col in train.columns:
    cur_group = nans_df[col].sum()
    try:
        nans_groups[cur_group].append(col)
    except:
        nans_groups[cur_group]=[col]

# -----------------------------------------------------------
# NAN count = 1269
# ['D1', 'V281', 'V282', 'V283', 'V288', 'V289', 'V296', 'V300', 'V301', 'V313', 'V314', 'V315']
# -----------------------------------------------------------

以上每个block可以简单的认为他们之间有某种相关性,其中有三种方法处理block

  1. Applied PCA on each group individually
  2. Selected a maximum sized subset of uncorrelated columns from each group
  3. Replaced the entire group with all columns averaged.

下面主要讲解下第二种方法:

首先根据NaN的数量,D11 & V1-V11分为一个block,接着画出V1-V11相关性图:

def make_corr(Vs,Vtitle=''):
    cols = ['TransactionDT'] + Vs
    plt.figure(figsize=(15,15))
    sns.heatmap(train[cols].corr(), cmap='RdBu_r', annot=True, center=0.0, fmt=".2f")
    if Vtitle!='': plt.title(Vtitle,fontsize=14)
    else: plt.title(Vs[0]+' - '+Vs[-1],fontsize=14)
    plt.show()

将相关性大于0.75的看为同一组,即D11 & V1-V11可划分为 [[V1],[V2,V3],[V4,V5],[V6,V7],[V8,V9],[V10,V11]],对每个group保留nunique值多大的列:

grps = [[1],[2,3],[4,5],[6,7],[8,9],[10,11]]
def reduce_group(grps,c='V'):
    use = []
    for g in grps:
        mx = 0
        vx = g[0]
        for gg in g:
            n = train[c+str(gg)].nunique()
            if n>mx:
                mx = n
                vx = gg
        use.append(vx)    # 保留 subset 中 unique 值最多的列
    print('Use these',use)
reduce_group(grps)

# Use these [1, 3, 4, 6, 8, 11]

最终,使用V1, V3, V4, V6, V8, V11来代替V1-V11。对其他block使用同样的方法。

Validation Strategy

不要相信单独一个validation,我们可以构造多个validation:使用前四个月训练,跳过一个月,预测最后一个月;前两个月训练,跳过2个月,预测最后一个月;第一个月训练,跳过三个月,预测最后一个月。

此外也使用CV,看模型对已知UID和未知UID预测效果:

  • XGB model did best predicting known UIDs with AUC = 0.99723
  • LGBM model did best predicting unknown UIDs with AUC = 0.92117
  • CAT model did best predicting questionable UIDs with AUC = 0.98834

将三者集成的模型对所有类型都能获得最好的预测效果。

predict

使用 month 进行 GroupKFold 进行预测。

oof = np.zeros(len(X_train))
preds = np.zeros(len(X_test))

skf = GroupKFold(n_splits=6)
for i, (idxT, idxV) in enumerate(skf.split(X_train, y_train, groups=X_train['DT_M']) ):
    month = X_train.iloc[idxV]['DT_M'].iloc[0]
    print('Fold',i,'withholding month',month)
    print(' rows of train =',len(idxT),'rows of holdout =',len(idxV))
    clf = xgb.XGBClassifier(
        n_estimators=5000,
        max_depth=12,
        learning_rate=0.02,
        subsample=0.8,
        colsample_bytree=0.4,
        missing=-1,
        eval_metric='auc',
        # USE CPU
        nthread=4,
        tree_method='hist'
        # USE GPU
        #tree_method='gpu_hist'
    )
    h = clf.fit(X_train[cols].iloc[idxT], y_train.iloc[idxT],
            eval_set=[(X_train[cols].iloc[idxV],y_train.iloc[idxV])],
            verbose=100, early_stopping_rounds=100)

    oof[idxV] += clf.predict_proba(X_train[cols].iloc[idxV])[:,1]
    preds += clf.predict_proba(X_test[cols])[:,1]/skf.n_splits
    del h, clf
    x=gc.collect()
print('#'*20)
print ('XGB95 OOF CV=',roc_auc_score(y_train,oof))

OOFPredict 画出分布图,查看是否相似:

plt.hist(oof,bins=100)
plt.ylim((0,5000))
plt.title('XGB OOF')
plt.show()

X_train['oof'] = oof
X_train.reset_index(inplace=True)
X_train[['TransactionID','oof']].to_csv('oof_xgb_95.csv')
X_train.set_index('TransactionID',drop=True,inplace=True)

sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission.isFraud = preds
sample_submission.to_csv('sub_xgb_96.csv',index=False)

plt.hist(sample_submission.isFraud,bins=100)
plt.ylim((0,5000))
plt.title('XGB96 Submission')
plt.show()

Preventing Overfitting

为了防止过拟合,不能直接使用 UID,因为在测试集中有 60% 多的新用户。但是可以根据UID来提取特征:

new_features = df.groupby('uid')[columns].agg(['mean'])

这样模型就有能力来识别未看过的用户。

Post processing

因为同一个人的所有交易都是isFraud = 0或全部isFraud = 1。换句话说,它们的所有预测都是相同的。因此,我们的后期处理是将一个客户的所有预测替换为它们的平均预测,包括训数据集中的isFraud值。

X_test['isFraud'] = sample_submission.isFraud.values
X_train['isFraud'] = y_train.values
comb = pd.concat([X_train[['isFraud']],X_test[['isFraud']]],axis=0)

uids = pd.read_csv('data/uids_v4_no_multiuid_cleaning..csv',usecols=['TransactionID','uid']).rename({'uid':'uid2'},axis=1)
comb = comb.merge(uids,on='TransactionID',how='left')
mp = comb.groupby('uid2').isFraud.agg(['mean'])
comb.loc[comb.uid2>0,'isFraud'] = comb.loc[comb.uid2>0].uid2.map(mp['mean'])

uids = pd.read_csv('data/uids_v1_no_multiuid_cleaning.csv',usecols=['TransactionID','uid']).rename({'uid':'uid3'},axis=1)
comb = comb.merge(uids,on='TransactionID',how='left')
mp = comb.groupby('uid3').isFraud.agg(['mean'])
comb.loc[comb.uid3>0,'isFraud'] = comb.loc[comb.uid3>0].uid3.map(mp['mean'])

sample_submission.isFraud = comb.iloc[len(X_train):].isFraud.values
sample_submission.to_csv('sub_xgb_96_PP.csv',index=False)

Reference