H&M 比赛记录5：交易日期越近越冷门，则更推荐[LB:0.02226]

博主： admin
发布时间：2022 年 04 月 01 日
1766次浏览
暂无评论
3745字数
分类：深度学习实战

逻辑

如果一个客户购买的某个商品的订单越接近，而且这个订单越不是热门，则更有可能是他很喜欢这个订商品。根据以下两个方面来计算一个商品被客户喜欢的程度

过往订单在多久之前
这个商品是否很热门

这个喜欢程度是一个值，之后把全部该客户购买该商品的订单都加到一起。

代码

导入库

import numpy as np
import pandas as pd

from math import sqrt
from pathlib import Path
from tqdm import tqdm
tqdm.pandas()

导入数据集

N = 12

df  = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv',
                            usecols= ['t_dat', 'customer_id', 'article_id'], 
                            dtype={'article_id': 'int32', 't_dat': 'string', 'customer_id': 'string'})
df ['customer_id'] = df ['customer_id'].str[-16:].str.hex_to_int().astype('int64')

df['t_dat'] = cudf.to_datetime(df['t_dat'])
last_ts = df['t_dat'].max()

ldbw是下一个距离t_data最近的周三所在的周一。

%%time
# cudf转换为pandas
tmp = df[['t_dat']].copy().to_pandas()
# 一周的第几天
tmp['dow'] = tmp['t_dat'].dt.dayofweek

# 当前日期-dow，变为当前周的第1天
tmp['ldbw'] = tmp['t_dat'] - pd.TimedeltaIndex(tmp['dow'] - 1, unit='D')

# 如果是周三-周日，ldbw会加上7天（当成下一周）
tmp.loc[tmp['dow'] >=2 , 'ldbw'] = tmp.loc[tmp['dow'] >=2 , 'ldbw'] + pd.TimedeltaIndex(np.ones(len(tmp.loc[tmp['dow'] >=2])) * 7, unit='D')

df['ldbw'] = tmp['ldbw'].values

weekly_sales是计算每个article每周的销量，加到df中

# 根据周来count
weekly_sales = df.drop('customer_id', axis=1).groupby(['ldbw', 'article_id']).count().reset_index()
weekly_sales = weekly_sales.rename(columns={'t_dat': 'count'})

df = df.merge(weekly_sales, on=['ldbw', 'article_id'], how = 'left')

假设目标周的销量count_target和上周的销量相同

因为我们把ldbw设为下一周，因此我们做join的时候相当于会把每个article的销量和上一周的销量join起来

weekly_sales = weekly_sales.reset_index().set_index('article_id')

df = df.merge(
    weekly_sales.loc[weekly_sales['ldbw']==last_ts, ['count']],
    on='article_id', suffixes=("", "_targ"))

df['count_targ'].fillna(0, inplace=True)
del weekly_sales

上一周的与总销量的对比

# 当前周销量 / 总销量
df['quotient'] = df['count_targ'] / df['count']

取出最火的12个产品

target_sales = df.drop('customer_id', axis=1).groupby('article_id')['quotient'].sum()
general_pred = target_sales.nlargest(N).index.to_pandas().tolist()
general_pred = ['0' + str(article_id) for article_id in general_pred]
general_pred_str =  ' '.join(general_pred) # 拼接成字符串
del target_sales

下面要做的逻辑是，如果一个订单是越久之前买的，客户购买相关订单的意愿就会越低，因此计算步骤如下：

为每个订单计算一个$x$，这个$x$指的是：订单交易日期和最后的日期差多少天
$x=max(x,1)$，$x$至少不为$0$
根据这个$x$计算出$y$，也就是$x$越大$y$越小，$y$的公式为：

$$ y=\frac{25000}{\sqrt{x}}+150000\times \exp(0.2\times x)-1000 $$

$y=max(0,y)$，保证$y$至少大于等于0
$value=y\times \text{quotient}$，$y$相当于一个重要程度，quotient也是一个重要程度，两者相乘得出$value$
接着，根据为每个customer_id的每个article_id的value做一个sum
取出 $value$的$sum$大于100的article和customer，并做一个dense_rank

%%time
purchase_dict = {}

# x = 最后的日期和订单交易日期差多少天, x=max(1,x)
tmp = df.copy().to_pandas()
tmp['x'] = ((last_ts - tmp['t_dat']) / np.timedelta64(1, 'D')).astype(int)
tmp['dummy_1'] = 1 
tmp['x'] = tmp[["x", "dummy_1"]].max(axis=1)

# y为时间距离越远越小（客户的购买兴趣越小）
a, b, c, d = 2.5e4, 1.5e5, 2e-1, 1e3
tmp['y'] = a / np.sqrt(tmp['x']) + b * np.exp(-c*tmp['x']) - d

# y = max(0,y)

tmp['dummy_0'] = 0 
tmp['y'] = tmp[["y", "dummy_0"]].max(axis=1)

# value = quotient * y 
tmp['value'] = tmp['quotient'] * tmp['y'] 

# 根据customer_id和article_id结合取和
tmp = tmp.groupby(['customer_id', 'article_id']).agg({'value': 'sum'})
tmp = tmp.reset_index()

# 只要value > 100的
tmp = tmp.loc[tmp['value'] > 100]
# 根据 customer 根据value排序
tmp['rank'] = tmp.groupby("customer_id")["value"].rank("dense", ascending=False)
# 取前12个
tmp = tmp.loc[tmp['rank'] <= 12]

最终倒序取出这些值作为预测值

purchase_df = tmp.sort_values(['customer_id', 'value'], ascending = False).reset_index(drop = True)
purchase_df['prediction'] = '0' + purchase_df['article_id'].astype(str) + ' '
purchase_df = purchase_df.groupby('customer_id').agg({'prediction': sum}).reset_index()
purchase_df['prediction'] = purchase_df['prediction'].str.strip()
purchase_df = cudf.DataFrame(purchase_df)

接着用最火的12个不足推荐商品不足12个的article

%%time
sub  = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv',
                            usecols= ['customer_id'], 
                            dtype={'customer_id': 'string'})

sub['customer_id2'] = sub['customer_id'].str[-16:].str.hex_to_int().astype('int64')

sub = sub.merge(purchase_df, left_on = 'customer_id2', right_on = 'customer_id', how = 'left',
               suffixes = ('', '_ignored'))

sub = sub.to_pandas()
sub['prediction'] = sub['prediction'].fillna(general_pred_str)
sub['prediction'] = sub['prediction'] + ' ' +  general_pred_str
sub['prediction'] = sub['prediction'].str.strip()
sub['prediction'] = sub['prediction'].str[:131]
sub = sub[['customer_id', 'prediction']]
sub.to_csv(f'submission.csv',index=False)

最后修改：2023 年 12 月 04 日 09 : 38 PM

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

H&M 比赛记录5：交易日期越近越冷门，则更推荐[LB:0.02226]

admin • 2022 年 04 月 01 日

逻辑

过往订单在多久之前
这个商品是否很热门

这个喜欢程度是一个值，之后把全部该客户购买该商品的订单都加到一起。

代码

导入库

import numpy as np
import pandas as pd

from math import sqrt
from pathlib import Path
from tqdm import tqdm
tqdm.pandas()

导入数据集

N = 12

df  = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv',
                            usecols= ['t_dat', 'customer_id', 'article_id'], 
                            dtype={'article_id': 'int32', 't_dat': 'string', 'customer_id': 'string'})
df ['customer_id'] = df ['customer_id'].str[-16:].str.hex_to_int().astype('int64')

df['t_dat'] = cudf.to_datetime(df['t_dat'])
last_ts = df['t_dat'].max()

ldbw是下一个距离t_data最近的周三所在的周一。

%%time
# cudf转换为pandas
tmp = df[['t_dat']].copy().to_pandas()
# 一周的第几天
tmp['dow'] = tmp['t_dat'].dt.dayofweek

# 当前日期-dow，变为当前周的第1天
tmp['ldbw'] = tmp['t_dat'] - pd.TimedeltaIndex(tmp['dow'] - 1, unit='D')

# 如果是周三-周日，ldbw会加上7天（当成下一周）
tmp.loc[tmp['dow'] >=2 , 'ldbw'] = tmp.loc[tmp['dow'] >=2 , 'ldbw'] + pd.TimedeltaIndex(np.ones(len(tmp.loc[tmp['dow'] >=2])) * 7, unit='D')

df['ldbw'] = tmp['ldbw'].values

weekly_sales是计算每个article每周的销量，加到df中

# 根据周来count
weekly_sales = df.drop('customer_id', axis=1).groupby(['ldbw', 'article_id']).count().reset_index()
weekly_sales = weekly_sales.rename(columns={'t_dat': 'count'})

df = df.merge(weekly_sales, on=['ldbw', 'article_id'], how = 'left')

假设目标周的销量count_target和上周的销量相同

因为我们把ldbw设为下一周，因此我们做join的时候相当于会把每个article的销量和上一周的销量join起来

weekly_sales = weekly_sales.reset_index().set_index('article_id')

df = df.merge(
    weekly_sales.loc[weekly_sales['ldbw']==last_ts, ['count']],
    on='article_id', suffixes=("", "_targ"))

df['count_targ'].fillna(0, inplace=True)
del weekly_sales

上一周的与总销量的对比

# 当前周销量 / 总销量
df['quotient'] = df['count_targ'] / df['count']

取出最火的12个产品

target_sales = df.drop('customer_id', axis=1).groupby('article_id')['quotient'].sum()
general_pred = target_sales.nlargest(N).index.to_pandas().tolist()
general_pred = ['0' + str(article_id) for article_id in general_pred]
general_pred_str =  ' '.join(general_pred) # 拼接成字符串
del target_sales

下面要做的逻辑是，如果一个订单是越久之前买的，客户购买相关订单的意愿就会越低，因此计算步骤如下：

为每个订单计算一个$x$，这个$x$指的是：订单交易日期和最后的日期差多少天
$x=max(x,1)$，$x$至少不为$0$
根据这个$x$计算出$y$，也就是$x$越大$y$越小，$y$的公式为：

$$ y=\frac{25000}{\sqrt{x}}+150000\times \exp(0.2\times x)-1000 $$

$y=max(0,y)$，保证$y$至少大于等于0
$value=y\times \text{quotient}$，$y$相当于一个重要程度，quotient也是一个重要程度，两者相乘得出$value$
接着，根据为每个customer_id的每个article_id的value做一个sum
取出 $value$的$sum$大于100的article和customer，并做一个dense_rank

%%time
purchase_dict = {}

# x = 最后的日期和订单交易日期差多少天, x=max(1,x)
tmp = df.copy().to_pandas()
tmp['x'] = ((last_ts - tmp['t_dat']) / np.timedelta64(1, 'D')).astype(int)
tmp['dummy_1'] = 1 
tmp['x'] = tmp[["x", "dummy_1"]].max(axis=1)

# y为时间距离越远越小（客户的购买兴趣越小）
a, b, c, d = 2.5e4, 1.5e5, 2e-1, 1e3
tmp['y'] = a / np.sqrt(tmp['x']) + b * np.exp(-c*tmp['x']) - d

# y = max(0,y)

tmp['dummy_0'] = 0 
tmp['y'] = tmp[["y", "dummy_0"]].max(axis=1)

# value = quotient * y 
tmp['value'] = tmp['quotient'] * tmp['y'] 

# 根据customer_id和article_id结合取和
tmp = tmp.groupby(['customer_id', 'article_id']).agg({'value': 'sum'})
tmp = tmp.reset_index()

# 只要value > 100的
tmp = tmp.loc[tmp['value'] > 100]
# 根据 customer 根据value排序
tmp['rank'] = tmp.groupby("customer_id")["value"].rank("dense", ascending=False)
# 取前12个
tmp = tmp.loc[tmp['rank'] <= 12]

最终倒序取出这些值作为预测值

purchase_df = tmp.sort_values(['customer_id', 'value'], ascending = False).reset_index(drop = True)
purchase_df['prediction'] = '0' + purchase_df['article_id'].astype(str) + ' '
purchase_df = purchase_df.groupby('customer_id').agg({'prediction': sum}).reset_index()
purchase_df['prediction'] = purchase_df['prediction'].str.strip()
purchase_df = cudf.DataFrame(purchase_df)

接着用最火的12个不足推荐商品不足12个的article

%%time
sub  = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv',
                            usecols= ['customer_id'], 
                            dtype={'customer_id': 'string'})

sub['customer_id2'] = sub['customer_id'].str[-16:].str.hex_to_int().astype('int64')

sub = sub.merge(purchase_df, left_on = 'customer_id2', right_on = 'customer_id', how = 'left',
               suffixes = ('', '_ignored'))

sub = sub.to_pandas()
sub['prediction'] = sub['prediction'].fillna(general_pred_str)
sub['prediction'] = sub['prediction'] + ' ' +  general_pred_str
sub['prediction'] = sub['prediction'].str.strip()
sub['prediction'] = sub['prediction'].str[:131]
sub = sub[['customer_id', 'prediction']]
sub.to_csv(f'submission.csv',index=False)

H&M 比赛记录5：交易日期越近越冷门，则更推荐[LB:0.02226]

逻辑

代码

发表评论取消回复

广义拉格朗日函数及其对偶算法

支持向量机SVM 系列(1)——线性可分支持向量机

支持向量机SVM 系列(2)——对偶方法(Dual Method)

支持向量机SVM 系列(3)——核函数(Kernel Function)

支持向量机SVM 系列(4)——软间隔(soft-margin SVM)

分类算法评估指标 + 金融风控评估指标

异常检测系列(1)：基于统计学-非参数方法HBOS

Python略深学习——特殊方法

矩阵分析基础知识

NLP01——基础分词jieba与HanLP

H&M 比赛记录5：交易日期越近越冷门，则更推荐[LB:0.02226]

逻辑

代码

逻辑

代码

发表评论 取消回复

H&M 比赛记录5：交易日期越近越冷门，则更推荐[LB:0.02226]

逻辑

代码

发表评论取消回复