Python中的datatable和R中的data.table有一定关系,比Pandas处理速度更快,更适合大数据集处理。下面以titanic数据集为基础,记录一下datatable的基础使用和对比。

1 安装

!pip install datatable

2 导入

import time
import numpy as np
import pandas as pd
import datatable as dt
print(dt.__version__)

3.1 读取csv速度

读取csv的时间,datatable

## Data Table Reading
start = time.time()
dt_df = dt.fread('./train.csv')
end = time.time()
print(end - start)
0.002747774124145508

读取csv的时间,pandas

import time

start = time.time()
pd_df= pd.read_csv('./train.csv')
end = time.time()
print(end - start)
0.06061220169067383

Datatable转换至pandas,速度比直接读入pandas更快

start = time.time()
dt_df.to_pandas()
end = time.time()
print(end - start)
0.003814220428466797

3.2 UI

datatable的UI,挺好看

dt_df.head()

3.3 shape

Datatable 输出shape

dt_df.shape

3.4 column名字

获取column名字

dt_df.names[:10]

3.5 计算均值

Datatable计算均值

dt_df.mean()

pandas计算均值

pd_df.mean()
PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

3.6 排序

Datatable排序速度

start = time.time()
dt_df.sort("Fare")
end = time.time()
print(end - start)
0.0002865791320800781

pandas排序速度

start = time.time()
for i in range(100):
    dt_df[:, dt.sum(dt.f.Fare), dt.by(dt.f.Pclass)]
end = time.time()
print(end - start)
0.007885217666625977

3.7 groupby

Datatable的groupby

start = time.time()
for i in range(100):
    dt_df[dt.f.Fare>dt.mean(dt.f.Fare), "Fare"]
end = time.time()
print(end - start)
0.006508588790893555

Pandas的groupby

start = time.time()
for i in range(100):
    pd_df.groupby("Pclass")["Fare"].sum()
end = time.time()
print(end - start)
0.05550026893615723

3.8 建模

from sklearn. import LinearRegression

model = LinearRegression()
model.fit(dt_df[:,["loan_amnt", "installment"]], dt_df[:,"int_rate"])
model.coef_

3.9 存储数据集

gdf = dt_df[:, dt.sum(dt.f.Fare), dt.by(dt.f.Pclass)]
gdf.to_csv("temp.csv")
最后修改:2021 年 10 月 29 日 11 : 23 PM
如果觉得我的文章对你有用,请随意赞赏