Python中的datatable
和R中的data.table
有一定关系,比Pandas处理速度更快,更适合大数据集处理。下面以titanic数据集为基础,记录一下datatable的基础使用和对比。
1 安装
!pip install datatable
2 导入
import time
import numpy as np
import pandas as pd
import datatable as dt
print(dt.__version__)
3.1 读取csv速度
读取csv的时间,datatable
## Data Table Reading
start = time.time()
dt_df = dt.fread('./train.csv')
end = time.time()
print(end - start)
0.002747774124145508
读取csv的时间,pandas
import time
start = time.time()
pd_df= pd.read_csv('./train.csv')
end = time.time()
print(end - start)
0.06061220169067383
Datatable转换至pandas,速度比直接读入pandas更快
start = time.time()
dt_df.to_pandas()
end = time.time()
print(end - start)
0.003814220428466797
3.2 UI
datatable的UI,挺好看
dt_df.head()
3.3 shape
Datatable 输出shape
dt_df.shape
3.4 column名字
获取column名字
dt_df.names[:10]
3.5 计算均值
Datatable计算均值
dt_df.mean()
pandas计算均值
pd_df.mean()
PassengerId 446.000000
Survived 0.383838
Pclass 2.308642
Age 29.699118
SibSp 0.523008
Parch 0.381594
Fare 32.204208
dtype: float64
3.6 排序
Datatable排序速度
start = time.time()
dt_df.sort("Fare")
end = time.time()
print(end - start)
0.0002865791320800781
pandas排序速度
start = time.time()
for i in range(100):
dt_df[:, dt.sum(dt.f.Fare), dt.by(dt.f.Pclass)]
end = time.time()
print(end - start)
0.007885217666625977
3.7 groupby
Datatable的groupby
start = time.time()
for i in range(100):
dt_df[dt.f.Fare>dt.mean(dt.f.Fare), "Fare"]
end = time.time()
print(end - start)
0.006508588790893555
Pandas的groupby
start = time.time()
for i in range(100):
pd_df.groupby("Pclass")["Fare"].sum()
end = time.time()
print(end - start)
0.05550026893615723
3.8 建模
from sklearn. import LinearRegression
model = LinearRegression()
model.fit(dt_df[:,["loan_amnt", "installment"]], dt_df[:,"int_rate"])
model.coef_
3.9 存储数据集
gdf = dt_df[:, dt.sum(dt.f.Fare), dt.by(dt.f.Pclass)]
gdf.to_csv("temp.csv")