上篇文章讲了CNN的基础使用方法,这篇文章讲其在真实项目上的应用,这篇文章参考了《Deep Learning with Python》这本书上的内容。
这次使用的是Kaggle的猫狗分类数据集,可以在https://www.kaggle.com/c/dogs-vs-cats上下载。
训练集各有猫狗12,500张,首先,将训练数据集换分为 train, validation, test 3个集合,分别有猫狗图片各 10000, 1250, 1250张,之后,分别将他们归类到不同的文件夹中。
首先,导入处理文件用到的工具库
import os, shutil
#原解压后的训练集的地址
origin_dir = '../input/dogs-vs-cats/train'
#新数据集的根目录
base_dir = '../input/dogs-vs-cats-new'
#创建新数据集的根目录
os.mkdir(base_dir)
#创建各个集合的各个分类的目录
train_dir = os.path.join(base_dir, 'train')
os.makedirs(train_dir)
val_dir = os.path.join(base_dir, 'val')
os.makedirs(val_dir)
test_dir = os.path.join(base_dir, 'test')
os.makedirs(test_dir)
train_cats_dir = os.path.join(train_dir,'cats')
os.mkdir(train_cats_dir)
train_dogs_dir = os.path.join(train_dir,'dogs')
os.mkdir(train_dogs_dir)
val_cats_dir = os.path.join(val_dir,'cats')
os.mkdir(val_cats_dir)
val_dogs_dir = os.path.join(val_dir,'dogs')
os.mkdir(val_dogs_dir)
test_cats_dir = os.path.join(test_dir,'cats')
os.mkdir(test_cats_dir)
test_dogs_dir = os.path.join(test_dir,'dogs')
os.mkdir(test_dogs_dir)
#复制图片文件到对应的文件夹
fnames = ['cat.{}.jpg'.format(i) for i in range(10000)]
for fname in fnames:
src = os.path.join(origin_dir, fname)
dst = os.path.join(train_cats_dir, fname)
shutil.copyfile(src,dst)
fnames = ['dog.{}.jpg'.format(i) for i in range(10000)]
for fname in fnames:
src = os.path.join(origin_dir, fname)
dst = os.path.join(train_dogs_dir, fname)
shutil.copyfile(src,dst)
fnames = ['cat.{}.jpg'.format(i) for i in range(10000, 11250)]
for fname in fnames:
src = os.path.join(origin_dir, fname)
dst = os.path.join(val_cats_dir, fname)
shutil.copyfile(src,dst)
fnames = ['dog.{}.jpg'.format(i) for i in range(10000, 11250)]
for fname in fnames:
src = os.path.join(origin_dir, fname)
dst = os.path.join(val_dogs_dir, fname)
shutil.copyfile(src,dst)
fnames = ['cat.{}.jpg'.format(i) for i in range(11250, 12500)]
for fname in fnames:
src = os.path.join(origin_dir, fname)
dst = os.path.join(test_cats_dir, fname)
shutil.copyfile(src,dst)
fnames = ['dog.{}.jpg'.format(i) for i in range(11250, 12500)]
for fname in fnames:
src = os.path.join(origin_dir, fname)
dst = os.path.join(test_dogs_dir, fname)
shutil.copyfile(src,dst)
搭建模型
接下来,要搭建CNN模型,共3个卷积层,后面接包含512个神经元的隐藏层,由于是二分类问题,直接用sigmoid
激活函数把输出压缩到[0,1]之间。
from keras import layers, models, optimizers
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu',
input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(lr=1e-4),
metrics=['acc'])
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_2 (Conv2D) (None, 148, 148, 32) 896
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 74, 74, 32) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 72, 72, 64) 18496
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 36, 36, 64) 0
_________________________________________________________________
conv2d_4 (Conv2D) (None, 34, 34, 128) 73856
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 17, 17, 128) 0
_________________________________________________________________
conv2d_5 (Conv2D) (None, 15, 15, 128) 147584
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 7, 7, 128) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 6272) 0
_________________________________________________________________
dense_1 (Dense) (None, 512) 3211776
_________________________________________________________________
dense_2 (Dense) (None, 1) 513
=================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0
_________________________________________________________________
数据预处理
数据预处理的步骤如下:
(1) 读取图片文件;
(2)将JPEG文件解码为RGB像素网格
(3)将这些像素网格转换为浮点数张量
(4)将像素值(0~255)压缩到[0,1]区间
这些步骤看似比较麻烦,但可以用Keras自动完成,Keras有一个图像处理辅助工具的模块,位于keras.preprocessing.image
,其中有一个ImageDataGenerator
类,可以快速创建Python Generator,自动预处理张量。
from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale=1./255)# 将所有像素值乘1/255
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(150, 150),#所有图像尺寸调整为150*150
batch_size=50,
class_mode='binary')#一共就两个分类,所以使用二进制标签,也就是0和1
val_generator = test_datagen.flow_from_directory(
val_dir,
target_size=(150, 150),
batch_size=50,
class_mode='binary')
训练
在训练的时候,可以使用fit_generator
来训练数据,它需要提供的参数如下:
- train_generator:就是一个Python生成器,用来不停生成训练数据和对应的target
- steps_per_epoch:生成多少次样本算是一个epoch,我们这里设置的batch_size=50,训练样本为20000个,也就是10000/50=400
- validation_data:也是一个生成器,用来生成验证集的数据和target
- validation_steps:与steps_per_epoch相同,用来设定验证集生成多少次数据算是一个epoch,这里2500/50=50
history = model.fit_generator(
train_generator,
steps_per_epoch=400,
epochs=30,
validation_data=val_generator,
validation_steps=50)
Epoch 1/30
400/400 [==============================] - 24s 60ms/step - loss: 0.3115 - acc: 0.8621 - val_loss: 0.3415 - val_acc: 0.8400
Epoch 2/30
400/400 [==============================] - 23s 58ms/step - loss: 0.2866 - acc: 0.8780 - val_loss: 0.3496 - val_acc: 0.8462
...
Epoch 30/30
400/400 [==============================] - 24s 59ms/step - loss: 0.0218 - acc: 0.9929 - val_loss: 0.4752 - val_acc: 0.8838
我们可以保存模型到本地
model.save('cat_and_dogs.h5')
可视化
我们先看下在训练集和验证集上的准确率变化怎么样
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
plt.plot(acc, 'g', label='training acc')
plt.plot(val_acc, 'r', label='validation acc')
plt.title("Training and Validation accuracy")
plt.legend()
再看下损失函数的变化
plt.figure()
plt.plot(loss, 'g', label='training loss')
plt.plot(val_loss, 'r', label='validation loss')
plt.title("Training and Validation loss")
plt.legend()
可以看到,训练集上的模型效果确是在不断变好,但在测试集上却没有什么改变,有一种解决过拟合的方法是增加样本数,可以尝试使用数据增强来解决。
数据增加(Data Augmentation)
我们可以使用数据增强来从现有数据中生成更多的样本,通过把现有图像宣传扭曲、缩放、随机错切、翻转,让模型能够学习到更多的图像,避免过拟合。
其中,参数大体一下:
- rotation_range:表示图像随机旋转的角度范围
- width_shift 与 height_shift:图像在水平或者垂直方向上平抑的范围
- shear_range:随机错切变换的角度
- zoom_range:图像随机缩放的范围
- horizontal_flip:随机将一张图片水平翻转
- fill_mode:用于填充新创建像素的方法,可以来自旋转或者宽度/高度的平移
datagen = ImageDataGenerator(
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')
看一下效果如何
from keras.preprocessing import image
fnames = [os.path.join(train_cats_dir, fname) for
fname in os.listdir(train_cats_dir)]
img_path = fnames[3]
img = image.load_img(img_path, target_size=(150, 150))
x = image.img_to_array(img)
x = x.reshape((1, ) + x.shape)
i = 0
for batch in datagen.flow(x, batch_size=1):
plt.subplot(2,2,i + 1)
imgplot = plt.imshow(image.array_to_img(batch[0]))
i += 1
if i % 4 == 0:
break
plt.show()
再次训练
这次,我们重新建立模型,这次不仅使用数据增加防止过拟合,还尝试使用BatchNormalization以及Dropout来更大程度地消除过拟合。
关于BatchNormalization,以及其他归一化,可以看我之前的文章:各种 Normalization 的简介及区分
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu',
input_shape=((150, 150, 3))))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.BatchNormalization())
model.add(layers.Flatten())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(lr=1e-4),
metrics=['acc'])
再次数据生成器
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
# 注意:验证集不要做数据增强,
# 否则性能就是在数据增强后的集合上衡量的,
# 很可能会失真
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')
val_generator = test_datagen.flow_from_directory(
test_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')
开始训练:
history = model.fit_generator(
train_generator,
steps_per_epoch= 400,
epochs=50,
validation_data=val_generator,
validation_steps=50)
Epoch 1/50
400/400 [==============================] - 87s 218ms/step - loss: 0.7360 - acc: 0.5990 - val_loss: 0.6284 - val_acc: 0.6725
Epoch 2/50
400/400 [==============================] - 100s 251ms/step - loss: 0.6832 - acc: 0.6328 - val_loss: 0.6177 - val_acc: 0.6698
...
Epoch 50/50
400/400 [==============================] - 65s 162ms/step - loss: 0.3750 - acc: 0.8309 - val_loss: 0.3060 - val_acc: 0.8769
这次再看一下准确率和损失的变化过程
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
plt.plot(acc, 'g', label='training acc')
plt.plot(val_acc, 'r', label='validation acc')
plt.title("Training and Validation accuracy")
plt.legend()
plt.figure()
plt.plot(loss, 'g', label='training loss')
plt.plot(val_loss, 'r', label='validation loss')
plt.title("Training and Validation loss")
plt.legend()
可以看到,模型在训练集和验证集上的效果都在进步,没有任何过拟合!