In the previous blogs, we discussed flow and flow_from_directory methods. Both these methods perform the same task i.e. generate batches of augmented data. The only thing that differs is the format or structuring of the datasets. Some of the most common formats (Image datasets) are
- Keras builtin datasets
- Datasets containing separate folders of data corresponding to the respective classes.
- Datasets containing a single folder along with a CSV or JSON file that maps the image filenames with their corresponding classes.
We already know how to deal with the first two formats. In this blog, we will discuss how to perform data augmentation with the data available in the data frame. To do this, Keras provides a builtin flow_from_dataframe method. So, let’s discuss this method in detail.
Keras API
1 |
flow_from_dataframe(dataframe, directory=None, x_col='filename', y_col='class', target_size=(256, 256), color_mode='rgb', classes=None, class_mode='categorical', batch_size=32, shuffle=True, seed=None, save_to_dir=None, save_prefix='', save_format='png', subset=None, interpolation='nearest', drop_duplicates=True) |
In this, you need to provide the data frame that contains the image names or file paths and the corresponding labels. Now, there are two cases possible:
- if the data frame contains image names then you need to specify the directory where these images are residing, using the “directory” argument. See the example below.
- if the data frame contains the absolute image paths then set the “directory” argument to None.
Similarly, for the labels column, the values can be string/list/tuple depending on the “class_mode” argument. For instance, if class_mode is binary, then the label column must contain the class values as strings. Note that we can have multiple label columns also. For instance regression tasks like bounding box prediction etc. Then you need to pass these columns as a list in the “y_col” argument.
Rest all the arguments are the same as discussed in the ImageDataGenerator flow_from_directory blog. Now let’s take an example to see how to use this.
We will take the traditional cats vs dogs dataset. First, download the dataset from Kaggle. This dataset contains two folders train and the test each containing 25000 and 12500 images respectively.
Create a Dataframe
The first step is to create a data frame that contains the filename and the corresponding labels column. For this, we will iterate over each image in the train folder and check the filename prefix. If it is a cat, set the label to 0 otherwise 1.
1 2 3 4 5 6 7 8 9 10 11 |
# Path to the train folder original_train = 'D:/downloads/Data/train/' filenames = os.listdir(original_train) categories = [] for filename in filenames: category = filename.split('.')[0] if category == 'cat': categories.append('0') else: categories.append('1') |
Now create a data frame as
1 |
data = pd.DataFrame({'filename':filenames,'label':categories}) |
Create Generators
Now, we will create the train and validation generator using the flow_from_dataframe method as
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
datagen = ImageDataGenerator(rescale=1/255., validation_split=0.2) train_generator = datagen.flow_from_dataframe(dataframe=data, directory=original_train, x_col='filename', y_col='label', target_size=(150,150), class_mode='binary', batch_size=100, subset='training', seed=7) validation_generator = datagen.flow_from_dataframe(dataframe=data, directory=original_train, x_col='filename', y_col='label', target_size=(150,150), class_mode='binary', batch_size=100, subset='validation', seed=7) |
Build the Model
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
model = Sequential() model.add(Conv2D(32,(3,3),activation='relu',input_shape=(150,150,3))) model.add(MaxPool2D((2,2))) model.add(Conv2D(64,(3,3),activation='relu')) model.add(MaxPool2D((2,2))) model.add(Conv2D(128,(3,3),activation='relu')) model.add(MaxPool2D((2,2))) model.add(Conv2D(128,(3,3),activation='relu')) model.add(MaxPool2D((2,2))) model.add(Flatten()) model.add(Dense(512,activation='relu')) model.add(Dense(1,activation='sigmoid')) model.compile(loss='binary_crossentropy',optimizer = 'Adam',metrics=['accuracy']) |
Train the Model
Let’s train the model using the fit_generator method.
1 2 3 4 5 |
train_steps = train_generator.n//train_generator.batch_size validation_steps = validation_generator.n//validation_generator.batch_size history = model.fit_generator(train_generator,steps_per_epoch=train_steps, epochs=20, validation_data=validation_generator,validation_steps=validation_steps) |
Test time
So, for the test time, we can simply use the flow_from_directory method. You can use any method. For this, you need to create a subfolder inside the test folder. Remember not to shuffle the data at the test time. The class_mode argument should be set to None.
1 2 3 4 5 6 7 |
test_directory = 'D:/downloads/Data/test/' test_datagen = ImageDataGenerator(rescale=1/255.) test_generator = test_datagen.flow_from_directory(test_directory,target_size=(150,150), shuffle=False, class_mode=None, batch_size=1, seed=7) |
For predictions, we can simply use the predict_generator method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
STEP_SIZE_TEST=test_generator.n//test_generator.batch_size test_generator.reset() pred=model.predict_generator(test_generator,steps=STEP_SIZE_TEST,verbose=1) predictions = [] for i in pred: if i >=0.5: predictions.append('1') else: predictions.append('0') filenames=test_generator.filenames results=pd.DataFrame({"Filename":filenames, "Predictions":predictions}) |
That’s all for the flow_from_dataframe method. Hope you enjoy reading.
If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.