flow_from_dataframe | TheAILearner

In the previous blogs, we discussed flow and flow_from_directory methods. Both these methods perform the same task i.e. generate batches of augmented data. The only thing that differs is the format or structuring of the datasets. Some of the most common formats (Image datasets) are

Keras builtin datasets
Datasets containing separate folders of data corresponding to the respective classes.
Datasets containing a single folder along with a CSV or JSON file that maps the image filenames with their corresponding classes.

We already know how to deal with the first two formats. In this blog, we will discuss how to perform data augmentation with the data available in the data frame. To do this, Keras provides a builtin flow_from_dataframe method. So, let’s discuss this method in detail.

Keras API

flow_from_dataframe(dataframe, directory=None, x_col='filename', y_col='class', target_size=(256, 256), color_mode='rgb', classes=None, class_mode='categorical', batch_size=32, shuffle=True, seed=None, save_to_dir=None, save_prefix='', save_format='png', subset=None, interpolation='nearest', drop_duplicates=True)

flow_from_dataframe(dataframe, directory=None, x_col='filename', y_col='class', target_size=(256, 256), color_mode='rgb', classes=None, class_mode='categorical', batch_size=32, shuffle=True, seed=None, save_to_dir=None, save_prefix='', save_format='png', subset=None, interpolation='nearest', drop_duplicates=True)

In this, you need to provide the data frame that contains the image names or file paths and the corresponding labels. Now, there are two cases possible:

if the data frame contains image names then you need to specify the directory where these images are residing, using the “directory” argument. See the example below.
if the data frame contains the absolute image paths then set the “directory” argument to None.

Similarly, for the labels column, the values can be string/list/tuple depending on the “class_mode” argument. For instance, if class_mode is binary, then the label column must contain the class values as strings. Note that we can have multiple label columns also. For instance regression tasks like bounding box prediction etc. Then you need to pass these columns as a list in the “y_col” argument.

Rest all the arguments are the same as discussed in the ImageDataGenerator flow_from_directory blog. Now let’s take an example to see how to use this.

We will take the traditional cats vs dogs dataset. First, download the dataset from Kaggle. This dataset contains two folders train and the test each containing 25000 and 12500 images respectively.

Create a Dataframe

The first step is to create a data frame that contains the filename and the corresponding labels column. For this, we will iterate over each image in the train folder and check the filename prefix. If it is a cat, set the label to 0 otherwise 1.

# Path to the train folder
original_train = 'D:/downloads/Data/train/'

filenames = os.listdir(original_train)
categories = []
for filename in filenames:
    category = filename.split('.')[0]
    if category == 'cat':
        categories.append('0')
    else:
        categories.append('1')

# Path to the train folder

original_train = 'D:/downloads/Data/train/'

filenames = os.listdir(original_train)

categories = []

for filename in filenames:

category = filename.split('.')[0]

if category == 'cat':

categories.append('0')

else:

categories.append('1')

Now create a data frame as

data = pd.DataFrame({'filename':filenames,'label':categories})

1	data = pd.DataFrame({'filename':filenames,'label':categories})

Create Generators

Now, we will create the train and validation generator using the flow_from_dataframe method as

datagen = ImageDataGenerator(rescale=1/255., validation_split=0.2)

train_generator = datagen.flow_from_dataframe(dataframe=data, directory=original_train,
                                             x_col='filename',
                                             y_col='label',
                                             target_size=(150,150),
                                             class_mode='binary',
                                             batch_size=100,
                                             subset='training',
                                             seed=7)

validation_generator = datagen.flow_from_dataframe(dataframe=data, directory=original_train,
                                             x_col='filename',
                                             y_col='label',
                                             target_size=(150,150),
                                             class_mode='binary',
                                             batch_size=100,
                                             subset='validation',
                                             seed=7)

datagen = ImageDataGenerator(rescale=1/255., validation_split=0.2)

train_generator = datagen.flow_from_dataframe(dataframe=data, directory=original_train,

x_col='filename',

y_col='label',

target_size=(150,150),

class_mode='binary',

batch_size=100,

subset='training',

seed=7)

validation_generator = datagen.flow_from_dataframe(dataframe=data, directory=original_train,

x_col='filename',

y_col='label',

target_size=(150,150),

class_mode='binary',

batch_size=100,

subset='validation',

seed=7)

Build the Model

model = Sequential()
model.add(Conv2D(32,(3,3),activation='relu',input_shape=(150,150,3)))
model.add(MaxPool2D((2,2)))
model.add(Conv2D(64,(3,3),activation='relu'))
model.add(MaxPool2D((2,2)))
model.add(Conv2D(128,(3,3),activation='relu'))
model.add(MaxPool2D((2,2)))
model.add(Conv2D(128,(3,3),activation='relu'))
model.add(MaxPool2D((2,2)))
model.add(Flatten())
model.add(Dense(512,activation='relu'))
model.add(Dense(1,activation='sigmoid'))

model.compile(loss='binary_crossentropy',optimizer = 'Adam',metrics=['accuracy'])

model = Sequential()

model.add(Conv2D(32,(3,3),activation='relu',input_shape=(150,150,3)))

model.add(MaxPool2D((2,2)))

model.add(Conv2D(64,(3,3),activation='relu'))

model.add(MaxPool2D((2,2)))

model.add(Conv2D(128,(3,3),activation='relu'))

model.add(MaxPool2D((2,2)))

model.add(Conv2D(128,(3,3),activation='relu'))

model.add(MaxPool2D((2,2)))

model.add(Flatten())

model.add(Dense(512,activation='relu'))

model.add(Dense(1,activation='sigmoid'))

model.compile(loss='binary_crossentropy',optimizer = 'Adam',metrics=['accuracy'])

Train the Model

Let’s train the model using the fit_generator method.

train_steps = train_generator.n//train_generator.batch_size
validation_steps = validation_generator.n//validation_generator.batch_size

history = model.fit_generator(train_generator,steps_per_epoch=train_steps, epochs=20,
                              validation_data=validation_generator,validation_steps=validation_steps)

train_steps = train_generator.n//train_generator.batch_size

validation_steps = validation_generator.n//validation_generator.batch_size

history = model.fit_generator(train_generator,steps_per_epoch=train_steps, epochs=20,

validation_data=validation_generator,validation_steps=validation_steps)

Test time

So, for the test time, we can simply use the flow_from_directory method. You can use any method. For this, you need to create a subfolder inside the test folder. Remember not to shuffle the data at the test time. The class_mode argument should be set to None.

test_directory = 'D:/downloads/Data/test/'
test_datagen = ImageDataGenerator(rescale=1/255.)
test_generator = test_datagen.flow_from_directory(test_directory,target_size=(150,150),
                                                 shuffle=False,
                                                 class_mode=None,
                                                 batch_size=1,
                                                 seed=7)

test_directory = 'D:/downloads/Data/test/'

test_datagen = ImageDataGenerator(rescale=1/255.)

test_generator = test_datagen.flow_from_directory(test_directory,target_size=(150,150),

shuffle=False,

class_mode=None,

batch_size=1,

seed=7)

For predictions, we can simply use the predict_generator method.

STEP_SIZE_TEST=test_generator.n//test_generator.batch_size
test_generator.reset()
pred=model.predict_generator(test_generator,steps=STEP_SIZE_TEST,verbose=1)

predictions = []
for i in pred:
    if i >=0.5:
        predictions.append('1')
    else:
        predictions.append('0')

filenames=test_generator.filenames
results=pd.DataFrame({"Filename":filenames,
                      "Predictions":predictions})

STEP_SIZE_TEST=test_generator.n//test_generator.batch_size

test_generator.reset()

pred=model.predict_generator(test_generator,steps=STEP_SIZE_TEST,verbose=1)

predictions = []

for i in pred:

if i >=0.5:

predictions.append('1')

else:

predictions.append('0')

filenames=test_generator.filenames

results=pd.DataFrame({"Filename":filenames,

"Predictions":predictions})

That’s all for the flow_from_dataframe method. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

TheAILearner

Mastering Artificial Intelligence

Tag Archives: flow_from_dataframe

ImageDataGenerator – flow_from_dataframe method

Keras API

Create a Dataframe

Create Generators

Build the Model

Train the Model

Test time