Data set class

The Dataset class contains all the information to build our model. Every column represents a particular variable, and each row corresponds to one sample.

Throughout this tutorial, we will use the Iris data set to give a notion of how to use the most important methods of the main classes which compose OpenNN.


Sepal length   

Sepal width   

Petal length   

Petal width   

Iris flower   
5.1 3.5 1.4 0.2 Setosa
4.9 3.0 1.4 0.2 Setosa
4.7 3.2 1.3 0.2 Setosa
4.6 3.1 1.5 0.2 Setosa
5.0 3.6 1.4 0.2 Setosa
5.4 3.9 1.7 0.4 Setosa
Iris data set

You can download the iris data set here. Notice that the target has been binarized:

  • 1 0 0 corresponds to Setosa.
  • 0 1 0 corresponds to Versicolor.
  • 0 0 1 corresponds to Virginica.

The DataSet class offers a wide variety of constructors. The easiest and the most common way to create a dataset object is by means of the default constructor, which creates an empty data set.

DataSet dataset;

Once we have created the dataset object, the next step is to fill it with data:

data_set.set_data_file_name("../data/iris_plant.csv");

data_set.set_separator("Space");

data_set.load_data();

By default the last column has been set as output and the remainder as inputs. In order to modify some atributes of the subclass Variables the pointer is needed. For Instance to set the use, units, and name of the variables, we take the variables_pointer.

Variables* variables_pointer = data_set.get_variables_pointer();

variables_pointer->set_name(0, "sepal_length");
variables_pointer->set_units(0, "centimeters");
variables_pointer->set_use(0, Variables::Input);

variables_pointer->set_name(1, "sepal_width");
variables_pointer->set_units(1, "centimeters");
variables_pointer->set_use(1, Variables::Input);

variables_pointer->set_name(2, "petal_length");
variables_pointer->set_units(2, "centimeters");
variables_pointer->set_use(2, Variables::Input);

variables_pointer->set_name(3, "petal_width");
variables_pointer->set_units(3, "centimeters");
variables_pointer->set_use(3, Variables::Input);

variables_pointer->set_name(4, "iris_setosa");
variables_pointer->set_use(4, Variables::Target);

variables_pointer->set_name(5, "iris_versicolour");
variables_pointer->set_use(5, Variables::Target);

variables_pointer->set_name(6, "iris_virginica");
variables_pointer->set_use(6, Variables::Target);

Similarly to visualize the input and target information you have to use the variable_pointer: <

const Matrix<string> inputs_information = variables_pointer->get_inputs_information();
const Matrix<string> targets_information = variables_pointer->get_targets_information();

cout << "Input information" << endl << inputs_information << endl;
cout << "Target information" << endl << targets_information << endl;

It is also possible accessing to instances class, through the instances_pointer, for example to split the data.

Instances* instances_pointer = data_set.get_instances_pointer(); 

instances_pointer->split_random_indices(0.7, 0.15, 0.15);

The first number corresponds to the percentage of training instances, the second number represents the percentage of selection and the last one correspond to the percentage of testing instances by default this values are set as 0.6, 0.2 and 0.2.

Finally, DataSet class implements some useful preprocessing methods, below we present some of them:

  • Scaling methods:
  • Scaling data let the neural network work in better conditions. The following method scale the input data between a minimum and maximum value, and simultaneously returns a structure that shows information about the inputs. This information corresponds to the maximum value, the minimum value, the mean and the standard deviation for each variable.

    const Vector< Statistics<double> > inputs_statistics = data_set.scale_inputs_minimum_maximum();
    
    cout<<inputs_statistics ;
    

    These scaled data are introduced into the neural network in the training phase.

  • Correlation methods:
  • Correlation provide you better knowledge about which variables have more relation with the targets. This method calculates the linear correlations between all outputs and all inputs.

    const Matrix<double> input_target_correlation = data_set.calculate_input_target_correlations();
    
    input_target_correlation.print_preview();
    

If you need more information about Dataset class visit DataSet Class Reference
NeuralNetwork ⇒