Diagnose breast cancer from fine-needle aspirate images using OpenNN

This example aims to assess whether a lump in a breast could be malignant (cancerous) or benign (non-cancerous) from digitized images of a fine-needle aspiration biopsy.

The breast cancer database was obtained from the University of Wisconsin Hospitals, Madison, from Dr William H. Wolberg.

Contents:

  1. Application type.
  2. Data set.
  3. Neural network.
  4. Training strategy.
  5. Testing analysis.
  6. Model seployment.
  7. Full Code.

1. Application type

The variable to be predicted can have two values (malignant or benignant tumour). Therefore, this is a binary classification project.

The goal here is to model the probability of a malignant tumour, conditioned on the fine needle aspiration test features.

2. Data set

The first step is to prepare the data set, the source of information for the classification problem. For that, we need to configure the following concepts:

  • Data source.
  • Variables.
  • Instances.

The data source is the file breast_cancer.csv. It contains the data for this example in comma-separated values (CSV) format and can be loaded as

DataSet data_set("path_to_source/breast_cancer.csv",',',true);

The number of columns is 10, and the number of rows is 683. The variables in this problem are:

  • clump_thickness: (1-10). Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayers.
  • cell_size_uniformity: (1-10). Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not.
  • cell_shape_uniformity: (1-10). Uniformity of cell size/shape: Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not.
  • marginal_adhesion: (1-10). Normal cells tend to stick together. Cancer cells tend to lose this ability. So the loss of adhesion is a sign of malignancy.
  • single_epithelial_cell_size: (1-10). It is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be a malignant cell.
  • bare_nuclei: (1-10). This is a term used for nuclei not surrounded by cytoplasm (the rest of the cell). Those are typically seen in benign tumors.
  • bland_chromatin: (1-10). Describes a uniform «texture» of the nucleus seen in benign cells. In cancer cells, the chromatin tends to be more coarse.
  • normal_nucleoli: (1-10). Nucleoli are small structures seen in the nucleus. In normal cells, the nucleolus is usually very small if visible at all. In cancer cells, the nucleoli become more prominent, and sometimes there are more of them.
  • mitoses: (1-10). Cancer is essentially a disease of uncontrolled mitosis.
  • diagnose: (0 or 1). Benign (non-cancerous) or malignant (cancerous) lump in a breast.

Once we have the data ready, we will get the information of the variables, such as names and statistical descriptives.

const Tensor<string, 1> inputs_names = data_set.get_input_variables_names();
const Tensor<string, 1> targets_names = data_set.get_target_variables_names();

The instances are divided into training, selection, and testing subsets. They represent 60% (409), 20% (137), and 20% (137) of the original instances, respectively, and are split at random using the following command

data_set.split_samples_random();

To get the input variables number and target variables number, we use the following command

const Index input_variables_number = data_set.get_input_variables_number();
const Index target_variables_number = data_set.get_target_variables_number();

For more information about the data set methods, see DataSet class.

3. Neural network

The second step is to choose the correct neural network architecture. For classification problems, it is usually composed by:

  • A scaling layer.
  • Two perceptron layers.
  • A probabilistic layer.
  • An unscaling layer.

The NeuralNetwork class is responsible for building the neural network and properly organizing the layers of neurons using the following constructor. If you need more complex architectures, you should see NeuralNetwork class.

const Index hidden_neurons_number = 6;
NeuralNetwork neural_network(NeuralNetwork::Classification,
    {input_variables_number, hidden_neurons_number, target_variables_number});

Once the neural network has been created, we can introduce information in the layers for a more precise calibration

neural_network.set_inputs_names(inputs_names);
neural_network.set_outputs_names(targets_names);

Therefore, we have already created a good-looking model. Thus we proceed to the learning process with TrainingStrategy.

4. Training strategy

The third step is to set the training strategy, which is composed of:

  • Loss index.
  • Optimization algorithm.

Firstly, we construct the training strategy object

TrainingStrategy training_strategy(&neural_network, &data_set);

then, set the error term

training_strategy.set_loss_method(TrainingStrategy::NORMALIZED_SQUARED_ERROR);

and finally the optimization algorithm

training_strategy.set_optimization_method(TrainingStrategy::ADAPTIVE_MOMENT_ESTIMATION);

We can now start the training process by using the command

training_strategy.perform_training();

For more information about the training strategy methods, see TrainingStrategy class.

5. Testing analysis

The fourth step is to evaluate our model. For that purpose, we need to use the testing analysis class, whose goal is to validate the model’s generalization performance. Here, we compare the neural network outputs to the corresponding targets in the testing instances of the data set.

As in the previous cases, we start by building the testing analysis object

TestingAnalysis testing_analysis(&neural_network, &data_set);

and perform the testing, in our case we use binary classification tests

testing_analysis.print_binary_classification_tests();

For more information about the testing analysis methods, see TestingAnalysis class.

6. Model deployment

Once our model is completed, the neural network is ready to predict outputs for inputs that it has never seen. This process is called model deployment.

To generate predictions with new data, you can use

neural_network.calculate_outputs();

For instance, the new inputs are:

  • Clump thickness (1-10): 4
  • Cell size uniformity (1-10): 3
  • Cell shape uniformity (1-10): 3
  • Marginal adhesion (1-10): 2
  • Single epithelial cell size (1-10): 3
  • Bare nuclei (1-10): 4
  • Bland chromatin (1-10):3
  • Normal nucleoli (1-10): 2
  • Mitoses (1-10): 1

and in OpenNN we can write it as

Tensor<type, 2> inputs(1,9);
inputs.setValues({{type(4),type(3),type(3),type(2),type(3),type(4),type(3),type(2),type(1)}});
neural_network.calculate_outputs(inputs);

or save the model.

neural_network.save_expression_c("../data/expression.txt");
neural_network.save_expression_python("../data/expression.txt");

The model can be implemented in python, php, … .

7. Full Code

Joining all steps, we obtain the following code:

// DataSet
DataSet data_set("../data/breast_cancer.csv", ';', true);
const Index input_variables_number = data_set.get_input_variables_number();
const Index target_variables_number = data_set.get_target_variables_number();
            
// Neural Network
const Index hidden_neurons_number = 6;
NeuralNetwork neural_network(NeuralNetwork::Classification,
      {input_variables_number,hidden_neurons_number,target_variables_number});
            
// Training Strategy
TrainingStrategy training_strategy(&neural_network, &data_set);
training_strategy.set_loss_method(TrainingStrategy::CROSS_ENTROPY_ERROR);
training_strategy.set_optimization_method(TrainingStrategy::QUASI_NEWTON_METHOD);
training_strategy.perform_training();
            
// Testing Analysis
TestingAnalysis testing_analysis(&neural_network, &data_set);
testing_analysis.print_binary_classification_tests();
            
// Model deployment
Tensor<type, 2> inputs(1,4);
inputs.setValues({{type(4),type(3),type(3),type(2),type(3),type(4),type(3),type(2),type(1)}});
neural_network.calculate_outputs(inputs);
            
// Save results
neural_network.save_expression_c("../data/breast_cancer.txt");
neural_network.save_expression_python("../data/breast_cancer.py");

This code can be exported to your C++ project.

References:

  • The data for this problem has been taken from the UCI Machine Learning Repository.
  • Wolberg, W.H., & Mangasarian, O.L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, 87, 9193–9196.
  • Zhang, J. (1992). Selecting typical instances in instance-based learning. In Proceedings of the Ninth International Machine Learning Conference (pp. 470–479). Aberdeen, Scotland: Morgan Kaufmann.
  • Fisher,R.A. «The use of multiple measurements in taxonomic problems» Annual Eugenics, 7, Part II, 179-188 (1936); also in «Contributions to Mathematical Statistics» (John Wiley, NY, 1950).