Diagnose breast cancer from fine-needle aspirate images using OpenNN
This example aims to assess whether a lump in a breast could be malignant (cancerous) or benign (non-cancerous) from digitized images of a fine-needle aspiration biopsy.
The breast cancer database was obtained from the University of Wisconsin Hospitals, Madison, from Dr William H. Wolberg.
Contents:
1. Application type
The variable to be predicted can have two values (malignant or benignant tumour). Therefore, this is a binary classification project.
The goal here is to model the probability of a malignant tumour, conditioned on the fine needle aspiration test features.
2. Data set
The first step is to prepare the data set, the source of information for the classification problem. For that, we need to configure the following concepts:
- Data source.
- Variables.
- Instances.
The data source is the file breast_cancer.csv. It contains the data for this example in comma-separated values (CSV) format and can be loaded as
DataSet data_set("path_to_source/breast_cancer.csv",',',true);
The number of columns is 10, and the number of rows is 683. The variables in this problem are:
- clump_thickness: (1-10). Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayers.
- cell_size_uniformity: (1-10). Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not.
- cell_shape_uniformity: (1-10). Uniformity of cell size/shape: Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not.
- marginal_adhesion: (1-10). Normal cells tend to stick together. Cancer cells tend to lose this ability. So the loss of adhesion is a sign of malignancy.
- single_epithelial_cell_size: (1-10). It is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be a malignant cell.
- bare_nuclei: (1-10). This is a term used for nuclei not surrounded by cytoplasm (the rest of the cell). Those are typically seen in benign tumors.
- bland_chromatin: (1-10). Describes a uniform «texture» of the nucleus seen in benign cells. In cancer cells, the chromatin tends to be more coarse.
- normal_nucleoli: (1-10). Nucleoli are small structures seen in the nucleus. In normal cells, the nucleolus is usually very small if visible at all. In cancer cells, the nucleoli become more prominent, and sometimes there are more of them.
- mitoses: (1-10). Cancer is essentially a disease of uncontrolled mitosis.
- diagnose: (0 or 1). Benign (non-cancerous) or malignant (cancerous) lump in a breast.
Once we have the data ready, we will get the information of the variables, such as names and statistical descriptives.
const Tensor<string, 1> inputs_names = data_set.get_input_variables_names(); const Tensor<string, 1> targets_names = data_set.get_target_variables_names();
The instances are divided into training, selection, and testing subsets. They represent 60% (409), 20% (137), and 20% (137) of the original instances, respectively, and are split at random using the following command
data_set.split_samples_random();
To get the input variables number and target variables number, we use the following command
const Index input_variables_number = data_set.get_input_variables_number(); const Index target_variables_number = data_set.get_target_variables_number();
For more information about the data set methods, see DataSet class.
3. Neural network
The second step is to choose the correct neural network architecture. For classification problems, it is usually composed by:
- A scaling layer.
- Two perceptron layers.
- A probabilistic layer.
- An unscaling layer.
The NeuralNetwork class is responsible for building the neural network and properly organizing the layers of neurons using the following constructor. If you need more complex architectures, you should see NeuralNetwork class.
const Index hidden_neurons_number = 6; NeuralNetwork neural_network(NeuralNetwork::Classification, {input_variables_number, hidden_neurons_number, target_variables_number});
Once the neural network has been created, we can introduce information in the layers for a more precise calibration
neural_network.set_inputs_names(inputs_names); neural_network.set_outputs_names(targets_names);
Therefore, we have already created a good-looking model. Thus we proceed to the learning process with TrainingStrategy.
4. Training strategy
The third step is to set the training strategy, which is composed of:
- Loss index.
- Optimization algorithm.
Firstly, we construct the training strategy object
TrainingStrategy training_strategy(&neural_network, &data_set);
then, set the error term
training_strategy.set_loss_method(TrainingStrategy::NORMALIZED_SQUARED_ERROR);
and finally the optimization algorithm
training_strategy.set_optimization_method(TrainingStrategy::ADAPTIVE_MOMENT_ESTIMATION);
We can now start the training process by using the command
training_strategy.perform_training();
For more information about the training strategy methods, see TrainingStrategy class.
5. Testing analysis
The fourth step is to evaluate our model. For that purpose, we need to use the testing analysis class, whose goal is to validate the model’s generalization performance. Here, we compare the neural network outputs to the corresponding targets in the testing instances of the data set.
As in the previous cases, we start by building the testing analysis object
TestingAnalysis testing_analysis(&neural_network, &data_set);
and perform the testing, in our case we use binary classification tests
testing_analysis.print_binary_classification_tests();
For more information about the testing analysis methods, see TestingAnalysis class.
6. Model deployment
Once our model is completed, the neural network is ready to predict outputs for inputs that it has never seen. This process is called model deployment.
To generate predictions with new data, you can use
neural_network.calculate_outputs();
For instance, the new inputs are:
- Clump thickness (1-10): 4
- Cell size uniformity (1-10): 3
- Cell shape uniformity (1-10): 3
- Marginal adhesion (1-10): 2
- Single epithelial cell size (1-10): 3
- Bare nuclei (1-10): 4
- Bland chromatin (1-10):3
- Normal nucleoli (1-10): 2
- Mitoses (1-10): 1
and in OpenNN we can write it as
Tensor<type, 2> inputs(1,9); inputs.setValues({{type(4),type(3),type(3),type(2),type(3),type(4),type(3),type(2),type(1)}}); neural_network.calculate_outputs(inputs);
or save the model.
neural_network.save_expression_c("../data/expression.txt"); neural_network.save_expression_python("../data/expression.txt");
The model can be implemented in python, php, … .
7. Full Code
Joining all steps, we obtain the following code:
// DataSet DataSet data_set("../data/breast_cancer.csv", ';', true); const Index input_variables_number = data_set.get_input_variables_number(); const Index target_variables_number = data_set.get_target_variables_number(); // Neural Network const Index hidden_neurons_number = 6; NeuralNetwork neural_network(NeuralNetwork::Classification, {input_variables_number,hidden_neurons_number,target_variables_number}); // Training Strategy TrainingStrategy training_strategy(&neural_network, &data_set); training_strategy.set_loss_method(TrainingStrategy::CROSS_ENTROPY_ERROR); training_strategy.set_optimization_method(TrainingStrategy::QUASI_NEWTON_METHOD); training_strategy.perform_training(); // Testing Analysis TestingAnalysis testing_analysis(&neural_network, &data_set); testing_analysis.print_binary_classification_tests(); // Model deployment Tensor<type, 2> inputs(1,4); inputs.setValues({{type(4),type(3),type(3),type(2),type(3),type(4),type(3),type(2),type(1)}}); neural_network.calculate_outputs(inputs); // Save results neural_network.save_expression_c("../data/breast_cancer.txt"); neural_network.save_expression_python("../data/breast_cancer.py");
This code can be exported to your C++ project.