This post is also available in: 日本語 (Japanese)
This post is about sample code that simply processes missing features and matches the number of training data and test data when handle machine learning.
Since the processing of missing values depends on the machine learning policy, it is convenient when you want to run the code for the time being.
However, in the sample code, it is written on the assumption that the index of training data and test data is common.
import pandas as pd import numpy as np #Create sample data, and substitute nan data_X = np.random.randn(6,2) data_X[0][1] = np.nan data_X[4][0] = np.nan print(data_X) """ #output [[ 0.1669884 nan] [-0.93169488 -0.80602492] [ 1.34485881 -1.15684329] [-1.77475068 0.58345764] [ nan -1.34413655] [ 0.76400682 0.43928072]] """ #Create feature dataframe train_features = pd.DataFrame(data_X, columns=["FeatureA","FeatureB"]) print(train_features) """ #output FeatureA FeatureB 0 0.166988 NaN 1 -0.931695 -0.806025 2 1.344859 -1.156843 3 -1.774751 0.583458 4 NaN -1.344137 5 0.764007 0.439281 """ #Create labels train_labels = pd.Series([1,0,1,1,0,1]) #Drop the missing value (nan) of the feature # how='any' means that delete if even one of the rows contains a missing value train_features = train_features.dropna(how='any') #Does not match the number of data in train_features and train_labels print(len(train_features.index.values)) #4 print(len(train_labels.index.values)) #6 #Case where the number of data of train_labels is matched with the number of data of train_features #In other words, make the index of train_labels the same as the index of train_features #However, the indexes of train_features and train_labels must be the same train_labels = train_labels[train_features.index.values] #The number of data in train_features and train_labels match print(len(train_features.index.values)) #4 print(len(train_labels.index.values)) #4