Summer Research Fellowship Programme of India's Science Academies 2017

SIGN LANGUAGE RECOGNITION

Summer Research Fellowship report by:

Muskan Dhiman

National Institute of Technology, Hamirpur (H.P.)

Under the supervision of :

Dr G.N. Rathna

Department of Electrical Engineering, DSP Lab, Indian Institute of Science, Bangalore

Acknowledgment

I wish to express my sincere gratitude to my guide and mentor, Dr GN Rathna for guiding and encouraging me during the course of my fellowship in Indian Institute of Sciences, while working on the project on “Sign Language Recognition”.

I also take the opportunity to thank Mr Mukesh Makwana, and Mr Abhilash Jain for helping me in carrying out this project.

I sincerely thank the coordinator of Summer Research Fellowship 2017, Mr CS Ravi Kumar for giving me the opportunity to embark on this project.

Abstract

The project aims at building a machine learning model that will be able to classify the various hand gestures used for fingerspelling in sign language. In this user independent model, classification machine learning algorithms are trained using a set of image data and testing is done on a completely different set of data. For the image dataset, depth images are used, which gave better results than some of the previous literatures [4], owing to the reduced pre-processing time. Various machine learning algorithms are applied on the datasets, including Convolutional Neural Network (CNN). An attempt is made to increase the accuracy of the CNN model by pre-training it on the Imagenet dataset. However, a small dataset was used for pre-training, which gave an accuracy of 15% during training.

Introduction

Communication is very crucial to human beings, as it enables us to express ourselves. We communicate through speech, gestures, body language, reading, writing or through visual aids, speech being one of the most commonly used among them. However, unfortunately, for the speaking and hearing impaired minority, there is a communication gap. Visual aids, or an interpreter, are used for communicating with them. However, these methods are rather cumbersome and expensive, and can't be used in an emergency. Sign Language chiefly uses manual communication to convey meaning. This involves simultaneously combining hand shapes, orientations and movement of the hands, arms or body to express the speaker's thoughts.

Sign Language consists of fingerspelling, which spells out words character by character, and word level association which involves hand gestures that convey the word meaning. Fingerspelling is a vital tool in sign language, as it enables the communication of names, addresses and other words that do not carry a meaning in word level association. In spite of this, fingerspelling is not widely used as it is challenging to understand and difficult to use. Moreover, there is no universal sign language and very few people know it, which makes it an inadequate alternative for communication.

A system for sign language recognition that classifies finger spelling can solve this problem. Various machine learning algorithms are used and their accuracies are recorded and compared in this report.

Related Literature

Sanil Jain and KV Sameer Raja [4] worked on Indian Sign Language Recognition, using coloured images. They used feature extraction methods like bag of visual words, Gaussian random and the Histogram of Gradients (HoG). Three subjects were used to train SVM, and they achieved an accuracy of 54.63% when tested on a totally different user.

Experiments

Datasets

For this project, 2 datasets are used: ASL dataset and ISL dataset

American Sign Language (ASL) dataset :

ASL dataset created by B. Kang et al is used. It is a collection of 31,000 images, 1000 images for each of the 31 classes. These gestures are recorded for a total of five subjects. The gestures include numerals 1- 9 and alphabets A-Z except ‘J’ and ‘Z’, because these require movements of hand and thus cannot be captured in the form of an image. Some of the gestures are very similar, (0/o) , (V/2) and (W/6). These are classified by context or meaning.

Indian Sign Language (ISL) dataset:

hog-histogram-1.png
Histogram of Gradients

No standard dataset for ISL was available. So, a dataset created by Mukesh Kumar Makwana, M.E. student at IISc, is used. It consisted of 43,750 depth images, 1,250 images for each of the 35 hand gestures. These were recorded from five different subjects. The gestures include alphabets (A-Z) and numerals (0-9) except “2” which is exactly like ‘v’. The images are gray-scale with resolution of 320x240.

Algorithms used:

Classification machine learning algorithms like SVM, k-NN are used for supervised learning, which involves labeling the dataset before feeding it into the algorithm for training. For this project, various classification algorithms are used: SVM, k-NN and CNN.

Feature extraction algorithms are used for dimensionality reduction to create a subset of the initial features such that only important data is passed to the algorithm. When the input to the algorithm is too large to be processed and is suspected to be redundant (like repetitiveness of images presented by pixels), then it can be converted into a reduced set of features. Feature extraction algorithms: PCA, LBP, and HoG, are used alongside classification algorithms for this purpose. This reduces the memory required and increases the efficiency of the model.

The algorithms used are as follow:

Support Vector Machine (SVM)

In SVM, each data point is plotted in an n-dimensional space (n is the number of features) with the value of each feature being the value of a particular coordinate. The classification is done by finding a hyper-plane that differentiates the classes the best.

svm-kernel.png
SVM classifier

k-NN (k-Nearest Neighbors)

In k-NN classification, an object is classified by a majority vote of its neighbours, with object assigned to the class that is the most common among its k-nearest neighbors, where k is a positive integer, typically small. The output of the algorithm is a class membership.

download.jpg
k-NN 

CNN

Convolutional Neural Networks (CNN), are deep neural networks used to process data that have a grid-like topology, e.g images that can be represented as a 2-D array of pixels. A CNN model consists of four main operations: Convolution, Non-Linearity (Relu), Pooling and Classification (Fully-connected layer ).

Convolution: The purpose of convolution is to extract features from the input image. It preserves the spatial relationship between pixels by learning image features using small squares of input data. It is usually followed by Relu.

Relu: It is an element-wise operation that replaces all negative pixel values in the feature map by zero. Its purpose is to introduce non-linearity in a convolution network.

Pooling: Pooling (also called downsampling ) reduces the dimesionality of each feature map but retains important data.

Fully-connected layer: It is a multi layer perceptron that uses softmax function in the output layer. Its purpose is to use features from previous layers for classsifying the input image into various classes based on training data.

The combination of these layers is used to create a CNN model. The last layer is a fully connected layer.

screen-shot-2016-08-08-at-2-26-09-am.png
CNN

Pre-training a CNN model:

The concept of Transfer learning is used here, where the model is first pre-trained on a dataset that is different from the original. This way the model gains knowledge that can be transferred to other neural networks. The knowledge gained by the model, in the form of “weights” is saved and can be loaded into some other model. The pre-trained model can be used as a feature extractor by adding fully-connected layers on top of it. The model is trained with the original dataset after loading the saved weights.

PCA (Pricipal Component Analysis)

Using PCA, data is projected to a lower dimension for dimensionality reduction. The most important feature is the one with the largest variance or spread, as it corresponds to the largest entropy and thus encodes the most information. Thus the dimension with the largest variance is kept while others are reduced.

fig_pca_principal_component_analysis.png
PCA feature reduction

LBP (Local Binary Patterns):

LBP computes a local representation of texture which is constructed by comparing each pixel by its surrounding or neighbourig pixels. The results of this are stored as an array which is then converted into decimal and stored as an LBP 2D array.

lbp_calculation-1024x299.jpg
Coversion of pixel into LBP representation
lbp_to_output-1024x519.jpg
LBP

HoG (Histogram of Gradients):

A feature descriptor is a representation of an image or an image patch that simplifies the image by extracting useful information and throwing away extraneous information.

Hog is a feature descriptor that calculates a histogram of gradient for the image pixels, which is a vector of 9 bins (numbers ) corresponding to the angles: 0, 20, 40, 60... 160. The images are divided into cells, (usually, 8x8 ), and for each cell, gradient magnitude and gradient angle is calculated, using which a histogram is created for a cell. The histogram of a block of cells is normalized, and the final feature vector for the entire image is calculated.

hog-cell-gradients-768x432.png
Calculation of Gradient Magnitude and Gradient Direction
hog-visualization.png
HoG Visualization
hog-histogram-1.png
Creating histogram from Gradient of magnitude and direction

Input to algorithms

The image dataset was converted to a 2-D array of pixels. The array was flattened and normalized. Following is the code snippet :

  1. im =(Image.open(os.getcwd()+filenames_training[i]))
  2. im=np.array(im)
  3. im = im.flatten()
  4. im = im.astype(int)
  5. im = im/255.0
Conversion, Flattening and Normalization 

Experiments on ASL

The algorithms were first implemented on an ASL dataset. Training was done on four subjects and testing on the fifth subject.

Algorithms used

SVM+PCA

SVM classifier is implemented using the SVM module present in the sklearn library. For feature extraction, PCA is used, which is implemented using the PCA module present in sklearn.decomposition. To find the optimum number of components to which we can reduce the original feature set without compromising the important features, a graph of 'no. of components vs. variance' is plotted. Considering the graph, 53 components are taken as the optimum as the corresponding variance is near to maximum. After 53, variance per component reduces slowly and is almost constant.

pca_graph5.png
  • 1
Y-axis: Variance, X-axis: No. of Components  

Using PCA, we were able to reduce the No. of components from 65536 to 53, which reduced the complexity and training time of the algorithm.

This is a code snippet showing SVM and PCA.

  1. from sklearn.svm import SVC
  2. #PCA feauture reduction
  3. pca=PCA(n_components=53)
  4. pca.fit(X_train)
  5. X_train=pca.transform(X_train)
  6. #Training SVM
  7. svc=SVC(gamma=0.001,C=10)
  8. svc.fit(X_train,y_train)
  9. #Predicting
  10. X_test= pca.transform(X_test)
  11. predict=svc.predict(X_test)
  12. print( predict)
  13. #Finding accuracy
  14. s=svc.score(X_test,y_test)
SVM and PCA

Reference: sklearn.org

SVM+LBP

Using LBP as a feature extraction method did not show promising results, as LBP is a texture recognition algorithm, and our dataset of depth images could not be classified based on texture. A before-after LBP is presented below.

index.png
Original Image
LBP1.png
Image after applying LBP
LBP Feature Extraction
index.png
Original Image
LBP1.png
Image after applying LBP

As seen in Fig 12b , the edges of the curled fingers is not detected, so we might need some image-preprocessing to increase accuracy.

SVM+LBP with pre-processing

The following image pre-processing methods were performed :

  • Gamma Correction: This is a nonlinear gray-level transformation that replaces gray-level I with Iγ (for γ > 0) or log(I) (for γ = 0), where γ ∈ [0, 1] is a user-defined parameter. It has the effect of enhancing the local dynamic range of the image in dark or shadowed regions, while compressing it in bright regions and at highlights. γ here is 0.2.
gamma_1.png
Gamma Correction

2. Difference of Gaussian: Shading induced by surface structure is potentially a useful visual cue but it is predominantly low-frequency spatial information that is hard to separate from effects caused by illumination gradients.

gaussian_2.png
Difference of Gaussian

3. Contrast Equalization: The final step of our preprocessing chain rescales the image intensities to standardize a robust measure of overall contrast or intensity variation.

contrast_equalization_2.png
Contrast Equalization.

After preprocessing, LBP was applied.

LBP1_2.png
LBP1

Applying LBP the second time.

LBP2_2.png
LBP 2

We were able to increase the accuracy by 20% after pre-processing. However, as the edges of the curled fingers were still not detected properly, the results were not very promising.

SVM+HoG :

Applying SVM with HoG gave the best accuracies recorded so far. HoG was implemented using HoG module present in scikit-image library.

hog.png
Applying HoG

The code snippet below was used to visualise the histogram.

  1. from skimage.feature import hog
  2. from skimage import data, color, exposure
  3. hogtrain=[]
  4. for i in range(1,1240):
  5. im=Image.open(readpath+training_filenames[i])
  6. im=np.array(im)
  7. fd,hog_image = hog(im, orientations=8, pixels_per_cell=(8,8),cells_per_block=(1,1), visualise=True)
  8. hogtrain.append(fd)
Feature extraction of training dataset

Reference : scikit-image.org

Parameters, pixels_per_cell and cells_per_block were varied and the results were recorded:

Pixels_per_cellCell_per_blockAverage Accuracy
8x81x168.93
8x82x260.89
16x161x168.02

The maximum accuracy was shown by 8x8, 1x1, so this parameter was used.

Confusion matrix:

A confusion matrix gives the summary of prediction results on a classification problem. Each row corresponds to actual class and every column of the matrix corresponds to a predicted class. It is desirable that a diagonal is obtained across the matrix, which means that classes have been correctly predicted.

A confusion matrix was obtained for SVM+HoG, with Sujbect 3 as test dataset, and the following classes showed anomalies: d, k, m, t, s, e, i.e., these classes were getting wrongly predicted.

The classes showing anomalies were then seperated from the original training dataset and trained in a seperate SVM model. However, this method did not give good results, but helped in identifying the classes that were getting wrongly predicted.

hog10s3.png
Confusion Matrix for SVM+Hog

K-nn with HoG:

K-nearest neighbour when used with HoG feature extractor increased the accuracy by 12%. However, the algorithm took a long time to train, and was not used subsequently.

Accuracies Recorded

Images /classTesting set (Subject)AlgorithmParameters usedMaximum Accuracy
101SVMC=10.0 , gamma=0.0170
101SVM+PCANo. of components=5366.67
105SVM+LBP(No of points to consider for LBP , Radius): (8,2)11.6
105SVM+LBP with pre-processing(No of points for LBP , Radius) : (16,2)31.71
103SVM+HoGPixels per cell : (8,8 ) Cells per block : (1,1)77.66
1002SVM+HoGPixels per cell : (8,8 ) Cells per block : (1,1)81.15
105k-NNNearest neighbours=555
103k-NN with HoGNearest Neighbours=567.63
Note
The table shows the maximum accuracy recorded for each algorithm
Images / classClassifierParametersAverage Accuracy (%)
10SVM+PCANo. of components =5360
10SVM+LBP(No of points to consider for LBP , Radius): (8,2)11
10SVM+LBP with pre-processing(No of points to consider for LBP , Radius) : (16,2)31
10SVM+HogPixels per cell : (8,8 ) Cells per block :(1,1)68.93
100SVM+HogPixels per cell:(8,8) Cells per block:(1,1)71.78
Note
The table shows the average accuracy recorded for each algorithm

Experiments with ISL dataset:

Accuracies recorded

The datasets that showed promising results for ASL dataset were implemented with ISL dataset and the following accuracies were recorded. The following table shows the maximum accuracies recorded for each algorithm:

Table showing Maixmum accuracies
Images/classTesting datasets (Subject no.)ClassifierParametersMax. Accuracy
1005SVM+HoGPixels per cell : (8,8 ) Cells per block : (1,1)71.88
1005SVM+PCANo. of components=5370.54

The table below shows the average accuracies recorded for each algorithm:

Table showing average accuracies
Images/classClassifierParametersAvg. Accuracy
100SVM+HoGPixels per cell : (8,8 ) Cells per block : (1,1)64.12 %
100SVM+PCANo. of components=5356.8 %

CNN on ISL dataset :

The CNN model created by Mr Mukesh Makwana was used. The architecture of the model is as follows:

Model 1

  • Convolution layer: 3x3 kernel , 64 filters
  • Maxpooling: pool size – 2x2
  • Convolution layer: 1x1 kernel , 16 filters
  • Maxpooling: pool size – 3x3
  • Convolution layer: 3x3 kernel , 16 filters
  • Maxpooling: pool size – 2x2
  • Convolution layer: 1x1 kernel , 32 filters
  • Maxpooling: pool size – 2x2
  • Convolution layer: 5x5 kernel , 64 filters
  • Maxpooling: pool size - 2x2
  • Flatten layer
  • Dropout layer: value – 0.5
  • Fully connected layer: 256 nodes
  • Dropout layer: value -0.5
  • Fully connected layer: 64 nodes
  • Fully connected layer: 35 nodes (ouput layer)

The model is compiled with adam optimizer in keras.optimizers library.

Following are the accuracies recorded for batch size 32 with 100 images per class :

For 50 epochs: 65.99 %

For 30 epochs after removing layer 7 and layer 8: 50 %

PRE-TRAINING ON CNN MODEL :

Model 1 was modified to form model 2 and model 3 which were trained on Imagenet dataset that consisted of images of the following classes: Flowers, Nutmeg, Vegetables, Snowfall, Seashells and Ice-cream. Due to limited computation power, a dataset of 1200 images is used.

C05_DT0393.JPEG
Seashell
C01_DT0002.JPEG
Ice-cream
C04_DT0471.JPEG
Flower
C06_DT0075.JPEG
Vegetables
C03_DT0549.JPEG
Snowfall
C02_DT0005.JPEG
Nutmeg
Samples from pre-training dataset

For model 2, layer 4, layer 7 and layer 8 were removed. A dense layer with 512 nodes was added after layer 11.

For model 3, layer 2, 3, 4, 8, and layer 9 were removed. A dense layer was added after flatten layer with 512 nodes.

Pre-training was done with model 2 and model 3 after compiling them with keras optmizers, adam and adadelta. The accuracies were as follow for batch size 32:

Model 2:

Optimizer: adam, epochs: 30 - 14.40 %

Optimizer: adadelta, epochs : 50 - 16.12 %

Model 3:

Optimizer: adam, epochs: 30 – 14.3 %

Optimizer: adadelta, epochs: 50 - 15.9 %

Following is the code snippet:

  1. top_model=Sequential()
  2. top_model.add(Dense(512, input_shape = model.output_shape[1:],activation='relu'))
  3. top_model.add(Dropout(0.5))
  4. top_model.add(Dense(256, activation='relu'))
  5. top_model.add(Dropout(0.5))
  6. top_model.add(Dense(64, activation='relu'))
  7. top_model.add(Dropout(0.5))
  8. top_model.add(Dense(35, activation='softmax'))
  9. #loading the weights of model 2 / model 3
  10. model.load_weights("model_2.hdf5")
  11. #adding the dense laters on top of model 2
  12. model.add(top_model)
  13. for layer in model.layers[:7]:
  14. layer.trainable = False
CNN

For training the model, 300 images from each of the 6 classes are used, and 100 images per class for testing. The images were coloured and of varying sizes. Thus they were resized to 160x160. The images dimensions of Indian Sign Language (gray-scale images ) and Imagnet dataset (colored images) had to be the same. Due to this, the ISL images also had to be resized to 160x160 so that both inputs can have the shape (160, 160, 3). The weights of the models 2 and 3 are saved. They are then used for feature extraction, by adding fully connencted layers, with output layer having 35 nodes (number of classes in ISL dataset). When the whole model is trained with 100 images per class for ISL dataset, however, the accuracies did not show improvement.

Conclusion

We conclude that SVM+HoG and Convolutional Neural Networks can be used as classification algorithms for sign language recognition. However, pre-training has to be performed with a larger dataset in order to show increase in accuracy. We were able to achieve maximum accuracy of 71.88% with SVM+HoG for ISL dataset using depth images dataset when 4 subjects were used for training and a different subject for testing, which is more than the accuracy recorded in previous literatures.

Future work

User dependent model using pre-training:

Pre-training the model on a larger dataset (e.g. ILSRVC), that consists of around 14,000 classes, and then fine-tuning it with ISL dataset, so that the model can show good results even when trained with a small dataset. For user- dependent, the user will give a set of images to the model for training ,so it becomes familiar with the user. This way the model will perform well for a particular user.

References:

  • Kang, Byeongkeun, Subarna Tripathi, and Truong Q. Nguyen. "Real-time sign language fingerspelling recognition using convolutional neural networks from depth map." Pattern Recognition (ACPR), 2015 3rd IAPR Asian Conference on. IEEE, 2015.
  • scikit-learn.org
  • Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions- Xiaoyang Tan and Bill Triggs
  • Indian Sign Language Character Recognition by Sanil Jain and K.V.Sameer Raja
  • deeplearningbooks.org : Convolutional Networks
  • SQUEEZENET: ALEXNET-LEVEL ACCURACY WITH 50X FEWER PARAMETERS AND <0.5MB MODEL SIZE Forrest N. Iandola, Song Han, Matthew W. Moskewicz , Khalid Ashraf , William J. Dally , Kurt Keutzer
  • ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
  • www.learnopencv.com