Summer Research Fellowship report by:
Under the supervision of :
I wish to express my sincere gratitude to my guide and mentor, Dr GN Rathna for guiding and encouraging me during the course of my fellowship in Indian Institute of Sciences, while working on the project on “Sign Language Recognition”.
I also take the opportunity to thank Mr Mukesh Makwana, and Mr Abhilash Jain for helping me in carrying out this project.
I sincerely thank the coordinator of Summer Research Fellowship 2017, Mr CS Ravi Kumar for giving me the opportunity to embark on this project.
The project aims at building a machine learning model that will be able to classify the various hand gestures used for fingerspelling in sign language. In this user independent model, classification machine learning algorithms are trained using a set of image data and testing is done on a completely different set of data. For the image dataset, depth images are used, which gave better results than some of the previous literatures , owing to the reduced pre-processing time. Various machine learning algorithms are applied on the datasets, including Convolutional Neural Network (CNN). An attempt is made to increase the accuracy of the CNN model by pre-training it on the Imagenet dataset. However, a small dataset was used for pre-training, which gave an accuracy of 15% during training.
Communication is very crucial to human beings, as it enables us to express ourselves. We communicate through speech, gestures, body language, reading, writing or through visual aids, speech being one of the most commonly used among them. However, unfortunately, for the speaking and hearing impaired minority, there is a communication gap. Visual aids, or an interpreter, are used for communicating with them. However, these methods are rather cumbersome and expensive, and can't be used in an emergency. Sign Language chiefly uses manual communication to convey meaning. This involves simultaneously combining hand shapes, orientations and movement of the hands, arms or body to express the speaker's thoughts.
Sign Language consists of fingerspelling, which spells out words character by character, and word level association which involves hand gestures that convey the word meaning. Fingerspelling is a vital tool in sign language, as it enables the communication of names, addresses and other words that do not carry a meaning in word level association. In spite of this, fingerspelling is not widely used as it is challenging to understand and difficult to use. Moreover, there is no universal sign language and very few people know it, which makes it an inadequate alternative for communication.
A system for sign language recognition that classifies finger spelling can solve this problem. Various machine learning algorithms are used and their accuracies are recorded and compared in this report.
Sanil Jain and KV Sameer Raja  worked on Indian Sign Language Recognition, using coloured images. They used feature extraction methods like bag of visual words, Gaussian random and the Histogram of Gradients (HoG). Three subjects were used to train SVM, and they achieved an accuracy of 54.63% when tested on a totally different user.
For this project, 2 datasets are used: ASL dataset and ISL dataset
ASL dataset created by B. Kang et al is used. It is a collection of 31,000 images, 1000 images for each of the 31 classes. These gestures are recorded for a total of five subjects. The gestures include numerals 1- 9 and alphabets A-Z except ‘J’ and ‘Z’, because these require movements of hand and thus cannot be captured in the form of an image. Some of the gestures are very similar, (0/o) , (V/2) and (W/6). These are classified by context or meaning.
No standard dataset for ISL was available. So, a dataset created by Mukesh Kumar Makwana, M.E. student at IISc, is used. It consisted of 43,750 depth images, 1,250 images for each of the 35 hand gestures. These were recorded from five different subjects. The gestures include alphabets (A-Z) and numerals (0-9) except “2” which is exactly like ‘v’. The images are gray-scale with resolution of 320x240.
Classification machine learning algorithms like SVM, k-NN are used for supervised learning, which involves labeling the dataset before feeding it into the algorithm for training. For this project, various classification algorithms are used: SVM, k-NN and CNN.
Feature extraction algorithms are used for dimensionality reduction to create a subset of the initial features such that only important data is passed to the algorithm. When the input to the algorithm is too large to be processed and is suspected to be redundant (like repetitiveness of images presented by pixels), then it can be converted into a reduced set of features. Feature extraction algorithms: PCA, LBP, and HoG, are used alongside classification algorithms for this purpose. This reduces the memory required and increases the efficiency of the model.
The algorithms used are as follow:
In SVM, each data point is plotted in an n-dimensional space (n is the number of features) with the value of each feature being the value of a particular coordinate. The classification is done by finding a hyper-plane that differentiates the classes the best.
In k-NN classification, an object is classified by a majority vote of its neighbours, with object assigned to the class that is the most common among its k-nearest neighbors, where k is a positive integer, typically small. The output of the algorithm is a class membership.
Convolutional Neural Networks (CNN), are deep neural networks used to process data that have a grid-like topology, e.g images that can be represented as a 2-D array of pixels. A CNN model consists of four main operations: Convolution, Non-Linearity (Relu), Pooling and Classification (Fully-connected layer ).
Convolution: The purpose of convolution is to extract features from the input image. It preserves the spatial relationship between pixels by learning image features using small squares of input data. It is usually followed by Relu.
Relu: It is an element-wise operation that replaces all negative pixel values in the feature map by zero. Its purpose is to introduce non-linearity in a convolution network.
Pooling: Pooling (also called downsampling ) reduces the dimesionality of each feature map but retains important data.
Fully-connected layer: It is a multi layer perceptron that uses softmax function in the output layer. Its purpose is to use features from previous layers for classsifying the input image into various classes based on training data.
The combination of these layers is used to create a CNN model. The last layer is a fully connected layer.
Pre-training a CNN model:
The concept of Transfer learning is used here, where the model is first pre-trained on a dataset that is different from the original. This way the model gains knowledge that can be transferred to other neural networks. The knowledge gained by the model, in the form of “weights” is saved and can be loaded into some other model. The pre-trained model can be used as a feature extractor by adding fully-connected layers on top of it. The model is trained with the original dataset after loading the saved weights.
Using PCA, data is projected to a lower dimension for dimensionality reduction. The most important feature is the one with the largest variance or spread, as it corresponds to the largest entropy and thus encodes the most information. Thus the dimension with the largest variance is kept while others are reduced.
LBP computes a local representation of texture which is constructed by comparing each pixel by its surrounding or neighbourig pixels. The results of this are stored as an array which is then converted into decimal and stored as an LBP 2D array.
A feature descriptor is a representation of an image or an image patch that simplifies the image by extracting useful information and throwing away extraneous information.
Hog is a feature descriptor that calculates a histogram of gradient for the image pixels, which is a vector of 9 bins (numbers ) corresponding to the angles: 0, 20, 40, 60... 160. The images are divided into cells, (usually, 8x8 ), and for each cell, gradient magnitude and gradient angle is calculated, using which a histogram is created for a cell. The histogram of a block of cells is normalized, and the final feature vector for the entire image is calculated.
The image dataset was converted to a 2-D array of pixels. The array was flattened and normalized. Following is the code snippet :
im = im.flatten()
im = im.astype(int)
im = im/255.0
The algorithms were first implemented on an ASL dataset. Training was done on four subjects and testing on the fifth subject.
SVM classifier is implemented using the SVM module present in the sklearn library. For feature extraction, PCA is used, which is implemented using the PCA module present in sklearn.decomposition. To find the optimum number of components to which we can reduce the original feature set without compromising the important features, a graph of 'no. of components vs. variance' is plotted. Considering the graph, 53 components are taken as the optimum as the corresponding variance is near to maximum. After 53, variance per component reduces slowly and is almost constant.
Using PCA, we were able to reduce the No. of components from 65536 to 53, which reduced the complexity and training time of the algorithm.
This is a code snippet showing SVM and PCA.
from sklearn.svm import SVC
#PCA feauture reduction
Using LBP as a feature extraction method did not show promising results, as LBP is a texture recognition algorithm, and our dataset of depth images could not be classified based on texture. A before-after LBP is presented below.
As seen in Fig 12b , the edges of the curled fingers is not detected, so we might need some image-preprocessing to increase accuracy.
The following image pre-processing methods were performed :
2. Difference of Gaussian: Shading induced by surface structure is potentially a useful visual cue but it is predominantly low-frequency spatial information that is hard to separate from effects caused by illumination gradients.
3. Contrast Equalization: The final step of our preprocessing chain rescales the image intensities to standardize a robust measure of overall contrast or intensity variation.
After preprocessing, LBP was applied.
Applying LBP the second time.
We were able to increase the accuracy by 20% after pre-processing. However, as the edges of the curled fingers were still not detected properly, the results were not very promising.
Applying SVM with HoG gave the best accuracies recorded so far. HoG was implemented using HoG module present in scikit-image library.
The code snippet below was used to visualise the histogram.
from skimage.feature import hog
from skimage import data, color, exposure
for i in range(1,1240):
fd,hog_image = hog(im, orientations=8, pixels_per_cell=(8,8),cells_per_block=(1,1), visualise=True)
Reference : scikit-image.org
Parameters, pixels_per_cell and cells_per_block were varied and the results were recorded:
The maximum accuracy was shown by 8x8, 1x1, so this parameter was used.
A confusion matrix gives the summary of prediction results on a classification problem. Each row corresponds to actual class and every column of the matrix corresponds to a predicted class. It is desirable that a diagonal is obtained across the matrix, which means that classes have been correctly predicted.
A confusion matrix was obtained for SVM+HoG, with Sujbect 3 as test dataset, and the following classes showed anomalies: d, k, m, t, s, e, i.e., these classes were getting wrongly predicted.
The classes showing anomalies were then seperated from the original training dataset and trained in a seperate SVM model. However, this method did not give good results, but helped in identifying the classes that were getting wrongly predicted.
K-nearest neighbour when used with HoG feature extractor increased the accuracy by 12%. However, the algorithm took a long time to train, and was not used subsequently.
|Images /class||Testing set (Subject)||Algorithm||Parameters used||Maximum Accuracy|
|10||1||SVM||C=10.0 , gamma=0.01||70|
|10||1||SVM+PCA||No. of components=53||66.67|
|10||5||SVM+LBP||(No of points to consider for LBP , Radius): (8,2)||11.6|
|10||5||SVM+LBP with pre-processing||(No of points for LBP , Radius) : (16,2)||31.71|
|10||3||SVM+HoG||Pixels per cell : (8,8 ) Cells per block : (1,1)||77.66|
|100||2||SVM+HoG||Pixels per cell : (8,8 ) Cells per block : (1,1)||81.15|
|10||3||k-NN with HoG||Nearest Neighbours=5||67.63|
|Images / class||Classifier||Parameters||Average Accuracy (%)|
|10||SVM+PCA||No. of components =53||60|
|10||SVM+LBP||(No of points to consider for LBP , Radius): (8,2)||11|
|10||SVM+LBP with pre-processing||(No of points to consider for LBP , Radius) : (16,2)||31|
|10||SVM+Hog||Pixels per cell : (8,8 ) Cells per block :(1,1)||68.93|
|100||SVM+Hog||Pixels per cell:(8,8) Cells per block:(1,1)||71.78|
The datasets that showed promising results for ASL dataset were implemented with ISL dataset and the following accuracies were recorded. The following table shows the maximum accuracies recorded for each algorithm:
|Images/class||Testing datasets (Subject no.)||Classifier||Parameters||Max. Accuracy|
|100||5||SVM+HoG||Pixels per cell : (8,8 ) Cells per block : (1,1)||71.88|
|100||5||SVM+PCA||No. of components=53||70.54|
The table below shows the average accuracies recorded for each algorithm:
|100||SVM+HoG||Pixels per cell : (8,8 ) Cells per block : (1,1)||64.12 %|
|100||SVM+PCA||No. of components=53||56.8 %|
The CNN model created by Mr Mukesh Makwana was used. The architecture of the model is as follows:
The model is compiled with adam optimizer in keras.optimizers library.
Following are the accuracies recorded for batch size 32 with 100 images per class :
For 50 epochs: 65.99 %
For 30 epochs after removing layer 7 and layer 8: 50 %
Model 1 was modified to form model 2 and model 3 which were trained on Imagenet dataset that consisted of images of the following classes: Flowers, Nutmeg, Vegetables, Snowfall, Seashells and Ice-cream. Due to limited computation power, a dataset of 1200 images is used.
For model 2, layer 4, layer 7 and layer 8 were removed. A dense layer with 512 nodes was added after layer 11.
For model 3, layer 2, 3, 4, 8, and layer 9 were removed. A dense layer was added after flatten layer with 512 nodes.
Pre-training was done with model 2 and model 3 after compiling them with keras optmizers, adam and adadelta. The accuracies were as follow for batch size 32:
Optimizer: adam, epochs: 30 - 14.40 %
Optimizer: adadelta, epochs : 50 - 16.12 %
Optimizer: adam, epochs: 30 – 14.3 %
Optimizer: adadelta, epochs: 50 - 15.9 %
Following is the code snippet:
top_model.add(Dense(512, input_shape = model.output_shape[1:],activation='relu'))
#loading the weights of model 2 / model 3
#adding the dense laters on top of model 2
for layer in model.layers[:7]:
layer.trainable = False
For training the model, 300 images from each of the 6 classes are used, and 100 images per class for testing. The images were coloured and of varying sizes. Thus they were resized to 160x160. The images dimensions of Indian Sign Language (gray-scale images ) and Imagnet dataset (colored images) had to be the same. Due to this, the ISL images also had to be resized to 160x160 so that both inputs can have the shape (160, 160, 3). The weights of the models 2 and 3 are saved. They are then used for feature extraction, by adding fully connencted layers, with output layer having 35 nodes (number of classes in ISL dataset). When the whole model is trained with 100 images per class for ISL dataset, however, the accuracies did not show improvement.
We conclude that SVM+HoG and Convolutional Neural Networks can be used as classification algorithms for sign language recognition. However, pre-training has to be performed with a larger dataset in order to show increase in accuracy. We were able to achieve maximum accuracy of 71.88% with SVM+HoG for ISL dataset using depth images dataset when 4 subjects were used for training and a different subject for testing, which is more than the accuracy recorded in previous literatures.
User dependent model using pre-training:
Pre-training the model on a larger dataset (e.g. ILSRVC), that consists of around 14,000 classes, and then fine-tuning it with ISL dataset, so that the model can show good results even when trained with a small dataset. For user- dependent, the user will give a set of images to the model for training ,so it becomes familiar with the user. This way the model will perform well for a particular user.