Summer Research Fellowship Programme of India's Science Academies 2017
Gunda Rohit Chandra
Department of Computer Science and Engineering
National Institute of technology, Warangal
Telangana - 506004
July 12, 2017
The aim of this project is to create a tracker or improve an existing tracker to
track in IR images. During this project, the basics of computer vision, tracking,
thermal imaging has been done along with the trackers like Spatially Regu-
larized Discriminative Correlation Filter (SRDCF), Tracking Learning and De-
tection(TLD) Tracker, Multi-Domain Convolutional Neural Network (MDNet),
Guided MDNet and SR Guided MDNet. A method of preprocessing to make
the SR Guided MDNet tracker track well in the IR images has been proposed
and the resulting tracker named SR Guided MDNet IR is made.
I declare that this report titled ”Object tracking in Infra Red images” is a record
of original work carried out by me under the supervision of Dr. Deepak Mishra
and has not formed the basis for the award of any degree, diploma, associate
ship, fellowship, or other titles in this or any other Institution or University of
higher learning.
Nowadays, IR images are being used in a lot of fields like military and medicine
in various applications. Tracking in IR images is used in various applications like
pedestrian detection in autonomous vehicles and intrusion detection. Here, we
create a new tracker named SR Guided MDNetIR by added a new preprocessing
technique to SR Guided MDNet to get a better performance for object tracking
in IR images.
Chapter 1
Computer vision Basics
1.1 Introduction
A digital image is a representation of a two-dimensional image as a finite set of
digital values, called picture elements or pixels. Common image formats include
1 sample per point (Black and white or Grayscale), 3 samples per point (Red,
Green and Blue), 3 samples per point (Hue, Saturation and Value), 4 samples
per point(Cyan, Magenta, Yellow and Black), 4 samples per point (Red, Green,
Blue, and Alpha, also known as Opacity).
1.1.1 Fields that Use Digital Image Processing
Unlike humans, who are limited to the visual band of the electromagnetic
(EM) spectrum, imaging machines cover almost the entire EM spectrum, rang-
ing from gamma to radio waves. They can operate on images generated by
sources that humans are not accustomed to associating with images. These
include Gamma ray imaging, X-ray imaging, Imaging in an ultraviolet band,
Imaging in the visible and infrared bands, Imaging in the microwave band,
Imaging in the radio band, Ultra sound imaging.
1.1.2 Important techniques in digital image processing
Some of the important techniques in digital image processing are Image Re-
sizing, Image Rotation, Image Translation, Image skewing, Image Smoothing,
Image Sharpening, Unsharp Masking, Median Filtering.
1.1.3 Edge Detection
Edges are sudden discontinuities appearing in images also known as sudden
jumps in intensity or colour in images Edges can be found using the derivatives
which have a nonzero value only where there is a change in colour. We can
take the gradient in either one direction or in both directions. The edges can
be found by masking.
1.1.4 Key Stages in Digital Image Processing
Figure 1.1: A picture showing the key stages in Digital Image Processing
1.1.5 Neural Networks
Neural networks (Convolutional neural networks, deep neural networks and
other types like RNNs ) are being used extensively in computer vision nowa-
days. Neural networks are modelled on human brain and nervous system. It is
composed of a large number of highly interconnected processing elements called
neurons. An NN is configured for a specific application, such as pattern recog-
nition or data classification. Neural networks have a lot properties like the data
passes through them completely or it does not pass through them. The output
is influenced by many units. The nodes in the network have a weight function
similar to synaptic weights, refractory period and bifurcation.
Inspiration from Neurobiology
A neuron: many-inputs / one-output unit output can be excited or not excited
incoming signals from other neurons determine if the neuron shall excite (”fire”).
Output subject to attenuation in the synapses, which are junction parts of the
A simple neuron takes the Inputs, calculate the summation of the Inputs and
compare it with the threshold being set during the learning stage.
1.1.6 Deep Learning
It consists of one input, one output and multiple fully-connected hidden layers
in-between. Each layer is represented as a series of neurons and progressively
Figure 1.2: A neuron
Figure 1.3: model for neuron
Figure 1.4: A simple deep network
extracts higher and higher-level features of the input until the final layer essen-
tially makes a decision about what the input shows. Various functions are used
to transfer the input from the previous layer to the output layer. The functions
that can are generally used are ReLu, Softplus, Tanh, sigmoid. ReLu is the
currently most used function. There are many types of neural networks. Some
are Deep auto encoders, Convolutional neural nets, Recurring neural nets, Long
short term recurring neural nets, Generative advertial network, Differentiable
neural network.
Chapter 2
Infra Red IMAGES
2.1 Introduction to IR images
Infrared radiation was originally discovered in 1800 by Sir Frederick William
Herschel. The infrared band lies below the visual red light band, since it has
longer wavelength. The infrared wavelength band is broad and is usually di-
vided into different bands based on their different properties: near infrared
(NIR, wavelengths 0.7 1m), shortwave infrared (SWIR, 13 m), mid wave in-
frared (MWIR, 35m), and long wave infrared (LWIR, 812 m).
Figure 2.1: Spectrum of electromagnetic radiation
LWIR, and sometimes MWIR, is commonly referred to as thermal infrared
(TIR). TIR cameras are sensitive to emitted radiation in everyday temperatures
and should not be confused with NIR and SWIR cameras that, in contrast,
mostly measure reflected radiation. These non-thermal cameras are dependent
on illumination and behave in general in a similar way as visual cameras
2.2 Thermal imaging
Thermal images are visual displays of measured emitted, reflected, and trans-
mitted thermal radiation within an area. When presented to an operator, color
maps are often used to map pixel intensity values in order to visualize details
more clearly. Due to multiple sources of thermal radiation, thermal imaging
Figure 2.2: An example of how the emissivity of materials affects what is per-
ceived. A transparent tape of another logo than that of the soda has been placed
on the metal can. The can was then filled with hot water. The tape has higher
emissivity than the can and appears warmer when measuring. Image courtesy
of Jrgen Ahlberg and Patrik Stensbo
can be challenging depending on the properties of the object and its surround-
ings. The amount of radiation emitted by the object depends on its emissivity.
The thermal radiation from other objects is also reflected on the surface of the
object. The amount of radiation that reaches the detector is affected by the
atmosphere. Some is transmitted, some is absorbed, and some is even emitted
from the atmosphere itself. Moreover, the camera itself emits thermal radiation
during operation.
2.2.1 Advantages and limitations of thermal imaging
From the aspect of measuring temperatures, thermal imaging is advantageous
compared to point-based methods since temperatures over a large area can be
compared. However, it is not considered to be as accurate as contact meth-
ods. Compared to visual cameras, thermal cameras are favourable if there is a
temperature difference connected to the object or phenomena we want to de-
tect. Thermal cameras can produce an image with no or few distortions during
darkness and/or difficult weather conditions like fog/rain/snow.
Thermal cameras are expensive and have low resolution compared to visual
cameras. In comparison to a visual camera, a thermal camera typically requires
more training for correct usage. From thermal imagery, it is not considered
possible to perform person identification.It means that thermal camera can be
used in applications where preservation of privacy is crucial. However, if person
identification is requested, it has to be combined with a visual camera.
2.2.2 Image analysis in thermal infrared
In the thermal infrared spectrum, there are no shadows since mostly emitted
radiation is measured. In most applications, the emitted radiation changes much
slower than the reflected radiation. That is, an object moving from a dark room
into the sunlight will not immediately change its appearance.Compared to a
visual camera, a thermal infrared camera typically has more blooming, lower
resolution and a larger percentage of dead pixels.
2.2.3 Applications
There are many applications related to automatic image analysis in thermal
imagery. Monitoring of wild and domestic animals can be used to detect inflam-
mations, perform behavior analysis, or to estimate population sizes. Detection
and tracking of pedestrians, but also other vehicles, using a small thermal cam-
era mounted in the front of a car (or train). Heat losses in buildings can be de-
tected using a thermal camera.There are automatic methods that maps thermal
images to 3D models for heat loss visualization. Thermal (radiometric) cameras
are useful for detecting fires.They can also see through smoke and are commonly
used by fire fighters to find persons and to localize the base of the fire. Industry
is a broad area that has many applications like Detection of different materials,
positioning, and non-destructive testing. Detection of tumors in early stages,
inflammations, fever screening. There are numerous military applications, such
as automatic target recognition, target tracking, gunfire, detection, missile ap-
proach warning, mine detection, and sniper detection. Searching for persons
independently of daylight using cameras carried by UAVs, helicopters or rescue
robots. Detection, tracking, and behavior analysis of persons and vehicles for
detection of intrusion and suspicious behavior.
Chapter 3
Automatic Target
Recognition (ATR)
3.1 Introduction to ATR
Automatic Target Recognition (ATR), is a process by which an algorithm
or device to recognize targets or objects based on data obtained from sensors.
ATR algorithm is a sequence of computer executable steps that determine which
image locations should be given a target label.
3.2 Uses of Automatic Target Recognition
ATR can be useful for everything from recognizing an object on a battlefield
to filtering out interference caused by large flocks of birds on Doppler weather
radar.Possible military applications include ,a simple identification system such
as an IFF transponder, and is used in other applications such as unmanned
aerial vehicles and cruise missiles. Research has been done into using ATR for
border security, safety systems, automated vehicles, and many others.
3.3 Various stages in ATR algorithm
Figure 3.1: Processes in automatic target recognition
Chapter 4
4.1 Introduction
Video tracking of an object is locating an object over time using a camera.
Various trackers are used to locate an object in frames of a video. The trackers
require initialization of the bounding box in the first frame and they estimate
the target in the next frames of the video.
Figure 4.1: A picture representing how a tracker works.
4.2 Applications of tracking
Object tracking has a lot of real world applications like human and computer
interaction, video communication and compression, security and surveillance,
traffic control, augmented reality, medical imaging, video editing, motion anal-
ysis, autonomous robots and cars.
4.3 Challenges in tracking
The challenges faced by a tracker when tracking are illumination changes,
motion blur, scale change, background clutter, occlusion, out of plane rotation.
Chapter 5
SR Guided MDNet
5.1 Trackers used in SR Guided MDNet
5.1.1 Spatially Regularized Discriminative Correlation Fil-
ters (SRDCF)
The tracker must generalize the target appearance from a very limited set
of training samples to achieve robustness against, e.g. occlusions, fast motion
and deformations. DCF based approaches have successfully been applied to the
tracking problem. These methods learn a correlation filter from a set of training
samples. The correlation filter is trained to perform a circular sliding window
operation on the training samples. This corresponds to assuming a periodic
extension of these samples. The periodic assumption enables efficient training
and detection by utilizing the Fast Fourier Transform (FFT).
To alleviate the problems induced by the circular convolution, the regular-
ization term is replaced with a more general Tikhonov regularization [1]. The
regularization weights determine the importance of the filter coefficients, de-
pending on their spatial locations.By using SRDCF, we will overcome the lim-
itations of the standard DCF. An optimization strategy, based on the iterative
Gauss-Seidel method, enables efficient online learning of SRDCF.
Figure 5.1: A picture representing the difference created by spatial regulariza-
tion in SRDCF [1]
5.1.2 Tracking, Learning and Detection (TLD) tracker
By long-term tracking of unknown objects in a video stream, the object is
defined by its location and extent in a single frame. In every frame that follows,
the task is to determine the objects location and extent or indicate that the
object is not present.It includes, Object state, Object Model, Tracking: Median
Flow Tracker, Object Detector Learning in TLD,Integrator.
Figure 5.2: The framework and working of the tld tracker [3].
5.1.3 Multi-Domain Convolutional Neural Network (MD-
Multi domain convolutional neural network or MDNet is based on discrimi-
natively trained CNN. MDNet tracking process is divided into 2 phases, offline
Training and online Tracking with Fine Tuning. It’s architecture has 5 hidden
layers (3 convolutional layers and 2 fully connected layers) and 1 domain-specific
layer.K branches at the last fully connected layer corresponds to K domains (or
videos) during offline training because each video sequence is considered as a
separate domain when the network is trained iteratively.
The main limitations of the MDNet are pre-training requires a large amount
of time. As the computational complexity of MDNet is high, large amount of
time is required for the evaluation of any sequence. The main reasons for the
above is due to the larger number of random samples that are generated around
the previous target patch and the Larger number of random training samples
generated for fine tuning.