After a long journey filled with many sleepness nights, consumption of copious amounts of coffee, and much suffering, I finally managed to submit my PhD thesis.
I must say that I was really nervous when it came to the Viva. But then once it started, I really enjoyed it. So much that when the chair said only two more questions to go, I wished that there was time for more.
My PhD studies dealt with the subject area of automated sign language recognition. For a description of what this involves, and why sign language recognition is useful, here is a link to a short news article available on Newspoint, the news portal of the University of Malta.
When I first started my research in this area, deep learning was still in its infancy and had not been applied yet to the field of sign language recognition. At that time I had only basic knowledge of machine learning, let alone deep learning, and the general approach and consensus in this area of study was that the way to go is to purely use computer vision and image processing techniques. In fact, that was the only option at that time.
But then as deep learning methods gained traction and ‘awesome’ results started being produced in many other areas, my research journey started changing direction from a pure computer vision approach, towards one that combines computer vision with deep learning.
One of the first problems that I tackled that involved deep learning was that of finding videos that contain signing in the first place. This is a harder problem than it appears to be at first. It’s not a clear-cut issue of distinguishing people speaking from people signing by looking whether the mouth or the hands are moving. When people speak, the hands are also moving (called gesticulation). And when signers are using sign language, the lips also move (called mouthings by linguists).
Therefore the problem boils down to teaching a system to recognise that the hand motions of signers have patterns of movements that are determined by the sign language (the phonetics and grammar of sign language), while the hand motions of speakers tend to be more unstructured.
Again this distinction is not so clear-cut. The motion patterns of a signer’s hands do not rigidly follow the rules and sructure of sign language, but exhibit variations caused by the things like emotion, tone of voice, emphasis, personal style of signing, etc. And the way the hands of a speaker move are not completely unstructured, but some elements and structure of the spoken language creep into the hand motions. Therefore there exists a sort of continuum with structured hand motions at one end, and unstructured hand motions at the other. And speaking and signing are some way on this continuum close to each other.
To solve this problem, a recurrent neural network (RNN) was used, with a convolutional neural network (CNN) serving as the automatic feature extractor. The RNN network (a bi-directional LSTM RNN to be exact) is ideal in this case, because it can reason across time and thus learn to identify the hand motion patterns of signing.
To train this deep learning model with realistic data, I created a dataset of YouTube videos selected to contain a diversity of signing and speaking videos, as well as other videos that are considered to be difficult cases. This dataset was made publicly available on the IEEE DataPort site for others to use if they find it useful.
Some things that I learned throughout this journey include:
And, here is the link to the submitted thesis in case this is of interest.