PhD finished!

31 May 2020 • 5 min. read • Comments

After a long journey filled with many sleepness nights, consumption of copious amounts of coffee, and much suffering, I finally managed to submit my PhD thesis.

I must say that I was really nervous when it came to the Viva. But then once it started, I really enjoyed it. So much that when the chair said only two more questions to go, I wished that there was time for more.

Sign language recognition

My PhD studies dealt with the subject area of automated sign language recognition. For a description of what this involves, and why sign language recognition is useful, here is a link to a short news article available on Newspoint, the news portal of the University of Malta.

A journey from computer vision to deep learning

When I first started my research in this area, deep learning was still in its infancy and had not been applied yet to the field of sign language recognition. At that time I had only basic knowledge of machine learning, let alone deep learning, and the general approach and consensus in this area of study was that the way to go is to purely use computer vision and image processing techniques. In fact, that was the only option at that time.

But then as deep learning methods gained traction and ‘awesome’ results started being produced in many other areas, my research journey started changing direction from a pure computer vision approach, towards one that combines computer vision with deep learning.

Deep learning sign detection

One of the first problems that I tackled that involved deep learning was that of finding videos that contain signing in the first place. This is a harder problem than it appears to be at first. It’s not a clear-cut issue of distinguishing people speaking from people signing by looking whether the mouth or the hands are moving. When people speak, the hands are also moving (called gesticulation). And when signers are using sign language, the lips also move (called mouthings by linguists).

Therefore the problem boils down to teaching a system to recognise that the hand motions of signers have patterns of movements that are determined by the sign language (the phonetics and grammar of sign language), while the hand motions of speakers tend to be more unstructured.

Again this distinction is not so clear-cut. The motion patterns of a signer’s hands do not rigidly follow the rules and sructure of sign language, but exhibit variations caused by the things like emotion, tone of voice, emphasis, personal style of signing, etc. And the way the hands of a speaker move are not completely unstructured, but some elements and structure of the spoken language creep into the hand motions. Therefore there exists a sort of continuum with structured hand motions at one end, and unstructured hand motions at the other. And speaking and signing are some way on this continuum close to each other.

To solve this problem, a recurrent neural network (RNN) was used, with a convolutional neural network (CNN) serving as the automatic feature extractor. The RNN network (a bi-directional LSTM RNN to be exact) is ideal in this case, because it can reason across time and thus learn to identify the hand motion patterns of signing.

To train this deep learning model with realistic data, I created a dataset of YouTube videos selected to contain a diversity of signing and speaking videos, as well as other videos that are considered to be difficult cases. This dataset was made publicly available on the IEEE DataPort site for others to use if they find it useful.

What I learnt

Some things that I learned throughout this journey include:

A better understanding and appreciation of machine learning and deep learning techniques.
Creating, preparing and curating datasets is very important and takes considerable time (especially groundtruthing)
Having a good experimentation and testing system is important for doing research in a systematic and repeatable way (Jupyter notebooks are great for example. I wish Apache Airflow was available at that time; it would have made my life much easier).
Deep learning has its disadvantages. One of them is the problem of explainability. If trained end-to-end, then the model becomes a black box. In comes the signing video, out comes the spoken text, but what is happening inside the model is difficult to interpret, to debug, and eventually to improve. A better approach (at least for sign language recognition) is to combine deep learning with a blend of automatically-determined and specially-chosen (handcrafted) features. A sort of hybrid approach combining deep learning with standard computer vision techniques. In fact this was one of the main contributions of my work. Rather than just feeding in raw video frames, I utilised a matrix factorisation algorithm (this is very similar algorithm to what is used in recommender systems) and then extracted semantic features from this representation to feed to the deep learning model. This approach has some resemblance to the idea of a word embedding space, but in my case the space is more natural and better reflects the structured motion of sign langauges. It is also more interpretable (unlike word embeddings). Another plus is that the representation can be re-used for different sign languages, especially ones where training data is limited. For example, I have shown how I can train my deep learning model on German sign language; then with no (or very minimal) fine-tuning, can reuse it for recognising Maltese sign language, which lacks large training datasets.

And, here is the link to the submitted thesis in case this is of interest.

PhD