Gesture Recognition Wearable| 2020 | ATLANTA

Atltvhead Gesture Recognition Bracer – A TensorflowLite gesture detector for the atltvhead project and exploration into Data Science. Medium Publication GitHub Hackaday and Adafruit Blog 

While completing this project I made some tools for anyone who wants to replicate this with their own gestures. All my files are found in this Github Repo. Getting up and Running TLDR:

  • If you use Docker you can use the JupyterNotebook’s Tensorflow container or build my makefile

  • Get started by uploading the capture data ino in the Arduino_Sketch folder onto a Sparkfun ESP32 Thing Plus with a push button between pin 33 and GND and Adafruit LSM6DSOX 9dof IMU connect with a Qwiic connector.

  • Use the Capture data python script in the Training_Data folder. Initiate this script on any pc, type in the gesture name, start recording the motion data by pressing the button on the Arduino

  • After several gestures were recorded change to a different gesture, do it again. I tried to get at least 50 motion recordings of each gesture, you can try less if you like.

  • Once all the data is collected, navigate to the Python Scripts folder and run data pipeline python script and model pipeline script in that order. Models are trained here and can time some time.

  • Run predict gesture script and press the button on the Arduino to take a motion recording and see the results printed out. Run the tflite gesture prediction script to run with the smaller model.


I run an interactive live stream. I wear an old tv (with working led display) for a helmet and a backpack with an integrated display. Twitch chat controls what’s displayed on the television screen and the backpack screen through chat commands. Together Twitch chat and I go through the city of Atlanta, Ga spreading cheer.

As time has gone on, I have over 20 channel commands for the tv display. Remembering these commands has become complicated and tedious. So it’s time to simplify my interface to the tvhead.

What are my resources? During the live stream, I am on rollerblades, my right hand is holding the camera, my left hand has a high five detecting glove I’ve built from a lidar sensor and esp32, my backpack has a raspberry pi 4, and a website with buttons that post commands in twitch chat.

What to simplify? I’d like to simplify the channel commands and gamify it a bit more.

What resources to use? I am going to change my high five gloves, removing the lidar sensor, and feed the raspberry pi with acceleration and gyroscope data. So that the Pi can inference a gesture performed from the arm data.


A working gesture detection model using TensorFlow and sensor data.

Sensors :

To briefly explain my sensor choices.

Lidar: I’ve used “lidar” Time of flight sensors to detect high fives, in the previous version of my wearable. However, it cannot detect arm gestures without complex mounting a ton of them all over one arm.

Stain Sensor: Using the change-in-resistance stretch rubbers or flex sensors I can get an idea of what muscles I’m actuating or general shape of my arm. However, they are easily damaged and wear with use.

Muscle Sensors: Something like an MYO armband can determine hand gestures, but require a lot of processing overhead for my use case. They are also quite expensive and the electrodes not reusable.

IMU: Acceleration and gyroscope sensors are cheap and do not wear out over time. However, determining a gesture from the data output of the sensor requires a lot of manual thresholding and timing to determine anything useful. Luckily machine learning can determine relationships in the data and even can be implemented on a microcontroller with tflite and TinyML. So I chose to go forward with an IMU sensor and Machine Learning.

The AGRB-Traning-Data-Capture.ino is my Arduino script to pipe acceleration and gyroscope data from an Adafruit LSM6DSOX 9dof IMU out of the USB serial port. An esp32 Thingplus by SparkFun is the board I’ve chosen due to the Qwiic connector support between this board and the Adafruit IMU. A push-button is connected between ground and pin 33 with the internal pullup resistor on. Eventually, I plan to deploy a tflite model on the esp32, so I’ve included a battery.

The data stream is started after the button on the Arduino is pressed and stops after 3 seconds. It is similar to a photograph, but instead of an x/y of camera pixels, its sensor data/time.

Building Housing:
I modeled my arm and constructed the housing around it using Fusion 360. I then 3D printed everything and assembled the joints with some brass rod. I especially enjoy using the clip to secure the electronics to my arm.

Data Collection:

With the Arduino

loaded with the AGRB-Traning-Data-Capture.ino script and connected to the capture computer with a USB cable, run the script. It’ll ask for the name of the gesture you are capturing. Then when you are ready to perform the gesture press the button on the Arduino and perform the gesture within 3 seconds. When you have captured enough of one gesture, stop the python script. Rinse and Repeat.

I choose 3 seconds of data collection or roughly 760 data points because I wasn’t positive how long each gesture would take to be performed. Anyways, more data is better right?

Docker File:

I’ve included a Docker makefile for you! I used this docker container while processing my data and training my model. It’s based on the Jupiter Notebooks TensorFlow docker container.

Data Exploration and Augmentation:

As said in the TLDR, the script, will take all of your data from the data collection, split them between training/test/validation sets, augment the training data, and finalized CSVs ready for the model training.

The following conclusions and findings are found in Jypter_Scripts/Data_Exploration.ipynb file:

  • The first exploration task I conducted was to use seaborn’s pair plot to plot all variables against one another for each different type of gesture. I was looking to see if there was any noticeable outright linear or logistic relationships between variables. For the fist pump data there seem to be some possible linear effects between the Y and Z axis, but not enough to make me feel confident in them.

  • Looking at the descriptions, I noticed that each gesture sampling had a different number of points, and are not consistent between samples of the same gesture.

  • Each gesture’s acceleration data and gyroscope data is pretty unique when looking at time series plots. With fist-pump mode and speed mode looking the most similar and will probably be the trickiest to differentiate from one another.

  • Accelerationsof Gestures

  • Radians of Gestures

  • Conducting a PCA of the different gestures yielded that acceleration contributed the principal components when determining gestures. However, when conducting a PCA with min/max normalized acceleration and gyroscope data, the most important feature became the normalized gyroscope data. Specifically, Gyro_Z seems to contribute the most to the first principal component, across all gestures.

  • So now the decision. The PCA of Raw Data says that accelerations work. The PCA of Normalized Data seems to conclude that gyroscope data works. Since I’d like to eventually move this project over to the esp32, less pre-processing will reduce processing overhead on the micro. So let’s try just using the raw acceleration data first. If that doesn’t work, I’ll add in the raw gyroscope data. If none of those work well, I’ll normalize the data.

The following information is can be found in more detail in the Data Cleaning and Augmentation.ipynb file:

Since I was collecting the data myself, I have a super small data set. Meaning I will most likely overfit my model. To overcome this I implemented augmentation techniques to my data set.

The augmentation techniques used are as follows:

  • Increase and decrease the peaks of the XYZ data

  • Shift the data to complete faster or slower. Time stretch and shrink.

  • Add noise to the data points

  • Increase and decrease the magnitude the XYZ data uniformly

  • Shift the snapshot window around the time series data, making the gesture start sooner or later

To address the number of data points inconsistency, I found that 760 data points per sample was the average. I then implemented a script that cut off the front and end of my data by a certain number of samples depending on the length. Saving the first half and the second half as two different samples, to keep as much data as possible. This cut data had a final length of 760 for each sample.

Before Augmenting I had 168 samples in my training set, 23 in my test set, and 34 in my validation set. After I augmenting I ended up with 8400 samples in my training set, 46 in my test set, and 68 in my validation set. Still small, but better than before.

Model Building and Selection:

As said in the TLDR, the script will import all data from the finalized CSVs generated from, create 2 different models an LSTM and CNN, compare the models’ performances, and save all models. Note the LSTM will not have a size optimized tflite model.

I want to eventually deploy on the esp32 with TinyML. That limits us to using Tensorflow. I am dealing with time-series data, meaning each data point in a sample is not independent of one another. Since RNN’s and LSTM’s assume there are relationships between data points and take sequencing into account, they are a good choice for modeling our data. A CNN can also extract features from a time series, however, it needs to be presented with the entire time sequence of data because of how it handles the sequence order of data.

CNN: I made a 10 layer CNN. The first layer being a 2D convolution layer, going into a maxpool, dropout, another 2D convolution, another maxpool, another dropout, a flattening, a dense, a final dropout, and a dense output layer for the 4 gestures.

After tuning hyperparameters, I ended up with a batch size of 192, 300 steps per epoch, and 20 epochs. I optimized with an adam optimizer and used sparse categorical cross-entropy for my loss, having accuracy as the metric to measure.

LSTM: Using TensorFlow I made a sequential LTSM model with 22 bidirectional layers and a dense output layer classifying to my 4 gestures.

After tuning hyperparameters, I ended up with a batch size of 64, 200 steps per epoch, and 20 epochs. I optimized with an adam optimizer and used sparse categorical cross-entropy for my loss, having accuracy as the metric to measure.

Model Selection:

Both the CNN and LSTM perfectly predicted the gestures of the training set. The LSTM with a loss of 0.04 and the CNN with a loss of 0.007 during the test.

Next, I looked at the Training Validation loss per epoch of training. From the look of it, the CNN with a batch size of 192 is pretty close to being fit correctly. The CNN batch size of 64 and the LSTM both seem a little overfit.

I also looked at the size of the model. The h5 filesize of the LSTM is 97KB and the size of the CNN is 308KB. However, when comparing their tflite models, the CNN came in at 91KB and the LSTM grew to 119KB. On top of that, the quantized tflite CNN shrank to 28KB. I was unable to quantize the LSTM for size, so the CNN seems to be the winner. One last comparison when converting the tflite model to C++ for use on my microcontroller revealed that both models increased in size. The CNN 167KB and the LSTM to 729KB.

EDIT After some more hyperparameter tweaking (cnn_model3), I shrank the CNN optimized model. The C++ implementation of this model is down to 69KB and the tflight implementation is down to 12KB. The Loss 0.015218148939311504, Accuracy 1.0 for model 3.

So I chose to proceed with the CNN model, trained with a batch size of 192. I saved the model, as well as saved a tflite version of the model optimized for size in the model folder.

Raspberry Pi Deployment:

I used a raspberry pi 4 for my current deployment since it was already in a previous tvhead build, has the compute power for model inference, and can be powered by a battery.

The pi is in a backpack with a display. On the display is a positive message that changes based on what is said in my Twitch chat, during live streams. I used this same script but added the TensorFlow model gesture prediction components from the script to create the script.

To infer gestures and send them to Twitch, use the or the They run the heavy.h5 model file. To run the tflite model on the raspberry pi run the script. You’ll need to connect the raspberry pi with the ESP32 in the arm attachment using a USB cable. Press the button the arm attachment to send data and predict gesture. Long press the button to continually detect gestures, continuous snapshot mode.

Note: When running the scripts that communicate with Twitch you’ll need to follow Twitch’s chatbot development documentation for creating your chatbot and authenticating it.


It works! The gesture prediction works perfectly when triggering a gesture prediction from the arm attachment. Continuously sending data from my glove works well but feels sluggish in use due to the 3 seconds data sampling between gesture predictions.

Future Work:

  • Shrink the data capture window from 3 seconds to 1.5 seconds

  • Test if the gyroscope data improves continuous snapshot mode

  • Deploy on the ESP32 with TinyML / TensorFlow Lite

Feel free to give me any feedback on this project or my scripts as it was my first real dive into Data Science and ML!