Analyzing CrossFit Subreddit with NLP

Grettel Juárez
6 min readMar 26, 2021

Using Natural Language Processing techniques on the discourse within the CrossFit subreddit community to gain insights for athletes and gym owners.

CrossFit is a fitness program focused on strength and conditioning to build functional fitness. Each day’s workout combines different movements across high intensity interval training, Olympic weightlifting, powerlifting, gymnastics, cardio, and strongman training.

CrossFit saw accelerated growth and increasing popularity starting in 2012. By July 2019, the number of CrossFit affiliates worldwide was at ~15,200 according to Morning Chalk Up. This year, that number is down by 29% to ~10,800.

What happened? — 2020

The COVID-19 pandemic caused an unprecedented disruption in everyone’s lives in 2020. In CrossFit, many athletes minimized or terminated their gym memberships due to risk of infection. CrossFit gyms have also struggled. Government shutdowns hit local gyms very hard. On top of that, the founder and then CEO of CrossFit, Greg Glassman, made an insensitive comment in reference to George Floyd and failed to apologize for it. Due to the backlash, many gyms disaffiliated, and Glassman was forced to sell the company.

Given the events of the past year, the current situation, and many services going online — what does this mean for both athletes and gym owners in the CrossFit community? This is what I investigated through an NLP project during my time in the 12-week Metis Data Science bootcamp.

To explore the discourse in the CrossFit online community, I used reddit. Reddit is a network of online communities based on people’s interests. This, of course, is not fully representative of every CrossFit athlete, but can provide insights into the online conversations in this community. The r/crossfit subreddit community was created in December 2008 and currently has 232K members.

Approach

  1. Data Collection and Pre-processing
  2. Topic Modeling
  3. Results

Data Collection and Pre-processing

The data was collected for the r/crossfit subreddit using Pushshift API. This provides a way to access historical reddit posts more easily. This analysis was conducted using data from December 2008 through February 19, 2021.

Once the data was collected, I performed text pre-processing to clean the data. This process included turning all text to lowercase as well as removing emojis, numbers, and punctuation. I also removed stop words, which are commonly used words that have little meaning like “the”, “a”, and “with”. Lastly, I used lemmatization to group words with the same base form together. For example, “bats” becomes “bat”, “feet” becomes “foot”, “having” becomes “have”.

A chart of the data below shows the number of r/crossfit posts per year. This follows the growth pattern of CrossFit through 2019.

Topic Modeling

Once the data was complete, I applied topic modeling to group the r/crossfit posts into common themes.

  1. TFIDF: This was done by first using a TF-IDF vectorizer to transform words in each post to numerical representations in a vector. This process assigns weights for relative word frequency in each post as compared to other posts in the whole dataset.
  2. NMF: Next, I applied an unsupervised machine learning algorithm called NMF to group the posts into topics.
  3. TSNE: Finally, I used TSNE, an additional dimensionality reduction method to decrease the features for viewing on a 2-D plane.

Below are the resulting nine topics along with the top five words characterizing each topic. Each dot in the graph is a r/crossfit post with the colors categorizing topics. The groupings are reasonably distinguishable.

Results/Findings

1. Personal Records topic could indicate an opportunity for gym owners

The discourse in the Personal Records topic increases similarly with talk around the CrossFit Open and CrossFit Games. The “Open” and the “Games” are world-wide competitions usually hosted annually. This observation makes sense because many people are known to set new personal records during this time.

Because athletes enjoy pushing themselves and exploring limits, perhaps hosting in-house competitions if not already done could boost engagement at a local gym, even if done virtually.

This Personal Records category also contains a high proportion of videos. The posts in this topic often ask for and receive detailed feedback from fellow redditors. Below are the top ten words along with a few sample posts in this topic.

For athletes, this provides an avenue for receiving praise and feedback on a heavy lift.

For gym owners, this could indicate an opportunity to provide an online coaching feedback service. This could be a way to reach many more customers outside of the local area. Since people are seeking feedback online from strangers, it may be beneficial and enticing for athletes to pay a small fee for a number of video reviews per week. This way, the athlete receives more personalized feedback from someone familiar with their movements without having to pay the high fee of a personal trainer.

2. Stand against racism

The discussion around the Glassman issue presented itself in an interesting topic, which is the “Why I Love CrossFit” one. In this topic, many people share the impact (usually positive) CrossFit has had on their quality of life. The top ten words are reflective of why people join and love CrossFit: fitness, people, community. This topic oddly increased in June 2020 aligning with the Glassman issue. Upon investigation, I found there were posts such as this: “I love crossfit, but am embarrassed for glassman”.

The backlash and discussion around disaffiliation from CrossFit due to Glassman’s comments shows a large number of gyms standing against racism, almost 1/3 choosing to act on their beliefs by separating themselves from the brand.

Additionally, the graph below shows a comparison of word frequency for the 1-year pandemic period versus a pre-pandemic period of one year to match. The top pandemic words fall into three categories: quarantine, racism, and equipment.

3. Increased discussion in “Gym Advice” topic

This “Gym Advice” topic containing discussion around programming and equipment (in top fifteen words) increases in March 2020 reflecting a shift in athletes looking for and purchasing equipment and discussing programming. Athletes having their own equipment and showing an interest in programming could be another indicator for demand of online remote services.

Conclusion

Using applied Natural Language Processing techniques, this analysis provides insights for athletes and gym owners from the online CrossFit reddit community. This investigation outlined:

  • Athletes actively use this online community to seek feedback on their lifts and celebrate personal records. The volume of posts in this topic correlates with CrossFit Games and Open. Hosting in-house competitions could potentially increase gym member engagement
  • Racism is an important discussion and many feel strongly about taking a stance against it. Gyms may consider intentional messaging of support for their athletes
  • There is a potential opportunity for gym owners to provide remote services such as combined programming and lift feedback given many more athletes have their own equipment due to the pandemic

My Github repo for this project is here. Feel free to connect with me on Linkedin.

--

--

Grettel Juárez

Data Science | Performance Engineering | Technology Consulting