NYC Languages Clustering

min read

Ever wanted to see how languages are geographically represented across New York City? We’ve done the clustering for you.  

In this clustering app, one of our resident Software Engineer’s Eugene O’Friel put together this clustering experiment.

The goal was determine if speakers with languages with geographically proximal origins or within the same language family (for example, two languages spoken in Eastern Europe) would end up settling in close neighbourhoods. 


Technical Write up 

This dataset is a product of the Endangered Language Alliance whose attempted goal is to completely catalog the languages spoken within the Greater NYC area, both modern and historical. 


Interesting Insights 

  • Regarding the aforementioned hypothesis: sort of. There appears to be a few clusters that do group together related languages.  
  • In terms of the clustering, I did think there would be greater concentration amongst the earlier languages. For the most part, there did seem to be some diversity. 
  • It was interesting to see how many languages were spoken and came up within this. It was a few hundred and the original hypothesis would be there would be less than that.  
  • There’s a great concentration in certain areas that historically welcomed immigrants, as evidenced by the languages represented in these areas. 
  • Flushing had a large concentration of East Asian languages (Jackson Heights is the concentration of South East Asian languages). 
  • Bronx had a huge concentration & diversity of different African languages.  


  • There’s languages that are clustered outside of their geographic region. Trying to dimensionise on region could potentially solve this. 
  • Would want to break down further into historically spoken vs current usage.  
  • Different datasets would be great to compliment this experiment. One idea was to looking at New York City specific discrimination, or political focused datasets. To see whether or not certain policies are enacted because of the population demographics there. 
  •  Ideas for extension: The languages are categorized by status. Maybe I can drill down and look at residential/community vs. historical trends and see if there are any patterns there. 

View link to the app here – 

NYC Languages Clustering
Benedek Zajkas
February 4, 2022
Want to run a Clustering Experiment on your Dataset?

Join the waitlist for your free account on our vector-based Clustering platform