Topic modeling Communities of Practice to Identify Learning Barriers

X-DBER 2023

Tim Ransom

who I am

  • Computer Scientist
  • Mathematician
  • English Speaker
  • Clemson ESED graduate Student
  • Cat person

photograph of Tim

Tim's cat Ada

Uses of NLP in classrooms

  • Which topics students are asking for help online?
    • (Assumption) Highly requested help topics indicate difficult to learn topics
  • Which topics students can find answers for online?
    • (Assumption) students are searching for help on Reddit
  • Which topics students might be most exposed to online?
  • Identifying barriers to learning

What is Reddit

  • BBS-system pseudo-anonymous social media
    • (Basically image/text forums with accounts)
  • Organized into subreddits around topics
Community Learning Community
r/Python r/learnpython
r/math r/learnmath
r/engineering r/EngineeringStudents
r/Physics r/learnPhysics

Conceptualizing subreddits as communities of practice [1]

  • people → subscribed users
  • practice → topic of the subreddit
  • culture → layered internet & domain

Latent Drilicht Allocation

  • LDA [2] is a well established NLP topic modelling algorithm
  • Documents are comprised of a mixture of terms and topics
  • With sufficiently many documents we can correlate the relationship between terms and topics

We’ll be topic modelling Reddit post data collected through pushshift [3] and processed with R and spark [4] on Clemson’s Palemtto cluster computer [5]

Illuminatory Example

Subreddit CoP Topics can Overlap

r/Python wordclouds

r/learnpython wordclouds

Data vis

Single Subreddit Topics

r/python

r/learnpython

Identified Learning Barriers

  • User Input Loop
  • classes vs scripts (OOP foundations)
  • Interpreting debug messages
  • Data formatting
  • Array Indexing

r/learnpython over time

Next Steps

  • Interpret other subreddits
  • Interpret other disciplines
  • Investigate other text sources

References

[1]
J. Lave, “Situating learning in communities of practice.” 1991.
[2]
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
[3]
J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn, “The pushshift reddit dataset,” in Proceedings of the international AAAI conference on web and social media, 2020, vol. 14, pp. 830–839.
[4]
J. Luraschi et al., “Sparklyr: R interface to apache spark.” Mar. 2023. Available: https://CRAN.R-project.org/package=sparklyr
[5]

Contact info

email: tsranso@clemson.edu

github: https://github.com/ransomts

website: tsranso.people.clemson.edu