Topic modeling Communities of Practice to Identify Learning Barriers

who I am

Computer Scientist
Mathematician
English Speaker
Clemson ESED graduate Student
Cat person

photograph of Tim

Tim's cat Ada

Uses of NLP in classrooms

Which topics students are asking for help online?
- (Assumption) Highly requested help topics indicate difficult to learn topics
Which topics students can find answers for online?
- (Assumption) students are searching for help on Reddit
Which topics students might be most exposed to online?
Identifying barriers to learning

What is Reddit

BBS-system pseudo-anonymous social media
- (Basically image/text forums with accounts)
Organized into subreddits around topics

Community	Learning Community
r/Python	r/learnpython
r/math	r/learnmath
r/engineering	r/EngineeringStudents
r/Physics	r/learnPhysics

Conceptualizing subreddits as communities of practice [1]

people → subscribed users
practice → topic of the subreddit
culture → layered internet & domain

Latent Drilicht Allocation

LDA [2] is a well established NLP topic modelling algorithm
Documents are comprised of a mixture of terms and topics
With sufficiently many documents we can correlate the relationship between terms and topics

We’ll be topic modelling Reddit post data collected through pushshift [3] and processed with R and spark [4] on Clemson’s Palemtto cluster computer [5]

Illuminatory Example

Subreddit CoP Topics can Overlap

r/Python wordclouds

r/learnpython wordclouds

Data vis

Single Subreddit Topics

r/python

r/learnpython

Identified Learning Barriers

User Input Loop
classes vs scripts (OOP foundations)
Interpreting debug messages
Data formatting
Array Indexing

r/learnpython identified trends

Users are asking for more examples
File I/O & array posts remain constant
General question posts on the rise
Getting started posts on the rise

Next Steps

Interpret other subreddits
Interpret other disciplines
Investigate other text sources

References

[1]

J. Lave, “Situating learning in communities of practice.” 1991.

[2]

D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.

[3]

J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn, “The pushshift reddit dataset,” in Proceedings of the international AAAI conference on web and social media, 2020, vol. 14, pp. 830–839.

[4]

J. Luraschi et al., “Sparklyr: R interface to apache spark.” Mar. 2023. Available: https://CRAN.R-project.org/package=sparklyr

[5]

Feb. 2023. Available: https://docs.rcd.clemson.edu/palmetto/about/

Contact info

email: tsranso@clemson.edu

github: https://github.com/ransomts

website: tsranso.people.clemson.edu