Data resources

I use this page as a repository to store & share links to slides, books, tips, forums, and useful packages in NLP, Machine Learning & Deep Learning, Python, R, and other related data retrieval topics. Wherever applicable, I also put relevant forum channels and links to the latest resources.

Natural Language Processing & Argument Mining

Code & Data for Social Science – A practitioner’s guide (Matthew Gentzkow, Stanford University & Jesse Shapiro, Brown University): A general go-to guidebook on how to logically and efficiently organize and share your codes/data. A must-read to keep your empirical work corner in navigable manner, especially in the long run.

Flowchart NLP project decision: A comprehensive A-Z assessment and planning of NLP-related project management, strongly recommended to evaluate the scope and feasibility of your project.

NBER Econometric Methods for High-Dimensional Data: High-level lecture series on high-dimensional data and text as data , with links to references, data and codes for practice.

Text Data & Machine Learning for Social Science (Python) (Elliot Ash, ETH Zurich): Extensive, step-by-step text as data course, I got started with my own project based on the codes and slides of this class. His website also contains many useful links and tips for graduate students doing applied micro work.

Text as Data in Social Science (R) (Branden Stewart, Princeton University): A full-fledged >1000 slides of detailed instructions on working with text as data in R. Strongly recommend for those who want to use structural topic models (stm), as Brandon has developed a great R package with his team.

Argument mining framework (Ubiquitous Knowledge Processing (UKP) Lab, TU Darmstadt): I use ATLAS.ti , a user-friendly annotation tool to annotate argument components & relations and manage meta-data analysis among my team members.

Machine Learning & Deep Learning

Mathematics for Machine Learning: A comprehensive refresher of essential math for machine learning (notice that it’s very similar to those early PhD classes of Matrix Algebra & Optimization for Economics & Econometrics).

Introduction to Neural Networks: An intuitive 7-min-read introduction to build your neural networks from scratch.

In-depth tutorials in Deep Learning (both in Pytorch and Keras): Various teaching slides and exercises from intensive tutorials at Ecole Polytechnique Data Science Summer School DS^3 2018 (Paris, France), where I attended last year. There are also other advanced topics in Causality and Machine Learning, Topological Data Analysis, etc available. Also, if you prefer a more extensive course, I highly recommend Tensorflow for Deep Learning Research of the awe-inspiring Chip Huyen (Stanford, NVIDIA).

Deep Learning, Reinforcement Learning (and other advanced materials): Slide set from presentation at Eastern European Machine Learning Summer School 2019 (Bucharest, Romania), where I attended this summer.

Facebook AI/DS/ML/DL group : I strongly recommend joining this group, members share many interesting applications, codes, papers, etc across all fields.

Python & R

Workshop materials (R, Python, Stata) (IQSS, Harvard University): Materials from workshop series of Harvard Data Science Services.

Introduction to Programming with Python (Durham University): A self-study course with lots of exercises and examples to illustrate the fundamentals of programming in Python language.

Python & R cheat sheets for various packages: A collection of Datacamp community one-page cheat sheet codes for the most common/useful packages in Python (numpy, pandas, scikit-learn, scipy, matplotlib, keras, pySpark, spaCy, seaborn) and R (tidyverse, xts time series). There is also a Beginner’s cheatsheet for Python!


Webscraping: EEA 2018 workshop for economists (R), basic package Beautifulsoup, and more sophisticated Selenium on LinkedIn data, in addition to a bunch of other packages.

RegEx: RegEx101, Rubular, and this quick cheatsheet on Medium are a great help to determine the right text expression to keep or strip.

POS: Part-of-speech tagging test to determine the various word functions to keep or strip. (e.g: like=Verb in one context, but Adverb in another)

Politeness package: Check the politeness of your sentence with this model, using conversational analysis toolkit Convokit. A politeness package in R is also available here (e.g: hedge_list, filler_list), created by Mike Yeomans (Harvard).