Sunday, February 24, 2019

The first result from my current replication book project

 I am sharing data and replication files I wrote for an article published in 2016 by John V. Winters titled, “Is economics a good major for future lawyers? Evidence from earnings data,” here on my blog. The replication files are at the bottom of this page. This is the first time I’ve produced and shared replication files, but over the next year, I hope to add replication files for an additional 10-15 more previously published studies.
Update December 3, 2020: Please visit the finished book's companion site for all data and code files.  

Update: January 30, 2020. Since I first wrote this post, I've learned a lot about economics research studies that use public micro data (ACS), R programming skills, and how to replicate them (including how to ask for data; see also here.) Eventually I settled on a list of lucky number seven studies to replicate, although I expect this number to rise, and in fact want to facilitate replication within a community of students and researchers, so that 7 may become 700 in the years ahead. I have discussed some of these plans in recent posts here and here. Soon I'll be posting the beta versions of the seven replication files, all in one place, plus teaching and other resources, culminating along with the publication of the book in early 2021.  

The studies that I will replicate will serve as the basis for a book I am planning, which will narrate a statistical portrait of social life in America in the 21’st century. My goal is for the book to be both timely and interesting to mainstream readers, as well as educational and thought-provoking for students and practitioners. Topics I plan to discuss include: migration and citizenship; education, training and jobs; housing, transportation and family life.  The final chapter will document the carbon footprint of all Americans, which is one of the consequences of the striving for a better life that is the central driving force behind the statistics and stories presented earlier in the book.
Not only will I provide R Script, Stata Do files, and data files for all of the analyses in the book on my blog, but all of the studies I will discuss use a common data source, the American Community Survey, a massive US Census Bureau survey which reaches 1% of the US population every year. These data are easily accessible through and so my hope is that an interested reader can learn about how previous research was done, and learn the tools needed to go out and do their original research.

That said, I also hope to write it in a way so that a more causal reader will find it entertaining, inspirational and insightful. With this I turn to the details of the first study replicated.

The Winters (2016) study lists the most popular college majors among practicing lawyers, and then calculates average income by major. He finds that political science is the most popular major among lawyers. According to the data, more than 20% of practicing lawyers majored in it. Winters (2016) reported median lawyer earnings among political science majors as $114,288 in 2015 dollars. The economics major was not as popular among future lawyers but still came in at fourth at 6.21%. Average earnings were higher for lawyers who majored in economics compared to political science, with a median of $130,723. Only lawyers who were electrical engineering or accounting majors had higher median earnings than those majoring in economics.

This study is a great example of how descriptive analysis can provide releant information.

I was able to reproduce all of the results from Winters (2016) in Stata, the software the author used. With the R files; I was able to exactly reproduce the precise ranking of major popularity and all weighted mean earnings. With the weighted median earnings, I was able to reproduce 9/25 exactly, and replicate the other 16/25 extremely closely, suggesting this minor descrepency is due to the a difference between the two weighted median functions I used, in Stata and in R. The idea with using sampling weights is that you can correct for over- or under-sampling to make the resulting statistic (here, median income) more representative of the population from which the respondents were drawn.

After I replicated his findings, I decided to see what effect weighting had in the ranking of the most popular major choices among lawyers, because Winters (2016) applied weights in his calculation of mean and median earnings, but not in the frequency of most popular majors. It's likely to be a very minor point, but we wouldn't know that if we didn't try. (I actually happened on this idea on accident, by adding weights to the wrong line of code.)

Table 1 shows the percentage of practicing lawyers for the top 10 majors. I list both weighted percentages that take into account over- and under- sampling, and the unweighted percentages that were reported by Winters (2016). They are sorted in the table by the weighted percentages and this causes the ranking to change only slightly from the ranking reported in Winters (2016). As you can see, both figures are very similar.

Table 1: The 10 most popular college majors among practicing lawyers 
Major Size rank Unweighted percentage of practicing lawyers Weighted percentage of practicing lawyers
Political Science and Government 1 21.58 21.59
History 2 9.70 9.40
English Language and Literature 3 8.05 7.94
Economics 4 6.21 6.22
Psychology 5 4.83 4.83
Business Management and Administration 6 3.82 3.72
General Business 7 2.78 2.91
Philosophy and Religious Studies 8 2.77 2.85
Accounting 9 2.80 2.70
Criminal Justice and Fire Protection 10 2.33 2.52

 Notes: The statistics are based on the 2009-2013 ACS data for practicing lawyers aged 30-61. The figures in the unweighted column is a reproduction of Winters’ (2016) result, while the weighted column is a replication that reflects sampling weights.

Sample weight is not a topic that I’ve seen covered in any of the leading introductory econometrics textbooks, but it is widely used in applied research like that I plan survey in my book. In all cases in the Table above, weighted and unweighted percentages are very similar, and the ranking is nearly the same regardless of which percent is used. This suggests for practical purposes the issue of weighting is irrelevant. Nonetheless, confusion over the role of weighting is a potential stumbling block for someone trying to use or at least understand how others have used the ACS data. As Angrist and Pishke (2009, p. 91) write in Mostly Harmless Econometrics, “Few things are as confusing to applied researchers as the role of sample weights. Even now, 20 years post-Ph.D., we read the section of the Stata manual on weighting with some dismay.”

But this issue needn’t cause excessive concern--the files I share below illustrate how to calculate weighted means and medians using both R and Stata, in the same way as do experts like John Winters. A student sees how to do weighted averages and weighted regression in the analysis files, without having to learn a great deal about the details, beyond a solid conceptual understanding of weighting as a best practice.

The previously discussion illustrated that the Winters (2016) article illustrates how to properly weight observations in calculating average earnings. Three other learning opportunites I encountered in reproducing this study included:
  • Which income measure to use? The author reports using, “…the full earnings of all lawyers…” but the ACS data includes multiple earnings measures.
  •  To include or exclude people with negative reported income. In the working paper version of this article, Winters estimated regression models using a sample that excluded negative earners, but in the published version negative earners were included in average income calculations.
  •  Unclear how to adjust for inflation. The US Bureau of Labor Statistics updates the Consumer Price Index all the time and authors rarely report the version of the CPI they use in their adjustments.

My first approach is always to try to replicate studies without author assistance. But if it starts to get time consuming, I am not shy, and I'm glad the author shared his analysis files with me. It was especially helpful to see his precise CPI calculation, and I am grateful to Professor Winters for allowing me to include his CPI calculation code in my replication files below. 

In their large-scale replication study, Chang and Lee report only about half of the authors they contacted provided them with replication files. Since I first wrote this post in February 2019, I have encountered a large share of non-responses, a few helpful authors like Professor Winters, and some helpful mandatory data sharing policies at journals like the American Economic Review, any of which makes the job of replicating studies much easier.

Replication Files:

Update December 3, 2020: Please visit the finished book's companion site for all R scripts. I still link to the Stata files below. //


Analysis file (Do File), Data File: analysis subset (DTA)


To run the analysis in Stata, download Do File and DTA file. Change directory in the Do file, highlight all code and select Do.
See Codebook to match degree codes to major names. The Script and Do file also contain more information and extensive comments and notes. 

The first version of these files were posted on February 22, 2019.  I plan to update and improve the files and will make a note here when the updated files are posted.

No comments:

Post a Comment