In my last post, I described how I reproduced the analysis of Costa and Kahn (2011) which was originally done in Stata, using the R programming language. There I said I'm calling it a reproduction because, even though I was using different software, I was using the author-provided data.
A different question I address in this post is, can their results be replicated using what should be the same data obtained from a different source? It turns out it can and my R script is available here. This is a relevant question because for example sometimes data sets are updated (e.g. if errors are discovered). In addition, the data supplied by Costa and Kahn is only a small subset of the 2000 Census data, which only contains California homeowners in single-family homes. This is enough to reproduce their results, but it severely limits the types of extensions that one can do with their data (for example, if one wanted to estimate the baseline model on Florida households or for households in multi-family housing.)
It is easy in principle to download the original data from the source Costa and Kahn obtained it from, IPUMS-USA at the University of Minnesota. One can also download Census micro data directly from the Census Bureau or other distributors, but the good folks in Minnesota make downloading massive Census micro data sets as easy as an making an Amazon purchase. Even better than Amazon it is free! One only has to create an account, and then "build" a data extract consisting of years (in this case just one year, the 2000 5% sample) and variables; the details of mine for this replication are shown in this codebook here.
Right off the bat, however, there is one obvious difference between the data Costa and Kahn downloaded from IPUMS-USA and the data currently available on IPUMS-USA. In the original data, the year of construction variable indicated whether a home was built between 1995 and 1998, or in 1999-2000, whereas in the currently available data these categories are combined--we know whether a home was built between 1995-2000. In other words since Costa and Kahn downloaded the data, the two categories were combined into one, probably to "harmonize" the data with that from other survey years.
To deal with this, I repeat the Costa and Kahn analysis using the data they supplied, but by combining their two built year categories which span 1995-2000 into one category. Doing this was a bit more complicated than it sounds, as it also involves re-calculating the average electricity price over the period in which the home was constructed. Luckily in their author-provided files, Costa and Kahn include the raw price data with which they calculated these averages, so I re-calculated the averages of these prices over the 1995-2000 period. These adjusted average prices are available in this file. (It took a little trial and error to figure out precisely how they calculated these averages, and I discuss this in more detail in this document (TBA), which describes my whole replication in more detail.)
Another difference is that, given we have period of construction dummies for the entire time period, and one has to be omitted as a reference category, I omit the 1960s category, whereas Costa and Kahn omitted the 1999-2000 category. This is one way that the results from the author-provided data can be directly compared to the results from the data I just downloaded.
The table below shows the results of the re-estimated models using the original data and data from IPUMS. Columns (1) and (2) show the results using the data I downloaded, and columns (3) and (4) use the data supplied by the authors; the relevant comparisons are between columns (1) and (3) and columns (2) and (4):
What we see is that the results are nearly identical, even though there is a difference in sample size of about 3%. Perhaps I selected the estimation subsample in a slightly different way from them, or perhaps it is a result of other data changes made to harmonize the variables or update the data set. Whatever the reason for the differing estimation subsample sizes, the fact remains that this is by and large a very successful replication.
(The only exception in the comparison above are the coefficients on SEI, but here I just forgot to rescale the variable in the replication; in their original analysis they divided this variable by 100. I'll try to fix this in an update but it is kind of trivial so I may not get around to it.)
In the next article in this series, I will move from reproduction and replication to extension. I will estimate their model, using data from the 2012-2017 American Community Surveys, the successor to the long-form decennial Census, to see if the results they obtained hold up in the most recent data. Whereas with reproduction and replication, we expect the results will hold up, in the case of the extension I have in mind, it's perfectly plausible to expect the results to be different. In the table above, we see homes constructed in the 1980s, the decade just after the introduction of energy efficient building codes, had statistically indistinguishable electricity bills from homes constructed in the 1960s. This suggests building codes, at least those enacted in the 1970s, were ineffective. We'll see if this results holds up in the most recent data.
A different question I address in this post is, can their results be replicated using what should be the same data obtained from a different source? It turns out it can and my R script is available here. This is a relevant question because for example sometimes data sets are updated (e.g. if errors are discovered). In addition, the data supplied by Costa and Kahn is only a small subset of the 2000 Census data, which only contains California homeowners in single-family homes. This is enough to reproduce their results, but it severely limits the types of extensions that one can do with their data (for example, if one wanted to estimate the baseline model on Florida households or for households in multi-family housing.)
It is easy in principle to download the original data from the source Costa and Kahn obtained it from, IPUMS-USA at the University of Minnesota. One can also download Census micro data directly from the Census Bureau or other distributors, but the good folks in Minnesota make downloading massive Census micro data sets as easy as an making an Amazon purchase. Even better than Amazon it is free! One only has to create an account, and then "build" a data extract consisting of years (in this case just one year, the 2000 5% sample) and variables; the details of mine for this replication are shown in this codebook here.
Right off the bat, however, there is one obvious difference between the data Costa and Kahn downloaded from IPUMS-USA and the data currently available on IPUMS-USA. In the original data, the year of construction variable indicated whether a home was built between 1995 and 1998, or in 1999-2000, whereas in the currently available data these categories are combined--we know whether a home was built between 1995-2000. In other words since Costa and Kahn downloaded the data, the two categories were combined into one, probably to "harmonize" the data with that from other survey years.
To deal with this, I repeat the Costa and Kahn analysis using the data they supplied, but by combining their two built year categories which span 1995-2000 into one category. Doing this was a bit more complicated than it sounds, as it also involves re-calculating the average electricity price over the period in which the home was constructed. Luckily in their author-provided files, Costa and Kahn include the raw price data with which they calculated these averages, so I re-calculated the averages of these prices over the 1995-2000 period. These adjusted average prices are available in this file. (It took a little trial and error to figure out precisely how they calculated these averages, and I discuss this in more detail in this document (TBA), which describes my whole replication in more detail.)
Another difference is that, given we have period of construction dummies for the entire time period, and one has to be omitted as a reference category, I omit the 1960s category, whereas Costa and Kahn omitted the 1999-2000 category. This is one way that the results from the author-provided data can be directly compared to the results from the data I just downloaded.
The table below shows the results of the re-estimated models using the original data and data from IPUMS. Columns (1) and (2) show the results using the data I downloaded, and columns (3) and (4) use the data supplied by the authors; the relevant comparisons are between columns (1) and (3) and columns (2) and (4):
Costa Kahn Replication Regression Results
|
||||
Dependent variable:
|
||||
logCOSTELEC
|
||||
(1)
|
(2)
|
(3)
|
(4)
|
|
lprice
|
-0.236***
|
-0.233***
|
||
(0.041)
|
(0.039)
|
|||
AGE
|
0.005***
|
0.005***
|
0.005***
|
0.005***
|
(0.0003)
|
(0.0003)
|
(0.0003)
|
(0.0003)
|
|
ROOMS
|
0.090***
|
0.090***
|
0.090***
|
0.090***
|
(0.003)
|
(0.003)
|
(0.003)
|
(0.003)
|
|
logHHINCOME
|
0.116***
|
0.116***
|
0.116***
|
0.116***
|
(0.004)
|
(0.004)
|
(0.004)
|
(0.004)
|
|
HHSIZE
|
0.056***
|
0.056***
|
0.057***
|
0.057***
|
(0.004)
|
(0.004)
|
(0.004)
|
(0.004)
|
|
WHITE
|
0.086***
|
0.086***
|
0.084***
|
0.084***
|
(0.008)
|
(0.008)
|
(0.007)
|
(0.007)
|
|
ELEHEAT
|
0.221***
|
0.221***
|
0.219***
|
0.219***
|
(0.011)
|
(0.011)
|
(0.011)
|
(0.011)
|
|
SEI
|
0.001***
|
0.001***
|
0.066***
|
0.066***
|
(0.0001)
|
(0.0001)
|
(0.01)
|
(0.01)
|
|
YB1970
|
0.019***
|
-0.027***
|
0.017**
|
-0.029***
|
(0.007)
|
(0.009)
|
(0.007)
|
(0.008)
|
|
YB1980
|
-0.011*
|
0.011
|
-0.011*
|
0.011
|
(0.007)
|
(0.009)
|
(0.007)
|
(0.008)
|
|
YB1990
|
-0.059***
|
-0.00001
|
-0.057***
|
0.001
|
(0.009)
|
(0.012)
|
(0.009)
|
(0.011)
|
|
YB1995
|
-0.116***
|
-0.077***
|
-0.114***
|
-0.075***
|
(0.012)
|
(0.01)
|
(0.011)
|
(0.01)
|
|
Observations
|
144296
|
144296
|
139345
|
139345
|
R2
|
0.164
|
0.164
|
0.165
|
0.165
|
Residual Std. Error
|
3.014
|
3.014
|
3.007
|
3.007
|
Note:
|
*p<0.1; **p<0.05; ***p<0.01
|
What we see is that the results are nearly identical, even though there is a difference in sample size of about 3%. Perhaps I selected the estimation subsample in a slightly different way from them, or perhaps it is a result of other data changes made to harmonize the variables or update the data set. Whatever the reason for the differing estimation subsample sizes, the fact remains that this is by and large a very successful replication.
(The only exception in the comparison above are the coefficients on SEI, but here I just forgot to rescale the variable in the replication; in their original analysis they divided this variable by 100. I'll try to fix this in an update but it is kind of trivial so I may not get around to it.)
In the next article in this series, I will move from reproduction and replication to extension. I will estimate their model, using data from the 2012-2017 American Community Surveys, the successor to the long-form decennial Census, to see if the results they obtained hold up in the most recent data. Whereas with reproduction and replication, we expect the results will hold up, in the case of the extension I have in mind, it's perfectly plausible to expect the results to be different. In the table above, we see homes constructed in the 1980s, the decade just after the introduction of energy efficient building codes, had statistically indistinguishable electricity bills from homes constructed in the 1960s. This suggests building codes, at least those enacted in the 1970s, were ineffective. We'll see if this results holds up in the most recent data.
No comments:
Post a Comment