Saturday, March 4, 2017

Merging data from different sources using spreadsheets

Students in my classes have been busy searching for data for their term papers.  Often, the variables needed for an analysis are contained in separate files and must be merged.  Merging, an importing data management task, can be accomplished using a spreadsheet like MS Excel or a statistical software package like Stata.  This article describes how to complete a merge in Excel, and is written for students or others who are completely new to data management and spreadsheets.




If you have very large data sets using Excel for merges can be problematic, but it is powerful enough to handle data merges for many common situations.  Say you want to test the hypothesis that controlling for income, median housing value is higher in more temperate cities.  Data on median home values and average July temperature are available in the 2007 County and City Data Book.  (Unfortunately these files are no longer hosted by the Census, but luckily you can still access them through the Internet Archive.  Click here to see archived versions of these files.)  Click here to download the spreadsheet I've created for this demo, which contains only the variables needed for the analysis.
 

Median housing value
D-6. Owner- and Renter-Occupied Housing Units and Vehicles Available

Per capita income
D-5. Income and Poverty
Average July temperature
C-6. Government and Climate

If you access the original data files you will find that you can easily combine data from files D-5 and D-6, because the Census Bureau has formatted them so the cities are listed in the same order in both files.  A copy and paste procedure will work to combine these variables.  

The problem comes in when you want to merge in Avg Temperature.  This is in the file C-6, which has more cities than the D files (the C-files have all cities with population larger than 25,000 while the D-files have only those cities with population larger than 100,000).  Fortunately, you can use Excel’s VLOOKUP tool to avoid having to match them up by hand.  Not only is this faster but you minimize the chance of human error.  

Rather than matching cities by name, it is easier to assign each city a number and then match on numbers.  This eliminates the possibility that misspellings (or alternative spellings, such as Saint Petersburg versus St. Petersburg) causes a failure to match records.  The U.S. Census assigns each city, county and state (and all other geographies, such as school districts) a “FIPS” code.  They did not include these codes in the D5, D6 and C6 files.  However, they are included in the D1 and C1 files.

Here are the steps needed to merge the data and carry out the analysis:

Step 1: Create a “large city file” using the data in D5 and D6 and the FIPS codes in D1.  This will list each city name, each city's FIPS code, its Per Capita Income, and its Median Housing Value.  This is done in the Worksheet titled “Big Cities File”.  

Step 2: Create a “small city file” in the same way, which contains city names and FIPS codes from C1 and average temperature from C6.

Step 3: Create a new worksheet called “Merge” and paste both the "big cities" and "small cities" data sets on the same worksheet, leaving a few columns empty between them.

Step 4: use the Vlookup syntax as shown in Cell E2 to merge the data from the two sets.  Here is the syntax:
=VLOOKUP(A2,$I$2:$L$1319,3,FALSE)

You will type this syntax into E2 as this is where you want the first average temperature variable to go.  The syntax works by first looking up the value in cell A2 (as specified in the syntax) and then looking in the first column of the table array (the data from cells I2 to L1319), and when making an exact match between the number in A2 and the number in column I, it returns the variable in the 3rd column of the table array, namely Average January Temperature.  Thus the four terms in the parentheses of the syntax point to the values to be looked up and returned, and the last entry "FALSE" tells Excel to only match if it is an exact match (if you type TRUE it will return the closest match, which may be useful for some purposes but not ours.)

Important note: using cell referencing (rather than type I2:L1319, we type ,$I$2:$L$1319) is important here, so when you have Excel automatically repeat the syntax for the rest of the rows, it will keep the lookup table array the same.  If you don’t use cell referencing, your lookup table will get smaller and smaller and this will cause you to not match all records.  You can do this by highlighting I2:L1319 and hitting the F4 key.

Finally, to prepare your file for regression, you want to drop all states (the Census releases these files with city averages and also state averages.)  You can easily do this by copying and pasting the merged data into a new worksheet, titled “FINAL” and sorting based on July temp, as the census didn’t report values here.  (Note, I am coping and pasting as values so the july temperature will appear as values rather than excel syntax.  To do this, right click and select "paste special" and choose "paste as values".)  All the states will then be at either the top or bottom of the file depending on how you sorted and you can just delete them.

I use the data in the worksheet titled FINAL to create the two scatterplots shown below:



As you can see, both warmer January and cooler July temperatures are associated with higher values.  

If you were inclined you could carry out multiple regression analysis using Excel (I may write up an article on how to do this later).  Or just import the data into Stata and do it there, which is what I've done.  Here is a table I produced using Stata and this data:


In the last column, which includes controls for state fixed effects, we see a 1% increase in Avg Jan Temp is associated with 1% higher median home values, and a 1% increase in Avg July Temperature is associated with 1% lower median home values.  The January effect is estimated to be about the same in all three specifications, while the July effect falls considerably when state effects are included.

In addition to the data file for this example mentioned above, I've also created a Stata do file to run the analysis reported in the table above.  This do file can be accessed by clicking here.





No comments:

Post a Comment