Getting correspondence from messy surveys: finding equivalent variable names

UNICEF’s Multiple Indicator Cluster Surveys are a great resource. In addition to survey-level indicators available via http://www.micscompiler.org/, UNICEF provides the individual-level survey data.

The data are messy, however: the same variables have different names across surveys and labelling is inconsistent and multilingual (both within and across surveys).

As downloaded, there are lots of files with pretty inconsistent variable names (we’ll not deal with value labelling here, but that’s also an issue). We need to process each of those files, learning what we can, and come up with a way to compare them — then produce some code to make the surveys conform.

The first thing to do is have Stata inventory the data files you need to review. If they’re all in one folder, the commands:

cd "folderwhereyourfileslive"
local thefiles: dir . files "*.dta"
foreach f of local thefiles {
   use "`f'", clear
   * do what you need to do...
}

What, however, if your files are in multiple folders, or in a tree with multiple layers of subfolders? Here you need a recursive search, which you do a few ways. A search for Stata dirlist recursive turns up some good stuff, or you can use shell commands (I’ll just use DOS/Windows for example):

cd "folderwhereyourfileslive"
winexec "dir /s /b *.dta >filelist.txt"
import delimited "filelist.txt", clear

… and v1 will contain the names of your files, with full path.

There’s a very nice option to Stata’s

-describe-

command. You can use

describe, replace

to produce a dataset containing the output that is typically written to Stata’s Results window. In the case of multiple files, you can simply add a variable indicating the name of the survey (file) to which you applied -describe-, and -append- to a cross-survey file.

The following starts with all of the downloaded and unzipped MICS surveys (that I converted from SPSS to Stata using Stat/Transfer). It uses a recursive dirlist function I found on Statalist, and combines -describe, replace- with some text cleanup. The result is a “mega” cross-survey file with variables indicating the path to the file, variable name, variable data type, variable label, and the value label (if any).

local dir_do "/somepath/"

local dir_f "/somepath/"

adopath + "`dir_do'"

local descfile "`dir_f'/micsdesc.dta"
local listfile "`dir_do'/micsfiles.dta"

erase "`descfile'"

dirlist, fromdir("`dir_f'") pattern ("*.dta") save("`listfile'") replace

use "`listfile'"


// build the file list
local fls // this will hold our list
qui count // #records will be stored in r(N)
forvalues i = 1/`r(N)' {
	local thisfile = fname[`i'] // ith value of variable fname
	local fls "`fls' `thisfile'" // build via append
}


tokenize "`fls'"
while "`1'" ~= "" {
	use "`1'", clear
	rename *, lower // might not be necessary, but let's assume v1 == V1
	save "`1'", replace // saving w/ lowercase variable names
	describe, replace
	// strip off the base path for clarity and to save space
	local fname = subinstr("`1'","`dir_f'","",.)
	gen fileis = "`fname'"
	capture confirm file "`descfile'"
	if _rc==0 {
		append using "`descfile'"
	}
	save "`descfile'", replace
	mac shift
}

// OK, now we've got a LONG file, each record a variable, grouped by survey.
// What next?
// Derive features from the filename
gen filetemp = fileis
replace filetemp = subinstr(filetemp,"Datasets","",.)
replace filetemp = subinstr(filetemp,".dta","",.)
replace filetemp = subinstr(filetemp,"-SPSS-","",.)
replace filetemp = subinstr(filetemp,"_","-",.)

// That's better.
// Now, try to find the 4-digit year in the path
gen yyyy=regexs(1) if regexm(filetemp,"([0-9][0-9][0-9][0-9])")
// If ther's a numeral immediately following 'MICS', it's the version
gen micsver=regexs(1) if regexm(filetemp,"mics([0-9])")
replace micsver=regexs(1) if regexm(filetemp,"MICS([0-9])")

// Country name is the first word in the path.
// Except in a few multi-word names and Myanmar.
split filetemp, p("-") limit(1)
rename filetemp1 country
drop if country=="micsfiles"
replace country="Myanmar" if country=="myanmar/DATA"
replace country="SouthSudan" if country=="South"
replace country="StateOfPalestine" if country=="State"
replace country="CentralAfricanRepublic" if country=="Central"
drop filetemp

save "`descfile'", replace

What do you do with this once you’ve got it?

  • You can list cases sorted either by variable name or survey name (file path / name), and see whether similarly-named variables have the same variable label (I’d take that as semi-strong evidence that they’re the same variable and worth treating as measured equivalently until I see some evidence to the contrary).
  • You can add a variable indicating to what the original variable should be renamed, in the case of the same variable that is named differently across different surveys. You can then use Stata’s excellent -file- commands to write out the text of a do-file containing a set of -use-, -rename-, and -save- statements so you’re automating your renaming operations.
  • Doubtless there are other uses and I’m just insufficiently imaginative. Suggestions, please!

Leave a Reply

Your email address will not be published. Required fields are marked *