Turning the World Fertility Surveys’ raw data into a set of Stata dta files

#WARNING — this is very draft.

I’m posting code for the adventurous among you, but I’m not sure how useful it is without more elaboration in the text. That elaboration is forthcoming … someday.

Note: this work was completed while working for Professor John Casterline.

The World Fertility Surveys are archived here, as a set of fixed-width data files and data dictionaries. That is, sans code for making the files readily usable for analysis.

When faced with a problem like this, perhaps any problem where one can conceive of more and less sophisticated solutions, one is also faced with the sad fact that we only get to live in one universe at once, even if we can conceive of many. Should I just slog it out, writing code by hand to pull the variables I need for this particular task, or even go all out and write the whole thing? Or, should I try to make a robot that does it for me? The former is tedious but probably gets the desired result almost all the time; the latter isn’t a sure thing, and one could spend a lot of time and end up empty.

I built a robot this time around, and it’s inelegant but basically gets the job done.

What it Does

Given original WFS data and dictionary files from Princeton’s OPR archive or wherever, the code below:

  1. Produces a set of do-files to convert the delimited data files into Stata-format files,
  2. Produces a set of do-files to apply variable and value labels to the Stata files,
  3. Executes the aforementioned do-files.

Manual Front-End Stuff

Pick a folder in which to work and adjust the code I’ve posted accordingly. In the code I’ve posted, fix the paths so they match your environment. Note that UNIX/Mac paths require forward-slashes (/), but Stata for Windows can use either them or backslashes (\) — unlike Windows itself, which is essentially backslash-only.

Download data and dictionary files from the WFS archive: as of 9 Sept 2013, it’s here: http://opr.princeton.edu/archive/wfs

You’ll benefit from a tool like DownloadThemAll for Firefox when downloading lots of files.

It’s convenient to save all the files to a single folder, and we’ll let Stata sort them out. They’re zip files, however, so we’ll need to batch-unzip everything:

### BEGIN ###
local filespath "/Volumes/wherever/" // wherever you downloaded all those zip files!
cd "`filespath'"
local fls : dir "`filespath'" files "*.zip"
foreach f of local fls {
    unzipfile "`f'", replace
}
### END ###


 Now, let’s process

### BEGIN ###
set more off
set trace off
capture drop _all
capture log close

local basepath /Volumes/WYRK/wfs
local outpath `basepath'/newdcts
local tmppath `basepath'/tmp

cap noi mkdir "`outpath'"
cap noi mkdir "`tmppath'"

// One survey per country - note that Dominican Republic (Dr) poses special challenges not handled by this version of the code
local surveys "Bd Bj Ci Cm Co Cr Dr Ec Eg Fj Gh Gy Ht Id Jm Ke Kr Lk Ls Ma Mr Mx My Np Pa Pe Ph Pk Pt Py Sd Sn Sy Th Tn Tr Tt Ve Ye"
// The Trinidad and Tobago survey is exceptional - doesn't fit with the rest.
*local surveys "Tt"

// seed stems for the various files we'll use in our output
tempname cb dct labs inlabs outlabs

// let's iterate over surveys
foreach cc of local surveys {
		cap noi macro drop varname
		log using "`basepath'/logs/`cc'_log.log", replace
		local cc = lower("`cc'")
		local codebook "`cc'sr.dct"
		file open `dct' using "`tmppath'/`cc'_dct_fixquotes.dct", write replace
		file open `labs' using "`tmppath'/`cc'_labels_step1.do", write replace

		file write `dct' "infile dictionary using `basepath'/`cc'sr.dat {" _n
		file write `labs' "* Variable label file to be applied after dictionary `cc'_dct.dct" _n

                // we start by replacing some difficult-to-parse characters with either a blank (if superfluous) or with a string not found in the data itself (if we need that character later)
		filefilter "`basepath'/dctorig/`codebook'" "`tmppath'/tmp3_`codebook'", from(\Q) to("") replace
		filefilter "`tmppath'/tmp3_`codebook'" "`tmppath'/tmp1_`codebook'", from("%") to("XXXXX") replace
		filefilter "`tmppath'/tmp1_`codebook'" "`tmppath'/tmp_`codebook'", from("$") to ("XXXXX") replace

                // open the text codebook
		file open `cb' using "`tmppath'/tmp_`codebook'", read
	local index=1
	while 1==1 {
		*di "`cc' -- `index' -- varname=`varname'. lastline=`prev_outline'. prev_varname=$prev_varname."
		if "`varname'"!="" {
			global prev_varname "`varname'"
		}

		file read `cb' line

                // if line is blank, move on!
 		if r(eof)!=0 {
			continue, break
		} // end if(reof)

		if (strpos("`line'","*")==1 ///
			| strpos("`line'",">")==1 ///
			| strpos("`line'","XXXXX")==1 ///
			| strpos("`line'","(")==1) {
			local ++index
			continue
		}
		local varname =    lower(subinstr(substr("`line'",1,6)," ","",.))
		local varname =    subinstr("`varname'",">"," ",1)
		local varpos =     subinstr(substr("`line'",10,4)," ","",.)
		local varlen =     subinstr(substr("`line'",15,2)," ","",.)
		local varmin =     subinstr(substr("`line'",17,4)," ","",.)
		local varmax =     subinstr(substr("`line'",21,4)," ","",.)
		local notapp =     subinstr(substr("`line'",26,4)," ","",.)
		local varspecial = substr("`line'",31,4)
		local varlab =     trim(substr("`line'",36,30))
		local useslab =    lower(subinstr(substr("`line'",67,6)," ","",.))

		if ("`varpos'"!=""){	// there's a position, so this is a variable line
			file write `dct' _tab "_column(`varpos') str`varlen' `varname' %`varlen's  ^^`varlab'^^" _n
			if "`prevlinetype'"=="val" ///
			  & "`prev_outline'"!="uses" ///
			  & "$prev_varname"!="" { // we need to append a -label values- line post-define
				file write `labs' _n "label values $prev_varname $prev_varname"
				local prev_outline "valuespostdefine"
			}

			if "$prev_varname"!="" {
				local outline "label values $prev_varname $prev_varname"
				local prev_outline "values"
			}
			if "`useslab'"=="" {	// this variable must be defining a new label
				local outline "label define `varname' "
				local prev_outline "define"
			}
			if "`useslab'"!=""{	// there's a variable name specified and we'll use its label def
				local outline  "label values `varname' `useslab'"
				local prev_outline "uses"
			}
			file write `labs' _n "`outline'"
		}
		else {
			local valval = trim(substr("`line'",36,4))
			local vallab = trim(substr("`line'",44,23))
			local prevlinetype = "val"
			noi di "valval is `valval' and vallab is `vallab'"
			file write `labs' " `valval' " "^^`vallab'^^ "
			local prevline `line'
		} // end if ("`varpos'"!="")
		local ++index			
	} // end while 1==1

	// post end-of-file
	if "`prevlinetype'"=="val" {
		file write `labs' _n "label values $prev_varname $prev_varname"
		global prev_varname
		local prev_outline "valuesEOF"
	}

	file write `dct' "}" _n
	file close `cb'
	file close `dct'
	file close `labs'
	log close
	local lastcc "`cc'"
	local varname ""
	local prev_varname ""
	local prev_outline ""
*	macro drop varname prev_varname useslab vallab val valval prevlinetype prev_outline
} // end foreach surveys

* now clean up labels.do files *
foreach cc of local surveys {
	clear
	local cc = lower("`cc'")
	file open `outlabs' using "`tmppath'/`cc'_labels_step2.do", write replace
	file open `inlabs' using "`tmppath'/`cc'_labels_step1.do", read
	while 1==1 {	
		cap noi file read `inlabs' line
		if r(eof)!=0 {
			continue, break
		}
		tokenize "`line'", parse(" ")
		if strpos("`line'","define")!=0 {
			if "`4'"=="" { // this is a bogus line like "label define v001"
				continue
			}
			else {
				file write `outlabs' "`line'" _n
			}
		}
		if strpos("`line'","values")!=0 { // this line could be wrong
			if "`5'"=="" { // this one is probably like "cap noi label values v001 v001"
				file write `outlabs' "local type : type `3'" _n
				file write `outlabs' "if (strpos "
				file write `outlabs' "("
				file write `outlabs' `"`=char(34)'"'
				file write `outlabs' "\`type'"
				file write `outlabs' `"`=char(34)',"'
				file write `outlabs' `"`=char(34)'"'
				file write `outlabs' "str"
				file write `outlabs' `"`=char(34)'"'
				file write `outlabs' "))==0 {" _n
				file write `outlabs' _tab "`line'" _n
				file write `outlabs' "}" _n
			}
			else { // here's a problem
				tokenize "`line'",parse(" ")
				local secondline "`1' `2' `3' `4'"
				local thisvar `4'
				mac shift 4
				local firstline "label define `thisvar' `*', modify"
				file write `outlabs' "`firstline'" _n
				file write `outlabs' "`secondline'" _n
			} // end if "`5'"==""
		} // end strpos("`line'","values")!=0
	} // end while 1==1
	qui {
		file close `inlabs'
		file close `outlabs'

		filefilter "`tmppath'/`cc'_dct_fixquotes.dct" "`outpath'/`cc'_dct.dct", from("^^") to(\Q) replace
		filefilter "`tmppath'/`cc'_labels_step2.do" "`outpath'/`cc'_labels.do", from("^^") to(\Q) replace

		infile using "`outpath'/`cc'_dct.dct", using("`basepath'/dat/`cc'sr.dat")

		cap destring _all, replace
		compress
		save "`basepath'/dta/`cc'_wfs.dta", replace
	} // end quietly
} // end foreach surveys

/* now apply value labels */
foreach cc of local surveys {
	local cc = lower("`cc'")
	use "`basepath'/dta/`cc'_wfs.dta", clear
	do "`outpath'/`cc'_labels.do"
	save "`basepath'/dta/`cc'_wfs.dta", replace
	clear
} // end foreach surveys
### END ###

4 thoughts on “Turning the World Fertility Surveys’ raw data into a set of Stata dta files

  1. Hi

    Thanks so much for posting this, it is very helpful. However, when trying to run the do-file I receive an error message originated in the second part of the file (after “* now clean up labels.do files *”), which says “dictionary invalid”. The previous lines run ok. I was able to create the files in the tmp and logs folders. However, the newdcts folder only contain 2 files for the first country (“bj_dct.dct” and “bj_labels.do”). After that, I receive the mentioned error message and hence the .dta files cannot be generated.

    I’m guesssing the error is originated in the line “infile using “`outpath’/`cc’_dct.dct”, using(“`basepath’/`cc’sr.dat”)”, but I’m new in dealing with dictionary files and also in programming in Stata, so not sure…

    Any help would be highly appreciated.

    Thank you so much!
    Naty

    • Hi. Sorry for the late reply.
      It looks like ‘dictionary invalid’ could be applicable to more than one command; that is, the error itself doesn’t contain enough information to infer where to look for a solution.
      Could you start a log, put -set trace on- in your code somewhere, and send me a copy? -trace- yields really verbose output that’s a pain to sift through, but it has the precision needed to pinpoint where problems are coming from.

    • Naty,

      Apologies for the late reply. I’ve got a revised set that address this and other bugs, and also handle the exceptionally difficult and painful to process Dominican Republic 1980 survey.

      Coming soon — stay tuned!

      Cheers,
      Colin

Leave a Reply

Your email address will not be published. Required fields are marked *