Ordinal dates in Stata – ith day and week of month

 

clear

// seed some days
set obs 2300
gen daily = mdy(12,31,2019) + _n
format daily %td

// the basics
gen yr = year(daily)
gen mo = month(daily)
label define mo ///
    1 "Jan" 2 "Feb" 3 "Mar" 4 "Apr" 5 "May" 6 "Jun" ///
    7 "Jul" 8 "Aug" 9 "Sep" 10 "Oct" 11 "Nov" 12 "Dec"
label values mo mo

gen day_wk = dow(daily) + 1
label define day_wk ///
   1 "Sun" 2 "Mon" 3 "Tues" 4 "Wed" 5 " Thu" 6 "Fri" 7 "Sat"
label val day_wk day_wk

// ith week of the year
bysort yr : gen wk_yr = sum(dow(daily) == 1) + 1

// ith day of the year
bysort yr : gen day_yr = _n

// ith day of the month
bysort yr mo : gen day_mo = _n

// ith week of the month
bysort yr mo : gen wk_mo = int(day_mo / 7) + 1

// making the first of something is easy
gen firstmonday = wk_mo==1 & day_wk==1
gen firstmondaymarch = firstmonday & mo==3

// but what about the last?
gen negdaymo = -day_mo
sort yr mo negdaymo
bysort yr mo : gen lastweek = _n<=7
sort yr mo day_mo
drop negdaymo

list in 1/70

gen lastfriday = day_wk==5 & lastweek

 

Make use of Qualtrics’ exported csv data in Stata

Qualtrics dumps its data to Excel is a format that’s not quite ready for Stata, but it’s pretty close. You get variable names in the first row and variable labels in the second row. Variable labels longer than XX characters get truncated in the middle in the case of sub-items on ‘matrix’ and other survey questions with multiple components, such that the title of the sub-item is listed in full if possible. We can extract the sub-item label, if it exists, by splitting the string on either of the delimiters Qualtrics (currently!) uses: …- and ?-

import excel filename.xlsx, firstrow clear
foreach v of varlist _all {
   local l`v' = `v'[1]
   local l`v' = subinstr("`l`v''","...-","|",.)
   local l`v' = subinstr("`l`v''","?-","|",.)
   tokenize "`l`v''", parse("|")
   if "`3'"=="" {
      label variable `v' "`1'"
   }
   else {
      label variable `v' "`3'"
   }
   note `v' : "`l`v''"
}
drop in 1

Produce a fully-labeled Stata dataset from SAS — mostly

I’m working in SAS again, after about eight gleeful years. To be sure, I didn’t miss it, but I’d forgotten all the reasons why.

My data is in SAS, with a format catalog formats.sasb7cat. I want a fully-labeled Stata version of that data. What to do?

The following takes a SAS file and format catalog that reside in the same directory, spits out the format catalog to Stata (you’ll need to export your data in a separate step), turns the data file into a text file containing all the -label define- and -label values- statements that will reproduce the value labels present in the SAS file.

In SAS:

libname mine "c:\temp\sasxport";
libname library "c:\temp\sasxport";

proc format lib=library cntlout=fmtdata;
run;

proc export data=fmtdata
	FILE="lbls_converted.dta"
	DBMS=STATA replace;
run;

Now, escaping back to Stata. The following is a supremely lazy way of approaching the problem, but it gets the job done. For one thing, I use -preserve- and -restore- to drop all but the line in which I’m working.

use lbls_converted.dta, clear // my output from SAS

local labeldo "convert_dict.do" // this is the do-file that'll contain your generated -label- statements

tempname outfile
file open `outfile' using `labeldo', write replace

// if you've got anything "weird" in your data you want to drop, do it before this line...
qui count
local nobs=`r(N)'

by fmtname, sort: gen rowindex=_n

foreach v of varlist _all { // I had a bunch of leading / trailing whitespace -- YMMV
	cap noi replace `v' = trim(`v')
}

// this may be a peculiarity of my situation -- some, but not all, formats end in _F
// -regexr- would be smarter here, as it could trim only the *final* instance
gen vname = subinstr(fmtname,"_F","",1) // replace one instance of _F

forvalues row = 1/`nobs' {
	preserve
	qui keep in `row'
	local vname = vname
	local valname = start
	local vallab = subinstr(label, "''","",.)
	if rowindex==1 {
		*if "`labelline'"!="" file write `outfile' "`labelline'" _n
		if "`varline'"!="" file write `outfile' _n "`varline'" _n
		local varline "label values `vname' `vname'"
		file write `outfile' "label define `vname' "
	}
	* local labelline "`labelline' `valname' `vallab'"
	file write `outfile' `" `valname' "`vallab'""'
	restore
}
// end foreach

file close `outfile'

use "data_converted.dta", clear
do `labeldo'

How might this fall apart? Easily. For example, for SAS FORMATs might not be named the same as your variables. I’m sure there’s a step that could solve this easily enough, but I’m fortunate to not have had to deal with it.

Reading UN World Population Prospects Data into Stata

I’m not a fan of manual work that can be automated. Manually processing Excel files in Stata is a drag, though, particularly when those files are designed, in part, to present data in an attractive way.

This script addresses a couple irritating issues one would otherwise have to address manually:

  1. Stata lets you import data starting at the ith line, but it only imports variable names from the 1st
  2. Stata abhors (forbids, even) variable names starting with a numeral or containing a hyphen.

Note that I’ve only tested this with fertility data (TFR, ASFRs, natural increase), but it works with all of them and a cursory look around makes me think it could work for any of the WPP files without too much adjustment.

Here goes:

Continue reading

Stata metadata mining with substring searches

In a previous post I described how one might assemble variable names and labels and value labels (plus some descriptives) into a single file, describing one or more Stata datasets. This can give a more compact — and, frankly, inductive — overview of a dataset’s properties than a codebook, and facilitates comparisons across (ostensibly similar) files. However, it’s mostly helpful as-is if you know what you’re looking for. What if you don’t?

Continue reading

Variable and value labeling across 244 Demographic and Health Surveys

(it’s just the individual woman recode files, BTW)

Here it is, a variable-level file with more than 800,000 cases. It’s only about 24MB zipped, but beware: it’s about 1.6GB unzipped.

You can get the code that produced this file here. Note that you have to put up with my filename conventions, not the original DHS filenames, but it’s not hard to understand what’s what — eg coir61.dta is Colombia, a DHS VI, first revision.

There are more than 244 datasets available from DHS, but you can always run my code on the extra files and append, saving you some minutes/hours and a few dimes’ worth of electricity.

Documenting variable and value labeling across a lot of data files

Working with cross-national survey data can be frustrating because of, among other things, incommensurability among surveys’ variable naming and value coding. Multi-national survey programs such as the Demographic and Health Surveys help provide consistent sampling designs, questionnaires and on-the-ground implementation — their work is helpful, indeed.

Inconsistencies are, for a number of reasons, essentially unavoidable. I’ve found that the way one often finds them is by discovering a problem when proceeding with an analysis that assumes all will go well. Doing better, it turns out, doesn’t require so much effort.

Stata’s built-in data documentation commands, -describe- and -codebook- among them, produce nice human-readable reports of variable and value names. They’re no good for making comparisons across lots of data files, however. Wouldn’t it be nice to have a way to easily compare, across files, the variable names, value labels, and perhaps some rough descriptive across a more or less arbitrary set of files?

Here ya go, in less than a hundred lines:

Continue reading

Variable exploration in many files, cont’d

In “Getting Correspondence from messy surveys” I took a stab at finding variables that are likely what I’m looking for. As the list of data files and my list of search terms grew, I found I needed something a little more extensible, and more general, than my ad hoc scripts.

Below is the result. Here’s what it does. Taking as parameters

  • (a) a path containing your Stata data files of interest (subfolders are OK),
  • (b) a list of strings for which you want matching variables (that is, whose variable and/or value labels match), and
  • (c) [OPTIONAL] a list if strings whose matches you want to exclude,

the program:

(1) Uses the excellent -filelist- program written by Robert Picard to traverse the directory set in (a), and all its subfolders, creating a Stata data file containing filenames and full paths, one file per row,

(2) Opens each file identified in (1) and creates an output file, one row for each file, with a set of new variables one for each string matched, containing the names of variables returned by -lookfor-,

(3) Using the file created in (2) examines each of files found by (1) and, in a new file that has one record per variable in each file (that is, if two variables match in file A and three variables in file B, the result of this step will be a file with five rows) for each of the strings, matched,  one row per variable in each file

(4) Removes records from the file created in (3) whose variable label matches any of the strings specified in (c), writing a new file containing the reduced set.

Clearly these operations could be completed in fewer steps. I of course welcome you to make refinements and would be grateful if you shared them (with me, especially). I also use Stata’s -file- commands because I love plain text files. All of this could be done with a set of local macros plus -generate- and -append-.

Continue reading

Getting correspondence from messy surveys: finding equivalent variable names

UNICEF’s Multiple Indicator Cluster Surveys are a great resource. In addition to survey-level indicators available via http://www.micscompiler.org/, UNICEF provides the individual-level survey data.

The data are messy, however: the same variables have different names across surveys and labelling is inconsistent and multilingual (both within and across surveys).

As downloaded, there are lots of files with pretty inconsistent variable names (we’ll not deal with value labelling here, but that’s also an issue). We need to process each of those files, learning what we can, and come up with a way to compare them — then produce some code to make the surveys conform.

Continue reading

Stata: Using S_ADO to reference files from your program w/o an argument

I’ve been working with a user-written module in Stata where the program’s code isn’t separate from its data; that is, the program relies on a big macro containing operational details it needs to perform. The program iterates a selected do-file over a set of data files that can be selected on the basis of a battery of characteristics, and the roster of data files and their characteristics is a big local macro inside the program.

That’s not a bad way to go, but I want that data to be accessible outside the program so that I can use it for other things, have someone who doesn’t understand code maintain it, etc. Separating code from data also satisfies a programming principle that was drilled into me from an early age (but, see this exchange on SE).

Because the program needs to be portable, I can’t hard-code the data file’s location into the program. Because the program needs to be backward-compatible with the user code that calls it, I can’t require that the program take a new argument indicating where the data live.

What to do?

Continue reading