Ordinal dates in Stata – ith day and week of month

 

clear

// seed some days
set obs 2300
gen daily = mdy(12,31,2019) + _n
format daily %td

// the basics
gen yr = year(daily)
gen mo = month(daily)
label define mo ///
    1 "Jan" 2 "Feb" 3 "Mar" 4 "Apr" 5 "May" 6 "Jun" ///
    7 "Jul" 8 "Aug" 9 "Sep" 10 "Oct" 11 "Nov" 12 "Dec"
label values mo mo

gen day_wk = dow(daily) + 1
label define day_wk ///
   1 "Sun" 2 "Mon" 3 "Tues" 4 "Wed" 5 " Thu" 6 "Fri" 7 "Sat"
label val day_wk day_wk

// ith week of the year
bysort yr : gen wk_yr = sum(dow(daily) == 1) + 1

// ith day of the year
bysort yr : gen day_yr = _n

// ith day of the month
bysort yr mo : gen day_mo = _n

// ith week of the month
bysort yr mo : gen wk_mo = int(day_mo / 7) + 1

// making the first of something is easy
gen firstmonday = wk_mo==1 & day_wk==1
gen firstmondaymarch = firstmonday & mo==3

// but what about the last?
gen negdaymo = -day_mo
sort yr mo negdaymo
bysort yr mo : gen lastweek = _n<=7
sort yr mo day_mo
drop negdaymo

list in 1/70

gen lastfriday = day_wk==5 & lastweek

 

Make use of Qualtrics’ exported csv data in Stata

Qualtrics dumps its data to Excel is a format that’s not quite ready for Stata, but it’s pretty close. You get variable names in the first row and variable labels in the second row. Variable labels longer than XX characters get truncated in the middle in the case of sub-items on ‘matrix’ and other survey questions with multiple components, such that the title of the sub-item is listed in full if possible. We can extract the sub-item label, if it exists, by splitting the string on either of the delimiters Qualtrics (currently!) uses: …- and ?-

import excel filename.xlsx, firstrow clear
foreach v of varlist _all {
   local l`v' = `v'[1]
   local l`v' = subinstr("`l`v''","...-","|",.)
   local l`v' = subinstr("`l`v''","?-","|",.)
   tokenize "`l`v''", parse("|")
   if "`3'"=="" {
      label variable `v' "`1'"
   }
   else {
      label variable `v' "`3'"
   }
   note `v' : "`l`v''"
}
drop in 1

Stata metadata mining with substring searches

In a previous post I described how one might assemble variable names and labels and value labels (plus some descriptives) into a single file, describing one or more Stata datasets. This can give a more compact — and, frankly, inductive — overview of a dataset’s properties than a codebook, and facilitates comparisons across (ostensibly similar) files. However, it’s mostly helpful as-is if you know what you’re looking for. What if you don’t?

Continue reading Stata metadata mining with substring searches

Getting correspondence from messy surveys: finding equivalent variable names

UNICEF’s Multiple Indicator Cluster Surveys are a great resource. In addition to survey-level indicators available via http://www.micscompiler.org/, UNICEF provides the individual-level survey data.

The data are messy, however: the same variables have different names across surveys and labelling is inconsistent and multilingual (both within and across surveys).

As downloaded, there are lots of files with pretty inconsistent variable names (we’ll not deal with value labelling here, but that’s also an issue). We need to process each of those files, learning what we can, and come up with a way to compare them — then produce some code to make the surveys conform.

Continue reading Getting correspondence from messy surveys: finding equivalent variable names

Stata: Using S_ADO to reference files from your program w/o an argument

I’ve been working with a user-written module in Stata where the program’s code isn’t separate from its data; that is, the program relies on a big macro containing operational details it needs to perform. The program iterates a selected do-file over a set of data files that can be selected on the basis of a battery of characteristics, and the roster of data files and their characteristics is a big local macro inside the program.

That’s not a bad way to go, but I want that data to be accessible outside the program so that I can use it for other things, have someone who doesn’t understand code maintain it, etc. Separating code from data also satisfies a programming principle that was drilled into me from an early age (but, see this exchange on SE).

Because the program needs to be portable, I can’t hard-code the data file’s location into the program. Because the program needs to be backward-compatible with the user code that calls it, I can’t require that the program take a new argument indicating where the data live.

What to do?

Continue reading Stata: Using S_ADO to reference files from your program w/o an argument

Use Stata to unzip a bunch of Demographic and Health Survey files and put them where I want them

The 2012 Indonesia Demographic and Health Survey data were released yesterday. There are a bunch of zip files to download, one for each of the survey components, and each of these zip files contains between zero and one files that colleagues and I want to use. Being lazy, I wanted to:

  1. Use Stata to do as much of the work for me, and
  2. do nothing manually.

Continue reading Use Stata to unzip a bunch of Demographic and Health Survey files and put them where I want them

Turning the World Fertility Surveys’ raw data into a set of Stata dta files

#WARNING — this is very draft.

I’m posting code for the adventurous among you, but I’m not sure how useful it is without more elaboration in the text. That elaboration is forthcoming … someday.

Note: this work was completed while working for Professor John Casterline.

The World Fertility Surveys are archived here, as a set of fixed-width data files and data dictionaries. That is, sans code for making the files readily usable for analysis.

When faced with a problem like this, perhaps any problem where one can conceive of more and less sophisticated solutions, one is also faced with the sad fact that we only get to live in one universe at once, even if we can conceive of many. Should I just slog it out, writing code by hand to pull the variables I need for this particular task, or even go all out and write the whole thing? Or, should I try to make a robot that does it for me? The former is tedious but probably gets the desired result almost all the time; the latter isn’t a sure thing, and one could spend a lot of time and end up empty.

I built a robot this time around, and it’s inelegant but basically gets the job done.

What it Does

Given original WFS data and dictionary files from Princeton’s OPR archive or wherever, the code below:

  1. Produces a set of do-files to convert the delimited data files into Stata-format files,
  2. Produces a set of do-files to apply variable and value labels to the Stata files,
  3. Executes the aforementioned do-files.

Manual Front-End Stuff

Pick a folder in which to work and adjust the code I’ve posted accordingly. In the code I’ve posted, fix the paths so they match your environment. Note that UNIX/Mac paths require forward-slashes (/), but Stata for Windows can use either them or backslashes (\) — unlike Windows itself, which is essentially backslash-only.

Download data and dictionary files from the WFS archive: as of 9 Sept 2013, it’s here: http://opr.princeton.edu/archive/wfs

You’ll benefit from a tool like DownloadThemAll for Firefox when downloading lots of files.

It’s convenient to save all the files to a single folder, and we’ll let Stata sort them out. They’re zip files, however, so we’ll need to batch-unzip everything:

Continue reading Turning the World Fertility Surveys’ raw data into a set of Stata dta files

Building a unique ID in Stata using -concat-

Someone asked me today how to create a unique ID from a dataset with four variables whose combination are unique. They’re event data, sort of like this:

country village year household 1 1 1956 1 1 2 1956 2 2 1 1956 1 12 53 2001 1234

The temptation is to do this:

egen uniqueid = concat(country village year household)

The problems is that household 1 in year 1960 in village 19 in country 11 will have the same id as household 1 in year 1960 in village 119 in country 1 –> 1119601 for both. Put simply, multi-digit variables without leading zeros “squish” together and you risk non-uniqueness (“collision”).

How to get leading zeros?

Continue reading Building a unique ID in Stata using -concat-