Custom color scheme for your Stata graphs, and matching colors in PowerPoint

The Problem

Stata’s built-in color schemes are fine for banging out drafts, but for presentations where you want a custom color scheme it’s often easier to produce your own scheme than to either find acceptable colors in Stata’s palette or define color codes every time you produce a figure. I’ve found that specifying a custom scheme with RGB color codes is the easiest way to get the colors I want from Stata, and then match them perfectly in PowerPoint. Read on…

Continue reading Custom color scheme for your Stata graphs, and matching colors in PowerPoint

R Skool vol1

I’m taking the Coursera course Computing for Data Analysis. These notes are just FYI — no warranty expressed or implied.

WK1

R treats numbers as numeric...
Must be explicit to get integers:
1 is numeric, 1L is an integer
Infinity is Inf
NaN is "not a number," undefined

Continue reading R Skool vol1

Use Stata to unzip a bunch of Demographic and Health Survey files and put them where I want them

The 2012 Indonesia Demographic and Health Survey data were released yesterday. There are a bunch of zip files to download, one for each of the survey components, and each of these zip files contains between zero and one files that colleagues and I want to use. Being lazy, I wanted to:

  1. Use Stata to do as much of the work for me, and
  2. do nothing manually.

Continue reading Use Stata to unzip a bunch of Demographic and Health Survey files and put them where I want them

Clean up inconsistent text with Levenshtein distance

Researchers who deal with text data, particularly categorical text (opposed to free prose) have long recognized the need to clean up data entry coding and other inconsistencies. For example, I’ve seen many files with a field for ‘organization’ where the organization’s name is, alternately, spelled out, abbreviated, misspelled, or otherwise heterogeneous. For example:

  • National Association for the Advancement of Colored People
  • National Association for the Advancement of Colored Persons
  • The National Association for the Advancement of Colored People
  • NAACP
  • N.A.A.C.P.
  • N A A C P

… and so on.  In an analysis, we generally want to refer to these as the same organization, but that involves lots of cleaning up by hand.

Continue reading Clean up inconsistent text with Levenshtein distance

Turning the World Fertility Surveys’ raw data into a set of Stata dta files

#WARNING — this is very draft.

I’m posting code for the adventurous among you, but I’m not sure how useful it is without more elaboration in the text. That elaboration is forthcoming … someday.

Note: this work was completed while working for Professor John Casterline.

The World Fertility Surveys are archived here, as a set of fixed-width data files and data dictionaries. That is, sans code for making the files readily usable for analysis.

When faced with a problem like this, perhaps any problem where one can conceive of more and less sophisticated solutions, one is also faced with the sad fact that we only get to live in one universe at once, even if we can conceive of many. Should I just slog it out, writing code by hand to pull the variables I need for this particular task, or even go all out and write the whole thing? Or, should I try to make a robot that does it for me? The former is tedious but probably gets the desired result almost all the time; the latter isn’t a sure thing, and one could spend a lot of time and end up empty.

I built a robot this time around, and it’s inelegant but basically gets the job done.

What it Does

Given original WFS data and dictionary files from Princeton’s OPR archive or wherever, the code below:

  1. Produces a set of do-files to convert the delimited data files into Stata-format files,
  2. Produces a set of do-files to apply variable and value labels to the Stata files,
  3. Executes the aforementioned do-files.

Manual Front-End Stuff

Pick a folder in which to work and adjust the code I’ve posted accordingly. In the code I’ve posted, fix the paths so they match your environment. Note that UNIX/Mac paths require forward-slashes (/), but Stata for Windows can use either them or backslashes (\) — unlike Windows itself, which is essentially backslash-only.

Download data and dictionary files from the WFS archive: as of 9 Sept 2013, it’s here: http://opr.princeton.edu/archive/wfs

You’ll benefit from a tool like DownloadThemAll for Firefox when downloading lots of files.

It’s convenient to save all the files to a single folder, and we’ll let Stata sort them out. They’re zip files, however, so we’ll need to batch-unzip everything:

Continue reading Turning the World Fertility Surveys’ raw data into a set of Stata dta files

Building a unique ID in Stata using -concat-

Someone asked me today how to create a unique ID from a dataset with four variables whose combination are unique. They’re event data, sort of like this:

country village year household 1 1 1956 1 1 2 1956 2 2 1 1956 1 12 53 2001 1234

The temptation is to do this:

egen uniqueid = concat(country village year household)

The problems is that household 1 in year 1960 in village 19 in country 11 will have the same id as household 1 in year 1960 in village 119 in country 1 –> 1119601 for both. Put simply, multi-digit variables without leading zeros “squish” together and you risk non-uniqueness (“collision”).

How to get leading zeros?

Continue reading Building a unique ID in Stata using -concat-