Stata metadata mining with substring searches

In a previous post I described how one might assemble variable names and labels and value labels (plus some descriptives) into a single file, describing one or more Stata datasets. This can give a more compact — and, frankly, inductive — overview of a dataset’s properties than a codebook, and facilitates comparisons across (ostensibly similar) files. However, it’s mostly helpful as-is if you know what you’re looking for. What if you don’t?

Let’s assume you’ve got a data file resulting from the code I suggested. Now you’ve got variable label contents in a string variable named varlab, and value label contents in a string variable named vallab. The easiest, if crude, way to approach finding what you want is to search for substrings that match what you seek (good thing you don’t have to look for superstrings — that’d be tough). There’s doubtless a lot of ways to approach this problem, but the following is simple enough and scales when you want to search for a lot of stuff in a single run.

Explanation follows the code.

local search_varlab `"onething, something with a space, something "with" quotes"'
local search_vallab `"onething, something with a space, something "with" quotes"'
use whateveryoucalledyoursummaryfilemaybeit'sjob_vardesc.dta, clear
tokenize `"`search_varlab'"', parse("||")

while `"`1'"'!="" {
	local s_var_varname = subinstr(subinstr(`"`1'"',",","_",.)," ","_",.) // Stata abhors a (space) vacuum -- also commas
	gen byte "`s_var_varname'" = . // initialize empty
	replace  "`s_var_varname'"=1 if strpos(`"`1'"',varlab)>0 // alternately, -replace "`s_var_varname'"=strpos(`"`1'"',varlab)-
	macro shift // next!
}
tokenize `"`search_vallab'"', parse("||")
while `"`1'"'!="" { // this is just the same as above, just this time with 'vallab'
	local s_val_varname = subinstr(subinstr(`"`1'"',",","_",.)," ","_",.) // Stata abhors a (space) vacuum -- also commas
	gen byte "`s_val_varname'" = . // initialize empty
	replace  "`s_val_varname'"=1 if strpos(`"`1'"',varlab)>0
	macro shift // next!
}

Here’s the gist:

  • The locals contain your list of search terms. Leave them empty when you don’t want to search anything (that is, you want to only search variable labels, or only value labels).
  • The locals are in compound quotes so that you can use quotes in your search terms.
  • I’ve selected a double pipe, ||, as the delimiter –tokenize– will use to break the local macro into discrete pieces. This allows you to include spaces in your search terms, so that you can search for a phrase rather than a single alphanumeric string(see footnote 1). You can of course select something different, but a double pipe, ||, is unlikely to show up in your search terms (but maybe you won’t be inserting commas into your searches, either).
  • The code iterates over your search terms, looking in varlab or vallab as appropriate for the position of that search term in the variable’s values (using Stata’s perfectly nice –strpos()-).
  • It creates a variable for search term named search_var_`term’ or search_val_`term’, as appropriate, where `term’ is the search term you specified minus characters forbidden in a Stata variable name. It is initialized to missing (.).
  • Values of the new variable search_var_`term’ (or search_var_`term’) are replaced with a ‘1’ for cases whose contents match.
  • That’s it. It’s then up to you to -list- or -keep- or -sort- or whatever you’d like to do to isolate and examine the variables whose labels match.

What if you want to search for multiple things? You’ve got two choices, one of which requires pretty trivial modification and one which requires no changes at all.

  • The no-modification version might be obvious: if both/all of your desired search terms are already defined in the `search_’ macro, you simply use Stata’s own logic for handling this stuff: look for a 1 in both variables. Given two search terms “this” and “that”,  if search_this==1&search_that==1.
  • The slight modification version introduces another delimiter: you could specify a special character or string to denote AND, and inside each loop look for the -strpos()- of that special character and split and do some version of
    if strpos("`1'",varlab)&strpos("`1'",varlab)

    See footnote 2 for arbitrary-number-of-elements AND statments.

Footnotes

  1. This is a clunkier approach than the alternative: if you’re willing to substitute underscores for spaces in your search terms, you can avoid -tokenize- altogether and use a simpler -foreach- loop, handling the underscore-to-space substitution within the loop. Hindsight, y’know.
  2. Let’s take this a little further, into possibly-silly territory: Let’s say you want this program to handle not one, not two, but n substrings all connected by AND (that is, you want all terms found). Let’s pretend you’ve selected a double ampersand, &&, as your AND delimiter. You could:
    	tokenize "`1'", parse("&&")
    	local criteriastring `"replace search_`term' if strpos(`"`1'"',"&&")>0"'
    	macro shift
    	while "`1'"!="" {
    		local criteriastring `"`criteriastring' & strpos(`"`1'"',"&&")>0"'
    	}
    

    The point is that you can build up an obnoxiously long set of conditions as long as they’re all ANDs.

Leave a Reply

Your email address will not be published. Required fields are marked *