'How to find a common variable in a large number of databases using Stata

So I have a large number of databases (82) in Stata, that each contain around 1300 variables and several thousand observations. Some of these databases contain variables that give the mean or standard deviation of certain concepts. For example, a variable in such a dataset could be called "leverage_mean". Now, I want to know which datasets contain variables called concept_mean or concept_sd, without having to go through every dataset by hand.

I was thinking that maybe there is a way to loop through the databases looking for variables containing "mean" or "sd", unfortunately I have idea how to do this. I'm using R and Stata datafiles.



Solution 1:[1]

You can use filelist, (from SSC) to create a dataset of files. To install filelist, type in Stata's Command window:

ssc install filelist

With a list of datasets in memory, you can then loop over each file and use describe to get a list of variables for each file. You can store this list of variables in a single string variable. For example, the following will collect the names of all Stata datasets shipped with Stata and then store for each the variables they contain:

findfile "auto.dta"
local base_dir = subinstr("`r(fn)'", "/a/auto.dta", "", 1)

dis "`base_dir'"
filelist, dir("`base_dir'") pattern("*.dta")

gen variables = ""
local nmatch = _N
qui forvalues i = 1/`nmatch' {
    local f = dirname[`i'] + "/" + filename[`i']
    describe using "`f'", varlist
    replace variables = " `r(varlist)' " in `i'
}
leftalign  // also from SSC, to install: ssc install leftalign

Once you have all this information in the data in memory, you can easily search for specific variables. For example:

. list filename if strpos(variables, " rep78 ")

     +-----------+
     | filename  |
     |-----------|
 13. | auto.dta  |
 14. | auto2.dta |
     +-----------+

Solution 2:[2]

The lookfor_all package (SSC) is there for that purpose:

cd "pathtodirectory"

lookfor_all leverage_mean

Just make sure the file extensions are in lowercase(.dta) and not upper.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Robert Picard
Solution 2 Nile