Matt Pearce’s UCI Merge

The multitalented Matt Pearce has created a fantastic new Stata tool called “UCI Merge”, which facilitates the merging of cross-national datasets.

Check it out:

As many know first-hand, assembling and organizing large cross-national datasets is time-consuming, frustrating, and often error-prone.  Matt’s new tool will help automate the process and save countless hours of tedious work.

As an aside:  One of the reasons the Stanford world society/world polity research group has been so productive over many decades is the cumulation of expertise with cross-national datasets (as well as the sharing of data more generally).  Knowledge about data was passed down through many generations of graduate students.  I personally benefitted tremendously from the generosity of Marc Ventresca.  Marc taught me all about how to organize datasets, and helped me figure out the big datasets that had been previously assembled at Stanford.  (I remember asking “What is this variable called newid3?”)  Anyhow, Matt has continued this important tradition of generosity with his knowledge and expertise.

An excerpt from the README file is below.  This is new, so please report bugs or problems to Matt so they can be fixed.  And, consider buying Matt a beer at ASA…  He has earned it.

# UCIMerge – a framework for harmonizing cross national time series data

## Read Me
UCIMerge is a framework in STATA to standardize the merging of international comparative datasets. This project creates conventions and a library of functions so that it becomes easier and faster to merge time series datasets, incorporate updates, make sure observations are consistent across years, conserve N and encourage reproducible research.

This framework came about from conversations at the [UC Irvine International Comparative Workshop](

Download the [latest release]( Join the [announcement list]( to receive notifications of updates.

## How to use UCIMerge

The first time you run the scripts, it will take an extremely long to update the datasets from the web. If you would like to jumpstart this, you can use this [starter pack]( by drop these files into the /source directory. If you want to force the system to refresh a dataset, just delete that dataset file from /source.

1. Set the UCIMerge folder as the working directory for STATA (‘cd ~/UCIMerge’)

2. Edit the file with the configuration that you would like.

3. Run ‘do master’ -> your new dataset will be opened and saved within the UCIMerge folder.

UCIMerge requires STATA 13. The .csv files which link countries across datasets can be used independently.

## Currently Supported Datasets

* [Norris 2009](
* [Freedom House 2015](
* [Polity IV](
* Polity IV Coups
* [World Development Indicators](
* [KOF Index of Globalization](
include lib/lied
* [The Lexical Index of Electoral Democracy (LIED)](
* [CIRI Human Rights Dataset](
* [Quality of Government Standard dataset](
* [Cross National Time Series](
* [Penn World Table version 8.1](

INGO memberships: raw or cooked?

I’ve been hanging out at Stanford, which is great fun.  One question that came up recently is “how to best measure INGO memberships?”  I’ve been dealing with INGO data for a long time and I have opinions…

First, some background:  John Boli and George Thomas were the first to recognize that International Non-governmental Organizations (INGOs) are a core infrastructure of world society.  The discourses and activities of INGOs are a key embodiment of an emergent global culture, and INGO play an important role in the spread of that culture.  Their book “Constructing World Culture:  International Non-Governmental Organizations Since 1875” makes this point very vividly.

These days, country memberships in International Non-governmental Organizations (INGOs) have been accepted as the standard way to measure national embeddedness in world society.  Countries tied to lots of INGOs are most exposed to global culture, and are fastest to adopt a whole host of policy innovations — new environmental laws, fashionable human rights commitments, and so on.

But, how should one actually operationalize INGO memberships in quantitative analyses?  Suppose citizens of a country are members of 1,500 different INGOs.  Should one use the raw counts?  The natural log of counts?  Memberships per capita?  Or something else?

I usually use the natural log of the INGO membership count.

As a practical matter, raw counts are hugely skewed (except in some cases — for instance analyses focusing on certain particular regions).  Logged INGO memberships are less skewed, and therefore work much better in regression-type models.  Also, one can make a substantive argument:  going from 100 to 200 INGO memberships has a bigger substantive effect than going from 1,100 to 1,200.  The natural log transformation helps take this into account (despite being a somewhat arbitrary correction).

Sometimes people suggest that INGO memberships be standardized by population.  Shouldn’t you correct for the size of the population?  Big countries can have more memberships… and besides, don’t you need lots of memberships to influence the culture of a large country?

These arguments are plausible, but ultimately I’m not sold.  First, the INGO membership variable from the Yearbook of International Association counts organizations that are tied to a country, not individual memberships.  An organization is counted as tied to a country if one or more citizen is a member.  That may not be ideal, but that’s the measure we’re stuck with.  So, if all 1.3 billion citizens of China joined Greenpeace, it would still count as one INGO tie.  Second, most diffusion studies focus on state policy, rather than individual attitudes or activities.  Many INGOs function as advocacy groups of various sorts — and don’t need to be connected to each and every citizen to influence policy diffusion.  Finally, I’ve looked at the actual result of standardizing INGOs by population.  Often it produces a very odd distribution.  Tiny island nations appear to be at the “center” of world society.  (Again, this could vary for different types of INGOs or if you focused on a particular region.)

In short, I’d recommend using logged INGO memberships as a default approach.  I can imagine situations where raw INGO counts or INGO membership per capita could be justifiable… but be sure to check the actual distribution before plowing ahead.

Those are my 2 cents.  If people have other views on this, I’m interested to hear them.

Of course, I’m only talking about count-based measures of INGOs, which are easiest to get.  Pam Paxton, Melanie Hughes, Jason Beckfield, and others, have been working on network-based measures of INGO ties.  That opens up a whole other range of options…

The World’s Regions (according to news reports)…

Marc Ventresca passed along a neat wired article, describing the work of Kalev Leetaru, a research fellow at Georgetown.  Leetaru did a cluster analyses of news report data to define regions in the globe.

The regions produce the following map:


The regions are mostly sensible…  you can really see the legacy of European colonialism.

The full article addresses a wide range of issues — such as the “tone” of news reports, which Leetaru suggests is predictive of events.

One nit to pick:  Leetaru describes his research as a new field of “culturomics”… whereas it looks to me like conventional quantitative social science.  I guess the lesson is that “omics” is more rigorous than “ology”.  Well, time to get back to work… doing cutting-edge sociomics!

Mike Landis defends!

Mike Landis wrapped up his PhD this Spring.  Congrats!!!

The dissertation is a quantitative, cross-national analysis of terrorism events over the last few decades.  One of the take-away points is that terrorism is frequently the spillover from an ongoing civil war.  That finding makes a ton of sense, and provides a better way of thinking about “typical” forms of terrorism, compared to popular accounts that focus on things like 9/11.  The dissertation does a nice job of developing and extending some of Ann Hironaka’s arguments in her book Neverending Wars (Harvard Press, 2005).  The dissertation committee was Ann, David Frank, Ed Amenta, Wayne Sandholtz, and myself.

The dissertation uses the Global Terrorism Database, which was put together by Gary LaFree  and colleagues at the U of Maryland.  The dataset looks pretty interesting.

Again, congratulations to Dr. Mike Landis!

More Democracy Data

Christine Wotipka sends along a link to a democracy dataset that I hadn’t seen before:  “Democracy and Dictatorship Revisited.”  It covers 1946-2008 for over 200 countries.  The dataset was recommended by her colleague James Vreeland as having significant advantages over the Freedom House and Polity measures.

The dataset is described in the following paper (also available via the link):  Cheibub, José Antonio, Jennifer Gandhi, and James Raymond Vreeland. 2010. “Democracy and Dictatorship Revisited.” Public Choice, vol. 143, no. 2-1, pp. 67-101.

I haven’t had a chance to compare with other datasets.  But, the democracy variable seems to focus on elections (method of executive/legislative selection, updating Banks) and competitive political parties.  Definitely looks useful.

GDPx: New GDP Data Source

Liz Boyle just pointed me to a new source of GDP data, which she has heard good things about:

A detailed description of the datasource is here:

It looks like they smushed together a bunch of pre-existing GDP datasets to produce a long, consistent time series for the period since 1950.

Looks like they have a lot of health data, too.  The main website is:


Disaster Data

Wes Longhofer came a cross a new database: The Centre for Research on the Epidemiology of Disaster’s International Disaster Database. Site:

The site has cross-national data on both natural and human-caused disasters since 1900.  Apparently, the most costly industrial accident in history was a chemical spill in Spain in 2002.  Didn’t know that…

The dataset will be useful for our papers on environmental associations/policy reform/etc. Our prior work has generally found that environmental degradation variables (e.g., pollution) don’t do a good job of accounting for environmental mobilization or policy reform.  Reviewers have then suggested, on more than one occasion, that people may respond to vivid disasters (Three Mile Island, Exxon Valdez, etc), rather than actual degradation.  So, at one time David Frank pulled together a simple measure of disasters… but now someone has assembled a much more systematic dataset.

Disasters might also be an interesting issue to analyze as a dependent variable.  For instance, one wonders if strong environmental/health/safety laws, strong unions, or other factors reduce industrial accidents…  Maybe INGOs help, too… they do everything.  (kidding…)