Matt Pearce’s UCI Merge

The multitalented Matt Pearce has created a fantastic new Stata tool called “UCI Merge”, which facilitates the merging of cross-national datasets.

Check it out:  https://github.com/mpearce/UCIMerge/releases/latest

As many know first-hand, assembling and organizing large cross-national datasets is time-consuming, frustrating, and often error-prone.  Matt’s new tool will help automate the process and save countless hours of tedious work.

As an aside:  One of the reasons the Stanford world society/world polity research group has been so productive over many decades is the cumulation of expertise with cross-national datasets (as well as the sharing of data more generally).  Knowledge about data was passed down through many generations of graduate students.  I personally benefitted tremendously from the generosity of Marc Ventresca.  Marc taught me all about how to organize datasets, and helped me figure out the big datasets that had been previously assembled at Stanford.  (I remember asking “What is this variable called newid3?”)  Anyhow, Matt has continued this important tradition of generosity with his knowledge and expertise.

An excerpt from the README file is below.  This is new, so please report bugs or problems to Matt so they can be fixed.  And, consider buying Matt a beer at ASA…  He has earned it.

# UCIMerge – a framework for harmonizing cross national time series data

## Read Me
UCIMerge is a framework in STATA to standardize the merging of international comparative datasets. This project creates conventions and a library of functions so that it becomes easier and faster to merge time series datasets, incorporate updates, make sure observations are consistent across years, conserve N and encourage reproducible research.

This framework came about from conversations at the [UC Irvine International Comparative Workshop](http://sites.uci.edu/icsw/).

Download the [latest release](https://github.com/mpearce/UCIMerge/releases/latest). Join the [announcement list](http://eepurl.com/btU40r) to receive notifications of updates.

## How to use UCIMerge

The first time you run the scripts, it will take an extremely long to update the datasets from the web. If you would like to jumpstart this, you can use this [starter pack](http://mattpearce.name/files/UCIMergeStarterPack.zip) by drop these files into the /source directory. If you want to force the system to refresh a dataset, just delete that dataset file from /source.

1. Set the UCIMerge folder as the working directory for STATA (‘cd ~/UCIMerge’)

2. Edit the Master.do file with the configuration that you would like.

3. Run ‘do master’ -> your new dataset will be opened and saved within the UCIMerge folder.

UCIMerge requires STATA 13. The .csv files which link countries across datasets can be used independently.

## Currently Supported Datasets

* [Norris 2009](https://sites.google.com/site/pippanorris3/research/data#TOC-Democracy-Time-series-Data-Release-3.0-January-2009)
* [Freedom House 2015](https://freedomhouse.org/report/freedom-world/freedom-world-2015)
* [Polity IV](http://www.systemicpeace.org/polityproject.html)
* Polity IV Coups
* [World Development Indicators](http://data.worldbank.org)
* [KOF Index of Globalization](http://globalization.kof.ethz.ch)
include lib/lied
* [The Lexical Index of Electoral Democracy (LIED)](http://ps.au.dk/forskning/forskningsprojekter/dedere/datasets/)
* [CIRI Human Rights Dataset](http://www.humanrightsdata.com)
* [Quality of Government Standard dataset](http://qog.pol.gu.se/data/datadownloads/qogstandarddata)
* [Cross National Time Series](http://www.databanksinternational.com)
* [Penn World Table version 8.1](http://www.rug.nl/research/ggdc/data/pwt/pwt-8.1)

INGO memberships: raw or cooked?

I’ve been hanging out at Stanford, which is great fun.  One question that came up recently is “how to best measure INGO memberships?”  I’ve been dealing with INGO data for a long time and I have opinions…

First, some background:  John Boli and George Thomas were the first to recognize that International Non-governmental Organizations (INGOs) are a core infrastructure of world society.  The discourses and activities of INGOs are a key embodiment of an emergent global culture, and INGO play an important role in the spread of that culture.  Their book “Constructing World Culture:  International Non-Governmental Organizations Since 1875” makes this point very vividly.

These days, country memberships in International Non-governmental Organizations (INGOs) have been accepted as the standard way to measure national embeddedness in world society.  Countries tied to lots of INGOs are most exposed to global culture, and are fastest to adopt a whole host of policy innovations — new environmental laws, fashionable human rights commitments, and so on.

But, how should one actually operationalize INGO memberships in quantitative analyses?  Suppose citizens of a country are members of 1,500 different INGOs.  Should one use the raw counts?  The natural log of counts?  Memberships per capita?  Or something else?

I usually use the natural log of the INGO membership count.

As a practical matter, raw counts are hugely skewed (except in some cases — for instance analyses focusing on certain particular regions).  Logged INGO memberships are less skewed, and therefore work much better in regression-type models.  Also, one can make a substantive argument:  going from 100 to 200 INGO memberships has a bigger substantive effect than going from 1,100 to 1,200.  The natural log transformation helps take this into account (despite being a somewhat arbitrary correction).

Sometimes people suggest that INGO memberships be standardized by population.  Shouldn’t you correct for the size of the population?  Big countries can have more memberships… and besides, don’t you need lots of memberships to influence the culture of a large country?

These arguments are plausible, but ultimately I’m not sold.  First, the INGO membership variable from the Yearbook of International Association counts organizations that are tied to a country, not individual memberships.  An organization is counted as tied to a country if one or more citizen is a member.  That may not be ideal, but that’s the measure we’re stuck with.  So, if all 1.3 billion citizens of China joined Greenpeace, it would still count as one INGO tie.  Second, most diffusion studies focus on state policy, rather than individual attitudes or activities.  Many INGOs function as advocacy groups of various sorts — and don’t need to be connected to each and every citizen to influence policy diffusion.  Finally, I’ve looked at the actual result of standardizing INGOs by population.  Often it produces a very odd distribution.  Tiny island nations appear to be at the “center” of world society.  (Again, this could vary for different types of INGOs or if you focused on a particular region.)

In short, I’d recommend using logged INGO memberships as a default approach.  I can imagine situations where raw INGO counts or INGO membership per capita could be justifiable… but be sure to check the actual distribution before plowing ahead.

Those are my 2 cents.  If people have other views on this, I’m interested to hear them.

Of course, I’m only talking about count-based measures of INGOs, which are easiest to get.  Pam Paxton, Melanie Hughes, Jason Beckfield, and others, have been working on network-based measures of INGO ties.  That opens up a whole other range of options…

The World’s Regions (according to news reports)…

Marc Ventresca passed along a neat wired article, describing the work of Kalev Leetaru, a research fellow at Georgetown.  Leetaru did a cluster analyses of news report data to define regions in the globe.

http://www.wired.com/wiredscience/2013/09/world-civilizations-from-network-analysis/

The regions produce the following map:

Image

The regions are mostly sensible…  you can really see the legacy of European colonialism.

The full article addresses a wide range of issues — such as the “tone” of news reports, which Leetaru suggests is predictive of events.

http://firstmonday.org/ojs/index.php/fm/article/view/3663/3040

One nit to pick:  Leetaru describes his research as a new field of “culturomics”… whereas it looks to me like conventional quantitative social science.  I guess the lesson is that “omics” is more rigorous than “ology”.  Well, time to get back to work… doing cutting-edge sociomics!

Mike Landis defends!

Mike Landis wrapped up his PhD this Spring.  Congrats!!!

The dissertation is a quantitative, cross-national analysis of terrorism events over the last few decades.  One of the take-away points is that terrorism is frequently the spillover from an ongoing civil war.  That finding makes a ton of sense, and provides a better way of thinking about “typical” forms of terrorism, compared to popular accounts that focus on things like 9/11.  The dissertation does a nice job of developing and extending some of Ann Hironaka’s arguments in her book Neverending Wars (Harvard Press, 2005).  The dissertation committee was Ann, David Frank, Ed Amenta, Wayne Sandholtz, and myself.

The dissertation uses the Global Terrorism Database, which was put together by Gary LaFree  and colleagues at the U of Maryland.  The dataset looks pretty interesting.

Again, congratulations to Dr. Mike Landis!

More Democracy Data

Christine Wotipka sends along a link to a democracy dataset that I hadn’t seen before:  “Democracy and Dictatorship Revisited.”  It covers 1946-2008 for over 200 countries.  The dataset was recommended by her colleague James Vreeland as having significant advantages over the Freedom House and Polity measures.

https://sites.google.com/site/joseantoniocheibub/datasets/democracy-and-dictatorship-revisited

The dataset is described in the following paper (also available via the link):  Cheibub, José Antonio, Jennifer Gandhi, and James Raymond Vreeland. 2010. “Democracy and Dictatorship Revisited.” Public Choice, vol. 143, no. 2-1, pp. 67-101.

I haven’t had a chance to compare with other datasets.  But, the democracy variable seems to focus on elections (method of executive/legislative selection, updating Banks) and competitive political parties.  Definitely looks useful.

GDPx: New GDP Data Source

Liz Boyle just pointed me to a new source of GDP data, which she has heard good things about:

http://www.healthmetricsandevaluation.org/ghdx/record/gross-domestic-product-gdp-estimates-country-1950-2015

A detailed description of the datasource is here:

http://www.pophealthmetrics.com/content/10/1/12/abstract

It looks like they smushed together a bunch of pre-existing GDP datasets to produce a long, consistent time series for the period since 1950.

Looks like they have a lot of health data, too.  The main website is:

http://www.healthmetricsandevaluation.org

 

Disaster Data

Wes Longhofer came a cross a new database: The Centre for Research on the Epidemiology of Disaster’s International Disaster Database. Site: http://www.emdat.be/

The site has cross-national data on both natural and human-caused disasters since 1900.  Apparently, the most costly industrial accident in history was a chemical spill in Spain in 2002.  Didn’t know that…

The dataset will be useful for our papers on environmental associations/policy reform/etc. Our prior work has generally found that environmental degradation variables (e.g., pollution) don’t do a good job of accounting for environmental mobilization or policy reform.  Reviewers have then suggested, on more than one occasion, that people may respond to vivid disasters (Three Mile Island, Exxon Valdez, etc), rather than actual degradation.  So, at one time David Frank pulled together a simple measure of disasters… but now someone has assembled a much more systematic dataset.

Disasters might also be an interesting issue to analyze as a dependent variable.  For instance, one wonders if strong environmental/health/safety laws, strong unions, or other factors reduce industrial accidents…  Maybe INGOs help, too… they do everything.  (kidding…)

The MacroDataGuide

Katerina Vrablikova, a visiting fellow at the UCI Center for the Study of Democracy, pointed me to a data website that I hadn’t seen before.  It is called The MacroDataGuide:

http://www.nsd.uib.no/macrodataguide/index.html

It is maintained by the Norwegian Social Science Data Services (apparently a branch of the Norwegian Ministry of Education and Research) to organize “contextual” variables for use with the European Social Survey.

It is a nice, clean, website with descriptive information on lots of country-level datasets.  It has all the big ones, and a few that are new to me.

The site provides a wealth of summary information:  topics covered by the dataset, the number of countries and time period covered, relevant references, and mundane-but-useful information such as the file format(s) available, cost, and links to documentation and (usually) the actual dataset.  There is also commentary on data quality, which is rare to see.

Definitely worth a look.

More Globalization Indices

Wade Cole sent me a link to a useful data source, the KOF index of globalization:  http://globalization.kof.ethz.ch/

They create 3 measures of globalization, reflecting economic (trade, FDI, barriers), social (communication), and political (embassies, international orgs, treaties) dimensions.

They rely in standard measures that are pretty familiar.  A full list of the measures (and weights used to create the indices) can be found here:  http://globalization.kof.ethz.ch/static/pdf/variables_2010.pdf

Coverage looks pretty solid — from 1970 to the present.  Overall, worth checking out.

WDI Reshape do file

I got a request for my stata code for reshaping the wdi 2010.  There is also code to add variable labels.  It is an alternative to using “wdireshape“.  You can download it here:  WDI 2010 reshape stata code.doc

Notes:

1.  The stata code is in a word “doc” file, rather than a stata “do” file.  That is because WordPress limits the filetypes you can upload.  Rather than putting it on my other website, I just pasted it into a word document.  You can just paste it back into a stata do file…

2.  You should download the “csv” version of the WDI.

3.  You need a computer with quite a bit of memory to run the reshape.  If you are short on memory, you can manually select a subset of the file and reshape it in smaller chunks that fit into your computer’s memory.  You can then “merge” the pieces together.

4.  It takes a long time to run if you do the whole file at once (hours).

WDI 2010 Is Out — And Free!

The World Bank has released the World Development Indicators 2010 — and it is freely downloadable in multiple file formats.

http://data.worldbank.org/data-catalog

This is great news.  You can just download a spreadsheet (excel or .csv).  No more messing with a CDROM or MS Access databases!  (Those who have worked with older versions know what I mean.)

It is still a bit of a pain to work with.  It is too big for my (older) version of Excel.  And, the format remains quirky:  years are columns, countries are rows, with variables stacked long (a column contains a the variable identifier). Fortunately, Stata’s “reshape” command can get it into a more useful format.  I can post the syntax if people are interested.

Growth of Commonwealth Universities

Danielle Logue has put together a really neat visualization of the historical proliferation of universities in the British Commonwealth.  As someone who thinks a lot about the growth of universities, I found it really interesting.

http://timothyhannigan.com/danielleMaps/dmap.php

Here’s a description with some of the context:

“This map is part of a larger doctoral research project by Danielle Logue, Said Business School, University of Oxford.  This project examines the changing composition of top management teams in over 500 universities across 37 countries of the British Commonwealth.  By conceptualising these leadership positions as constitutive of particular conceptions of control, it asks the question:  how do such conceptions of control spread in global, loosely structured fields, where there are not the usual suspects of organisational diffusion?  Amongst other findings, the research reveals the global diffusion of a finance ‘conception of control’, which will be demonstrated in an upcoming animated map.  Danielle is working with her DPhil colleague, Tim Hannigan at the Oxford Centre for Entrepreneurship and Innovation,  who provides the sophisticated technical expertise to produce such visualisations.  For further details, contact Danielle Logue (danielle.logue@sbs.ox.ac.uk) or Tim Hannigan (timothy.hannigan@sbs.ox.ac.uk).”

Development Finance Data

Wes sent me a link to the latest/greatest new source of data on international aid/development finance:

http://www.aiddata.org/home/index

They compile information on development finance (i.e., grants and loans) from governments and IGOs to developing countries for the period 1947-2009.

We are using it to expand on a paper, which we presented at ASA last August, that looked at the impact of World Bank “structural adjustment” loans on national income inequality. (We found that loans are associated with generally higher inequality in the 1980s; but not in the 1990s and not in Asia.)  This database will allow us to look at IMF loans, and other sources as well.

Also:  In the future, they plan to add data on aid from NGOs and private groups.  That will be really useful.

Updated IGO / INGO Membership Data

Allwyn Lim, working with Kiyo, has coded country IGO and INGO membership data from 2000-2007.  We hope to use this to extend our existing datasets.  However, we’ll need to do some checking (to watch out for discontinuities) first.

Allwyn’s coded disaggregated data on all UIA sub-types allowing for more nuanced analyses than in the past (e.g., separating ‘regional’ from truly ‘international’ INGOs).

The new dataset is on the password-protected part of the data archive at UC Irvine.

Substantively, the new data are quite interesting.  It looks like INGO memberships continue to grow incredibly quickly between 2000 and 2007.  My eyeball estimate is that the typical country grows by 20% in that period.  Wow.

> I’m attaching updated UIA NGO/IGO data for 2000-2007 in Excel. These
> are NGO/IGO memberships and not secretariats. There are two files:
>
> (1) “UIA 2000-2007” has separate worksheets that replicate the tables
> from the respective yearbooks. I’ve retained all categories since
> people may want to use different combinations for their analyses.
>
> (2) “UIA Totals 2000-2007” has NGO and IGO totals (sum of all
> categories A-U) in country-year format plus “newid3” and “gurrid”
> where I could identify them.