rdfabout.net: Resource Description Framework

The 2000 U.S. Census: 1 Billion RDF Triples

August 14, 2007

Overview

I've been interested lately in getting large amounts of existing data into RDF so that databases once isolated by being in vastly different data formats can start to be meshed more easily. (See data on the U.S. Congress in RDF). This page describes the 2000 U.S. Census converted into RDF (what is RDF?) and exposed via SPARQL.
The U.S. Census data is provided by the Census Bureau in a structured format (with an enormous amount of documentation, no less) and yields on the order of 1 billion RDF triples. The task of extracting those triples is a hefty one, though fairly straight-forward, and this document explains what I did from start to finish --- first transforming the Census data into Notation 3 (with a Perl script), and then loading it into a MySQL database and serving it via SPARQL (using my own C# library for RDF).

Also see my original announcement of this data set on the semantic-web mail list.

Try it out...

  • census.tgz: Basic geographic data for the U.S., the states, counties, cities, ZCTAs (the Census's approximation to zip codes), and congressional districts. 4.7 MB; 1,016,219 triples in N3 format. See example data below.
  • For the detailed Census statistics, you'll have to download the raw Census data files from the Census Bureau, my Perl script and the patch file below and run it yourself because the files are too big for me to offer as a download!
  • census.pl: Perl script to generate >1B triples of detailed census data from the Census's published data files. The script contains documentation on where to get the data files from the Census and how to run it. If you want to generate the detailed data, you'll also need census_table_layouts.patch which patches some auxiliary files that you'll need to get from the Census. link-geonames.pl is a Perl script to establish owl:sameAs links from this data set to the Geonames data set.
  • Access a sample URI as linked data: http://www.rdfabout.com/rdf/usgov/geo/us
  • SPARQL Query Page and Examples
  • The data and scripts can be reused under Creative Commons Attribution-NonCommercial-ShareAlike.

More information from the Census...

  • Summary File 1 (SF1): Tables focusing on age, sex, households, families, and housing units. Documentation and FTP download.
  • Summary File 3 (SF3): Tables focusing on social, economic and housing characteristics compiled from a sample of approximately 19 million housing units (about 1 in 6 households) that received the Census 2000 long-form questionnaire. Documentation and FTP download.
  • What are ZTCAs: ZIP Code Tabulation Areas (i.e. not ZIP codes exactly)

Later on this page...

About the Census Data

The Census data comprises population statistics at various geographic levels, from the U.S. as a whole, down through states, counties, sub-counties (roughly, cities and incorporated towns), so-called "census data places" ("CDP"s, what I would call a named "village", but might correspond better with the colloquial use of the word town), ZIP Code Tabulation Areas (ZCTAs, which approximate ZIP codes), and even deeper levels of granularity.
Side notes: The data set contains around 3,200 counties, 36,000 "towns", 16,000 "villages", and 33,000 ZCTAs. There are fewer CDPs than towns here because I exclude CDPs that represent 100% of the town they are contained in. A big chunk of the 25k stats per region is iterations of the same statistics but for race-based subsets of the total population of each region, and since I don't think that's so interesting, and I wanted to keep the data size managable (1B is large enough, thank you), I omit those stats, leaving around 11 thousand per region.
The statistics themselves contain total population counts, counts by age, sex, and race, information on commuting time to work, mean income, latitude and longitude of the region, etc. In fact, for each of the around 55,000 geographic regions from country down to CDP, 25 thousand statistics are reported! That's a lot.
The thousands of statistics are, fortunately for the human user, structured. The stats break down into tables, and tables within tables, etc. For instance, population by sex and age (i.e. how many individuals are male and 24 years old) is a two-level table. First the total population is broken down by sex (total male, female), and then each of those parts are further broken down by age. It is not uncommon to see tables four or five levels deep. And since the statistics break things down into smaller and smaller categories, around 25% of all of the numbers reported over all of the regions are just zero.
Further, the stats break down into what the Census calls "universes." For a region, one universe is "total population", which means that the statistics represent counts of people out of all of the people in the region. Another universe is "households", which means the numbers are not counting people but households, out of all of the households in the region. Other universes break down those two, such as "total population 18 years and over."
And, lastly there are some statistics which are not counts but are instead aggregates or medians, such as the median income out of a subset of the population within some region.
The fact that the statistics are structured into hierarchical tables makes the relational database model already problematic for representing the data --- unless you want a single table with thousands of columns, or otherwise several hundred tables, but in either case the hierarchy is not explicitly modeled. An XML-ish database might work well, but the flexibility of a native RDF store for encoding a graph seems like a fair enough fit.

Modeling the Census Data in RDF and Conversion to Notation 3

Modeling Choices

The Census data could have been modeled in RDF any number of ways, and I chose one way that seemed to work out all right. I wanted the notion of a hierarchy of tables to carry over into RDF. That is, there should be nodes in the RDF graph representing a table, i.e. a subset of the population, out of which subtables may slice the population into smaller groups. Each slice of the world is represented by a RDF predicate. Tables as such are represented by blank (anonymous) nodes.
Okay, to take an example: Each region starts off with a node representing the region, i.e. <http://www.rdfabout.com/rdf/usgov/geo/us> for the United States as a whole. To keep the thousands of statistics slightly separated from the basic geographic data, that node follows a census:details predicate to a new node representing the 2000 Census statistics for that region. By the current convention, the URIs for the details node just appends "/censustables" to the region URI.
The first way the Census splices the world as it pertains to that region is by dividing the world into universes, which as I mentioned above are "total population", "households", etc. In the RDF model, predicates representing each universe are applied to the ...region/censustables entity and land on a bnode representing a table that will further divide that universe. If the "total population" universe is then subdivided into "male" versus "female", predicates leave the first bnode and land on two new bnodes representing the males out of the total population and the females of the total population. Further subdivisions, such as by age, leave these bnodes and may land on new bnodes that may be subdivided further.
There is more to explain, below, but here is a graphical representation of what is going on. The black-colored nodes and edges are the nodes and triples I've discussed so far.
Structure of the RDF model.
As the figure shows, each Census predicate (in black) takes you from a table bnode to either a) another table or b) a literal numeric value (in blue). When it takes you to a numeric value, you've reached the end of the line and it tells you how many people (housholds, etc.) fall into
You may notice in the figure that there are two population edges leaving the Region node terminating on bnodes that repeat the same total population count (120,000) in their rdf:values. This redundancy is there in order to model each table in the Census data as independent bnodes. That is, if there were only one population predicate leaving Region, then the male/female division and the inHousehold/inGroupQuarters division would be collapsed. The benefit of keeping them apart is that the model is explicit about which categories are mutually exclusive and which fall into natural groupings, which applications may find useful.
all of the categories that brought you to that value. So if you follow the path Region > population > female > age10-19, you end on a numeric value that tells you how many women aged 10-19 are in that region. (If you followed Region > households > nonFamilyHouseholds you would get the number of households, not people, that are nonFamilyHouseHolds. To know what a "non-family household" is, you would have to consult the PDFs published by the Census.) Now, what makes this a little bit weird is that we may also want to know how many women there are total. If we followed the path only part way (Region > population > female), you might want a literal numeric value here, but we can't do that since we've already put a bnode here so we can branch further. Instead, we branch off a rdf:value predicate to a literal value that has the total number for that category (the edges and literals in red). So if you want the total number of women, you follow Region > population > female > rdf:value. (This is, yes, a little bit awkward since you can't know whether you need that extra rdf:value or not without looking at the structure of the graph.)
Besides the graph above, there is also a hierarchy established between regions themselves using dcterms:isPartOf, and several other features including actual names and latitute/longitude are represented as predicates directly off of the region node.

Converting the Census Raw Data Files to Notation 3

The Census publishes their data in text-based data files (compressed, since even compressed it is several gigabytes in all). I chose to first convert them to Notation 3 files on disk, and then to load those files into a triple store.

Geographic Data

Basic geographic data, including the hierarchical relationship between the regions, latitude/longitude, and names, comes from the file usgeo_uf1.txt. It's a fixed-column-width text file, which is described here. Besides some standard schemas, I use two of my own: Census and USGovt. Below is the segment of the N3 version for New York.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix census: <tag:govshare.info,2005:rdf/census/> .
@prefix usgovt: <tag:govshare.info,2005:rdf/usgovt/> .

<http://www.rdfabout.com/rdf/usgov/geo/us/ny> 
   rdf:type usgovt:State ;
   usgovt:censusStateCode "21" ;
   usgovt:fipsStateCode "36" ;
   usgovt:uspsStateCode "NY" ;
   dc:title "New York" ;
   dcterms:isPartOf <http://www.rdfabout.com/rdf/usgov/geo/us> ;
   geo:lat 42.155127 ;
   geo:long -75.164667 ;
   census:population 18976457 ;
   census:households 7679307 ;
   census:landArea "122283145776 m^2" ;
   census:waterArea "19016249880 m^2" ;
   census:details <http://www.rdfabout.com/rdf/usgov/geo/us/ny/censustables> . 
You may notice the landArea and waterArea values are given in a strange half-number/half-unit literal value. I didn't know what to do since there isn't much consensus on how to indicate the units of physical quantities in RDF. (See the thread "ontology for units of measurement and/or physical quantities" at the semantic-web mail list archives and my own suggestion that came later on.)
A link to download the geographic data set (about 1 million triples) is at the start of this document.

Population Tables

The tables-within-tables population data came next. There are two sets of this data. The first set is the "100 Percent" data which is data for questions asked of every individual in the United States. The second data set is the Sample data, which is data for questions asked of about one-sixth of the population. The two data sets overlap in questions, and so there are two values for some statistics, one as determined by the 100 Percent data and one from the Sample data, and this forced the use of two separate namespaces for the statistics, "100pct" and "samp".
This data was by far much more difficult to get into N3 than the geographic data, which is just a flat table with a few fields per region. The "summary files" (as they're called) come from here (documentation is there too). For instance, the file SF1_all_0Final_National.zip (one gigabyte) contains about one-third of the census data. In that zip file are zip files for each state, and in that are comma-delimited text files with several hundred fields per file. The comma-delimited files are described in a set of metadata files for the SAS statistics program. Those files list the order of the fields within the data files and give each field a short description. In addition, whitespace indentations in the descriptions, which presumably are ignored by SAS, were crucial for establishing the hierarchical nature of the field.
I used a Perl script to convert the data into N3. The script runs for about an hour and a half (on modern hardware) and yields 1 billion triples. To get the triples yourself, you will have to download my perl script, a patch file, and Census data files, all linked at the start of this document. The resulting N3 files are just too big (2.4GB) for me to provide as a download.
The beginning of the output for the United States looks like this:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix : <tag:govshare.info,2005:rdf/census/details/100pct> .

<http://www.rdfabout.com/rdf/usgov/geo/us/censustables>
  :totalPopulation 281421906 ;     # P001001
  :totalPopulation [
     dc:title "URBAN AND RURAL (P002001)";
     rdf:value 281421906 ;   # P002001
     :urban [
        rdf:value 222360539 ;  # P002002
        :insideUrbanizedAreas 192323824 ;   # P002003
        :insideUrbanClusters 30036715 ;     # P002004
        ] ;
     :rural 59061367 ;   # P002005
     ] ;
  :totalPopulation [
     dc:title "RACE (P003001)";
     rdf:value 281421906 ;   # P003001
     :populationOfOneRace [
        rdf:value 274595678 ;    # P003002
        :whiteAlone 211460626 ;     # P003003
        :blackOrAfricanAmericanAlone 34658190 ;     # P003004
        :americanIndianAndAlaskaNativeAlone 2475956 ;   # P003005
  ...

Overall Statistics

The conversion process to Notation 3 yielded a total of 1,002,848,918 triples. 1,016,219 triples of that were for the geographic data. The remaining were for the detailed population statistics. Regions covered include the U.S., states, counties, towns, Census data places ("villages"), ZCTAs (roughly ZIP Codes), and current congressional districts (2007-08). Also included are owl:sameAs links between geographic entities defined here and some defined in the Geonames data set.

Loading the Data into a Triple Store and Exposing it to the World

Loading the Data

I used my own SemWeb Library for C# to load the triples into a MySQL database. Loading it took 39 hours on a 2.13 GHz Core 2 Duo (~7,000 statements per second). The database is 85 GB large (~90 bytes per statement).
Within my library there is a command-line tool for loading data into a database. I used this to load the geographic data:
export SEMWEB_MYSQL_IMPORT_MODE=LOCK 

mono rdfstorage.exe --clear -in n3 \
    -out "mysql:censusgeo:Database=rdf;User name=rdf" \
    geo-* link-*
Because of the enormous size of the remaining data, the detailed data triples were in GZip'ed N3 files, and so one has to be a little bit more creative to load that into a database without uncompressing them first:
export SEMWEB_MYSQL_IMPORT_MODE=DISABLEKEYS

cat sumfile* cong* | gunzip | cat -- schema.n3 - | \
    nice mono rdfstorage.exe
        -out "mysql:censustables:Database=rdf;User name=rdf" \
        -in n3 -
The library creates three MySQL tables to store the data. The first table has columns Subject, Predicate, and Object, each an integer key representing the resource. Two other tables map integers to URIs, for named entities, and integers to values and datatypes for literals. MySQL indexes are created over the columns in several ways to support different types of queries.

Exposing the Data with SPARQL

Once the data is in a triple store, exposing the data via SPARQL is the best way to access it. SPARQL allows one to run all sorts of queries against the data set. We could use it, for instance, to get a list of all of the counties in New York sorted by the median income in the county:
SELECT ?name ?medianincome WHERE {
  ?county dcterms:isPartOf <http://www.rdfabout.com/rdf/usgov/geo/us/ny> ;
  rdf:type usgov:County ;
  dc:title ?name .
  ?county census:details [
    census2:population15YearsAndOverWithIncomeIn1999 [ 
      census2:medianIncomeIn1999 ?medianincome
    ]
  ] .
} ORDER BY ?medianincome
Another example would be to find all states in the United States with more students living in dorms than prison inmates:
SELECT ?name ?prisoners ?students WHERE {
  ?state rdf:type usgov:State ;
  dc:title ?name .
  ?state census:details [
    census1:populationInGroupQuarters [
      census1:institutionalizedPopulation [
        census1:correctionalInstitutions ?prisoners
      ] ;
      census1:noninstitutionalizedPopulation [
       census1:collegeDormitories__includesCollegeQuartersOffCampus ?students
      ]
    ]
  ] .
  FILTER(?students > ?prisoners) .
}
You can try out queries here. That page has the full query examples with PREFIXes and some additional suggestions for querying the data.
Again, I used my own library to set up a SPARQL end-point, since I have included in it an ASP.NET HTTP protocol handler that can plug into ASP.NET to provide a SPARQL end-point. I run it under Mono's mod_mono under Apache but in principle it should work with IIS on Windows too. Getting that going is very straight-forward, at least once you have mod_mono working. First, instruct Apache that a certain location will be handled by the ASP.NET runtime by editing the server or vhost configuration and adding:
<Location /sparql>
    SetHandler mono
</Location>
(I believe you could also do something similar with a .htaccess file in a "sparql" directory.) Then, at the root-level of the website, in the web.config file, add:
<configuration>
    <configSections>
        <section name="sparqlSources" type="System.Configuration.NameValueSectionHandler,System"/>
    </configSections>

    <system.web>
        <httpHandlers>
            <add verb="*" path="sparql"
                type="SemWeb.Query.SparqlProtocolServerHandler, SemWeb.Sparql" />
        <httpHandlers>
    </system.web>

    <sparqlSources>
        <add key="/sparql" value="noreuse,mysql:censusgeo:Database=rdf;Server=localhost;User Id=rdf
                                          mysql:censustables:Database=rdf;Server=localhost;User Id=rdf"/>
    </sparqlSources>
</configuration>
and place the DLLs from the library in the bin directory of the website.
This creates a SPARQL end-point at http://www.rdfabout.com/sparql.

Dereferencing URIs and Linked Data

Current Semantic Web best practices are to mint URIs for entities that are dereferencable, that is, http: URIs that you could plug into a web browser and get something back. For reasons beyond this scope of this web page, the practice is to have a web server reply with a 303 "see other" status when dereferencing RDF URIs, sending the user to a different page which actually provides some RDF information (usually RDF/XML) about the resource in question. (For more, see How to Publish Linked Data on the Web.)
Providing the redirect from the URIs to some other page describing them is easy enough with Apache. But what page should describe them, and how should those pages be created? With a SPARQL end-point already set up, an easy solution is to use URLs for SPARQL DESCRIBE queries as the targets of the redirects.
I've thus included the following in my .htaccess file at the root of my website. It redirects URLs in a certain subdirectory to the SPARQL end-point with a DESCRIBE query on the URL originally accessed. Note that the angled brackets in the query are URI-escaped, and that I assume no special URI characters like %, #, and & are present in URIs, or a different redirecting method would have to be used so that URIs could be properly escaped.
RedirectMatch 303 (/rdf/usgov/geo/.*) http://rdfabout.com/sparql?query=DESCRIBE+%3Chttp://www.rdfabout.com$1%3E
You could try it out by visiting http://www.rdfabout.com/rdf/usgov/geo/us, the URI I minted for the United States, right in your browser, or pasting that URI into any of the client-side tools on the LinkedData ESW wiki page.

Update: A robust solution for redirects, for URIs that may include special escaped characters, is here: .htaccess. It employs a redirect for all URLs in the /rdf path space for the virtual host. Note that this requires that a line be added to httpd.conf.

Providing a Sitemap

Following the new Semantic Web Crawling: a Sitemap Extension guide, and the robots.txt extension for finding sitemaps, I have created a sitemap.xml file for the data stored on this website which you can take a look at here, and it is referenced in the robots.txt file for the website.
This site is run by Joshua Tauberer.