How to reshape WDI data - r

I need to reshape some data I downloaded from the worldbank database. However I have some difficulties with it.
The goal is that it looks like this:
year CH DE US
1980 17383.38 11746.40 12179.56
1981 15833.74 9879.46 13526.19
1982 16133.97 9593.66 13932.68
1983 16007.82 9545.86 15000.09
1984 15229.82 9012.48 16539.38
I use the following code to download data. WDI and RJSONO packages are required.
wdi <- WDI(country = c("CH","DE","US"), indicator = "NY.GDP.PCAP.CD" ,start = 1980, end = 2010, extra = F)
then I reshaped the following way:
wdi2 <- reshape(wdi, direction = "wide", timevar="year", v.names="NY.GDP.PCAP.CD", idvar="country", drop="iso2c")
The output does not match my expectations of how it should look:
> wdi2
country NY.GDP.PCAP.CD.2010 NY.GDP.PCAP.CD.2009 NY.GDP.PCAP.CD.2008
1 Switzerland 70572.66 65790.07 68555.37
32 Germany 40163.82 40275.25 44132.04
63 United States 46615.51 45305.05 46759.56 ...
This one is a bit better but still not what I want:
> t(wdi2)
1 32 63
country "Switzerland" "Germany" "United States"
NY.GDP.PCAP.CD.2010 "70572.66" "40163.82" "46615.51"
NY.GDP.PCAP.CD.2009 "65790.07" "40275.25" "45305.05"
NY.GDP.PCAP.CD.2008 "68555.37" "44132.04" "46759.56"
NY.GDP.PCAP.CD.2007 "59663.77" "40402.99" "46349.12"
The wdi object looks like this:
> wdi
iso2c country NY.GDP.PCAP.CD year
1 CH Switzerland 70572.657 2010
2 CH Switzerland 65790.067 2009
3 CH Switzerland 68555.372 2008
4 CH Switzerland 59663.770 2007
...
30 CH Switzerland 16219.906 1981
31 CH Switzerland 17807.340 1980
32 DE Germany 40163.817 2010
33 DE Germany 40275.251 2009
34 DE Germany 44132.042 2008
...
62 DE Germany 11746.404 1980
63 US United States 46615.511 2010
64 US United States 45305.052 2009

In front of a computer again... so here's an update.
As mentioned in my comments, dcast from "reshape2" is quite convenient for this. You can get similar functionality from xtabs in base R if you're just doing the reshaping step.
x <- xtabs(NY.GDP.PCAP.CD ~ year + iso2c, wdi)
head(x)
# iso2c
# year CH DE US
# 1980 17807.34 11746.404 12179.56
# 1981 16219.91 9879.457 13526.19
# 1982 16527.46 9593.657 13932.68
# 1983 16398.24 9545.859 15000.09
# 1984 15601.26 9012.479 16539.38
# 1985 15748.95 9125.121 17588.81
xtabs creates a matrix of class "xtabs", so to get a data.frame, wrap the output in as.data.frame.matrix.
head(as.data.frame.matrix(x))
# CH DE US
# 1980 17807.34 11746.404 12179.56
# 1981 16219.91 9879.457 13526.19
# 1982 16527.46 9593.657 13932.68
# 1983 16398.24 9545.859 15000.09
# 1984 15601.26 9012.479 16539.38
# 1985 15748.95 9125.121 17588.81
To answer your other question you had asked in your comment: However isn't their a smarter way that puts the data directly to the right format by using only the reshape function?. The answer is "Yes. Just swap what you were using for the "idvar" and "timevar" in your original reshape attempt."
y <- reshape(wdi[-2], direction = "wide", idvar="year", timevar="iso2c")
## Optional step to clean up the resulting names
names(y) <- gsub("NY.GDP.PCAP.CD.", "", names(y))
head(y)
# year CH DE US
# 1 2010 70572.66 40163.82 46615.51
# 2 2009 65790.07 40275.25 45305.05
# 3 2008 68555.37 44132.04 46759.56
# 4 2007 59663.77 40402.99 46349.12
# 5 2006 54140.50 35237.60 44622.64
# 6 2005 51734.30 33542.78 42516.39
When using the reshape function, sometimes it helps to ignore the "id" and "time" parts of the argument names and think instead about where they are to go. ID variables usually make up a column, and time variables usually spread out wide, one column for each time. So, even though we might think of "country" as the actual ID variable, for the data format that you want, it is more of a time variable.
Hopefully this helps, even though you've already accepted an answer :)

It is really easy to achieve using reshape2.
require(reshape2)
dcast(wdi[,-2], year ~ iso2c, value.var = 'NY.GDP.PCAP.CD')
EDIT. Oops, I did not see the comment posted by Ananda Mahto with the same solution. Anand, if you post your comment as an answer, I will delete mine.

Here is a base R solution.
# renames the NY.GDP column and drops all but two columns
trans_one <- function(dat) {
newcol <- dat[1, "iso2c"]
idx <- which(colnames(dat)=="NY.GDP.PCAP.CD")
colnames(dat)[[idx]] <- newcol
dat <- dat[,c(newcol, "year")]
dat
}
# split by country
sp <- split(wdi, wdi$iso2c)
# merge
fun <- function(x,y) {
merge(x, trans_one(y), by="year", all=TRUE)
}
Reduce(fun, x=tail(sp, -1), init=trans_one(sp[[1]]))
However, the reshape2 looks more straightforward to me now.

Related

Looping through multiple dataframes in R

I have a set of survey design data for each quarter/year in RDs format on my disk. The data is like this:
Year Quarter Age
2010 1 27
2010 1 32
2010 1 34
...
I'm using the function svymean(formula=~Age, na.rm = T, design = data20101) to estimate the mean of the age variable for each year/quarter file. I would like to run this more efficiently in a way that I could run the function in a loop and then save the results in one single data frame.
The output I'm looking for is to produce such a dataframe:
Year Quarter Mean_Age
2010 1 31.1
2010 1 32.4
2010 1 30.9
2010 1 34.5
2010 2 36.3
2010 2 31.2
2010 2 30.8
2010 2 35.6
...
Regards,
I don't have enough rep to comment. I see r2evans is making good suggestions as to how you might read in your big data. You will obviously need to list the data in some way if you are to iterate through it. This method iterates through the list of filenames given your data is all in one directory by itself. It also does not save more than one dataset at a time which is ideal if the only thing you want is the output/grouped mean ages (not ideal if you are running more analysis besides this). I'm not sure what was most pressing from your question, but below is a general model of how to approach your problem.
library(dplyr)
output <- data.frame(Year = numeric(),
Quarter = numeric(),
Mean_Age = numeric())
filepath <- "./filpath_to_data/"
files_list <- list.files(filepath)
for (i in 1:length(files_list)){
output <- read.csv(paste0(filepath, files_list[i])) %>%
group_by(Year, Quarter) %>%
summarise(Mean_Age = mean(Age), .groups = "drop") %>%
add_rows(output)
}
output

Tidycensus - One year ACS for All Counties in a State

Pretty simple problem, I think, but not sure of the proper solution. Have done some research on this and think I recall seeing a solution somewhere, but cannot remember where...anyway,
Want to get DP03, one-year acs data for all Ohio counties, year 2019. However, The code below only accesses 39 of Ohio's 88 counties. How can I access the remaining counties?
My guess is that data is only being pulled for counties with populations greater than 60,000.
library(tidycensus)
library(tidyverse)
acs_2019 <- load_variables(2019, dataset = "acs1/profile")
DP03 <- acs_2019 %>%
filter(str_detect(name, pattern = "^DP03")) %>%
pull(name, label)
Ohio_county <-
get_acs(geography = "county",
year = 2019,
state = "OH",
survey = "acs1",
variables = DP03,
output = "wide")
This results in a table that looks like this...
Ohio_county
# A tibble: 39 x 550
GEOID NAME `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 39057 Gree~ 138295 815 138295 NA 87465
2 39043 Erie~ 61316 516 61316 NA 38013
3 39153 Summ~ 442279 1273 442279 NA 286777
4 39029 Colu~ 83317 634 83317 NA 48375
5 39099 Maho~ 188298 687 188298 NA 113806
6 39145 Scio~ 60956 588 60956 NA 29928
7 39003 Alle~ 81560 377 81560 NA 49316
8 39023 Clar~ 108730 549 108730 NA 64874
9 39093 Lora~ 250606 896 250606 NA 150136
10 39113 Mont~ 428140 954 428140 NA 267189
Pretty sure I've seen a solution somewhere, but cannot recall where.
Any help would be appreciated since it would let the office more easily pull census data rather than wading through the US Census Bureau site. Best of luck and Thank you!
My colleague, who already pulled the data, did not specify whether or not the DP03 data came from the ACS 1 year survey or the ACS 5 year survey. As it turns out, it was from the ACS 5 year survey, which includes all Ohio counties, not just those counties over 65,000 population. Follow comments above for a description of how this answer was determined.
Code for all counties is here
library(tidycensus)
library(tidyverse)
acs_2018 <- load_variables(2018, dataset = "acs5/profile")
DP03 <- acs_2019 %>%
filter(str_detect(name, pattern = "^DP03")) %>%
pull(name)
Ohio_county <-
get_acs(geography = "county",
year = 2018,
state = "OH",
survey = "acs5",
variables = DP03,
output = "wide")

efficient way to match and sum variables of two data frames based on two criteria [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a data frame df1 on import data for 397 different industries over 17 years and several different exporting countries/ regions.
> head(df1)
year importer exporter imports sic87dd
2300 1991 USA CAN 9.404848e+05 2011
2301 1991 USA CAN 2.259720e+04 2015
2302 1991 USA CAN 5.459608e+02 2021
2303 1991 USA CAN 1.173237e+04 2022
2304 1991 USA CAN 2.483033e+04 2023
2305 1991 USA CAN 5.353975e+00 2024
However, I want the sum of all imports for a given industry and a given year, regardless of where they came from. (The importer is always the US, sic87dd is a code that uniquely identifies the 397 industries)
So far I have tried the following code, which works correctly but is terribly inefficient and takes ages to run.
sic87dd <- unique(df1$sic87dd)
year <- unique (df1$year)
df2 <- data.frame("sic87dd" = rep(sic87dd, each = 17), "year" = rep(year, 397), imports = rep(0, 6749))
i <- 1
j <- 1
while(i <= nrow(df2)){
while(j <= nrow(df1)){
if((df1$sic87dd[j] == df2$sic87dd[i]) == TRUE & (df1$year[j] == df2$year[i]) == TRUE){
df2$imports[i] <- df2$imports[i] + df1$imports[j]
}
j <- j + 1
}
i <- i + 1
j <- 1
}
Is there a more efficient way to do this? I have seen some questions here that were somewhat similar and suggested the use of the data.table package, but I can't figure out how to make it work in my case.
Any help is appreciated.
There is a simple solution using dplyr:
First, you'll need to set your industry field as a factor (I'm assuming this entire field consists of a 4 digit number):
df1$sic87dd <- as.factor(df1$sic87dd)
Next, use the group_by command and summarise:
df1 %>%
group_by(sic87dd) %>%
summarise(total_imports = sum(imports))

Faster way to process 1.2 million JSON geolocation queries from large dataframe

I am working on the Gowalla location-based checkin dataset which has around 6.44 million checkins. Unique locations on these checkins are 1.28 million. But Gowalla only gives latitudes and longitudes. So I need to find city, state and country for each of those lats and longs. From another post on StackOverflow I was able to create the R query below which queries the open street maps and finds the relevant geographical details.
Unfortunately it takes around 1 minute to process 125 rows, which means 1.28 million rows would take a couple of days. Is there a faster way to find these details? Maybe there is some package with builtin lats and longs of cities of the world to find the city name for the given lat and long so I do not have to do online querying.
Venue table is a data frame with 3 columns: 1: vid(venueId), 2 lat(latitude), 3: long(longitude)
for(i in 1:nrow(venueTable)) {
#this is just an indicator to display current value of i on screen
cat(paste(".",i,"."))
#Below code composes the url query
url <- paste("http://nominatim.openstreetmap.org/reverse.php? format=json&lat=",
venueTableTest3$lat[i],"&lon=",venueTableTest3$long[i])
url <- gsub(' ','',url)
url <- paste(url)
x <- fromJSON(url)
venueTableTest3$display_name[i] <- x$display_name
venueTableTest3$country[i] <- x$address$country
}
I am using the jsonlite package in R which makes x which is the result of the JSON query as a dataframe which stores various results returned. So using x$display_name or x$address$city i use my required field.
My laptop is core i5 3230M with 8gb ram and 120gb SSD using Windows 8.
You're going to have issues even if you persevere with time. The service you're querying allows 'an absolute maximum of one request per second', which you're already breaching. It's likely that they will throttle your requests before you reach 1.2m queries. Their website notes similar APIs for larger uses have only around 15k free daily requests.
It'd be much better for you to use an offline option. A quick search shows that there are many freely available datasets of populated places, along with their longitude and latitudes. Here's one we'll use: http://simplemaps.com/resources/world-cities-data
> library(dplyr)
> cities.data <- read.csv("world_cities.csv") %>% tbl_df
> print(cities.data)
Source: local data frame [7,322 x 9]
city city_ascii lat lng pop country iso2 iso3 province
(fctr) (fctr) (dbl) (dbl) (dbl) (fctr) (fctr) (fctr) (fctr)
1 Qal eh-ye Now Qal eh-ye 34.9830 63.1333 2997 Afghanistan AF AFG Badghis
2 Chaghcharan Chaghcharan 34.5167 65.2500 15000 Afghanistan AF AFG Ghor
3 Lashkar Gah Lashkar Gah 31.5830 64.3600 201546 Afghanistan AF AFG Hilmand
4 Zaranj Zaranj 31.1120 61.8870 49851 Afghanistan AF AFG Nimroz
5 Tarin Kowt Tarin Kowt 32.6333 65.8667 10000 Afghanistan AF AFG Uruzgan
6 Zareh Sharan Zareh Sharan 32.8500 68.4167 13737 Afghanistan AF AFG Paktika
7 Asadabad Asadabad 34.8660 71.1500 48400 Afghanistan AF AFG Kunar
8 Taloqan Taloqan 36.7300 69.5400 64256 Afghanistan AF AFG Takhar
9 Mahmud-E Eraqi Mahmud-E Eraqi 35.0167 69.3333 7407 Afghanistan AF AFG Kapisa
10 Mehtar Lam Mehtar Lam 34.6500 70.1667 17345 Afghanistan AF AFG Laghman
.. ... ... ... ... ... ... ... ... ...
It's hard to demonstrate without any actual data examples (helpful to provide!), but we can make up some toy data.
# make up toy data
> candidate.longlat <- data.frame(vid = 1:3,
lat = c(12.53, -16.31, 42.87),
long = c(-70.03, -48.95, 74.59))
Using the distm function in geosphere, we can calculate the distances between all your data and all the city locations at once. For you, this will make a matrix containing ~8,400,000,000 numbers, so it might take a while (can explore parallisation), and may be highly memory intensive.
> install.packages("geosphere")
> library(geosphere)
# compute distance matrix using geosphere
> distance.matrix <- distm(x = candidate.longlat[,c("long", "lat")],
y = cities.data[,c("lng", "lat")])
It's then easy to find the closest city to each of your data points, and cbind it to your data.frame.
# work out which index in the matrix is closest to the data
> closest.index <- apply(distance.matrix, 1, which.min)
# rbind city and country of match with original query
> candidate.longlat <- cbind(candidate.longlat, cities.data[closest.index, c("city", "country")])
> print(candidate.longlat)
vid lat long city country
1 1 12.53 -70.03 Oranjestad Aruba
2 2 -16.31 -48.95 Anapolis Brazil
3 3 42.87 74.59 Bishkek Kyrgyzstan
Here's an alternate way using R's inherent spatial processing capabilities:
library(sp)
library(rgeos)
library(rgdal)
# world places shapefile
URL1 <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_populated_places.zip"
fil1 <- basename(URL1)
if (!file.exists(fil1)) download.file(URL1, fil1)
unzip(fil1)
places <- readOGR("ne_10m_populated_places.shp", "ne_10m_populated_places",
stringsAsFactors=FALSE)
# some data from the other answer since you didn't provide any
URL2 <- "http://simplemaps.com/resources/files/world/world_cities.csv"
fil2 <- basename(URL2)
if (!file.exists(fil2)) download.file(URL2, fil2)
# we need the points from said dat
dat <- read.csv(fil2, stringsAsFactors=FALSE)
pts <- SpatialPoints(dat[,c("lng", "lat")], CRS(proj4string(places)))
# this is not necessary
# I just don't like the warning about longlat not being a real projection
robin <- "+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
pts <- spTransform(pts, robin)
places <- spTransform(places, robin)
# compute the distance (makes a pretty big matrix so you should do this
# in chunks unless you have a ton of memory or do it row-by-row
far <- gDistance(pts, places, byid=TRUE)
# find the closest one
closest <- apply(far, 1, which.min)
# map to the fields (you may want to map to other fields)
locs <- places#data[closest, c("NAME", "ADM1NAME", "ISO_A2")]
locs[sample(nrow(locs), 10),]
## NAME ADM1NAME ISO_A2
## 3274 Szczecin West Pomeranian PL
## 1039 Balakhna Nizhegorod RU
## 1012 Chitre Herrera PA
## 3382 L'Aquila Abruzzo IT
## 1982 Dothan Alabama US
## 5159 Bayankhongor Bayanhongor MN
## 620 Deming New Mexico US
## 1907 Fort Smith Arkansas US
## 481 Dedougou Mou Houn BF
## 7169 Prague Prague CZ
It's about a minute (on my system) for ~7500 so you're looking at a couple hours vs a day or more. You can do this in parallel and prbly get it done in less than an hour.
For better place resolution, you could use a very lightweight shapefile of country or Admin 1 polygons, then use a second process to do the distance from better resolution points for those geographic locations.

Looping through csv and dumping to data frame in R

Using R, I am trying to take a csv file, loop through it, extract values, and dump them into a data frame. There are four columns in the csv: ID, UG_inst, Freq, and Year. Specifically, I want to loop through the UG_inst column by institution name for each year (2010-11,2011-12,2012-13,and 2013-14) and put the value at that cell into the respective "cell" in the R data frame. Right now, the csv just has a Year column, but the data frame I've created has a column for each year. The ultimate idea is to be able to create bar graphs representing the frequency per institution per year. Currently, the code below throws up NO errors, but appears to do nothing to the R data frame "j".
A couple of caveats: 1) Doing a nested for loop was making my head spin, so I decided to just use 2010-11 for now and just loop through the institution name. Since there are only 4 years, I can rewrite this four times, each time with a different year. 2) Also, in the csv, there are repeat names. So, if an institution name appears twice (will be adjacent rows in the csv due to alphabetical arrangement), is there a way to dump the SUM of these into the data frame in R?
All relevant info is below. Thanks so much for any help!!!!
Here is a link to the .csv file: https://www.dropbox.com/s/9et7muchkrgtgz7/UG_inst_ALL.csv
And here is the R code I am trying:
abc <- read.csv(insert file path to above csv here)
inst_string <- unique(abc$UG_inst)
j <- data.frame("UG_inst"=inst_string,"2010-11"=NA,"2011-12"=NA,"2012-13"=NA,"2013-14"=NA)
for (i in inst_string) {
inst.index <- which(abc$UG_inst == i && abc$Year == "2010-11")
j$X2010.11[j$Ug_inst==i] <- abc$Freq[inst.index]
}
Instead of using a nested loop (or a loop at all) I suggest using the reshape() function in base R.
abc <- read.csv("UG_inst_ALL.csv")
abc <- abc[2:4]
reshape(data = abc,
v.names = "Freq",
timevar = "Year",
idvar = "UG_inst",
direction = "wide")
This is known as "reshaping" your data, and you are going from a "long" format to a "wide" format.
In addition to base R's reshape function, here are a few other options to consider.
I'll assume that we are starting with data read in like the following.
abc <- read.csv("~/Downloads/UG_inst_ALL.csv", row.names = 1)
head(abc)
# UG_inst Freq Year
# 1 Abilene Christian University 0 2010-11
# 2 Adams State University 0 2010-11
# 3 Adrian College 1 2010-11
# 4 Agnes Scott College 0 2010-11
# 5 Alabama A&M University 1 2010-11
# 6 Albion College 1 2010-11
Option 1: xtabs
out <- as.data.frame.matrix(xtabs(Freq ~ UG_inst + Year, abc))
head(out)
# 2010-11 2011-12 2012-13 2013-14
# Abilene Christian University 0 1 0 0
# Adams State University 0 0 0 1
# Adrian College 1 0 0 0
# Agnes Scott College 0 0 1 0
# Alabama A&M University 1 3 1 2
# Albion College 1 0 0 0
Option 2: dcast from "reshape2"
library(reshape2)
head(dcast(abc, UG_inst ~ Year, value.var = "Freq"))
Option 3: spread from "tidyr"
library(dplyr)
library(tidyr)
abc %>% select(-X) %>% group_by(UG_inst) %>% spread(Year, Freq)

Resources