Divide case by population - r

In the table2 dataset from the tidyr package, we have:
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
How do I code this so that I can divide the type cases by the type population and then multiply by 10000. (Yes, this is a question from R for Data Science by Hadley Wickham.)
I've thought of:
sum_1 <- vector()
for (i,j in 1:nrow(table2)) {
if (i %% 2 != 0) {
sum_1 <- (table2[i] / table2[j]) * 10000

Assuming that there are only 2 values for 'type' for each 'country', 'year', then after grouping by 'country', 'year', arrange by 'type' (in case the order is different) and divide the first value of 'count' with the last value of 'count' to create the 'newcol'
library(dplyr)
table2 %>%
group_by(country, year) %>%
arrange(country, year, type) %>%
mutate(newcol = 10000*first(count)/last(count))
If we need only a summarised output, replace mutate with summarise
If there are other values in type in addition to 'cases' and 'population', then we subset the 'count' based on logical index
table2 %>%
group_by(country, year) %>%
mutate(newcol = 10000*count[type=="cases"]/count[type=="population"])
Here, also the assumption is that there is only a single 'cases' and 'population' per each 'country', 'year'

Related

Combining & totalling rows in R

I have the below dataset, with the variables as follows:
member_id - an id number for each member
year - the year in question
gender - binary variable, 0 is male, 1 is female
party - the party of the member
Leadership - TRUE if the member holds a leadership position in government or opposition, FALSE if they don't
house_start - the date the member became an MP
Year.Entered - the year they became an MP
Years.in.parliament - how many years it has been since they were first elected
Edu - the amount of time the MP has participated in debates related to education in the given year.
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat FALSE 09/06/1983 1983 14 3
6 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 9
As you can see with rows 5 and 6 in the dataset, the same member is recorded twice in the one year. This has happened throughout the dataset for some members because of the Leadership variable. For example this member (id number 15) did not have a leadership position for the first part of 1997 but did get one later in the year. I want to be able to combine these two rows and have the Leadership variable as TRUE in these cases. I also need to compute the sum of Edu rows for these as well, so for this member it would become 12 (because I want each members number of times participated per year for this policy area). So I want it to look like:
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 12
I have been trying to change these manually on Excel, but I need to do this for several different policy areas, so it is taking a lot of time. Any help would be much appreciated!
We can do a group by sum and arrange and slice the first row
library(dplyr)
df1 %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
arrange(party, desc(Leadership)) %>%
slice(1)
For each group you can select the rows where there is only one row or row where Leadership is TRUE.
library(dplyr)
df %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
filter(n() == 1 | Leadership)
From my understanding the minimal repeating group is the member_id & year, we can then sum the Edu amount defensively (using na.rm = TRUE) and then slice the grouped data.frame using boolean algebra (taking the maximum of a boolean vector yields true records).
library(dplyr)
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
slice(which.max(Leadership)) %>%
ungroup()
Alternatively we can use top_n function (which yields the same result):
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
top_n(1, Leadership) %>%
ungroup()

Lagging a variable by adding up the previous 5 years?

I am working with data that look like this:
Country Year Aid
Angola 1995 416420000
Angola 1996 459310000
Angola 1997 354660000
Angola 1998 335270000
Angola 1999 387540000
Angola 2000 302210000
I want to create a lagged variable by adding up the previous five years in the data
So that the observation for 2000 looks like this:
Country Year Aid Lagged5
Angola 2000 416420000 1953200000
Which was derived by adding the Aid observations from 1995 to 1999 together:
416420000 + 459310000 + 354660000 + 335270000 + 387540000 = 1953200000
Also, I will need to group by country as well.
Thank You!
You could do:
library(dplyr)
df %>%
group_by(Country) %>%
mutate(Lagged5 = sapply(Year, function(x) sum(Aid[between(Year, x - 5, x - 1)])))
Output:
# A tibble: 6 x 4
# Groups: Country [1]
Country Year Aid Lagged5
<chr> <int> <int> <int>
1 Angola 1995 416420000 0
2 Angola 1996 459310000 416420000
3 Angola 1997 354660000 875730000
4 Angola 1998 335270000 1230390000
5 Angola 1999 387540000 1565660000
6 Angola 2000 302210000 1953200000
Using the input DF shown reproducibly in the Note at the end define a roll function which sums the prior 5 rows and use ave to run it for each Country. The width argument list(-seq(5)) to rollapplyr means use offsets -1, -2, -3, -4, -5 in summing, i.e. the values in the prior 5 rows.
The question did not discuss what to do with the initial rows in each country so we put in NA values but if you want partial sums add the partial = TRUE argument to rollapplyr. You can also change the fill=NA to some other value if you wish so it is quite flexible.
library(zoo)
roll <- function(x) rollapplyr(x, list(-seq(5)), sum, fill = NA)
transform(DF, Lag5 = ave(Aid, Country, FUN = roll))
Note
The input was assumed to be the following. We added a second country.
Lines <- "Country Year Aid
Angola 1995 416420000
Angola 1996 459310000
Angola 1997 354660000
Angola 1998 335270000
Angola 1999 387540000
Angola 2000 302210000"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE,
colClasses = c("character", "integer", "numeric"))
DF <- rbind(DF, transform(DF, Country = "Belize"))

Looping through two dataframes and adding columns inside of the loop

I have a problem when specifying a loop with a data frame.
The general idea I have is the following:
I have an area which contains a certain number of raster quadrants. These raster quadrants have been visited irregularily over several years (e.g. from 1950 -2015).
I have two data frames:
1) a data frame containing the IDs of the rasterquadrants (and one column for the year of first visit of this quadrant):
df1<- as.data.frame(cbind(c("12345","12346","12347","12348"),rep(NA,4)))
df1[,1]<- as.character(df1[,1])
df1[,2]<- as.numeric(df1[,2])
names(df1)<-c("Raster_Q","First_visit")
2) a data frame that contains the infos on the visits; this one is ordered with by 1st rasterquadrants and then 2nd years. This dataframe has the info when the rasterquadrant was visited and when.
df2<- as.data.frame(cbind(c(rep("12345",5),rep("12346",7),rep("12347",3),rep(12348,9)),
c(1950,1952,1955,1967,1951,1968,1970,
1998,2001,2014,2015,2017,1965,1986,2000,1952,1955,1957,1965,2003,2014,2015,2016,2017)))
df2[,1]<- as.character(df2[,1])
df2[,2]<- as.numeric(as.character(df2[,2]))
names(df2)<-c("Raster_Q","Year")
I want to know when and how often the full area was 'sampled'.
Scheme of what I want to do; different colors indicate different areas/regions
My rationale:
I sorted the complete data in df2 according to Quadrant and Year. I then match the rasterquadrant in df1 with the name of the rasterquadrant in df2 and the first value of year from df2 is added.
For this I wrote a loop (see below)
In order not to replicate a quadrant I created a vector "visited"
visited<-c()
Every entry of df2 that matches df1 will be written into this vector, so that the second entry of e.g. rasterquadrant "12345" in df2 is ignored in the loop.
Here comes the loop:
visited<- c()
for (i in 1:nrow(df2)){
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"First_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This gives me the first full sampling period.
Raster_Q First_visit
1 12345 1950
2 12346 1968
3 12347 1965
4 12348 1952
However, I want to have all full sampling periods.
So I do:
df1$"Second_visit"<-NA
I reset the visited vector and specify the following loop:
visited <- c()
for (i in 1:nrow(df2)){
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"Second_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
}
Which is basically the same loop as before, however, only making sure that, if df2$"Year" in a certain raster quadrant has already been included in the first visit, then it is skipped.
That gives me the second full sampling period:
Raster_Q First_visit Second_visit
1 12345 1950 NA
2 12346 1968 1970
3 12347 1965 1986
4 12348 1952 2003
Okay, so far so good. I could do that all by hand. But I have loads and loads of rasterquadrants and several areas that can and should be screened in this way.
So doing all of this in a single loop for this would be really great! However, I realized that this will create a problem because the loop then gets recursive:
The added column will not be included in the subsequent iteration of the loop, because the df1 itself is not re-read for each loop, and in consequence, the new coulmn for the new sampling period will not be included in the following iterations:
visited<- c()
for (i in 1:nrow(df2)){
m<-ncol(df1)
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
#finish "first_visit"
df1[,m+1]<-NA
# add column for "second visit"
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
# make sure that the first visit year are not included
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m+1]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This won't work. Another issue is that the vector visited() is not emptied during this loop, so that basically every Raster_Q has already been visited in the second sampling period.
I am stuck.... any ideas?
You can do this without a for loop by using the dplyr and tidyr packages. First, you take your df2 and use dplyr::arrange to order by raster and year. Then you can rank the years visited using the rank function inside of the dplyr::mutate function. Then using tidyr::spread you can put them all in their own columns. Here is the code:
df <- df2 %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year),
visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
Here is the output:
> df
# A tibble: 4 x 10
# Groups: Raster_Q [4]
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7 visit_8 visit_9
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 1951 1952 1955 1967 NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017 NA NA
3 12347 1965 1986 2000 NA NA NA NA NA NA
4 12348 1952 1955 1957 1965 2003 2014 2015 2016 2017
EDIT: So I think I understand your problem a little better now. You are looking to remove all duplicate visits to each quadrant that happened before the maximum Year of each respective "round" of visits. So to accomplish this, I wrote a short function that in essence does what the code above does, but with a slight change. Here is the function:
filter_by_round <- function(data, round) {
output <- data %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(in_round = ifelse(Year <= max(.$Year[.$visit == round]) & visit > round,
TRUE, FALSE)) %>%
filter(!in_round) %>%
select(-c(in_round, visit))
return(output)
}
What this function does, is look through the data and if a given year is less than the max year for the specified "visit round" then it is removed. To apply this only to the first round, you would do this:
df2 %>%
filter_by_round(1) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
which would give you this:
# A tibble: 4 x 8
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017
3 12347 1965 1986 2000 NA NA NA NA
4 12348 1952 2003 2014 2015 2016 2017 NA
However, while it does accomplish what your for loop would have, you now have other occurrences of the same problem. I have come up with a way to do this successfully but it requires you to know how many "visit rounds" you had or some trial and error. To accomplish this, you can use map and assign the change to a global variable.
# I do this so we do not lose the original dataset
df <- df2
# I chose 1:5 after some trial and error showed there are 5 unique
# "visit rounds" in your toy dataset
# However, if you overshoot your number, it should still work,
# you will just get warnings about `max` not working correctly
# however, this may casue issues, so figuring out your exact number is
# recommended
purrr::map(1:5, function(x){
# this assigns the output of each iteration to the global variable df
df <<- df %>%
filter_by_round(x)
})
# now applying the original transformation to get the spread dataset
df %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
This will give you the following output:
# A tibble: 4 x 6
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA
2 12346 1968 1970 2014 2015 2017
3 12347 1965 1986 NA NA NA
4 12348 1952 2003 2014 2015 2016
granted, this is probably not the most elegant solution, but it works. Hopefully this solves the problem for you

R: How to spread, group_by, summarise and mutate at the same time

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

How do I melt or reshape binned data in R? [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 7 years ago.
I have binned data reflecting the width of rivers across each continent. Below is a sample dataset. I pretty much just want to get the data into the form I have shown.
dat <- read.table(text =
"width continent bin
5.32 Africa 10
6.38 Africa 10
10.80 Asia 20
9.45 Africa 10
22.66 Africa 30
9.45 Asia 10",header = TRUE)
How do I melt the above toy dataset to create this dataframe?
Bin Count Continent
10 3 Africa
10 1 Asia
20 1 Asia
30 1 Africa
We could use either one of the aggregate by group.
The data.table option would be to convert the 'data.frame' to 'data.table' (setDT(dat)), grouped by 'continent' and 'bin' variables, we get the number of elements per group (.N)
library(data.table)
setDT(dat)[,list(Count=.N) ,.(continent,bin)]
# continent bin Count
#1: Africa 10 3
#2: Asia 20 1
#3: Africa 30 1
#4: Asia 10 1
Or a similar option with dplyr by grouping the variables and then use n() instead of .N to get the count.
library(dplyr)
dat %>%
group_by(continent, bin) %>%
summarise(Count=n())
Or we can use aggregate from base R and using the formula method, we get the length.
aggregate(cbind(Count=width)~., dat, FUN=length)
# continent bin Count
#1 Africa 10 3
#2 Asia 10 1
#3 Asia 20 1
#4 Africa 30 1
From #Frank's and #David Arenburg's comments, some additional options using data.table and dplyr. We convert the dataset to data.table (setDT(dat)), convert to 'wide' format with dcast, then reconvert it back to 'long' using melt, and subset the roww (value>0)
library(data.table)
melt(dcast(setDT(dat),continent~bin))[value>0]
Using count from dplyr
library(dplyr)
count(dat, bin, continent)
With sqldf:
library(sqldf)
sqldf("SELECT bin, continent, COUNT(continent) AS count
FROM dat
GROUP BY bin, continent")
Output:
bin continent count
1 10 Africa 3
2 10 Asia 1
3 20 Asia 1
4 30 Africa 1

Resources