R - cumulative sum by condition - r

So I have a dataset which simplified looks something like this:
Year ID Sum
2009 999 100
2009 123 85
2009 666 100
2009 999 100
2009 123 90
2009 666 85
2010 999 100
2010 123 100
2010 666 95
2010 999 75
2010 123 100
2010 666 85
I'd like to add a column with the cumulative sum, by year and ID. Like this:
Year ID Sum Cum.Sum
2009 999 100 100
2009 123 85 85
2009 666 100 100
2009 999 100 200
2009 123 90 175
2009 666 85 185
2010 999 100 100
2010 123 100 100
2010 666 95 95
2010 999 75 175
2010 123 100 200
2010 666 85 180
I think this should be pretty straight-forward, but somehow I haven't been able to figure it out. How do I do this? Thanks for the help!

Using data.table:
require(data.table)
DT <- data.table(DF)
DT[, Cum.Sum := cumsum(Sum), by=list(Year, ID)]
Year ID Sum Cum.Sum
1: 2009 999 100 100
2: 2009 123 85 85
3: 2009 666 100 100
4: 2009 999 100 200
5: 2009 123 90 175
6: 2009 666 85 185
7: 2010 999 100 100
8: 2010 123 100 100
9: 2010 666 95 95
10: 2010 999 75 175
11: 2010 123 100 200
12: 2010 666 85 180

Another way
1) use ddply to sum a variable by group (similar to SQL group by)
X <- ddply ( dataset, .(Year,ID), sum)
2) merge the result with dataset
Y <- merge( dataset, X, by=('Year','ID')

You can use dplyr, and the base function cumsum:
require(dplyr)
dataset %>%
group_by(Year, ID) %>%
mutate(cumsum = cumsum(Sum)) %>%
ungroup()

Related

Re-code and spread data in columns, based on values in another column

I have a table that looks like this:
Year Tax1 Tax2 Tax3 Tax4
2004 12 123 145 104
2004 145 99 90 56
2005 212 300 240 123
etc...
The Tax# columns give info about the tax paid in years subsequent to the value in the Year column. I would like to re-arrange the table, and rename the columns, so it looked like this:
Year Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
2004 12 123 145 104 NA
2004 145 99 90 56 NA
2005 NA 212 300 240 123
I was thinking of splitting the table into separate tables, based on the year column, then renaming the Tax# columns, and joining back together. But its a bit convoluted, and I was wondering if there was a simpler way to do this?
Any help much appreciated.
library(dplyr)
library(tidyr)
df <- read.table(text = "
Year Tax1 Tax2 Tax3 Tax4
2004 12 123 145 104
2004 145 99 90 56
2005 212 300 240 123
", header = TRUE)
df %>%
mutate(id = row_number()) %>%
gather(rel_year, amount, contains("Tax")) %>%
mutate(rel_year = as.integer(gsub("Tax", "", rel_year)),
pay_year = Year + rel_year - 1,
pay_year = paste0("Tax", pay_year)) %>%
select(-rel_year) %>%
spread(pay_year, amount)
Result:
Year id Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
1 2004 1 12 123 145 104 NA
2 2004 2 145 99 90 56 NA
3 2005 3 NA 212 300 240 123
dat1%>%
gather(key,value,-Year)%>%
group_by(key)%>%
mutate(col=1:n())%>%
ungroup()%>%
mutate(key=paste0("Tax",2004:2008)[(Year==2005)+
as.numeric(sub("\\D+","",key))])%>%
spread(key,value)
# A tibble: 3 x 7
Year col Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
<int> <int> <int> <int> <int> <int> <int>
1 2004 1 12 123 145 104 NA
2 2004 2 145 99 90 56 NA
3 2005 3 NA 212 300 240 123
>
Here is an option using data.table
library(data.table)
library(readr)
dcast(melt(setDT(df, keep.rownames = TRUE), id.var = c("rn", "Year"))[,
newYear := paste0("Tax", Year + parse_number(variable) - 1)],
rn + Year~ newYear, value.var = 'value')[, rn := NULL][]
# Year Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
#1: 2004 12 123 145 104 NA
#2: 2004 145 99 90 56 NA
#3: 2005 NA 212 300 240 123

Creating new column based on row values of multiple data subsetting conditions

I have a dataframe that looks more or less like follows (the original one has 12 years of data):
Year Quarter Age_1 Age_2 Age_3 Age_4
2005 1 158 120 665 32
2005 2 257 145 121 14
2005 3 68 69 336 65
2005 4 112 458 370 101
2006 1 75 457 741 26
2006 2 365 134 223 45
2006 3 257 121 654 341
2006 4 175 124 454 12
2007 1 697 554 217 47
2007 2 954 987 118 54
2007 4 498 235 112 65
Where the numbers in the age columns represents the amount of individuals in each age class for a specific quarter within a specific year. It is noteworthy that sometimes not all quarters in a specific year have data (e.g., third quarter is not represented in 2007). Also, each row represents a sampling event. Although not shown in this example, in the original dataset I always have more than one sampling event for a specific quarter within a specific year. For example, for the first quarter in 2005 I have 47 sampling events, leading therefore to 47 rows.
What I´d like to have now is a dataframe structured in a way like:
Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort
2005 1 158 120 665 32 158
2005 2 257 145 121 14 257
2005 3 68 69 336 65 68
2005 4 112 458 370 101 112
2006 1 75 457 741 26 457
2006 2 365 134 223 45 134
2006 3 257 121 654 341 121
2006 4 175 124 454 12 124
2007 1 697 554 217 47 47
2007 2 954 987 118 54 54
2007 4 498 235 112 65 65
In this case, I want to create a new column (Cohort) in my original dataset which basically follows my cohorts along my dataset. In other words, when I´m in my first year of data (2005 with all quarters), I take the row values of Age_1 and paste it into the new column. When I move to the next year (2006), then I take all my row values related to my Age_2 and paste it to the new column, and so on and so forth.
I have tried to use the following function, but somehow it only works for the first couple of years:
extract_cohort_quarter <- function(d, yearclass=2005, quarterclass=1) {
ny <- 1:nlevels(d$Year) #no. of Year levels in the dataset
nq <- 1:nlevels(d$Quarter)
age0 <- (paste("age", ny, sep="_"))
year0 <- as.character(yearclass + ny - 1)
quarter <- as.character(rep(1:4, length(age0)))
age <- rep(age0,each=4)
year <- rep(year0,each=4)
df <- data.frame(year,age,quarter,stringsAsFactors=FALSE)
n <- nrow(df)
dnew <- NULL
for(i in 1:n) {
tmp <- subset(d, Year==df$year[i] & Quarter==df$quarter[i])
tmp$Cohort <- tmp[[age[i]]]
dnew <- rbind(dnew, tmp)
}
levels(dnew$Year) <- paste("Yearclass_", yearclass, ":",
year,":",quarter,":", age, sep="")
dnew
}
I have plenty of data from age_1 to age_12 for all the years and quarters, so I don´t think that it´s something related to the data structure itself.
Is there an easier solution to solve this problem? Or is there a way to improve my extract_cohort_quarter() function? Any help will be much appreciated.
-M
I have a simple solution but that demands bit of knowledge of the data.table library. I think you can easily adapt it to your further needs.
Here is the data:
DT <- as.data.table(list(Year = c(2005, 2005, 2005, 2005, 2006, 2006 ,2006 ,2006, 2007, 2007, 2007),
Quarter= c(1, 2, 3, 4 ,1 ,2 ,3 ,4 ,1 ,2 ,4),
Age_1 = c(158, 257, 68, 112 ,75, 365, 257, 175, 697 ,954, 498),
Age_2= c(120 ,145 ,69 ,458 ,457, 134 ,121 ,124 ,554 ,987, 235),
Age_3= c(665 ,121 ,336 ,370 ,741 ,223 ,654 ,454,217,118,112),
Age_4= c(32,14,65,101,26,45,341,12,47,54,65)
))
Here is th code :
DT[,index := .GRP, by = Year]
DT[,cohort := get(paste0("Age_",index)),by = Year]
and the output:
> DT
Year Quarter Age_1 Age_2 Age_3 Age_4 index cohort
1: 2005 1 158 120 665 32 1 158
2: 2005 2 257 145 121 14 1 257
3: 2005 3 68 69 336 65 1 68
4: 2005 4 112 458 370 101 1 112
5: 2006 1 75 457 741 26 2 457
6: 2006 2 365 134 223 45 2 134
7: 2006 3 257 121 654 341 2 121
8: 2006 4 175 124 454 12 2 124
9: 2007 1 697 554 217 47 3 217
10: 2007 2 954 987 118 54 3 118
11: 2007 4 498 235 112 65 3 112
What it does:
DT[,index := .GRP, by = Year]
creates an index for all different year in your table (by = Year makes an operation for group of year, .GRP create an index following the grouping sequence).
I use it to call the column that you named Age_ with the number created
DT[,cohort := get(paste0("Age_",index)),by = Year]
You can even do everything in the single line
DT[,cohort := get(paste0("Age_",.GRP)),by = Year]
I hope it helps
Here is an option using tidyverse
library(dplyr)
library(tidyr)
df1 %>%
gather(key, Cohort, -Year, -Quarter) %>%
separate(key, into = c('key1', 'key2')) %>%
mutate(ind = match(Year, unique(Year))) %>%
group_by(Year) %>%
filter(key2 == Quarter[ind]) %>%
mutate(newcol = paste(Year, Quarter, paste(key1, ind, sep="_"), sep=":")) %>%
ungroup %>%
select(Cohort, newcol) %>%
bind_cols(df1, .)
# Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort newcol
#1 2005 1 158 120 665 32 158 2005:1:Age_1
#2 2005 2 257 145 121 14 257 2005:2:Age_1
#3 2005 3 68 69 336 65 68 2005:3:Age_1
#4 2005 4 112 458 370 101 112 2005:4:Age_1
#5 2006 1 75 457 741 26 457 2006:1:Age_2
#6 2006 2 365 134 223 45 134 2006:2:Age_2
#7 2006 3 257 121 654 341 121 2006:3:Age_2
#8 2006 4 175 124 454 12 124 2006:4:Age_2
#9 2007 1 697 554 217 47 47 2007:1:Age_3
#10 2007 2 954 987 118 54 54 2007:2:Age_3
#11 2007 4 498 235 112 65 65 2007:4:Age_3

Selecting unique non-repeating values

I have some panel data from 2004-2007 which I would like to select according to unique values. To be more precise im trying to find out entry and exits of individual stores throughout the period. Data sample:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
2 2004 35000 800 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
4 2005 17500 320 136
4 2006 17500 320 136
4 2007 17500 320 136
5 2005 45000 580 191
5 2006 45000 580 191
5 2007 45000 580 191
6 2004 7000 345 191
6 2005 7000 345 191
6 2006 7000 345 191
7 2007 10000 500 191
So for instance I would like to find out how many stores have exited the market throughout the period, which should look like:
store year rev space market
2 2004 35000 800 136
6 2006 7000 345 191
As well as how many stores have entered the market, which would imply:
store year rev space market
4 2005 17500 320 136
5 2005 45000 580 191
7 2007 10000 500 191
UPDATE:
I didn't include that it also should assume incumbent stores, such as:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
Since im, pretty new to R I've been struggling to do it right even on year-by-year basis. Any suggestions?
Using the data.table package, if your data.frame is called df:
dt = data.table(df)
exit = dt[,list(ExitYear = max(year)),by=store]
exit = exit[ExitYear != 2007] #Or whatever the "current year" is for this table
enter = dt[,list(EntryYear = min(year)),by=store]
enter = enter[EntryYear != 2003]
UPDATE
To get all columns instead of just the year and store, you can do:
exit = dt[,.SD[year == max(year)], by=store]
exit[year != 2007]
store year rev space market
1: 2 2004 35000 800 136
2: 6 2006 7000 345 191
Using only base R functions, this is pretty simple:
> subset(aggregate(df["year"],df["store"],max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(df["year"],df["store"],min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
or using formula syntax:
> subset(aggregate(year~store,df,max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(year~store,df,min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
Update Getting all the columns isn't possible for aggregate, so we can use base 'by' instead. By isn't as clever at reassembling the array:
Filter(function(x)x$year!=2007,by(df,df$store,function(s)s[s$year==max(s$year),]))
$`2`
store year rev space market
5 2 2004 35000 800 136
$`6`
store year rev space market
18 6 2006 7000 345 191
So we need to take that step - let's build a little wrapper:
by2=function(x,c,...){Reduce(rbind,by(x,x[c],simplify=FALSE,...))}
And now use that instead:
> subset(by2(df,"store",function(s)s[s$year==max(s$year),]),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
We can further clarify this by creating a function for getting a row which has the stat (min or max) for a particular column:
statmatch=function(column,stat)function(df){df[df[column]==stat(df[column]),]}
> subset(by2(df,"store",statmatch("year",max)),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
Dplyr
Using all of these base functions which don't really resemble each other starts to get fiddly after a while, so it's a great idea to learn and use the excellent (and performant) dplyr package:
> df %>% group_by(store) %>%
arrange(-year) %>% slice(1) %>%
filter(year != 2007) %>% ungroup
Source: local data frame [2 x 5]
store year rev space market
1 2 2004 35000 800 136
2 6 2006 7000 345 191
and
> df %>% group_by(store) %>%
arrange(+year) %>% slice(1) %>%
filter(year != 2004) %>% ungroup
Source: local data frame [3 x 5]
store year rev space market
1 4 2005 17500 320 136
2 5 2005 45000 580 191
3 7 2007 10000 500 191
NB The ungroup is not strictly necessary here, but puts the table back in a default state for further calculations.

diff operation within a group, after a dplyr::group_by()

Let's say I have this data.frame (with 3 variables)
ID Period Score
123 2013 146
123 2014 133
23 2013 150
456 2013 205
456 2014 219
456 2015 140
78 2012 192
78 2013 199
78 2014 133
78 2015 170
Using dplyr I can group them by ID and filter these ID that appear more than once
data <- data %>% group_by(ID) %>% filter(n() > 1)
Now, what I like to achieve is to add a column that is:
Difference = Score of Period P - Score of Period P-1
to get something like this:
ID Period Score Difference
123 2013 146
123 2014 133 -13
456 2013 205
456 2014 219 14
456 2015 140 -79
78 2012 192
78 2013 199 7
78 2014 133 -66
78 2015 170 37
It is rather trivial to do this in a spreadsheet, but I have no idea on how I can achieve this in R.
Thanks for any help or guidance.
Here is another solution using lag. Depending on the use case it might be more convenient than diff because the NAs clearly show that a particular value did not have predecessor whereas a 0 using diff might be the result of a) a missing predecessor or of b) the subtraction between two periods.
data %>% group_by(ID) %>% filter(n() > 1) %>%
mutate(
Difference = Score - lag(Score)
)
# ID Period Score Difference
# 1 123 2013 146 NA
# 2 123 2014 133 -13
# 3 456 2013 205 NA
# 4 456 2014 219 14
# 5 456 2015 140 -79
# 6 78 2012 192 NA
# 7 78 2013 199 7
# 8 78 2014 133 -66
# 9 78 2015 170 37

selecting observations based on a condition depended on a grouped variable

I have a question that I am hoping some will help me answer. I have a data set ordered by parasites and year, that looks something like this (the actual dataset is much larger):
parasites year samples
1000 2000 11
910 2000 22
878 2000 13
999 2002 64
910 2002 75
710 2002 16
890 2004 29
810 2004 10
789 2004 9
876 2005 120
750 2005 12
624 2005 157
what I would like to do is, for every year, I want to select the 2 samples with the highest number of parasites, to give me an output that looks like this:
parasites year samples
1000 2000 11
910 2000 22
999 2002 64
910 2002 75
890 2004 29
810 2004 10
876 2005 120
750 2005 12
I am new to programming as a whole and still trying to find my way around R. can someone please explain to me how I would go about this? Thanks so much.
How about with data.table:
parasites<-read.table(header=T,text="parasites year samples
1000 2000 11
910 2000 22
878 2000 13
999 2002 64
910 2002 75
710 2002 16
890 2004 29
810 2004 10
789 2004 9
876 2005 120
750 2005 12
624 2005 157")
EDIT - sorry sorted by parasites, not samples
require(data.table)
data.table(parasites)[,.SD[order(-parasites)][1:2],by="year"]
Note .SD is the sub-table for each year value as set in by=
year parasites samples
1: 2000 1000 11
2: 2000 910 22
3: 2002 999 64
4: 2002 910 75
5: 2004 890 29
6: 2004 810 10
7: 2005 876 120
8: 2005 750 12
Here is a R-base solution (if you need it):
data = data.frame("parasites"=c(1000,910,878,999,910,710,890,910,789,876,750,624),
"year"=c(2000,2000,2000,2002,2002,2002,2004,2004,2004,2005,2005,2005),
"samples"=c(11,22,13,64,75,16,29,10,9,120,12,157))
data = data[order(data$year,data$samples),]
data_list = lapply(unique(data$year),function(x) (tail(data[data$year==x,],n=2)))
final_data = do.call(rbind, Map(as.data.frame,data_list))
Hope that helps!

Resources