Summing and discarding grouped variables - r

I have a dataframe of this form
familyid memberid year contract months
1 1 2000 1 12
1 1 2001 1 12
1 1 2002 1 12
1 1 2003 1 12
2 3 2000 2 12
2 3 2001 2 12
2 3 2002 2 12
2 3 2003 2 12
3 2 2000 1 5
3 2 2000 2 5
3 2 2001 1 12
3 2 2002 1 12
3 2 2003 1 12
4 1 2000 2 12
4 1 2001 2 12
4 1 2002 2 12
4 1 2003 2 12
5 2 2000 1 8
5 2 2001 1 12
5 2 2002 1 12
5 2 2003 1 4
5 2 2003 1 6
I want back a dataframe like
familyid memberid year contract months
1 1 2000 1 12
1 1 2001 1 12
1 1 2002 1 12
1 1 2003 1 12
2 3 2000 2 12
2 3 2001 2 12
2 3 2002 2 12
2 3 2003 2 12
4 1 2000 2 12
4 1 2001 2 12
4 1 2002 2 12
4 1 2003 2 12
5 2 2000 1 8
5 2 2001 1 12
5 2 2002 1 12
**5 2 2003 1 10**
Basically I want to sum the variable months if they same familyid shows the same value for the variable "contract" (in my example I am summing 6 and 4 for familyid=5 in year=2003). However, I want also to discard familiids which show, during the same year, two different values for the variable contract (in my case I am discarding familyid=3 since it shows contract=1 and contract=2 in year=2000). For the other observations I want to keep things unchanged.
Does anybody know how to do this?
Thanks to anyone helping me.
Marco

You mentioned that you wanted to get the total months within one family's single contract in one year, but also to remove the families entirely with more than one contract in a year. Here's one approach:
library(dplyr)
df2 <- df %>%
group_by(familyid, memberid, year, contract) %>%
summarize(months = sum(months, na.rm = T)) %>%
# We need this to answer the second part. How many contracts did this family have this year?
mutate(contracts_this_yr = n()) %>%
ungroup() %>%
# Only include the families with no years of multiple contracts
group_by(familyid, memberid) %>%
filter(max(contracts_this_yr) < 2) %>%
ungroup()
Output
df2
# A tibble: 16 x 5
familyid memberid year contract months
<int> <int> <int> <int> <int>
1 1 1 2000 1 12
2 1 1 2001 1 12
3 1 1 2002 1 12
4 1 1 2003 1 12
5 2 3 2000 2 12
6 2 3 2001 2 12
7 2 3 2002 2 12
8 2 3 2003 2 12
9 4 1 2000 2 12
10 4 1 2001 2 12
11 4 1 2002 2 12
12 4 1 2003 2 12
13 5 2 2000 1 8
14 5 2 2001 1 12
15 5 2 2002 1 12
16 5 2 2003 1 10

Related

dcast reporting 1 or 0 rather than actual values

I have a data frame of this form
familyid Year memberid value
1 2000 1 5
1 2000 2 6
2 2000 1 5
3 2000 1 7
3 2000 2 8
1 2002 1 5
1 2002 2 5
2 2002 1 6
3 2002 1 7
3 2002 2 8
I want to transform it in the following way
familyid Year value_1 value_2
1 2000 5 6
2 2000 5 NA
3 2000 7 8
1 2002 5 5
2 2002 6 NA
3 2002 7 8
In other words I want to group my obs by familyid and year and then, for each memberid, create a column reporting the corresponding value of the last column. Whenever that family has only one member, I want to have NA in the value_2 column associated with member 2 of the reference family.
To do this I usually and succesfully use the following code
setDT(df)
dfnew<-data.table::dcast(df, Year + familyid ~ memberid, value.var=c("value"))
Unfortunately this time I get something like this
familyid Year value_1 value_2
1 2000 1 1
2 2000 1 0
3 2000 1 1
1 2002 1 1
2 2002 1 0
3 2002 1 1
In other words I get a new dataframe with 1 whenever the member exists (indeed column value_1 contains all 1 since all families have at least one member), 0 whenever the member does not exist, regardless the actual value in column "value". Does anybody know why this happens? Thank you for your time.
With tidyverse:
library(tidyverse)
df<-read.table(text="familyid Year memberid value
1 2000 1 5
1 2000 2 6
2 2000 1 5
3 2000 1 7
3 2000 2 8
1 2002 1 5
1 2002 2 5
2 2002 1 6
3 2002 1 7
3 2002 2 8",header=T)
df%>%
group_by(familyid,Year)%>%
spread(memberid,value)%>%
arrange(Year)%>%
mutate_at(c("1", "2"),.funs = funs( ifelse(is.na(.),0,1)))
# A tibble: 6 x 4
# Groups: familyid, Year [6]
familyid Year `1` `2`
<int> <int> <dbl> <dbl>
1 1 2000 1. 1.
2 2 2000 1. 0.
3 3 2000 1. 1.
4 1 2002 1. 1.
5 2 2002 1. 0.
6 3 2002 1. 1.

apply lag or lead in increasing order for the dataframe

df1 <- read.csv("C:/Users/uni/DS-project/df1.csv")
df1
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
6 2000 1
7 2001 2
8 2002 3
9 2003 4
10 2004 5
11 2000 1
12 2001 2
13 2002 3
14 2003 4
15 2004 5
16 2000 1
17 2001 2
18 2002 3
19 2003 4
20 2004 5
i want to apply lead so i can get the output in the below fashion.
we have set of 5 observation of each year repeated for n number of times, in output for 1st year we need to remove 2000 and its respective value, similar for second year we neglect 2000 and 2001 and its respective value, and for 3rd year remove - 2000, 2001, 2002 and its respective value. And so on.
so that we can get the below output in below manner.
output:
year value
2000 1
2001 2
2002 3
2003 4
2004 5
2001 2
2002 3
2003 4
2004 5
2002 3
2003 4
2004 5
2003 4
2004 5
please help.
Just for fun, adding a vectorized solution using matrix sub-setting
m <- matrix(rep(TRUE, nrow(df)), 5)
m[upper.tri(m)] <- FALSE
df[m,]
# year value
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 7 2001 2
# 8 2002 3
# 9 2003 4
# 10 2004 5
# 13 2002 3
# 14 2003 4
# 15 2004 5
# 19 2003 4
# 20 2004 5
Below grp is 1 for each row of the first group, 2 for the second and so on. Seq is 1, 2, 3, ... for the successive rows of each grp. Now just pick out those rows for which Seq is at least as large as grp. This has the effect of removing the first i-1 rows from the ith group for i = 1, 2, ... .
grp <- cumsum(df1$year == 2000)
Seq <- ave(grp, grp, FUN = seq_along)
subset(df1, Seq >= grp)
We could alternately write this in the less general form:
subset(df1, 1:5 >= rep(1:4, each = 5))
In any case the output from either subset statement is:
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
7 2001 2
8 2002 3
9 2003 4
10 2004 5
13 2002 3
14 2003 4
15 2004 5
19 2003 4
20 2004 5
library(dplyr)
df %>%
group_by(g = cumsum(year == 2000)) %>%
filter(row_number() >= g) %>%
ungroup %>%
select(-g)
# # A tibble: 14 x 2
# year value
# <int> <int>
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 6 2001 2
# 7 2002 3
# 8 2003 4
# 9 2004 5
# 10 2002 3
# 11 2003 4
# 12 2004 5
# 13 2003 4
# 14 2004 5
Using lapply():
to <- nrow(df) / 5 - 1
df[-unlist(lapply(1:to, function(x) seq(1:x) + 5*x)), ]
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
7 2001 2
8 2002 3
9 2003 4
10 2004 5
13 2002 3
14 2003 4
15 2004 5
19 2003 4
20 2004 5
Where unlist(lapply(1:to, function(x) seq(1:x) + 5*x)) are the indices to skip:
[1] 6 11 12 16 17 18
Using sequence:
df[5-rev(sequence(2:5)-1),]
# year value
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 2.1 2001 2
# 3.1 2002 3
# 4.1 2003 4
# 5.1 2004 5
# 3.2 2002 3
# 4.2 2003 4
# 5.2 2004 5
# 4.3 2003 4
# 5.3 2004 5
how it works:
5-rev(sequence(2:5)-1)
# [1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5
rev(sequence(2:5)-1)
# [1] 4 3 2 1 0 3 2 1 0 2 1 0 1 0
sequence(2:5)-1
# [1] 0 1 0 1 2 0 1 2 3 0 1 2 3 4
sequence(2:5)
# [1] 1 2 1 2 3 1 2 3 4 1 2 3 4 5

Change the form while merging multiple data frames

I have several data frames that are all in same format, like:
price <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
size <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
performance <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
> price
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
> size
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
> performance
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
and I want to merge these data frames but the result is in different form, the desired output is like:
> df
name Year price size performance
1 A 2001 1 1 1
2 A 2002 2 2 2
3 A 2003 3 3 3
4 B 2001 2 2 2
5 B 2002 3 3 3
6 B 2003 4 4 4
7 C 2001 3 3 3
8 C 2002 4 4 4
9 C 2003 5 5 5
which arranges the data in the order of names, and then the ordered date. Since I have over 2000 names and 180 dates in each of the 20 data frames it's too difficult to sort it by just imputing the specific name.
You need to convert your data frames to long format then join them together
library(tidyverse)
price_long <- price %>% gather(key, value = "price", -Year)
size_long <- size %>% gather(key, value = "size", -Year)
performance_long <- performance %>% gather(key, value = "performance", -Year)
price_long %>%
left_join(size_long) %>%
left_join(performance_long)
Joining, by = c("Year", "key")
Joining, by = c("Year", "key")
Year key price size performance
1 2001 A 1 1 1
2 2002 A 2 2 2
3 2003 A 3 3 3
4 2001 B 2 2 2
5 2002 B 3 3 3
6 2003 B 4 4 4
7 2001 C 4 4 4
8 2002 C 5 5 5
9 2003 C 6 6 6
you can use data.table
library(data.table)
a=list(price=price,size=size,performance=performance)
dcast(melt(rbindlist(a,T,idcol = "name"),1:2),variable+Year~name)
variable Year performance price size
1: A 2001 1 1 1
2: A 2002 2 2 2
3: A 2003 3 3 3
4: B 2001 2 2 2
5: B 2002 3 3 3
6: B 2003 4 4 4
7: C 2001 4 4 4
8: C 2002 5 5 5
9: C 2003 6 6 6
We can combine the data frames, gather and spread the combined data frame.
library(tidyverse)
dat <- list(price, size, performance) %>%
setNames(c("price", "size", "performance")) %>%
bind_rows(.id = "type") %>%
gather(name, value, A:C) %>%
spread(type, value) %>%
arrange(name, Year)
dat
# Year name performance price size
# 1 2001 A 1 1 1
# 2 2002 A 2 2 2
# 3 2003 A 3 3 3
# 4 2001 B 2 2 2
# 5 2002 B 3 3 3
# 6 2003 B 4 4 4
# 7 2001 C 4 4 4
# 8 2002 C 5 5 5
# 9 2003 C 6 6 6
dplyr::bind_rows comes quiet handy in such scenarios. A solution can be as:
library(tidyverse)
bind_rows(list(price = price, size = size, performance = performance), .id="Type") %>%
gather(Key, Value, - Type, -Year) %>%
spread(Type, Value)
# Year Key performance price size
# 1 2001 A 1 1 1
# 2 2001 B 2 2 2
# 3 2001 C 4 4 4
# 4 2002 A 2 2 2
# 5 2002 B 3 3 3
# 6 2002 C 5 5 5
# 7 2003 A 3 3 3
# 8 2003 B 4 4 4
# 9 2003 C 6 6 6
The above solution is very much similar to the one by #www. It just avoids use of setNames
To round it out, here's package-free base R answer.
# gather the data.frames into a list
myList <- mget(ls())
Note that the three data.frames are the only objects in my environment.
# get the final data.frame
Reduce(merge,
Map(function(x, y) setNames(cbind(x[1], stack(x[-1])), c("Year", y, "ID")),
myList, names(myList)))
This returns
Year ID performance price size
1 2001 A 1 1 1
2 2001 B 2 2 2
3 2001 C 4 4 4
4 2002 A 2 2 2
5 2002 B 3 3 3
6 2002 C 5 5 5
7 2003 A 3 3 3
8 2003 B 4 4 4
9 2003 C 6 6 6

merge data frames in R in two dimensions

DATA FRAME 1: HOUSE PRICE
year month MSA1 MSA2 MSA3
2000 1 12 6 7
2000 2 1 3 4
2001 3 9 5 7
DATA FRAME 2: MORTGAGE INFO
ID MSA YEAR MONTH
1 MSA1 2000 2
2 MSA3 2001 3
3 MSA2 2001 3
4 MSA1 2000 1
5 MSA3 2000 3
OUTCOME DESIRED:
ID MSA YEAR MONTH HOUSE_PRICE
1 MSA1 2000 2 1
2 MSA3 2001 3 7
3 MSA2 2001 3 5
Anyone knows how to achieve this in an efficient way? data frame 2 is huge and data frame 1 is ok size. Thanks!
Assuming both are data.tables dt1 and dt2, this can be done without having to cast them to long form as follows:
require(data.table)
dt2[dt1, .(ID, MSA, House_price = get(MSA)), by=.EACHI,
nomatch=0L, on=c(YEAR="year", MONTH="month")]
# YEAR MONTH ID MSA House_price
# 1: 2000 1 4 MSA1 12
# 2: 2000 2 1 MSA1 1
# 3: 2001 3 2 MSA3 7
# 4: 2001 3 3 MSA2 5
dt1 = fread('year month MSA1 MSA2 MSA3
2000 1 12 6 7
2000 2 1 3 4
2001 3 9 5 7
')
dt2 = fread('ID MSA YEAR MONTH
1 MSA1 2000 2
2 MSA3 2001 3
3 MSA2 2001 3
4 MSA1 2000 1
5 MSA3 2000 3
')
This looks like a case of turning a data frame from wide to long form and then merging two data frames. Here is a dplyr solution with gather and right_join. The name change is just here to make the join easier.
library(dplyr)
library(tidyr)
names(df1) <- toupper(names(df1))
gather(df1,MSA,HOUSE_PRICE,-YEAR,-MONTH) %>%
right_join(df2,by = c("YEAR","MONTH","MSA"))
output
YEAR MONTH MSA HOUSE_PRICE ID
1 2000 2 MSA1 1 1
2 2001 3 MSA3 7 2
3 2001 3 MSA2 5 3
4 2000 1 MSA1 12 4
5 2000 3 MSA3 NA 5

Combining split() and cumsum()

I am trying to produce stats for cumulative goals by season by a particular soccer player. I have used the cut function to obtain the season from the game dates. I have data which corresponds to this dataframe
df.raw <-
data.frame(Game = 1:20,
Goals=c(1,0,0,2,1,0,3,2,0,0,0,1,0,4,1,2,0,0,0,3),
season = gl(4,5,labels = c("2001", "2002","2003", "2004")))
In real life, the number of games per season may not be constant
I want to end up with data that looks like this
df.seasoned <-
data.frame(Game = 1:20,seasonGame= rep(1:5),
Goals=c(1,0,0,2,1,0,3,2,0,0,0,1,0,4,1,2,0,0,0,3),
cumGoals = c(1,1,1,3,4,0,3,5,5,5,0,1,1,5,6,2,2,2,2,5),
season = gl(4,5,labels = c("2001", "2002","2003", "2004")))
With the goals cumulatively summed within year and a game number for the season
df.raw$cumGoals <- with(df.raw, ave(Goals, season, FUN=cumsum) )
df.raw$seasonGame <- with(df.raw, ave(Game, season, FUN=seq))
df.raw
Or with transform ... the original transform, that is:
df.seas <- transform(df.raw, seasonGame = ave(Game, season, FUN=seq),
cumGoals = ave(Goals, season, FUN=cumsum) )
df.seas
Game Goals season seasonGame cumGoals
1 1 1 2001 1 1
2 2 0 2001 2 1
3 3 0 2001 3 1
4 4 2 2001 4 3
5 5 1 2001 5 4
6 6 0 2002 1 0
7 7 3 2002 2 3
8 8 2 2002 3 5
9 9 0 2002 4 5
10 10 0 2002 5 5
snipped
Another job for ddply and transform (from the plyr package):
ddply(df.raw,.(season),transform,seasonGame = 1:NROW(piece),
cumGoals = cumsum(Goals))
Game Goals season seasonGame cumGoals
1 1 1 2001 1 1
2 2 0 2001 2 1
3 3 0 2001 3 1
4 4 2 2001 4 3
5 5 1 2001 5 4
6 6 0 2002 1 0
7 7 3 2002 2 3
8 8 2 2002 3 5
9 9 0 2002 4 5
10 10 0 2002 5 5
11 11 0 2003 1 0
12 12 1 2003 2 1
13 13 0 2003 3 1
14 14 4 2003 4 5
15 15 1 2003 5 6
16 16 2 2004 1 2
17 17 0 2004 2 2
18 18 0 2004 3 2
19 19 0 2004 4 2
20 20 3 2004 5 5
Here is a solution using data.table which is very fast.
library(data.table)
df.raw.tab = data.table(df.raw)
df.raw.tab[,list(seasonGame = 1:NROW(Goals), cumGoals = cumsum(Goals)),'season']

Resources