R: Running sum of changed column values within groups - r

I have data that looks like this:
df <- read.table(textConnection(
"ID DATE UNIT
100 1/5/2005 4
100 2/6/2006 4
100 3/7/2007 5
100 4/7/2008 5
100 5/9/2009 6
101 1/5/2005 1
101 2/6/2006 1
101 3/7/2007 1
101 4/7/2008 1
102 1/3/2010 3
102 4/5/2010 4
102 5/9/2011 3
102 6/7/2011 5
102 10/10/2012 5
103 1/5/2005 1
103 1/6/2010 2"),header=TRUE)
I want to group by ID, sort each group by DATE, and create another column that is a running count of the number of times the UNIT variable has changed for each given ID variable. So I want an output that looks like this:
ID DATE UNIT CHANGES
100 1/5/2005 4 0
100 2/6/2006 4 0
100 3/7/2007 5 1
100 4/7/2008 5 1
100 5/9/2009 6 2
101 1/5/2005 1 0
101 2/6/2006 1 0
101 3/7/2007 1 0
101 4/7/2008 1 0
102 1/3/2010 3 0
102 4/5/2010 4 1
102 5/9/2011 3 2
102 6/7/2011 5 3
102 10/10/2012 5 3
103 1/5/2005 1 0
103 1/6/2010 2 1

You could also do this in base R, using order to sort the observations and ave to compute the grouped values:
df$DATE <- as.Date(df$DATE, "%m/%d/%Y")
df <- df[order(df$ID, df$DATE),]
df$CHANGES <- ave(df$UNIT, df$ID, FUN=function(x) c(0, cumsum(diff(x) != 0)))
df
# ID DATE UNIT CHANGES
# 1 100 2005-01-05 4 0
# 2 100 2006-02-06 4 0
# 3 100 2007-03-07 5 1
# 4 100 2008-04-07 5 1
# 5 100 2009-05-09 6 2
# 6 101 2005-01-05 1 0
# 7 101 2006-02-06 1 0
# 8 101 2007-03-07 1 0
# 9 101 2008-04-07 1 0
# 10 102 2010-01-03 3 0
# 11 102 2010-04-05 4 1
# 12 102 2011-05-09 3 2
# 13 102 2011-06-07 5 3
# 14 102 2012-10-10 5 3
# 15 103 2005-01-05 1 0
# 16 103 2010-01-06 2 1

Using dplyr.
First I'm converting your DATE column to a date, assuming it's in format m/d/y (if not, change the "%m/%d/%Y" to "%d/%m/%Y"):
df$DATE <- as.Date(df$DATE, "%m/%d/%Y")
Now the code:
library(dplyr)
df %>% group_by(ID) %>%
arrange(DATE) %>%
mutate(CHANGES=c(0,cumsum(na.omit(UNIT!=lag(UNIT,1)))))

Related

Changing rows to columns through merging in R

actual_date=c('2018-01-03','2018-01-02','2018-01-25','2018-01-15','2018-01-06','2018-01-02','2018-01-16','2018-01-22','2018-01-03')
date_band=c('_201801','_201801','_201803','_201802','_201801', '_201801','_201803','_201804','_201801')
action=c('text','letter','call','letter','visit','letter','text','text','call')
unique_ref=c(1,1,2,1,2,3,3,4,4)
df1=as.data.frame(cbind(unique_ref,actual_date,date_band,action))
unique_ref=c(1,2,3,4)
priority_201801=c('3','2','3','0')
balance_201801=c('30','-20','35','-100')
priority_201802=c('1','1','1','2')
balance_201802=c('60','-40','35','0')
priority_201803=c('2','2','3','2')
balance_201803=c('30','-40','-50','100')
priority_201804=c('99','0','0','0')
balance_201804=c('0','-20','-50','-100')
df2=as.data.frame(cbind(unique_ref,priority_201801,balance_201801,priority_201802,balance_201802,priority_201803,
balance_201803,priority_201804,balance_201804))
The code above produces two examples of datasets I'm working with.
df1 looks like this:
unique_ref actual_date date_band action
1 1 2018-01-03 _201801 text
2 1 2018-01-02 _201801 letter
3 2 2018-01-25 _201803 call
4 1 2018-01-15 _201802 letter
5 2 2018-01-06 _201801 visit
6 3 2018-01-02 _201801 letter
7 3 2018-01-16 _201803 text
8 4 2018-01-22 _201804 text
9 4 2018-01-03 _201801 call
While df2 looks like:
unique_ref priority_201801 balance_201801 priority_201802 balance_201802 priority_201803 balance_201803 priority_201804 balance_201804
1 1 3 30 1 60 2 30 99 0
2 2 2 -20 1 -40 2 -40 0 -20
3 3 3 35 1 35 3 -50 0 -50
4 4 0 -100 2 0 2 100 0 -100
What I want to do is add a column to df2 that states action_dateband (i.e. action_201801, action_201802 etc). This would be taken from the action in df1 using the date_band and matched on unqiue_ref.
Desired output looks as follows: Where if there are two for one week, there is a comma between the two actions.
unique_ref priority_201801 balance_201801 action_201801 priority_201802 balance_201802 action_201802 priority_201803 balance_201803 action_201803
1 1 3 30 text,letter 1 60 letter 2 30
2 2 2 -20 visit 1 -40 2 -40 call
3 3 3 35 letter 1 35 3 -50 text
4 4 0 -100 call 2 0 2 100
priority_201804 balance_201804 action_201804
1 99 0
2 0 -20
3 0 -50
4 0 -100 text
library(tidyverse)
df2 %>%
left_join(df1, by=c("unique_ref")) %>% # join df1 to df2
select(-actual_date) %>% # remove column you won't need
mutate(date_band = paste0("action", date_band)) %>% # update column values
spread(date_band, action) # reshape to get the format you want
# unique_ref priority_201801 balance_201801 priority_201802 balance_201802 priority_201803 balance_201803
# 1 1 3 30 1 60 2 30
# 2 2 2 -20 1 -40 2 -40
# 3 3 3 35 1 35 3 -50
# 4 4 0 -100 2 0 2 100
# priority_201804 balance_201804 action_201801 action_201802 action_201803 action_201804
# 1 99 0 text letter <NA> <NA>
# 2 0 -20 visit <NA> call <NA>
# 3 0 -50 letter <NA> text <NA>
# 4 0 -100 call <NA> <NA> text
For the case where you have multiple actions for a specific ref and date_band you can use this approach:
library(tidyverse)
# update df1
df1_upd = df1 %>%
group_by(unique_ref, date_band) %>% # for every combination of ref and date_band
summarise(action = paste0(action, collapse = ",")) %>% # combine actions
ungroup() # forget the grouping
df2 %>%
left_join(df1_upd, by=c("unique_ref")) %>%
mutate(date_band = paste0("action", date_band)) %>%
spread(date_band, action)
# unique_ref priority_201801 balance_201801 priority_201802 balance_201802 priority_201803 balance_201803
# 1 1 3 30 1 60 2 30
# 2 2 2 -20 1 -40 2 -40
# 3 3 3 35 1 35 3 -50
# 4 4 0 -100 2 0 2 100
# priority_201804 balance_201804 action_201801 action_201802 action_201803 action_201804
# 1 99 0 text,letter letter <NA> <NA>
# 2 0 -20 visit <NA> call <NA>
# 3 0 -50 letter <NA> text <NA>
# 4 0 -100 call <NA> <NA> text

Efficient way in R to add a new column to a dataframe with huge dataset

I really need to speed some R code up. I have a large dataset from a particular sport. Each row in the data frame represents some type of action in the game. For each game (game_id) we have two teams (team_id) that take part in the game. time_ref in the data frame are the actions in chronological order for each game. type_id is the type of action in the game. player_off is set as TRUE or FALSE and is linked to action_id=3. action_id=3 represents a player getting a card and player_off is set to TRUE/FALSE if the player was sent off when they got that card. Example data.frame:
> df
game_id team_id action_id player_off time_ref
100 10 1 NA 1000
100 10 1 NA 1001
100 10 1 NA 1002
100 11 1 NA 1003
100 11 2 NA 1004
100 11 1 NA 1005
100 10 3 1 1006
100 11 1 NA 1007
100 10 1 NA 1008
100 10 1 NA 1009
101 12 3 0 1000
101 12 1 NA 1001
101 12 1 NA 1002
101 13 2 NA 1003
101 13 3 1 1004
101 12 1 NA 1005
101 13 1 NA 1006
101 13 1 NA 1007
101 12 1 NA 1008
101 12 1 NA 1009
What I need is another column in the data frame that gives me TRUE or FALSE on whether both teams had an equal/unequal number of players on the field while each action (row) took place.
So game_id=100 had an action_id=3 & player_off=1 for team_id=10 at time_ref=1006. So we know the teams were equal with number of players on the field up to that point but unequal for the rest of the game (time_ref>1006). The same thing occurred in game_id=101 also.
This an example of the data frame with an extra column I would like to have for the dataset.
>df
game_id team_id action_id player_off time_ref is_even
100 10 1 NA 1000 1
100 10 1 NA 1001 1
100 10 1 NA 1002 1
100 11 1 NA 1003 1
100 11 2 NA 1004 1
100 11 1 NA 1005 1
100 10 3 1 1006 1
100 11 1 NA 1007 0
100 10 1 NA 1008 0
100 10 1 NA 1009 0
101 12 3 0 1000 1
101 12 1 NA 1001 1
101 12 1 NA 1002 1
101 13 2 NA 1003 1
101 13 3 1 1004 1
101 12 1 NA 1005 0
101 13 1 NA 1006 0
101 13 1 NA 1007 0
101 12 1 NA 1008 0
101 12 1 NA 1009 0
So you can see that in game_id=100 a player was sent off at time_ref=1006 so all previous rows were marked as is_even=1 and subsequent marked as uneven or 0. Similar for game_id=101 at time_ref=1004.
What is the most efficient way of achieving this extra column? Preferably not using for loops.
For some vector
x = c(0, NA, NA, NA, 1, NA, NA, NA)
write a function to standardize the data (0 or 1 player lost), calculate the cumulative number of players lost, and compare this to zero,
fun0 = function(x) {
x[is.na(x)] = 0
cumsum(x) == 0
}
For several groups, use ave() with a grouping variable
x = c(x, rev(x))
grp = rep(1:2, each = length(x) / 2)
ave(x, grp, FUN = fun0)
For the data in the question, try
df$is_even = ave(df$player_off, df$game_id, FUN = fun)
Semantically, it seems likely that fun0() is more complicated than implied in this solution, specifically that if each team loses a player, they are again even, as #SunLisa says. If so, clean the data
df$player_off[is.na(df$player_off)] = 0
and change fun0(), e.g.,
fun1 <- function(x, team) {
is_team_1 <- team == head(team, 1) # is 'team' the first team?
x1 <- x & is_team_1 # lost player & team 1
x2 <- x & !is_team_1 # lost player & team 2
cumsum(x1) == cumsum(x2) # same total number of players?
}
(it doesn't seem like a good idea to coerce the logical return value to an integer). This could be applied by group with
df$is_even = ave(seq_len(nrow(df)), df$game_id, FUN = function(i) {
fun1(df$player_off[i], df$team_id[i])
})
or
split(df$is_even, df$game_id) <-
Map(fun1,
split(df$player_off, df$game_id),
split(df$team_id, df$game_id)
)
The implementation of ave() is useful to look at, the important line being
split(x, g) <- lapply(split(x, g), FUN)
The right-hand side splits x by group g, then applies FUN() to each group. The left-hand side split<-() is a tricky operation, using the group indexes to update the original vector x.
Comments
The original question asked for 'no for loops', but actually lapply() (in ave()) and Map() are exactly that; ave() is relatively efficient because of the split-apply-combine strategy it adopts, rather than what the OP probably implemented, which was likely to iterate through games, subset the data frame, then update the data.frame for each game. The subsetting would have duplicated subsets of the entire data set, and the update in particular would have copied at least the entire result column on each assignment; this copying would have slowed the execution down alot. It's also possible that the OP was struggling with fun0(); it would help to clarify the question, especially title, to identify that as the problem.
There are faster ways, especially using the data.table package, but the principle is the same -- identify a function that operates on a vector the way you'd like, and apply it by group.
An alternative, fully-vectorized, solution follows this suggestion to calculate a cumulative sum by group. For fun0(), standardize x to be the number of players leaving the game at a particular timepoint, without NAs
x[is.na(x)] = 0
For the equivalent of fun(), calculate the cumulative sum of players leaving the game, irrespective of group
cs = cumsum(x)
Correct this for the group that the cumulative sum applies to
in_game = cs - (grp - 1)
and set this to 'TRUE' when 0 players have left the game
is_even = (in_game == 0)
This relies on grp indexing from 1 to the number of groups; for the data here one might grp = match(df$game_id, unique(df$game_id)). A similar solution exists for fun1().
Here's a dplyr + tidyr solution to the problem, with the summary of what was done:
Manipulate the data by converting all NAs in player_off to 0 for easier summing and assigning the smaller team_num (assuming there are only 2) to team1 and the other to team2
"Tally" the player_offs using spread and fill the invalid combinations in the data with 0 -- for example, in game_id = 100, there's no team_id = 11 for time_ref = 1000
Take the cumulative sum of the lagged team1 and team2 vectors (and of course fill NAs with 0)
Code below:
require(dplyr)
require(tidyr)
df %>%
group_by(game_id) %>%
mutate(
player_off = player_off %>% replace(list = is.na(.), values = 0),
team_num = if_else(team_id == min(team_id), "team1", "team2")
) %>%
spread(key = team_num, value = player_off, fill = 0) %>%
arrange(game_id, time_ref) %>%
mutate(
team1_cum = cumsum(lag(team1, default = 0)),
team2_cum = cumsum(lag(team2, default = 0)),
is_even = as.integer(team1_cum == team2_cum)
) %>%
ungroup() %>%
select(-team1, -team2, -team1_cum, -team2_cum)
Output:
# A tibble: 20 x 5
game_id team_id action_id time_ref is_even
<int> <int> <int> <int> <int>
1 100 10 1 1000 1
2 100 10 1 1001 1
3 100 10 1 1002 1
4 100 11 1 1003 1
5 100 11 2 1004 1
6 100 11 1 1005 1
7 100 10 3 1006 1
8 100 11 1 1007 0
9 100 10 1 1008 0
10 100 10 1 1009 0
11 101 12 3 1000 1
12 101 12 1 1001 1
13 101 12 1 1002 1
14 101 13 2 1003 1
15 101 13 3 1004 1
16 101 12 1 1005 0
17 101 13 1 1006 0
18 101 13 1 1007 0
19 101 12 1 1008 0
20 101 12 1 1009 0
Here's my think:
data.table is going to work well, especially when you are working with large data sets. It's faster. We just need to group it, cumsum 2 team's layoff, and see if they equal.
First I have to say:
(problem solved by Martin Morgan, his updated answer no longer has this error)
I don't think #Martin Morgan 's answer is right. Let's imagine a certain case:
when team 1 had one player off, after which team 2 had another player off, then 2 teams should be even, but #Martin Morgan's output would be FALSE.
I'll make an example with this dataset, where player_off of record 19 was modified to 1, which means that in game 101, after team 13 had had 1 player off at 1004, team 12 had 1 player off at 1008, which would make 2 teams even at 1009.
> dt.1
game_id team_id action_id player_off time_ref
1 100 10 1 NA 1000
2 100 10 1 NA 1001
3 100 10 1 NA 1002
4 100 11 1 NA 1003
5 100 11 2 NA 1004
6 100 11 1 NA 1005
7 100 10 3 1 1006
8 100 11 1 NA 1007
9 100 10 1 NA 1008
10 100 10 1 NA 1009
11 101 12 3 0 1000
12 101 12 1 NA 1001
13 101 12 1 NA 1002
14 101 13 2 NA 1003
15 101 13 3 1 1004
16 101 12 1 NA 1005
17 101 13 1 NA 1006
18 101 13 1 NA 1007
19 101 12 1 1 1008
20 101 12 1 NA 1009
But #Martin Morgan 's function would produce this output:
> dt.1$is_even = ave(df$player_off, df$game_id, FUN = fun)
> dt.1
game_id team_id action_id player_off time_ref is_even
1 100 10 1 NA 1000 1
2 100 10 1 NA 1001 1
3 100 10 1 NA 1002 1
4 100 11 1 NA 1003 1
5 100 11 2 NA 1004 1
6 100 11 1 NA 1005 1
7 100 10 3 1 1006 1
8 100 11 1 NA 1007 0
9 100 10 1 NA 1008 0
10 100 10 1 NA 1009 0
11 101 12 3 0 1000 1
12 101 12 1 NA 1001 1
13 101 12 1 NA 1002 1
14 101 13 2 NA 1003 1
15 101 13 3 1 1004 1
16 101 12 1 NA 1005 0
17 101 13 1 NA 1006 0
18 101 13 1 NA 1007 0
19 101 12 1 1 1008 0
20 101 12 1 NA 1009 0
Notice how at line 19 and line 20, is.even=0. Which is not what op wants.
My code does not process NAs, so I am going to transform NA to 0 first.
> dt.1<-as.data.table(dt.1)
> dt.1[is.na(dt.1)]<-0
My code would produce the correct output, at time 1008 and 1009, where both team 12 and team 13 had 1 off, two teams are even.
> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id][order(game_id,time_ref)][,.(team_id,time_ref,action_id,player_off,even=as.numeric(cumsum(team2_off)==cumsum(team1_off))),by=game_id]
game_id team_id time_ref action_id player_off even
1: 100 10 1000 1 0 1
2: 100 10 1001 1 0 1
3: 100 10 1002 1 0 1
4: 100 11 1003 1 0 1
5: 100 11 1004 2 0 1
6: 100 11 1005 1 0 1
7: 100 10 1006 3 1 0
8: 100 11 1007 1 0 0
9: 100 10 1008 1 0 0
10: 100 10 1009 1 0 0
11: 101 12 1000 3 0 1
12: 101 12 1001 1 0 1
13: 101 12 1002 1 0 1
14: 101 13 1003 2 0 1
15: 101 13 1004 3 1 0
16: 101 12 1005 1 0 0
17: 101 13 1006 1 0 0
18: 101 13 1007 1 0 0
19: 101 12 1008 1 1 1
20: 101 12 1009 1 0 1
I understand it is a messy looking chunk of data.table code, let me explain step by step.
dt[, .(
action_id,
team2_off = (team_id == max(team_id)) * player_off,
team1_off = (team_id == min(team_id)) * player_off,
team_id,
time_ref,
player_off
), by = game_id][order(game_id, time_ref)][, .(team_id,
time_ref,
action_id,
player_off,
even = cumsum(team2_off) == cumsum(team1_off)), by = game_id]
first, we take data.table dt, group by game_id, and does this calculation:
team2_off = (team_id == max(team_id)) * player_off,
team1_off = (team_id == min(team_id)) * player_off
data.table has some problem taking 2 grouping at once (group by game_id and team_id), but it handles logical expression inside of each group well. In this way, we effectively get team1_off and team2_off, by multiplying a logical output of team_id == max/min(team_id) with player_off. When both are 1, the output would be 1, which means, 1 player was off in the selected team.
Now we have a data table of:
> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id]
game_id action_id team2_off team1_off team_id time_ref player_off
1: 100 1 0 0 10 1000 0
2: 100 1 0 0 10 1001 0
3: 100 1 0 0 10 1002 0
4: 100 1 0 0 11 1003 0
5: 100 2 0 0 11 1004 0
6: 100 1 0 0 11 1005 0
7: 100 3 0 1 10 1006 1
8: 100 1 0 0 11 1007 0
9: 100 1 0 0 10 1008 0
10: 100 1 0 0 10 1009 0
11: 101 3 0 0 12 1000 0
12: 101 1 0 0 12 1001 0
13: 101 1 0 0 12 1002 0
14: 101 2 0 0 13 1003 0
15: 101 3 1 0 13 1004 1
16: 101 1 0 0 12 1005 0
17: 101 1 0 0 13 1006 0
18: 101 1 0 0 13 1007 0
19: 101 1 0 1 12 1008 1
20: 101 1 0 0 12 1009 0
Now we no longer need to group by two groups (team_id, game_id), we can just do cumsum by game_id, and compare if cumsum(team1_off)==cumsum(team2_off), also, order it by game_id and time_ref, so the result would have the correct order.
I understand that NAs may have different meanings than 0 in this scenario. If you really care that much, just create a dummy column of player_off.
> dt$dummy<-dt$player_off
> dt$dummy[is.na(dt$dummy)]<-0
> dt<-as.data.table(dt)
> dt[, .(
+ action_id,
+ team2_off = (team_id == max(team_id)) * dummy,
+ team1_off = (team_id == min(team_id)) * dummy,
+ team_id,
+ time_ref,
+ player_off
+ ), by = game_id][order(game_id, time_ref)][, .(team_id,
+ time_ref,
+ action_id,
+ player_off,
+ even = as.numeric(cumsum(team2_off) == cumsum(team1_off))), by = game_id]
game_id team_id time_ref action_id player_off even
1: 100 10 1000 1 NA 1
2: 100 10 1001 1 NA 1
3: 100 10 1002 1 NA 1
4: 100 11 1003 1 NA 1
5: 100 11 1004 2 NA 1
6: 100 11 1005 1 NA 1
7: 100 10 1006 3 1 0
8: 100 11 1007 1 NA 0
9: 100 10 1008 1 NA 0
10: 100 10 1009 1 NA 0
11: 101 12 1000 3 0 1
12: 101 12 1001 1 NA 1
13: 101 12 1002 1 NA 1
14: 101 13 1003 2 NA 1
15: 101 13 1004 3 1 0
16: 101 12 1005 1 NA 0
17: 101 13 1006 1 NA 0
18: 101 13 1007 1 NA 0
19: 101 12 1008 1 NA 0
20: 101 12 1009 1 NA 0
I really think you question is very interesting, and I was dedicated to solve this using data.table. It took me few hours and I almost gave up on data.table, thinking that data.table just can't process two grouping at a time. I eventually solved it with a logical multiplication.
Great fun I had
team1_off = (team_id == min(team_id)) * dummy
team2_off = (team_id == max(team_id)) * dummy

How to calculate the cumulative data difference with preceding data by group?

the reduced raw data is as follow
Data group
2016/1/10 1
2016/2/4 1
2016/3/25 1
2016/4/13 1
2016/5/5 1
2016/7/1 2
2016/8/1 2
2016/10/1 2
2016/12/1 2
2016/12/31 2
what the final data i want to get is like:
Data group cum_diff_preceding
2016/1/10 1 0
2016/2/4 1 25
2016/3/25 1 125
2016/4/13 1 182
2016/5/5 1 270
2016/7/1 2 0
2016/8/1 2 31
2016/10/1 2 153
2016/12/1 2 336
2016/12/31 2 380
the calculation method is as follow:
for row 2016/1/10, cum_diff_preceding is 0
for row 2016/2/4, cum_diff_preceding is (2016/2/4-2016/1/10)
for row 2016/3/25, cum_diff_preceding is (2016/3/25-2016/1/10)+(2016/3/25-2016/2/4)
for row 2016/4/13, cum_diff_preceding is (2016/4/13-2016/1/10)+(2016/4/13- 2016/2/4)+(2016/4/13-2016/3/25)
for row 2016/5/5, cum_diff_preceding is (2016/5/5-2016/1/10)+(2016/5/5- 2016/2/4)+(2016/5/5-2016/3/25)+(2016/4/13-2016/4/13)
for row 2016/7/1, cum_diff_preceding is 0
for row 2016/8/1, cum_diff_preceding is (2016/8/1-2016/7/1)
for row 2016/10/1, cum_diff_preceding is (2016/10/1-2016/7/1)+(2016/10/1- 2016/8/1)
for row 2016/12/1, cum_diff_preceding is (2016/12/1-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)
for row 2016/12/31, cum_diff_preceding is (2016/12/31-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)+(2016/12/31- 2016/12/1)
my major code is as follow
>as.Date(df$Data,"%Y-%m-%d")
>fun_forcast<-function(df){for(i in 2:nrow(df)){df$cum_diff_preceeding[i]<-sum(df$data[i]-df$data[1:(i-1)])}}
>ddply(df,.(group),transform,cum_diff_preceding<-fun_forcast)
but it not work.
or when i change my code to
>fun_forcast<-function(df)(df$cum_diff_preceding<-sapply(1:NROW(df), >function(i) sum(df$data[i] - df$data[1:(i-1)])))
ddply(df,.(group),fun_forcast)
it work, but the result format is
> ddply(df,.(group),fun_forcast)
group V1 V2 V3 V4 V5
1 1 0 25 125 182 270
2 2 0 31 153 336 380
i don't know how to take the results back into cum_diff_preceding in original data.frame.
please
We can do this with ave from base R
df$Data <- as.Date(df$Data, "%Y/%m/%d")
fun_forcast <- function(v1) sapply(seq_along(v1), function(i) sum(v1[i] - v1[1:(i-1)]))
df$cum_diff_preceding <- with(df, ave(as.numeric(Data), group, FUN = fun_forcast))
df$cum_diff_preceding
#[1] 0 25 125 182 270 0 31 153 336 456
Or use dplyr
library(dplyr)
df %>%
group_by(group) %>%
mutate(cum_diff_preceding = fun_forcast(Data))
# A tibble: 10 x 3
# Groups: group [2]
# Data group cum_diff_preceding
# <date> <int> <dbl>
# 1 2016-01-10 1 0
# 2 2016-02-04 1 25
# 3 2016-03-25 1 125
# 4 2016-04-13 1 182
# 5 2016-05-05 1 270
# 6 2016-07-01 2 0
# 7 2016-08-01 2 31
# 8 2016-10-01 2 153
# 9 2016-12-01 2 336
#10 2016-12-31 2 456
By converting the dates to numeric, and generalizing the formula:
df %>%
group_by(group) %>%
mutate(numdata = as.numeric(Data),
cum_diff_preceding = (1:n())*numdata-cumsum(numdata)) %>%
select(-numdata)
# A tibble: 10 x 3
# Groups: group [2]
# Data group cum_diff_preceding
# <date> <int> <dbl>
# 1 2016-01-10 1 0
# 2 2016-02-04 1 25
# 3 2016-03-25 1 125
# 4 2016-04-13 1 182
# 5 2016-05-05 1 270
# 6 2016-07-01 2 0
# 7 2016-08-01 2 31
# 8 2016-10-01 2 153
# 9 2016-12-01 2 336
# 10 2016-12-31 2 456

Aggregating over 2 columns of a dataframe R

My dataframe is as follows
TreeID Species PlotNo Basalarea
12345 A 1 120
13242 B 7 310
14567 D 8 250
13245 B 1 305
13426 B 1 307
13289 A 3 118
I used
newdata<- aggregate(Basalarea~PlotNo+Species, data, sum, na.rm=TRUE)
to aggregate all the values such that
newdata
Species PlotNo Basalarea
A 1 120
A 3 118
B 1 some value
B 7 310
D 8 250
This is great but I would like a dataframe such that
PlotNo A B D
1 120 some value 0
3 118 0 0
7 0 310 0
8 0 0 250
How do I obtain the above dataframe?
We can use dcast to convert from long to wide format. Specify the fun.aggregate as sum.
library(reshape2)
dcast(df1, PlotNo~Species, value.var='Basalarea', sum)
# PlotNo A B D
#1 1 120 612 0
#2 3 118 0 0
#3 7 0 310 0
#4 8 0 0 250
Or a base R option would be using xtabs. By default it gets the sum of the 'Basalarea' for the combinations of 'PlotNo' and 'Species'.
xtabs(Basalarea~PlotNo+Species, df1)
# Species
#PlotNo A B D
# 1 120 612 0
# 3 118 0 0
# 7 0 310 0
# 8 0 0 250
Or another base R option is tapply
with(df1, tapply(Basalarea, list(PlotNo, Species), FUN=sum))

Rank function to rank multiple variables in R

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

Resources