I am trying to calculate a grouped rolling sum based on a window size k but, in the event that the within group row index (n) is less than k, I want to calculate the rolling sum using the condition k=min(n,k).
My issue is similar to this question R dplyr rolling sum but I am looking for a solution that provides a non-NA value for each row.
I can get part of the way there using dplyr and rollsum:
library(zoo)
library(dplyr)
df <- data.frame(Date=rep(seq(as.Date("2000-01-01"),
as.Date("2000-12-01"),by="month"),2),
ID=c(rep(1,12),rep(2,12)),value=1)
df <- tbl_df(df)
df <- df %>%
group_by(ID) %>%
mutate(total3mo=rollsum(x=value,k=3,align="right",fill="NA"))
df
Source: local data frame [24 x 4]
Groups: ID [2]
Date ID value tota3mo
(date) (dbl) (dbl) (dbl)
1 2000-01-01 1 1 NA
2 2000-02-01 1 1 NA
3 2000-03-01 1 1 3
4 2000-04-01 1 1 3
5 2000-05-01 1 1 3
6 2000-06-01 1 1 3
7 2000-07-01 1 1 3
8 2000-08-01 1 1 3
9 2000-09-01 1 1 3
10 2000-10-01 1 1 3
.. ... ... ... ...
In this case, what I would like is to return the value 1 for observations on 2000-01-01 and the value 2 for observations on 2000-02-01. More generally, I would like the rolling sum to be calculated over the largest window possible but no larger than k.
In this particular case it's not too difficult to change some NA values by hand. However, ultimately I would like to add several more columns to my data frame that will be rolling sums calculated over various windows. In this more general case it will get quite tedious to go back change many NA values by hand.
Using the partial=TRUE argument of rollapplyr :
df %>%
group_by(ID) %>%
mutate(roll = rollapplyr(value, 3, sum, partial = TRUE)) %>%
ungroup()
or without dplyr (still need zoo):
roll <- function(x) rollapplyr(x, 3, sum, partial = TRUE)
transform(df, roll = ave(value, ID, FUN = roll))
Related
In R, I'm trying to average a subset of a column based on selecting a certain value (ID) in another column. Consider the example of choosing an ID among 100 IDs, perhaps the ID number being 5. Then, I want to average a subset of values in another column that corresponds to the ID number that is 5. Then, I want to do the same thing for the rest of the IDs. What should this function be?
Using dplyr:
library(dplyr)
dt <- data.frame(ID = rep(1:3, each=3), values = runif(9, 1, 100))
dt %>%
group_by(ID) %>%
summarise(avg = mean(values))
Output:
ID avg
<int> <dbl>
1 1 41.9
2 2 79.8
3 3 39.3
Data:
ID values
1 1 8.628964
2 1 99.767843
3 1 17.438596
4 2 79.700918
5 2 87.647472
6 2 72.135906
7 3 53.845573
8 3 50.205122
9 3 13.811414
We can use a group by mean. In base R, this can be done with aggregate
dt <- data.frame(ID = rep(1:3, each=3), values = runif(9, 1, 100))
aggregate(values ~ ID, dt, mean)
Output:
ID values
1 1 40.07086
2 2 53.59345
3 3 47.80675
I have a data frame with three columns. Each row contains three unique numbers between 1 and 5 (inclusive).
df <- data.frame(a=c(1,4,2),
b=c(5,3,1),
c=c(3,1,5))
I want to use mutate to create two additional columns that, for each row, contain the two numbers between 1 and 5 that do not appear in the initial three columns in ascending order. The desired data frame in the example would be:
df2 <- data.frame(a=c(1,4,2),
b=c(5,3,1),
c=c(3,1,5),
d=c(2,2,3),
e=c(4,5,4))
I tried to use the below mutate function utilizing setdiff to accomplish this, but returned NAs rather than the values I was looking for:
df <- df %>% mutate(d=setdiff(c(a,b,c),c(1:5))[1],
e=setdiff(c(a,b,c),c(1:5))[2])
I can get around this by looping through each row (or using an apply function) but would prefer a mutate approach if possible.
Thank you for your help!
Base R:
cbind(df, t(apply(df, 1, setdiff, x = 1:5)))
# a b c 1 2
# 1 1 5 3 2 4
# 2 4 3 1 2 5
# 3 2 1 5 3 4
Warning: if there are any non-numerical columns, apply will happily up-convert things (converting to a matrix internally).
We can use pmap to loop over the rows, create a list column and then unnest it to create two new columns
library(dplyr)
librayr(purrr)
library(tidyr)
df %>%
mutate(out = pmap(., ~ setdiff(1:5, c(...)) %>%
as.list%>%
set_names(c('d', 'e')))) %%>%
unnest_wider(c(out))
# A tibble: 3 x 5
# a b c d e
# <dbl> <dbl> <dbl> <int> <int>
#1 1 5 3 2 4
#2 4 3 1 2 5
#3 2 1 5 3 4
Or using base R
df[c('d', 'e')] <- do.call(rbind, lapply(asplit(df, 1), function(x) setdiff(1:5, x)))
I have data with a grouping variable ("from") and values ("number"):
from number
1 1
1 1
2 1
2 2
3 2
3 2
I want to subset the data and select groups which have two or more unique values. In my data, only group 2 has more than one distinct 'number', so this is the desired result:
from number
2 1
2 2
Several possibilities, here's my favorite
library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2
Basically, per each group we are checking if there is any variance, if TRUE, then return the group values
With base R, I would go with
df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2
Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead
setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]
You could try
library(dplyr)
df1 %>%
group_by(from) %>%
filter(n_distinct(number)>1)
# from number
#1 2 1
#2 2 2
Or using base R
indx <- rowSums(!!table(df1))>1
subset(df1, from %in% names(indx)[indx])
# from number
#3 2 1
#4 2 2
Or
df1[with(df1, !ave(number, from, FUN=anyDuplicated)),]
# from number
#3 2 1
#4 2 2
Using concept of variance shared by David but doing it dplyr way:
library(dplyr)
df %>%
group_by(from) %>%
mutate(variance=var(number)) %>%
filter(variance!=0) %>%
select(from,number)
#Source: local data frame [2 x 2]
#Groups: from
#from number
#1 2 1
#2 2 2
I have a dataframe with an id, an ordering time value and a value. And for each group of ids, I would like to remove rows having a smaller value than rows having smaller time value.
data <- data.frame(id = c(rep(c("a", "b"), each = 3L), "b"),
time = c(0, 1, 2, 0, 1, 2, 3),
value = c(1, 1, 2, 3, 1, 2, 4))
> data
id time value
1 a 0 1
2 a 1 1
3 a 2 2
4 b 0 3
5 b 1 1
6 b 2 2
7 b 3 4
So the result would be :
> data
id time value
1 a 0 1
2 a 2 2
3 b 0 3
4 b 3 4
(For id == b rows where time %in% c(3, 4) are removed because the value value is smaller than when time is lower)
I was thinking about lag
data %>%
group_by(id) %>%
filter(time == 0 | lag(value, order_by = time) < value)
Source: local data frame [5 x 3]
Groups: id [2]
id time value
<fctr> <dbl> <dbl>
1 a 0 1
2 a 2 2
3 b 0 3
4 b 2 2
5 b 3 4
But it doesn't work as expected since it's a vectorized function, so instead the idea would be to use a "recursive lag function" or to check the last maximal value. I can do it recursively with a loop but I'm sure there is a more straightforward and high level way to do it.
Any help would be appreciated, thank you !
Here is a data.table solution:
library(data.table)
setDT(data)
data[, myVal := cummax(c(0, shift(value)[-1])), by=id][value > myVal][, myVal := NULL][]
id time value
1: a 0 1
2: a 2 2
3: b 0 3
4: b 3 4
The first part of the chain uses shift and cummax to create the cumulative maximum of the lagged value variable. In c(0, shift(value)[-1]), 0 is added to supply a value lover than any in the variable. More generally, you could use min(value)-1 the [-1] subsetting removes the first element of shift, which is NA. The second part of the chain selects observations where value is greater than the cumulative maximum. The final two chains remove the cumulative maximum variable and print out the result.
Another option is to perform a self anti/non-equi join using data.table
library(data.table) # v1.10.0
setDT(data)[!data, on = .(id, time > time, value <= value)]
# id time value
# 1: a 0 1
# 2: a 2 2
# 3: b 0 3
# 4: b 3 4
Which is basically saying: "If time is larger but value is less-equal, then I don't want these rows (! sign)"
Here is an option with dplyr. After grouping by 'id', we filter the rows where the 'value' is greater than the cumulative maximum of the 'lag' of the 'value' column
library(dplyr)
data %>%
group_by(id) %>%
filter(value > cummax(lag(value, default = 0)) )
# id time value
# <fctr> <dbl> <dbl>
#1 a 0 1
#2 a 2 2
#3 b 0 3
#4 b 3 4
Or another option is slice after arrangeing by 'id' and 'time' (as the OP mentioned about the order
data %>%
group_by(id) %>%
arrange(id, time) %>%
slice(which(value > cummax(lag(value, default = 0))))
My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).