Here is a sample df in which I would like to calculate cumulative sum over a moving window.
d <- data.frame(y = 1:10)
From previous suggestions, I am able to perform a sliding window cumsum, using the following script (thanks to nograpes):
size <- 2 # size of window
len <- nrow(d) - size +1 # number of sliding windows to perform
sumsmatrix <- apply(d, 2, function(x)
cumsum(x)[size:nrow(d)] - c(0,cumsum(x)[1:(len-1)]))
And gives the following output:
y
3
5
7
9
11
13
15
17
19
My request is to perform a cumsum by moving window, instead of sliding. For instance take my window size is 2, I would like to calculate the cumsum of first two rows of a column and then move to the 3rd and calculate for 3rd and 4th and so on ..
Desired output:
y
1
3
3
7
5
11
7
15
9
19
How can I tweak the script to suit my needs?
We can do a group by operation by creating a grouping variable with %/% and then use on the group by functions i.e. tapply
unlist(tapply(d$y, (seq_len(nrow(d))-1)%/% size, FUN = cumsum), use.names = FALSE)
#[1] 1 3 3 7 5 11 7 15 9 19
Another way would be to create a group variable and use cumsum().
library(dplyr)
d %>%
group_by(group = rep(1:(nrow(.)/2), each = 2)) %>%
transmute(y = cumsum(y)) %>%
ungroup %>%
select(-group)
# y
# <int>
#1 1
#2 3
#3 3
#4 7
#5 5
#6 11
#7 7
#8 15
#9 9
#10 19
The initial example seems to be for a rolling sum over a window of width 2, not a cumulative sum. It can be written more compactly as:
library(zoo)
rollapplyr(d, 2, sum)
or
rollsum(d, 2)
or this which uses no packages:
apply(d, 2, function(x) rowSums(embed(x, 2)))
Now getting to the actual question here are some alternatives:
1) zoo To perform a cumsum rolling forward by 2 at a time:
apply(d, 2, function(x) t(rollapplyr(x, 2, cumsum, by = 2)))
2) no packages This would also work and uses no packages:
apply(d, 2, function(x) apply(matrix(x, 2), 2, cumsum))
Revised to work on every column of its input.
Related
I am trying to get the rolling sum of the past N days without including NA's. Using this sample df:
myVec <- data.frame(myVec = c(7,2,4,5,1,3,2,9))
I'm aware of the function rollsumr using this approach:
library(zoo)
rollsumr(myVec$myVec, k = 3, fill = NA)
Which yields the following output:
NA NA 13 11 10 9 6 14 However what I don't want is the leading NA's to take up those spots. I'd like to have the first spot be the first index, and second spot be 1st + 2nd, and so on until the Nth spot is reached which looks like 7 9 13 11 10 9 6 14 as a final result.
I have a way to iterate through using a for loop if my N days was small, however if my rolling sum N were to be 50 with 100 rows, I'd have to index individually for each of the first 50 days. So, I'd think there is a simpler, more efficient way to accomplish this.
You can use partial=TRUE in rollapplyr:
library(zoo)
rollapplyr(myVec$myVec, 3, sum, partial = TRUE)
# [1] 7 9 13 11 10 9 6 14
Or using the same function in dplyr:
library(dplyr)
myVec %>%
mutate(myVec = rollapplyr(myVec, 3, sum, partial = TRUE))
slider::slide_dbl is what you're looking for.
slider::slide_dbl(myVec$myVec, sum,.before = 2, .after = 0)
[1] 7 9 13 11 10 9 6 14
I want to exclude the minimum as well as the maximum value of each row in a data frame. (If one of those value are repeated, only one should be excluded.)
I can exclude either the minimum, or the maximum, but not both.
I don't seem to find a way to combine those (which both work fine by themselves):
d[-which(d == min(d))[1]]
d[-which(d == max(d))[1]]
This doesn't work:
d[
-which(d == min(d))[1] &
-which(d == max(d))[1]
]
It gives the full row.
(I also tried an approach using apply(d, 1, min/max), but this also fails.)
Update
Remembered after looking at #Rich Pauloo's answer, we can directly use which.max and which.min to get index of minimum and maximum value
as.data.frame(t(apply(df, 1, function(x) x[-c(which.max(x), which.min(x))])))
# V1 V2 V3
#1 13 11 6
#2 15 8 18
#3 5 10 21
#4 14 12 17
#5 19 9 20
Here which.max/which.min will ensure that you get the index of first minimum and maximum respectively for each row.
Some other variations could be
as.data.frame(t(apply(df, 1, function(x)
x[-c(which.max(x == min(x)), which.max(x == max(x)))])))
If you want to use which we can do
as.data.frame(t(apply(df, 1, function(x)
x[-c(which(x == min(x)[1]), which(x == max(x)[1]))])))
data
set.seed(1234)
df <- as.data.frame(matrix(sample(25), 5, 5))
df
# V1 V2 V3 V4 V5
#1 3 13 11 16 6
#2 15 1 8 25 18
#3 24 5 4 10 21
#4 14 12 17 2 22
#5 19 9 20 7 23
You were very close! With data.frames you need to use a comma within the brackets to accomplish row-column subsetting.
Use which.max() and which.min() to return the index of the max and min values of a vector, respectively.
Bind those indices into a new vector with c().
Use - and the vector from 2. to subset your data frame for the desired rows.
Here's an example to copy/paste:
d <- data.frame(a = 1:5) # make example data.frame
d[-c(which.max(d$a), which.min(d$a)), ]
[1] 2 3 4
This will remove the rows containing the min and max values of score as shown in the example data frame.
library(tidyverse)
df <- tribble(~name, ~score,
'John', 10,
'Mike', 2,
'Mary', 11,
'Jane', 1,
'Jill', 5)
df %>%
arrange(score) %>%
slice(-1, -nrow(.))
# A tibble: 3 x 2
name score
<chr> <dbl>
1 Mike 2
2 Jill 5
3 John 10
We can use
t(apply(df, 1, function(x) x[!x %in% range(x)]))
I want to aggregate data frame f into a new data frame g so that the column g$z contains a list of all group-wise values from column f$z. At first sight, this seems to be working:
f = data.frame(x=c(1, 1, 1, 2), y=c(4, 4, 5, 6), z=c(11, 12, 13, 14))
g = aggregate(z ~ x + y, f, c)
x y z
1 1 4 11, 12
2 1 5 13
3 2 6 14
Now I want to do different computations on the lists in column c for all rows in the data frame and put the result in new columns in the same data frame. But this doesn't work!
g$m = sum(g$z)
g$n = g$z + 1
Error in sum(g$z) : invalid 'type' (list) of argument
How can I work with lists inside a data frame cell like attempted above? Or is this simply un-R-like / impossible? If so, what is the correct approach?
UPDATE
My underlying goal is to do a lot of group-wise operations on all combinations of X and Y in the original data set. What options do I have for this in R in general?
Use apply. Pro: Everything in one table. Con: Complex table structure, can't use sum etc.
for(y), for(x), subset. Pro: Can do sum etc. directly. Con: Lots of code, and possibly slow.
Work in parallell w/original and aggregated table. Pro: Can do sum etc. Con: Data duplication.
Other options?
sum and Vectorization doesn't apply to lists, you can simply use sapply and lapply for the task:
g$m <- sapply(g$z, sum)
g$n <- lapply(g$z, `+`, 1)
g
# x y z m n
#1 1 4 11, 12 23 12, 13
#2 1 5 13 13 14
#3 2 6 14 14 15
Or if you use tidyverse, you can use map + mutate:
g %>% mutate(m = map_dbl(z, sum), n = map(z, ~.x + 1))
# x y z m n
#1 1 4 11, 12 23 12, 13
#2 1 5 13 13 14
#3 2 6 14 14 15
I have a survey of about 80 items, primarily the items are valanced positively (higher scores indicate better outcome), but about 20 of them are negatively valanced, I need to find a way to reverse score the ones negatively valanced in R. I am completely lost on how to do so. I am definitely an R beginner, and this is probably a dumb question, but could someone point me in an direction code-wise?
Here's an example with some fake data that you can adapt to your data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
dat
Q1 Q2 Q3
1 2 2 5
2 2 1 2
3 3 4 4
4 5 2 1
5 2 4 2
6 5 3 2
7 5 4 1
8 4 5 2
9 4 2 5
10 1 4 2
# Say you want to reverse questions Q1 and Q3
cols = c("Q1", "Q3")
dat[ ,cols] = 6 - dat[ ,cols]
dat
Q1 Q2 Q3
1 4 2 1
2 4 1 4
3 3 4 2
4 1 2 5
5 4 4 4
6 1 3 4
7 1 4 5
8 2 5 4
9 2 2 1
10 5 4 4
If you have a lot of columns, you can use tidyverse functions to select multiple columns to recode in a single operation.
library(tidyverse)
# Reverse code columns Q1 and Q3
dat %>% mutate(across(matches("^Q[13]"), ~ 6 - .))
# Reverse code all columns that start with Q followed by one or two digits
dat %>% mutate(across(matches("^Q[0-9]{1,2}"), ~ 6 - .))
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ 6 - .))
If different columns could have different maximum values, you can (adapting #HellowWorld's suggestion) customize the reverse-coding to the maximum value of each column:
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ max(.) + 1 - .))
Here is an alternative approach using the psych package. If you are working with survey data this package has lots of good functions. Building on #eipi10 data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
original_data = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
original_data
# Say you want to reverse questions Q1 and Q3. Set those keys to -1 and Q2 to 1.
# install.packages("psych") # Uncomment this if you haven't installed the psych package
library(psych)
keys <- c(-1,1,-1)
# Use the handy function from the pysch package
# mini is the minimum value and maxi is the maimum value
# mini and maxi can also be vectors if you have different scales
new_data <- reverse.code(keys,original_data,mini=1,maxi=5)
new_data
The pro to this approach is that you can recode your entire survey in one function. The con to this is you need a library. The stock R approach is more elegant as well.
FYI, this is my first post on stack overflow. Long time listener, first time caller. So please give me feedback on my response.
Just converting #eipi10's answer using tidyverse:
# Create same fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat <- data.frame(Q1 = sample(1:5,10, replace=TRUE),
Q2 = sample(1:5,10, replace=TRUE),
Q3 = sample(1:5,10, replace=TRUE))
# Reverse scores in the desired columns (Q2 and Q3)
dat <- dat %>%
mutate(Q2Reversed = 6 - Q2,
Q3Reversed = 6 - Q3)
Another example is to use recode in library(car).
#Example data
data = data.frame(Q1=sample(1:5,10, replace=TRUE))
# Say you want to reverse questions Q1
library(car)
data$Q1reversed <- recode(data$Q1, "1=5; 2=4; 3=3; 4=2; 5=1")
data
The psych package has the intuitive reverse.code() function that can be helpful. Using the dataset started by #eipi10 and the same goal or reversing q1 and q2:
set.seed(1)
dat <- data.frame(q1 =sample(1:5,10,replace=TRUE),
q2=sample(1:5,10,replace=TRUE),
q3 =sample(1:5,10,replace=TRUE))
You can use the reverse.code() function. The first argument is the keys. This is a vector of 1 and -1. -1 means that you want to reverse that item. These go in the same order as your data.
The second argument, called items, is simply the name of your dataset. That is, where are these items located?
Last, the mini and maxi arguments are the smallest and largest values that a participant could possibly score. You can also leave these arguments to NULL and the function will use the lowest and highest values in your data.
library(psych)
keys <- c(-1, 1, -1)
dat1 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat1
Alternatively, your keys can also contain the specific names of the variables that you want to reverse score. This is helpful if you have many variables to reverse score and yields the same answer:
library(psych)
keys <- c("q1", "q3")
dat2 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat2
Note that, after reverse scoring, reverse.code() slightly modifies the variable name to have a - behind it (i.e., q1 becomes q1- after being reverse scored).
The solutions above assume wide data (one score per column). This reverse scores specific rows in long data (one score per row).
library(magrittr)
max <- 5
df <- data.frame(score=sample(1:max, 20, replace=TRUE))
df <- mutate(df, question = rownames(df))
df
df[c(4,13,17),] %<>% mutate(score = max + 1 - score)
df
Here is another attempt that will generalize to any number of columns. Let's use some made up data to illustrate the function.
# create a df
{
A = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
B = c(9, 2, 3, 2, 4, 0, 2, 7, 2, 8)
C = c(2, 4, 1, 0, 2, 1, 3, 0, 7, 8)
df1 = data.frame(A, B, C)
print(df1)
}
A B C
1 3 9 2
2 3 2 4
3 3 3 1
4 3 2 0
5 3 4 2
6 3 0 1
7 3 2 3
8 3 7 0
9 3 2 7
10 3 8 8
The columns to reverse code
# variables to reverse code
vtcode = c("A", "B")
The function to reverse-code the selected columns
reverseCode <- function(data, rev){
# get maximum value per desired col: lapply(data[rev], max)
# subtract values in cols to reverse-code from max value plus 1
data[, rev] = mapply("-", lapply(data[rev], max), data[, rev]) + 1
return(data)
}
reverseCode(df1, vtcode)
A B C
1 1 1 2
2 1 8 4
3 1 7 1
4 1 8 0
5 1 6 2
6 1 10 1
7 1 8 3
8 1 3 0
9 1 8 7
10 1 2 8
This code was inspired by another response a response from #catastrophic-failure relating to subtract max of column from all entries in column R
I am trying to calculated the lagged difference (or actual increase) for data that has been inadvertently aggregated. Each successive year in the data includes values from the previous year. A sample data set can be created with this code:
set.seed(1234)
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
(df <- rbind(x, y, z))
I can use a combination of lapply() and split() to calculate the difference between each year for every unique id, like so:
(diffs <- lapply(split(df, df$id), function(x){-diff(x$value)}))
However, because of the nature of the diff() function, there are no results for the values in year 1, which means that after I flatten the diffs list of lists with Reduce(), I cannot add the actual yearly increases back into the data frame, like so:
df$actual <- Reduce(c, diffs) # flatten the list of lists
In this example, there are only 10 calculated differences or lags, while there are 15 rows in the data frame, so R throws an error when trying to add a new column.
How can I create a new column of actual increases with (1) the values for year 1 and (2) the calculated diffs/lags for all subsequent years?
This is the output I'm eventually looking for. My diffs list of lists calculates the actual values for years 2 and 3 just fine.
id value year actual
1 21 3 5
2 26 3 16
3 26 3 14
4 26 3 10
5 29 3 14
1 16 2 10
2 10 2 5
3 12 2 10
4 16 2 7
5 15 2 13
1 6 1 6
2 5 1 5
3 2 1 2
4 9 1 9
5 2 1 2
I think this will work for you. When you run into the diff problem just lengthen the vector by putting 0 in as the first number.
df <- df[order(df$id, df$year), ]
sdf <-split(df, df$id)
df$actual <- as.vector(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,2]))))
df[order(as.numeric(rownames(df))),]
There's lots of ways to do this but this one is fairly fast and uses base.
Here's a second & third way of approaching this problem utilizing aggregate and by:
aggregate:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- c(unlist(t(aggregate(value~id, df, diff2)[, -1])))
df[order(as.numeric(rownames(df))),]
by:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- unlist(by(df$value, df$id, diff2))
df[order(as.numeric(rownames(df))),]
plyr
df <- df[order(df$id, df$year), ]
df <- data.frame(temp=1:nrow(df), df)
library(plyr)
df <- ddply(df, .(id), transform, actual=diff2(value))
df[order(-df$year, df$temp),][, -1]
It gives you the final product of:
> df[order(as.numeric(rownames(df))),]
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2
EDIT: Avoiding the Loop
May I suggest avoiding the loop and turning what I gave to you into a function (the by solution is the easiest one for me to work with) and sapply that to the two columns you desire.
set.seed(1234) #make new data with another numeric column
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
df <- rbind(x, y, z)
df <- df.rep <- data.frame(df[, 1:2], new.var=df[, 2]+sample(1:5, nrow(df),
replace=T), year=df[, 3])
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x)) #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- data.frame(df, sapply(df[, 2:3], group.diff)) #apply group.diff to col 2:3
df[order(as.numeric(rownames(df))),] #reorder it
Of course you'd have to rename these unless you used transform as in:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x)) #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- transform(df, actual=group.diff(value), actual.new=group.diff(new.var))
df[order(as.numeric(rownames(df))),]
This would depend on how many variables you were doing this to.
1) diff.zoo. With the zoo package its just a matter of converting it to zoo using split= and then performing the diff :
library(zoo)
zz <- zz0 <- read.zoo(df, split = "id", index = "year", FUN = identity)
zz[2:3, ] <- diff(zz)
It gives the following (in wide form rather than the long form you mentioned) where each column is an id and each row is a year minus the prior year:
> zz
1 2 3 4 5
1 6 5 2 9 2
2 10 5 10 7 13
3 5 16 14 10 14
The wide form shown may actually be preferable but you can convert it to long form if you want that like this:
dt <- function(x) as.data.frame.table(t(x))
setNames(cbind(dt(zz), dt(zz0)[3]), c("id", "year", "value", "actual"))
This puts the years in ascending order which is the convention normally used in R.
2) rollapply. Also using zoo this alternative uses a rolling calculation to add the actual column to your data. It assumes the data is structured as you show with the same number of years in each group arranged in order:
df$actual <- rollapply(df$value, 6, partial = TRUE, align = "left",
FUN = function(x) if (length(x) < 6) x[1] else x[1]-x[6])
3) subtraction. Making the same assumptions as in the prior solution we can further simplify it to just this which subtracts from each value the value 5 positions hence:
transform(df, actual = value - c(tail(value, -5), rep(0, 5)))
or this variation:
transform(df, actual = replace(value, year > 1, -diff(ts(value), 5)))
EDIT: added rollapply and subtraction solutions.
Kind of hackish but keeping in place your wonderful Reduce you could add mock rows to your df for year 0:
mockRows <- data.frame(id = 1:5, value = 0, year = 0)
(df <- rbind(df, mockRows))
(df <- df[order(df$id, df$year), ])
(diffs <- lapply(split(df, df$id), function(x){diff(x$value)}))
(df <- df[df$year != 0,])
(df$actual <- Reduce(c, diffs)) # flatten the list of lists
df[order(as.numeric(rownames(df))),]
This is the output:
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2