repeatedly applying ave for computing group means in a data frame - r

The following code separately produces the group means of x and y in accordance to group. Suppose that I have a number of variables for which repeating the same operation.
How would you suggest to proceed in order to obtain the same result through a single command? (I suppose it is necessary to adopt tapply, but I am not really sure about it..).
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- data.frame(cbind(group, x, y))
dat$m_x <- ave(dat$x, dat$group)
dat$m_y <- ave(dat$y, dat$group)
dat
Many thanks.

Alternative solutions using data.table and plyr packages:
1) Using data.table
require(data.table)
dt <- data.table(dat, key="group")
# Following #Matthew's comment, edited:
dt[, `:=`(m_x = mean(x), m_y = mean(y)), by=group]
Output:
group x y m_x m_y
1: 1 1 2 3 4
2: 1 3 4 3 4
3: 1 5 6 3 4
4: 2 7 8 9 10
5: 2 9 10 9 10
6: 2 11 12 9 10
2) using plyr and transform:
require(plyr)
ddply(dat, .(group), transform, m_x=mean(x), m_y=mean(y))
output:
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
3) using plyr and numcolwise (note the reduced output):
ddply(dat, .(group), numcolwise(mean))
Output:
group x y
1 1 3 4
2 2 9 10

Assuming you have more than just two columns, you would want to use apply to apply ave to every column in the matrix.
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- cbind(x, y)
ave.dat <- apply(dat, 2, function(column) ave(column, group))
# x y
# [1,] 1 2
# [2,] 3 4
# [3,] 5 6
# [4,] 7 8
# [5,] 9 10
# [6,] 11 12

You can also use aggregate():
dat2 <- data.frame(dat, aggregate(dat[,-1], by=list(dat$group), mean)[group, -1])
dat2
group x y x.1 y.1
1 1 1 2 3 4
1.1 1 3 4 3 4
1.2 1 5 6 3 4
2 2 7 8 9 10
2.1 2 9 10 9 10
2.2 2 11 12 9 10
row.names(dat2) <- rownames(dat)
colnames(dat2) <- gsub("(.)\\.1", "m_\\1", colnames(dat2))
dat2
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
If the variable names are more than a single character, you would need to modify the gsub() call.

Related

Add column to each data frame within list with function rowSums and range of columns

SO. The following might serve as a small example of the real list.
a <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
b <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
c <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
l <- list(a,b,c)
From the second column to last column - on every data frame - i want to add the sums as a new column to each data frame.
I tried:
lapply(l, function(x) rowSums(x[2:ncol(x)]))
which returns the correct sums, but doesn't add them to the data frames.
I also tried:
lapply(l, transform, sum = y + z)
which gives me the correct results but is not flexible enough, because i don't always know how many columns there are for each data frame and what names they have. The only thing i know, is, that i have to start from second column to end. I tried to combine these two approaches but i can't figure out, how to do it exactly.
Thanks
Try this. You can play around index in columns and exclude the first variable so that there is not issues about how many additional variables you have in order to obtain the rowsums. Here the code:
#Compute rowsums
l1 <- lapply(l,function(x) {x$RowSum<-rowSums(x[,-1],na.rm=T);return(x)})
Output:
l1
[[1]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
Here's how to combine your attempts. I used data[-1] instead of data[2:ncol(data)] because it seems simpler, but either should work.
lapply(l, function(data) transform(data, sum = rowSums(data[-1])))
Unfortunately, transform will be confused if the name of the argument to your anonymous function is the same as a column name - data[-1] needs to look at the data frame, not a particular column. (I originally use function(x) instead of function(data), and this caused an error because there is a column named x. From this perspective, Duck's answer is a little safer.)
Does this work:
> add_col <- function(df){
+ df[(ncol(df)+1)] = rowSums(df[2:ncol(df)])
+ df
+ }
> lapply(l, add_col)
[[1]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
>
With sum as column name:
> add_col <- function(df){
+ df['sum'] = rowSums(df[2:ncol(df)])
+ df
+ }
> lapply(l, add_col)
[[1]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
use tidyverse
library(tidyverse)
map(l, ~.x %>% mutate(Sum := apply(.x[-1], 1, sum)))
#> [[1]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
#>
#> [[2]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
#>
#> [[3]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
Created on 2020-09-30 by the reprex package (v0.3.0)
We can use map with mutate
library(purrr)
library(dplyr)
map(l, ~ .x %>%
mutate(sum = rowSums(select(., -1))))
Or with c_across
map(l, ~ .x %>%
rowwise() %>%
mutate(sum = sum(c_across(-1), na.rm = TRUE)) %>%
ungroup)

Remove identical values if the same as previous in a time series

I have a time series:
df <- data.frame(t=1:10, x= c(5,7,8,9,5,5,5,5,4,3))
I want to remove values that are identical to the previous value to obtain:
x = c(5,7,8,9,5,4,3)
I tried:
df[unique(df$x),]
But this gives the incorrect answer.
You can do:
df[c(1, diff(df$x)) != 0, ]
t x
1 1 5
2 2 7
3 3 8
4 4 9
5 5 5
6 9 4
7 10 3
With dplyr, you can do:
df %>%
filter(x != lag(x, default = first(x)-1))
t x
1 1 5
2 2 7
3 3 8
4 4 9
5 5 5
6 9 4
7 10 3
In base R, we can use head and tail
subset(df, c(TRUE, head(x, -1) != tail(x, -1)))
# t x
#1 1 5
#2 2 7
#3 3 8
#4 4 9
#5 5 5
#9 9 4
#10 10 3
Another base solution would be using rle.
If you want to subset the dataframe based on the criteria, you can use lengths. Otherwise, if you only need the subset of x column, we should extract the values from rle. See below;
df[cumsum(rle(df$x)$lengths), ] # dataframe subset
# t x
# 1 1 5
# 2 2 7
# 3 3 8
# 4 4 9
# 8 8 5
# 9 9 4
# 10 10 3
rle(df$x)$values # vector of values
# [1] 5 7 8 9 5 4 3
Or using data.table:
library(data.table)
setDT(df_large)[, rn :=1:.N, by = rleid(x)][rn == 1, .(t, x)]
# t x
# 1: 1 5
# 2: 2 7
# 3: 3 8
# 4: 4 9
# 5: 5 5
# 6: 9 4
# 7: 10 3
library(dplyr)
df <- data.frame(t=1:10, x= c(5,7,8,9,5,5,5,5,4,3))
subsetVec <- df$x - lag(df$x) != 0
subsetVec <- replace_na(subsetVec, TRUE)
df[subsetVec,]

How to find closest match from list in R

I have a list of numbers and would like to find which is the next highest compared to each number in a data.frame. I have:
list <- c(3,6,9,12)
X <- c(1:10)
df <- data.frame(X)
And I would like to add a variable to df being the next highest number in the list. i.e:
X Y
1 3
2 3
3 3
4 6
5 6
6 6
7 9
8 9
9 9
10 12
I've tried:
df$Y <- which.min(abs(list-df$X))
but that gives an error message and would just get the closest value from the list, not the next above.
Another approach is to use findInterval:
df$Y <- list[findInterval(X, list, left.open=TRUE) + 1]
> df
X Y
1 1 3
2 2 3
3 3 3
4 4 6
5 5 6
6 6 6
7 7 9
8 8 9
9 9 9
10 10 12
You could do this...
df$Y <- sapply(df$X, function(x) min(list[list>=x]))
df
X Y
1 1 3
2 2 3
3 3 3
4 4 6
5 5 6
6 6 6
7 7 9
8 8 9
9 9 9
10 10 12

remove i+1th term if reoccuring

Say we have the following data
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
How would one write a function so that for A, if we have the same value in the i+1th position, then the reoccuring row is removed.
Therefore the output should like like
data.frame(c(1,2,3,4,8,6,1,2,3,4), c(1,2,5,1,2,3,5,1,2,3))
My best guess would be using a for statement, however I have no experience in these
You can try
data[c(TRUE, data[-1,1]!= data[-nrow(data), 1]),]
Another option, dplyr-esque:
library(dplyr)
dat1 <- data.frame(A=c(1,2,2,2,3,4,8,6,6,1,2,3,4),
B=c(1,2,3,4,5,1,2,3,4,5,1,2,3))
dat1 %>% filter(A != lag(A, default=FALSE))
## A B
## 1 1 1
## 2 2 2
## 3 3 5
## 4 4 1
## 5 8 2
## 6 6 3
## 7 1 5
## 8 2 1
## 9 3 2
## 10 4 3
using diff, which calculates the pairwise differences with a lag of 1:
data[c( TRUE, diff(data[,1]) != 0), ]
output:
A B
1 1 1
2 2 2
5 3 5
6 4 1
7 8 2
8 6 3
10 1 5
11 2 1
12 3 2
13 4 3
Using rle
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
X <- rle(data$A)
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
View(data[Y, ])
row.names A B
1 1 1 1
2 2 2 2
3 5 3 5
4 6 4 1
5 7 8 2
6 8 6 3
7 10 1 5
8 11 2 1
9 12 3 2
10 13 4 3

R: create a data frame out of a rolling window

Lets say I have a data frame with the following structure:
DF <- data.frame(x = 0:4, y = 5:9)
> DF
x y
1 0 5
2 1 6
3 2 7
4 3 8
5 4 9
what is the most efficient way to turn 'DF' into a data frame with the following structure:
w x y
1 0 5
1 1 6
2 1 6
2 2 7
3 2 7
3 3 8
4 3 8
4 4 9
Where w is a length 2 window rolling through the dataframe 'DF.' The length of the window should be arbitrary, i.e a length of 3 yields
w x y
1 0 5
1 1 6
1 2 7
2 1 6
2 2 7
2 3 8
3 2 7
3 3 8
3 4 9
I am a bit stumped by this problem, because the data frame can also contain an arbitrary number of columns, i.e. w,x,y,z etc.
/edit 2: I've realized edit 1 is a bit unreasonable, as xts doesn't seem to deal with multiple observations per data point
My approach would be to use the embed function. The first thing to do is to create a rolling sequence of indices into a vector. Take a data-frame:
df <- data.frame(x = 0:4, y = 5:9)
nr <- nrow(df)
w <- 3 # window size
i <- 1:nr # indices of the rows
iw <- embed(i,w)[, w:1] # matrix of rolling-window indices of length w
> iw
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 3 4
[3,] 3 4 5
wnum <- rep(1:nrow(iw),each=w) # window number
inds <- i[c(t(iw))] # the indices flattened, to use below
dfw <- sapply(df, '[', inds)
dfw <- transform(data.frame(dfw), w = wnum)
> dfw
x y w
1 0 5 1
2 1 6 1
3 2 7 1
4 1 6 2
5 2 7 2
6 3 8 2
7 2 7 3
8 3 8 3
9 4 9 3

Resources