How to calculate rolling mean for each column - r

I want to calculate mean of every five rows for each column by group, and I tried:
name<-colnames(df[,4:10])
df1<-for (i in name){
df%>%
group_by(A)%>%
summarise(!!paste(i,"mean"):=rollapplyr(get(i),5,mean,fill = NA,by.column=T))
}
result df1 is NULL
then I tried:
for (i in name){
df%>%
group_by(A)%>%
mutate(!!paste(i,"mean"):=rollapplyr(get(i),5,mean,fill = NA,by.column=T))
}
This could run, but nothing happen, df remains the same. And if I assign above code to df1, df1 is still NULL.
I also tried rollmean
df1<- for (i in name){
+ df%>%
+ group_by(CONM)%>%
+ mutate(!!paste(i,"mean"):=rollmean(get(i),5,fill = NA,align = "right"))
+ }
But still get NULL.
My data is like this:
CONM A B C
a 1 2 3
a 2 3 4
a 3 4 5
a 4 5 6
a 5 6 7
a 6 7 8
And I want to get this result for each CONM:
CONM A B C A_mean B_mean C_mean
a 1 2 3 NA NA NA
a 2 3 4 NA NA NA
a 3 4 5 NA NA NA
a 4 5 6 NA NA NA
a 5 6 7 3 4 5
a 6 7 8 4 5 6
b 1 2 3 NA NA NA
Could someone help me with this? Should I use other packages? Thanks

We can use mutate with across to loop over the columns A to C, specify a lambda function (function(.) or tidyverse shortform ~) to apply the function rollmean on the column
library(dplyr)
library(zoo)
df %>%
group_by(CONM) %>%
mutate(across(A:C, ~ rollmean(., 5, fill = NA, align = 'right'),
.names = '{col}_mean')) %>%
ungroup
-output
# A tibble: 7 x 7
# CONM A B C A_mean B_mean C_mean
# <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#1 a 1 2 3 NA NA NA
#2 a 2 3 4 NA NA NA
#3 a 3 4 5 NA NA NA
#4 a 4 5 6 NA NA NA
#5 a 5 6 7 3 4 5
#6 a 6 7 8 4 5 6
#7 b 1 2 3 NA NA NA
Or as #G. Grothendieck mentioned, the rollmeanr would do the right alignment
df %>%
group_by(CONM) %>%
mutate(across(A:C, ~ rollmeanr(., 5, fill = NA), .names = '{col}_mean'))
data
df <- structure(list(CONM = c("a", "a", "a", "a", "a", "a", "b"), A = c(1L,
2L, 3L, 4L, 5L, 6L, 1L), B = c(2L, 3L, 4L, 5L, 6L, 7L, 2L), C = c(3L,
4L, 5L, 6L, 7L, 8L, 3L)), class = "data.frame", row.names = c(NA,
-7L))

Related

With R, how can I separate continuous values from a dataframe with item NA and calculate the average of only variable Y?

X Y
1 1 2
2 2 4
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 1 4
8 2 6
9 1 8
10 1 10
It should be so: In the first case the average of the values 2 and 4 is 3 In the second case, the average of the values 4,6,8,10 is 7 and so on...
Your data:
df = data.frame(X=c(1,2,NA,NA,NA,NA,1,2,1,1),Y=c(2,4,NA,NA,NA,NA,4,6,8,10))
You can define rows with consecutive rows with no NAs using diff(complete.cases(..)) :
blocks = cumsum(c(0,diff(complete.cases(df)) != 0 ))
block_means = tapply(df$Y,blocks,mean)
0 1 2
3 NA 7
block_means[!is.na(block_means)]
0 2
3 7
Or if you don't need to know the order:
na.omit(as.numeric(tapply(df$Y,blocks,mean)))
[1] 3 7
We can create groups of continuous values using rleid from data.table , within each group calculate the mean of Y values/
library(dplyr)
df %>%
group_by(gr = data.table::rleid(is.na(Y))) %>%
summarise(Y = mean(Y, na.rm = TRUE)) %>%
filter(!is.na(Y)) -> df1
df1
# gr Y
# <int> <dbl>
#1 1 3
#2 3 7
data.table way of doing this would be :
library(data.table)
df1 <- setDT(df)[, .(Y = mean(Y, na.rm = TRUE)), rleid(is.na(Y))][!is.na(Y)]
data
df <- structure(list(X = c(1L, 2L, NA, NA, NA, NA, 1L, 2L, 1L, 1L),
Y = c(2L, 4L, NA, NA, NA, NA, 4L, 6L, 8L, 10L)),
class = "data.frame", row.names = c(NA, -10L))

return all possible values with which.max in R

I have the following dataset
clust T2 n
1 a 1
1 b 3
1 c 3
2 d 5
3 a 4
3 b 3
4 b 5
4 c 8
4 t 6
4 e 7
etc..
using the following function:
library(dplyr)
table <- data %>% group_by(clust) %>% summarise(max = max(n), name1 = T2[which.max(n)])
I get this output
clust max name1
1 3 b
2 5 d
3 4 a
4 8 c
etc
however there are cases where there are two or more T2 values corresponding to max(n). how can I record those value too?
i.e.
clust max name1
1 3 b,c
2 5 d
3 4 a
4 8 c
etc
or
clust max name1
1 3 b
1 3 c
2 5 d
3 4 a
4 8 c
etc
We can do a == instead of which.max (that returns only the first index of max value) and paste together with toString
library(dplyr)
library(tidyr)
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = toString(T2[n == max(n)]))
# A tibble: 4 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b, c
#2 2 5 d
#3 3 4 a
#4 4 8 c
and this can be expanded with separate_rows in the next step
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = toString(T2[n == max(n)])) %>%
separate_rows(name1, sep=",\\s+")
# A tibble: 5 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b
#2 1 3 c
#3 2 5 d
#4 3 4 a
#5 4 8 c
Or have a list column and then unnest
data %>%
group_by(clust) %>%
summarise(max = max(n), name1 = list(T2[n == max(n)])) %>%
unnest(c(name1))
# A tibble: 5 x 3
# clust max name1
# <int> <int> <chr>
#1 1 3 b
#2 1 3 c
#3 2 5 d
#4 3 4 a
#5 4 8 c
data
data <- structure(list(clust = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 4L, 4L,
4L), T2 = c("a", "b", "c", "d", "a", "b", "b", "c", "t", "e"),
n = c(1L, 3L, 3L, 5L, 4L, 3L, 5L, 8L, 6L, 7L)),
class = "data.frame", row.names = c(NA,
-10L))

Getting rowSums for triplicate records and retaining only the one with highest value

I have a data frame with 163 observations and 65 columns with some animal data. The 163 observations are from 56 animals, and each was supposed to have triplicated records, but some information was lost so for the majority of animals, I have triplicates ("A", "B", "C") and for some I have only duplicates (which vary among "A" and "B", "A" and "C" and "B" and "C").
Columns 13:65 contain some information I would like to sum, and only retain the one triplicate with the higher rowSums value. So my data frame would be something like this:
ID Trip Acet Cell Fibe Mega Tera
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3
I am not sure if what I need is to write my own function, or a loop, or what the best alternative actually is - sorry I am still learning and unfortunately for me, I don't think like a programmer so that makes things even more challenging...
So what I want is to know to keep on rows 2 and 6 (which have the highest rowSums among triplicates per animal), but for the whole data frame. What I want as a result is
ID Trip Acet Cell Fibe Mega Tera
1 4 B 9 3 7 5 5
2 12 C 5 5 7 3 3
REALLY sorry if the question is poorly elaborated or if it doesn't make sense, this is my first time asking a question here and I have only recently started learning R.
We can create the row sums separately and use that to find the row with the maximum row sums by using ave. Then use the logical vector to subset the rows of dataset
nm1 <- startsWith(names(df1), "V")
OP updated the column names. In that case, either an index
nm1 <- 3:7
Or select the columns with setdiff
nm1 <- setdiff(names(df1), c("ID", "Trip"))
v1 <- rowSums(df1[nm1], na.rm = TRUE)
i1 <- with(df1, v1 == ave(v1, ID, FUN = max))
df1[i1,]
# ID Trip V1 V2 V3 V4 V5
#2 4 B 9 3 7 5 5
#6 12 C 5 5 7 3 3
data
df1 <- structure(list(ID = c(4L, 4L, 4L, 12L, 12L, 12L), Trip = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
V1 = c(2L, 9L, 1L, 4L, 6L, 5L), V2 = c(4L, 3L, 2L, 6L, 8L,
5L), V3 = c(9L, 7L, 4L, 7L, 1L, 7L), V4 = c(8L, 5L, 8L, 2L,
1L, 3L), V5 = c(3L, 5L, 6L, 3L, 2L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Here is one way.
library(tidyverse)
dat2 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
group_by(ID) %>%
filter(Sum == max(Sum)) %>%
select(-Sum) %>%
ungroup()
dat2
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
Here is another one. This method makes sure only one row is preserved even there are multiple rows with row sum equals to the maximum.
dat3 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
arrange(ID, desc(Sum)) %>%
group_by(ID) %>%
slice(1) %>%
select(-Sum) %>%
ungroup()
dat3
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
DATA
dat <- read.table(text = " ID Trip V1 V2 V3 V4 V5
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3 ",
header = TRUE)

Is there a better way to spread a 'long' table with mutlple columns into a 'wide' one? [duplicate]

This question already has answers here:
Convert data from long format to wide format with multiple measure columns
(6 answers)
Closed 4 years ago.
I want to reshape a long dataframe to a wide one. That is, I want to go from this:
file label val1 val2
1 red A 12 3
2 red B 4 2
3 red C 5 8
4 green A 3 3
5 green B 6 5
6 green C 9 6
7 blue A 3 3
8 blue B 1 2
9 blue C 4 6
to this:
file value1_A value1_B value1_C value2_A value2_B value2_C
1 red 12 4 5 3 2 8
2 green 3 6 9 3 5 6
3 blue 3 1 4 3 2 6
My best attempt thus far is as follows:
library(tidyverse)
dat <-
structure(list(file = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L),
.Label = c("blue", "green", "red"),
class = "factor"),
label = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L),
.Label = c("A", "B", "C"),
class = "factor"),
val1 = c(12L, 4L, 5L, 3L, 6L, 9L, 3L, 1L, 4L),
val2 = c(3L, 2L, 8L, 3L, 5L, 6L, 3L, 2L, 6L)),
class = "data.frame", row.names = c(NA, -9L))
dat %>%
group_by(file) %>%
mutate(values1 = paste('value1', label, sep='_'),
values2 = paste('value2', label, sep='_')) %>%
spread(values1, val1) %>%
spread(values2, val2) %>%
select(-label)
# # A tibble: 9 x 7
# # Groups: file [3]
# file value1_A value1_B value1_C value2_A value2_B value2_C
# <fct> <int> <int> <int> <int> <int> <int>
# 1 blue 3 NA NA 3 NA NA
# 2 blue NA 1 NA NA 2 NA
# 3 blue NA NA 4 NA NA 6
# 4 green 3 NA NA 3 NA NA
# 5 green NA 6 NA NA 5 NA
# 6 green NA NA 9 NA NA 6
# 7 red 12 NA NA 3 NA NA
# 8 red NA 4 NA NA 2 NA
# 9 red NA NA 5 NA NA 8
The output is unsatisfactory since what should be on one row occupies three, with multiple 'NA'. This seems to be due to using spread twice, but I don't know how else to achieve the result I desire. I'd very much appreciate any advice on how to do this.
Many thanks in advance,
-R
Here's a way
library(tidyverse)
dat %>%
# first move to long form so we can
# see the original column names as strings
gather("variable_name", "value", contains("val")) %>%
# create the new column names from the variable name and the label
mutate(new_column_name = paste(variable_name, label, sep="_")) %>%
# get rid of the pieces we used to make the column names
select(-label, -variable_name) %>%
# now spread
spread(new_column_name, value)
here's the data.table way. all in one line of code...
library( data.table )
dcast( setDT( dat ), file ~ label, value.var = c("val1", "val2"))
# file val1_A val1_B val1_C val2_A val2_B val2_C
# 1: blue 3 1 4 3 2 6
# 2: green 3 6 9 3 5 6
# 3: red 12 4 5 3 2 8

Replacing missing character values by the character in the row below using R [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 5 years ago.
I have a character column with <NA> which I want to replace with the character below the row. An example is below:
df12 <-
structure(
list(Reg = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 2L, 2L, NA, 2L, 2L)
, .Label = c("A", "B"), class = "factor")),
.Names = "Reg", row.names = c(NA, -15L), class = "data.frame")
df12
Reg
1 <NA>
2 A
3 A
4 <NA>
5 A
6 A
7 <NA>
8 B
9 B
10 <NA>
11 B
12 B
13 <NA>
14 B
15 B
library(dplyr)
Required Output
1 A
2 A
3 A
4 A
5 A
6 A
7 B
8 B
9 B
10 B
11 B
12 B
13 B
14 B
15 B
We can use the fill by specifying the .direction
library(dplyr)
library(tidyr)
df12 %>%
fill(Reg, .direction = "up")
Another option using na.locf faster and strictly designed for filling missing with latest no missing value:
zoo::na.locf(df12,fromLast=TRUE)
# Reg
# 1 A
# 2 A
# 3 A
# 4 A
# 5 A
# 6 A
# 7 B
# 8 B
# 9 B
# 10 B
# 11 B
# 12 B
# 13 B
# 14 B
# 15 B

Resources