How to average all columns in dataset by group [duplicate] - r

This question already has answers here:
How to calculate mean of all columns, by group?
(6 answers)
Closed 4 years ago.
I'm using aggregate in R to try and summarize my dataset. I currently have 3-5 observation per ID and I need to average these so that I have 1 value (the mean) per ID. Some columns are returning all "NA" when I use aggregate.
So far, I've created a vector for each column to average it, then tried to use merge to combine all of them. Some columns are characters, so I tried converting them to numbers using as.numeric(as.character(column)), but that returns too many NA in the column.
library(dplyr)
Tr1 <- data %>% group_by(ID) %>% summarise(mean = mean(Tr1))
Tr2 <- data %>% group_by(ID) %>% summarise(mean = mean(Tr2))
Tr3 <- data %>% group_by(ID) %>% summarise(mean = mean(Tr3))
data2 <- merge(Tr1,Tr2,Tr3, by = ID)
From this code I get error codes:
There were 50 or more warnings (use warnings() to see the first 50)
then,
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical
My original dataset looks like:
ID Tr1 Tr2 Tr3
1 4 5 6
1 5 3 9
1 3 5 9
4 5 1 8
4 2 6 4
6 2 8 6
6 2 7 4
6 7 1 9
and I am trying to find a code so that it looks like:
ID Tr1 Tr2 Tr3
1 4 4.3 8
4 3.5 3.5 6
6 3.7 5.3 6.3

You can use summarise_all instead of multiple uses of summarise:
library(dplyr)
data %>%
group_by(ID) %>%
summarise_all(mean)
# A tibble: 3 x 4
ID Tr1 Tr2 Tr3
<int> <dbl> <dbl> <dbl>
1 1 4 4.33 8
2 4 3.5 3.5 6
3 6 3.67 5.33 6.33

Related

R data imputation from group_by table [duplicate]

This question already has answers here:
How to replace NA with mean by group / subset?
(5 answers)
Closed 7 months ago.
group = c(1,1,4,4,4,5,5,6,1,4,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c')
sleep = c(14,NA,22,15,NA,96,100,NA,50,2,1)
test = data.frame(group, animal, sleep)
print(test)
group_animal = test %>% group_by(`group`, `animal`) %>% summarise(mean_sleep = mean(sleep, na.rm = T))
I would like to replace the NA values the sleep column based on the mean sleep value grouped by group and animal.
Is there any way that I can perform some sort of lookup like Excel that matches group and animal from the test dataframe to the group_animal dataframe and replaces the NA value in the sleep column from the test df with the sleep value in the group_animal df?
We could use mutate instead of summarise as summarise returns a single row per group
library(dplyr)
library(tidyr)
test <- test %>%
group_by(group, animal) %>%
mutate(sleep = replace_na(sleep, mean(sleep, na.rm = TRUE))) %>%
ungroup
-output
test
# A tibble: 11 × 3
group animal sleep
<dbl> <chr> <dbl>
1 1 a 14
2 1 b 50
3 4 c 22
4 4 c 15
5 4 d 2
6 5 a 96
7 5 b 100
8 6 c 1
9 1 b 50
10 4 d 2
11 6 c 1

Stepwise column sum in data frame based on another column in R

I have a data frame like this:
Team
GF
A
3
B
5
A
2
A
3
B
1
B
6
Looking for output like this (just an additional column):
Team
x
avg(X)
A
3
0
B
5
0
A
2
3
A
3
2.5
B
1
5
B
6
3
avg(x) is the average of all previous instances of x where Team is the same. I have the following R code which gets the overall average, however I'm looking for the "step-wise" average.
new_df <- df %>% group_by(Team) %>% summarise(avg_x = mean(x))
Is there a way to vectorize this while only evaluating the previous rows on each "iteration"?
You want the cummean() function from dplyr, combined with lag():
df %>% group_by(Team) %>% mutate(avg_x = replace_na(lag(cummean(x)), 0))
Producing the following:
# A tibble: 6 × 3
# Groups: Team [2]
Team x avg_x
<chr> <dbl> <dbl>
1 A 3 0
2 B 5 0
3 A 2 3
4 A 3 2.5
5 B 1 5
6 B 6 3
As required.
Edit 1:
As #Ritchie Sacramento pointed out, the following is cleaner and clearer:
df %>% group_by(Team) %>% mutate(avg_x = lag(cummean(x), default = 0))

Create an id based on column value in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I am working on a two columns dataset in R representing response values ("Response") of different samples belonging to different groups ("Group") and I want to create a third ID column to identify each sample with a number from 1 to [..] (there is not the same number of sample in each group). Here is just a few lines as an example: Example
Thanks for your help.
try
library(tidyverse)
your_data %>%
group_by(Group)%>%
mutate(ID = 1:n())
We could use cur_group_id
library(dplyr)
df %>%
group_by(Group) %>%
mutate(new_id = cur_group_id())
Output:
Group Response Id new_id
<chr> <dbl> <dbl> <int>
1 A 1.5 1 1
2 A 3.4 2 1
3 A 2.3 3 1
4 A 1.8 4 1
5 B 1.9 1 2
6 B 1.4 2 2
7 C 2.7 1 3
8 C 2.3 2 3
9 C 3.2 3 3

How do I subset patient data based on number of readings for a particular variable for each patient?

I keep trying to find an answer, but haven't had much luck. I'll add a sample of some similar data.
What I'd be trying to do here is exclude patient 1 and patient 4 from my subset, as they only have one reading for "Mobility Score". So far, I've been unable to work out a way of counting the number of readings under each variable for each patient. If the patient only has one or zero readings, I'd like to exclude them from a subset.
This is an imgur link to the sample data. I can't upload the real data, but it's similar to this
This can be done with dplyr and group_by. For more information see ?group_by and ?summarize
# Create random data
dta <- data.frame(patient = rep(c(1,2),4), MobiScor = runif(8, 0,20))
dta$MobiScor[sample(1:8,3)] <- NA
# Count all avaiable Mobility scores per patient and leave original format
library(dplyr)
dta %>% group_by(patient) %>% mutate(count = sum(!is.na(MobiScor)))
# Merge and create pivot table
dta %>% group_by(patient) %>% summarize(count = sum(!is.na(MobiScor)))
Example data
patient MobiScor
1 1 19.203898
2 2 13.684209
3 1 17.581468
4 2 NA
5 1 NA
6 2 NA
7 1 7.794959
8 2 NA
Result (mutate) 1)
patient MobiScor count
<dbl> <dbl> <int>
1 1 19.2 3
2 2 13.7 1
3 1 17.6 3
4 2 NA 1
5 1 NA 3
6 2 NA 1
7 1 7.79 3
8 2 NA 1
Result (summarize) 2)
patient count
<dbl> <int>
1 1 3
2 2 1
You can count the number of non-NA in each group and then filter based on that.
This can be done in base R :
subset(df, ave(!is.na(Mobility_score), patient, FUN = sum) > 1)
Using dplyr
library(dplyr)
df %>% group_by(patient) %>% filter(sum(!is.na(Mobility_score)) > 1)
and data.table
library(data.table)
setDT(df)[, .SD[sum(!is.na(Mobility_score)) > 1], patient]

Match string pattern in column names via loop and append as new column to dataframe

I have a data frame with column names as such:
abc_alpha = c(1,2,3,4)
abc_beta = c(5,6,7,8)
abc_char = c(9,10,11,12)
xyz_alpha = c(4,3,2,1)
xyz_beta = c(8,7,6,5)
xyz_char = c(12,11,10,9)
and my dataframe (df):
abc_alpha abc_beta abc_char xyz_alpha xyz_beta xyz_char
1 5 9 4 8 12
2 6 10 3 7 11
3 7 11 2 6 10
4 8 12 1 5 9
I would like to loop through the columns and match the columns that have the same end of the strings (after the underscore), take the average of two matching columns and append it to the end of the data frame as a new variable (col name for the new variable will be the matched string after the underscore). I'd like to use a loop instead of hard-coding the column names as the real dataset has way too many columns.
Expected output will be:
abc_alpha abc_beta abc_char xyz_alpha xyz_beta xyz_char alpha beta char
1 5 9 4 8 12 2.5 6.5 10.5
2 6 10 3 7 11 2.5 6.5 10.5
3 7 11 2 6 10 2.5 6.5 10.5
4 8 12 1 5 9 2.5 6.5 10.5
I've written the first part of the loop function, but can't seem to finish by appending the new columns to the dataframe:
for (i in 1:ncol(df)) {
x <- (strsplit(names(df)[i], split = '_', fixed = T))[[1]][2]
I've browsed through possibly similar questions, but as I'm new to R, alot of answers that suggest using the Apply family have gotten me confused and I've been unable to adapt those solutions to my situation.
Thank you!
We can split the data by a grouping variable created by removing the substring and get the rowMeans
cbind(df, sapply(split.default(df, sub(".*_", "", names(df))), rowMeans))
#abc_alpha abc_beta abc_char xyz_alpha xyz_beta xyz_char alpha beta char
#1 1 5 9 4 8 12 2.5 6.5 10.5
#2 2 6 10 3 7 11 2.5 6.5 10.5
#3 3 7 11 2 6 10 2.5 6.5 10.5
#4 4 8 12 1 5 9 2.5 6.5 10.5
Or using tidyverse, gather the columns into 'long' format, then separate the 'key' column into two colums by the separator _, summarise to get the mean after grouping by the row names and the 'key2', spread to 'wide' and bind with the original dataset using `bind_cols'
library(tidyverse)
df %>%
rownames_to_column('rn') %>% # create a rowname column
gather(key, val, -rn) %>% # convert to long format
separate(key, into = c('key1', 'key2')) %>% # split column into two
group_by(rn, key2) %>% # grouping with columns
summarise(val = mean(val)) %>% # get the mean
spread(key2, val) %>% # convert to wide format
ungroup %>% # remove the groups
select(-rn) %>% # select only columns of interest
bind_cols(df, .) # bind with the original dataset
# abc_alpha abc_beta abc_char xyz_alpha xyz_beta xyz_char alpha beta char
#1 1 5 9 4 8 12 2.5 6.5 10.5
#2 2 6 10 3 7 11 2.5 6.5 10.5
#3 3 7 11 2 6 10 2.5 6.5 10.5
#4 4 8 12 1 5 9 2.5 6.5 10.5
data
df <- data.frame(abc_alpha, abc_beta, abc_char, xyz_alpha, xyz_beta, xyz_char)

Resources