Aggregating variables for cases - r

Hello fellow Overflowers,
the goal is to process certain steps of data manipulation on a fairy big dataset. In a first step, certain variables, which represent different cases of a certain information, shall be aggregated for each case. There are always 5 variables to aggregate.
Right now, the dataset looks like this:
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 ... xyz5 A B C
case1 3 4 7 9 6 21 13 4 1 7 8
case2 9 12 8 17 25 31 7 2 7 6
case3 5 3 11 10 32 19 13 5 1 6 8
...
It should somehow look like this
mean-a mean-b ...mean-xyz A B C
case1 5,8 17 6,4 1 7 8
case2 9,6 24,3 8,3 2 7 6
case3 7,25 21,3 7 1 6 8
...
I'm not sure if building a function or using the acrossfunction from the dplyr package is the right way to do it, since there are about 2000 variables which need to be aggregated.
Any help will be greatly appreciated.
Thanks a lot in advance!

You can also use the following solution:
library(dplyr)
library(stringr)
library(purrr)
# First we extract the unique letter values of column names
letters <- unique(str_remove(names(df), "\\d"))
[1] "a" "b"
# Then we iterate over each unique values and extract the columns that contain that unique letter
letters %>%
map(~ df %>%
select(contains(.x)) %>%
rowwise() %>%
mutate("mean_{.x}" := mean(c_across(contains(.x)), na.rm = TRUE))) %>%
bind_cols() %>%
relocate(contains("mean"), .after = last_col())
# A tibble: 3 x 12
# Rowwise:
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 mean_a mean_b
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 4 7 9 6 21 13 7 8 4 5.8 10.6
2 9 12 8 17 25 31 4 2 2 7 14.2 9.2
3 5 3 11 10 32 19 13 2 2 5 12.2 8.2
Data
df <- tribble(
~a1, ~a2, ~a3, ~a4, ~a5, ~b1, ~b2, ~b3, ~b4, ~b5,
3, 4, 7, 9, 6, 21, 13, 7, 8, 4,
9, 12, 8, 17, 25, 31, 4, 2, 2, 7,
5, 3, 11, 10, 32, 19, 13, 2, 2, 5
)

Example Data:
# toy data
library(data.table)
m <- matrix(1:30, ncol = 10)
colnames(m) <- c(paste0('a', 1:5), paste0('b', 1:5))
d <- data.table(m)
d
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5
1: 1 4 7 10 13 16 19 22 25 28
2: 2 5 8 11 14 17 20 23 26 29
3: 3 6 9 12 15 18 21 24 27 30
Determine Groups:
You can determine first the groups you want to aggregate.
groups <- split(colnames(d), gsub("\\d", "", colnames(d)))
groups
$a
[1] "a1" "a2" "a3" "a4" "a5"
$b
[1] "b1" "b2" "b3" "b4" "b5"
Aggregate
Afterwards you simply calculate the mean of each group.
> d[,lapply(groups, function(i) {rowMeans(d[, i, with = FALSE])})]
a b
1: 7 22
2: 8 23
3: 9 24

Related

replace NAs in data frame with 'average if' of row

I have some data where each unique ID is a member of a group. There are some IDs with missing data, for these I'd like to take the average of the other members of the same group for that row.
For example, with the below data I'd like to replace the "NA" for id 3 in row V_2 with the average of the other Group A members for that row (average of 21 & 22). Similarly for id 7 in row V_3 it would be the average of 34 & 64.
Group=rep(c('A', 'B', 'C'), each=3)
id=1:9
V_1 = t(c(10,20,30,40,10,10,20,35,65))
V_2 = t(c(21,22,"NA",42,12,12,22,32,63))
V_3 = t(c(24,24,34,44,14,14,"NA",34,64))
df <- as.data.frame(rbind(Group, id, V_1, V_2, V_3))
df
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21 22 NA 42 12 12 22 32 63
X.2 24 24 34 44 14 14 NA 34 64
An approach using dplyr. The warnings occur because data frame columns are all character in your example (because the character class Group is in row 1). So ideally the whole data frame should be transposed...
library(dplyr)
library(tidyr)
tibble(data.frame(t(df))) %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ as.numeric(.x))) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
t() %>%
as.data.frame()
V1 V2 V3 V4 V5 V6 V7 V8 V9
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21.0 22.0 21.5 42.0 12.0 12.0 22.0 32.0 63.0
X.2 24 24 34 44 14 14 49 34 64
Warning messages:
1: Problem while computing `..1 = across(X:X.2, ~as.numeric(.x))`.
ℹ NAs introduced by coercion
ℹ The warning occurred in group 1: Group = "A".
2: Problem while computing `..1 = across(X:X.2, ~as.numeric(.x))`.
ℹ NAs introduced by coercion
ℹ The warning occurred in group 3: Group = "C".
Same example using transposed data
df_t %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
ungroup()
# A tibble: 9 × 5
Group id X X.1 X.2
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 10 21 24
2 A 2 20 22 24
3 A 3 30 21.5 34
4 B 4 40 42 44
5 B 5 10 12 14
6 B 6 10 12 14
7 C 7 20 22 49
8 C 8 35 32 34
9 C 9 65 63 64
with transpose back to wider format
df_t %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
t() %>%
as.data.frame()
V1 V2 V3 V4 V5 V6 V7 V8 V9
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21.0 22.0 21.5 42.0 12.0 12.0 22.0 32.0 63.0
X.2 24 24 34 44 14 14 49 34 64
transposed data
df_t <- structure(list(Group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C"), id = c(1, 2, 3, 4, 5, 6, 7, 8, 9), X = c(10, 20, 30, 40,
10, 10, 20, 35, 65), X.1 = c(21, 22, NA, 42, 12, 12, 22, 32,
63), X.2 = c(24, 24, 34, 44, 14, 14, NA, 34, 64)), class = "data.frame", row.names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9"))
Structuring the data the tidy way might make it easier. Package {Hmisc} offers a convenience impute helper (since this is such a frequent task). That way you could proceed as follows:
tidy the data
## example dataframe df:
set.seed(4711)
df <- data.frame(Group = gl(3, 3, labels = LETTERS[1:3]),
id = 1:9,
V_1 = sample(c(NA, 1:8)),
V_2 = sample(c(NA, 1:8)),
V_3 = sample(c(NA, 1:8))
)
## > df |> head()
## Group id V_1 V_2 V_3
## 1 A 1 1 7 6
## 2 A 2 4 8 2
## 3 A 3 3 2 3
## 4 B 4 6 4 1
## 5 B 5 5 3 8
## 6 B 6 NA NA 4
use {Hmisc} and {dplyr} together with the pipeline notation:
library(dplyr)
library(Hmisc)
df_imputed <-
df |> mutate(across(V_1:V_3, impute, mean))
> df_imputed |> head()
Group id V_1 V_2 V_3
1 A 1 1.0 7.0 6
2 A 2 4.0 8.0 2
3 A 3 3.0 2.0 3
4 B 4 6.0 4.0 1
5 B 5 5.0 3.0 8
6 B 6 4.5 4.5 4
Should you now prefer to replace missing values with groupwise medians instead of total means, the tidy arrangement (together with {dplyr}) requires only one additional group_by clause:
df |>
group_by(Group) |>
mutate(across(V_1:V_3, impute, median))

create a new variable based on other factors using R

So I have this dataframe and I aim to add a new variable based on others:
Qi
Age
c_gen
1
56
13
2
43
15
5
31
6
3
67
8
I want to create a variable called c_sep that if:
Qi==1 or Qi==2 c_sep takes a random number between (c_gen + 6) and Age;
Qi==3 or Qi==4 c_sep takes a random number between (Age-15) and Age;
And 0 otherwise,
so my data would look something like this:
Qi
Age
c_gen
c_sep
1
56
13
24
2
43
15
13
5
31
6
0
3
67
8
40
Any ideas please
In base R, you can do something along the lines of:
dat <- read.table(text = "Qi Age c_gen
1 56 13
2 43 15
5 31 6
3 67 8", header = T)
set.seed(100)
dat$c_sep <- 0
dat$c_sep[dat$Qi %in% c(1,2)] <- apply(dat[dat$Qi %in% c(1,2),], 1, \(row) sample(
(row["c_gen"]+6):row["Age"], 1
)
)
dat$c_sep[dat$Qi %in% c(3,4)] <- apply(dat[dat$Qi %in% c(3,4),], 1, \(row) sample(
(row["Age"]-15):row["Age"], 1
)
)
dat
# Qi Age c_gen c_sep
# 1 1 56 13 28
# 2 2 43 15 43
# 3 5 31 6 0
# 4 3 67 8 57
If you are doing it more than twice you might want to put this in a function - depending on your requirements.
Try this
df$c_sep <- ifelse(df$Qi == 1 | df$Qi == 2 ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$c_gen[x] + 6, df$Age[x]) ,1)) ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$Age[x] - 15, df$Age[x]) ,1)) , 0))
output
Qi Age c_gen c_sep
1 1 56 13 41
2 2 43 15 42
3 5 31 6 0
4 3 67 8 58
A tidyverse option:
library(tidyverse)
df <- tribble(
~Qi, ~Age, ~c_gen,
1, 56, 13,
2, 43, 15,
5, 31, 6,
3, 67, 8
)
df |>
rowwise() |>
mutate(c_sep = case_when(
Qi <= 2 ~ sample(seq(c_gen + 6, Age, 1), 1),
between(Qi, 3, 4) ~ sample(seq(Age - 15, Age, 1), 1),
TRUE ~ 0
)) |>
ungroup()
#> # A tibble: 4 × 4
#> Qi Age c_gen c_sep
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 56 13 39
#> 2 2 43 15 41
#> 3 5 31 6 0
#> 4 3 67 8 54
Created on 2022-06-29 by the reprex package (v2.0.1)

How to calculate the sum of specific columns in R and make the results in a another column

I'm a beginner in biostatistics and R software, and I need your help in a issue,
I have a table that contains more than 170 columns and more than 6000 lines, I want to add another column that contains the sum of all the columns, except the columns one and two columns
so for example if I have the data of 5 columns from A to E
A B C D E
12 2 13 98 6
10 7 8 67 12
12 56 67 9 7
I want to add another column (Column F for example ) that contain the sum of columns C D and E ( that means all the columns except the first two columns
so the result will be
A B C D E F
AA 2 13 98 6 117
CF 7 8 67 12 87
QZ 56 67 9 7 83
Please tell me if you want to know any other informations or clarification
Thank you very much
Does this work:
library(dplyr)
df %>% rowwise() %>% mutate(F = sum(c_across(-c(A:B))))
# A tibble: 3 x 6
# Rowwise:
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 2 13 98 6 117
2 10 7 8 67 12 87
3 12 56 67 9 7 83
Data used:
df
# A tibble: 3 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 12 2 13 98 6
2 10 7 8 67 12
3 12 56 67 9 7
library(tibble)
library(dplyr)
tbl <-
tibble::tribble(
~A, ~B, ~C, ~D, ~E,
12, 2, 13, 98, 6,
10, 7, 8, 67, 12,
12, 56, 67, 9, 7
)
tbl %>% dplyr::mutate("F" = C + D + E )
## R might consider F to be abbreviation for FALSE, so i put it in ""
You will find the information you need in the top answer to this question:
stackoverflow.com/questions/3991905/sum-rows-in-data-frame-or-matrix
Basically, you just name your new column, use the rowSums function, and specify the columns you want to include with the square bracket subsetting.
data$new <- rowSums( data[,43:167] )

How in R, can you create new columns using existing columns as variables?

I am new to R, and so would greatly appreciate more explanation for any code you might have to help solve my issue.
I have a data.frame with groups of columns related to each other and I want to perform a calculation on each of those many groups to get new output columns. For example, many biological replicates in an experiment where I want to perform the calculations on each replicate independently before collapsing them.
I know I could use mutate in dplyr to create new columns, but I am not sure how to do this in a loop or how to use lapply type strategy to avoid re-tying the columns names each time. My biggest issue is understanding how to convert the column names into something usable by one of these strategies.
For example:
> A.1 <- c(11,12,13,4,15,6,17,18)
> A.2 <- c(2,4,5,5,19,7,5,1)
>
> B.1 <- c (3,4,5,1,31,76,13,70)
> B.2 <- c (10,9,8,15,31,12,13,12)
>
> C.1 <- c(1,2,3,4,5,6,7,8)
> C.2 <- c(2,4,5,8,10,12,15,18)
>
> df <- data.frame(A.1, A.2, B.1, B.2, C.1, C.2)
>
> df
A.1 A.2 B.1 B.2 C.1 C.2
1 11 2 3 10 1 2
2 12 4 4 9 2 4
3 13 5 5 8 3 5
4 4 5 1 15 4 8
5 15 19 31 31 5 10
6 6 7 76 12 6 12
7 17 5 13 13 7 15
8 18 1 70 12 8 18
>
Where I want to create new columns where A.new = A.1/A.2, B.new = B.1/B.2 etc. without typing out each column name explicitly. Also note the "A" and "B" are really character strings, so typing all of them would be very messy and time consuming.
Something like this, but a general case for many column groups:
> df <- df %>% mutate(A.new = A.1/A.2)
> df <- df %>% mutate(B.new = B.1/B.2)
> df <- df %>% mutate(C.new = C.1/C.2)
>
> df
A.1 A.2 B.1 B.2 C.1 C.2 A.new B.new C.new
1 11 2 3 10 1 2 5.5000000 0.30000000 0.5000000
2 12 4 4 9 2 4 3.0000000 0.44444444 0.5000000
3 13 5 5 8 3 5 2.6000000 0.62500000 0.6000000
4 4 5 1 15 4 8 0.8000000 0.06666667 0.5000000
5 15 19 31 31 5 10 0.7894737 1.00000000 0.5000000
6 6 7 76 12 6 12 0.8571429 6.33333333 0.5000000
7 17 5 13 13 7 15 3.4000000 1.00000000 0.4666667
8 18 1 70 12 8 18 18.0000000 5.83333333 0.4444444
>
I don't see the answer to my question on here already, but if you could point me to existing answers that would be much appreciated! I am currently thinking about the column names as containing the variable, but maybe that is not the correct way to approach this (R is also the first programming language I am learning) and so my searches for answers haven't yielded much yet.
Thank you for your guidance in advance!
As mentioned by #A. S. K. it is easier to do the calculation if you have data in a long format.
We can use pivot_longer to get data in long format and for every row divide the first value by second value for that group of column.
library(dplyr)
df %>%
mutate(row = row_number()) %>%
tidyr::pivot_longer(cols = -row,
names_to = c('.value', 'group'),
names_sep = '\\.') %>%
group_by(row) %>%
summarise(across(A:C, list(new = ~.[1]/.[2]))) %>%
#If you have an older version of dplyr use
#summarise_at(vars(A:C), list(new = ~.[1]/.[2])) %>%
select(-row) %>%
bind_cols(df, .)
# A.1 A.2 B.1 B.2 C.1 C.2 A_new B_new C_new
#1 11 2 3 10 1 2 5.500 0.3000 0.500
#2 12 4 4 9 2 4 3.000 0.4444 0.500
#3 13 5 5 8 3 5 2.600 0.6250 0.600
#4 4 5 1 15 4 8 0.800 0.0667 0.500
#5 15 19 31 31 5 10 0.789 1.0000 0.500
#6 6 7 76 12 6 12 0.857 6.3333 0.500
#7 17 5 13 13 7 15 3.400 1.0000 0.467
#8 18 1 70 12 8 18 18.000 5.8333 0.444
You can specify range of column names using A:C in summarise step. Also, note that in pivot_longer step the names_sep argument is used to differentiate column group. Since you have column names as A.1, A.2 I use '.' as a separator, you might need to change it according to the column names that you have.
We can do this much more easily with split.default
lst1 <- lapply(split.default(df, sub("\\.\\d+$", "", names(df))),
function(x) x[[1]]/x[[2]])
df[paste0(names(lst1), ".new")] <- lst1
df
# A.1 A.2 B.1 B.2 C.1 C.2 A.new B.new C.new
#1 11 2 3 10 1 2 5.5000000 0.30000000 0.5000000
#2 12 4 4 9 2 4 3.0000000 0.44444444 0.5000000
#3 13 5 5 8 3 5 2.6000000 0.62500000 0.6000000
#4 4 5 1 15 4 8 0.8000000 0.06666667 0.5000000
#5 15 19 31 31 5 10 0.7894737 1.00000000 0.5000000
#6 6 7 76 12 6 12 0.8571429 6.33333333 0.5000000
#7 17 5 13 13 7 15 3.4000000 1.00000000 0.4666667
#8 18 1 70 12 8 18 18.0000000 5.83333333 0.4444444
NOTE: We don't need any packages and can be done much easily
data
df <- structure(list(A.1 = c(11, 12, 13, 4, 15, 6, 17, 18), A.2 = c(2,
4, 5, 5, 19, 7, 5, 1), B.1 = c(3, 4, 5, 1, 31, 76, 13, 70), B.2 = c(10,
9, 8, 15, 31, 12, 13, 12), C.1 = c(1, 2, 3, 4, 5, 6, 7, 8), C.2 = c(2,
4, 5, 8, 10, 12, 15, 18)), class = "data.frame", row.names = c(NA,
-8L))

dplyr- renaming sequence of columns with select function

I'm trying to rename my columns in dplyr. I found that doing it with select function. however when I try to rename some selected columns with sequence I cannot rename them the format that I want.
test = data.frame(x = rep(1:3, each = 2),
group =rep(c("Group 1","Group 2"),3),
y1=c(22,8,11,4,7,5),
y2=c(22,18,21,14,17,15),
y3=c(23,18,51,44,27,35),
y4=c(21,28,311,24,227,225))
CC <- paste("CC",seq(0,3,1),sep="")
aa<-test%>%
select(AC=x,AR=group,CC=y1:y4)
head(aa)
AC AR CC1 CC2 CC3 CC4
1 1 Group 1 22 22 23 21
2 1 Group 2 8 18 18 28
3 2 Group 1 11 21 51 311
4 2 Group 2 4 14 44 24
5 3 Group 1 7 17 27 227
6 3 Group 2 5 15 35 225
the problem is even I set CC value from CC0, CC1, CC2, CC3 the output gives automatically head names starting from CC1.
how can I solve this issue?
I think you'll have an easier time crating such an expression with the select_ function:
library(dplyr)
test <- data.frame(x=rep(1:3, each=2),
group=rep(c("Group 1", "Group 2"), 3),
y1=c(22, 8, 11, 4, 7, 5),
y2=c(22, 18, 21, 14, 17, 15),
y3=c(23, 18, 51, 44, 27, 35),
y4=c(21, 28, 311,24, 227, 225))
# build out our select "translation" named vector
DQ <- paste0("y", 1:4)
names(DQ) <- paste0("DQ", seq(0, 3, 1))
# take a look
DQ
## DQ0 DQ1 DQ2 DQ3
## "y1" "y2" "y3" "y4"
test %>%
select_("AC"="x", "AR"="group", .dots=DQ)
## AC AR DQ0 DQ1 DQ2 DQ3
## 1 1 Group 1 22 22 23 21
## 2 1 Group 2 8 18 18 28
## 3 2 Group 1 11 21 51 311
## 4 2 Group 2 4 14 44 24
## 5 3 Group 1 7 17 27 227
## 6 3 Group 2 5 15 35 225

Resources