Related
I have some data where each unique ID is a member of a group. There are some IDs with missing data, for these I'd like to take the average of the other members of the same group for that row.
For example, with the below data I'd like to replace the "NA" for id 3 in row V_2 with the average of the other Group A members for that row (average of 21 & 22). Similarly for id 7 in row V_3 it would be the average of 34 & 64.
Group=rep(c('A', 'B', 'C'), each=3)
id=1:9
V_1 = t(c(10,20,30,40,10,10,20,35,65))
V_2 = t(c(21,22,"NA",42,12,12,22,32,63))
V_3 = t(c(24,24,34,44,14,14,"NA",34,64))
df <- as.data.frame(rbind(Group, id, V_1, V_2, V_3))
df
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21 22 NA 42 12 12 22 32 63
X.2 24 24 34 44 14 14 NA 34 64
An approach using dplyr. The warnings occur because data frame columns are all character in your example (because the character class Group is in row 1). So ideally the whole data frame should be transposed...
library(dplyr)
library(tidyr)
tibble(data.frame(t(df))) %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ as.numeric(.x))) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
t() %>%
as.data.frame()
V1 V2 V3 V4 V5 V6 V7 V8 V9
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21.0 22.0 21.5 42.0 12.0 12.0 22.0 32.0 63.0
X.2 24 24 34 44 14 14 49 34 64
Warning messages:
1: Problem while computing `..1 = across(X:X.2, ~as.numeric(.x))`.
ℹ NAs introduced by coercion
ℹ The warning occurred in group 1: Group = "A".
2: Problem while computing `..1 = across(X:X.2, ~as.numeric(.x))`.
ℹ NAs introduced by coercion
ℹ The warning occurred in group 3: Group = "C".
Same example using transposed data
df_t %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
ungroup()
# A tibble: 9 × 5
Group id X X.1 X.2
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 10 21 24
2 A 2 20 22 24
3 A 3 30 21.5 34
4 B 4 40 42 44
5 B 5 10 12 14
6 B 6 10 12 14
7 C 7 20 22 49
8 C 8 35 32 34
9 C 9 65 63 64
with transpose back to wider format
df_t %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
t() %>%
as.data.frame()
V1 V2 V3 V4 V5 V6 V7 V8 V9
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21.0 22.0 21.5 42.0 12.0 12.0 22.0 32.0 63.0
X.2 24 24 34 44 14 14 49 34 64
transposed data
df_t <- structure(list(Group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C"), id = c(1, 2, 3, 4, 5, 6, 7, 8, 9), X = c(10, 20, 30, 40,
10, 10, 20, 35, 65), X.1 = c(21, 22, NA, 42, 12, 12, 22, 32,
63), X.2 = c(24, 24, 34, 44, 14, 14, NA, 34, 64)), class = "data.frame", row.names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9"))
Structuring the data the tidy way might make it easier. Package {Hmisc} offers a convenience impute helper (since this is such a frequent task). That way you could proceed as follows:
tidy the data
## example dataframe df:
set.seed(4711)
df <- data.frame(Group = gl(3, 3, labels = LETTERS[1:3]),
id = 1:9,
V_1 = sample(c(NA, 1:8)),
V_2 = sample(c(NA, 1:8)),
V_3 = sample(c(NA, 1:8))
)
## > df |> head()
## Group id V_1 V_2 V_3
## 1 A 1 1 7 6
## 2 A 2 4 8 2
## 3 A 3 3 2 3
## 4 B 4 6 4 1
## 5 B 5 5 3 8
## 6 B 6 NA NA 4
use {Hmisc} and {dplyr} together with the pipeline notation:
library(dplyr)
library(Hmisc)
df_imputed <-
df |> mutate(across(V_1:V_3, impute, mean))
> df_imputed |> head()
Group id V_1 V_2 V_3
1 A 1 1.0 7.0 6
2 A 2 4.0 8.0 2
3 A 3 3.0 2.0 3
4 B 4 6.0 4.0 1
5 B 5 5.0 3.0 8
6 B 6 4.5 4.5 4
Should you now prefer to replace missing values with groupwise medians instead of total means, the tidy arrangement (together with {dplyr}) requires only one additional group_by clause:
df |>
group_by(Group) |>
mutate(across(V_1:V_3, impute, median))
Question:
Below works, but is there a better "R way" of achieving similar result? I am essentially trying to create / distribute groups into individual line items according to a user defined function (currently just using a loop).
Example:
df1 <- data.frame(group = c("A", "B", "C"),
volume = c(200L, 45L, 104L)
)
print(df1)
#> group volume
#> 1 A 200
#> 2 B 45
#> 3 C 104
I want the volume to be broken across multiple rows according to group so that the final result is a dataframe where the new volume (vol2 in the below) would add up to original volume above. In this example, I'm applying integer math with a divisor of 52, so my final result should be:
print(df3)
#> group vol2
#> 1 A 52
#> 2 A 52
#> 3 A 52
#> 4 A 44
#> 21 B 45
#> 31 C 52
#> 32 C 52
This works
The code below DOES get me to the desired result shown above:
div <- 52L
df1$intgr <- df1$volume %/% div
df1$remainder <- df1$volume %% div
print(df1)
#> group volume intgr remainder
#> 1 A 200 3 44
#> 2 B 45 0 45
#> 3 C 104 2 0
df2 <- data.frame()
for (r in 1:nrow(df1)){
if(df1[r,"intgr"] > 0){
for (k in 1:as.integer(df1[r,"intgr"])){
df1[r,"vol2"] <- div
df2 <- rbind(df2, df1[r,])
}
}
if(df1[r,"remainder"]>0){
df1[r, "vol2"] <- as.integer(df1[r, "remainder"])
df2 <- rbind(df2, df1[r,])
}
}
print(df2)
#> group volume intgr remainder vol2
#> 1 A 200 3 44 52
#> 2 A 200 3 44 52
#> 3 A 200 3 44 52
#> 4 A 200 3 44 44
#> 21 B 45 0 45 45
#> 31 C 104 2 0 52
#> 32 C 104 2 0 52
df3 <- subset(df2, select = c("group", "vol2"))
print(df3)
#> group vol2
#> 1 A 52
#> 2 A 52
#> 3 A 52
#> 4 A 44
#> 21 B 45
#> 31 C 52
#> 32 C 52
Being still relatively new to R, I'm just curious if someone knows a better way / function / method that gets to the same place. Seems like there might be. I could potentially have a more complex way of breaking up the rows and I was thinking maybe there's a method that applies a UDF to the dataframe to do something like this. I was searching for "expand group/groups" but was finding mostly "expand.grid" which isn't what I'm doing here.
Thank you for any suggestions!
A quick function to help split each number by the modulus,
fun <- function(num, mod) c(rep(mod, floor(num / mod)), (num-1) %% mod + 1)
fun(200, 52)
# [1] 52 52 52 44
fun(45, 52)
# [1] 45
fun(104, 52)
# [1] 52 52
And we can apply this a number of ways:
dplyr
library(dplyr)
df1 %>%
group_by(group) %>%
summarize(vol2 = fun(volume, 52), .groups = "drop")
# # A tibble: 7 x 2
# group vol2
# <chr> <dbl>
# 1 A 52
# 2 A 52
# 3 A 52
# 4 A 44
# 5 B 45
# 6 C 52
# 7 C 52
base R
do.call(rbind, by(df1, seq(nrow(df1)),
FUN = function(z) data.frame(group = z$group, vol2 = fun(z$volume, 52))))
data.table
library(data.table)
setDT(df1)
df1[, .(vol2 = fun(volume, 52)), by = group]
A tidyverse approach using purrr::pmap and tidyr::unnest_longer may look like so:
library(dplyr, w = FALSE)
library(tidyr)
library(purrr)
div <- 52
df1 |>
mutate(intgr = volume %/% div, remainder = volume %% div, intgr1 = +(remainder > 0)) |>
mutate(vol2 = purrr::pmap(list(intgr, intgr1, remainder), ~ c(rep(div, ..1), rep(..3, ..2)))) |>
tidyr::unnest_longer(vol2) |>
select(-intgr1)
#> # A tibble: 7 × 5
#> group volume intgr remainder vol2
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A 200 3 44 52
#> 2 A 200 3 44 52
#> 3 A 200 3 44 52
#> 4 A 200 3 44 44
#> 5 B 45 0 45 45
#> 6 C 104 2 0 52
#> 7 C 104 2 0 52
With data.table and rep:
library(data.table)
setDT(df1)[, .(vol2 = c(rep(52, volume%/%52), (volume%%52)[sign(volume%%52)])), group]
#> group vol2
#> 1: A 52
#> 2: A 52
#> 3: A 52
#> 4: A 44
#> 5: B 45
#> 6: C 52
#> 7: C 52
Or
setDT(df1)[, .(vol2 = c(rep(52, volume%/%52), volume%%52)), group][vol2 != 0]
#> group vol2
#> 1: A 52
#> 2: A 52
#> 3: A 52
#> 4: A 44
#> 5: B 45
#> 6: C 52
#> 7: C 52
Vectorised and without grouping:
df1 <- data.frame(group = c("A", "B", "C"),
volume = c(200L, 45L, 104L))
n <- 52
idx <- df1$volume %/% n + ((sel <- df1$volume %% n) != 0)
out <- df1[rep(seq_len(nrow(df1)), idx),]
out$volume <- n
out$volume[cumsum(idx)[sel != 0]] <- sel[sel != 0]
## group volume
##1 A 52
##1.1 A 52
##1.2 A 52
##1.3 A 44
##2 B 45
##3 C 52
##3.1 C 52
Another base R solution using aggregate :
aggregate(.~group,df1,\(x) c(rep(52, x / 52), (x-1) %% 52 + 1))
group volume
1 A 52, 52, 52, 44
2 B 45
3 C 52, 52, 52
This results in a list column for volume (could be useful)
To transform it to a long dataframe we can either use stack:
with(
aggregate(.~group,df1,\(x) c(rep(52, x / 52), (x-1) %% 52 + 1)),
setNames(stack(setNames(volume,group))[2:1],names(df1))
)
group volume
1 A 52
2 A 52
3 A 52
4 A 44
5 B 45
6 C 52
7 C 52
8 C 52
Or alternatively use unnest from tidyr
library(tidyr)
aggregate(.~group,df1,\(x) c(rep(52, x / 52), (x-1) %% 52 + 1)) %>% unnest(volume)
# A tibble: 8 × 2
group volume
<chr> <dbl>
1 A 52
2 A 52
3 A 52
4 A 44
5 B 45
6 C 52
7 C 52
8 C 52
I'm a beginner in biostatistics and R software, and I need your help in a issue,
I have a table that contains more than 170 columns and more than 6000 lines, I want to add another column that contains the sum of all the columns, except the columns one and two columns
so for example if I have the data of 5 columns from A to E
A B C D E
12 2 13 98 6
10 7 8 67 12
12 56 67 9 7
I want to add another column (Column F for example ) that contain the sum of columns C D and E ( that means all the columns except the first two columns
so the result will be
A B C D E F
AA 2 13 98 6 117
CF 7 8 67 12 87
QZ 56 67 9 7 83
Please tell me if you want to know any other informations or clarification
Thank you very much
Does this work:
library(dplyr)
df %>% rowwise() %>% mutate(F = sum(c_across(-c(A:B))))
# A tibble: 3 x 6
# Rowwise:
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 2 13 98 6 117
2 10 7 8 67 12 87
3 12 56 67 9 7 83
Data used:
df
# A tibble: 3 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 12 2 13 98 6
2 10 7 8 67 12
3 12 56 67 9 7
library(tibble)
library(dplyr)
tbl <-
tibble::tribble(
~A, ~B, ~C, ~D, ~E,
12, 2, 13, 98, 6,
10, 7, 8, 67, 12,
12, 56, 67, 9, 7
)
tbl %>% dplyr::mutate("F" = C + D + E )
## R might consider F to be abbreviation for FALSE, so i put it in ""
You will find the information you need in the top answer to this question:
stackoverflow.com/questions/3991905/sum-rows-in-data-frame-or-matrix
Basically, you just name your new column, use the rowSums function, and specify the columns you want to include with the square bracket subsetting.
data$new <- rowSums( data[,43:167] )
In R, I want to create a new dataframe per column name from the following dataframes:
agedf <- data.frame(A = c(12,14,16,18), B = c(13,15,17,19), C = c(11,13,15,17))
heightdf <- data.frame(A = c(110,120,130,140), B = c(120,130,140,150), C = c(115,125,135,145))
weightdf <- data.frame(A = c(80,90,100,110), B = c(90,100,110,120), C = c(85,95,105,115))
The desired result is to have a formula that creates a dataframe for each of A, B and C with their respectively agedf, heightdf and weightdf columns. I.e. to end up with 3 dataframes as shown in this Excel photo:
Excel desired result
How best could I do this?
Using a for-loop (there is probably alternatives with packaged like tidyr or dplyr):
newlist = list()
names = colnames(agedf)
for(i in names){
index = which(colnames(agedf)==i)
newlist[[i]] = cbind(agedf[,index], heightdf[,index], weightdf[,index])
colnames(newlist[[i]]) = c("Age", "Height", "Weight")}
Output:
> newlist
$A
Age Height Weight
[1,] 12 110 80
[2,] 14 120 90
[3,] 16 130 100
[4,] 18 140 110
$B
Age Height Weight
[1,] 13 120 90
[2,] 15 130 100
[3,] 17 140 110
[4,] 19 150 120
$C
Age Height Weight
[1,] 11 115 85
[2,] 13 125 95
[3,] 15 135 105
[4,] 17 145 115
Without using a list, and creating a new df for each names:
names = colnames(agedf)
for(i in names){
index = which(colnames(agedf)==i)
assign(i, cbind("Age"=agedf[,index], "Height"=heightdf[,index], "Weight"=weightdf[,index]))}
This gives the same output from before, just not in a list.
Lastly, if you want to add them all into a single data frame, and specify where each observation came from:
df = numeric()
names = colnames(agedf)
for(i in names){
index = which(colnames(agedf)==i)
df = rbind(df, cbind(i, agedf[,index], heightdf[,index], weightdf[,index]))}
colnames(df) = c("Code", "Age", "Height", "Weight")
df = as.data.frame(df)
Output:
> df
Code Age Height Weight
1 A 12 110 80
2 A 14 120 90
3 A 16 130 100
4 A 18 140 110
5 B 13 120 90
6 B 15 130 100
7 B 17 140 110
8 B 19 150 120
9 C 11 115 85
10 C 13 125 95
11 C 15 135 105
12 C 17 145 115
Obs: you could pass the colnames c("Age", "Height", "Weight") directly into the cbind in the other approach too.
Here is a way using the tidyverse.
library(dplyr)
library(purrr)
df_list <- list(Age = agedf,
Height = heightdf,
Weight = weightdf)
map(transpose(df_list), bind_cols)
# $A
# # A tibble: 4 x 3
# Age Height Weight
# <dbl> <dbl> <dbl>
# 1 12 110 80
# 2 14 120 90
# 3 16 130 100
# 4 18 140 110
#
# $B
# # A tibble: 4 x 3
# Age Height Weight
# <dbl> <dbl> <dbl>
# 1 13 120 90
# 2 15 130 100
# 3 17 140 110
# 4 19 150 120
#
# $C
# # A tibble: 4 x 3
# Age Height Weight
# <dbl> <dbl> <dbl>
# 1 11 115 85
# 2 13 125 95
# 3 15 135 105
# 4 17 145 115
I'm trying to split columns into new rows keeping the data of the first two columns.
d1 <- data.frame(a=c(100,0,78),b=c(0,137,117),c.1=c(111,17,91), d.1=c(99,66,22), c.2=c(11,33,44), d.2=c(000,001,002))
d1
a b c.1 d.1 c.2 d.2
1 100 0 111 99 11 0
2 0 137 17 66 33 1
3 78 117 91 22 44 2
Expected results would be:
a b c d
1 100 0 111 99
2 100 0 11 0
3 0 137 17 66
4 0 137 33 1
5 78 117 91 22
6 78 117 44 2
Multiple tries with dplyr, but in sees is not the right approach.
If you want to stay in dplyr/tidyverse, you want tidyr::pivot_longer with a special reference to .value -- see the pivot vignette for more:
library(tidyverse)
d1 <- data.frame(
a = c(100, 0, 78),
b = c(0, 137, 117),
c.1 = c(111, 17, 91),
d.1 = c(99, 66, 22),
c.2 = c(11, 33, 44),
d.2 = c(000, 001, 002)
)
d1 %>%
pivot_longer(
cols = contains("."),
names_to = c(".value", "group"),
names_sep = "\\."
)
#> # A tibble: 6 x 5
#> a b group c d
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 100 0 1 111 99
#> 2 100 0 2 11 0
#> 3 0 137 1 17 66
#> 4 0 137 2 33 1
#> 5 78 117 1 91 22
#> 6 78 117 2 44 2
Created on 2020-05-11 by the reprex package (v0.3.0)
This could solve your issue:
#Try this
a1 <- d1[,c(1:4)]
a2 <- d1[,c(1,2,5,6)]
names(a1) <- names(a2) <- c('a','b','c','d')
DF <- rbind(a1,a2)
The posted answers are good, here's my attempt:
df <- data.frame(a=c(100,0,78),b=c(0,137,117),
c.1=c(111,17,91), d.1=c(99,66,22),
c.2=c(11,33,44), d.2=c(000,001,002))
# Make 2 pivot long operations
df_c <- df %>% select(-d.1, -d.2) %>%
pivot_longer(cols = c("c.1", "c.2"), values_to = "c") %>% select(-name)
df_d <- df %>% select(-c.1, -c.2) %>%
pivot_longer(cols=c("d.1","d.2"), values_to = "d") %>% select(-name)
# bind them without the "key" colums
bind_cols(df_c, select(df_d, -a, -b))
Which produces
# A tibble: 6 x 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 100 0 111 99
2 100 0 11 0
3 0 137 17 66
4 0 137 33 1
5 78 117 91 22
6 78 117 44 2