I want to calculate the medeian for several columns based on what letters each column begins with, creating a new column.
I know the way to do this for mean using dplyr and an example like that shown below:
library(dplyr)
mutate(df, IVMean = rowMeans(select(df, starts_with("IV")), na.rm = TRUE))
But how is this possible for median?
I have tried a number of ways including select then mutate but having no luck.
We could use rowMedians after converting to matrix
library(matrixStats)
library(dplyr)
mutate(df, IVMedian = rowMedians(as.matrix(select(df,
starts_with("IV"))), na.rm = TRUE))
-ouptut
IV1 IV2 IV3 col4 IVMedian
1 2 4 1 1 2
2 3 8 3 2 3
3 4 9 4 3 4
Or this can be done in rowwise with c_across
df %>%
rowwise %>%
mutate(IVMedian = median(c_across(starts_with('IV')))) %>%
ungroup
# A tibble: 3 x 5
IV1 IV2 IV3 col4 IVMedian
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 4 1 1 2
2 3 8 3 2 3
3 4 9 4 3 4
Or use pmap (from purrr)
library(purrr)
df %>%
mutate(IVMedian = pmap_dbl(select(., starts_with('IV')),
~ median(c(...))))
IV1 IV2 IV3 col4 IVMedian
1 2 4 1 1 2
2 3 8 3 2 3
3 4 9 4 3 4
Or in collapse
library(collapse)
df$IVMedian <- dapply(gvr(df, vars = '^IV'), MARGIN = 1, FUN = fmedian)
data
df <- data.frame(IV1 = c(2, 3, 4), IV2 = c(4, 8, 9),
IV3 = c(1, 3, 4), col4 = c(1, 2, 3))
Related
Suppose the following table with two factor variabels and one numerical variable:
df <- tibble(
x = as_factor(c("a", "a", "a", "b", "b", "b")),
y = as_factor(1:6),
val = c(10, 3, 8, 2, 6, 1)
)
> df
# A tibble: 6 x 3
x y val
<fct> <fct> <dbl>
1 a 1 10
2 a 2 3
3 a 3 8
4 b 4 2
5 b 5 6
6 b 6 1
I would like to re-order y such that the sum of val, when grouped by x, takes precedent, but y is still ordered by val. To illustrate the goal:
# A tibble: 6 x 4
# Groups: x [2]
x y val sum
<fct> <fct> <dbl> <dbl>
1 a 1 10 21 # all y for which x=="a" come first, because
2 a 3 8 21 # the sum of val for x=="a" is greater than
3 a 2 3 21 # for x=="b"
4 b 5 6 9 # within each group, y is ordered by val
5 b 4 2 9
6 b 6 1 9
But how do I get there? Within tidyverse, I tried to solve it with forcats::fct_reorder(), thinking that grouping might help (df |> group_by(x) |> mutate(y = fct_reorder(y, val))), but it doesn't.
Can fct_reorder() do that at all? What other approaches could work?
Edit: I have found a solution, but it feels rather hacky:
df |>
group_by(x) |>
mutate(sum = sum(val)) |>
arrange(desc(sum), desc(val)) |> ungroup() |>
tibble::rowid_to_column() |>
mutate(across(c(x, y), \(x) fct_reorder(x, rowid)))
Perhaps, we need to arrange
library(dplyr)
library(forcats)
df %>%
arrange(desc(ave(val, x, FUN = sum)), desc(val)) %>%
mutate(across(where(is.factor), fct_inorder))
-output
# A tibble: 6 × 3
x y val
<fct> <fct> <dbl>
1 a 1 10
2 a 3 8
3 a 2 3
4 b 5 6
5 b 4 2
6 b 6 1
Or use fct_reorder/reorder in arrange
df %>%
arrange(desc(fct_reorder(x, val, .fun = sum)), desc(val)) %>%
mutate(across(where(is.factor), fct_inorder)
Probably we can use the following data.table option along with fct_inorder
setorder(
setDT(df)[
,
sum := sum(val), x
],
-sum, -val
)[
,
lapply(
.SD,
function(x) ifelse(is.factor(x), fct_inorder, c)(x)
)
]
and you will obtain
x y val sum
1: a 1 10 21
2: a 3 8 21
3: a 2 3 21
4: b 5 6 9
5: b 4 2 9
6: b 6 1 9
This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 12 months ago.
this is probably trivial, but my data looks like this:
t <- structure(list(var = 1:5, ID = c(1, 2, 1, 1, 3)), class = "data.frame", row.names = c(NA,
-5L))
> t
var ID
1 1 1
2 2 2
3 3 1
4 4 1
5 5 3
I would like to get a mean value for each ID, so my idea was to transform them into this (variable names are not important):
f <- structure(list(ID = 1:3, var.1 = c(1, 2, 5), var.2 = c(2, NA,
NA), var.3 = c(3, NA, NA)), class = "data.frame", row.names = c(NA,
-3L))
> f
ID var.1 var.2 var.3
1 1 1 2 3
2 2 2 NA NA
3 3 5 NA NA
so that I could then calculate the mean for each var.x.
I know it's possible with tidyr (possibly pivot_wider?), but I can't figure out how to group it. How do I get a mean value for each ID?
Thank you in advance
You could use ave to get the mean of var for each ID:
t$mean = ave(t$var, t$ID, FUN = mean)
Result:
var ID mean
1 1 1 2.666667
2 2 2 2.000000
3 3 1 2.666667
4 4 1 2.666667
5 5 3 5.000000
If you want a simple table with the means, you could use aggregate:
aggregate(formula = var~ID, data = t, FUN = mean)
ID var
1 1 2.666667
2 2 2.000000
3 3 5.000000
If you want to use rowMeans on your t dataframe, then we can first use pivot_wider, then get the mean of the row.
library(tidyverse)
t %>%
group_by(ID) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = row, values_from = var, names_prefix = "var.") %>%
mutate(mean = rowMeans(select(., starts_with("var")), na.rm = TRUE))
# ID var.1 var.2 var.3 mean
# <dbl> <int> <int> <int> <dbl>
# 1 1 1 3 4 2.67
# 2 2 2 NA NA 2
# 3 3 5 NA NA 5
Or since t is in long form, then we can just group by ID, then get the mean for all values in that group.
t %>%
group_by(ID) %>%
summarise(mean = mean(var))
# ID mean
# <dbl> <dbl>
#1 1 2.67
#2 2 2
#3 3 5
Or for f, we can use rowMeans for each row that will include any column that starts with var.
f %>%
mutate(mean = rowMeans(select(., starts_with("var")), na.rm = TRUE))
# ID var.1 var.2 var.3 mean
#1 1 1 2 3 2
#2 2 2 NA NA 2
#3 3 5 NA NA 5
I have a dataframe somewhat similar to the one below (df). I need to add a new column indicating the ratio of the largest value for each row (= largest value in row divided by sum of all values in the row). The output should look similar to df1.
df <- data.frame('x' = c(1, 4, 1, 4, 1), 'y' = c(4, 6, 5, 2, 3), 'z' = c(5, 3, 2, 3, 2))
df1 <- data.frame('x' = c(1, 4, 1, 4, 1), 'y' = c(4, 6, 5, 2, 3), 'z' = c(5, 3, 2, 3, 2), 'ratio' = c(0.5, 0.462, 0.625, 0.444, 0.5)
Thank you!
Here is a solution using dplyr:
df %>%
rowwise() %>%
mutate(max_value = max(x,y,z),
sum_values = sum(x,y,z),
ratio = max_value / sum_values) #%>%
#select(-max_value, -sum_values) #uncomment this line if you want to df1 as in your question
# A tibble: 5 x 6
x y z max_value sum_values ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 5 5 10 0.5
2 4 6 3 6 13 0.462
3 1 5 2 5 8 0.625
4 4 2 3 4 9 0.444
5 1 3 2 3 6 0.5
library(tidyverse)
df %>%
rowwise() %>%
mutate(MAX = max(x,y,z, na.rm = TRUE ),
SUM = sum(x,y,z, na.rm = TRUE),
ratio = MAX / SUM)
# A tibble: 5 x 6
x y z MAX SUM ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 5 5 10 0.5
2 4 6 3 6 13 0.462
3 1 5 2 5 8 0.625
4 4 2 3 4 9 0.444
5 1 3 2 3 6 0.5
Another option with rowSums and pmax
library(dplyr)
library(purrr)
df %>%
mutate(ratio = reduce(., pmax)/rowSums(.))
# x y z ratio
#1 1 4 5 0.5000000
#2 4 6 3 0.4615385
#3 1 5 2 0.6250000
#4 4 2 3 0.4444444
#5 1 3 2 0.5000000
Or in base R
df$ratio <- do.call(pmax, df)/rowSums(df)
Additional solution
df$ratio <- apply(df, 1, function(x) max(x, na.rm = T) / sum(x, na.rm = T))
I have a dataframe:
df <- data.frame(ca = c("a","b","a","c","b", "b"),
f = c(3,4,0,NA,3, 4),
f2 = c(NA,5,6,1,9, 7),
f3 = c(3,0,6,3,0, 8))
I want join and sum my columns "f" and "f2" and rename it in "f_news"
exemple :
df <- data.frame(ca = c("a","b","a","c","b", "b"),
f_new = c(3,9,6,1,12, 11),
f3 = c(3,0,6,3,0, 8))
Do you have an idea of how to do this with summarise, spread, group_by?
Using plyr and dplyr you can do this:
df %>%
rowwise() %>%
mutate(f_new=sum(f, f2, na.rm = T))
# A tibble: 6 x 5
# ca f f2 f3 f_new
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 a 3 NA 3 3
#2 b 4 5 0 9
#3 a 0 6 6 6
#4 c NA 1 3 1
#5 b 3 9 0 12
#6 b 4 7 8 11
This method will retain and NA values
Here is an answer using tidyverse methods from dplyr and tidyr
library(tidyverse)
df <- data.frame(ca = c("a","b","a","c","b", "b"),
f = c(3,4,0,NA,3, 4),
f2 = c(NA,5,6,1,9, 7),
f3 = c(3,0,6,3,0, 8))
df %>%
replace_na(list(f = 0, f2 = 0)) %>%
mutate(f_new = f + f2)
#> ca f f2 f3 f_new
#> 1 a 3 0 3 3
#> 2 b 4 5 0 9
#> 3 a 0 6 6 6
#> 4 c 0 1 3 1
#> 5 b 3 9 0 12
#> 6 b 4 7 8 11
Dplyr can do this quite nice with the following code. Rowwise allows you to consider each row separately. And the mutate command sums whatever columns you want. the na.rm=TRUE handles the issue when you have NA's and want to ignore them. As a comment mentioned, if you do not have this, it will give you an NA if it's in any of the summed values.
library(dplyr)
df %>%
rowwise() %>%
mutate(f_new = sum(f,f2, na.rm = TRUE))
I have a dataset in r with two columns of numerical data and one with an identifier. Some of the rows share the same identifier (i.e. they are the same individual), but contain different data. I want to use the identifier to move those that share an identifier from a row into a columns. There are currently 600 rows, but there should be 400.
Can anyone share r code that might do this? I am new to R, and have tried the reshape (cast) programme, but I can't really follow it, and am not sure it's exactly what i'm trying to do.
Any help gratefully appreciated.
UPDATE:
Current
ID Age Sex
1 3 1
1 5 1
1 6 1
1 7 1
2 1 2
2 12 2
2 5 2
3 3 1
Expected output
ID Age Sex Age2 Sex2 Age3 Sex3 Age4 Sex4
1 3 1 5 1 6 1 7 1
2 1 2 12 2 5 2
3 3 1
UPDATE 2:
So far I have tried using the melt and dcast commands from reshape2. I am getting there, but it still doesn't look quite right. Here is my code:
x <- melt(example, id.vars = "ID")
x$time <- ave(x$ID, x$ID, FUN = seq_along)
example2 <- dcast (x, ID ~ time, value.var = "value")
and here is the output using that code:
ID A B C D E F G H (for clarity i have labelled these)
1 3 5 6 7 1 1 1 1
2 1 12 5 2 2 2
3 3 1
So, as you can probably see, it is mixing up the 'sex' and 'age' variables and combining them in the same column. For example column D has the value '7' for person 1 (age4), but '2' for person 2 (Sex). I can see that my code is not instructing where the numerical values should be cast to, but I do not know how to code that part. Any ideas?
Here's an approach using gather, spread and unite from the tidyr package:
suppressPackageStartupMessages(library(tidyverse))
x <- tribble(
~ID, ~Age, ~Sex,
1, 3, 1,
1, 5, 1,
1, 6, 1,
1, 7, 1,
2, 1, 2,
2, 12, 2,
2, 5, 2,
3, 3, 1
)
x %>% group_by(ID) %>%
mutate(grp = 1:n()) %>%
gather(var, val, -ID, -grp) %>%
unite("var_grp", var, grp, sep ='') %>%
spread(var_grp, val, fill = '')
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Age2 Age3 Age4 Sex1 Sex2 Sex3 Sex4
#> * <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 5 6 7 1 1 1 1
#> 2 2 1 12 5 2 2 2
#> 3 3 3 1
If you prefer to keep the columns numeric then just remove the fill='' argument from spread(var_grp, val, fill = '').
Other questions which might help with this include:
R spreading multiple columns with tidyr
How can I spread repeated measures of multiple variables into wide format?
I have recently come across a similar issue in my data, and wanted to provide an update using the tidyr 1.0 functions as gather and spread have been retired. The new pivot_longer and pivot_wider are currently much slower than gather and spread, especially on very large datasets, but this is supposedly fixed in the next update of tidyr, so hope this updated solution is useful to people.
library(tidyr)
library(dplyr)
x %>%
group_by(ID) %>%
mutate(grp = 1:n()) %>%
pivot_longer(-c(ID, grp), names_to = "var", values_to = "val") %>%
unite("var_grp", var, grp, sep = "") %>%
pivot_wider(names_from = var_grp, values_from = val)
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Sex1 Age2 Sex2 Age3 Sex3 Age4 Sex4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 1 5 1 6 1 7 1
#> 2 2 1 2 12 2 5 2 NA NA
#> 3 3 3 1 NA NA NA NA NA NA