Arranging a dataset by groups in R - r

I need help trying to make a dataset which contains which treatment the participants are on and what they scored in a composite test (this is just an exercise for my course no real data used)
A <- c(36, 35, 22, 20)
B <- c(26, 30, 25, 20)
C <- c(42, 30, 45, 62)
treatment <- c("A", "B", "C")
depression <- c(A, B, C)
df1 <- data.frame(treatment, depression)
It arranges the data in the dataframe wrong with each value being A, B, C instead of A, A , A , A, B...
anyone know how to convert and arrange this data?
I need the data arranged so I can split and do different calculations with them.

treatment <- rep(LETTERS[1:3], each=4)
depression <- c(A, B, C)
df1 <- data.frame(treatment, depression)

I think you're looking for
df1 <- data.frame(treatment = rep(treatment, each = 4), depression)
For production/"real-life" code you would probably want to do something fancier, e.g.
L <- tibble::lst(A,B,C) ## self-naming list
data.frame(treatment = rep(names(L), lengths(L)),
depression = unlist(L))

Here is tidyverse approach:
library(tidyverse)
tibble(depression) %>%
mutate(treatment = rep(treatment, each=length(A)))
depression treatment
<dbl> <chr>
1 36 A
2 35 A
3 22 A
4 20 A
5 26 B
6 30 B
7 25 B
8 20 B
9 42 C
10 30 C
11 45 C
12 62 C

Related

Aggregating rows with shared values and simultaneously choosing value in a separate column to keep in aggregated row

hope you are well. I would like to a) add together values in Column A if the values in Column B are equal to one another AND the values in Column C are equal to one another, and simultaneously b) in the new summed row, only keep the value from Column D matching the maximal value of Column A from that summed group.
I think it is difficult to explain my query without an example.
Let us assume these are the relevant data:
df <- data.frame (A = c(10, 1, 4, 3, 7),
B = c("a", "a", "b", "b", "b"),
C = c(.5, .5, 2.5, 1.5, 2.5),
D = c(54, 36, 94, 57, 49))
resulting in this dataframe:
A B C D
1 10 a 0.5 54
2 1 a 0.5 36
3 4 b 2.5 94
4 3 b 1.5 57
5 7 b 2.5 49
Notice that rows 1 and 2 are equivalent in B and C, so they should be summed. But row 1 has the greater value in A of the two, so 54 should be kept instead of 36. It's similar with rows 3 and 5. The end result should be:
A B C D
11 a .5 54
3 b 1.5 57
11 b 2.5 49
I am partway there. I have found some code that does part a). Either of these does the trick:
aggregate(A ~ B + C, df, sum)
library(data.table)
setDT(df)[, .(summedvar = sum(A)), by = .(A, B)]
However, these approaches delete Column D, unsurprisingly. I am curious if anyone has any ideas about how to incorporate part b). Maybe I need to do multiple steps? Or maybe I'm going about this the wrong way? I’m deeply grateful for any advice.
You can use which.max to get the index of highest value in A and get the corresponding value of D.
Using dplyr you may do -
library(dplyr)
df %>%
group_by(B, C) %>%
summarise(D = D[which.max(A)],
A = sum(A), .groups = "drop") %>%
select(A, B, C, D)
# A B C D
# <dbl> <chr> <dbl> <dbl>
#1 11 a 0.5 54
#2 3 b 1.5 57
#3 11 b 2.5 49
Similarly in data.table -
library(data.table)
setDT(df)
df[, .(A = sum(A), D = D[which.max(A)]), .(B, C)]

R using combn with apply

I have a data frame that has percentage values for a number of variables and observations, as follows:
obs <- data.frame(Site = c("A", "B", "C"), X = c(11, 22, 33), Y = c(44, 55, 66), Z = c(77, 88, 99))
I need to prepare this data as an edge list for network analysis, with "Site" as the nodes and the remaining variables as the edges. The result should look like this:
Node1 Node2 Weight Type
A B 33 X
A C 44 X
...
B C 187 Z
So that for "Weight" we are calculating the sum of all possible pairs, and this separately for each column (which ends up in "Type").
I suppose the answer to this has to be using apply on a combn expression, like here Applying combn() function to data frame, but I haven't quite been able to work it out.
I can do this all by hand taking the combinations for "Site"
sites <- combn(obs$Site, 2)
Then the individual columns like so
combA <- combn(obs$A, 2, function(x) sum(x)
and binding those datasets together, but this obviously become annoying very soon.
I have tried to do all the variable columns in one go like this
b <- apply(newdf[, -1], 1, function(x){
sum(utils::combn(x, 2))
}
)
but there is something wrong with that.
Can anyone help, please?
One option would be to create a function and then map that function to all the columns that you have.
func1 <- function(var){
obs %>%
transmute(Node1 = combn(Site, 2)[1, ],
Node2 = combn(Site, 2)[2, ],
Weight = combn(!!sym(var), 2, function(x) sum(x)),
Type = var)
}
map(colnames(obs)[-1], func1) %>% bind_rows()
Here is an example using combn
do.call(
rbind,
combn(1:nrow(obs),
2,
FUN = function(k) cbind(data.frame(t(obs[k, 1])), stack(data.frame(as.list(colSums(obs[k, -1]))))),
simplify = FALSE
)
)
which gives
X1 X2 values ind
1 A B 33 X
2 A B 99 Y
3 A B 165 Z
4 A C 44 X
5 A C 110 Y
6 A C 176 Z
7 B C 55 X
8 B C 121 Y
9 B C 187 Z
try it this way
library(tidyverse)
obs_long <- obs %>% pivot_longer(-Site, names_to = "type")
sites <- combn(obs$Site, 2) %>% t() %>% as_tibble()
Type <- tibble(type = c("X", "Y", "Z"))
merge(sites, Type) %>%
left_join(obs_long, by = c("V1" = "Site", "type" = "type")) %>%
left_join(obs_long, by = c("V2" = "Site", "type" = "type")) %>%
mutate(res = value.x + value.y) %>%
select(-c(value.x, value.y))
V1 V2 type res
1 A B X 33
2 A C X 44
3 B C X 55
4 A B Y 99
5 A C Y 110
6 B C Y 121
7 A B Z 165
8 A C Z 176
9 B C Z 187

How to add a new column to a data frame that is the -ln of the variable "hr" if the two "age" variables in the two dfs match in R?

My goal is to create a new column "HoLj" in df1 that is the -ln of "hr" from df2 if the corresponding age in df1, matches the age2 in df2.
df1<- data.frame(age = c("1","2","4","5","7","8"), dif = c("y", "n", "y", "n","n","y")
df2<- data.frame(age2=c("1","2","3","4","5","6","7","8"),hr=c(56, 57, 23, 46, 45, 19, 21, 79)
My goals is to create a new column in df1 that looks like below:
age dif hoLj
1 y -ln(56)
2 n -ln(57)
4 y -ln(46)
5 n -ln(45)
7 n -ln(21)
8 y -ln(79)
Thank you!
We can do a join and then get the natural log
library(dplyr)
left_join(df1, df2) %>%
mutate(hoLj = -log(hr)) %>%
select(-hr)
Or with data.table
library(data.table)
setDT(df1)[df2, hoLj := -log(hr), on = .(age)]
df1
# age dif hoLj
#1: 1 y -4.025352
#2: 2 n -4.043051
#3: 4 y -3.828641
#4: 5 n -3.806662
#5: 7 n -3.044522
#6: 8 y -4.369448

Group rows in data.frame and find quantile [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 3 years ago.
I have the following data:
set.seed(789)
df_1 = data.frame(a = 22, b = 24, c = rnorm(10))
df_2 = data.frame(a = 44, b = 24, c = rnorm(10))
df_3 = data.frame(a = 33, b = 99, c = rnorm(10))
df_all = rbind(df_1, df_2, df_3)
I need to group df_all by column a and b, and then find the 50th quantile based on column c.
This can be done singularly, for each df, as follows:
df_1_q = quantile(df_1$c, probs = 0.50)
df_2_q = quantile(df_2$c, probs = 0.50)
df_3_q = quantile(df_3$c, probs = 0.50)
However my real df_all is larger than this.
And more generally, how can I group a data.frame by rows and apply a given function?
thanks
You could use dplyr for that
library(dplyr)
df_all %>%
group_by(a, b) %>%
summarise(quantile = quantile(c, probs = 0.5))
# A tibble: 3 x 3
# Groups: a [?]
a b quantile
<dbl> <dbl> <dbl>
1 22 24 -0.268
2 33 99 -0.234
3 44 24 -0.445
Or using data.table as:
library(data.table)
dt <- data.table(df_all)
dt[,list(quantile=quantile(c, probs = 0.5)),by=c("a", "b")]
a b quantile
1: 22 24 -0.2679104
2: 44 24 -0.4450979
3: 33 99 -0.2336712

Merge select columns from multiple tables using common identifiers in R

I would like to combine (merge) select columns from multiple tables with following organization.
Here's two datasets as examples that I want to combine
"dataset1"
A B C D E F (header)
1 2 3 4 5 F1(1st row)
6 7 8 9 10 F2(2nd row)
11 12 13 14 15 F3 (3rd row)
....
"dataset2"
A B C D E F (header)
16 17 18 19 20 F1(1st row)
21 22 23 24 25 F2(2nd row)
26 27 28 29 30 F3 (3rd row)
....
Here, header for all different datasets (I have more than 100 datasets) are identical, and I want to use names in F columns (F1, F2, F3...more than F200) as unique identifier.
For example, If I combine column "A" from all different datasets using column F as identifier, the results should look like this. Also to distinguish where the data come from, header also needs to be changed to dataset ID.
dataset1 dataset2 F (header)
1 16 F1 (1st row)
6 21 F2 (2nd row)
11 26 F3 (3rd row)
....
Note that all datasets I have contain different numbers of row, so that some data point values corresponding to F1~F200 could be missing. in this case I want to put NA or leave it as empty.
To this end, I tried following code
x <- merge(dataset1, dataset2, by="F", all=T)
But this way, I cannot extract only column A, rather it merges evert columns.
Similarly, I tried also
x <- Reduce(function(x, y) merge(x, y, all=TRUE, by=("F")), list(dataset1, dataset2))
This gave me actually identical results as previous code. To further extract only column A using this code, I tried following one, but did not worked.
x <- Reduce(function(x, y) merge(x, y, all=TRUE, by=("F")), list(dataset1[,1], dataset2[,1))
And I have no idea how to change name of header into the name of data set which came from.
Please understand I just started to learn R basics.
I'm using RStudio 0.98507 and currently all datasets (more than hundred) were loaded and in present in "Global Environment"
Thank you very much!
Here's one solution with the following four sample data frames:
dataset1 <- data.frame(A = c(1, 6, 11),
B = c(2, 7, 12),
C = c(3, 8, 12),
D = c(4, 9, 13),
E = c(5, 10, 14),
F = c("F1", "F2", "F3"))
dataset2 <- data.frame(A = c(16, 21, 26),
B = c(17, 22, 27),
C = c(18, 23, 28),
D = c(19, 24, 29),
E = c(20, 25, 30),
F = c("F1", "F2", "F3"))
dataset3 <- data.frame(A = c(30, 61),
B = c(57, 90),
C = c(38, 33),
D = c(2, 16),
E = c(77, 25),
F = c("F1", "F2"))
dataset4 <- data.frame(A = c(36, 61),
B = c(47, 30),
C = c(37, 33),
D = c(45, 10),
E = c(66, 29),
F = c("F1", "F2"))
First combine them into a list:
datasets <- list(dataset1, dataset2, dataset3, dataset4)
Then rename all the columns except the F column. This is because later when we merge the data frames together, if the columns all have the same names then merge will try to differentiate them by adding .x or .y to the names -- which is fine when you're only merging two data sets, but gets confusing with more than two.
for (i in seq_along(datasets)) {
for (j in seq_along(colnames(datasets[[i]]))) {
if (colnames(datasets[[i]])[j] != "F") {
colnames(datasets[[i]])[j] <- paste(colnames(datasets[[i]])[j], i, sep = ".")
}
}
}
This gives us data frames whose column headers look like this:
datasets[[1]]
## A.1 B.1 C.1 D.1 E.1 F
## 1 1 2 3 4 5 F1
## 2 6 7 8 9 10 F2
## 3 11 12 12 13 14 F3
Then use Reduce:
df <- Reduce(function(x, y) merge(x, y, all = TRUE, by = "F"), datasets)
And select the columns you want, in this case all the columns with A in the column name:
df[, c("F", grep("A", names(df), value = TRUE))]
## F A.1 A.2 A.3 A.4
## 1 F1 1 16 30 36
## 2 F2 6 21 61 61
## 3 F3 11 26 NA NA

Resources