Normalize multiple values using values of one factor in R - r

We have some tidy data with treatments (multiple samples and control), time points, and measured values. I want to normalize all the samples by dividing by the corresponding time point in the control variable.
I know how I would do this with each value in its own column, but can't figure out how to us a combination of gather mutate, sumamrise etc from tidyr or dplyr to do this in a straightforward way.
Here is a sample data frame definition:
structure(list(time = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 100, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -9L), class = "data.frame")
Data frame looks like this:
time value treat
1 10 c
2 20 c
3 15 c
1 100 t1
2 210 t1
3 180 t1
1 110 t2
2 180 t2
3 140 t2
Expected output. same but with normvalue column containing c(1,1,1,10,10.5,12,11,9,9.333333)
I'd like to get out columns of normalized value for each treatment and time point using tidyverse procedures...

If you group by time (assuming that, as in the example, it is the grouping variable for time-point) then we can use bracket notation in a mutate statement to search only within the group. We can use that to access the control value for each group and then divide the un-normalized value by that:
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 9 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 1 100 t1 10
5 2 210 t1 10.5
6 3 180 t1 12
7 1 110 t2 11
8 2 180 t2 9
9 3 140 t2 9.33
All this does is take the value column of each row and divide it by the value for the control sample with the same time value. As you can see, it doesn't care if sample t1 is missing an observation for time == 1:
df <- structure(list(time = c(1, 2, 3, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -8L), class = "data.frame")
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 8 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 2 210 t1 10.5
5 3 180 t1 12
6 1 110 t2 11
7 2 180 t2 9
8 3 140 t2 9.33

Related

Creating a dataframe based on condition (with a probably dependent by age)?

Im trying to create a synthetic dataset, but im struggling a bit
Is there a way to create a column based on the values in another column?
between subject design and my participant are dividend in two conditions
(condition 1 = 0 condition 2 = 1).
I want to make a column "Trial_1" = 0 = Absence, 1 = Presence, but just for the participants in one of the conditions?
df <- data.fram(
Id = seq(1, 10, by=1),
Age = sample(1:5, 10, replace = TRUE)
Condition = sample(0:1, 10, replace = TRUE)
Trial_1 = sample(0:1, 10, replace = TRUE, prob = c(0.3, 0.7)))
##BUT, want Trial_1 just do it for partisans' in in condition = 1
And if there is an easy way to make the probability based on age, that would be amazing!
Thanks in advance!
You can create df with Id, Age, Condition columns first, and then use rowwise() and mutate() (both from dplyr package) to create Trial_1.
library(dplyr)
df %>%
rowwise() %>%
mutate(Trial_1 = sample(0:1, 1, prob=c(1-Age/10, Condition*Age/10)))
Here, note that the probability of 0 and 1 is 1-Age/10 and Age/10, respectively, to make it age-dependent; you would want to change this to whatever dependence on age you would like.
Also, note that I multiply the probability corresponding to 1 by Condition, ensuring that Condition=0 rows always get 0.
Output:
Id Age Condition Trial_1
<dbl> <int> <int> <int>
1 1 1 0 0
2 2 3 1 1
3 3 1 0 0
4 4 4 1 0
5 5 3 1 0
6 6 5 1 1
7 7 4 1 0
8 8 5 1 0
9 9 3 0 0
10 10 2 1 0
If you prefer those rows to be NA, then do something like this instead:
df %>%
rowwise() %>%
mutate(Trial_1 = if_else(Condition==1, sample(0:1, 1, prob=c(1-Age/10, Age/10)), NA_integer_))
Output:
Id Age Condition Trial_1
<dbl> <int> <int> <int>
1 1 1 0 NA
2 2 3 1 1
3 3 1 0 NA
4 4 4 1 1
5 5 3 1 0
6 6 5 1 1
7 7 4 1 0
8 8 5 1 0
9 9 3 0 NA
10 10 2 1 0
Input:
structure(list(Id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), Age = c(1L,
3L, 1L, 4L, 3L, 5L, 4L, 5L, 3L, 2L), Condition = c(0L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 0L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
I'd do it in two steps - first create the dataframe and then the Trial column. My solution isn't super elegant, but it's straightforward and doesn't require anything but base R. I hope it helps.
df <- data.frame(
Id = seq(1, 10, by = 1),
Age = sample(1:5, 10, replace = TRUE),
Condition = sample(0:1, 10, replace = TRUE)
)
df$Trial[df$Condition == 1] <- sample(0:1, sum(df$Condition), prob = c(0.3, 0.7), replace = TRUE)
# more generally, if you want to assign to Trial only when Condition is x
# df$Trial[df$Condition == x] <- sample(0:1, sum(df$Condition == x), prob = c(0.3, 0.7), replace = TRUE)

Add a new column with sum of count to a dataframe according to informations from another in R

I would need help in order to add count column into a table called tab1 according to another tab2.
Here is the first tab :
tab1
Event_Groups Other_column
1 1_G1,2_G2 A
2 2_G1 B
3 4_G4 C
4 7_G5,8_G5,9_G5 D
as you can see in Event_Groups column I have 2 information (Event and Groups numbers separated by a "_"). These informations will also be found in tab2$Group and tab2$Event and the idea is for each element within rows in tab1 (separated by a comma) , to count the number of rows within tab2 where VALUE1 < 10 AND VALUE2 > 30 and then add this count into tab1 in a new column called Sum_count.
Here is the
tab2
Group Event VALUE1 VALUE2
1 G1 1 5 50 <- VALUE1 < 10 & VALUE2 > 30 : count 1
2 G1 2 6 20 <- VALUE2 < 30 : count 0
3 G2 2 50 50 <- VALUE1 > 10 : count 0
4 G3 3 0 0
5 G4 1 0 0
6 G4 4 2 40 <- VALUE1 < 10 & VALUE2 > 30 : count 1
7 G5 7 1 70 <- VALUE1 < 10 & VALUE2 > 30 : count 1
8 G5 8 4 67 <- VALUE1 < 10 & VALUE2 > 30 : count 1
9 G5 9 3 60 <- VALUE1 < 10 & VALUE2 > 30 : count 1
Example :
For instance for the first element of row1 in tab1: 1_G1
we see in tab2 (row1) that VALUE1 < 10 & VALUE2 > 30, so I count 1.
For the seconde element (row1) : 2_G2 we see in tab2 (row3) that VALUE1 > 10, so I count 0.
And here is the expected result tab1 dataframe;
Event_Groups Other_column Sum_count
1_G1,2_G2 A 1
2_G1 B 0
4_G4 C 1
7_G5,8_G5,9_G5 D 3
I dot not know if I am clear enough, do not hesitate to ask questions.
Here are the two tables in dput format if it can helps:
tab1
structure(list(Event_Groups = structure(1:4, .Label = c("1_G1,2_G2",
"2_G1", "4_G4", "7_G5,8_G5,9_G5"), class = "factor"), Other_column =
structure(1:4, .Label = c("A", "B", "C", "D"), class = "factor")),
class = "data.frame", row.names = c(NA,
-4L))
tab2
structure(list(Group = structure(c(1L, 1L, 2L, 3L, 4L, 4L, 5L,
5L, 5L), .Label = c("G1", "G2", "G3", "G4", "G5"), class = "factor"),
Event = c(1L, 2L, 2L, 3L, 1L, 4L, 7L, 8L, 9L), VALUE1 = c(5L,
6L, 50L, 0L, 0L, 2L, 1L, 4L, 3L), VALUE2 = c(50, 20, 50,
0, 0, 40, 70, 67, 60)), class = "data.frame", row.names = c(NA,
-9L))
Here is one way to do it:
library(dplyr)
library(tidyr)
tab1 %>%
mutate(Event_Groups = as.character(Event_Groups)) %>%
separate_rows(Event_Groups, sep = ",") %>%
left_join(.,
tab2 %>%
unite(col = "Event_Groups", Event, Group) %>%
mutate(count = if_else(VALUE1 < 10 & VALUE2 > 30,1L, 0L))) %>%
group_by(Other_column) %>%
summarise(Event_Groups = paste(unique(Event_Groups), collapse = ","),
Sum_count = sum(count)) %>%
select(Event_Groups, everything())
#> Joining, by = "Event_Groups"
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 4 x 3
#> Event_Groups Other_column Sum_count
#> <chr> <fct> <int>
#> 1 1_G1,2_G2 A 1
#> 2 2_G1 B 0
#> 3 4_G4 C 1
#> 4 7_G5,8_G5,9_G5 D 3
Created on 2021-07-29 by the reprex package (v0.3.0)
You can try a tidyverse
library(tidyverse)
tab1 %>%
rownames_to_column() %>%
separate_rows(Event_Groups, sep = ",") %>%
separate(Event_Groups, into = c("Event", "Group"), sep="_", convert = T) %>%
left_join(tab2 %>%
mutate(count = as.numeric(VALUE1 < 10 & VALUE2 > 30)),
by = c("Event", "Group")) %>%
unite(Event_Groups, Event, Group) %>%
group_by(rowname) %>%
summarise(Event_Groups = toString(Event_Groups),
Other_column = unique(Other_column),
count =sum(count))
# A tibble: 4 x 4
rowname Event_Groups Other_column count
<chr> <chr> <chr> <dbl>
1 1 1_G1, 2_G2 A 1
2 2 2_G1 B 0
3 3 4_G4 C 1
4 4 7_G5, 8_G5, 9_G5 D 3

How to use column indices to collect values from columns in R

x y z column_indices
6 7 1 1,2
5 4 2 3
1 3 2 1,3
I have the column indices of the values I would like to collect in a separate column like so, what I want to create is something like this:
x y z column_indices values
6 7 1 1,2 6,7
5 4 2 3 2
1 3 2 1,3 1,2
What is the simplest way to do this in R?
Thanks!
In base R, we can use apply, split the column_indices on ',', convert them to integer and get the corresponding value from the row.
df$values <- apply(df, 1, function(x) {
inds <- as.integer(strsplit(x[4], ',')[[1]])
toString(x[inds])
})
df
# x y z column_indices values
#1 6 7 1 1,2 6, 7
#2 5 4 2 3 2
#3 1 3 2 1,3 1, 2
data
df <- structure(list(x = c(6L, 5L, 1L), y = c(7L, 4L, 3L), z = c(1L,
2L, 2L), column_indices = structure(c(1L, 3L, 2L), .Label = c("1,2",
"1,3", "3"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
One solution involving dplyr and tidyr could be:
df %>%
pivot_longer(-column_indices) %>%
group_by(column_indices) %>%
mutate(values = toString(value[1:n() %in% unlist(strsplit(column_indices, ","))])) %>%
pivot_wider(names_from = "name", values_from = "value")
column_indices values x y z
<chr> <chr> <int> <int> <int>
1 1,2 6, 7 6 7 1
2 3 2 5 4 2
3 1,3 1, 2 1 3 2

How to extract exactly three observations with biggest count

How to extract only three observations that are top observations with respect to some variable, ex. count (n var in example data below)? I would like to avoid arranging rows so I thought I could use dplyr::min_rank.
ex <- structure(list(code = c("17.1", "6.2", "151.5", "78.1", "88.1",
"95.1", "45.2", "252.2"), id = c(1, 2, 3, 4, 5, 6, 7, 8), n = c(6L,
5L, 8L, 10L, 6L, 3L, 4L, 6L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))
ex %>%
filter(min_rank(desc(n)) <= 3)
But if there are ties, it can give more than 3 observations. For example, the command above returns five rows:
# A tibble: 5 x 3
code id n
<chr> <dbl> <int>
1 17.1 1 6
2 151.5 3 8
3 78.1 4 10
4 88.1 5 6
5 252.2 8 6
How can I then extract exactly 3 observations? (no matter which observation is returned in case of ties)
We can use row_number that can take a column as argument
ex %>%
filter(row_number(desc(n)) <= 3)
# A tibble: 3 x 3
# code id n
# <chr> <dbl> <int>
#1 17.1 1 6
#2 151.5 3 8
#3 78.1 4 10
In base R, we can use
ex[tail(order(ex$n),3), ]

Remove all rows of a category if one row meets a condition [duplicate]

This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 1 year ago.
Problem:
I want to remove all the rows of a specific category if one of the rows has a certain value in another column (similar to problems in the links below). However, the main difference is I would like it to only work if it matches a criteria in another column.
Making a practice df
prac_df <- data_frame(
subj = rep(1:4, each = 4),
trial = rep(rep(1:4, each = 2), times = 2),
ias = rep(c('A', 'B'), times = 8),
fixations = c(17, 14, 0, 0, 15, 0, 8, 6, 3, 2, 3,3, 23, 2, 3,3)
)
So my data frame looks like this.
subj ias fixations
1 1 A 17
2 1 B 14
3 2 A 0
4 2 B 0
5 3 A 15
6 3 B 0
7 4 A 8
8 4 B 6
And I want to remove all of subject 2 because it has a value of 0 for fixations column in a row that ias has a value of A. However I want to do this without removing subject 3, because even though there is a 0 it is in a row where the ias column has a value of B.
My attempt so far.
new.df <- prac_df[with(prac_df, ave(prac_df$fixations != 0, subj, FUN = all)),]
However this is missing the part that will only get rid of it if it has the value A in the ias column. I've attempted various uses of & or if but I feel like there's likely a clever and clean way I just don't know of.
My goal is to make a df like this.
subj ias fixations
1 1 A 17
2 1 B 14
3 3 A 15
4 3 B 0
5 4 A 8
6 4 B 6
Thank you very much!
Related questions:
R: Remove rows from data frame based on values in several columns
How to remove all rows belonging to a particular group when only one row fulfills the condition in R?
We group by 'subj' and then filter based on the logical condition created with any and !
library(dplyr)
df1 %>%
group_by(subj) %>%
filter(!any(fixations==0 & ias == "A"))
# subj ias fixations
# <int> <chr> <int>
#1 1 A 17
#2 1 B 14
#3 3 A 15
#4 3 B 0
#5 4 A 8
#6 4 B 6
Or use all with |
df1 %>%
group_by(subj) %>%
filter(all(fixations!=0 | ias !="A"))
The same approach can be used with ave from base R
df1[with(df1, !ave(fixations==0 & ias =="A", subj, FUN = any)),]
data
df1 <- structure(list(subj = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), ias = c("A",
"B", "A", "B", "A", "B", "A", "B"), fixations = c(17L, 14L, 0L,
0L, 15L, 0L, 8L, 6L)), .Names = c("subj", "ias", "fixations"),
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))

Resources