dplyr / base R: compute new columns using logic combinations of row indices - r

I analyse a data set from an experiment and would like to calculate effect sizes for each variable. My dataframe consist of multiple variables (= columns) for 8 treatments t (= rows), with t1 - t4 being the control for t5 - t8, respectively (t1 control for t5, t2 for t6, ... ). The original data set is way larger, so I would like to solve the following two tasks::
I would like to calculate the log(treatment/control) for each t5 - t8 for one variable, e.g. effect size for t5 = log(t5/t1), effect size for t6 = log(t6/t2), ... . The name of the resulting column should be variablename_effect and the new column would only have 4 rows instead of 8.
The most tricky part is, that I need to implement the combination of specific rows into my code, so that the correct control is used for each treatment.
I would like to calculate the effect sizes for all my variables within one code, so create multiple new columns with the correct names (variablename_effect).
I would prefer to solve the problem in dplyr or base R to keep it simple.
So far, the only related question I found was /r-dplyr-mutate-refer-new-column-itself (shows combination of multiple if else()). I would be very thankful for either a solution, links to similar questions or which packages I should use in cast it's not possible within dplyr / base R!
Sample data:
df <- data.frame("treatment" = c(1:8), "Var1" = c(9:16), "Var2" = c(17:24))
Edit: this is the df_effect I would expect to receive as an output, thanks #Martin_Gal for the hint!
df_effect <- data.frame("treatment" = c(5:8), "Var1_effect" = c(log(13/9), log(14/10), log(15/11), log(16/12)), "Var2_effect" = c(log(21/17), log(22/18), log(23/19), log(24/20)))
My ideas so far:
For calculating the effect size:
mutate() and for function:
# 1st option:
for (i in 5:8) {
dt_effect <- df %>%
mutate(Var1_effect = log(df[i, "Var1"]/df[i - 4, "Var1"]))
}
#2nd option:
for (i in 5:8){
dt_effect <- df %>%
mutate(Var1_effect = log(df[treatment == i , "Var1"]/df[treatment == i - 4 , "Var1"]))
}
problem: both return the result for i = 8 for every row!
mutate() and ifelse():
df_effect <- df %>%
mutate(Var1_effect = ifelse(treatment >= 5, log(df[, "Var1"]/df[ , "Var1"]), NA))
seems to work, but so far I couldn't implement which row to pick for the control, so it returns NA for t1 - t4 (correct) and 0 for t5 - t8 (mathematically correct as I calculate log(t5/t5), ... but not what I want).
maybe I should use summarise() instead of mutate() because I create fewer rows than in my original dataframe?
Make this work for every variable at the same time
My only idea would be to index the columns within a second for function and use the paste() to create the new column names, but I don't know exactly how to do this ...

I don't know if this will solve your problem, but I want to make a suggestion similar to Limey:
library(dplyr)
library(tidyr)
df %>%
mutate(control = 1 - (treatment-1) %/% (nrow(.)/2),
group = ifelse(treatment %% (nrow(.)/2) == 0, nrow(.)/2, treatment %% (nrow(.)/2))) %>%
select(-treatment) %>%
pivot_wider(names_from = c(control), values_from=c(Var1, Var2)) %>%
group_by(group) %>%
mutate(Var1_effect = log(Var1_0/Var1_1))
This yields
# A tibble: 4 x 6
# Groups: group [4]
group Var1_1 Var1_0 Var2_1 Var2_0 Var1_effect
<dbl> <int> <int> <int> <int> <dbl>
1 1 9 13 17 21 0.368
2 2 10 14 18 22 0.336
3 3 11 15 19 23 0.310
4 4 12 16 20 24 0.288
What happend here?
I expected the first half of your data.frame to be the control variables for the second half. So I created an indicator variable and a grouping variable based on the treatment id's/numbers.
Now the treatment id isn't used anymore, so I dropped it.
Next I used pivot_wider to create a dataset with Var1_1 (i.e. Var1 for your control variable) and Var1_0 (i.e. Var1 for your "ordinary" variable).
Finally I calculated Var1_effect per group.

In response to OP's comment to #MartinGal 's solution (which is perfectly fione in its own right):
First convert the input data to a more convenient form:
# Original input dataset
df <- data.frame("treatment" = c(1:8), "Var1" = c(9:16), "Var2" = c(17:24))
# Revised input dataset
revisedDF <- df %>%
select(-treatment) %>%
add_column(
Treatment=rep(c("Control", "Test"), each=4),
Experiment=rep(1:4, times=2)
) %>%
pivot_longer(
names_to="Variable",
values_to="Value",
cols=c(Var1, Var2)
) %>%
arrange(Experiment, Variable, Treatment)
revisedDF %>% head(6)
Giving
# A tibble: 6 x 4
Treatment Experiment Variable Value
<chr> <int> <chr> <int>
1 Control 1 Var1 9
2 Test 1 Var1 13
3 Control 1 Var2 17
4 Test 1 Var2 21
5 Control 2 Var1 10
6 Test 2 Var1 14
I like this format because it makes the analysis code completely independent of the number of variables, the number of experiements and the number of Treatments.
The analysis is straightforward, too:
result <- revisedDF %>% pivot_wider(
names_from=Treatment,
values_from=Value
) %>%
mutate(Effect=log(Test/Control))
result
Giving
Experiment Variable Control Test Effect
<int> <chr> <int> <int> <dbl>
1 1 Var1 9 13 0.368
2 1 Var2 17 21 0.211
3 2 Var1 10 14 0.336
4 2 Var2 18 22 0.201
5 3 Var1 11 15 0.310
6 3 Var2 19 23 0.191
7 4 Var1 12 16 0.288
8 4 Var2 20 24 0.182
pivot_wider and pivot_longer are relatively new dplyr verbs. If you're unable to use the most recent version of the package, spread and gather do the same job with slightly different argument names.

Related

Time series in R in column

I do have time series with months in rows instead of columns. It's quite a large dataset and I am looking for a way to get the mean for every 12 rows, in this case for temperature so that a smaller dataset will emerge.
This can be done with group_by and summarize from dplyr. First you have to create "groups", variable that will be used for grouping data.
library(dplyr)
dta <- data.frame(temp = rnorm(60, 0, 1))
dta$group <- sort(rep(1:12, 60/12))
dta %>% group_by(group) %>% summarize(mean_temp = mean(temp))
Result
# A tibble: 12 x 2
group mean_temp
<int> <dbl>
1 1 -0.582
2 2 0.490
3 3 -0.197

How to divide each of a range a variables by a second range of variables in R

I have a range of columns containing the numerators of certain diseases, and a range of columns containing the denominators of the same diseases. I want to loop through each of the numerator columns dividing by the appropriate denominator column creating a percentage column for each disease.
All my columns follow the same name format, disease1_num, disease2_num, disease1_den, disease1_den
I want to divide disease1_num/disease1_den*100 to create disease1_perc, then disease2_num/disease2_den*100 to create disease2_perc etc.
There are approximately 20 diseases in my dataset.
I am mainly using tidyverse commands.
I have tried using gather to create two datasets, one with the numerators, one with the denominator, extracted the diseasename, joined them together, calculated the percentage and then spread the dataset again, before adding this back to the original dataset, which does work but it is a bit long winded, ideally I would like to do this in place in the original dataset.
# A tibble: 3 x 5
id disease1_num disease2_num disease1_den disease2_den
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 4 12 15
2 2 8 6 14 16
3 3 10 8 17 18
df_num <- df %>%
select(id,disease1_num:disease2_num) %>%
gather(key="num_indicator",value="num",disease1_num:disease2_num) %>%
mutate(indicator=str_remove(num_indicator,'_num'))
df_den <- df%>%
select(id, disease1_den:disease2_den) %>%
gather(key="den_indicator",value="den",disease1_den:disease2_den) %>%
mutate(indicator=str_remove(den_indicator,'_den'))
df_numden <- left_join(df_num,df_den,c('id','indicator'))
df_perc <- df_numden %>%
mutate(perc_indicator=str_replace(den_indicator,'den','perc'),
perc=num/den*100) %>%
select(id, perc_indicator:perc) %>%
spread(perc_indicator,perc)
df_final <- left_join(df,df_perc,'id')
We can just use grep to get column indices and divide directly.
num_cols <- grep("num$", names(df), value = TRUE)
den_cols <- grep("den$", names(df), value = TRUE)
df[sub("_num","_perc", num_cols)]<- df[num_cols]/df[den_cols] * 100
df
# id disease1_num disease2_num disease1_den disease2_den disease1_perc disease2_perc
#1 1 5 4 12 15 41.7 26.7
#2 2 8 6 14 16 57.1 37.5
#3 3 10 8 17 18 58.8 44.4
Note that you need to be sure that you have same number of num_cols and den_cols.

How to run a for loop for each group in a dataframe?

This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.
Here is code that I have confirmed to work:
df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)
The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.
Say I had data like this:
How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.
Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".
In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups: ID [3]
# ID value value.mean
# <fct> <int> <dbl>
#1 a 13 12.6
#2 a 14 12.6
#3 a 12 12.6
#4 a 13 12.6
#5 a 11 12.6
#6 b 12 15.5
#7 b 19 15.5
#8 cc4 10 10.0
Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(value.mean = mean(value))
## A tibble: 3 x 2
# ID value.mean
# <fct> <dbl>
#1 a 12.6
#2 b 15.5
#3 cc4 10.0
The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).
Sample data
df <- read.table(text =
"ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)
Update
To conclude from the comments&chat, this should be what you're after.
# Sample data
set.seed(2017)
csvdata <- data.frame(
microsat = rep(c("A", "B", "C"), each = 8),
allele = sample(20, 3 * 8, replace = T))
csvdata %>%
group_by(microsat) %>%
summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
# microsat D
# <fct> <dbl>
#1 A 0.844
#2 B 0.812
#3 C 0.812
Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.
A base R option would be
df1$value.mean <- with(df1, ave(value, ID))

tidyr::gather() %>% mutate() %>% spread() returns NA's unexpectedly

My ultimate goal is to do a series of chisq.test's on this data, comparing the values of 'dealer','store' and 'transport' by 'gender'. I'm using spread and gather to create a column of 'female' and one for 'males' then planned to use group_by & map to run the chisq.test by group of 'key', which is created in my gather argument. I'm doing something wrong because I'm getting grouped NA's back.
The code below produces my dilemma.
set.seed(123)
df_ <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
df_ %>%
gather(key,value,-gender) %>%
mutate(id = 1:nrow(.)) %>%
spread(gender,value)
Here is a data_frame of my desired outcome.
data_frame(key = sample(c('dealer','store','transport'),50,T),
male = sample(1:5,50,T),
female = sample(1:5,50,T))
You need to group_by(gender) before adding your id and spreading, i.e.
library(tidyverse)
df_ %>%
gather(key, value, -gender) %>%
group_by(gender) %>%
mutate(id = row_number()) %>%
spread(gender, value)
NOTE Substituting row_number() with 1:nrow(.) will fail because of the grouping. This is because it takes the sequence of the whole data frame (rather than a sequence for each group) and tries to assign it to each group. Hence the error you get with the length
Error in mutate_impl(.data, dots) :
Column id must be length 156 (the group size) or one, not 300
If you do say ... %>%mutate(id = 1:length(key)) It will be fine
The result in both (row_number and 1:length(key)) is,
# A tibble: 168 x 4
key id female male
* <chr> <int> <int> <int>
1 dealer 1 3 4
2 dealer 2 3 2
3 dealer 3 1 4
4 dealer 4 5 3
5 dealer 5 4 4
6 dealer 6 5 2
7 dealer 7 3 3
8 dealer 8 1 2
9 dealer 9 2 5
10 dealer 10 2 2
# ... with 158 more rows
#elliot while #Sotos has given a great answer to the challenge you were having with the tidyverse, I'm a bit confused by why you're going through all that extra effort. Your ultimate goal as stated was to run chisq.test for gender against each of the others (dealer, store & transport). Your original dataset doesn't need any modification to do that!
require(tidyverse)
set.seed(123)
yourdata <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
yourdata
# A tibble: 100 x 4
gender dealer store transport
<chr> <int> <int> <int>
1 female 2 2 5
2 male 2 4 2
3 female 2 2 1
Can be used exactly as it stands! You may have other reasons to want to change the data but it is tidy as it is representing one case or person per row.
Edited (January 16th) To achieve your stated ultimate goal you just have to:
require(dplyr)
require(broom)
allofthem <- lapply(yourdata[-1], function(y) tidy(chisq.test(x = yourdata$gender, y = y )))
allofthem <- bind_rows(allofthem, .id = "dependentv")
allofthem
You may also want to look at the lsr package which will do Chi-square independence (association tests) and provide a much more informative output. Also note that from a statistical perspective you are running very many tests and should correct your confidence appropriately... see for example http://rpubs.com/ibecav/290361

group by in R dplyr for more than one variable on unique value of other variable

I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')

Resources