dplyr calculations involving two columns of a data frame (R) - r

I'm pretty new to R and couldn't find a clear answer my question after extensively searching the web. I'm trying to get dplyr functions to do the following task:
I have the following data.frame as tibble: Columns starting with X. indicates different samples and rows indicate how much a specific gene is expressed.
head(immgen_dat)
# A tibble: 6 x 212
ProbeSetID GeneName Description X.proB_CLP_BM. X.proB_CLP_FL. X.proB_FrA_BM. X.proB_FrA_FL. X.proB_FrBC_BM.
<int> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10344620 " Gm1056~ " predicted gene 1~ 15.6 15.3 17.2 16.1 18.1
2 10344622 " Gm1056~ " predicted gene 1~ 240. 255. 224. 312. 272.
3 10344624 " Lypla1" " lysophospholipas~ 421. 474. 349. 478. 459.
4 10344633 " Tcea1" " transcription el~ 802. 950. 864. 968. 1056.
5 10344637 " Atp6v1~ " ATPase H+ transp~ 199. 262. 167. 267. 255.
6 10344653 " Oprk1" " opioid receptor ~ 14.8 12.8 18.0 13.2 15.3
# ... with 204 more variables: X.proB_FrBC_FL. <dbl>,
I added a mean expression variable at the end for each gene by using the following code (the range of variables are the first and the last sample):
immgen_avg <- immgen_dat %>%
rowwise() %>%
mutate(Average = mean(X.proB_CLP_BM.:X.MLP_FL.))
Here, I have a quick question: The returned mean value I get from this code doesn't match the average I calculated elsewhere (in Excel). I don't think there are any missing values.
What I'd like to do is the following: For each gene, I'd like to compare the sample values with the average value and calculate a log2-fold difference (log2 difference of gene expression in a sample compared to the average expression value across all the samples). I'd like to store this dataframe with the name of immgen_log2 and do some subsequent analyses. In this new data frame, I'd like to keep the gene names because I'm thinking to merge this with another data table to compare log2 change between different experiments.
What is the best way of doing this? I appreciate your answers.

I will explain what is happening in a short while, but one way to solve for the row means of your intended variables is:
immgen_dat %>%
mutate(Average = apply(.[, 4:8], 1, mean)) %>%
select(Average)
# Average
# 1 16.46
# 2 260.60
# 3 436.20
# 4 928.00
# 5 230.00
# 6 14.82
To see what is happening with your code, we can use the do function as follows:
df2 <- immgen_dat %>%
rowwise() %>%
do(Average = .$X.proB_CLP_BM.:.$X.proB_FrBC_BM.)
df2$Average[1]
# [[1]]
# [1] 15.6 16.6 17.6
You will see that : generates a sequence from 15.6 in steps of 1. You can see this explained in more detail by typing help(":"). So in
immgen_dat %>%
rowwise() %>%
mutate(Average = mean(X.proB_CLP_BM.:X.proB_FrBC_BM.))
you are computing the means of the values of these sequences.
Edit
The logarithm of the ratios is of course the differences of the logarithms (provided the denominator is nonzero). So you are trying to find the differences between the log2's of each of the other numerical variables from the log2 of the Average, you can do something like.
immgen_log2 <- immgen_dat
immgen_log2[,4:9] <- log(immgen_dat[,4:9])
immgen_log2[,4:8] <- sapply(immgen_log2[,4:8], func)

I'm not entirely sure whether I get it right what you need to do, but whenever using dplyr or tidyverse in general (also ggplot2), long representation of your data works best. I assume that you want to calculate the mean of all variables starting with X. for each ProbeSetID. Then, for each X.-column and ProbeSetID, calculate ratio and take log2, i.e. log2(X.bla/mean):
df <- read.table(text = 'ProbeSetID X.proB_CLP_BM. X.proB_CLP_FL. X.proB_FrA_BM. X.proB_FrA_FL. X.proB_FrBC_BM.
10344620 15.6 15.3 17.2 16.1 18.1
10344622 240. 255. 224. 312. 272.
10344624 421. 474. 349. 478. 459.
10344633 802. 950. 864. 968. 1056.
10344637 199. 262. 167. 267. 255.
10344653 14.8 12.8 18.0 13.2 15.3', header = T)
library(dplyr)
library(tidyr)
result <-
df %>%
# transform to long:
gather(key = key, value = value, grep(x = names(.), pattern = "^X\\.")) %>%
# group by IDs, ie make rowwise calculations if it was still wide, but faster:
group_by(ProbeSetID) %>%
# calculate group-mean on the fly and calculate log-ratio directly:
mutate(log2_ratio = log2(value / mean(value)))
# transform back to wide, if needed:
result %>%
# remove initial values to have only 1 value variable:
select(-value) %>%
# go back to wide:
spread(key = key, value = log2_ratio)
# or, if you want to keep all values:
df %>%
# transform to long:
gather(key = key, value = value, grep(x = names(.), pattern = "^X\\.")) %>%
# group by IDs, ie make rowwise calculations if it was still wide, but faster:
group_by(ProbeSetID) %>%
# calculate the mean of each observation:
mutate(mean_value = mean(value)) %>%
# go back to wide:
spread(key, value) %>%
# now do the transformation to each variable that begins with X.:
mutate_at(.vars = vars(matches("^X\\.")),
.funs = funs(log2_ratio = log2(./mean_value)))

Related

Row-wise sum of paired values in 2 identically labelled lists

I have two lists Sis1 and Sis2 with identical column names and identical first column values but different values in columns 2:n:
Time Nuc_11 Nuc_15 Nuc_16
835.20 71.00670 101.13133 89.51763
839.84 103.49544 146.28099 77.43385
844.48 142.79152 85.98028 83.36374
849.12 82.49453 90.12871 69.49881
853.76 63.62815 77.91290 104.42857
858.40 66.04051 110.48225 108.12739
What I want is to have a row-wise sum of Sis1 and Sis2 for each identically named pair of columns in the two lists (ie. sum of Sis1$Nuc_11 and Sis2$Nuc_11 for each row defined by the shared Time value). I'm missing something obvious. My best guess is:
NucID <- colnames(Sis1)
SisSum <- bind_cols(Sis1, Sis2) %>%
group_by(!!!NucID) %>%
summarise(across(everything(),list("SisSum" = sum)))
but that's not correct. I'm not sworn to dplyr for the answer, so if there's a nicer approach I'll take it.
Thanks in advance for the help :)
Since you say that they should be added "by the shared Time value", we cannot use bind_cols, which would require perfect alignment and shared-existence of all Time values. (If this is always the case, then just ignore it ... and pray that your assumption is always true.) I suggest a merge/join operation.
For discussion on merge/join operations, see How to join (merge) data frames (inner, outer, left, right) and What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?.
Further, joining on a floating-point is subject to a fundamental problem in programming (not just R), testing for perfect equality in floating-point numbers. To work around this, we need to determine a precision level at which two numbers are assumed to be effectively the same. (Note that often R can get floating-point intuitively correct, but there is no clear indication when it does not work.)
For more discussion on issues regarding floating-point equality, test (0.10 + 0.05) == 0.15 (returns false), and then read Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754.
Doing a merge/join based on common columns will be much simpler if we first pivot the data from the current wide format to a long format. (This relies on all of them being the same class, numeric in this case. If there is mixed-class, then this does not work.)
For discussion on pivoting, see Reshaping data.frame from wide to long format (wide-to-long) and the reverse, Reshape multiple value columns to wide format (long to wide).
I'll create a similarly-shaped Sis2, changing the numbers, and changing one of the Time values just a little to show what happens.
set.seed(42)
Sis2 <- Sis1 %>%
mutate(across(starts_with("Nuc"), ~ runif(n(), 50, 100)))
Sis2$Time[3] <- Sis2$Time[3] + 1e-5
Sis2
# Time Nuc_11 Nuc_15 Nuc_16
# 1 835.20 95.74030 86.82942 96.73361
# 2 839.84 96.85377 56.73333 62.77144
# 3 844.48 64.30698 82.84961 73.11464
# 4 849.12 91.52238 85.25324 97.00073
# 5 853.76 82.08728 72.88709 98.91132
# 6 858.40 75.95480 85.95561 55.87437
Despite the fact that they look similar, one of the Time values is in fact different, and will not join properly.
Sis1$Time
# [1] 835.20 839.84 844.48 849.12 853.76 858.40
Sis2$Time
# [1] 835.20 839.84 844.48 849.12 853.76 858.40
Sis1$Time == Sis2$Time
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
Assuming that 0.001 is sufficient precision to determine as "effectively identical", here is working code that will work:
library(dplyr)
# library(tidyr) # pivot_*
full_join(
Sis1 %>%
mutate(Time_chr = sprintf("%0.03f", Time)) %>%
tidyr::pivot_longer(-c(Time, Time_chr)),
Sis2 %>%
mutate(Time_chr = sprintf("%0.03f", Time)) %>%
tidyr::pivot_longer(-c(Time, Time_chr)),
by = c("Time_chr", "name")) %>%
transmute(
Time = coalesce(Time.x, Time.y),
name,
values = rowSums(cbind(value.x, value.y), na.rm = TRUE)
) %>%
tidyr::pivot_wider(Time, names_from = "name", values_from = "values") %>%
as.data.frame()
# Time Nuc_11 Nuc_15 Nuc_16
# 1 835.20 166.7470 187.9607 186.2512
# 2 839.84 200.3492 203.0143 140.2053
# 3 844.48 207.0985 168.8299 156.4784
# 4 849.12 174.0169 175.3819 166.4995
# 5 853.76 145.7154 150.8000 203.3399
# 6 858.40 141.9953 196.4379 164.0018
(The use of %>% as.data.frame() is purely to show all digits and rows without tibble's convenient brief-view. It is not required. Also ... I'm generally not a fan of nesting %>%-pipes inside function calls like that, but it shows what's being done well, I think; feel free to use temporary variables for pivoted storage if you like.)
Because we're doing a full-join, if there were any mismatches then we would see NA values. Such as this, without the use of converting Time to a string:
full_join(
tidyr::pivot_longer(Sis1, -Time),
tidyr::pivot_longer(Sis2, -Time),
by = c("Time", "name")
) %>%
transmute(
Time,
name,
values = rowSums(cbind(value.x, value.y), na.rm = TRUE)
) %>%
tidyr::pivot_wider(Time, names_from = "name", values_from = "values") %>%
as.data.frame()
# Time Nuc_11 Nuc_15 Nuc_16
# 1 835.20 166.74700 187.96075 186.25124
# 2 839.84 200.34921 203.01432 140.20529
# 3 844.48 142.79152 85.98028 83.36374
# 4 849.12 174.01691 175.38195 166.49954
# 5 853.76 145.71543 150.79999 203.33989
# 6 858.40 141.99531 196.43786 164.00176
# 7 844.48 64.30698 82.84961 73.11464
Notice that 844.48 in row 3 and its similar-looking "844.48" in row 7 have not been added, due to the 1e-5 difference I added. You can get R to show them with more precision (see ?options for the 'digits' and 'scipen' arguments).
library(tidyverse)
list <- list(
read_table(
" Time Nuc_11 Nuc_15 Nuc_16
835.20 71.00670 101.13133 89.51763
839.84 103.49544 146.28099 77.43385
844.48 142.79152 85.98028 83.36374
849.12 82.49453 90.12871 69.49881
853.76 63.62815 77.91290 104.42857
858.40 66.04051 110.48225 108.12739"
),
read_table(
" Time Nuc_11 Nuc_15 Nuc_16
835.20 71.00670 101.13133 89.51763
839.84 103.49544 146.28099 77.43385
844.48 142.79152 85.98028 83.36374
849.12 82.49453 90.12871 69.49881
853.76 63.62815 77.91290 104.42857
858.40 66.04051 110.48225 108.12739"
)
)
do.call(rbind, list) %>%
pivot_longer(-Time) %>%
group_by(Time, name) %>%
summarise(sum = sum(value))
# A tibble: 18 × 3
# Groups: Time [6]
Time name sum
<dbl> <chr> <dbl>
1 835. Nuc_11 142.
2 835. Nuc_15 202.
3 835. Nuc_16 179.
4 840. Nuc_11 207.
5 840. Nuc_15 293.
6 840. Nuc_16 155.
7 844. Nuc_11 286.
8 844. Nuc_15 172.
9 844. Nuc_16 167.
10 849. Nuc_11 165.
11 849. Nuc_15 180.
12 849. Nuc_16 139.
13 854. Nuc_11 127.
14 854. Nuc_15 156.
15 854. Nuc_16 209.
16 858. Nuc_11 132.
17 858. Nuc_15 221.
18 858. Nuc_16 216.
#AlbertQuackstein taking your comment in #r2evans 's answer HERE into account, you could also solve your problem as follow:
# Solution 1
library(dplyr)
bind_rows(Sis1, Sis2) %>%
group_by(Time) %>%
summarise(across(.fns=sum))
# Solution 2: Base R approach
result = Sis1
result[-1] = Map(`+`, Sis1[-1], Sis2[names(Sis1[-1])])
If all columns also have the same order, you can replace Sis2[names(Sis1[-1])] in solution 2 with Sis2[-1].

override using .groups argument

I keep getting "summarise() has grouped output by 'new_brand'. You can override using
the .groups argument." I'm not sure if I'm getting this error because I created columns pos_prop and neg_prop
superbowl %>% group_by(new_brand, superbowl) %>% summarize(mean(superbowl$volume, superbowl$pos_prop, superbowl$neg_prop), sd(superbowl$volume, superbowl$pos_prop, superbowl$neg_prop)) %>% filter(superbowl, superbowl == "0")
When I run rlang::last_error() The code works, I'm not sure how to make the code run properly. Any help will be appreciated.
You're using summarize and such incorrectly. Try this:
superbowl %>%
group_by(new_brand) %>%
summarize(across(c(volume, pos_prop, neg_prop),
list(mu = ~ mean(.), sigma = ~ sd(.)))) %>%
filter(superbowl == "0")
Notes on your code:
once you start a dplyr-pipe with superbowl %>%, almost never use superbowl$ in the dplyr verbs (very rare exceptions); I also removed references to superbowl in both group_by and filter, since it is not clear if you're trying to refer to the original frame symbol again ... if you have superbowl$superbowl, then they may still be appropriate;
either use across(..) as above or name the calculations, e.g., summarize(volume_mu = mean(volume), pos_mu = mean(pos_prop), ...); and
I'm inferring, but ... mean(volume, pos_prop, neg_prop) (with or without the superbowl$) is an error: in this case, the call is effectively mean(volume, trim=pos_prop, na.rm=neg_prop), which should be producing errors. One could adapt this to be mean(c(volume, pos_prop, neg_prop)) if you really want to aggregate three columns' data into a single number, but I thought that might be unintended over-aggregation.
Demonstration of this with real data:
mtcars %>%
group_by(cyl) %>%
summarize(across(c(disp, mpg),
list(mu = ~ mean(.), sigma = ~ sd(.))))
# # A tibble: 3 x 5
# cyl disp_mu disp_sigma mpg_mu mpg_sigma
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 105. 26.9 26.7 4.51
# 2 6 183. 41.6 19.7 1.45
# 3 8 353. 67.8 15.1 2.56

How to group and average a dataset by order then plot a broken line plot?

I have a data frame that contains 5000 examinee's ability estimation with their test score, and they are both continuous variables. Since there are too many examinees, it would be messy to plot out all their scores, so I wish to draw a 'broken line plot' or 'conditional mean plot', that average the test scores of several examines that have similar ability levels at a time, and plot their average score against their average ability. Like the plot below.
I already managed to do this with the codes below.
df<-cbind(rnorm(100,set.seed(123)),sample(100,set.seed(123)),) %>%
as.data.frame() %>%
setNames(c("ability","score")) #simulate the dataset
df<-df[order(df$ability),] #sort the data from low to high according to the ability varaible
seq<-round(seq(from=1, to=nrow(df), length.out=10),0) #divide the data equally to nine groups (which is also gonna be the 9 points that appear in my plot)
b<-data.frame()
for (i in 1:9) {
b[i,1]<-mean(df[seq[i]:seq[i+1],1]) #calculate the mean of the ability by group
b[i,2]<-mean(df[seq[i]:seq[i+1],2]) # calculate the mean of test score by group
}.
I got the mean of the ability and test score using this for loop, and it looks like this
and finally, do the plot
plot(b$V1,b$V2, type='b',
xlab="ability",
ylab="score",
main="Conditional score")
These codes meet my goal, but I can't help thinking if there's a simpler way to do this. Drawing a broken line plot by averaging the data that is sorted from low to high seems to be a normal task.
I wonder if there is any function or trick for this. All ideas are welcome! :)
Here is a solution to create the data to be plotted using dplyr:
set.seed(123)
df<-cbind(rnorm(100,1),sample(100,50)) %>%
as.data.frame() %>%
setNames(c("ability","score")) #simulate the dataset
df<-df[order(df$ability),] #sort the data from low to high according to the ability varaible
df$id <- seq(1, nrow(df))
df %>% mutate(bin = ntile(id, 10)) %>%
group_by(bin) %>%
dplyr::summarize(meanAbility = mean(ability, na.rm=T),
meanScore = mean(score, na.rm=T)) %>%
as.data.frame()
bin meanAbility meanScore
1 1 -0.81312770 41.6
2 2 -0.09354171 52.3
3 3 0.29089892 54.4
4 4 0.68490709 45.8
5 5 0.93078744 59.8
6 6 1.17380069 34.0
7 7 1.42942368 41.3
8 8 1.64965315 40.1
9 9 1.95290596 35.6
10 10 2.50277510 52.9
I would approach the whole thing a bit differently (note also that your code has several errors and won't run the way you were showing.
The exmaple below will lead to different numbers than yours (due to the random generation of numbers and your non-working code).
library(tidyverse)
df <- data.frame(ability = rnorm(100),
score = sample(100)) %>%
arrange(ability) %>%
mutate(seq = ntile(n = 9)) %>%
group_by(seq) %>%
summarize(mean_ability = mean(ability),
mean_score = mean(score))
which gives:
# A tibble: 9 x 3
seq mean_ability mean_score
<int> <dbl> <dbl>
1 1 -1.390807 45.25
2 2 -0.7241746 56.18182
3 3 -0.4315872 49
4 4 -0.2223723 48.81818
5 5 0.06313174 56.36364
6 6 0.3391321 42
7 7 0.6118022 53.27273
8 8 1.021438 50.54545
9 9 1.681746 53.54545

select top n values by group with n depending on other value in data frame

I'm quite new to r and coding in general. Your help would be highly appreciated :)
I'm trying to select the top n values by group with n depending on an other value (in the following called factor) from my data frame. Then, the selected values shoud be summarised by group to calculate the mean (d100). My goal is to get one value for d100 per group.
(Background: In forestry there is an indicator called d100 which is the mean diameter of the 100 thickest trees per hectare. If the size of the sampling area is smaller than 1 ha you need to select accordingly fewer trees to calculate d100. That's what the factor is for.)
First I tried to put the factor inside my dataframe as an own column. Then I thought maybe it would help to have something like a "lookup-table", because R said, that n must be a single number. But I don't know how to create a lookup-function. (See last part of the sample code.) Or maybe summarising df$factor before using it would do the trick?
Sample data:
(I indicated expressions where I'm not sure how to code them in R like this: 'I dont know how')
# creating sample data
library(tidyverse)
df <- data.frame(group = c(rep(1, each = 5), rep(2, each = 8), rep(3, each = 10)),
BHD = c(rnorm(23, mean = 30, sd = 5)),
factor = c(rep(pi*(15/100)^2, each = 5), rep(pi*(20/100)^2, each = 8), rep(pi*(25/100)^2, each = 10))
)
# group by ID, then select top_n values of df$BHD with n depending on value of df$factor
df %>%
group_by(group) %>%
slice_max(
BHD,
n = 100*df$factor,
with_ties = F) %>%
summarise(d100 = mean('sliced values per group'))
# other thought: having a "lookup-table" for the factor like this:
lt <- data.frame(group = c(1, 2, 3),
factor = c(pi*(15/100)^2, pi*(20/100)^2, pi*(25/100)^2))
# then
df %>%
group_by(group) %>%
slice_max(
BHD,
n = 100*lt$factor 'where lt$group == df$group',
with_ties = F) %>%
summarise(d100 = mean('sliced values per group'))
I already found this answer to a problem which seems similar to mine, but it didn't quite help.
Since all the factor values are the same within each group, you can select any one factor value.
library(dplyr)
df %>%
group_by(group) %>%
top_n(BHD, n = 100* first(factor)) %>%
ungroup
# group BHD factor
# <dbl> <dbl> <dbl>
# 1 1 25.8 0.0707
# 2 1 24.6 0.0707
# 3 1 27.6 0.0707
# 4 1 28.3 0.0707
# 5 1 29.2 0.0707
# 6 2 28.8 0.126
# 7 2 39.5 0.126
# 8 2 23.1 0.126
# 9 2 27.9 0.126
#10 2 31.7 0.126
# … with 13 more rows

Using dynamic names with dplyr

I have a data frame in which I would I would like to compute some extra column as a function of the existing columns, but want to specify both each new column name and the function dynamically. I have a vector of column names that are already in the dataframe df_daily:
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
The rows of the dataframe have responses to each question from each user each time they answer the questionnaire, as well as a column with the number of days since the user first answered the questionnaire (i.e. Days_From_First_Use = 0 on the very first use, = 1 if it is used the next day etc.). I want to average the responses to these questions by Days_From_First_Use . I start by by grouping my dataframe by Days_From_First_Use:
df_test <- df_daily %>%
group_by(Days_From_First_Use)
and then try averaging the responses in a loop as follows:
for(i in 1:5){
df_test <- df_test %>%
mutate(!! paste0('Avg_Score_', DAILY_QUESTIONS[i]) :=
paste0('mean(', DAILY_QUESTIONS[i], ')'))
}
Unfortunately, while my new variable names are correct ("Avg_Score_Q1_Daily", "Avg_Score_Q2_Daily", "Avg_Score_Q3_Daily", "Avg_Score_Q4_Daily", "Avg_Score_Q5_Daily"), my answers are not: every row in my data frame has a string such as "mean(Q1_Daily)" in the relevant column .
So I'm clearly doing something wrong - what do I need to do fix this and get the average score across all users on each day?
Sincerely and with many thanks in advance
Thomas Philips
I took a somewhat different approach, using summarize(across(...)) after group_by(Days_From_First_Use) I achieve the dynamic names by using rename_with and a custom function that replaces (starts with)"Q" with "Avg_Score_Q"
library(dplyr, warn.conflicts = FALSE)
# fake data -- 30 normalized "responses" from 0 to 2 days from first use to 5 questions
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
df_daily <- as.data.frame(do.call('cbind', lapply(1:5, function(i) rnorm(30, i))))
colnames(df_daily) <- DAILY_QUESTIONS
df_daily$Days_From_First_Use <- floor(runif(30, 0, 3))
df_test <- df_daily %>%
group_by(Days_From_First_Use) %>%
summarize(across(.fns = mean)) %>%
rename_with(.fn = function(x) gsub("^Q","Avg_Score_Q",x))
#> `summarise()` ungrouping output (override with `.groups` argument)
df_test
#> # A tibble: 3 x 6
#> Days_From_First… Avg_Score_Q1_Da… Avg_Score_Q2_Da… Avg_Score_Q3_Da…
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1.26 1.75 3.02
#> 2 1 0.966 2.14 3.48
#> 3 2 1.08 2.45 3.01
#> # … with 2 more variables: Avg_Score_Q4_Daily <dbl>, Avg_Score_Q5_Daily <dbl>
Created on 2020-12-06 by the reprex package (v0.3.0)

Resources