How to calculate the difference between values in one column based on another column? - r

I'm trying to calculate the difference between abundance at the time points C1 and C0. I'd like to do this for the different genes, so I've used group_by for the genes, but can't figure out how to find the difference in abundance at the different time points.
Here is one of my attempts:
IgH_CDR3_post_challenge_unique_vv <- IgH_CDR3_post_challenge_unique_v %>%
group_by(gene ) %>%
mutate(increase_in_abundance = (abunance[Timepoint=='C1'])-(abunance[Timepoint=='C0'])) %>%
ungroup()
My data looks something like this:
gene
Timepoint
abundance
1
C0
5
2
C1
3
1
C1
6
3
C0
2

Assuming (!) you will have one entry per gene and timepoint (as opposed to the table posted in the question), you can pivot_wider your data and then calculate the difference for every gene. The current example, of course, isn't very helpful with mostly missings.
df <- data.frame(gene = c(1, 2, 1, 3),
Timepoint = c("c0", "c1", "c1", "c0"),
abundance = c(5, 3, 6, 2))
library(tidyverse)
df %>%
pivot_wider(names_from = Timepoint,
values_from = abundance,
id_cols = gene) %>%
mutate(increase_in_abundance = c1 - c0)
# A tibble: 3 x 4
gene c0 c1 increase_in_abundance
<dbl> <dbl> <dbl> <dbl>
1 1 5 6 1
2 2 NA 3 NA
3 3 2 NA NA

Related

Calculating percentiles and showing them as stacked bars for benchmarking

This is a follow up question to Calculate proportions of multiple columns.
I have the following data:
location = rep(c("A", "B", "C", "D"),
times = c(4, 6, 3, 7))
ID = (1:20)
Var1 = rep(c(0,2,1,1,0), times = 4)
Var2 = rep(c(2,1,1,0,2), times = 4)
Var3 = rep(c(1,1,0,2,0), times = 4)
df=as.data.frame(cbind(location, ID, Var1, Var2, Var3))
And with some help I counted occurrences and proportions of the different scores (0, 1, 2) in the different Vars like this:
df %>%
pivot_longer(starts_with("Var"), values_to = "score") %>%
type_convert() %>%
group_by(location, name) %>%
count(score) %>%
mutate(frac = n / sum(n)) -> dfmut
Now I have a data frame that looks like this, which I called dfmut:
# A tibble: 36 × 5
# Groups: location, name [12]
location name score n frac
<chr> <chr> <dbl> <int> <dbl>
1 A Var1 0 2 0.4
2 A Var1 1 2 0.4
3 A Var1 2 1 0.2
4 A Var2 0 1 0.2
5 A Var2 1 2 0.4
6 A Var2 2 2 0.4
7 A Var3 0 2 0.4
8 A Var3 1 2 0.4
9 A Var3 2 1 0.2
10 B Var1 0 2 0.4
Now what I like to do is get the 10th, 25th, 75th and 90th percentiles of the scores that are not 0 and turn them into a stacked bar chart. I'll give you an example story: location (A, B, etc.) is gardens of different people where they grew different kinds of vegetables (Var1, Var 2, etc.). We scored how well the vegetables turned out with score 0 = optimal, score 1 = suboptimal, score 2 = failure.
The goal is to get a stacked bar chart that shows how high the proportion of non-optimal (score 1 and 2) vegetables is in the 10% best, the 25% best gardens, etc. Then I want to indicate to each gardener where they lie in the ranking regarding each Var.
This could look something like in the image with dark green: best 10% to dark pink-purplish: worst 10% with the dot indicating garden A.
I started making a new data frame with the quantiles, which is probable not very elegant, so feel free to point out how I could do this more efficiently:
dfmut %>%
subset(name =="Var1") %>%
subset(score == "1"| score == "2") -> Var1_12
Percentiles <- c("10", "25", "75", "90")
Var1 <- quantile(Var1_12$frac, probs = c(0.1, 0.25, 0.75, 0.9))
data <- data.frame(Percentiles, Var1)
dfmut %>%
subset(name =="Var2") %>%
subset(score == "1"| score == "2") -> Var2_12
data$Var2 <- quantile(Var2_12$frac, probs = c(0.1, 0.25, 0.75, 0.9))
data_tidy <- melt(data, id.vars = "Percentiles")
I can't get any further than this. Probably because I'm on an entirely wrong path...
Thank you for your help![]

In R, trying to average one column based on selecting a certain value in another column

In R, I'm trying to average a subset of a column based on selecting a certain value (ID) in another column. Consider the example of choosing an ID among 100 IDs, perhaps the ID number being 5. Then, I want to average a subset of values in another column that corresponds to the ID number that is 5. Then, I want to do the same thing for the rest of the IDs. What should this function be?
Using dplyr:
library(dplyr)
dt <- data.frame(ID = rep(1:3, each=3), values = runif(9, 1, 100))
dt %>%
group_by(ID) %>%
summarise(avg = mean(values))
Output:
ID avg
<int> <dbl>
1 1 41.9
2 2 79.8
3 3 39.3
Data:
ID values
1 1 8.628964
2 1 99.767843
3 1 17.438596
4 2 79.700918
5 2 87.647472
6 2 72.135906
7 3 53.845573
8 3 50.205122
9 3 13.811414
We can use a group by mean. In base R, this can be done with aggregate
dt <- data.frame(ID = rep(1:3, each=3), values = runif(9, 1, 100))
aggregate(values ~ ID, dt, mean)
Output:
ID values
1 1 40.07086
2 2 53.59345
3 3 47.80675

Averages of different lengths in R

I am trying to compute average scores for responses to different events. My data is in long format with one row for each event, sample dataset data here:
Subject Event R1 R2 R3 R4 Average
1 A 1 2 2 N/A 2.5
1 B 1 1 1 1 1
So to get the average for event A, it would be (R1 + R2 + R3)/3 ignoring the N/A, whereas event B has 4 responses. I computed the average for Event A in dplyr as:
data$average <- data%>%filter(Event == "A") %>% with(data, (R1 + R2 + R3)/4)
I ran into problems when I tried to do the same for the next event...Thank you for the help!
The following doesn't include the NA value as part of the mean calculation (na.rm=TRUE). Also, I think grouping by Event is important. When run without group_by, the calculations combine all events and the resulting value is 1.285714 (=9/7 obs).
data <- data.frame(
Subject=c(1,1),
Event=c('A', 'B'),
R1=c(1,1),
R2=c(2,1),
R3=c(2,1),
R4=c(NA,1)
)
df <- data %>%
group_by(Event) %>%
mutate(Average = mean(c(R1,R2,R3,R4), na.rm=TRUE))
Output:
Subject Event R1 R2 R3 R4 Average
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 1 2 2 NA 1.67
2 1 B 1 1 1 1 1
You don't need to filter for each event at a time. dplyr is able to process all rows at once, one by one. Also when using dplyr, you don't need to assign to a variable outside of its context, such as data$average <- (something). You can use mutate(). So the intuitive syntax for dplyr would be:
data <-
data %>%
mutate(average = mean(c(R1, R2, R3, R4), na.rm = TRUE))
You can use rowMeans to calculate means for each row of a dataframe. Specify in the input which columns you want to include. To ignore the NA set na.rm=TRUE.
data$Average <- rowMeans(data[,c("R1", "R2", "R3", "R4")], na.rm=TRUE)
If you had lots of columns to average and didn't want to type them all out, you could use grep to match the names of data to any pattern. Say for example you want to average all the rows containing an "R" in their name:
data$Average <- rowMeans(data[,grep("R",names(data))], na.rm=TRUE)
Just to complete all previous answers, if you have multiple values named R1, R2, .... R100, instead of writing all of them into the mean function, you could be interested by reshaping your dataframe into a longer format using pivot_longer function and then group by Event and calculate the mean. Finally, using pivot_wider, you could get your dataframe into the initial wider format.
library(dplyr)
library(tidyr)
df %>% mutate_at(vars(contains("R")), as.numeric) %>%
pivot_longer(cols = starts_with("R"), names_to = "R", values_to = "Values") %>%
group_by(Event) %>%
mutate(average = mean(Values, na.rm = TRUE)) %>%
pivot_wider(names_from = R, values_from = Values)
# A tibble: 2 x 8
# Groups: Event [2]
Subject Event Average average R1 R2 R3 R4
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 2.5 1.67 1 2 2 NA
2 1 B 1 1 1 1 1 1
As mentioned by #TTS, there is something wrong in your calculation of the average of the event A.
Reproducible example
structure(list(Subject = c(1L, 1L), Event = c("A", "B"), R1 = c(1L,
1L), R2 = 2:1, R3 = 2:1, R4 = c("N/A", "1"), Average = c(2.5,
1)), row.names = c(NA, -2L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x5555743c1310>)

Apply function to create mean for filtered columns across multiple columns r

I have a data frame with likert scoring across multiple aspects of a course (about 40 columns of likert scores like the two in the sample data below).
Not all rows contain valid scores. Valid scores are 1:5. Invalid scores are allocated 96:99 or are simply missing.
I would like to create an average score for each individual ID for each of the satisfaction columns that:
1) filters for invalid scores,
2) creates a mean of the valid scores for each id .
3) places the mean satisfaction score for each id in a new column labelled [column.name].mean as in Skill.satisfaction.mean below
I have included a sample data frame and the transformation of the data frame that I would like on a single row below.
####sample score vector
possible.scores <-c(1:5, 96,97, 99,"")
####data frame
ratings <- data.frame(ID = c(rep(1:7, each =2), 8:10), Degree = c(rep("Double", times = 14), rep("Single", times = 3)),
Skill.satisfaction = sample(possible.scores, size = 17, replace = TRUE),
Social.satisfaction = sample(possible.scores, size = 17, replace = TRUE)
)
####transformation applied over one of the satisfaction scales
ratings<- ratings %>%
group_by(ID) %>%
filter(!Skill.satisfaction %in% c(96:99), Skill.satisfaction!="") %>%
mutate(Skill.satisfaction.mean = mean(as.numeric(Skill.satisfaction), na.rm = T))
library(dplyr)
ratings %>%
group_by(ID) %>%
#Change satisfaction columns from factor into numeric
mutate_at(vars(-ID,-Degree), list(~as.numeric(as.character(.)))) %>%
#Get mean for values in 1:5
mutate_at(vars(-ID,-Degree), list(mean=~mean(.[. %in% 1:5], na.rm = T)))
# A tibble: 6 x 6
# Groups: ID [3]
ID Degree Skill.satisfaction Social.satisfaction Skill.satisfaction_mean Social.satisfaction_mean
<int> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 Double 96 99 2 NaN
2 1 Double 2 97 2 NaN
3 2 Double 1 97 1 NaN
4 2 Double 97 NA 1 NaN
5 3 Double 96 96 NaN 3
6 3 Double 99 3 NaN 3

Iterate through columns and row values (list) in R dplyr

This question is based on the following post with additional requirements (Iterate through columns in dplyr?).
The original code is as follows:
df <- data.frame(col1 = rep(1, 15),
col2 = rep(2, 15),
col3 = rep(3, 15),
group = c(rep("A", 5), rep("B", 5), rep("C", 5)))
for(col in c("col1", "col2", "col3")){
filt.df <- df %>%
filter(group == "A") %>%
select_(.dots = c('group', col))
# do other things, like ggplotting
print(filt.df)
}
My objective is to output a frequency table for each unique COL by GROUP combination. The current example specifies a dplyr filter based on a GROUP value A, B, or C. In my case, I want to iterate (loop) through a list of values in GROUP (list <- c("A", "B", "C") and generate a frequency table for each combination.
The frequency table is based on counts. For Col1 the result would look something like the table below. The example data set is simplified. My real dataset is more complex with multiple 'values' per 'group'. I need to iterate through Col1-Col3 by group.
group value n prop
A 1 5 .1
B 2 5 .1
C 3 5 .1
A better example of the frequency table is here: How to use dplyr to generate a frequency table
I struggled with this for a couple days, and I could have done better with my example. Thanks for the posts. Here is what I ended up doing to solve this. The result is a series of frequency tables for each column and each unique value found in group. I had 3 columns (col1, col2, col3) and 3 unique values in group (A,B,C), 3x3. The result is 9 frequency tables and a frequency table for each group value that is non-sensical. I am sure there is a better way to do this. The output generates some labeling, which is useful.
# Build unique group list
group <- unique(df$group)
# Generate frequency tables via a loop
iterate_by_group <- function(x)
for (i in 1:length(group)){
filt.df <- df[df$group==group[i],]
print(lapply(filt.df, freq))
}
# Run
iterate_by_group(df)
We could gather into long format and then get the frequency (n()) by group
library(tidyverse)
gather(df, value, val, col1:col3) %>%
group_by(group, value = parse_number(value)) %>%
summarise(n = n(), prop = n/nrow(.))
# A tibble: 9 x 4
# Groups: group [?]
# group value n prop
# <fct> <dbl> <int> <dbl>
#1 A 1 5 0.111
#2 A 2 5 0.111
#3 A 3 5 0.111
#4 B 1 5 0.111
#5 B 2 5 0.111
#6 B 3 5 0.111
#7 C 1 5 0.111
#8 C 2 5 0.111
#9 C 3 5 0.111
Is this what you want?
df %>%
group_by(group) %>%
summarise_all(funs(freq = sum))

Resources