I have a data frame with, for example, replicates from an experiment in different columns. If each row in my data frame is a sample, with columns a, b, c as replicates, I want to:
Determine the variation between the replicates (What is the difference between the highest and lowest values in each row)? Put this in a new column called "variation.")
If the variation is greater than 10, omit the one replicate that is furthest away.
How can I accomplish this in this data frame? I want new columns:
"max" - highest value of a, b, c for each row
"min" - lowest value of a, b, c for each row
"variation" - max/min for each row
Then, I want to omit the data points in a, b, or c that are furthest away from the others so the remaining points have <10 variation.
df <- data.frame(a = rnorm(10, 100, 20),
b = rnorm(10, 2000, 500),
c = rnorm(10, 50, 20))
df$max = apply(df, 1, max, na.rm = T)
df$min = apply(df, 1, min, na.rm = T)
df$variation = df$max/df$min
(Also, how can I calculate the max and min using dplyr and %>% notation?)
Example using dplyr pipes, with mutate and group_by. I reshaped the data in long format using tidyr gather and reshaped it in wide format at the end using spread.
library(dplyr)
library(tidyr)
set.seed(100)
dtf_wide <- data.frame(a = rnorm(10, 100, 20),
b = rnorm(10, 2000, 500),
c = rnorm(10, 50, 20))
Reshape data in long format. Group by id (row number in the wide format) Then compute the variation and the distance from the median value.
dtf <- dtf_wide %>%
# Explicitely add an identification column (for the grouping)
mutate(id = row_number()) %>%
# put data in tidy format, one observation per row
gather(key, value, a:c) %>%
arrange(id) %>%
group_by(id) %>%
mutate(variation = max(value, na.rm = TRUE) / min(value, na.rm = TRUE),
median = median(value),
distancefrommedian = abs(value-median),
maxdistancefrommedian = max(distancefrommedian))
head(dtf)
# # A tibble: 6 x 7
# # Groups: id [2]
# id key value variation median distancefrommedian maxdistancefrommedian
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 a 89.95615 49.58856 89.95615 0.00000 1954.987
# 2 1 b 2044.94307 49.58856 89.95615 1954.98692 1954.987
# 3 1 c 41.23820 49.58856 89.95615 48.71795 1954.987
# 4 2 a 102.63062 31.37407 102.63062 0.00000 1945.507
# 5 2 b 2048.13723 31.37407 102.63062 1945.50661 1945.507
# 6 2 c 65.28121 31.37407 102.63062 37.34941 1945.507
If varation is greater than 10, remove the line where the value is further away from the median (you could change that rule here to take away more lines if needed).
dtf <- dtf %>%
# For each id,
# Take all lines where variation is smaller than 10
filter(variation <= 10 |
# If varation is greater than 10,
# Filter out lines were the value is further away from the median
(variation > 10 & distancefrommedian < maxdistancefrommedian)) %>%
# Keep only interesting variables
select(id, key, value) %>%
# Compute the variations again (just to check)
mutate(variation = max(value, na.rm = TRUE) / min(value, na.rm = TRUE))
head(dtf)
# id key value variation
# <int> <chr> <dbl> <dbl>
# 1 1 a 89.95615 2.181379
# 2 1 c 41.23820 2.181379
# 3 2 a 102.63062 1.572131
# 4 2 c 65.28121 1.572131
# 5 3 a 98.42166 1.781735
# 6 3 c 55.23923 1.781735
Reshape data to obtain a table in wide format similar to the original data frame.
dtf_wide2 <- dtf %>%
spread(key, value)
head(dtf_wide2)
# id variation a c
# <int> <dbl> <dbl> <dbl>
# 1 1 4.385692 89.95615 41.23820
# 2 2 4.385692 102.63062 65.28121
# 3 3 4.385692 98.42166 55.23923
# 4 4 4.385692 117.73570 65.46809
# 5 5 4.385692 102.33943 33.71242
# 6 6 4.385692 106.37260 41.23099
Related
I have a large data for which I'm attempting to remove repeated row entries based on several columns. The column headings and sample entries are
count freq, cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart
5036 0.0599 TGCAGTGCTAGAG CSARDPDR TRBV20-1 TRBD1 TRBJ1-5 15 17 43 21
There are several thousand rows, and for two rows to match all the values except for "count" and "freq" must be the same. I want to remove the repeated entries, but before that, I need to change the "count" value of the one repeated row with the sum of the individual repeated row "count" to reflect the true abundance. Then, I need to recalculate the frequency of the new "count" based on the sum of all the counts of the entire table.
For some reason, the script is not changing anything, and I know for a fact that the table has repeated entries.
Here's my script.
library(dplyr)
# Input sample replicate table.
dta <- read.table("/data/Sample/ci1371.txt", header=TRUE, sep="\t")
# combine rows with identical data. Recalculation of frequency values.
dta %>% mutate(total = sum(count)) %>%
group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
summarize(count_new = sum(count), freq = count_new/mean(total))
dta_clean <- dta
Any help is greatly appreciated. Here's a screenshot of how the datatable looks like.
Preliminary step: transform in data.table and store column names that are not count and freq
library(data.table)
setDT(df)
cols <- colnames(df)[3:ncol(df)]
(in your example, count and freq are in the first two positions)
To recompute count and freq:
df_agg <- df[, .(count = sum(count)), by = cols]
df_agg[, 'freq' := count/sum(count)]
If you want to keep unique values by all columns except count and freq
df_unique <- unique(df, by = cols)
Sample data, where grp1 and grp2 are intended to be all of your grouping variables.
set.seed(42)
dat <- data.frame(
grp1 = sample(1:2, size=20, replace=TRUE),
grp2 = sample(3:4, size=20, replace=TRUE),
count = sample(100, size=20, replace=TRUE),
freq = runif(20)
)
head(dat)
# grp1 grp2 count freq
# 1 2 4 38 0.6756073
# 2 2 3 44 0.9828172
# 3 1 4 4 0.7595443
# 4 2 4 98 0.5664884
# 5 2 3 44 0.8496897
# 6 2 4 96 0.1894739
Code:
library(dplyr)
dat %>%
group_by(grp1, grp2) %>%
summarize(count = sum(count)) %>%
ungroup() %>%
mutate(freq = count / sum(count))
# # A tibble: 4 x 4
# grp1 grp2 count freq
# <int> <int> <int> <dbl>
# 1 1 3 22 0.0206
# 2 1 4 208 0.195
# 3 2 3 383 0.358
# 4 2 4 456 0.427
I have a data frame with likert scoring across multiple aspects of a course (about 40 columns of likert scores like the two in the sample data below).
Not all rows contain valid scores. Valid scores are 1:5. Invalid scores are allocated 96:99 or are simply missing.
I would like to create an average score for each individual ID for each of the satisfaction columns that:
1) filters for invalid scores,
2) creates a mean of the valid scores for each id .
3) places the mean satisfaction score for each id in a new column labelled [column.name].mean as in Skill.satisfaction.mean below
I have included a sample data frame and the transformation of the data frame that I would like on a single row below.
####sample score vector
possible.scores <-c(1:5, 96,97, 99,"")
####data frame
ratings <- data.frame(ID = c(rep(1:7, each =2), 8:10), Degree = c(rep("Double", times = 14), rep("Single", times = 3)),
Skill.satisfaction = sample(possible.scores, size = 17, replace = TRUE),
Social.satisfaction = sample(possible.scores, size = 17, replace = TRUE)
)
####transformation applied over one of the satisfaction scales
ratings<- ratings %>%
group_by(ID) %>%
filter(!Skill.satisfaction %in% c(96:99), Skill.satisfaction!="") %>%
mutate(Skill.satisfaction.mean = mean(as.numeric(Skill.satisfaction), na.rm = T))
library(dplyr)
ratings %>%
group_by(ID) %>%
#Change satisfaction columns from factor into numeric
mutate_at(vars(-ID,-Degree), list(~as.numeric(as.character(.)))) %>%
#Get mean for values in 1:5
mutate_at(vars(-ID,-Degree), list(mean=~mean(.[. %in% 1:5], na.rm = T)))
# A tibble: 6 x 6
# Groups: ID [3]
ID Degree Skill.satisfaction Social.satisfaction Skill.satisfaction_mean Social.satisfaction_mean
<int> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 Double 96 99 2 NaN
2 1 Double 2 97 2 NaN
3 2 Double 1 97 1 NaN
4 2 Double 97 NA 1 NaN
5 3 Double 96 96 NaN 3
6 3 Double 99 3 NaN 3
Let's say I have a few columns in my data frame, that come from a bunch of similar factors:
For eg: A1_Factor1, A1_Factor2, A1_Factor3, B1_Factor1,B1_Factor2,C1_Factor1 etc
What I want is to create additional columns using this data. So:
A1_Mean - This should be the average of columns starting with A1
B1_Mean - This should be the average of columns starting with B1
A1_Min - This should be the minimum value of columns starting with A1
B1_Min - This should be the minimum value of columns starting with B1
A1_SD - This should be the Standard Deviation of columns starting with A1
B1_SD - This should be the Standard Deviation of columns starting with B1
How can it be done in R, so that the code first extract the columns having similar initials, and then perform the required analysis on it. And then create new columns out of it using same initials?
Thanks for your help in advance! :)
You can do this using tidyverse package
Input:
library(tidyverse)
set.seed(123)
df <- tibble(A1_abc = sample(1:10, 5),
A1_cde = sample(10:15, 5),
B1_abc = sample(1:10, 5),
B1_cde = sample(15:20, 5))
df
# A tibble: 5 x 4
A1_abc A1_cde B1_abc B1_cde
<int> <int> <int> <int>
1 3 10 10 20
2 8 12 5 16
3 4 13 6 15
4 7 11 9 18
5 6 15 1 19
Method:
df %>%
gather(key, value) %>%
separate(key, c("gp", "rand"), sep = "_") %>%
select(-rand) %>%
group_by(gp) %>%
mutate(id = 1:n()) %>%
spread(gp, value) %>%
summarise_at(vars(2:3), funs(Min = min(.),
Max = max(.),
Mean = mean(.),
SD = sd(.)))
Output:
# A tibble: 1 x 8
A1_Min B1_Min A1_Max B1_Max A1_Mean B1_Mean A1_SD B1_SD
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3. 1. 15. 20. 8.90 11.9 3.96 6.61
If you want to add more functions, just add it at the funs() function inside the summarise_at()
I created a small example and this is what I have,
df <- data.frame("A1_factor1" = rnorm(5), "A1_factor2" = rnorm(5),
"B1_factor1" = rnorm(5), "B1_factor2" = rnorm(5))
col.names <- names(df)
group <- unique(substr(col.names, 1, 2))
for (i in 1:length(group)){
group.df <- df[, substr(names(df), 1, 2) == group[i]]
df[, ncol(df)+1] <- apply(group.df, 1, mean)
df[, ncol(df)+1] <- apply(group.df, 1, min)
df[, ncol(df)+1] <- apply(group.df, 1, sd)
df[, ncol(df)+1] <- apply(group.df, 1, max)
names(df)[(ncol(df)-3):ncol(df)] <- paste(group[i], c("Mean", "Min", "SD", "Max"), sep = "_")
}
df
I hope this helps!
I have a dataframe in r with 100 rows of unique first and last name and address. I also have columns for weather 1 and weather 2. I want to make a random number of copies between 50 and 100 for each row. How would I do that?
df$fname df$lname df$street df$town df%state df$weather1 df$weather2
Using iris and baseR:
#example data
iris2 <- iris[1:100, ]
#replicate rows at random
iris2[rep(1:100, times = sample(50:100, 100, replace = TRUE)), ]
Each row of iris2 will be replicated between 50-100 times at random
This is probably not the easiest way to do this, but...
What I've done here is for each for of the data set select just that row and make 1-3 (sub 50-100) copies of that row, and finally stack all the results together.
library(dplyr)
library(purrr)
df <- tibble(foo = 1:3, bar = letters[1:3])
map_dfr(seq_len(nrow(df)), ~{
df %>%
slice(.x) %>%
sample_n(size = sample(1:3, 1), replace = TRUE)
})
#> # A tibble: 7 x 2
#> foo bar
#> <int> <chr>
#> 1 1 a
#> 2 1 a
#> 3 1 a
#> 4 2 b
#> 5 2 b
#> 6 3 c
#> 7 3 c
I have a data.frame with such as
df1 <- data.frame(id = c("A", "A", "B", "B", "B"),
cost = c(100, 10, 120, 102, 102)
I know that I can use
df1.a <- group_by(df1, id) %>%
summarise(no.c = n(),
m.costs = mean(cost))
to calculate the number of observations and mean by id. How could I do so if I want to calculate the number of observations and mean for all rows that are NOT equal to the ID, so it would for example give me 3 as value for observations not A and 2 for observations not B.
I would like to use the dplyr package and the group_by functions since I have to this for a lot of huge dataframes.
You can use the . to refer to the whole data.frame, which lets you calculate the differences between the group and the whole:
df1 %>% group_by(id) %>%
summarise(n = n(),
n_other = nrow(.) - n,
mean_cost = mean(cost),
mean_other = (sum(.$cost) - sum(cost)) / n_other)
## # A tibble: 2 × 5
## id n n_other mean_cost mean_other
## <fctr> <int> <int> <dbl> <dbl>
## 1 A 2 3 55 108
## 2 B 3 2 108 55
As you can see from the results, with two groups you could just use rev, but this approach will scale to more groups or calculations easily.
Looking for something like this? This calculates the total cost and total number of rows firstly and then subtract the total cost and total number of rows for each group and take average for the cost:
sumCost = sum(df1$cost)
totRows = nrow(df1)
df1 %>%
group_by(id) %>%
summarise(no.c = totRows - n(),
m.costs = (sumCost - sum(cost))/no.c)
# A tibble: 2 x 3
# id no.c m.costs
# <fctr> <int> <dbl>
#1 A 3 108
#2 B 2 55