calculate reads per million mapped read using R - r

df1 <- read.table(text="
gene_id A1 A2 A3 A4 length Total
ENSMUSG00000000028 58 93 48 58 789 200
ENSMUSG00000000031 11 7 20 16 364 54
ENSMUSG00000000037 3 5 6 98 196 112
ENSMUSG00000000058 66 93 69 71 436 299
ENSMUSG00000000085 55 68 97 67 177 287", header=TRUE)
The table represents the read count in a gene in different samples (A1, A2..A4).
How can i calculate the reads per million mapped read (RPKM) for these raw read counts using R
RPKM = (number of reads in a gene * 1e6)/(Total*length)
out_put <- read.table(text="
gene_id A1 A2 A3 A4
ENSMUSG00000000028 367.5539 589.3536 304.1825 367.5539
ENSMUSG00000000031 559.6256 356.1254 1017.5010 814.0008
ENSMUSG00000000037 136.6618 227.7697 273.3236 4464.2857
ENSMUSG00000000058 506.2747 713.3871 529.2872 544.6289
ENSMUSG00000000085 1082.6985 1338.6090 1909.4864 1318.9236", header=TRUE)

One way to do this without writing lines or a loop is using melt and dcast:
library(reshape2)
m_df1 <- melt(df1, measure.vars=c("A1","A2","A3","A4"))
m_df1$RPKM <- with(m_df1, value*1e6 / (Total*length))
output <- dcast(gene_id~variable,value.var="RPKM",data=m_df1)
> output
gene_id A1 A2 A3 A4
1 ENSMUSG00000000028 367.5539 589.3536 304.1825 367.5539
2 ENSMUSG00000000031 559.6256 356.1254 1017.5010 814.0008
3 ENSMUSG00000000037 136.6618 227.7697 273.3236 4464.2857
4 ENSMUSG00000000058 506.2747 713.3871 529.2872 544.6289
5 ENSMUSG00000000085 1082.6985 1338.6090 1909.4864 1318.9236
A second way is to use sapply to create a matrix of estimates, which you can then either rename and add to your original data, or cbind to your gene_ids.
my_cols <- c("A1","A2","A3","A4")
RPKMs <- sapply(my_cols, function(x){
df1[,x]*1e6/(df1$Total*df1$length)
}
)
output <- cbind(df1$gene_id,RPKMs)

You can achieve this also without reshaping. Using the data.table package:
library(data.table)
setDT(df1)[,indx:=.I][, lapply(.SD, function(x) (x * 1e6) / (Total * length)),
by=.(indx,gene_id,length,Total)]
this gives:
indx gene_id length Total A1 A2 A3 A4
1: 1 ENSMUSG00000000028 789 200 367.5539 589.3536 304.1825 367.5539
2: 2 ENSMUSG00000000031 364 54 559.6256 356.1254 1017.5010 814.0008
3: 3 ENSMUSG00000000037 196 112 136.6618 227.7697 273.3236 4464.2857
4: 4 ENSMUSG00000000058 436 299 506.2747 713.3871 529.2872 544.6289
5: 5 ENSMUSG00000000085 177 287 1082.6985 1338.6090 1909.4864 1318.9236
Explanation:
with setDT(df1) you convert the dataframe to a datatable
with [,indx:=.I] you create a unique identifier for each row
with by=.(indx,gene_id,length,Total) you determine the columns by which you want to group the data (these columns will not be transformed), by including the indx you make sure that each row is an unique group
with lapply(.SD, function(x) (x * 1e6) / (Total * length)) you apply the required calculation to each column which is not specified in the by statement
A similar solution with dplyr:
library(dplyr)
func <- function(x,y,z) (x * 1e6) / (y * z)
df1 %>% mutate(indx=seq(1,nrow(.))) %>%
group_by(indx,gene_id,length,Total) %>%
summarise_each(funs(func(.,Total,length)))
wich gives:
indx gene_id length Total A1 A2 A3 A4
(int) (fctr) (int) (int) (dbl) (dbl) (dbl) (dbl)
1 1 ENSMUSG00000000028 789 200 367.5539 589.3536 304.1825 367.5539
2 2 ENSMUSG00000000031 364 54 559.6256 356.1254 1017.5010 814.0008
3 3 ENSMUSG00000000037 196 112 136.6618 227.7697 273.3236 4464.2857
4 4 ENSMUSG00000000058 436 299 506.2747 713.3871 529.2872 544.6289
5 5 ENSMUSG00000000085 177 287 1082.6985 1338.6090 1909.4864 1318.9236

Related

Summing columns in a data frame and adding those values to a new data frame in R [duplicate]

This question already has answers here:
How to sum data.frame column values?
(5 answers)
Closed 2 years ago.
I am trying to sum the columns of a data frame and add these sums to a new output data frame. When I run the following script, I get an error stating that the replacement has two rows and the data has 3.
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
for (i in 1:ncol(a)) {
b <-as.data.frame(names(a))
c <- sum(a[i])
b$d[i] <- c[i]
}
I am looking for the output as a data frame such as:
name1 sum1
name2 sum2
name3 sum3
Your solution was already pretty close. I made some slight modifications for you and it works:
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
b <-as.data.frame(names(a))
for (i in 1:ncol(a)) {
b$sum[i] <- sum(a[i])
}
Output:
names(a) sum
1 name1 470
2 name2 616
3 name3 495
I would suggest a dplyr approach:
library(dplyr)
#Data
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
#Code
a %>%
mutate(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) ))
Output:
name1 name2 name3 name1_sum name2_sum name3_sum
1 98 31 79 599 489 506
2 8 71 4 599 489 506
3 59 23 48 599 489 506
4 65 76 64 599 489 506
5 47 53 57 599 489 506
6 80 84 55 599 489 506
7 40 19 28 599 489 506
8 39 2 47 599 489 506
9 65 36 40 599 489 506
10 98 94 84 599 489 506
If only one dataframe is desired you can use this:
a %>%
summarise(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) ))
Output:
name1_sum name2_sum name3_sum
1 599 489 506
Initial code should be used when you want to add those variables to same dataframe.
And if you want a variable for the names and another for results you can use previous code with pivot_longer() from tidyverse to produce this:
library(tidyverse)
#Code
a %>%
summarise(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) )) %>%
pivot_longer(cols = everything())
Output:
# A tibble: 3 x 2
name value
<chr> <int>
1 name1_sum 599
2 name2_sum 489
3 name3_sum 506
It can be vectorized with colSums in base R
as.data.frame.list(colSums(a))
Or for a two column summary
stack(colSums(a))
If we need to create new columns in 'a'
a[paste0(names(a), "_sum")] <- colSums(a)

return a vector in a column in data.table

I have a data.table in R, and I'm looking to create a vector based on .SDcols row by row.
library("data.table")
dt = data.table(
id=1:6,
A1=sample(100,6),
A2=sample(100,6),
A3=sample(100,6),
B1=sample(100,6),
B2=sample(100,6),
B3=sample(100,6)
)
dt[,x1:=paste(.SD,collapse = ","),.SDcols=A1:B3,by=id]
dt[,x2:=strsplit(x1,",")] # x2 vector of characters
now, I got x2 with a vector of characters.
however, I expected x2 with a vector of integers.
R > dt
id A1 A2 A3 B1 B2 B3 x2
1: 1 72 23 76 10 35 14 c(72,23,76,10,35,14)
2: 2 44 28 77 29 20 63 c(44,28,77,29,20,63)
3: 3 18 34 43 77 76 100 c(18,34,43,77,76,100)
4: 4 15 33 50 87 86 86 c(15,33,50,87,86,86)
5: 5 71 71 41 75 8 3 c(71,71,41,75,8,3)
6: 6 11 89 98 42 72 27 c(11,89,98,42,72,27)
I tried with several solutions, all failed.
dt[,x2:=.(list(.SD)),.SDcols=A1:B3,by=id] #x2 is <data.table>
dt[,x2:=.(lapply(.SD,c)),.SDcols=A1:B3,by=id]
dt[,x2:=.(c(.SD)), .SDcols=A1:B3,by=id] #RHS 1 is length 6 (greater than the size (1) of group 1). The last 5 element(s) will be discarded.
dt[,x2:=c(.SD),.SDcols=A1:B3,by=id] # x2 equals A1
dt[,x2:=lapply(.SD,c),.SDcols=A1:B3,by=id] # x2 equals A1
dt[,x2:=sapply(.SD,c),.SDcols=A1:B3,by=id] # x2 equals A1
Any suggestion?
Thanks in advance
=====================================================================
edit: thanks Jaap,
dt[, x2 := lapply(strsplit(x1, ","), as.integer)] # it works
Still, I wonder any beautiful solution?
=====================================================================
edit2:
new solutions, base function is much more useful than I thought.
dt[,ABC0:=apply(rbind(.SD), 1, list),.SDcols=A1:B3,by=id]
dt[,ABC1:=apply(cbind(.SD), 1, list),.SDcols=A1:B3,by=id]
or more simple
dt[,ABC2:=lapply(.SD,rbind),.SDcols=A1:B3]

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Count number of occurances of a string in R under different conditions

I have a dataframe, with multiple columns called "data" which looks like this:
Preferences Status Gender
8a 8b 9a Employed Female
10b 11c 9b Unemployed Male
11a 11c 8e Student Female
That is, each customer selected 3 preferences and specified other information such as Status and Gender. Each preference is given by a [number][letter] combination, and there are c. 30 possible preferences. The possible preferences are:
8[a - c]
9[a - k]
10[a - d]
11[a - c]
12[a - i]
I want to count the number of occurrences of each preference, under certain conditions for the other columns - eg. for all women.
The output will ideally be a dataframe that looks like this:
Preference Female Male Employed Unemployed Student
8a 1034 934 234 495 203
8b 539 239 609 394 235
8c 124 395 684 94 283
9a 120 999 895 945 345
9b 978 385 596 923 986
etc.
What's the most efficient way to achieve this?
Thanks.
I am assuming you are starting with something that looks like this:
mydf <- structure(list(
Preferences = c("8a 8b 9a", "10b 11c 9b", "11a 11c 8e"),
Status = c("Employed", "Unemployed", "Student"),
Gender = c("Female", "Male", "Female")),
.Names = c("Preferences", "Status", "Gender"),
class = c("data.frame"), row.names = c(NA, -3L))
mydf
# Preferences Status Gender
# 1 8a 8b 9a Employed Female
# 2 10b 11c 9b Unemployed Male
# 3 11a 11c 8e Student Female
If that's the case, you need to "split" the "Preferences" column (by spaces), transform the data into a "long" form, and then reshape it to a wide form, tabulating while you do so.
With the right tools, this is pretty straightforward.
library(devtools)
library(data.table)
library(reshape2)
source_gist(11380733) # for `cSplit`
dcast.data.table( # Step 3--aggregate to wide form
melt( # Step 2--convert to long form
cSplit(mydf, "Preferences", " ", "long"), # Step 1--split "Preferences"
id.vars = "Preferences"),
Preferences ~ value, fun.aggregate = length)
# Preferences Employed Female Male Student Unemployed
# 1: 10b 0 0 1 0 1
# 2: 11a 0 1 0 1 0
# 3: 11c 0 1 1 1 1
# 4: 8a 1 1 0 0 0
# 5: 8b 1 1 0 0 0
# 6: 8e 0 1 0 1 0
# 7: 9a 1 1 0 0 0
# 8: 9b 0 0 1 0 1
I also tried a dplyr + tidyr approach, which looks like the following:
library(dplyr)
library(tidyr)
mydf %>%
separate(Preferences, c("P_1", "P_2", "P_3")) %>% ## splitting things
gather(Pref, Pvals, P_1:P_3) %>% # stack the preference columns
gather(Var, Val, Status:Gender) %>% # stack the status/gender columns
group_by(Pvals, Val) %>% # group by these new columns
summarise(count = n()) %>% # aggregate the numbers of each
spread(Val, count) # spread the values out
# Source: local data table [8 x 6]
# Groups:
#
# Pvals Employed Female Male Student Unemployed
# 1 10b NA NA 1 NA 1
# 2 11a NA 1 NA 1 NA
# 3 11c NA 1 1 1 1
# 4 8a 1 1 NA NA NA
# 5 8b 1 1 NA NA NA
# 6 8e NA 1 NA 1 NA
# 7 9a 1 1 NA NA NA
# 8 9b NA NA 1 NA 1
Both approaches are actually pretty quick. Test it with some better sample data than what you shared, like this:
preferences <- c(paste0(8, letters[1:3]),
paste0(9, letters[1:11]),
paste0(10, letters[1:4]),
paste0(11, letters[1:3]),
paste0(12, letters[1:9]))
set.seed(1)
nrow <- 10000
mydf <- data.frame(
Preferences = vapply(replicate(nrow,
sample(preferences, 3, FALSE),
FALSE),
function(x) paste(x, collapse = " "),
character(1L)),
Status = sample(c("Employed", "Unemployed", "Student"), nrow, TRUE),
Gender = sample(c("Male", "Female"), nrow, TRUE)
)

R: looping through data.frame columns

I got a following my_data:
geneid chr acc_no start end size strand S1 S2 A1 A2
1 gene_010010 1 AC12345.1 3662 4663 1002 - 328 336 757 874
2 gene_010020 1 AC12345.1 5750 7411 1662 - 480 589 793 765
3 gene_010030 2 AC12345.1 9003 11024 2022 - 653 673 875 920
4 gene_010040 2 AC12345.1 12006 12566 561 - 573 623 483 430
5 gene_010050 3 AC12345.1 15035 17032 1998 - 2256 2333 1866 1944
6 gene_010060 3 AC12345.1 18188 18937 750 - 526 642 650 586
I am able to calculate sums for a given column, i.e:
chr.sums <- data.frame(with (my_data, tapply(S1, INDEX=chr, FUN=sum)))
Problem is, I want to get chr.sums with four columns (S1, S2, A1 and A2) and 30 rows corresponding to unique chr numbers. I do not want to switch to Python back and forth, but looping through columns and assigning output to specific columns in data.frame baffles me.
EDIT
Toy data set above.
You can use ddply from plyr. Here is some code:
plyr::ddply(my_data, .(chr), summarize, S1 = sum(S1), S2 = sum(S2),
A1 = sum(A1), A2 = sum(A2))
EDIT. A more compact solution would be:
plyr::ddply(my_data, .(chr), colwise(sum, .(S1, S2, A1, A2)))
Here is how it works. The data is first split into pieces based on chr. Then, the columns S1, S2, A1, A2 are summed up for each piece. Finally, they are assembled back into a single data frame.
Any place you have this kind of a split-apply-combine problem, think plyr as a solution.
tapply won't handle multiple columns but the formula version of aggregate will.
chr.sums <- aggregate(cbind(S1,S2,A1,A2) ~ chr, data = my_data, FUN=sum)))

Resources