Aggregate a RANGE of values using R language - r

I have a CSV file having more than 2000rows with 8 columns. The schema of the csv is as follows.
col0 col1 col2 col3......
1.77 9.1 9.2 8.8
2.34 6.3 0.9 0.44
5.34 6.3 0.9 0.44
9.34 6.3 0.9 0.44........
.
.
.
2000rows with data as above
I am trying to aggregate specific sets of rows(set1: rows1-76, set2:rows96-121..) from the above csv e.g between 1.77 to 9.34 and for all the columns for their corresponding rows- the aggregate of these rows would be one row in my output file. I have tried various methods but i could do it for only a single set in the csv file.
The output would be a csv file having aggregate values of the specified intervals like follows.
col0 col1 col2 col3
3.25 8.2 4.4 3.3 //(aggregate of rows 1-3)
2.2 3.3 9.9 1.2 //(aggregate of rows 6-10)
and so on..

Considering what Manetheran points out, you should, if not already done, add a column showing which row belongs to which set.
The data.table-way:
require(data.table)
set.seed(123)
dt <- data.table(col1=rnorm(100),col2=rnorm(100),new=rep(c(1,2),each=50))
dt[,lapply(.SD,mean),by="new"]
new col1 col2
1: 1 0.03440355 -0.25390043
2: 2 0.14640827 0.03880684
You can replace mean with any other "aggregate-function"

Here's a possible approach using the base packages:
# Arguments:
# - a data.frame
# - a list of row ranges passes as list
# of vectors=[startRowIndex,endRowIndex]
# used to split the data.frame into sub-data.frames
# - a function that takes a sub-data.frame and returns
# the aggregated result
aggregateRanges <- function(DF,ranges,FUN){
l <- lapply(ranges,function(x){
return(FUN(DF[x[1]:x[2],]))
}
)
return(do.call(rbind.data.frame,l))
}
# example data
data <- read.table(
header=TRUE,
text=
"col0 col1 col2 col3
1.77 9.1 9.2 8.8
2.34 6.3 0.9 0.44
5.34 6.3 0.9 0.44
9.34 6.3 0.9 0.44
7.32 4.5 0.3 0.42
3.77 2.3 0.8 0.13
2.51 1.4 0.7 0.21
5.44 5.7 0.7 0.18
1.12 6.1 0.6 0.34")
# e.g. aggregate by summing sub-data.frames rows
result <-
aggregateRanges(
data,
ranges=list(c(1,3),c(4,7),c(8,9)),
FUN=function(dfSubset) {
rowsum.data.frame(dfSubset,group=rep.int(1,nrow(dfSubset)))
}
)
> result
col0 col1 col2 col3
1 9.45 21.7 11.0 9.68
11 22.94 14.5 2.7 1.20
12 6.56 11.8 1.3 0.52

Related

Storing output from R multiple loops into a list

I'm trying to carry out the following action on the columns of a dataframe (df1):
term1+term2+term3*req_no
req_no is a range of numbers: 20:24
df1:
ID term1 term2 term3
X299 1.2 2.3 0.12
X300 1.4 0.6 2.4
X301 0.3 1.6 1.2
X302 0.9 0.6 0.4
X303 0.3 1.8 0.3
X304 1.3 0.3 2.1
I need help t get this output and here's my attempt:
Required output:
ID 20 21 22 23 24
X299 5.9 6.02 6.14 6.26 6.38
X300 50 52.4 54.8 57.2 59.6
X301 25.9 27.1 28.3 29.5 30.7
X302 9.5 9.9 10.3 10.7 11.1
X303 8.1 8.4 8.7 9 9.3
X304 43.6 45.7 47.8 49.9 52
Here's:
results <- list()
req_no <- 20:25
for(i in 1:nrow(df1){
for(j in rq_no){
res <- term1+term2+term3*j
results[j] <- res
}
results[[i]]
}
results2 <- do.call("rbind",result)
Help will be appreciated.
Here are a couple different approaches, though neither as succinct as Parfait's. Sample data:
df <- data.frame(ID=c("X299", "X300"),
term1=c(1.2, 1.4),
term2=c(2.3, 0.6),
term3=c(0.12, 2.4))
req_no <- 20:25
Loop approach
Your initial approach is headed in the right direction, but in the future, it would help to specify exactly what your error or problem is. For an iterated and perhaps easier-to-read approach, here's one answer:
results <- matrix(data=NA, nrow=nrow(df), ncol=length(req_no)) # Empty matrix to store our results
colnames(results) <- req_no # Optional; name columns based off of req_no values
for(i in 1:nrow(df)) {
# Do the calculation we want; returns a vector length 6
res <- df[i,]$term1 + df[i,]$term2 + (df[i,]$term3 * req_no)
# Save results for row i of df into row i of results matrix
results[i,] <- res
}
# Now bind the columns (named 20 through 25) to the respective rows of df
output <- cbind(df, results)
output
From your initial attempt, note:
We only do one loop, since it is easy to multiply by a vector in R
There are a few ways to subset data from a data frame in R. In this case, df[i,] gets everything in the i-th row, while $termX gets value in the column named termX
Using a results matrix instead of a list makes it very easy to copy the temporary computations (for each row) into rows of the matrix
Rather than rbind() (row bind), we want cbind() (column bind) to bind those results to new columns of the original rows.
Output:
ID term1 term2 term3 20 21 22 23 24 25
1 X299 1.2 2.3 0.12 5.9 6.02 6.14 6.26 6.38 6.5
2 X300 1.4 0.6 2.40 50.0 52.40 54.80 57.20 59.60 62.0
Dplyr/purrr functions
This could also be solved using tidy functions. In essence it's a pretty similar approach to Parfait's answer, but I've made the steps a bit more verbose to see what's going on.
# Use purrr's map functions to do the computation we want
nested_df <- df %>%
# Make new column holding term3 * req_no (stores a vector in each new cell)
mutate(term3r = map(term3, ~ .x * req_no)) %>%
# Make new column which sums the three columns of interest (stores a vector in each new cell)
mutate(sum = pmap(list(term1, term2, term3r), ~ ..1 + ..2 + ..3))
# "Unnest" those vectors which store our sums, and keep only those and ID
output <- nested_df %>%
# Creates six new columns (named ...1 to ...6) with the elements of each sum
unnest_wider(sum) %>%
# Keeps only the output data and IDs
select(ID, ...1:...6)
output
Output:
# A tibble: 2 x 7
ID ...1 ...2 ...3 ...4 ...5 ...6
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 X299 5.9 6.02 6.14 6.26 6.38 6.5
2 X300 50 52.4 54.8 57.2 59.6 62
Consider directly assigning new columns with sapply using your formula:
df[paste0(req_no)] <- sapply(req_no, function(r) with(df, term1 + term2 + term3 * r))
df
# ID term1 term2 term3 20 21 22 23 24
# 1 X299 1.2 2.3 0.12 5.9 6.02 6.14 6.26 6.38
# 2 X300 1.4 0.6 2.40 50.0 52.40 54.80 57.20 59.60
# 3 X301 0.3 1.6 1.20 25.9 27.10 28.30 29.50 30.70
# 4 X302 0.9 0.6 0.40 9.5 9.90 10.30 10.70 11.10
# 5 X303 0.3 1.8 0.30 8.1 8.40 8.70 9.00 9.30
# 6 X304 1.3 0.3 2.10 43.6 45.70 47.80 49.90 52.00

calculating standard deviation in R on a dataframe

I have a dataframe in R.
index seq change change1 change2 change3 change4 change5 change6
1 1 0.12 0.34 1.2 1.7 4.5 2.5 3.4
2 2 1.12 2.54 1.1 0.56 0.87 2.6 3.2
3 3 1.86 3.23 1.6 0.23 3.4 0.75 11.2
... ... ... ... ... ... ... ... ...
The name of the dataframe is just FUllData. I can access each column of the FullData using the code:
FullData[2] for 'change'
FullData[3] for 'change1'
FullData[4] for 'change3'
...
...
Now, I wish to calculate the standard deviation of values in first row of first four columns and so on for all the columns
standarddeviation = sd ( 0.12 0.34 1.2 1.7 )
then
standarddeviation = sd ( 0.34 1.2 1.7 4.5 )
Above has to be for all rows. so basically I want to calulate sd row wise and the data is stored sort of column wise is it possible to do this.
How can I access the row of the data frame with using a for loop on index or seq variable ?
How can I do this in R ? is there any better way ?
I guess you're looking for something like this.
st.dev=numeric()
for(i in 1:dim(FUllData)[1])
{
for(j in 1:dim(FUllData)[2])
{
st.dev=cbind(st.dev,sd(FUllData[i:dim(FUllData)[1],j:dim(FUllData)[2]]))
}
}

How to select individual rows from duplicates based on the highest median in R?

I have a dataframe containing gene expression data that looks like the following:
row.names symbol Sample1 Sample2 Sample3 Sample4
Probe1 Gene1 1.5 2.8 1.8 3.2
Probe2 Gene2 2.7 4.5 3.2 5.1
Probe3 Gene3 1.1 4.7 2.3 5.3
Probe4 Gene2 1.2 0.9 0.8 1.1
Probe5 Gene1 3.1 6.1 6.2 4.2
I want to subset the data so that only unique genes remain, and in each case the probe with the highest median will be retained i.e. the data above would become the following:
row.names symbol Sample1 Sample2 Sample3 Sample4
Probe2 Gene2 2.7 4.5 3.2 5.1
Probe3 Gene3 1.1 4.7 2.3 5.3
Probe5 Gene1 3.1 6.1 6.2 4.2
The dataframe has ~40,000 individual probes and ~100 samples.
Does anyone have any idea which commands in R are suitable for the task?
I wouldn't calculate medians by row, rather use the vectorized rowMedians function from the matrixStats package for that. Then, I would reorder by the result and select unique entries using the data.table package
library(data.table)
library(matrixStats)
df$Medians <- rowMedians(as.matrix(df[-(1:2)]))
unique(setDT(df)[order(-Medians)], by = "symbol")
# row.names symbol Sample1 Sample2 Sample3 Sample4 Medians
# 1: Probe5 Gene1 3.1 6.1 6.2 4.2 5.15
# 2: Probe2 Gene2 2.7 4.5 3.2 5.1 3.85
# 3: Probe3 Gene3 1.1 4.7 2.3 5.3 3.50
Some benchmarks
library(data.table)
library(matrixStats)
library(dplyr)
set.seed(123)
bigdf <- data.frame(A = paste0("Probe", 1:1e5),
symbol = paste0("Gene", sample(1e2, 1e5, replace = TRUE)),
matrix(sample(1e2, 1e6, replace = TRUE), ncol = 100))
bigdf2 <- copy(bigdf)
bigdf3 <- copy(bigdf2)
system.time({
bigdf$Medians <- rowMedians(as.matrix(bigdf[-(1:2)]))
unique(setDT(bigdf)[order(-Medians)], by = "symbol")
})
# user system elapsed
# 0.22 0.05 0.26
system.time(setDT(bigdf2)[,.SD[which.max(apply(.SD[,-(1:2), with = FALSE], 1, median)),], by = symbol])
# user system elapsed
# 5.17 0.01 5.33
system.time({
bigdf3$medianCol <- apply(bigdf3[-(1:2)],1,FUN = median)
grouped_df <- group_by(bigdf3,symbol)
filtered_df <- filter(grouped_df, medianCol == max(medianCol))
})
# user system elapsed
# 5.15 0.00 5.15
Or using dplyr:
library(dplyr)
df$medianCol <- apply(df[,2:5],1,FUN = median)
grouped_df <- group_by(df,symbol)
filtered_df <- filter(grouped_df, medianCol == max(medianCol))
filtered_df$medianCol <- NULL

Using the ddply comand on a subset of data

I got some issues using the command 'ddply' of the 'plyr' package. I created a dataframe which looks like this one :
u v intensity season
24986 -1.97 -0.35 2.0 1
24987 -1.29 -1.53 2.0 1
24988 -0.94 -0.34 1.0 1
24989 -1.03 2.82 3.0 1
24990 1.37 3.76 4.0 1
24991 1.93 2.30 3.0 2
24992 3.83 -3.21 5.0 2
24993 0.52 -2.95 3.0 2
24994 3.06 -2.57 4.0 2
24995 2.57 -3.06 4.0 2
24996 0.34 -0.94 1.0 2
24997 0.87 4.92 5.0 3
24998 0.69 3.94 4.0 3
24999 4.60 3.86 6.0 3
I tried to use the function cumsum on the u and v values, but I don't get what I want. When I select a subset of my data, corresponding to a season, for example :
x <- cumsum(mydata$u[56297:56704]*10.8)
y <- cumsum(mydata$v[56297:56704]*10.8)
...this works perfectly. The thing is that I got a huge dataset (67208 rows) with 92 seasons, and I'd like to make this function work on subsets of data. So I tried this :
new <- ddply(mydata, .(mydata$seasons), summarize, x=c(0,cumsum(mydata$u*10.8)))
...and the result looks like this :
24986 1 NA
24987 1 NA
24988 1 NA
I found some questions related to this one on stackoverflow and other website, but none of them helped me dealing with my problem. If someone has an idea, you're welcome ;)
Don't use your data.frame's name inside the plyr "function". just reference the column name as though it was defined:
ddply(mydata, .(seasons), summarise, x=c(0, cumsum(u*10.8)))

Multiple columns of data and getting average R program

I asked a question like this before but I decided to simplify my data format because I'm very new at R and didnt understand what was going on....here's the link for the question How to handle more than multiple sets of data in R programming?
But I edited what my data should look like and decided to leave it like this..in this format...
X1.0 X X2.0 X.1
0.9 0.9 0.2 1.2
1.3 1.4 0.8 1.4
As you can see I have four columns of data, The real data I'm dealing with is up to 2000 data points.....Columns "X1.0" and "X2.0" refer "Time"...so what I want is the average of "X" and "X.1" every 100 seconds based on my 2 columns of time which are "X1.0" and "X2.0"...I can do it using this command
cuts <- cut(data$X1.0, breaks=seq(0, max(data$X1.0)+400, 400))
 by(data$X, cuts, mean)
But this will only give me the average from one set of data....which is "X1.0" and "X".....How will I do it so that I could get averages from more than one data set....I also want to stop having this kind of output
cuts: (0,400]
[1] 0.7
------------------------------------------------------------
cuts: (400,800]
[1] 0.805
Note that the output was done every 400 s....I really want a list of those cuts which are the averages at different intervals...please help......I just used data=read.delim("clipboard") to get my data into the program
It is a little bit confusing what output do you want to get.
First I change colnames but this is optional
colnames(dat) <- c('t1','v1','t2','v2')
Then I will use ave which is like by but with better output. I am using a trick of a matrix to index column:
matrix(1:ncol(dat),ncol=2) ## column1 is col1 adn col2...
[,1] [,2]
[1,] 1 3
[2,] 2 4
Then I am using this matrix with apply. Here the entire solution:
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10){ ## by 10 seconds! you can replace this
## with 100 or 400 in you real data
t.col <- dat[,x][,1] ## txxx
v.col <- dat[,x][,2] ## vxxx
ave(v.col,cut(t.col,
breaks=seq(0, max(t.col),by)),
FUN=mean)})
)
EDIT correct the cut and simplify the code
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10)ave(dat[,x][,1], dat[,x][,1] %/% by)))
X1.0 X X2.0 X.1 1 2
1 0.9 0.9 0.2 1.2 3.3000 3.991667
2 1.3 1.4 0.8 1.4 3.3000 3.991667
3 2.0 1.7 1.6 1.1 3.3000 3.991667
4 2.6 1.9 2.2 1.6 3.3000 3.991667
5 9.7 1.0 2.8 1.3 3.3000 3.991667
6 10.7 0.8 3.5 1.1 12.8375 3.991667
7 11.6 1.5 4.1 1.8 12.8375 3.991667
8 12.1 1.4 4.7 1.2 12.8375 3.991667
9 12.6 1.8 5.4 1.2 12.8375 3.991667
10 13.2 2.1 6.3 1.3 12.8375 3.991667
11 13.7 1.6 6.9 1.1 12.8375 3.991667
12 14.2 2.2 9.4 1.3 12.8375 3.991667
13 14.6 1.8 10.0 1.5 12.8375 10.000000

Resources