I am a beginner to R programming and am trying to add one extra column to a matrix having 50 columns. This new column would be the avg of first 10 values in that row.
randomMatrix <- generateMatrix(1,5000,100,50)
randomMatrix51 <- matrix(nrow=100, ncol=1)
for(ctr in 1:ncol(randomMatrix)){
randomMatrix51.mat[1,ctr] <- sum(randomMatrix [ctr, 1:10])/10
}
This gives the below error
Error in randomMatrix51.mat[1, ctr] <- sum(randomMatrix[ctr, 1:10])/10 :incorrect
number of subscripts on matrix
I tried this
cbind(randomMatrix,sum(randomMatrix [ctr, 1:10])/10)
But it only works for one row, if I use this cbind in the loop all the old values are over written.
How do I add the average of first 10 values in the new column. Is there a better way to do this other than looping over rows ?
Bam!
a <- matrix(1:5000, nrow=100)
a <- cbind(a,apply(a[,1:10],1,mean))
On big datasets it is however faster (and arguably simpler) to use:
cbind(a, rowMeans(a[,1:10]) )
Methinks you are over thinking this.
a <- matrix(1:5000, nrow=100)
a <- transform(a, first10ave = colMeans(a[1:10,]))
Related
I have a big data table called "dt", and I want to produce a data table of the same dimensions which gives the deviation from the row mean of each entry in dt.
This code works but it seems very slow to me. I hope there's a way to do it faster? Maybe I'm building my table wrong so I'm not taking advantage of the by-reference assignment. Or maybe this is as good as it gets?
(I'm a R novice so any other tips are appreciated!)
Here is my code:
library(data.table)
r <- 100 # of rows
c <- 100 # of columns
# build a data table with random cols
# (maybe not the best way to build, but this isn't important)
dt <- data.table(rnorm(r))
for (i in c(1:(c-1))) {
dt <- cbind(dt,rnorm(r))
}
colnames(dt) <- as.character(c(1:c))
devs <- copy(dt)
means <- rowMeans(dt)
for (i in c(1:nrow(devs))) {
devs[i, colnames(devs) := abs(dt[i,] - means[[i]])]
}
If you subtract a vector from a data.frame (or data.table), that vector will be subtracted from every column of the data.frame (assuming they're all numeric). Numeric functions like abs also work on all-numeric data.frames. So, you can compute devs with
devs <- abs(dt - rowMeans(dt))
You don't need a loop to create dt either, you can use replicate, which replicates its second argument a number of times specified by the first argument, and arranges the results in a matrix (unless simplify = FALSE is given as an argument)
dt <- as.data.table(replicate(r, rnorm(r)))
Not sure if its what you are looking for, but the sweep function will help you applying operation combining matrices and vectors (like your row means).
table <- matrix(rnorm(r*c), nrow=r, ncol=c) # generate random matrix
means <- apply(table, 1, mean) # compute row means
devs <- abs(sweep(table, 1, means, "-")) # compute by row the deviation from the row mean
I have a 18-by-48 matrix.
Is there a way to save each of the 18 rows automatically in a separate variable (e.g., from r1 to r18) ?
I'd definitely advise against splitting a data.frame or matrix into its constituent rows. If i absolutely had to split the rows up, I'd put them in a list then operate from there.
If you desperately had to split it up, you could do something like this:
toy <- matrix(1:(18*48),18,48)
variables <- list()
for(i in 1:nrow(toy)){
variables[[paste0("variable", i)]] <- toy[i,]
}
list2env(variables, envir = .GlobalEnv)
I'd be inclined to stop after the for loop and avoid the list2env. But I think this should give you your result.
I believe you can select a row r from your dataframe d by indexing without a column specified:
var <- d[r,]
Thus you can extract all of the rows into a variable by using
var <- d[1:length(d),]
Where var[1] is the first row, var[2] the second. Etc.. not sure if this is exactly what you are looking for. Why would you want 18 different variables for each row?
result <- data.frame(t(mat))
colnames(result) <- paste("r", 1:18, sep="")
attach(result)
your matrix is mat
I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))
I am interested in optimizing some code using data.table. I feel I should be able to do better than my current solution, and it does not scale well (as the number of rows increase).
Consider I have a matrix of values, with ID denoting person and the remaining values are traits (lineage in my case). I want to create a logical matrix which reflects if two ID's (rows) share any values amongst their row (including ID). I have been using data.table lately, but I cannot figure out how to do this more efficiently. I have tried (and failed) at nesting apply statements, or somehow using the .SD function of data.table to accomplish this.
The working code is below.
m <- matrix(rep(1:10,2),nrow=5,byrow=T)
m[c(1,3),3:4] <- NA
dt <- data.table(m)
setnames(dt,c("id","v1","v2","v3"))
res <- matrix(data=NA,nrow=5,ncol=5)
dimnames(res) <- list(dt[,id],dt[,id])
for (i in 1:nrow(dt)){
for (j in i:nrow(dt)){
res[j,i] <- res[i,j] <-length(na.omit(intersect(as.numeric(dt[i]),as.numeric(dt[j])))) > 0
}
}
res
I had a similar problem a while ago and somebody helped me out. Here's that help converted to your problem...
tm<-t(m) #transpose the matrix
dtt<-data.table(tm[2:4,]) #take values of matrix into data.table
setnames(dtt,as.character(tm[1,])) #make data.table column names
comblist<-combn(names(dtt),2,FUN=list) #create list of all possible column combinations
preresults<-dtt[,lapply(comblist, function(x) length(na.omit(intersect(as.numeric(get(x[1])),as.numeric(get(x[2]))))) > 0)] #recreate your double for loop
preresults<-melt(preresults,measure.vars=names(preresults)) #change columns to rows
preresults[,c("LHS","RHS"):=lapply(1:2,function(i)sapply(comblist,"[",i))] #add column labels
preresults[,variable:=NULL] #kill unneeded column
I'm drawing a blank on how to get my preresults to be in the same format as your res but this should give you the performance boost you're looking for.
I have two dataframes and I would like to do independent 2-group t-tests on the rows (i.e. t.test(y1, y2) where y1 is a row in dataframe1 and y2 is matching row in dataframe2)
whats best way of accomplishing this?
EDIT:
I just found the format: dataframe1[i,] dataframe2[i,]. This will work in a loop. Is that the best solution?
The approach you outlined is reasonable, just make sure to preallocate your storage vector. I'd double check that you really want to compare the rows instead of the columns. Most datasets I work with have each row as a unit of observation and the columns represent separate responses/columns of interest Regardless, it's your data - so if that's what you need to do, here's an approach:
#Fake data
df1 <- data.frame(matrix(runif(100),10))
df2 <- data.frame(matrix(runif(100),10))
#Preallocate results
testresults <- vector("list", nrow(df1))
#For loop
for (j in seq(nrow(df1))){
testresults[[j]] <- t.test(df1[j,], df2[j,])
}
You now have a list that is as long as you have rows in df1. I would then recommend using lapply and sapply to easily extract things out of the list object.
It would make more sense to have your data stored as columns.
You can transpose a data.frame by
df1_t <- as.data.frame(t(df1))
df2_t <- as.data.frame(t(df2))
Then you can use mapply to cycle through the two data.frames a column at a time
t.test_results <- mapply(t.test, x= df1_t, y = df2_t, SIMPLIFY = F)
Or you could use Map which is a simple wrapper for mapply with SIMPLIFY = F (Thus saving key strokes!)
t.test_results <- Map(t.test, x = df1_t, y = df2_t)