Vectorizing a column-by-column comparison to separate values - r

I'm working with data gathered from multi-channel electrode systems, and am trying to make this run faster than it currently is, but I can't find any good way of doing it without loops.
The gist of it is; I have modified averages for each column (which is a channel), and need to compare each value in a column to the average for that column. If the value is above the adjusted mean, then I need to put that value in another data frame so it can be easily read.
Here is some sample code for the problematic bit:
readout <- data.frame(dimnmames <- c("Values"))
#need to clear the dataframe in order to run it multiple times without errors
#timeFrame is just a subsection of the original data, 60 channels with upwards of a few million rows
readout <- readout[0,]
for (i in 1:ncol(timeFrame)){
for (g in 1:nrow(timeFrame)){
if (timeFrame[g,i] >= posCompValues[i,1])
append(spikes, timeFrame[g,i])
}
}
The data ranges from 500 thousand to upwards of 130 million readings, so if anyone could point me in the right direction I'd appreciate it.

Something like this should work:
Return values of x greater than y:
cmpfun <- function(x,y) return(x[x>y])
For each element (column) of timeFrame, compare with the corresponding value of the first column of posCompValues
vals1 <- Map(cmpfun,timeFrame,posCompValues[,1])
Collapse the list into a single vector:
spikes <- unlist(vals1)
If you want to save both the value and the corresponding column it may be worth unpacking this a bit into a for loop:
resList <- list()
for (i in seq(ncol(timeFrame))) {
tt <- timeFrame[,i]
spikes <- tt[tt>posCompVals[i,1]]
if (length(spikes)>0) {
resList[[i]] <- data.frame(value=spikes,orig_col=i)
}
}
res <- do.call(rbind, resList)

Related

How to match and store results from a long nested for loop into an empty column in a data frame in R

I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value

Restrict the range of values in a list of data frames in R

I have a list dataframes of 288 ASCII data frames in R that contain values and their coordinates. The data are for the average temperature of every hour by month, so the data frames have titles that range from jan_01 to dec_24. What I want to do is restrict these so they are reduced to only containing values for a specific range (region) of coordinates. I can do this successfully for each individual frame using lower bound xl and upper bound xu. For example, the x-coordinates for 01:00 in April would be restricted by using
apr_01 <- apr_01[apr_01$x <= xu,]
apr_01 <- apr_01[apr_01$x >= xl,]
I suspect there's some way to use lapply() or a series of loops so that this operation can be done to all data frames for the whole year, but I couldn't figure out how to implement it since my method above needs the unique data frame name. I tried writing a generalizable function for use with lapply(), but I'm new to R so haven't had much luck. This is probably a trivial problem, but any help would be appreciated!
edit: The function I tried to write, which obviously won't work, was something like
restrict <- function(f){
f <- f[f$s1 <= xu]
f <- f[f$s1 >= xl]
}
for(i in 1:24){
apr_[i] <- restrict(apr_[i])
}
This turned out to be really simple. Here is the code that solved my problem:
restrict <- function(f){
f <- f[f$x <= xu,]
f <- f[f$x >= xl,]
}
dataframes <- lapply(dataframes, restrict)
The restrict function restricts the x-coordinates to what I'm interested in, then I applied it to each data frame in the list with lapply().

Applying function to entire data-set x number of times for each observation

I am trying to apply a function to a simulated data set (the easy part), though I have to apply this function x number of times. My overall goal in this project is to simulate abundances of species over time for many populations over many years. In this example we will work with 15 species over thirty years for one community.
I have used a function and called it:
curve<-function(Ao,m,r,a,g){(Ao*((((x-m)/r)+(a/(a+g)))^a)*((1-(((x-m)/r)+(a/(a+g))))^g))/(((a/(a+g))^a)*((1-(a/(a+g)))^g))}
x<-seq(1,365, by=14) #this is the number of times that I get sampled abundances, and is included in the function
I then run a loop and create an abundance table, along with a table giving me the values of each variable.
TotSpecies<-15
Community<-30
for(n in 1:TotSpecies){
Ao<-rlnorm(TotSpecies,3,2)
m<-sample(seq(min(x)+5:max(x)-5),TotSpecies)
r<-runif(TotSpecies,min=0,max=max(x))
a<-(runif(TotSpecies,min=.1,max=4))
g<-(runif(TotSpecies,min=.1,max=4))
}
Abundance <- matrix(0,nrow=length(x),ncol=TotSpecies)
colnames(Abundance) = c("Sp1","Sp2","Sp3","Sp4","Sp5","Sp6","Sp7","Sp8","Sp9","Sp10","Sp11","Sp12","Sp13","Sp14","Sp15")
for(L in 1:TotSpecies){
Abundance[,L] <- curve(Ao[L],m[L],r[L],a[L],g[L])
}
#Alter matrix to removed NANs and replace with zeroes
Abundance.NA<-is.na(Abundance)
Abundance[Abundance.NA]<-0 #this makes Abundance have 0's where abundance is NaN
Pres.Abs<-Abundance
Pres.Abs[Pres.Abs>0]<-1 #presence-absence matrix
#creates a data frame with the values of each variable
Species<-1:TotSpecies
Year<-rep(1,TotSpecies)
year1data<-data.frame(Species,Year,Ao,m,r,a,g)
At this point, I only have data of abundances for one year and one community. Now I want to simulate for this community over thirty years, altering the abundances of species sequentially from year to year by adding error.
TotSpeciesData <- do.call(
rbind, #bind the table by rows
lapply( #applying the function in list form
split(year1data, year1data$Species), #splits data into groups by species
function(data)
with(
data,
data.frame(Species=Species, Year=1:Community, Ao=c(Ao, Ao + cumsum(rnorm((TotSpecies-1),0,2))),m=m, r=r, a=a, g=g) #data frame is Species, Year,
) ) )
TotSpeciesData$Ao[TotSpeciesData$Ao<0]<-0 #any values less than 0 go to 0
TotSpeciesData<-TotSpeciesData[order(TotSpeciesData$Year),] #orders the data frame by Year
This is now a data frame with each given variable for each species for each year. Now I do not know how to apply the function to this table and create an abundance table that has all fifteen species for the thirty years.
I started off thinking that a nested loop would be the best instead of trying to use the apply function because the apply function could not handle me trying to run the function x number or times (or am I wrong in this??).
TotSpeciesAbundance<-matrix(0,nrow=Community*length(x),ncol=TotSpecies)
colnames(TotSpeciesAbundance) = c("Sp1","Sp2","Sp3","Sp4","Sp5","Sp6","Sp7","Sp8","Sp9","Sp10","Sp11","Sp12","Sp13","Sp14","Sp15")
Year<-rep(1:Community, each=length(x))
TotSpeciesAbundance<-cbind(TotSpeciesAbundance,Year)
for(p in 1:450){
for(j in 1:TotSpecies){
Ao<-TotSpeciesData$Ao
m<-TotSpeciesData$m
r<-TotSpeciesData$r
a<-TotSpeciesData$a
g<-TotSpeciesData$g
TotSpeciesAbundance[,j]<- curve(Ao[j],m[j],r[j],a[j],g[j])
}
}
I have tried a number of different ways to alter the double loop, though cannot find a way to get it to work. This may be a bit amateur, but can anyone help in this ?
I still don't really understand what you want, but I can make a guess. Your second for loop contains the same problem as your first one: you overwrite the same data.frame 450 times. I think what you intended is something like this:
# Bug fix: make TotSpeciesAbundance a data.frame so with works
TotSpeciesAbundance<-data.frame(TotSpeciesAbundance)
# Make a storage list beforehand.
big.list<-list()
for(p in 1:450){
TotSpeciesAbundance <- matrix(ncol=length(x),nrow=TotSpecies)
for(j in 1:TotSpecies){
TotSpeciesAbundance[,j]<- with(TotSpeciesData,curve(Ao[j],m[j],r[j],a[j],g[j]))
}
big.list[[p]] <- TotSpeciesAbundance
}
But, conveniently, you can replace the inner for:
for(p in 1:450) {
big.list[[p]] <- with(TotSpeciesAbundance,mapply(curve,Ao,m,r,a,g))
}
Which makes it quite clear that you were not only rerunning the thing 450 times, but you were doing it with exactly the same thing. You could replace this with:
replicate(450,with(TotSpeciesAbundance,mapply(curve,Ao,m,r,a,g)),simplify=FALSE)
I am guessing you want to add a bit of noise or something each time, but I can't figure out exactly what you want. Perhaps if you clearly explained what you mean by an abundance matrix, and give a small example of what the output data would look like.

Counting algorithm for big data in R

I have a big data frame with almost 1M rows (transactions) and 2600 columns (items). The values in the data set are 1's and NA's. Data type of all the values are factor. I want to add a new column to the end of the data frame which shows sum of all 1's in each row.
Here is the R code that I wrote:
for(i in 1:nrow(dataset){
counter<-0
for(j in 1:ncol(dataset){
if(!is.na(dataset[i,j])){
counter<- counter+1
}
}
dataset[i,ncol(dataset)+1]<-counter
}
But it has been a very long time that it is running in R studio because the running time is O(n^2). I am wondering if there is any other way to do that or a way to improve this algorithm? (Machine has 80Gb of memory)
Using a matrix (of numbers, not factors), as #joran suggested, would be better for this, and simply do:
rowSums(your_matrix, na.rm = T)
As eddi answer is the best in your case more general solution is to vectorize code (means: operate on all rows at once):
counter <- rep(0, nrow(dataset))
for(j in 1:ncol(dataset)) {
counter <- counter + !is.na(dataset[[j]])
}
dataset$no_of_1s <- counter
One note: in your code in line:
dataset[i,ncol(dataset)+1]<-counter
you create new column for each row (cause for each step there is one more column), so final data.frame would have 1M rows and 1M colums (so it won't fit your memory).
Another option is to use Reduce
dataset$no_of_1s <- Reduce(function(a,b) a+!is.na(b), dataset, init=integer(nrow(dataset)))

Return value from column indicated in same row

I'm stuck with a simple loop that takes more than an hour to run, and need help to speed it up.
Basically, I have a matrix with 31 columns and 400 000 rows. The first 30 columns have values, and the 31st column has a column-number. I need to, per row, retrieve the value in the column indicated by the 31st column.
Example row: [26,354,72,5987..,461,3] (this means that the value in column 3 is sought after (72))
The too slow loop looks like this:
a <- rep(0,nrow(data)) #To pre-allocate memory
for (i in 1:nrow(data)) {
a[i] <- data[i,data[i,31]]
}
I would think this would work:
a <- data[,data[,31]]
... but it results in "Error: cannot allocate vector of size 2.8 Mb".
I fear that this is a really simple question, so I've spent hours trying to understand apply, lapply, reshape, and more, but somehow I can't get a grip on the vectorization concept in R.
The matrix actually has even more columns that also go into the a-parameter, which is why I don't want to rebuild the matrix, or split it.
Your support is highly appreciated!
Chris
t(data[,1:30])[30*(0:399999)+data[,31]]
This works because you can reference matricies both in array format, and vector format (a 400000*31 long vector in this case) counting column-wise first. To count row-wise, you use the transpose.
Singe-index notation for the matrix may use less memory. This would involve doing something like:
i <- nrow(data)*(data[,31]-1) + 1:nrow(data)
a <- data[i]
Below is an example of single-index notation for matrices in R. In this example, the index of the per-row maximum is appended as the last column of a random matrix. This last column is then used to select the per-row maxima via single-index notation.
## create a random (10 x 5) matrix
M <- matrix(rpois(50,50),10,5)
## use the last column to index the maximum value of the first 5
## columns
MM <- cbind(M,apply(M,1,which.max))
## column ID row ID
i <- nrow(MM)*(MM[,ncol(MM)]-1) + 1:nrow(MM)
all(MM[i] == apply(M,1,max))
Using an index matrix is an alternative that will probably use more memory but is slightly clearer:
ii <- cbind(1:nrow(MM),MM[,ncol(MM)])
all(MM[ii] == apply(M,1,max))
Try to change the code to work a column at a time:
M <- matrix(rpois(30*400000,50),400000,30)
MM <- cbind(M,apply(M,1,which.max))
a <- rep(0,nrow(MM))
for (i in 1:(ncol(MM)-1)) {
a[MM[, ncol(MM)] == i] <- MM[MM[, ncol(MM)] == i, i]
}
This sets all elements in a with the values from column i if the last column has value i. It took longer to build the matrix than to calculate vector a.

Resources