Efficient way to bin data ranges in R

Efficient way to bin data ranges in R - r

I have several hundred variables in a data frame which need to be binned into buckets.
Currently, I'm using code similar to the following:
idx <- list()
idx[[1]] <- which(df$myVariable < 628 & df$myVariable >= 0)
idx[[2]] <- which(df$myVariable < 774 & df$myVariable >= 628)
idx[[3]] <- which(df$myVariable < 885 & df$myVariable >= 774)
idx[[4]] <- which(df$myVariable <= Inf & df$myVariable >= 4819)
idx[[5]] <- which(df$myVariable < 0)
df$myVariable[idx[[1]]] = 1
df$myVariable[idx[[2]]] = 2
df$myVariable[idx[[3]]] = 3
df$myVariable[idx[[4]]] = 4
df$myVariable[idx[[5]]] = 0
In reality, there are 21 ranges of values for each of the variables, and the cut points may vary between the variables. So, in full, this code is over 30,000 lines long (I have a script which generates it).
Is there a better way to represent this code? Ideally it would make use of dplyr, since I intend to run this code in sparklyr, but if that is not possible, native R code is fine (thanks to the spark_apply function).

Related

Calculate elements of time series for each variable (loop?)

I need to calculate each component of the time series for each X (50 levels) and Y (80 levels) from my dataset (df).
I wanted to go with something akin to the code below, where I tried to just get the seasonality. If I can get this it should be the same for the trend and random component of the decompose.
P <- df$X
for(y in 1:length(P)) {
OneP <- P[y]
AllS <- unique(df$Y[df$X== OneP])
for(i in 1:length(AllS)) {
OneS<- AllS[i]
df$TS[df$Y == OneS & df$X== OneP] <- ts(df$Mean[df$Y == OneS & df$X
== OneP], start = c(1999, 1), end = c(2015, 12), frequency = 12)
df$Dec[df$Y == OneS & df$X== OneP] <- decompose(ts(df$TS[df$Y == OneS &
df$X== OneP], frequency = 12), type = c("additive"))
df$Decomposition_seasonal[df$Y == OneS & df$X== OneP] <- df$Dec([df$Y == OneS & df$X== OneP], Dec$seasonal)
}
But this is not working. Error message is:
Error: attempt to apply non-function
I understand that the problem might come from my attempt to put decomposition output in a column. But how else to do it? Make a new dataset for every dev in every X * Y combination?
I know that the first lines of the code work as I used it before for something else. And I know this will run and give me TS and decomposition. It's the individual components bit that I am struggling with. Any advice is deeply appreciated.
Similar data:
X Y Mean Date(mY)
Tru A 35.6 02.2015
Fle A 15 05.2010
Srl C 67.1 05.1999
Tru A 13.2 08.2006
Srl B 89 08.2006
Tru B 14.8 12.2001
Fle A 21.5 11.2001
Lub D 34.8 03.2000

speeding up boolean logic loop in R

I am very new to R but I am interested in learning more and improving.
I have a dataset with around 40,000+ rows containing the length of neuron segments. I want to compare the length trends of neurons of different groups. The first step in this analysis involves sorting the measurements into 1 of 6 different categories such as '<10' '10-15', '15-20', '20-25', '25-30', and '>30'.
I created these categories as appended columns using 'mutate' from the 'dplyr' package and now I am trying to write a boolean function to determine where the measurement fits by applying a value of '1' to the corresponding column if it fits, and a '0' if it doesn't.
Here is what I wrote:
for (i in 1:40019) {
{if (FinalData$Length[i] <=10)
{FinalData$`<10`[i]<-1
} else {FinalData$`<10`[i]<-0}} #Fills '<10'
if (FinalData$Length[i] >=10 & FinalData$Length[i]<15){
FinalData$`10-15`[i]<-1
} else{FinalData$`10-15`[i]<-0} #Fills'10-15'
if (FinalData$Length[i] >=15 & FinalData$Length[i]<20){
FinalData$`15-20`[i]<-1
} else{FinalData$`15-20`[i]<-0} #Fills '15-20'
if (FinalData$Length[i] >=20 & FinalData$Length[i]<25) {
FinalData$`20-25`[i]<-1
} else{FinalData$`20-25`[i]<-0} #Fills '20-25'
if(FinalData$Length[i] >=25 & FinalData$Length[i]<30){
FinalData$`25-30`[i]<-1
} else{FinalData$`25-30`[i]<-0} #Fills '25-30'
if(FinalData$Length[i] >=30){
FinalData$`>30`[i]<-1
} else{FinalData$`>30`[i]<-0} #Fills '>30'
}
This seems to work, but it takes a long time:
system.time(source('~/Desktop/Home/Programming/R/Boolean Loop R.R'))
user system elapsed
94.408 19.147 118.203
The way I coded this seems very clunky and inefficient. Is there a faster and more efficient way to code something like this or am I doing this appropriately for what I am asking for?
Here is an example of some of the values I am testing:
'Length': 14.362, 12.482337, 8.236, 16.752, 12.045
If I am not being clear about how the dataframe is structured, here is a screenshot:
How my data frame is organized

You can use the cut function in R. It is used to convert numeric values to factors:
x<-c(1,2,4,2,3,5,6,5,6,5,8,0,5,5,4,4,3,3,3,5,7,9,0,5,6,7,4,4)
cut(x = x,breaks = c(0,3,6,9,12),labels = c("grp1","grp2","grp3","grp4"),right=F)
set right = "T" or "F" as per your need.

You can vectorise that as follows (I made a sample of some data called DF)
DF <- data.frame(1:40000,sample(letters,1:40000,replace=T),"Length"=sample(1:40,40000,replace=T))
MyFunc <- function(x) {
x[x >= 10 & x < 15] <- "10-15"
x[x >= 15 & x < 20] <- "15-20"
x[x >= 20 & x < 25] <- "20-25"
x[x >= 25 & x < 30] <- "25-30"
x[x > 30] <- ">30"
x[x < 10] <- "<10"
return(x)
}
DF$Group <- MyFunc(DF[,3])
If it has to be 6 columns like that, you can modify the above to return a one or zero for the appropriate size and everything else, respectively, for each of the 6 columns.
Edit: I guess a series of ifelse might be best if it really has to be 6 columns like that.
e.g.
DF$'<10' <- sapply(DF$Length, function(x) ifelse(x < 10,1,0))

Extend conditions in a dynamic way

I am trying to build a decision table. At time 3 for example I have to take the previous results in time t=1 and time t=2 in order to make my decision in time 3. The decision table is going to be pretty big so I am considering an efficient way to do it by building a function. For instance at time 3:
rm(list=ls()) # clear memory
names <- c("a","b","c","d","e","f","g","h","i","j","k50","l50","m50","n50","o50")
proba <- c(1,1,1,1,1,1,1,1,1,1,0.5,0.5,0.5,0.5,0.5)
need <- 4
re <- 0.5
w <- 1000000000
# t1
t1 <- as.integer(names %in% (sample(names,need,prob=proba,replace=F)))
# t2
t2 <- rep(t1)
# t3
proba3 <- ifelse(t2==1,proba*re,proba)
t3 <- as.integer(names %in% (sample(names,need,prob=proba3,replace=F)))
Now the table is going to be big until t=7 with proba7 which takes condition from t=1 to t=6. After t=7 it always takes the 6 previous outcomes plus the random part proba in order to make decision. In other words the ifelse must be dynamic in order that I can call it later. I have been trying something like
probF <- function(a){
test <- ifelse(paste0("t",a,sep="")==1,proba*re,proba)
return(test)
}
test <- probF(2)
but there is an error as I got just one value and not a vector. I know that it looks complicated
For the conditions requested by one person (i know it's not very good written) :
proba7 <- ifelse(t2==1 & t3==1 & t4==0 & t5==0 & t6==0,proba,
ifelse(t2==1 & t3==0 & t4==0 & t5==1 & t6==1,proba*re,
ifelse(t2==1 & t3==0 & t4==0 & t5==0 & t6==1, w,
ifelse(t2==0 & t3==1 & t4==1 & t5==0 & t6==0,proba,
ifelse(t2==0 & t3==1 & t4==1 & t5==1 & t6==0,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 & t6==1,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 &t6==0,0,
ifelse(t2==0 & t3==0 & t4==0 & t5==1 & t6==1, proba*re,
ifelse(t2==0 & t3==0 & t4==0 & t5==0 & t6==1,w,proba)))))))))
t7 <- as.integer(names %in% (sample(names,need,prob=proba7,replace=F)))

If you take a bit of a different approach, you'll gain quite a lot of speed.
First of all, it is really a terribly bad idea to store every step as a separate t1, proba1, etc. If you need to keep all that information, predefine a matrix or list of the right size and store everything in there. That way you can use simple indices instead of having to resort to the bug-prone use of get(). If you find yourself typing get(), almost always it's time to stop and rethink your solution.
Secondly, you can use a simple principle to select the indices of the test t:
seq(max(0, i-7), i-1)
will allow you to use a loop index i and refer to the 6 previous positions if they exist.
Thirdly, depending on what you want, you can reformulate your decision as well. If you store every t as a row in the matrix, you can simply use colSums() and check whether that one is larger than 0. Based on that index, you can update the probabilities in such a way that a 1 in any of the previous 6 rows halfs the probability.
wrapping everything in a function would then look like :
myfun <- function(names, proba, need, re,
w=100){
# For convenience, so I don't have to type this twice
resample <- function(p){
as.integer(
names %in% sample(names,need,prob=p, replace = FALSE)
)
}
# get the number of needed columns
nnames <- length(names)
# create two matrices to store all the t-steps and the probabilities used
theT <- matrix(nrow = w, ncol = nnames)
theproba <- matrix(nrow = w, ncol = nnames)
# Create a first step, using the original probabilities
theT[1,] <- resample(proba)
theproba[1,] <- proba
# loop over the other simulations, each time checking the condition
# recalculating the probability and storing the result in the next
# row of the matrices
for(i in 2:w){
# the id vector to select the (maximal) 6 previous rows. If
# i-6 is smaller than 1 (i.e. there are no 6 steps yet), the
# max(1, i-6) guarantees that you start minimal at 1.
tid <- seq(max(1, i-6), i-1)
# Create the probability vector from the original one
p <- proba
# look for which columns in the 6 previous steps contain a 1
pid <- colSums(theT[tid,,drop = FALSE]) > 0
# update the probability vector
p[pid] <- p[pid]*0.5
# store the next step and the used probabilities in the matrices
theT[i,] <- resample(p)
theproba[i,] <- p
}
# Return both matrices in a single list for convenience
return(list(decisions = theT,
proba = theproba)
)
}
which can be used as:
myres <- myfun(names, proba, need, re, w)
head(myres$decisions)
head(myres$proba)
This returns you a matrix where every row is one t-point in the decision table.

cor() function in R with a subset

I have a table in R with three columns. I want to get the correlation of the first two columns with a subset of the third column following a specific set of conditions (values are all numeric, I want them to be > a certain number). The cor() function doesn't seem to have an argument to define such a subset.
I know that I could use the summary(lm()) function and square-root the r^2, but the issue is that I'm doing this inside a for loop and am just appending the correlation to a separate list that I have. I can't really append part of the summary of the regression easily to a list.
Here is what I am trying to do:
for (i in x) {list[i] = cor(data$column_a, data$column_b, subset = data$column_c > i)}
Obviously, though, I can't do that because the cor() function doesn't work with subsets.
(Note: x = seq(1,100) and list = NULL)

You can do this without a loop using lapply. Here's some code that will output a data frame with the month-range in one column and the correlation in another column. The do.call(rbind... business is just to take the list output from lapply and turn it into a data frame.
corrs = do.call(rbind, lapply(min(airquality$Month):max(airquality$Month),
function(x) {
data.frame(month_range=paste0(x," - ", max(airquality$Month)),
correlation = cor(airquality$Temp[airquality$Month >= x & airquality$Temp < 80],
airquality$Wind[airquality$Month >= x & airquality$Temp < 80]))
}))
corrs
month_range correlation
1 5 - 9 -0.3519351
2 6 - 9 -0.2778532
3 7 - 9 -0.3291274
4 8 - 9 -0.3395647
5 9 - 9 -0.3823090

You can subset the data first, and then find the correlation.
a <- subset(airquality, Temp < 80 & Month > 7)
cor(a$Temp, a$Wind)
Edit: I don't really know what your list variable is, but here is an example of dynamically changing the subset based on i (see how the month requirement changes with each iteration)
list <- seq(1, 5)
for (i in 1:5){
a <- subset(airquality, Temp < 80 & Month > i)
list[i] <- cor(a$Temp, a$Wind)
}

Based on the pseudo-code you provided alone, here's something that should work:
for (i in x) {
df <- subset(data, column_c > i)
list[i] = cor(df$column_a, df$column_b)
}
However, I don't know why you would want your index in list[i] to be the same value that you use to subset column_c. That could be another source of problems.

Nested loop in R: columns then rows

I am trying to write a nested for loop in R, but am running into problems. I have researched as much as possible but can't find (or understand) the help I need. I am fairly new to R, so any advice on this looping would be appreciated, or if there is a simpler, more elegant way!
I have generated a file of daily temperatures for many many locations (I'll call them sites), and the file columns are set up like this:
year month day unix_time site_a site_b site_c site_d ... on and on
For each site (within each column), I want to run through the temperature values and create new columns (or a new data frame) with a number (a physiological rate) that corresponds with a range of those temperatures. (for example, temperatures less than 6.25 degrees have a rate of -1.33, temperatures between 6.25 and 8.75 have a rate of 0.99, etc). I have created a loop that does this for a single column of data. For example:
for(i in 1:dim(data)[1]){
if (data$point_a[i]<6.25) data$rate_point_a[i]<--1.33 else
if (data$point_a[i]>=6.25 && data$point_a[i]<8.75) data$rate_point_a[i]<-0.99 else
if (data$point_a[i]>=8.75 && data$point_a[i]<11.25) data$rate_point_a[i]<-3.31 else
if (data$point_a[i]>=11.25 && data$point_a[i]<13.75) data$rate_point_a[i]<-2.56 else
if (data$point_a[i]>=13.75 && data$point_a[i]<16.25) data$rate_point_a[i]<-1.81 else
if (data$point_a[i]>=16.25 && data$point_a[i]<18.75) data$rate_point_a[i]<-2.78 else
if (data$point_a[i]>=18.75 && data$point_a[i]<21.25) data$rate_point_a[i]<-3.75 else
if (data$point_a[i]>=21.25 && data$point_a[i]<23.75) data$rate_point_a[i]<-1.98 else
if (data$point_a[i]>=23.75 && data$point_a[i]<26.25) data$rate_point_a[i]<-0.21
}
The above code gives me a new column called "rate_site_a" that has my physiological rates. What I am having trouble doing is nesting this loop into another loop that runs through all of the columns. I have tried things such as:
for (i in 1:ncol(data)){
#for each row in that column
for (s in 1:length(data)){
if ([i]<6.25) rate1[s]<--1.33 else ...
I guess I don't know how to make the "if else" statement refer to the correct places. I know that I can't add the "rate" columns onto the existing data frame, as this would increase my ncol as I go through the loop, so need to put them into another data frame (though don't think this is my main issue). I am going to have many many many points to work through and would rather not have to do them one at a time, hence my attempt at a nested loop.
Any help would be much appreciated. Here is a link to some sample data if that is helpful. http://dl.dropbox.com/u/17903768/AVHRR_output.txt Thanks in advance!

Use ifelse which is vectorized:
ifelse(data$point<= 6.25,-1.33,ifelse(data$point<= 8.25,-0.99,ifelse(data$point<= 11.25,-3.31,.....Until finished.
For instance:
datap=read.table('http://dl.dropbox.com/u/17903768/AVHRR_output.txt',header=T)
apply(datap[,5:9],2,function(x){
datap$x =
ifelse(x<=6.25,1.33,
ifelse(x<=8.75,-0.99,
ifelse(x<=11.25,-3.31,
ifelse(x<=13.75,-2.56,
ifelse(x<=16.25,-1.81,
ifelse(x<=18.75,-2.78,
ifelse(x<=21.25,-3.75,
ifelse(x<=23.75,-1.98,-0.21))))))))})

Andres answer is great for the apply part to get you thru all the "temperature" columns. I'm stuck here without a copy of R (at work) to experiment with, but I suspect if you create a vector of your cutoff values
xcut <- c(0,6.25,8.75,.11.25,...
and just do
x <- xcut[(which(x>xcut))]
you'll have a much simpler bit of code, and easier to edit as well. (note: I added the 0 value to avoid problems with small x values :-) )

here's another way using just logicals:
DAT <- read.table("http://dl.dropbox.com/u/17903768/AVHRR_output.txt",header=TRUE,as.is=TRUE)
recodecolumn <- function(x){
out <- vector(length=length(x))
out[x < 6.25] <- 1.33
out[x >= 6.25 & x < 8.75] <- .99
out[x >= 8.75 & x < 11.25] <- 3.31
out[x >= 11.25 & x < 13.25] <- 2.56
out[x >= 13.25 & x < 16.25] <- 1.81
out[x >= 16.25 & x < 18.75] <- 2.78
out[x >= 18.75 & x < 21.25] <- 3.75
out[x >= 21.25 & x < 23.75] <- 1.98
out[x >= 23.75 & x < 26.25] <- 0.21
out
}
NewCols <- apply(DAT[,5:9],2,recodecolumn)
colnames(NewCols) <- paste("rate",1928:1932,sep="_")
DAT <- cbind(DAT,NewCols)

I find that findInterval is useful in situations like this instead of nested if else statements as it is already vectorized and returns the position within a vector of cutoff points.
DAT <- read.table("http://dl.dropbox.com/u/17903768/AVHRR_output.txt",header=TRUE,as.is=TRUE)
recode.fn <- function(x){
cut.vec <- c(0, seq(6.25,26.25,by = 2.5),Inf)
recode.val <- c(-1.33, 0.99, 3.31, 2.56,1.81,2.78,3.75,1.98, 0.21)
cut.interval <- findInterval(x, cut.vec, FALSE)
return(recode.val[cut.interval])
}
# Add on recoded data to existing data frame
DAT[,10:14] <- sapply(DAT[,5:9],FUN=recode.fn)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Efficient way to bin data ranges in R - r

Related

Calculate elements of time series for each variable (loop?)

speeding up boolean logic loop in R

Extend conditions in a dynamic way

cor() function in R with a subset

Nested loop in R: columns then rows

Categories

Resources