Related
I need to calculate each component of the time series for each X (50 levels) and Y (80 levels) from my dataset (df).
I wanted to go with something akin to the code below, where I tried to just get the seasonality. If I can get this it should be the same for the trend and random component of the decompose.
P <- df$X
for(y in 1:length(P)) {
OneP <- P[y]
AllS <- unique(df$Y[df$X== OneP])
for(i in 1:length(AllS)) {
OneS<- AllS[i]
df$TS[df$Y == OneS & df$X== OneP] <- ts(df$Mean[df$Y == OneS & df$X
== OneP], start = c(1999, 1), end = c(2015, 12), frequency = 12)
df$Dec[df$Y == OneS & df$X== OneP] <- decompose(ts(df$TS[df$Y == OneS &
df$X== OneP], frequency = 12), type = c("additive"))
df$Decomposition_seasonal[df$Y == OneS & df$X== OneP] <- df$Dec([df$Y == OneS & df$X== OneP], Dec$seasonal)
}
But this is not working. Error message is:
Error: attempt to apply non-function
I understand that the problem might come from my attempt to put decomposition output in a column. But how else to do it? Make a new dataset for every dev in every X * Y combination?
I know that the first lines of the code work as I used it before for something else. And I know this will run and give me TS and decomposition. It's the individual components bit that I am struggling with. Any advice is deeply appreciated.
Similar data:
X Y Mean Date(mY)
Tru A 35.6 02.2015
Fle A 15 05.2010
Srl C 67.1 05.1999
Tru A 13.2 08.2006
Srl B 89 08.2006
Tru B 14.8 12.2001
Fle A 21.5 11.2001
Lub D 34.8 03.2000
I am very new to R but I am interested in learning more and improving.
I have a dataset with around 40,000+ rows containing the length of neuron segments. I want to compare the length trends of neurons of different groups. The first step in this analysis involves sorting the measurements into 1 of 6 different categories such as '<10' '10-15', '15-20', '20-25', '25-30', and '>30'.
I created these categories as appended columns using 'mutate' from the 'dplyr' package and now I am trying to write a boolean function to determine where the measurement fits by applying a value of '1' to the corresponding column if it fits, and a '0' if it doesn't.
Here is what I wrote:
for (i in 1:40019) {
{if (FinalData$Length[i] <=10)
{FinalData$`<10`[i]<-1
} else {FinalData$`<10`[i]<-0}} #Fills '<10'
if (FinalData$Length[i] >=10 & FinalData$Length[i]<15){
FinalData$`10-15`[i]<-1
} else{FinalData$`10-15`[i]<-0} #Fills'10-15'
if (FinalData$Length[i] >=15 & FinalData$Length[i]<20){
FinalData$`15-20`[i]<-1
} else{FinalData$`15-20`[i]<-0} #Fills '15-20'
if (FinalData$Length[i] >=20 & FinalData$Length[i]<25) {
FinalData$`20-25`[i]<-1
} else{FinalData$`20-25`[i]<-0} #Fills '20-25'
if(FinalData$Length[i] >=25 & FinalData$Length[i]<30){
FinalData$`25-30`[i]<-1
} else{FinalData$`25-30`[i]<-0} #Fills '25-30'
if(FinalData$Length[i] >=30){
FinalData$`>30`[i]<-1
} else{FinalData$`>30`[i]<-0} #Fills '>30'
}
This seems to work, but it takes a long time:
system.time(source('~/Desktop/Home/Programming/R/Boolean Loop R.R'))
user system elapsed
94.408 19.147 118.203
The way I coded this seems very clunky and inefficient. Is there a faster and more efficient way to code something like this or am I doing this appropriately for what I am asking for?
Here is an example of some of the values I am testing:
'Length': 14.362, 12.482337, 8.236, 16.752, 12.045
If I am not being clear about how the dataframe is structured, here is a screenshot:
How my data frame is organized
You can use the cut function in R. It is used to convert numeric values to factors:
x<-c(1,2,4,2,3,5,6,5,6,5,8,0,5,5,4,4,3,3,3,5,7,9,0,5,6,7,4,4)
cut(x = x,breaks = c(0,3,6,9,12),labels = c("grp1","grp2","grp3","grp4"),right=F)
set right = "T" or "F" as per your need.
You can vectorise that as follows (I made a sample of some data called DF)
DF <- data.frame(1:40000,sample(letters,1:40000,replace=T),"Length"=sample(1:40,40000,replace=T))
MyFunc <- function(x) {
x[x >= 10 & x < 15] <- "10-15"
x[x >= 15 & x < 20] <- "15-20"
x[x >= 20 & x < 25] <- "20-25"
x[x >= 25 & x < 30] <- "25-30"
x[x > 30] <- ">30"
x[x < 10] <- "<10"
return(x)
}
DF$Group <- MyFunc(DF[,3])
If it has to be 6 columns like that, you can modify the above to return a one or zero for the appropriate size and everything else, respectively, for each of the 6 columns.
Edit: I guess a series of ifelse might be best if it really has to be 6 columns like that.
e.g.
DF$'<10' <- sapply(DF$Length, function(x) ifelse(x < 10,1,0))
I am trying to build a decision table. At time 3 for example I have to take the previous results in time t=1 and time t=2 in order to make my decision in time 3. The decision table is going to be pretty big so I am considering an efficient way to do it by building a function. For instance at time 3:
rm(list=ls()) # clear memory
names <- c("a","b","c","d","e","f","g","h","i","j","k50","l50","m50","n50","o50")
proba <- c(1,1,1,1,1,1,1,1,1,1,0.5,0.5,0.5,0.5,0.5)
need <- 4
re <- 0.5
w <- 1000000000
# t1
t1 <- as.integer(names %in% (sample(names,need,prob=proba,replace=F)))
# t2
t2 <- rep(t1)
# t3
proba3 <- ifelse(t2==1,proba*re,proba)
t3 <- as.integer(names %in% (sample(names,need,prob=proba3,replace=F)))
Now the table is going to be big until t=7 with proba7 which takes condition from t=1 to t=6. After t=7 it always takes the 6 previous outcomes plus the random part proba in order to make decision. In other words the ifelse must be dynamic in order that I can call it later. I have been trying something like
probF <- function(a){
test <- ifelse(paste0("t",a,sep="")==1,proba*re,proba)
return(test)
}
test <- probF(2)
but there is an error as I got just one value and not a vector. I know that it looks complicated
For the conditions requested by one person (i know it's not very good written) :
proba7 <- ifelse(t2==1 & t3==1 & t4==0 & t5==0 & t6==0,proba,
ifelse(t2==1 & t3==0 & t4==0 & t5==1 & t6==1,proba*re,
ifelse(t2==1 & t3==0 & t4==0 & t5==0 & t6==1, w,
ifelse(t2==0 & t3==1 & t4==1 & t5==0 & t6==0,proba,
ifelse(t2==0 & t3==1 & t4==1 & t5==1 & t6==0,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 & t6==1,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 &t6==0,0,
ifelse(t2==0 & t3==0 & t4==0 & t5==1 & t6==1, proba*re,
ifelse(t2==0 & t3==0 & t4==0 & t5==0 & t6==1,w,proba)))))))))
t7 <- as.integer(names %in% (sample(names,need,prob=proba7,replace=F)))
If you take a bit of a different approach, you'll gain quite a lot of speed.
First of all, it is really a terribly bad idea to store every step as a separate t1, proba1, etc. If you need to keep all that information, predefine a matrix or list of the right size and store everything in there. That way you can use simple indices instead of having to resort to the bug-prone use of get(). If you find yourself typing get(), almost always it's time to stop and rethink your solution.
Secondly, you can use a simple principle to select the indices of the test t:
seq(max(0, i-7), i-1)
will allow you to use a loop index i and refer to the 6 previous positions if they exist.
Thirdly, depending on what you want, you can reformulate your decision as well. If you store every t as a row in the matrix, you can simply use colSums() and check whether that one is larger than 0. Based on that index, you can update the probabilities in such a way that a 1 in any of the previous 6 rows halfs the probability.
wrapping everything in a function would then look like :
myfun <- function(names, proba, need, re,
w=100){
# For convenience, so I don't have to type this twice
resample <- function(p){
as.integer(
names %in% sample(names,need,prob=p, replace = FALSE)
)
}
# get the number of needed columns
nnames <- length(names)
# create two matrices to store all the t-steps and the probabilities used
theT <- matrix(nrow = w, ncol = nnames)
theproba <- matrix(nrow = w, ncol = nnames)
# Create a first step, using the original probabilities
theT[1,] <- resample(proba)
theproba[1,] <- proba
# loop over the other simulations, each time checking the condition
# recalculating the probability and storing the result in the next
# row of the matrices
for(i in 2:w){
# the id vector to select the (maximal) 6 previous rows. If
# i-6 is smaller than 1 (i.e. there are no 6 steps yet), the
# max(1, i-6) guarantees that you start minimal at 1.
tid <- seq(max(1, i-6), i-1)
# Create the probability vector from the original one
p <- proba
# look for which columns in the 6 previous steps contain a 1
pid <- colSums(theT[tid,,drop = FALSE]) > 0
# update the probability vector
p[pid] <- p[pid]*0.5
# store the next step and the used probabilities in the matrices
theT[i,] <- resample(p)
theproba[i,] <- p
}
# Return both matrices in a single list for convenience
return(list(decisions = theT,
proba = theproba)
)
}
which can be used as:
myres <- myfun(names, proba, need, re, w)
head(myres$decisions)
head(myres$proba)
This returns you a matrix where every row is one t-point in the decision table.
I have a table in R with three columns. I want to get the correlation of the first two columns with a subset of the third column following a specific set of conditions (values are all numeric, I want them to be > a certain number). The cor() function doesn't seem to have an argument to define such a subset.
I know that I could use the summary(lm()) function and square-root the r^2, but the issue is that I'm doing this inside a for loop and am just appending the correlation to a separate list that I have. I can't really append part of the summary of the regression easily to a list.
Here is what I am trying to do:
for (i in x) {list[i] = cor(data$column_a, data$column_b, subset = data$column_c > i)}
Obviously, though, I can't do that because the cor() function doesn't work with subsets.
(Note: x = seq(1,100) and list = NULL)
You can do this without a loop using lapply. Here's some code that will output a data frame with the month-range in one column and the correlation in another column. The do.call(rbind... business is just to take the list output from lapply and turn it into a data frame.
corrs = do.call(rbind, lapply(min(airquality$Month):max(airquality$Month),
function(x) {
data.frame(month_range=paste0(x," - ", max(airquality$Month)),
correlation = cor(airquality$Temp[airquality$Month >= x & airquality$Temp < 80],
airquality$Wind[airquality$Month >= x & airquality$Temp < 80]))
}))
corrs
month_range correlation
1 5 - 9 -0.3519351
2 6 - 9 -0.2778532
3 7 - 9 -0.3291274
4 8 - 9 -0.3395647
5 9 - 9 -0.3823090
You can subset the data first, and then find the correlation.
a <- subset(airquality, Temp < 80 & Month > 7)
cor(a$Temp, a$Wind)
Edit: I don't really know what your list variable is, but here is an example of dynamically changing the subset based on i (see how the month requirement changes with each iteration)
list <- seq(1, 5)
for (i in 1:5){
a <- subset(airquality, Temp < 80 & Month > i)
list[i] <- cor(a$Temp, a$Wind)
}
Based on the pseudo-code you provided alone, here's something that should work:
for (i in x) {
df <- subset(data, column_c > i)
list[i] = cor(df$column_a, df$column_b)
}
However, I don't know why you would want your index in list[i] to be the same value that you use to subset column_c. That could be another source of problems.
I am trying to write a nested for loop in R, but am running into problems. I have researched as much as possible but can't find (or understand) the help I need. I am fairly new to R, so any advice on this looping would be appreciated, or if there is a simpler, more elegant way!
I have generated a file of daily temperatures for many many locations (I'll call them sites), and the file columns are set up like this:
year month day unix_time site_a site_b site_c site_d ... on and on
For each site (within each column), I want to run through the temperature values and create new columns (or a new data frame) with a number (a physiological rate) that corresponds with a range of those temperatures. (for example, temperatures less than 6.25 degrees have a rate of -1.33, temperatures between 6.25 and 8.75 have a rate of 0.99, etc). I have created a loop that does this for a single column of data. For example:
for(i in 1:dim(data)[1]){
if (data$point_a[i]<6.25) data$rate_point_a[i]<--1.33 else
if (data$point_a[i]>=6.25 && data$point_a[i]<8.75) data$rate_point_a[i]<-0.99 else
if (data$point_a[i]>=8.75 && data$point_a[i]<11.25) data$rate_point_a[i]<-3.31 else
if (data$point_a[i]>=11.25 && data$point_a[i]<13.75) data$rate_point_a[i]<-2.56 else
if (data$point_a[i]>=13.75 && data$point_a[i]<16.25) data$rate_point_a[i]<-1.81 else
if (data$point_a[i]>=16.25 && data$point_a[i]<18.75) data$rate_point_a[i]<-2.78 else
if (data$point_a[i]>=18.75 && data$point_a[i]<21.25) data$rate_point_a[i]<-3.75 else
if (data$point_a[i]>=21.25 && data$point_a[i]<23.75) data$rate_point_a[i]<-1.98 else
if (data$point_a[i]>=23.75 && data$point_a[i]<26.25) data$rate_point_a[i]<-0.21
}
The above code gives me a new column called "rate_site_a" that has my physiological rates. What I am having trouble doing is nesting this loop into another loop that runs through all of the columns. I have tried things such as:
for (i in 1:ncol(data)){
#for each row in that column
for (s in 1:length(data)){
if ([i]<6.25) rate1[s]<--1.33 else ...
I guess I don't know how to make the "if else" statement refer to the correct places. I know that I can't add the "rate" columns onto the existing data frame, as this would increase my ncol as I go through the loop, so need to put them into another data frame (though don't think this is my main issue). I am going to have many many many points to work through and would rather not have to do them one at a time, hence my attempt at a nested loop.
Any help would be much appreciated. Here is a link to some sample data if that is helpful. http://dl.dropbox.com/u/17903768/AVHRR_output.txt Thanks in advance!
Use ifelse which is vectorized:
ifelse(data$point<= 6.25,-1.33,ifelse(data$point<= 8.25,-0.99,ifelse(data$point<= 11.25,-3.31,.....Until finished.
For instance:
datap=read.table('http://dl.dropbox.com/u/17903768/AVHRR_output.txt',header=T)
apply(datap[,5:9],2,function(x){
datap$x =
ifelse(x<=6.25,1.33,
ifelse(x<=8.75,-0.99,
ifelse(x<=11.25,-3.31,
ifelse(x<=13.75,-2.56,
ifelse(x<=16.25,-1.81,
ifelse(x<=18.75,-2.78,
ifelse(x<=21.25,-3.75,
ifelse(x<=23.75,-1.98,-0.21))))))))})
Andres answer is great for the apply part to get you thru all the "temperature" columns. I'm stuck here without a copy of R (at work) to experiment with, but I suspect if you create a vector of your cutoff values
xcut <- c(0,6.25,8.75,.11.25,...
and just do
x <- xcut[(which(x>xcut))]
you'll have a much simpler bit of code, and easier to edit as well. (note: I added the 0 value to avoid problems with small x values :-) )
here's another way using just logicals:
DAT <- read.table("http://dl.dropbox.com/u/17903768/AVHRR_output.txt",header=TRUE,as.is=TRUE)
recodecolumn <- function(x){
out <- vector(length=length(x))
out[x < 6.25] <- 1.33
out[x >= 6.25 & x < 8.75] <- .99
out[x >= 8.75 & x < 11.25] <- 3.31
out[x >= 11.25 & x < 13.25] <- 2.56
out[x >= 13.25 & x < 16.25] <- 1.81
out[x >= 16.25 & x < 18.75] <- 2.78
out[x >= 18.75 & x < 21.25] <- 3.75
out[x >= 21.25 & x < 23.75] <- 1.98
out[x >= 23.75 & x < 26.25] <- 0.21
out
}
NewCols <- apply(DAT[,5:9],2,recodecolumn)
colnames(NewCols) <- paste("rate",1928:1932,sep="_")
DAT <- cbind(DAT,NewCols)
I find that findInterval is useful in situations like this instead of nested if else statements as it is already vectorized and returns the position within a vector of cutoff points.
DAT <- read.table("http://dl.dropbox.com/u/17903768/AVHRR_output.txt",header=TRUE,as.is=TRUE)
recode.fn <- function(x){
cut.vec <- c(0, seq(6.25,26.25,by = 2.5),Inf)
recode.val <- c(-1.33, 0.99, 3.31, 2.56,1.81,2.78,3.75,1.98, 0.21)
cut.interval <- findInterval(x, cut.vec, FALSE)
return(recode.val[cut.interval])
}
# Add on recoded data to existing data frame
DAT[,10:14] <- sapply(DAT[,5:9],FUN=recode.fn)