I have been searching for an answer to this for a while without much luck so fingers crossed someone can help me!
I am dealing with cyclical data and I am trying to find the associated value of the two peaks and two troughs - this doesn't necessary equate to the max/min and second max/min values but rather the max/min and then the second max/min values conditional on the value being larger/smaller than the preceding and subsequent values.
This is an example of one cycle
x <- c(3.049, 3.492, 3.503, 3.429, 3.013, 2.881, 2.29, 1.785, 1.211, 0.890, 0.859, 0.903, 1.165, 1.634, 2.073, 2.477, 3.162, 3.207, 3.177, 2.742, 2.24, 1.827, 1.358, 1.111, 1.063, 1.098, 1.287, 1.596, 2.169, 2.292)
I have 1000's of cycles so I am using group_by in dplyr to group the cycles and then hoped to apply the conditional max/min argument within groups.
I would appreciate any advice with this,
Thanks in advance
Edit
I have since used the below function with just a slight edit on the last line
return(data.frame(Data.value=x, Time=y, Date=z,HHT=peak, LLT=trough))
where x is my original x above, y is a time var and z is a date var. This allowed me to do some extra calculations on the results (I needed the time at which the value was min/max as well as the value itself).
So now I have a dataframe with everything I need but it is only for one date - I still can't get this run through the whole dataset using the group_by function. I have tried sub-setting by date using
subsets<-split(data, data$datevar, drop=TRUE)
But still need a way to somehow run the findminmax function (and my few extra lines of code) for each subset. Any ideas?
Consider the following custom function that you can pass in a dplyr group_by() procedure. Essentially, function iterates through list of cyclical values and compares neighbor before and after it. Peaks would have neighbors both lower than itself and troughs with neighbors larger than iteself.
findminmax <- function(x){
peak <- list(NA, NA) # INITIALIZE TEMP LISTS AND ITERATORS
p <- 1
trough <- list(NA, NA)
t <- 1
for (i in 1:length(x)){
if (i != 1 & i != length(x)){ # LEAVES OUT FIRST AND LAST VALUES
if ((x[i] > x[i-1]) & (x[i] > x[i+1])) { # COMPARES IF GREATER THAN NEIGHBORS
peak[p] <- x[i]
p <- p + 1
}
if ((x[i] < x[i-1]) & (x[i] < x[i+1])){ # COMPARES IF LESS THAN NEIGHBORS
trough[t] <- x[i]
t <- t + 1
}
}
}
return(list(peak1=peak[[1]], peak2=peak[[2]],
trough1=trough[[1]], trough2=trough[[2]]))
}
result <- findminmax(x)
#$peak1
#[1] 3.503
#$peak2
#[1] 3.207
#$trough1
#[1] 0.859
#$trough2
#[1] 1.063
For dplyr's group_by:
finaldf <- originaldf %>%
group_by(z) %>%
summarise(Time = mean(y),
HHT1 = findminmax(x)$peak1,
HHT2 = findminmax(x)$peak2,
LLT1 = findminmax(x)$trough1,
LLT2 = findminmax(x)$trough2)
Related
For my program, I have to write a script that identifies clusters based on the distance "locuslim", which is the maximum distance allowed. I got a slow for-loop working as desired, but I have some trouble vectorizing it to optimize the runtime.
"s" is the iterator for all strains being compared and "i" is to iterate over rows. Col is to create new columns, each strain getting a new column.
Especially the ifelse seems to be the issue...
The for-loop (as function):
clusterfunc <- function (TABLE11_4){
for (s in levels(Strains)){
message(s) # Just to see the progress while running
locus <- min(TABLE11_4$dist[TABLE11_4$dist>0 & TABLE11_4$Organism.q==s], na.rm=TRUE) # The minimal distance between two items
locuslim <- 3*locus # Maximal distance allowed
col <- paste("Cluster",s,sep=".")
TABLE11_4 <- TABLE11_4 %>% group_by(Organism.h) %>% arrange(Rev.gr) %>% mutate(col=ifelse(lead(dist)<=locuslim,1,2))
TABLE11_4 <- TABLE11_4 %>% group_by() %>% arrange(Rev.gr)
TABLE11_4$dist[is.na(TABLE11_4$dist)] <- 0 # Eliminate NAs by making them 0
c=1
for(i in 1:nrow(TABLE11_4)){
if(TABLE11_4$dist[i]<=locuslim){
TABLE11_4$col[i] <- c # Set clusternumber it belongs to
}
else{
TABLE11_4$col[i] <- c+1 # Set next clusternumber it belongs to
c <- c+1 # Go to next cluster
}
}
setnames(TABLE11_4,"col",col)
}
return(TABLE11_4)
}
Thank you in advance!
Load library and sample data:
library(MASS)
View(Cars93)
Cars93$ID=1:93
Now I want to subset Cars93 so that new df (sub0l and sub0h) have all IDs with all columns but with only top (for df sub0h) and lowest 10% values (for df sub0l) in column 17:25, and rest values (11-100 quartile for df sub0l and 0-90 quartile for df sub0h) could be changed to NA.
Here is my attempt to create two dfs with top ten% or lowest ten% values from columns 17:25:
sub0l <- do.call(rbind,by (Cars93,Cars93$ID,FUN= function(x)
subset(Cars93, (Cars93[,17:25] <= quantile(Cars93[,17:25], probs= .10)))))
sub0h <- do.call(rbind,by (Cars93,Cars93$ID,FUN= function(x)
subset(Cars93, (Cars93[,17:25] >= quantile(Cars93[,17:25], probs= .91)))))
I get an error while subseting top and lowest ten quartiles of columns:
Error in `[.data.frame`(Cars93, ,17:25) : undefined columns selected
Called from: `[.data.frame`(Cars93, ,17:25)
Any better alternative?
I think the following returns what you are looking for
sub0l <- cbind(Cars93[,1:16], sapply(Cars93[,17:25],
function(i) ifelse(i > quantile(i, probs=0.1, na.rm=T) | is.na(i), NA, i)))
sub0h <- cbind(Cars93[,1:16], sapply(Cars93[,17:25],
function(i) ifelse(i < quantile(i, probs=0.91, na.rm=T) | is.na(i), NA, i)))
The sapply function loops through each variable in the data.frame, to which the quantile function is applied. Within each pass, the generic function accesses the variable as a vector through the "i" argument. This is then passed to the ifelse function. This function takes a look at each element of the vector, i and assesses whether it passes the test. If the element passes the test, it is assigned NA, if it fails, its original value is returned. This process will work great for variables that are numeric.
If some of the variables are not numeric, then you can add an additional check in the sapply functions as below:
sub0l <- cbind(Cars93[,1:16],
sapply(Cars93[,17:25],
function(i) {
if(is.numeric(i)) {
ifelse(i > quantile(i, probs=0.1, na.rm=T) | is.na(i), NA, i)))
}
else i
}))
sub0h <- cbind(Cars93[,1:16],
sapply(Cars93[,17:25],
function(i) {
if(is.numeric(i)) {
ifelse(i < quantile(i, probs=0.91, na.rm=T) | is.na(i), NA, i)
}
else i
}))
before beginning the operation described above, the generic function checks if the vector i is of type numeric (in R, this is either mode double or integer, see ?typeof for a discussion of the core element types in R). If this test fails, the vector is returned by else i. If the first test passes, then the process described above begins.
I have a data frame with geographical locations (two columns) and presence/abstinence data (0/1). What I need to find out is, how many of the other locations lie within their vicinity and have the value 1. That number should then be appended in a fourth column.
xcoords <- c(4487754, 4488111, 4487598, 4487417, 4487949, 4487802, 4488011, 4487638, 4487455, 4487478, 4487141, 4487550, 4487547, 4488010, 4487271, 4487170)
ycoords <- c(5294654, 5294706, 5295490, 5293859, 5295313, 5294910, 5294668, 5295399, 5294537, 5294408, 5294927, 5294617, 5294727, 5294953, 5294168, 5295142)
yesno <- round(runif(16, 0, 1),0)
df <- cbind(xcoords, ycoords, yesno)
I set the vicinity to 500 meters (the geodata is in a projected coordinate system, so the units correspond), and will do the calculation using Pythagoras' theorem.
buffer <- 500
What I do know, is how to build nested for-loops. But I don't want my function to look like this:
count.in.buffer <- function(df, buffer){
lon <- df$xcoords
lat <- df$ycoords
count <- rep(0, length(lon))
for(i in 1:length(lon)){ # for every row...
for(j in 1:length(lon) - 1){ # ... check all other rows
nolon <- lon[-i]
nolat <- lat[-i]
ifelse(sqrt((abs(lon[i] - nolon[j]))^2 + (abs(lat[i] - nolat[j]))^2) < buffer, ifelse(df$yesno == 1, count[i] <- count[i] + 1, count[i] <- count[i] + 0), count[i] <- count[i] + 0)
}
}
result <- cbind(data, count)
return(result)
}
That would take ages to be computed, because my data frame actually has 67000 rows.
Instead I would want something more efficient, but I don't understand how to convert the inner for-loop to a function I could use in an apply-function row-wise on the whole data frame. But I do suspect, that an apply-function would be the way to go, right?
P.S.: I solved the problem already with a proper GIS, by creating buffer polygons around each data point and linking the other data points to every buffer polygon's attribute table, if they lie within that buffer. But I think it should be much quicker to do in R.
You could try moving it into an apply statement, and subsetting your data into those that have values.
First, make a new df that only has those with 1s in the yesno (no point calculating distances for those that don't for the summing):
df1 <- df[df[ ,'yesno'] == 1,]
Then, we compare each row of df, to all of df1 in a vectorised way, and sum the counts (minusing out our initial count so we dont count it twice):
apply(df, 1, function(x){
sum(sqrt((x[1] - df1[ ,1])^2 +
(x[2] - df1[ ,2])^2) < buffer) - x[3]
})
I have a data.frame, ordered by mean column that looks like this:
10SE191_2 10SE207 10SE208 mean
7995783 12.64874 13.06391 12.69378 12.73937
8115327 12.69979 12.52285 12.41582 12.50363
8108370 12.58685 12.87818 12.66021 12.45720
7945680 12.46392 12.26087 11.77040 12.36518
7923547 11.98463 11.96649 12.50666 12.33138
8016718 12.81610 12.71548 12.48164 12.32703
I would like to apply a t.test to each row, using as input the intensity values: df[i,1:3] and the mean values from the rows with lower intensities. For example, for the first row I want to compute a t.test for df[1,1:3] vs _mean values_ from row 2 to row 6. My code uses a for loop but my current data.frame has more than 20,000 rows and 24 columns and it takes a long time. Any ideas for improving the code?
Thanks
Code:
temp <- matrix(-9, nrow=dim(matrix.order)[1], ncol=2) #create a result matrix
l <- dim(matrix.order)[1]
for (i in 1:l){
j <- 1+i
if (i < l | j +2 == l) { #avoid not enough y observations
mean.val <- matrix.order[j:l,4]
p <- t.test(matrix.order[i, 1:3], mean.val)
temp[i,1] <- p$p.value
}
else {temp[i,1] <- 1}
}
dput for my df
structure(list(`10SE191_2` = c(12.6487418898415, 12.6997932097351,12.5868508174491, 12.4639169398277, 11.9846348627906, 12.8160978540904), `10SE207` = c(13.0639063105224, 12.522848114011, 12.8781769160682, 12.260865493177, 11.9664905651469, 12.7154788700468), `10SE208` = c(12.6937808736673, 12.4158248856386, 12.6602128982717, 11.7704045448312, 12.5066604109231, 12.4816357798965), mean = c(12.7393707471856, 12.5036313008127, 12.4572035036992, 12.3651842840775, 12.3313821056582, 12.3270331271091)), .Names = c("10SE191_2", "10SE207", "10SE208", "mean"), row.names = c("7995783", "8115327", "8108370", "7945680", "7923547", "8016718"), class = "data.frame")
You can obtain all p-values (if possible) with this command:
apply(df, 1, function(x) {
y <- df$mean[df$mean < x[4]]
if(length(y) > 1)
t.test(x[1:3], y)$p.value
else NA
})
The function will return NA if there are not enough values for y.
7995783 8115327 8108370 7945680 7923547 8016718
0.08199794 0.15627947 0.04993244 0.50885253 NA NA
Running 2E4 t.tests probably takes a lot of time no matter what. Try using Rprof to find the hot spots. You might also want to use mcapply or similar parallel processing tools, since your analysis of each row is independent of all other data (which means this is a task well-suited to multicore parallel processing).
I am trying to write a nested for loop in R, but am running into problems. I have researched as much as possible but can't find (or understand) the help I need. I am fairly new to R, so any advice on this looping would be appreciated, or if there is a simpler, more elegant way!
I have generated a file of daily temperatures for many many locations (I'll call them sites), and the file columns are set up like this:
year month day unix_time site_a site_b site_c site_d ... on and on
For each site (within each column), I want to run through the temperature values and create new columns (or a new data frame) with a number (a physiological rate) that corresponds with a range of those temperatures. (for example, temperatures less than 6.25 degrees have a rate of -1.33, temperatures between 6.25 and 8.75 have a rate of 0.99, etc). I have created a loop that does this for a single column of data. For example:
for(i in 1:dim(data)[1]){
if (data$point_a[i]<6.25) data$rate_point_a[i]<--1.33 else
if (data$point_a[i]>=6.25 && data$point_a[i]<8.75) data$rate_point_a[i]<-0.99 else
if (data$point_a[i]>=8.75 && data$point_a[i]<11.25) data$rate_point_a[i]<-3.31 else
if (data$point_a[i]>=11.25 && data$point_a[i]<13.75) data$rate_point_a[i]<-2.56 else
if (data$point_a[i]>=13.75 && data$point_a[i]<16.25) data$rate_point_a[i]<-1.81 else
if (data$point_a[i]>=16.25 && data$point_a[i]<18.75) data$rate_point_a[i]<-2.78 else
if (data$point_a[i]>=18.75 && data$point_a[i]<21.25) data$rate_point_a[i]<-3.75 else
if (data$point_a[i]>=21.25 && data$point_a[i]<23.75) data$rate_point_a[i]<-1.98 else
if (data$point_a[i]>=23.75 && data$point_a[i]<26.25) data$rate_point_a[i]<-0.21
}
The above code gives me a new column called "rate_site_a" that has my physiological rates. What I am having trouble doing is nesting this loop into another loop that runs through all of the columns. I have tried things such as:
for (i in 1:ncol(data)){
#for each row in that column
for (s in 1:length(data)){
if ([i]<6.25) rate1[s]<--1.33 else ...
I guess I don't know how to make the "if else" statement refer to the correct places. I know that I can't add the "rate" columns onto the existing data frame, as this would increase my ncol as I go through the loop, so need to put them into another data frame (though don't think this is my main issue). I am going to have many many many points to work through and would rather not have to do them one at a time, hence my attempt at a nested loop.
Any help would be much appreciated. Here is a link to some sample data if that is helpful. http://dl.dropbox.com/u/17903768/AVHRR_output.txt Thanks in advance!
Use ifelse which is vectorized:
ifelse(data$point<= 6.25,-1.33,ifelse(data$point<= 8.25,-0.99,ifelse(data$point<= 11.25,-3.31,.....Until finished.
For instance:
datap=read.table('http://dl.dropbox.com/u/17903768/AVHRR_output.txt',header=T)
apply(datap[,5:9],2,function(x){
datap$x =
ifelse(x<=6.25,1.33,
ifelse(x<=8.75,-0.99,
ifelse(x<=11.25,-3.31,
ifelse(x<=13.75,-2.56,
ifelse(x<=16.25,-1.81,
ifelse(x<=18.75,-2.78,
ifelse(x<=21.25,-3.75,
ifelse(x<=23.75,-1.98,-0.21))))))))})
Andres answer is great for the apply part to get you thru all the "temperature" columns. I'm stuck here without a copy of R (at work) to experiment with, but I suspect if you create a vector of your cutoff values
xcut <- c(0,6.25,8.75,.11.25,...
and just do
x <- xcut[(which(x>xcut))]
you'll have a much simpler bit of code, and easier to edit as well. (note: I added the 0 value to avoid problems with small x values :-) )
here's another way using just logicals:
DAT <- read.table("http://dl.dropbox.com/u/17903768/AVHRR_output.txt",header=TRUE,as.is=TRUE)
recodecolumn <- function(x){
out <- vector(length=length(x))
out[x < 6.25] <- 1.33
out[x >= 6.25 & x < 8.75] <- .99
out[x >= 8.75 & x < 11.25] <- 3.31
out[x >= 11.25 & x < 13.25] <- 2.56
out[x >= 13.25 & x < 16.25] <- 1.81
out[x >= 16.25 & x < 18.75] <- 2.78
out[x >= 18.75 & x < 21.25] <- 3.75
out[x >= 21.25 & x < 23.75] <- 1.98
out[x >= 23.75 & x < 26.25] <- 0.21
out
}
NewCols <- apply(DAT[,5:9],2,recodecolumn)
colnames(NewCols) <- paste("rate",1928:1932,sep="_")
DAT <- cbind(DAT,NewCols)
I find that findInterval is useful in situations like this instead of nested if else statements as it is already vectorized and returns the position within a vector of cutoff points.
DAT <- read.table("http://dl.dropbox.com/u/17903768/AVHRR_output.txt",header=TRUE,as.is=TRUE)
recode.fn <- function(x){
cut.vec <- c(0, seq(6.25,26.25,by = 2.5),Inf)
recode.val <- c(-1.33, 0.99, 3.31, 2.56,1.81,2.78,3.75,1.98, 0.21)
cut.interval <- findInterval(x, cut.vec, FALSE)
return(recode.val[cut.interval])
}
# Add on recoded data to existing data frame
DAT[,10:14] <- sapply(DAT[,5:9],FUN=recode.fn)