function lapply error : subscript out of bounds - r

I have to make a 4*4 Kohonen map for a project.
However, I get the error
Error in win_index[1, ] : subscript out of bounds
In addition: There were 16 warnings (use warnings() to see them)
After testing my code, I guess it's the lapply function (line 77) that doesn't
is not executed correctly because the matrix thus created contains only NA.
And so since this matrix is used later on, the result is not correct because
NA is present throughout the program.
#############################################
##Function for distance calculation (RMSDA)##
#############################################
RMSDA<-function(data_phipsi,Kohonen_matrix)
{
difference<-data_phipsi-Kohonen_matrix
for(j in 1:length(difference)){
if (difference[j]< -180) {difference[j]=difference[j]+360}
if (difference[j]> +180) {difference[j]=difference[j]-360}
}
distance=mean(sqrt(difference^2))
return(distance)
}
##############################
## Program ###
##############################
for(step in 1:iteration)
{
data_phipsi<-data_phipsi[sample(nrow(data_phipsi)),] # Sample vectors of training (samples of lines of the dataframe)
print(step) #Visualize where we are in loops
for(k_row in 1:nrow(data_phipsi))
{
#Update learn_rate and radius at each row of each iteration
learn_rate<-learning(initial_rate,((step-1)*nrow(data_phipsi))+k_row,data_phipsi)
learn_radius<-learning(initial_rate,((step-1)*nrow(data_phipsi))+k_row,data_phipsi)
#Find distance between each vectors of angles of Kohonen Map and the training vector
phipsi_RMSDA<-lapply(random_list, RMSDA, data_phipsi=data_phipsi[k_row,])
}
How is it possible to fix this lapply error?
Thank you. Thank you.
edit : These are the only useful elements for the lapply
For the random list we can use:
random_list <- list(
c(88, 148, 60, 83, -119, -59, -96, 169),
c(104, 101, 174, -48, 18, 10, -159, 158),
c(164, -80, 137, -170, -172, 52, -149, 96),
c(88, 18, -115, 48, -3, -158, -92, -154),
c(170, -107, -109, -14, -142, -77, -120, 76),
c(-121, 15, -46, -145, -128, 74, -166, 44),
c(46, -178, 67, -88, -125, -130, 88, -11),
c(131, 147, -32, 103, -16, 116, 78, -125),
c(75, -95, -137, 133, -97, -134, 126, -105),
c(115, 173, -82, -135, 134, 82, -143, -43),
c(111, 13, -54, -53, 103, 132, -13, -43),
c(-143, 89, -91, -137, -63, 14, -166, 83),
c(-98, 178, 14, -80, -122, -25, 19, 117),
c(-113, -97, 34, -178, -56, 18, -167, 84),
c(49, 82, 50, 168, -157, -154, 51, 78),
c(173, -4, 164, 125, 31, 115, -74, -92)
)
and a exemple for
data_phipsi[1,]
data_phipsi <- read.table(header = TRUE, text = "
phi1 psi2 phy2 psi3 phy3 psi4 phy4 psi5
-24.5 81.9 -155.2 -81.4 127.7 -118 166 -82.1")
data_phipsi
# phi1 psi2 phy2 psi3 phy3 psi4 phy4 psi5
# 1 -24.5 81.9 -155.2 -81.4 127.7 -118 166 -82.1

The main issue was that a follow-on command (check back in the question edits for the win_index line of code, if interested) was failing because the results from lapply were suspect.
Troubleshooting that found the problem:
when the error hits, the value of both random_list and data_phipsi[k_row,] were reasonable and not suspect;
however, the output from lapply was all NA
So diving into the function RMSDA, it was noticed that difference looked fine up until
distance=mean(sqrt(difference^2))
at which point it turned all-NA. The problem is that the data was still in a form of a frame, so mean was failing (just try mean(mtcars) to see). (It should also be noted that I believe sqrt(difference^2) is equivalent to abs(difference).)
Replacing that line of RMSDA with
distance=mean(abs(unlist(difference)))
seems to have fixed the problem (the critical part being unlist).

Related

Error: Column indexes must be at most 1 if... heatmap.2

I received an error in heatmap.2, and I found similar error here R : knnImputation Giving Error but it doesn't have an answer yet. I couldn't understand the problem. I am new to this world, sorry in advance if there is any mistake.
I have a dataframe df with 144 rows,177 columns which shows the monthly averages of years between 2005-2016 by timeAverage function openair package.
Here the small example from df:
date Year Month Adana-Catalan Adana-Dogankent Adana-Meteoroloji
2008/09/01 2008 9 NaN NaN NaN
2008/10/01 2008 10 NaN NaN 1.7948718
2008/11/01 2008 11 NaN NaN 2.0909091
2008/12/01 2008 12 1.2694064 12.2384106 0.7272727
2009/01/01 2009 1 2.3150358 12.7479339 10.3779762
2009/02/01 2009 2 2.8241107 18.4320175 2.4494949
2009/03/01 2009 3 2.0401606 8.4597523 1.6529412
2009/04/01 2009 4 1.8604651 4.8560000 1.1267606
2009/05/01 2009 5 2.1087719 1.8202247 NaN
2009/06/01 2009 6 4.0695103 2.1463415 1.1111111
2009/07/01 2009 7 5.4016393 8.1298905 NaN
2009/08/01 2009 8 0.1313869 16.9874411 NaN
2009/09/01 2009 9 NaN 5.3753943 NaN
2009/10/01 2009 10 1.6626506 8.8000000 1.8388889
2009/11/01 2009 11 1.4177632 NaN 3.9879154
2009/12/01 2009 12 0.9644128 NaN 5.0281457
2010/01/01 2010 1 0.2608696 4.0898876 3.1981424
2010/02/01 2010 2 0.7619048 NaN 4.3169811
remove non-numeric columns:
df.monthly <- df[,-c(1:3)] #remove non-numeric columns
df.monthly.un <- unlist(df.monthly) #unlist the list
df.monthly.un[is.nan(df.monthly.un)] <- -999 #replace NaNs with -999
monthly.dim <- dim(df.monthly)
monthly.frame <- matrix(df.monthly.un, monthly.dim) #convert unlist to matrix
then I calculated distance matrix and produced dendograms. Finally, I used heatmap.2 to produce heatmap with dendograms.
monthly.dist <- dist(monthly.frame)
monthly.hclust <- hclust(monthly.dist, method="complete")
monthly.dist2 <- dist(t(monthly.frame))
colClust <- as.dendrogram(hclust(monthly.dist2, method="complete"))
rowClust <- as.dendrogram(monthly.hclust)
colpalette <- colorRampPalette(c("red","blue","green"))(n=100)
heatmap.2(monthly.frame, scale="none",
col=colpalette, trace= "none", cexRow=0.6, cexCol=1,
cex.main=0.7, key=T, Rowv=rowClust, labRow=df[,1],
main=expression("2005-2016 SO"[2] * " (ug/m"^3*")"))
However, when I run the code, it gives the error:
Error: Column indexes must be at most 1 if positive, not 22, 23, 24, 25, 21, 18, 19, 20, 16, 17, 12, 10, 11, 15, 13, 14, 3, 9, 8, 4, 7, 5, 6, 2, 124, 125, 121, 122, 123, 133, 132, 131, 134, 135, 126, 129, 127, 128, 130, 136, 137, 143, 144, 141, 142, 138, 139, 140, 57, 58, 55, 56, 42, 47, 41, 40, 36, 38, 37, 39, 46, 43, 44, 45, 34, 35, 26, 27, 28, 29, 30, 31, 32, 33, 59, 54, 53, 48, 49, 50, 51, 112, 116, 117, 114, 115, 88, 89, 52, 60, 63, 70, 75, 73, 74, 79, 77, 76, 78, 66, 67, 62, 65, 71, 64, 61, 72, 97, 87, 85, 86, 90, 98, 91, 83, 84, 92, 94, 96, 93, 95, 68, 69, 82, 80, 81, 113, 110, 111, 109, 118, 119, 120, 101, 105, 103, 104, 99, 106, 100, 102, 107, 108
Any idea why this error occurs? Thanks in advance!
This link shows you how to do KNN another way:
https://www.youtube.com/watch?v=u8XvfhBdbMw
Also, I don't understand why the knnImputation(data) won't work - although I have played around with the data frame and now it does work - even though I don't know why it works now.
What I did was:
mydata <- read_excel("icecreamdata.xlsx") #Here I'm importing my data
newdata <- data.frame() #I've created a blank data frame
newdata <- data.frame(mydata) #I'm putting the data I want into this new data frame
anyNA(newdata) #Checking for missing data. TRUE = yes, data is missing. FALSE = no, data is not missing.
fixeddata <- knnImputation(newdata) #Imputing data to a new object
anyNA(fixed data)
FALSE = There is now no missing data
Both work, but I'd also like to know from the experts why we got the error: column indexes must be at most 1 if positive etc.
The primary explanation for the error reported can be found here.
I came into the problem today,and I found that we should transform our tbl object into data.frame object!!This is one disgusting point that different packages do not have compatibility.
#check your df class,I think your df is actually a tbl object
class(df)
df_new <- as.data.frame(df)

Simulate data from a Gompertz curve in R

I have a set of data that I have collected which consists of a time series, where each y-value is found by taking the mean of 30 samples of grape cluster weight.
I want to simulate more data from this, with the same number of x and y values, so that I can carry out some Bayesian analysis to find the posterior distribution of the data.
I have the data, and I know that the growth follows a Gompertz curve with formula:
[y = a*exp(-exp(-(x-x0)/b))], with a = 88.8, b = 11.7, and x0 = 15.1.
The data I have is
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112)
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165).
Any help would be appreciated thank you
*Will edit when more information is given**
I am a little confused by your question. I have compiled what you have written into R. Please elaborate for me so that I can help you:
gompertz <- function(x, x0, a, b){
a*exp(-exp(-(x-x0)/b))
}
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165) # means of 30 samples of grape cluster weights?
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112) # ?
#??
gompertz(x, x0 = 15.1, a = 88.8, b = 11.7)
gompertz(y, x0 = 15.1, a = 88.8, b = 11.7)

How to predict new raster using model generated by cforest

I use randomForest model to predict class memberships. 'x' consists of 10 classes that I use to train 'training_predictors' values extracted from a large rasterstack/brick. The specific line of codes is:
r_tree<-randomForest(x ~. , data=training_predictors, ...)
Then I run 'predict' using the model 'r_tree' that I apply to the rasterstack 'predictor_data', as follow:
predictions<-predict(predictor_data, r_tree, filename=outraster, fun=predict na.rm=TRUE, format="PCDISK", overwrite=TRUE, progress="text", type="response").
The output is a raster that I use as thematic map.
I would like to use the conditional inference trees mode 'cforest' instead of randomForest to achieve the same goals.
I understand that 'predict' can be used with cforest, yet, I have not been able to generate raster files, such as those with randomForest as illustrated above.
It should run fine, but you may need to add the argument OOB=TRUE, and identify factors if there are any.
Example data
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85,
66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31,
22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)
a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)
# extract values for points
xy <- rbind(cbind(1, p), cbind(0, a))
v <- data.frame(cbind(xy[,1], extract(logo, xy[,2:3])))
colnames(v)[1] <- 'pa'
Basic model
library(party)
m1 <- cforest(pa~., control=cforest_unbiased(mtry=3), data=v)
pc1 <- predict(logo, m1, OOB=TRUE)
plot(pc1)
Model with factors
v$red <- as.factor(round(v$red/100))
logo$red <- round(logo[[1]]/100)
m2 <- cforest(pa~., control=cforest_unbiased(mtry=3), data=v)
f <- list(levels(v$red))
names(f) <- 'red'
pc2 <- predict(logo, m2, OOB=TRUE, factors=f)
plot(pc2)
By the way, this comes almost straight out of the help file of raster::predict

Simulate vectors conditional on custom distribution

I am measuring the duration of episodes (vector ep.dur in minutes) per day, for an observation period for T=364 days. The vector ep.dur has a length(ep.dur) of T=364, with zeros in days when no episode occurred, and range(ep.dur) is between 0 and 1440
The sum of the episode duration over the T period is a<-sum(ep.duration)
Now I have a vector den, with length(den)=99. The vector den shows how many days are required for the development of each 1% (1%, 2%, 3%, ...) of a
Now given den and a, I would like to simulate multiple ep.dur
Is this possible?
Clarification 1:: (first comment of danas.zuokas) The elements of den represent duration NOT exact days. That means, for example 1, that 1%(=1195.8) of a is developed in 1 day, 2% in 2 days, 3% in 3 days, 4% in 4 days, 5% in 5 days, 6% in 5 days .....). The episodes can take place anywhare in T
Clarification 2: (second comment of danas.zuokas) Unfortunately there can be no assumptions on how duration develops. That is why I have to simulate numerous ep.dur vectors. HOWEVER, i can expand the den vector into more finite resolution (that is: instead of 1% jumps, 0.1% jumps) if this is of any help.
Description of the algorithm
The algorithm should satisfy all information the den vector provides. I have imagined the algorithm going as following (Example 3):
Each 1% jump of a is 335,46 min. den[1] tells us that 1% of a is developed in 1 day. so lets say we generate ep.dur[1]=335,46. OK. We go to den[2]: 2% of the a is developed in d[2]=1 days. So, ep.dur[1] cannot be 335,46 and is rejected (2% of a should still occur in one day). Lets say that had generated ep.dur[1]=1440. d[1] is satisfied, d[2] is satisifed (at least 2% of the total duration is developed in dur[2]=1 days), dur[3]=1 is also satisfied. Keeper? However, dur[4]=2 is not satified if ep.dur[1]=1440 because it states that 4% of a (=1341) should occur in 2 days. So ep.dur[1] is rejected. Now lets say that ep.dur[1]=1200. dur[1:3] are accepted. Then we generate ep.dur[2] and so on making sure that the generated ep.dur all satisfy the information provided by den.
Is this programmatically feasible? I really do not know where to start with this problem. I will provide a generous bounty once bounty start period is over
Example 1:
a<-119508
den<-c(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15,
16, 17, 18, 19, 20, 20, 21, 22, 23, 24, 25, 25, 26, 27, 28, 29,
30, 30, 31, 32, 33, 34, 35, 35, 36, 37, 38, 39, 40, 40, 41, 42,
43, 44, 45, 45, 46, 47, 48, 49, 50, 50, 51, 52, 53, 54, 55, 55,
56, 57, 58, 59, 60, 60, 61, 62, 63, 64, 65, 65, 66, 67, 68, 69,
70, 70, 71, 72, 73, 74, 75, 75, 76, 77, 78, 79, 80, 80, 81, 82,
83)
Example 2:
a<-78624
den<-c(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11,
11, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 18, 19, 21, 22, 23,
28, 32, 35, 36, 37, 38, 43, 52, 55, 59, 62, 67, 76, 82, 89, 96,
101, 104, 115, 120, 126, 131, 134, 139, 143, 146, 153, 160, 165,
180, 193, 205, 212, 214, 221, 223, 227, 230, 233, 234, 235, 237,
239, 250, 253, 263, 269, 274, 279, 286, 288, 296, 298, 302, 307,
309, 315, 320, 324, 333, 337, 342, 347, 352)
Example 3
a<-33546
den<-c(1, 1, 1, 2, 4, 6, 8, 9, 12, 15, 17, 21, 25, 29, 31, 34, 37,
42, 45, 46, 51, 52, 56, 57, 58, 59, 63, 69, 69, 71, 76, 80, 81,
87, 93, 95, 102, 107, 108, 108, 112, 112, 118, 123, 124, 127,
132, 132, 132, 135, 136, 137, 150, 152, 162, 166, 169, 171, 174,
176, 178, 184, 189, 190, 193, 197, 198, 198, 201, 202, 203, 214,
218, 219, 223, 225, 227, 238, 240, 246, 248, 251, 254, 255, 257,
259, 260, 277, 282, 284, 285, 287, 288, 290, 294, 297, 321, 322,
342)
Example 4
a<-198132
den<-c(2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 16, 17, 18, 20, 21, 23, 24,
25, 27, 28, 29, 31, 32, 34, 35, 36, 38, 39, 40, 42, 43, 45, 46,
47, 49, 50, 51, 53, 54, 56, 57, 58, 60, 61, 62, 64, 65, 67, 68,
69, 71, 72, 74, 75, 76, 78, 79, 80, 82, 83, 85, 86, 87, 89, 90,
91, 93, 94, 96, 97, 98, 100, 101, 102, 104, 105, 107, 108, 109,
111, 112, 113, 115, 116, 120, 123, 130, 139, 155, 165, 172, 176,
178, 181, 185, 190, 192, 198, 218)
As far as I understand what you're after, I would start by converting den to an rle object. (Here using data from your Example 3)
EDIT: Add 100% at day 364 to den
if(max(den)!=364) den <- c(den, 364)
(rleDen <- rle(den))
# Run Length Encoding
# lengths: int [1:92] 3 1 1 1 1 1 1 1 1 1 ... # 92 intervals
# values : num [1:92] 1 2 4 6 8 9 12 15 17 21 ...
percDur <- rleDen$lengths # Percentage of total duration in each interval
atDay <- rleDen$values # What day that percentage was reached
intWidth <- diff(c(0, atDay), k = 1) # Interval width
durPerDay <- 1440 # Max observation time per day
percPerDay <- durPerDay/a*100 # Max percentage per day
cumPercDur <- cumsum(percDur) # Cumulative percentage in each interval
maxPerInt <- pmin(percPerDay * diff(c(0, atDay), 1),
percDur + 1) # Max percent observation per interval
set.seed(1)
nsims <- 10 # Desired number of simulations
sampMat <- matrix(0, ncol = length(percDur), nrow = nsims) # Matrix to hold sim results
To allow for randomness while considering the limitation of a maximum 1440 minutes of observation per day, check to see if there are any long intervals (i.e., any intervals in which the jump in percentage cannot be completely achieved in that interval)
if(any(percDur > maxPerInt)){
longDays <- percDur > maxPerInt
morePerInt <- maxPerInt - percDur
perEnd <- c(which(diff(longDays,1) < 0), length(longDays))
# Group intervals into periods bounded by "long" days
# and determine if there are any long periods (i.e., where
# the jump in percentage can't be achieved in that period)
perInd <- rep(seq_along(perEnd), diff(c(0, perEnd)))
perSums <- tapply(percDur, perInd, sum)
maxPerPer <- tapply(maxPerInt, perInd, sum)
longPers <- perSums > maxPerPer
# If there are long periods, determine, starting with the last period, when the
# excess can be covered. Each group of periods is recorded in the persToWatch
# object
if(any(longPers)) {
maxLongPer <- perEnd[max(which(longPers))]
persToWatch <- rep(NA, length(maxLongPer))
for(kk in rev(seq_len(maxLongPer))) {
if(kk < maxLongPer && min(persToWatch, na.rm = TRUE) <= kk) next
theSums <- cumsum(morePerInt[order(seq_len(kk),
decreasing = TRUE)])
above0 <- which(rev(theSums) > 0)
persToWatch[kk] <- max(above0[which(!perInd[above0] %in% c(perInd[kk],
which(longPers)) & !above0 %in% which(longDays))])
}
}
}
Now we can start the randomness. The first component of the sampling determines the overall proportion of a that occurs in each of the intervals. How much? Let runif decide. The upper and lower limits must reflect the maximum observation time per day and the excess amount of any long days and periods
for(jj in seq_along(percDur[-1])) {
upperBound <- pmin(sampMat[, jj] + maxPerInt[jj],
cumPercDur[jj] + 1)
lowerBound <- cumPercDur[jj]
# If there are long days, determine the interval over which the
# excess observation time may be spread
if(any(percDur > maxPerInt) && any(which(longDays) >= jj)) {
curLongDay <- max(which(perInd %in% perInd[jj]))
prevLongDay <- max(0, min(which(!longDays)[which(!longDays) <= jj]))
curInt <- prevLongDay : curLongDay
# If there are also long periods, determine how much excess observation time there is
if(any(longPers) && maxLongPer >= jj) {
curLongPerHigh <- min(which(!is.na(persToWatch))[
which(!is.na(persToWatch)) >= jj])
curLongPerLow <- persToWatch[curLongPerHigh]
longInt <- curLongPerLow : curLongPerHigh
curExtra <- max(0,
cumPercDur[curLongPerHigh] -
sum(maxPerInt[longInt[longInt > jj]]) -
sampMat[, jj, drop = FALSE])
} else {
curExtra <- cumPercDur[curLongDay] -
(sum(maxPerInt[curInt[curInt > jj]]) +
sampMat[, jj, drop = FALSE])
}
# Set the lower limit for runif appropriately
lowerBound <- sampMat[, jj, drop = FALSE] + curExtra
}
# There may be tolerance errors when the observations are tightly
# packed
if(any(lowerBound - upperBound > 0)) {
if(all((lowerBound - upperBound) <= .Machine$double.eps*2*32)) {
upperBound <- pmax(lowerBound, upperBound)
} else {
stop("\nUpper and lower bounds are on the wrong side of each other\n",
jj,max(lowerBound - upperBound))
}
}
sampMat[, jj + 1] <- runif(nsims, lowerBound, upperBound)
}
Then add 100 percent to the end of the results and calculate the interval-specific percentage
sampMat2 <- cbind(sampMat[, seq_along(percDur)], 100)
sampPercDiff <- t(apply(sampMat2, 1, diff, k = 1))
The second component of the randomness determines the distribution of sampPercDiff over the interval widths intWidth. This still requires more thought in my opinion. For instance, how long does a typical episode last compared to the unit of time under consideration?
For each interval, determine if the random percentage needs to be allocated over multiple time units (in this case days). EDIT: Changed the following code to limit percentage increase when intWidth > 1.
library(foreach)
ep.dur<-foreach(ii = seq_along(intWidth),.combine=cbind)%do%{
if(intWidth[ii]==1){
ret<-sampPercDiff[, ii, drop = FALSE] * a / 100
dimnames(ret)<-list(NULL,atDay[ii])
ret
} else {
theDist<-matrix(numeric(0), ncol = intWidth[ii], nrow = nsims)
for(jj in seq_len(intWidth[ii]-1)){
theDist[, jj] <- floor(runif(nsims, 0, pmax(0,
min(sampPercDiff[, ii], floor(sampMat2[,ii + 1])-.Machine$double.eps -
sampMat2[,ii]) * a / 100 - rowSums(theDist, na.rm = TRUE))))
}
theDist[, intWidth[ii]] <- sampPercDiff[, ii] * a / 100 - rowSums(theDist,
na.rm = TRUE)
distOrder <- replicate(nsims, c(sample.int(intWidth[ii] - 1),
intWidth[ii]), simplify = FALSE)
ret <- lapply(seq_len(nrow(theDist)), function(x) {
theDist[x, order(distOrder[[x]])]
})
ans <- do.call(rbind, ret)
dimnames(ans) <- list(NULL, atDay[ii]-((intWidth[ii]:1)-1))
ans
}
}
The duration time is sampled randomly for each time unit (day) in the interval to which it is to be distributed. After breaking up the total duration into daily observed times, these are then assigned randomly to the days in the interval.
Then, multiply the sampled and distributed percentages by a and divide by 100
ep.dur[1, 1 : 6]
# 1 2 3 4 5 6
# 1095.4475 315.4887 1.0000 578.9200 13.0000 170.6224
ncol(ep.dur)
# [1] 364
apply(ep.dur, 1, function(x) length(which(x == 0)))
# [1] 131 133 132 117 127 116 139 124 124 129
rowSums(ep.dur)/a
# [1] 1 1 1 1 1 1 1 1 1 1
plot(ep.dur[1, ], type = "h", ylab = "obs time")
I would most probably do this with a ruby script but it could be done in R too. I am not sure whether it is your homework problem or not. As to answer your question: Can this be done problematically? Yes, Ofcourse!
According to your problem, my solution is to define the minimum and maximum limits with in which I could like to randomly pick a percentage that satisfies the conditions given by den vector and a value.
Since the den vector only contains 99% values, we cannot be sure when the 100% is going to happen. This condition yields my solution to be split into 3 parts - 1) For the given den vector upto 98% 2) For the 99% 3) Beyond 99%. I could define another function and put the common code in all these 3 parts in it but I haven't done so.
Since, I use runif command to generate random numbers, given the low-limit, it is unlikely that it will generate the exact low-limit value. Hence, I have defined a threshold value which I can check and if it falls below it, I would make it 0. You can have this or remove it. Also when you consider example 4, the first 1% is going to happen at 2nd day. So it means the 1st day could contain upto a maximum=0.999999% of the episode and then the 1% occurs on 2nd day. This is why the maximum limit is defined by subtracting a smallestdiff value, which can be changed.
FindMinutes=function(a,den){
if (a>1440*364){
Print("Invalid value for aa")
return("Invalid value for aa")
}
threshold=1E-7
smallestdiff=1E-6
sum_perc=0.0
start=1 #day 1
min=0 #minimum percentage value for a day
max=0 #maximum percentage value for a day
days=rep(c(0),364) #day vector with percentage of minutes - initialized to 0
maxperc=1440*100/a #maximum percentage wrto 1440 minutes/day
#############################################################
#############################################################
############ For the length of den vector ###################
for (i in 1:length(den)){
if (den[i]>start){
min=(i-1)-sum_perc
for(j in start:(den[i]-1)){#number of days in-between
if (j>start){ min=0 }
if (i-smallestdiff-sum_perc>=maxperc){
max=maxperc
if ((i-smallestdiff-sum_perc)/(den[i]-j)>=maxperc){
min=maxperc
}else{
if ((i-smallestdiff-sum_perc)/(den[i]-j-1)<maxperc){
min=maxperc-(i-smallestdiff-sum_perc)/(den[i]-j-1)
}else{
min=maxperc
}
}
}else{
max=i-smallestdiff-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
sum_perc=sum_perc+days[j]
}else{
days[j]=0.0
}
}
start=den[i]
}
}
#############################################################
#############################################################
#####################For the 99% ############################
min=99-sum_perc
for(j in start:den[length(den)]){
if (j>start){
min=0
}
max=100-sum_perc
if (100-sum_perc>=maxperc){
max=maxperc
if ((100-sum_perc)/(364+1-j)>=maxperc){
min=maxperc
}else{
if ((100-sum_perc)/(364-j)<maxperc){
min=maxperc-(100-sum_perc)/(364-j)
}else{
min=maxperc
}
}
}else{
max=100-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
sum_perc=sum_perc+days[j]
}else{
days[j]=0.0
}
}
#############################################################
#############################################################
##################### For the remaining 1%###################
min=0
for(j in den[length(den)]+1:364){
max=100-sum_perc
if (j==364){
min=max
days[j]=min
}else{
if (100-sum_perc>maxperc){
max=maxperc
if ((100-sum_perc)/(364+1-j)>=maxperc){
min=maxperc
}else{
if ((100-sum_perc)/(364-j)<maxperc){
min=maxperc-(100-sum_perc)/(364-j)
}else{
min=maxperc
}
}
}else{
max=100-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
}else{
days[j]=0.0
}
}
sum_perc=sum_perc+days[j]
if (sum_perc>=100.00){
break
}
}
return(days*a/100) #return as minutes vector corresponding to each 364 days
}#function
In my code, I randomly generate percentage values of episodes for each day according to the minimum and maximum value. Also, the condition (den vector) holds good when you round the percentage values to integers (days vector) but you might need extra tuning (which depends on checking the den vector further ahead and then re-tuning the minimum value of percentages) if you want it accurate upto few decimal places. You can also check to make sure that sum(FindMinutes(a,den)) is equal to a. If you want to define den in terms of 0.1%, you can do so but you need to change the corresponding equations (in min and max)
As the worst case scenario example, if you make a as the maximum value it can take and a corresponding den vector:
a=1440*364
den<-c(0)
cc=1
for(i in 1:363){
if (trunc(i*1440*100/(1440*364))==cc){
den[cc]=i
cc=cc+1
}
}
You can run the above example by calling the function: maxexamplemin=FindMinutes(a,den)
and you can check to see that all the days have the maximum minutes of 1440 which is the only possible scenario here.
As an illustration, let me run your example 3:
a<-33546
den<-c(1, 1, 1, 2, 4, 6, 8, 9, 12, 15, 17, 21, 25, 29, 31, 34, 37, 42, 45, 46, 51, 52, 56, 57, 58, 59, 63, 69, 69, 71, 76, 80, 81, 87, 93, 95, 102, 107, 108, 108, 112, 112, 118, 123, 124, 127, 132, 132, 132, 135, 136, 137, 150, 152, 162, 166, 169, 171, 174, 176, 178, 184, 189, 190, 193, 197, 198, 198, 201, 202, 203, 214, 218, 219, 223, 225, 227, 238, 240, 246, 248, 251, 254, 255, 257, 259, 260, 277, 282, 284, 285, 287, 288, 290, 294, 297, 321, 322, 342)
rmin=FindMinutes(a,den)
sum(rmin)
[1] 33546
rmin2=FindMinutes(a,den)
rmin3=FindMinutes(a,den)
plot(rmin,tpe="h")
par(new=TRUE)
plot(rmin2,col="red",type="h")
par(new=TRUE)
plot(rmin3,col="red",type="h")
and the 3 super-imposed plots is shown below :

Create new dataset removing variables with high inflation factors

I have a dataset of environmental variables I would like to use for a GLMM. I am using the corvif function from the AED package (http://www.highstat.com/Book2/AED_1.0.zip) to identify and remove variables with high inflation factors.
Instead of removing one variable at a time manually from my dataset with a GVIF values > 3 (highest value removed first), I would like to know how to write a loop to accomplish this task automatically with the result being a new dataset with only the remaining variables (i.e. those with GVIF values < 3).
Any suggestions for how to approach this problem for a new R user?
Here is my sample data:
WW_Covs <- structure(list(Latitude = c(62.4419, 67.833333, 65.95, 63.72935,
60.966667, 60.266667, 55.660455, 62.216667, 61.3, 61.4, 62.084139,
55.662566, 64.48508, 63.208354, 62.87591, 62.70856, 62.64009,
63.79488, 59.55, 62.84206), BIO_02 = c(87, 82, 75, 70, 77, 70,
59, 84, 84, 79, 85, 60, 91, 87, 74, 74, 76, 70, 76, 74), BIO_03 = c(26,
23, 25, 26, 25, 24, 25, 25, 26, 25, 26, 26, 24, 25, 24, 25, 25,
25, 26, 24), BIO_04 = c(8443, 9219, 7594, 6939, 7928, 7593, 6160,
8317, 8167, 7972, 8323, 6170, 9489, 8578, 7814, 7680, 7904, 7149,
7445, 7803), BIO_05 = c(201, 169, 151, 166, 194, 210, 202, 205,
204, 186, 205, 200, 200, 195, 170, 154, 180, 166, 219, 170),
BIO_06 = c(-131, -183, -144, -102, -107, -75, -26, -119,
-113, -120, -120, -28, -169, -143, -131, -142, -124, -111,
-72, -129), BIO_08 = c(128, 109, 85, 78, 122, 145, 153, 134,
130, 126, 132, 152, 120, 119, 115, 98, 124, 104, 147, 115
), BIO_09 = c(-31, -81, -16, 13, -60, -6, 25, -25, -25, -70,
-25, 23, -56, -39, -47, -60, -39, 8, 0, -46), BIO_12 = c(667,
481, 760, 970, 645, 557, 645, 666, 652, 674, 670, 670, 568,
598, 650, 734, 620, 868, 571, 658), BIO_13 = c(78, 77, 96,
109, 85, 70, 67, 77, 84, 93, 78, 68, 72, 78, 93, 99, 90,
96, 72, 93), BIO_15 = c(23, 40, 25, 21, 36, 30, 21, 24, 28,
34, 24, 22, 28, 29, 34, 32, 36, 22, 30, 34), BIO_19 = c(147,
85, 180, 236, 108, 119, 154, 149, 135, 118, 148, 162, 117,
119, 120, 141, 111, 204, 111, 122)), .Names = c("Latitude",
"BIO_02", "BIO_03", "BIO_04", "BIO_05", "BIO_06", "BIO_08", "BIO_09",
"BIO_12", "BIO_13", "BIO_15", "BIO_19"), row.names = c(1:20), class = "data.frame")
Sample code:
library(AED)
WW_Final <- corvif(WW_Covs)
test <- corvif(WW_Covs])
test[order(-test$GVIF), ]
if(test$GVIF[1,] > 3, # this is where I get stuck...
Here is an algorithm for doing this. I illustrate with the built-in dataset longley, and I also use function vif in package car, rather than using package AED:
It's not pretty, and should be wrapped inside a function, but I leave that as an exercise for the interested reader.
The code:
library(car)
dat <- longley
cutoff <- 2
flag <- TRUE
while(flag){
fit <- lm(Employed ~ ., data=dat)
vfit <- vif(fit)
if(max(vfit) > cutoff){
dat <- dat[, -which.max(vfit)]
} else {
flag <- FALSE
}
}
print(fit)
print(vfit)
The output:
Call:
lm(formula = Employed ~ ., data = dat)
Coefficients:
(Intercept) Unemployed Armed.Forces
50.66281 0.02265 0.02847
Unemployed Armed.Forces
1.032501 1.032501

Resources