help with rle command - r

I'm having some trouble with an rle command that is designed to find the point at which participants reach 8 contiguous ones in a row.
For example, if:
x <- c(0,1,0,1,1,1,1,1,1,1,1,1)
i want to return a value of 11.
Thanks to DWin to I've been using this piece of code:
which( rle(x2)$values==1 & rle(x2)$lengths >= 8)
sum(rle(x)$lengths[ 1:(min(which(rle(x)$lengths >= 8))-1) ]) + 8
I've been using this code successfully to process my data. However, i noticed that it made a mistake when processing one of my data files.
For example, if
x <- c(1,1,1,1,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
the code returns 19, which is the point at which eight contiguous zeros in a row is reached. i'm not sure what is going wrong or how it fix it.
thanks in advance for your help.
Will

You need to paste the first line of code in its entirety into the second:
sum(rle(x)$lengths[ 1:(min(which( rle(x2)$values==1 & rle(x2)$lengths >= 8))-1) ]) + 8
[1] 39
However, here is another approach, using the function filter. This yields the same result in what I consider to be much more readable code:
which(filter(x2, rep(1/8, 8), sides=1) == 1)[1]
[1] 39
The filter function when used in this way essentially computes a moving average over a block of 8 values in the vector. I then return the position of the first value where the moving average equals 1.

In the basic programming course I teach, I advise students to give proper names to subresults, and to inspect these subresults:
lengthOfrepeatsOfAnything<-rle(x)$lengths
#4 2 5 11 2 2 3 2 17
whichRepeatsAreOfOnes<-rle(x)$values==1
#1 3 5 7 9
repeatsOfOnesLength<-lengthOfrepeatsOfAnything * whichRepeatsAreOfOnes #TRUE = 1, FALSE=0
#4 0 5 0 2 0 3 0 17
whichRepeatOfOneAreLongerThanEight<-which(repeatsOfOnesLength >= 8)
#9
result<-NA
if(length(whichRepeatOfOneAreLongerThanEight)>0){
firstRepeatOfOneAreLongerThanEight<-whichRepeatOfOneAreLongerThanEight[1]
#9
if(firstRepeatOfOneAreLongerThanEight==1){
result<-8
}
else{
repeatsBeforeFirstEightOnes<-1:(firstRepeatOfOneAreLongerThanEight-1)
#1 2 3 4 5 6 7 8
lengthsOfRepeatsBeforeFirstEightOnes<-lengthOfrepeatsOfAnything[repeatsBeforeFirstEightOnes]
#4 2 5 11 2 2 3 2
result<-sum(lengthsOfRepeatsBeforeFirstEightOnes) + 8
}
}
I know it doesn't look as dandy as a oneline solution, but it helps to make things clear and to pick up errors... Besides: what if you look back at this code in 4 months? Which one will be easier to understand again?

My advice would be to break the code up into simpler pieces. As suggested by #Nick, you want to write code which can be easily debugged and modular coding allows you to do that.
# find runs of 0s and 1s
run_01 = rle(x)
# find run of 1's with length >=8
run_1 = with(run_01, which(values == 1 & lengths >=8))
# find starting position of run_1
start_pos = sum(run_01$lengths[1:(run_1 - 1)])
# add 8 to it
end_pos = start_pos + 8

Related

Parallel processing for multiple nested for loops

I am trying to run simulation scenarios which in turn should provide me with the best scenario for a given date, back tested a couple of months. The input for a specific scenario has 4 input variables with each of the variables being able to be in 5 states (625 permutations). The flow of the model is as follows:
Simulate 625 scenarios to get each of their profit
Rank each of the scenarios according to their profit
Repeat the process through a 1-day expanding window for the last 2 months starting on the 1st Dec 2015 - creating a time series of ranks for each of the 625 scenarios
The unfortunate result for this is 5 nested for loops which can take extremely long to run. I had a look at the foreach package, but I am concerned around how the combining of the outputs will work in my scenario.
The current code that I am using works as follows, first I create the possible states of each of the inputs along with the window
a<-seq(as.Date("2015-12-01", "%Y-%m-%d"),as.Date(Sys.Date()-1, "%Y-%m-%d"),by="day")
#input variables
b<-seq(1,5,1)
c<-seq(1,5,1)
d<-seq(1,5,1)
e<-seq(1,5,1)
set.seed(3142)
tot_results<-NULL
Next the nested for loops proceed to run through the simulations for me.
for(i in 1:length(a))
{
cat(paste0("\n","Current estimation date: ", a[i]),";itteration:",i," \n")
#subset data for backtesting
dataset_calc<-dataset[which(dataset$Date<=a[i]),]
p=1
results<-data.frame(rep(NA,625))
for(j in 1:length(b))
{
for(k in 1:length(c))
{
for(l in 1:length(d))
{
for(m in 1:length(e))
{
if(i==1)
{
#create a unique ID to merge onto later
unique_ID<-paste0(replicate(1, paste(sample(LETTERS, 5, replace=TRUE), collapse="")),round(runif(n=1,min=1,max=1000000)))
}
#Run profit calculation
post_sim_results<-profit_calc(dataset_calc, param1=e[m],param2=d[l],param3=c[k],param4=b[j])
#Exctract the final profit amount
profit<-round(post_sim_results[nrow(post_sim_results),],2)
results[p,]<-data.frame(unique_ID,profit)
p=p+1
}
}
}
}
#extract the ranks for all scenarios
rank<-rank(results$profit)
#bind the ranks for the expanding window
if(i==1)
{
tot_results<-data.frame(ID=results[,1],rank)
}else{
tot_results<-cbind(tot_results,rank)
}
suppressMessages(gc())
}
My biggest concern is the binding of the results given that the outer loop's actions are dependent on the output of the inner loops.
Any advice on how proceed would greatly be appreciated.
So I think that you can vectorize most of this, which should give a big reduction in run time.
Currently, you use for-loops (5, to be exact) to create every combination of values, and then run the values one by one through profit_calc (a function that is not specified). Ideally, you'd just take all possible combinations in one go and push them through profit_calc in one single operation.
-- Rationale --
a <- 1:10
b <- 1:10
d <- rep(NA,10)
for (i in seq(a)) d[i] <- a[i] * b[i]
d
# [1] 1 4 9 16 25 36 49 64 81 100
Since * also works on vectors, we can rewrite this to:
a <- 1:10
b <- 1:10
d <- a*b
d
# [1] 1 4 9 16 25 36 49 64 81 100
While it may save us only one line of code, it actually reduces the problem from 10 steps to 1 step.
-- Application --
So how does that apply to your code? Well, given that we can vectorize profit_calc, you can basically generate a data frame where each row is every possible combination of your parameters. We can do this with expand.grid:
foo <- expand.grid(b,c,d,e)
head(foo)
# Var1 Var2 Var3 Var4
# 1 1 1 1 1
# 2 2 1 1 1
# 3 3 1 1 1
# 4 4 1 1 1
# 5 5 1 1 1
# 6 1 2 1 1
Lets say we have a formula... (a - b) / (c + d)... Then it would work like:
bar <- (foo[,1] - foo[,2]) * (foo[,3] + foo[,4])
head(bar)
# [1] 0 2 4 6 8 -2
So basically, try to find a way to replace for-loops with vectorized options. If you cannot vectorize something, try looking into apply instead, as that can also save you some time in most cases. If your code is running too slow, you'd ideally first see if you can write a more efficient script. Also, you may be interested in the microbenchmark library, or ?system.time.

R enumerate duplicates in a dataframe with unique value

I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.
As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?
Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).
New example:
part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)
data<-data.frame(part,site,result)
data$index<-1
repeat {
if(!anyDuplicated(data[,c("part","site","index")]))
{ break }
data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data
part site result index
1 A N 17 1
2 A C 20 1
3 A S 25 1
4 B C 51 1
5 B N 50 1
6 B S 49 1
7 A N 43 2
8 A C 45 2
9 A S 47 2
10 C N 52 1
11 C S 51 1
12 C C 56 1
Old example:
#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]
#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1
# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
if(!anyDuplicated(df[,c(1,3)]))
{ break }
df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}
# Check - The below vector should all be true
df$index==morley$Expt
We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.
indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE
Or we can use ave
indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE
Update
Using the new example
with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1
Or we can use getanID from library(splitstackshape)
library(splitstackshape)
getanID(data, c('part', 'site'))[]
I think this is a job for make.unique, with some manipulation.
index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE
Details of your actual data.frame may matter. However, a couple of options working with your example:
#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]

Double for loop to save several files using R

I am trying to do a “for loop” to generate files based on the column "group". I want to create a file for each group. My data is much bigger, but a sample would be:
id = c(1,2,3,4,5,6,7,8,9,10)
group = c(3,1,3,2,1,3,1,2,4,4)
weight = c(10,11,12,13,14,15,16,17,18,19)
index1 = c(50,50,50,50,50,50,50,50,50,50)
index2 = c(50,50,50,50,50,50,50,50,50,50)
data = data.frame(id,group,weight,index1,index2)
for (i in unique(data$group)){
for (j in 1:nrow(data)){
data$weight[j] = ifelse(data$group[j] == data$group[i], 0,data$weight[j])
data$index1[j] = ifelse(data$group[j] == data$group[i], 0,50)
data$index2[j] = ifelse(data$group[j] == data$group[i], 5,50)
}
write.table(data,paste("/home/paulaf/test/",data$group[i],".txt",sep=""),
quote=F,row.names=F,col.names=T)}
It seems to work, but it doesn’t write all the files. Any help would be very much appreciated. Thanks in advance.
Paula,
That code is actually writing four files. But you're overwriting one of those files, so you're only ending up with three.
When you name the file with paste, you're using data$group[i] to generate the name. If you look at those name by using cat() or something similar, you'll notice you have two 3.txt files.
/home/paulaf/test/3.txt
/home/paulaf/test/3.txt
/home/paulaf/test/1.txt
/home/paulaf/test/2.txt
So, that's why your not getting all of you files. Your first 3.txt is overwritten.
Looking a bit more closely at your data object, you can see why this happened.
Your i in your loops is going to have the values 3, 1, 2, and 4. By plugging 1-4 into data$group[i], you're actually pulling out the value of the 1-4th rows in the data$group. Notice that the first and third rows are both group 3.
id group weight index1 index2
1 1 3 0 50 50
2 2 1 0 50 50
3 3 3 0 50 50
4 4 2 0 0 5
5 5 1 0 50 50
6 6 3 0 50 50
7 7 1 0 50 50
8 8 2 0 0 5
9 9 4 18 50 50
10 10 4 19 50 50
Maybe replace your write.table() with this:
write.table(data,paste("/home/paulaf/test/",i,".txt",sep=""),
quote=F,row.names=F,col.names=T)
And one other note to save you future headache: It's often helpful to print some of your variables to the console. It's just a way to get some insight into what's happening.
Also, good luck, keep working with R, you're doing great!
unique(data$group) is a vector of length 4. data$group has a length of 10. You're setting the filenames to the first 4 values of data$group instead of the unique values of data$group.
Try replacing data$group[i] with just i inside the paste that generates the filename, e.g.
for (i in unique(data$group)){
for (j in 1:nrow(data)){
data$weight[j] = ifelse(data$group[j] == data$group[i], 0,data$weight[j])
data$index1[j] = ifelse(data$group[j] == data$group[i], 0,50)
data$index2[j] = ifelse(data$group[j] == data$group[i], 5,50)
}
fileName = paste("/home/paulaf/test/",i,".txt",sep="")
write.table(data,fileName,quote=F,row.names=F,col.names=T)
}
Your problem is very simple. Inside your write.table function, you're pasting the name using data$group[i], but your outside loop is not looping over the indices of the unique groups, but over the group names themselves. Your is are 3 1 2 4, so calling data$group[i] for each of those will result in 3, 3, 1, 2, which means all the filenames are all wrong (one file is replaced and you end up with only 3, for this sample). The solution is then:
write.table(data,paste("/home/paulaf/test/",i,".txt",sep=""),
quote=F,row.names=F,col.names=T)}
It's also slightly more efficiently (and easier to read, imho) to use paste0, so:
write.table(data,paste0("/home/paulaf/test/",i,".txt"),
quote=F,row.names=F,col.names=T)}

handling 'wrong' entries and NAs in a data.table substituting them with entries from other table

I am using data.table in the context of a wider application using shiny and handsontable.js. This is the flow of this part of the app:
I publish a data.table on the browser with numeric columns using handsontable & shiny. This is rendered on the screen.
The user changes values and each time this happens a new data.table is returned with the data.
The problem is with error management, specifically if an user accidentally keys a character.
My objective is to correct the user's error replacing the single cell value where the character was entered with the value in the original copy (only this cell as the others may contain valid changes to be saved at a later stage in the app).
Sadly I am not able to find an efficient solution to this problem. This is my code and a reproducible sample:
# I generate a sample datatable
originTable = data.table( Cat = LETTERS[1:5],
Jan=1:5,
Feb=sample(1:5),
Mar=sample(1:5),
Apr=sample(1:5),
May=sample(1:5))
# I take a full copy & to simulate the effect of a character key in by mistake I convert
# the entire column to character
dt_ <- copy(originTable)
dt_[,Jan := as.character(Jan)]
# "q" entered by mistake by the user -
dt_[[5,2]] <- "q"
# This is what I get back:
Cat Jan Feb Mar Apr May
1: A 1 1 2 4 4
2: B 2 5 4 2 2
3: C 3 4 3 1 5
4: D 4 3 5 5 1
5: E q 2 1 3 3
Now to my code to try to fix this:
valCols <- month.abb[1:5]
for (j in valCols)
set(dt_,
i = NULL,
j = j,
value= as.numeric(as.character(dt_[[j]])))
This gives me a data.table with a NA value somewhere (in place of the character entered by mistake - in a position I ignore).
To substitute the value I've used the following code
for (j in valCols)
set(dt_,
i = which(is.na(dt_[[j]])),
j = j,
value= as.numeric(originTable[[j]]))
But it does not work: it finds the correct column, but ignores the i value and copies the value contained in originTable[1,j] rather than originTable[i,j]. In the example dt_[5,2] will get 1 (positioned as originTable[1,2] instead of 5.
In other words I would have expect to see as.numeric(originTable[[j]]) subsetted by i (implicitly) and by j (explicitly).
To be fair the Warning is telling me what is happening:
Warning message:
In set(dt_, i = which(is.na(dt_[[j]])), j = j, value = as.numeric(originTable[[j]])) :
Supplied 5 items to be assigned to 1 items of column 'Jan' (4 unused)
But I remain with my problem unsolved.
I have read countless of apparently similar SO posts but sadly to no avail (possibly because NA handling has evolved in recent releases and older answers do not fully reflect best practice any more). Also a non-NA based solution would be equally acceptable. Thanks
Try the following:
# use your criteria to determine what the incorrect values are in each column
wrongs = lapply(dt_[, !"Cat"], function(x) which(is.na(as.numeric(x))))
# now substitute
for (n in names(wrongs)) dt_[wrongs[[n]], (n) := originTable[[n]][wrongs[[n]]]]
dt_
# Cat Jan Feb Mar Apr May
#1: A 1 2 5 2 4
#2: B 2 4 3 4 5
#3: C 3 3 2 5 2
#4: D 4 1 1 1 1
#5: E 5 5 4 3 3

Using R as a game simulator

I am trying to simulate a simple game where you spin a spinner, labeled 1-5, and then progress on until you pass the finish line (spot 50). I am a bit new to R and have been working on this for a while searching for answers. When I run the code below, it doesn't add the numbers in sequence, it returns a list of my 50 random spins and their value. How do I get this to add the spins on top of each other, then stop once => 50?
SpacesOnSpinner<-(seq(1,5,by=1))
N<-50
L1<-integer(N)
for (i in 1:N){
takeaspin<-sample(SpacesOnSpinner,1,replace=TRUE)
L1[i]<-L1[i]+takeaspin
}
This is a good use-case for replicate. I'm not sure if you have to use a for loop, but you could do this instead (replicate is a loop too):
SpacesOnSpinner<-(seq(1,5,by=1))
N<-10
cumsum( replicate( N , sample(SpacesOnSpinner,1,replace=TRUE) ) )
#[1] 5 10 14 19 22 25 27 29 30 33
However, since you have a condition which you want to break on, perhaps the other answer with a while condition is exactly what you need in this case (people will tell you they are bad in R, but they have their uses). Using this method, you can see how many spins it took you to get past 50 by a simple subset afterwards (but you will not know in advance how many spins it will take, but at most it will be 50!):
N<-50
x <- cumsum( replicate( N , sample(5,1) ) )
# Value of accumulator at each round until <= 50
x[ x < 50 ]
#[1] 5 6 7 8 12 16 21 24 25 29 33 34 36 38 39 41 42 44 45 49
# Number of spins before total <= 50
length(x[x < 50])
[1] 20
Here is another interesting way to simulate your game, using a recursive function.
spin <- function(outcomes = 1:5, start = 0L, end = 50L)
if (start <= end)
c(got <- sample(outcomes, 1), Recall(outcomes, start + got, end))
spin()
# [1] 5 4 4 5 1 5 3 2 3 4 4 1 5 4 3
Although elegant, it won't be as fast as an improved version of #Simon's solution that makes a single call to sample, as suggested by #Viktor:
spin <- function(outcomes = 1:5, end = 50L) {
max.spins <- ceiling(end / min(outcomes))
x <- sample(outcomes, max.spins, replace = TRUE)
head(x, match(TRUE, cumsum(x) >= end))
}
spin()
# [1] 3 5 2 3 5 2 2 5 1 2 1 5 5 5 2 4
For your ultimate goal (find the probability of one person being in the lead for the entire game), it is debatable whether while will be more efficient or not: a while loop is certainly slower, but you may benefit from the possibility of exiting early as the lead switches from one player to the other. Both approaches are worth testing.
You can use a while statement and a variable total for keeping track of the sum:
total <- 0
while(total <= 50){
takeaspin<-sample(SpacesOnSpinner,1,replace=TRUE)
total <- takeaspin + total
}
print (total)

Resources