I have a large list of 15000 elements each containing 10 numbers (data)
I am doing a time series cluster analysis using
distmatrix <- dist(data, method = "DTW")
This has now been running for 24 hours. Is it likely to complete any time soon. Is there a way of checking on it's progress. I don't want to abort just in case it's about to finish.
Related
I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.
I am playing with a large dataset (~1.5m rows x 21 columns). Which includes a long, lat information of a transaction. I am computing the distance of this transaction from couple of target locations and appending this as new column to main dataset:
TargetLocation1<-data.frame(Long=XX.XXX,Lat=XX.XXX, Name="TargetLocation1", Size=ZZZZ)
TargetLocation2<-data.frame(Long=XX.XXX,Lat=XX.XXX, Name="TargetLocation2", Size=YYYY)
## MainData[6:7] are long and lat columns
MainData$DistanceFromTarget1<-distVincentyEllipsoid(MainData[6:7], TargetLocation1[1:2])
MainData$DistanceFromTarget2<-distVincentyEllipsoid(MainData[6:7], TargetLocation2[1:2])
I am using geosphere() package's distVincentyEllipsoid function to compute the distances. As you can imaging, distVincentyEllipsoid function is a computing intensive but it is more accurate (compared to other functions of the same package distHaversine(); distMeeus(); distRhumb(); distVincentySphere())
Q1) It takes me about 5-10 mins to compute distances for each target location [I have 16 GB RAM and i7 6600U 2.81Ghz Intel CPU ], and I have multiple target locations. Is there any faster way to do this?
Q2) Then I am creating a new column for a categorical variable to mark each transaction if it belongs to market definition of target locations. A for loop with 2 if statements. Is there any other way to make this computation faster?
MainData$TransactionOrigin<-"Other"
for (x in 1:nrow(MainData)){
if (MainData$DistanceFromTarget1[x]<=7000)
MainData$TransactionOrigin[x]="Target1"
if (MainData$DistanceFromTarget2[x]<=4000)
MainData$TransactionOrigin[x]="Target2"
}
Thanks
Regarding Q2
This will run much faster if you lose the loop.
MainData$TransactionOrigin <- "Other"
MainData$TransactionOrigin[which(MainData$DistanceFromTarget1[x]<=7000)] <- "Target1"
MainData$TransactionOrigin[which(MainData$DistanceFromTarget2[x]<=4000)] <- "Target2"
I have data.table in r with 150 000 rows in it.
I use 9 features and it's training time more than 30 mins, I didn't wait more.
Also tried it on 500 rows (it takes 0.2 sec) and on 5000 it takes (71.2 sec).
So how I should train my model with all data or may be you can give me any other advice?
here compile log:
train1 <- train[1:5000,]+1
> f1 = as.formula("target~ v1+ v2+ v3+ v4+ v5+ v6+ v7+ v8+ v9")
> a=Sys.time()
> nn <-neuralnet(f1,data=train1, hidden = c(4,2), err.fct = "ce", linear.output = TRUE)
Warning message:
'err.fct' was automatically set to sum of squared error (sse), because the response is not binary
> b=Sys.time()
> difftime(b,a,units = "secs")
Time difference of 71.2000401 secs
This is to be expected in my experience, there are a lot of calculations involved in Neural Nets. I personally have one written in Python (2 hidden layers), detailed including momentum term, I have about 38,000 patterns of 56 inputs and 3 outputs. Splitting them into 8,000 chunks took about 10 minutes to run and just under a week to learn to my satisfaction.
The whole set of 38,000 had a larger hidden nodes to store all the patterns and that took over 6 hrs to go through one cycle and over 3 months to learn. Neural Networks is a very powerful tool but it comes at a price in my experience, others may have better implementation but all the comparisons of classification algorithms I have seen, have always mentioned the time to learn as being significant.
I'm trying to measure time used to solve a sysitem in R:
t1<-Sys.time()
b=t(Q)%*%t(Pi)%*%z
Sol<-BackwardSubs(R,b)
t2<-Sys.time()
DeltaT<-t2-t1
print(paste("System: ",DeltaT," sec",sep=""))
Sometimes i find results (1 or 2 seconds when the function execution is several minutes) with high dimensions of the problem that are not possible.
Is Sys.time() correct?
I've been trying to use the msm package to model an 8 state, multi-state markov chain. My data set, in total, contains about 11,000 subjects, with slightly over 100k observations total.
I try to run the msm function on several subsets of the data, taking the head of the data, like so:
mm2myTrajectoryMSM<-msm(role ~ year, subject=authId, data=head(mm2myMarkovRoles[,1:3,with=FALSE],7000), qmatrix=trajectory.qmatrix,death=1,control=list(trace=1,REPORT=1))
So far, I have not been able to get past ~7000 lines. Looking at the report output, I noticed that the function freezes when the iter value outputs a negative value. For example, here is the run with the first 10k rows of the data
initial value 19017.328402
iter 2 value 17808.111677
iter 3 value 17707.483305
iter 4 value -346782.085429 (freeze)
But it works with the first 20k rows
initial value 38101.266287
iter 2 value 35871.849676
iter 3 value 35796.410415
iter 4 value -721867.559664
iter 4 value -721867.559664
final value -721867.559664
converged
But not with 50k rows
initial value 92846.642840
iter 2 value 88466.007605
iter 3 value 88310.215979
iter 4 value 88276.433502
iter 5 value 88247.381022
iter 6 value -983685.709474
But it works for 60010,80007 (I'm capturing full records of subjects), and after that I cannot tell if the system freezes or the analysis is taking a very long time. The 1 cpu assigned to the task is maxed, but I am nowhere near my RAM resources limits (< 1% of the 96GB on the server).
I have two questions - ) Why does the function (arbitrarily?) hang on certain subsets of the data and
2) How can I estimate the run time of this function? Last time I let it run, and it went for over 2 days. Oddly, the computation time for many of the runs appeared to scale sub-linearly, but once I crossed a threshold it scales...?
Are you running msm 1.5?
In the changelog (http://cran.r-project.org/web/packages/msm/ChangeLog) it is mentioned that a bug has been fixed that led to infinite loops on windows.
If your time-series has several short jumps you might get a log-likelihood underflow. You can study this using fixedpars = TRUE in the msm call (then get the log-likelihood and look for underflow/overflow).
If something is wrong you'll get very long running times (hard to predict).
Also try to scale your likelihood values using fnscale=100000.