More efficient way to get measurements from different items in R - r

I'm currently getting a bunch of accuracy measurements for 80k different items which I need to calculate the measurements independently but currently is taking too long, so want to determine a faster way to do it.
Here's my code in R with it's comments:
work_file: Contains 4 varables: item_id, Dates, demand and forecast
my code:
output<-0
uniques<- unique(work_file$item_id)
for( i in uniques){
#filter every unique item
temporal<- work_file %>% filter(item_id==i)
#Calculate the accuracy measure for each item
x<-temporal$demand
x1<-temporal$forecast
item_error<- c(i, accuracy(x1,x)
output<-rbind(output, item_error)}
For 80k~unique items is taking hours,
Any suggestions?

R is a vectorized language, as such one can avoid the use of the loop. Also the binding within a loop is especially slow since the output data structure is constantly being deleted and recreated with each iteration.
Provided the "accuracy()" function can accept a vector input this should work: Without sample data to test, there is always some doubt.
answer<- work_file %>%
group_by(item_id) %>%
summarize(accuracy(forecast, demand))
Here the dplyr's group_by function will collect the different item_ids and the pass those vectors to summarize the accuracy function.

Consider using data.table methods which would be efficient
library(data.table)
setDT(work_file)[, .(acc = accuracy(forecast, demand)), item_id]

Related

How to analyse row's with similar ID's in PySpark?

I have a very large Dataset (160k rows).
I want to analyse each subset of rows with the same ID.
I only care about subsets with the same ID that are at least 30rows long.
What approach should I use?
I did the same task in R and did the following (from what it seems that can't be translated to pyspark):
Order by ascending order.
check whether next row is same as current, if yes n=n+1, if no i do my analysis and save the results. Rinse and Repeat for the whole lenght of the Data frame.
One easy method is to group by 'ID' and collect the columns that are needed for your analysis.
If just one column:
grouped_df = original_df.groupby('ID').agg(F.collect_list("column_m")).alias("for_analysis"))
If you need multiple columns, you can use struct:
grouped_df = original_df.groupby('ID').agg(F.collect_list(F.struct("column_m", "column_n", "column_o")).alias("for_analysis"))
Then, once you have your data per ID, you can use a UDF to perform your elaborate analysis
grouped_df = grouped_df.withColumn('analysis_result', analyse_udf('for_analysis', ...))

How to sample a list containing multiple dataframes using lapply in R?

I have this list of data that I created by using split on a dataframe:
dat_discharge = split(dat2,dat2$discharge_id)
I am trying to create a training and test set from this list of data by sampling in order to take into account the discharge id groups which are not at all equally distributed in the data.
I am trying to do this using lapply as I'd rather not have to individually sample each of the groups within the list.
trainlist<-lapply(dat_discharge,function(x) sample(nrow(x),0.75*nrow(x)))
trainL = dat_discharge[(dat_discharge %in% trainlist)]
testL = dat_discharge[!(dat_discharge %in% trainlist)]
I tried emulating this post (R removing items in a sublist from a list) in order to create the testing and training subsets however the training list is entirely empty, which I assume means that is not the correct way to do that for a list of dataframes?
Is what I am looking to do possible without selecting for the individual dataframes in the list like data_frame[[1]]?
You could use map_dfr instead of lapply from purrr library (do have into account that you need to install.package("purr") and the library(purrr) before doing the next steps. But maybe you already have it installed since it's a common package.
Then you could use the next code
dat2$rowid<-1:nrow(dat2)
dat_discharge <- split(dat2,dat2$id)
trainList<- dat_discharge %>% map_dfr(.f=function(x){
sampling <- sample(1:nrow(x),round(0.75*nrow(x),0))
result <- x[sampling,]
})
testL<-dat2[!(dat2$rowid %in% trainList$rowid),]
To explain the above code. First of all, I added a unique rowid to dat2 so I know which rows I am sampling and which not. This will be used in the last line of code to differentiate the Test and Train datasets such as Train dataset doesnt have any rowid that test has.
Then i do the split to create dat_discharge as you did
Then to each dataframe inside the dat_discharge list I apply the function in the map_dfr. The map_dfr fucntion is the same as the lapply, just that it "concatenates" the outputs in a single dataframe instead of putting each output in a list as the lapply does. Provided that the output of each of the iterations of the map_dfr is a dataframe with same columns as the first iteration. Think of it as "Okay, i got this dataframe, im gonna bind its row to the previous dataframe result". So the result is just one big dataframe.
Inside that function you can notice that i am doing the sample a bit different. I am taking 75% of the sequence of numbers of the rows that the iteration dataframe has, then, with that sampled sequence I subset the iteration dataframe with the x[sampling,] and that yields my sampled dataframe for that iteration (which is one of the dataframes from the dat_discharge list). And automatically, the map_dfr joins those sampled dataframes for each result in a single, big dataframe instead of putting them on a list as the lapply does.
So lastly, i just create the test as all the rowids from dat2 that are NOT present in the test set.
Hope this servers you well :)
Do note that, if you want to sample 75% of the observations for each id, then each id should have at least 4 observation for it to make sence. Imagine if you only had 1 observation in a particular id, yikes!. This code would still work (it will simply select that observation), but you really need to think of that implication when you build your statistic model

Recode a table in R

Suppose I have a table of ages:
ages <- array(round(runif(min=10,max=200,n=100)),dim=100,dimnames=list(age=0:99))
Suppose now I want to collapse my ages table in 5-year wide age groups.
This could be done quite easily by summarizing over different values:
ages.5y <- array(NA,dim=20,dimnames=list(age=paste(seq(from=0,to=95,by=5),seq(from=4,to=99,by=5),sep=""))
ages.5y[1]<-sum(ages[1:5])
ages.5y[2]<-sum(ages[6:10)
...
ages.5y[20]<-sum(ages[96:100])
It could also be done using a loop:
for(i in 1:20) ages.5y[i]<-sum(ages[(5*i-4):(5*i)])
But while this method is easy for "regular" transformations, the loop approach becomes infeasible if the new intervals are irregular, eg. 0-4,5:12,13-24,25-50,60-99.
If, instead of a table, I had individual values, this could be done quite easily using cut:
flattened <- rep(as.numeric(dimnames(ages)$age),ages)
table(cut(flattened,breaks=seq(from=0,to=100,by=5)))
This allows the use of any random break points, eg breaks=c(5,10,22,33,41,63,88)
However, this is a quite ressource intense way to do it.
So, my question is: Is there a better way to recode a contingency table?
You could use cut on the age values, but not the counts. Like this:
ages =0:99
ageCounts = array(round(runif(min=10,max=200,n=100)),dim=100)
groups = cut(ages,breaks=seq(from=-1,to=100,by=5))
Then group them. I use data.table for this:
DT = data.table(ages=ages, ageCounts=ageCounts, groups)
DT[,list(sum=sum(ageCounts)), by=groups]

Summary stats a variable for each unique variable within a condition

I have a longitudinal spreadsheet that contains different growth variables for many individuals. At the moment my R code looks like this:
D5<-ifelse(growth$agyr == 5, growth$R.2ND.DIG.AVERAGE,NA)
Since it is longitudinal, I have the same measurement for each individual at multiples ages, thus the variable agyr. In this example it is taking all kids who have a finger measurement at age 5.
What I would like to do is do that for all ages so that I don't have to define an object every time, so I can essentially run some summary stats on finger length for any given agyr. Surely this is possible, but I am still a beginner at R.
tapply() is your friend here. For the mean for example:
with(growth,
tapply(R.2ND.DIG.AVERAGE,agyr,mean)
)
See also ?tapply and some good introduction book on R. And also ?with, a function that can really make your code a lot more readible.
If you have multiple levels you want to average over, you can give tapply() a list of factors. Say gender is a variable as well (a factor!), you can do eg:
with(growth,
tapply(R.2ND.DIG.AVERAGE,list(agyr,gender),mean)
)
tapply() returns an array-like structure (a vector, matrix or multidimensional array, depending on the number of categorizing factors). If you want your results in a data frame and/or summarize multiple variables at once, look at ?aggregate, eg:
thevars <- c("R.2ND.DIG.AVERAGE","VAR2","MOREVAR")
aggregate(growth[thevars],by=list(agyr,gender), FUN="mean")
or using the formula interface:
aggregate(cbind(R.2ND.DIG.AVERAGE,VAR2,MOREVAR) ~ agyr + gender,
data=growth, FUN = "mean")
Make sure you check the help files as well. Both tapply() and aggregate() are quite powerful and have plenty other possibilities.

Endless function/loop in R: Data Management

I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata).
The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?
restructure <- function(data, firstcol, numcol, numsets){
out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
names(out) <- names(daten[0:(firstcol+numcol-1)])
for(i in 1:nrow(daten)){
fixdata <- (daten[i, 1:(firstcol-1)])
for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
flexdata <- daten[i, j:(j+numcol-1)]
tmp <- cbind(fixdata, flexdata)
names(tmp) <- names(daten[0:(firstcol+numcol-1)])
out <- rbind(out,tmp)
}
}
out <- out[2:nrow(out),]
return(out)
}
Thanks in advance!
Idea why: you rbind to out in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.
So, as Andrie tells you can look at melt.
Or you can do it with core R: stack.
Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with each = n.var.cols
A third alternative would be array2df from package arrayhelpers.
I agree with the others, look into reshape2 and the plyr package, just want to add a little in another direction. Particularly melt, cast,dcast might help you. Plus, it might help to make use of smart column names, e.g.:
As<-grep("^testA",names(yourdf))
# returns a vector with the column position of all testA1 through 10s.
Besides, if you 'spent' the two dimensions of a data.frame on test# and test type, there's obviously none left for the person. Sure, you identify them by an ID, that you could add an aesthetic to when plotting, but depending on what you want to do you might want to store them in a list. So you end up with a list of persons with a data.frame for every person. I am not sure what you are trying to do, but still hope this helps though.
Maybe you're not getting the plyr or other functions for reshaping the data component. How about something more direct and low level. If you currently just have one line that goes A1, A2, A3... A10, B1-B10, etc. then extract that lump of stuff from your data frame, I'm guessing columns 11-250, and then just make that section the shape you want and put them back together.
yDat <- data[, 11:250]
yDF <- lapply( 1:nrow(data), function(i) matrix(yDat[i,], ncol = 24) )
yDF <- do.call(rbind, y) #combine the list of matrices returned above into one
yDF <- data.frame(yDF) #get it back into a data.frame
names(yDF) <- LETTERS[1:24] #might as well name the columns
That's the fastest way to get the bulk of your data in the shape you want. All the lapply function did was add dimension attributes to each row so that they were in the shape you wanted and then return them as a list, which was massaged with the subsequent rows. But now it doesn't have any of your ID information from the main data.frame. You just need to replicate each row of the first 10 columns 10 times. Or you can use the convenience function merge to help with that. Make a common column that is already in your first 10 rows one of the columns of the new data.frame and then just merge them.
yInfo <- data[, 1:10]
ID <- yInfo$ID
yDF$ID <- rep( yInfo$ID, each = 10 )
newDat <- merge(yInfo, yDF)
And now you're done... mostly, you might want to make an extra column that names the new rows
newDat$condNum <- rep(1:10, nrow(newDat)/10)
This will be very fast running code. Your data.frame really isn't that big at all and much of the above will execute in a couple of seconds.
This is how you should be thinking of data in R. Not that there aren't convenience functions to handle the bulk of this but you should be doing this that avoid looping as much as possible. Technically, what happened above only had one loop, the lapply used right at the start. It had very little in that loop as well (they should be compact when you use them). You're writing in scalar code and it is very very slow in R... even if you weren't really abusing memory and growing data while doing it. Furthermore, keep in mind that, while you can't always avoid a loop of some kind, you can almost always avoid nested loops, which is one of your biggest problems.
(read this to better understand your problems in this code... you've made most of the big errors in there)

Resources