R lapply-split-rbindlist - does subset cause problems? - r

I'm sure this will be very easy as I'm still an R beginner but here goes...
I've started with a data frame which I've successfully put through lapply-split followed by rbindlist to regenerate as a dataframe.
From this same data set, I've subset some data and performed lapply-split followed by rbindlist and get the following error:
"Error in rbindlist(df) : Item 1 of list input is not a data.frame,
data.table or list"
This is confusing since it's the same (sub)set of data being split by the same parameter.
When I call:
df[1]
I get:
$SWS1Ami
[1] 13451.02
which is the mean value I wanted to calculate for the SWS1Ami group (so it seems to have done the lapply split correctly). When I call:
typeof(df[1])
I see it tells me this element(?) type is a list.
Two questions:
(1) What could cause rbindlist to not work after doing lapply-split? Why does this seem to sometimes work and sometimes not work?
(2) Is there a quick litmus test to tell if your dataframe is in the "right" setup to undergo lapply-split-rbindlist?

Related

Individual variables from the top row of my dataframe are not recognized as objects

I am really new to "R" and stackoverflow here a beginner question it seems so simple that I can not find the answer in the net...
I loaded my data:
df<- read.csv2("HOR.csv", na="NA",header = TRUE)
Now I would like to get an overview from individual variables of my df.
I use
summary(df)
and that works...
But when I try this with individual variables/columns, I get an error
Object not found
summary(A)
eg. the console says:
Error in summary(PSumme) : Object 'PSumme' not found
It is a very basic question but i can not do any type of analysis since it seems that R does not recognize any of my columns as objects, I get the same error when I try adding or subtracting columns with each others or trying to change variables e.g.:
PSumme * -1
I have already tried tibble and other commands to turn the columns heads into variables and it seems to work but the variables are not treated as objects.
Bests and ty for your time!

In mclapply : scheduled cores 9 encountered errors in user code, all values of the jobs will be affected

I went through the existing stackoverflow links regarding this error, but no solution given there is working (and some questions dont have solutions there either)
Here is the problem I am facing:
I run Arima models in parallel using mclapply of parallel package. The sample data is being split by key onto different cores and results are clubbed together using do.call + rbind (the server I place the script in has 20 cores of cpu which is passed on to mc.cores field)
Below is my mclapply code:
print('Before lapply')
data_sub <- do.call(rbind, mclapply(ds,predict_function,mc.cores=num_cores))
print('After lapply')
I get multiple set of values like below as output of 'predict_function'
So basically, I get the file as given above from multiple cores to be send to rbind. The code works perfectly for some part of data. Now, I get another set of data , same like above with same data type of each column, but different value in column 2
data type of each column is given in the column name above.
For the second case, I get below error:
simpleError in charToDate(x): character string is not in a standard unambiguous format
Warning message:
In mclapply(ds, predict, mc.cores = num_cores) :
scheduled cores 9 encountered errors in user code, all values of the jobs will be affected
I dont see this print: print('After lapply') for the second case, but is visible for first case.
I checked the date column in above dataframe, its in Date format. When I tried unique(df$DATE) it threw all valid values in the format as given above.
What is the cause of the error here? is it the first one due to which mclapply isnt able to rbind the values? Is the warning something we need to understand better?
Any advice would be greatly appreciated.

R: Error in .Primitive, non-numeric argument to binary operator

I did some reading on similar SO questions, but couldn't figure out how to resolve my error.
I have written the following string of code:
points[paste0(score.avail,"_pts")] <-
Map('*', points[score.avail], mget(paste0(score.avail,'_m')) )
Essentially, I have a list of columns in the 'points' data frame, defined by 'score.avail'. I am multiplying each of the columns by a respective constant, defined as the paste0(score.avail, '_m') expression. It appends new fields based on the multiplication, given by paste0(score.avail, "_pts") expression.
I have used this function before in a similar setup with no issues. However, I am now getting the following error:
Error in .Primitive("*")(dots[[1L]][[1L]], dots[[2L]][[1L]]) :
non-numeric argument to binary operator
I'm pretty sure R is telling me that one of the fields I'm trying to multiply is not numeric. However, I have checked all my fields, and they are numeric. I have even tried running a line as.numeric(score.avail) but that doesn't help. I also ran the following to remove NA's in the fields (before the Map function above).
for(col in score.avail){
points[is.na(get(col)) & (data.source == "average" |
data.source == "averageWeighted"), (col) := 0]}
The thing that stumps me is that this expression has worked with no issues before.
Update
I did some more digging by separating out each component of my original function. I'm getting odd output when running points[score.avail]. Previously when I ran this, it would return just the columns for all of my rows. Now, however, I'm getting none of the rows in my original data frame -- rather, it is imputing the column names in the 'score.avail' list as rows and filling in NA's everywhere (this is clearly the source of my problem).
I think this is because I'm using the object I'm pointing to is a data.table with keyvars set. Previously with this function, I had been pointing to a data frame.
Off to try a few more things.
Another Update
I was able to solve my problem by copying the 'points' object using as.data.frame(). However, I will leave the question open to see if anyone knows how to reset the data table key vars so that the function I specified above will work.
I was able to solve my problem by copying the 'points' object using as.data.frame(). Apparently classifying the object as a data.table was causing my headaches.

R ffdf Error: is.null(names(x)) is not TRUE

I'm trying to understand what a particular error message means in R.
I'm using an ff data frame (ffdf) to build an rpart model. That works fine.
However, when I try to apply the predict function (or any function) using ffdfdply, I get a cryptic error message that I can't seem to crack. I'm hoping someone here can shed light on its meaning.
PredictedData<-ffdfdply(x=TrainingData,split=TrainingData$id,
FUN=function(x) {x$Predicted<-predict(Model1,newdata=x)
x})
If I've thought about this correctly, ffdfdply will take the TrainingData table, split it into chunks based on TrainingData$id, then apply the predict function using the model file Model1. Then it will return the table (its labelled x in the function field), combine them back together into the table PredictedData. PredictedData should be the same as TrainingData, except with an additional column called "Predicted" added.
However, when I run this, I get rather unhelpful error message.
2014-07-16 21:16:17, calculating split sizes
2014-07-16 21:16:36, building up split locations
2014-07-16 21:17:02, working on split 1/30, extracting data in RAM of 32 split elements, totalling, 0.07934 GB, while max specified data specified using BATCHBYTES is 0.07999 GB
Error: is.null(names(x)) is not TRUE
In addition: Warning message:
In ffdfdply(x = TrainingData, split = TrainingData$id, FUN = function(x) { :
split needs to be an ff factor, converting using as.character.ff to an ff factor
Yes, every column has a name. Those names contain just alphanumeric characters plus periods. But the error message makes me think that that the columns should not have names? I guess I 'm confused what this means exactly.
I appreciate any hints anyone can provide and I'll be happy to provide more detail.
I think I found the solution to this.
Turns out I had periods in my column names. When I removed those periods, this worked perfectly.

Strangeness with filtering in R and showing summary of filtered data

I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!
What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)

Resources