define r square depending on 1 factor - r

I get data with one factor. The factor is ref_fruit. So my script looks like that to get the R_square for each factor (with MF depending on heure) :
models <- dlply(P1, "ref_fruit", function(df)
lm(MF ~ heure, data = df))
ldply(models, coef)
l_ply(models, summary, .print = TRUE)
the problem is the list I got with R square is really high each time : around 0.998. This is not what I observed with excel.
And the other problem is I got this message after executing:
ldply(models, coef)
Error in fs[[i]](x, ...) : attempt to apply non-function.
May someone help me please ?

Your problem is almost identical to this example for dplyr shown here. You didn't provide any data but it should be...
require(dplyr)
by_fruit <- group_by(df, ref_fruit)
models <- by_fruit %>% do(mod = lm(MF ~ heure, data = .))
models
summarise(models, rsq = summary(mod)$r.squared)

Related

In R, `Error in f(arg, ...) : NA/NaN/Inf in foreign function call (arg 1)` but there are no Infs, no NaNs, no `char`s, etc

I am trying to use the lqmm package in R and receiving the error Error in f(arg, ...) : NA/NaN/Inf in foreign function call (arg 1). I can successfully use it for a version of my data in which a variable called cluster_name is averaged over.
I've tried to verify that there are no NaNs or infinite values in my dataset this way:
na_data = mydata
new_DF <- na_data[rowSums(is.na(mydata)) > 0,] # yields a dataframe with no observations
is.na(na_data) <- sapply(na_data, is.infinite)
new_DF <- na_data[rowSums(is.na(mydata)) > 0,] # still a dataframe with no observations
There are no variables in my dataframe that are type char -- every such variable has been converted to a factor.
When I run my model
m1 = lqmm(std_brain ~ std_beh*type*taught, random = ~1, group=subject, data = begin_data, tau=.5, na.action=na.exclude)
on the first 12,528 lines of my dataset, the model works fine. Line 12,529 looks totally normal.
Similarly, if I run tail(mydata, 11943) I get a dataframe that runs without error, but tail(mydata, 11944) gives me a dataframe that generates the error. I can also run a subset from 9990:21825 without error, but extending the dataframe on either side generates the error. The whole dataframe is 29450 observations, and thus this middle slice contains the supposedly problematic observations. I tried making a smaller version of my dataset that contained just the borders of problems, and some observations around them, and I can see that 3/4 cases involve the same subject (7645), but I don't know what to make of that. I don't see how to make this reproducible without providing the whole dataframe (in case you were wondering, the small dataset doesn't cause any error). So here is the csv file I used.
Here is the function that gets the dataframe ready for analysis:
prep_data_set <- function(data_file, brain_var = 'beta', beh_var = 'accuracy') {
data = read.csv(data_file)
data$subject <- factor(data$subject)
data$type <- factor(data$type)
data$type <- relevel(data$type, ref = "S")
data$taught <- factor(data$taught)
data <- subset(data, data$run_num < 13)
data$run = factor(data$run_num)
brain_mean <- mean(data[[brain_var]])
brain_sd <- sd(data[[brain_var]])
beh_mean <- mean(data[[beh_var]])
beh_sd <- sd(data[[beh_var]])
data <- subset(data, data$cluster_name != "")
data$cluster_name <- factor(data$cluster_name)
data$mean_centered_brain <- data[[brain_var]]
data$std_brain <- data$mean_centered_brain/brain_sd
data$mean_centered_beh <- data[[beh_var]]
data$std_beh <- data$mean_centered_beh/beh_sd
return(data)
}
I run
mydata = prep_data_set(file.path(resdir, 'robust0005', 'pos_rel_con__all_clusters.csv'))
m1 = lqmm(std_brain ~ std_beh*type*taught, random = ~1, group=subject, data = mydata, tau=.5, na.action=na.exclude)
to generate the error.
By comparison
regular_model = lmer(std_brain ~ type*taught*std_beh + (1|subject/run) +
(1|subject:cluster_name), data = mydata)
runs fine.
I hope there is something interesting and generalizable in this question; I know it's kind of annoying to post to Stack Overflow with some idiosyncratic problem in a ~30000 line dataset.

Error with RandomForest in R because of "too many categories"

I'm trying to train a RF model in R, but when i try to define the model:
rf <- randomForest(labs ~ .,data=as.matrix(dd.train))
It gives me the error:
Error in randomForest.default(m, y, ...) :
Can not handle categorical predictors with more than 53 categories.
Any idea what could it be?
And no, before you say "You have some categoric variable with more than 53 categories". No, all variables but labs are numeric.
Tim Biegeleisen: Read the last line of my question and you will see why is not the same as the one you are linking!
Edited to address followup from OP
I believe using as.matrix in this case implicitly creates factors. It is also not necessary for this packages. You can keep it as a data frame, but will need to make sure that any unused factor levels are dropped by using droplevels (or something similar). There are many reasons an unused factor may be in your data set, but a common one is a dropped observation.
Below is a quick example that reproduces your error:
library('randomForest')
#making a toy data frame
x <- data.frame('one' = c(1,1,1,1,1,seq(50) ),
'two' = c(seq(54),NA),
'three' = seq(55),
'four' = seq(55) )
x$one <- as.factor(x$one)
x <- na.omit(x) #getting rid of an NA. Note this removes the whole row.
randomForest(one ~., data = as.matrix(x)) #your first error
randomForest(one ~., data = x) #your second error
x <- droplevels(x)
randomForest(one ~., data = x) #OK

How do I generate spline bases from a character vector of response variables?

I am working on a problem where I need to fit many additive models of the form y ~ s(x), where the response y is constant whereas the predictor x varies between each model. I am using mgcv::smoothCon() to set up the bases, and lm() to fit the models. The reason why I do this, rather than calling gam() directly, is that I need the unpenalized fits. My problem is that smoothCon() requires it object argument to be unquoted, e.g., s(x), and I wonder how I can generated such unquoted arguments from a character vector of variable names.
A minimal example can be illustrated using the mtcars dataset. The following snippet shows what I am able to do at the moment:
library(mgcv)
# Variables for which I want to create a smooth term s(x)
responses <- c("mpg", "disp")
# At the moment, this is the only solution which I am able to make work
bs <- list(
smoothCon(s(mpg), data = mtcars),
smoothCon(s(disp), data = mtcars)
)
It would be nicer to be able to generate bs using some functional programming approach. I imagine something like this, where foo() is my missing link:
lapply(paste0("s(", responses, ")"), function(x) smoothCon(foo(x),
data = mtcars))
I have tried noquote() and as.symbol(), but both fail.
responses <- c("mpg", "disp")
lapply(paste0("s(", responses, ")"),
function(x) smoothCon(noquote(x), data = mtcars))
#> Error: $ operator is invalid for atomic vectors
lapply(paste0("s(", responses, ")"),
function(x) smoothCon(as.symbol(x), data = mtcars))
#> Error: object of type 'symbol' is not subsettable
We can do this by converting to language class, evaluate and then apply the smoothCon
library(tidyverse)
out <- paste0("s(", responses, ")") %>%
map(~ rlang::parse_expr(.x) %>%
eval %>%
smoothCon(., data = mtcars))
identical(out, bs)
#[1] TRUE
why don't you try like this?
smoothCon(s(get("disp")), data = mtcars)
and, instead of disp you give the name of the variable you prefer. You can even put this within a loop or any other construct you prefer

Error in panel regression in case of different independent variable r

I am trying to run Fama Macbeth regression by the following code:
require(foreign)
require(plm)
require(lmtest)
fpmg <- pmg(return~max_1,df_all_11, index=c("yearmonth","firms" ))
Fama<-fpmg
coeftest(Fama)
It is working when I regress the data using the independent variable named 'max_1'. However when I change it and use another independent variable named 'ivol_1' the result is showing an error. The code is
require(foreign)
require(plm)
require(lmtest)
fpmg <- pmg(return~ivol_1,df_all_11, index=c("yearmonth","firms" ))
Fama<-fpmg
coeftest(Fama)
the error message is like this:
Error in pmg(return ~ ivol_1, df_all_11, index = c("yearmonth", "firms")) :
Insufficient number of time periods
or sometimes the error is like this
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
object is not a matrix
For your convenience, I am sharing my data with you. The data link is
data frame
I am wondering why this is happening in case of the different variable in the same data frame. I would be grateful if you can solve this problem.
This problem can be solved by mice function
library(mice)
library(dplyr)
require(foreign)
require(plm)
require(lmtest)
df_all_11<-read.csv("df_all_11.csv.part",sep = ",",header = TRUE,stringsAsFactor = F)
x<-data.frame(ivol_1=df_all_11$ivol_1,month=df_all_11$Month)
imputed_Data <- mice(x, m=3, maxit =5, method = 'pmm', seed = 500)
completeData <- complete(imputed_Data, 3)
df_all_11<-mutate(df_all_11,ivol_1=completeData$ivol_1)
fpmg2 <- pmg(return~ivol_1,df_all_11, index=c("yearmonth","firms"))
coeftest(fpmg2)
this problem because the variable ivol_1 have a lots of NA so you should impute the NA first then run the pmg function.

R update ctree (package party) features factors levels

I am trying to make sure that all my features of type factors are represented fully (in terms of all possible factor levels) both in my tree object and in my test set for prediction.
for (j in 1:length(predictors)){
if (is.factor(Test[,j])){
ct [[names(predictors)[j]]] <- union(ct$xlevels[[names(predictors)[j]]], levels(Test[,c(names(predictors)[j])]))
}
}
however, for object ct (ctree from package party) I can't seem to understand how to access the features' factor levels, as I am getting an error
Error in ct$xlevels : $ operator not defined for this S4 class
I had this problem countless times and today I come up with a little hack that should make not needed to fix levels' discrepancy in factors.
Just make the model on the whole dataset (train + test) giving zero weight to test observations. This way the ctree model will not drop factor levels.
a <- ctree(Y ~ ., DF[train.IDs,]) %>% predict(newdata = DF) # Would trigger error if the data passed to predict would not match the train data levels
b <- ctree(Y ~ ., weights = as.numeric((1:nrow(DF) %in% train.IDs)), data = DF) %>% predict(newdata = DF) # passing the IDs as 0-1 in the weights instead of subsetting the data solves it
mean(a == b) # test that predictions are equals, should be 1
Tell me if it works as expected!

Resources