I'm trying to perform a Wald test on an interaction term in a survey-weighted model with imputed data - see below for a toy reprex.
When I run the last three lines of code with regTermTest using syntax modeled directly off the function documentation I get the following error: Error in terms.default(model) : no terms component nor attribute.
A quick Google search seems to suggest this error means I'm passing an unsupported object type to the function; I read this as perhaps regTermTest does not support passing an MIResult to model. However, it seems that as of version 3.6 of the survey package, regTermTest supports MIResult models? This page also seems to suggest that as well.
Appreciate any guidance on what I'm doing wrong. Alternatively I could be happy if someone knows how to get p-values on individual model terms from an MIResult object (e.g. like is shown in this post for a regular model object).
# load packages
library(tidyverse)
library(survey)
library(mi)
library(mitools)
# load data on school performance included in survey package
# documentation available here: https://r-survey.r-forge.r-project.org/survey/html/api.html
data(api)
# remove problematic variables that are unnecessary for this example
apisub <- apiclus1 %>% select(-c("name", "sname", "dname", "cname", "flag",
"acs.46", "acs.core"))
# create and update variable types in missing_data.frame
mdf <- missing_data.frame(apisub)
mdf <- change(mdf, "cds", what = "type", to = "irrelevant")
mdf <- change(mdf, "stype", what = "type", to = "irrelevant")
mdf <- change(mdf, "snum", what = "type", to = "irrelevant")
mdf <- change(mdf, "dnum", what = "type", to = "irrelevant")
mdf <- change(mdf, "cnum", what = "type", to = "irrelevant")
mdf <- change(mdf, "fpc", what = "type", to = "irrelevant")
mdf <- change(mdf, "pw", what = "type", to = "irrelevant")
# summarize the missing_data.frame
show(mdf)
# impute missing data
imputations <- mi(mdf)
# create imputation list to pass to svydesign
imp_list <- complete(imputations, m = 5)
# create complex survey design using imputed data
dsn <- svydesign(id = ~dnum,
weights = ~pw,
data = imputationList(imp_list),
fpc = ~fpc)
# subset the survey design to remove schools that did not meet both targets
# just as an example of subsetting
dsn_sub <- subset(dsn, both == "No")
# specify analytic model
anl <- with(dsn_sub,
svyglm(api99 ~ enroll + meals + avg.ed*ell,
family = gaussian(),
design = dsn
)
)
# combine results into a single output
res <- MIcombine(anl)
# perform wald test for main and ixn terms
regTermTest(res, ~meals)
regTermTest(res, ~avg.ed:ell)
regTermTest(res, ~avg.ed*ell)
Didn't figure out how to do this with regTermTest but I learned that you can run the following to run a global test for interaction instead:
library(aod)
> aod::wald.test(Sigma = vcov(res),
+ b = coef(res),
+ Terms = 6)
Wald test:
----------
Chi-squared test:
X2 = 0.031, df = 1, P(> X2) = 0.86
Related
EDIT: Duplicate of question answered here.
I am trying to work with survey weighted data where there is some substantial missingness across important variables. I am generally following the workflow from this archived tutorial on R-Forge. Unfortunately I am running into an error I can't seem to figure out when I attempt to reference the imputed data when create the complex survey design object.
I can't do a reproducible example of my actual data, but I run into the same issue when trying to do the same thing with the apiclus1 dataset included in the survey package, so putting that example below.
I removed several variables that are unimportant for imputation and a few that were causing issues - this should not meaningfully affect the example.
library(tidyverse)
library(survey)
library(mi)
library(mitools)
data(api)
apisub <- apiclus1 %>% select(-c("name", "sname", "dname", "cname", "flag",
"acs.46", "acs.core"))
mdf <- missing_data.frame(apisub)
mdf <- change(mdf, "cds", what = "type", to = "irrelevant")
mdf <- change(mdf, "stype", what = "type", to = "irrelevant")
mdf <- change(mdf, "snum", what = "type", to = "irrelevant")
mdf <- change(mdf, "dnum", what = "type", to = "irrelevant")
mdf <- change(mdf, "cnum", what = "type", to = "irrelevant")
mdf <- change(mdf, "fpc", what = "type", to = "irrelevant")
mdf <- change(mdf, "pw", what = "type", to = "irrelevant")
show(mdf)
imputations <- mi(mdf)
dsn1 <- svydesign(id = ~dnum, weights = ~pw, data = imputationList(imputations), fpc = ~fpc)
The error I get after running this last line says Error in as.list.default(X) : no method for coercing this S4 class to a vector.
Can someone help me understand what I'm doing wrong?
not sure i can answer but maybe this helps a bit? it's not clear to me how the library(mi) is supposed to fill in the missing data? svydesign() expects a list of data.frame objects, so the library(tidyverse) and library(mi) might create types not supported by the survey library and mitools..
# is this how to extract the four imputed data.frames from the `imputations` object?
# i am not sure
w <- lapply( imputations#data , data.frame )
# four data.frame objects in a list will now work:
dsn1 <- svydesign(id = ~dnum, weights = ~pw, data = imputationList(w), fpc = ~fpc)
# still missings in this variable, is that expected?
MIcombine( with( dsn1 , svymean( ~ avg.ed ) ) )
# result, ignoring the missings
MIcombine( with( dsn1 , svymean( ~ avg.ed , na.rm = TRUE ) ) )
I want to add columns to a survey.design created with the survey package, which can be done as following:
library(survey)
data(api)
dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
dclus2 <- transform(dclus1,
api00_b = api00 + 1)
svymean(~ api00, design = dclus2)
#> mean SE
#> api00 644.17 23.542
svymean(~ api00_b, design = dclus2)
#> mean SE
#> api00_b 645.17 23.542
For a more complex task, I need to create these variable names dynamically from external vectors. The following produces an error, but I think provides an illustration of what I want to achieve:
vars <- c("api00_a", "api00_b")
dclus2 <- transform(dclus1,
vars[[2]] = api00 + 1)
How could dynamic names for the new columns be implemented?
I don't think you can use a vector like this on the left-hand side of the equal sign in R. You don't have to use transform, which calls survey:::update.survey.design, though. You could just add your new variable directly:
dclus2 <- dclus1
dclus2$variables[ ,vars[[1]]] <- dclus2$variables[,"api00"] + 1
This is the same as creating the new variable before converting to a survey.design object, as long as you do not use any survey functions for creation of the new variable. Just using Anthony's comment:
apiclus2 <- apiclus1
apiclus2[ , vars[[1]]] <- apiclus2[ , "api00" ] + 1
dclus_prep_2 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus2, fpc = ~fpc)
You might prefer to use srvyr, which allows your kind of programming with dplyr's !!and :=:
library(srvyr)
dclus_srvyr_1 <- as_survey_design(.data = apiclus1,
ids = dnum,
weights = pw,
fpc = fpc)
dclus_srvyr_2 <- mutate(dclus_srvyr_1,
!!vars[[1]] := api00 + 1)
All versions have the same result:
lapply(list(dclus2, dclus_prep_2, dclus_srvyr_2),
function(design) svymean(~api00_a, design=design))
[[1]]
mean SE
api00_a 645.17 23.542
[[2]]
mean SE
api00_a 645.17 23.542
[[3]]
mean SE
api00_a 645.17 23.542
Here's a possible solution using purrr:
library(purrr)
vars <- c("api00_a", "api00_b")
transform_func <- function(data, vars) {
transform(data, vars = api00 + 1)
}
map(vars, ~transform_func(dclus1, .))
Which gives us the following list:
[[1]]
1 - level Cluster Sampling design
With (15) clusters.
update(`_data`, ...)
[[2]]
1 - level Cluster Sampling design
With (15) clusters.
update(`_data`, ...)
You can do this with bquote. For example
vars <- c("api00_plus_1", "api00_plus_2")
exprs<-list(quote(api00+1),quote(api00+2))
names(exprs)<-vars
bquote(update(dclus1,..(exprs)), splice=TRUE)
eval(bquote(update(dclus1,..(exprs)), splice=TRUE))
Here's another chunk from inside the survey package that converts any string variables mentioned in a formula to factor
strings_to_factors<-function(formula, design){
allv<-intersect(all.vars(formula), colnames(design))
vclass<-sapply(model.frame(design)[,allv,drop=FALSE], class)
if (!any(vclass=="character")) return(design)
vfix<-names(vclass)[vclass=="character"]
l<-as.list(vfix)
names(l)<-vfix
fl<-lapply(l, function(li) bquote(factor(.(as.name(li)))))
expr<-bquote(update(design, ..(fl)), splice=TRUE)
eval(expr)
}
I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?
Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})
I'm currently working on a user-item collaborative filtering model.
I have a set of users and places they have shopped at, and have attempted to build a recommender model using R.
There are two aims of this project:
a) Recommend new shops to ALL customers
b) Give a stat to show how accurate the model is.
I have 2 years worth of data.
To answer b), I have subset my data to customers that have purchased in both the first 1.5 years AND in the following 6 months.
I have created a model on the data of transactions in the first 1.5 years, then have compared to model predictions to the ACTUAL 6 months of data.
By performing the above, I determined that I was to use UBCF and nn=500, and I was able to achieve accuracy of approx 80%.
However, I am now unsure of how to predict for the ENTIRE user base.
I was thinking of applying the ENTIRE dataset to the model I have just created, but there is bias/will not be accurate, as not all shops are represented in this small model I have created.
I have read articles and tutorials where people have done different things.
I have seen one where they input the entire dataset, and apply the [which] subsetting, so that it creates the model in 80% and tests using the remaining 20%.
My question is, if I was to use this process, how would I then get recommendations for ALL users, when the model only gives predictions for 20% of the users?
Is it best to create the model on the entire dataset?
SUBSET THE DATA
Create period flags
#If in 1.5 years, then 1. If in following 6 months, then 0.
FV$Flag1<-ifelse(FV$Date<="2018-10-01",1,0)
FV$Flag2<-ifelse(FV$Date>"2018-10-01",1,0)
IDENTIFICATION OF CUSTOMERS TO USE IN TRAINING MODEL
#Create SCV
#FV
FV_SCV<-select(FV, Customer, Flag1, Flag2) %>%
group_by(Customer) %>%
summarise_all(funs(sum)) #Sum all variables.
#Determine which customers to use based on if they have purchased both in the first and second years
FV_SCV$Use<-ifelse(FV_SCV$Flag1>0 & FV_SCV$Flag2>0, 1,0)
EXTRACT CUSTOMER LIST FOR TRAINING MODEL
#Training. Where customers have purchased both in the first & second year, but we only run the model on the first.
FV_Train<-FV_SCV %>%
filter(Use==1 )
SUBSET TO CUSTOMERS THAT HAVE PURCHASED IN 1 YEAR AND OF THE CUSTOMERS THAT HAVE PURCHASED IN BOTH YEARS, TO ONLY THOSE THAT HAVE SHOPPED IN THE FIRST YEAR
#FV_SCV$flag_sum<- FV_SCV$Flag1+FV_SCV$Flag2
SCV FOR CUSTOMERS USED IN THE TRAINING MODEL
#Join on the USE flag
FV_Train_Transactions<- FV %>% #Join on the page info
left_join(select(FV_Train, Customer, Use), by=c("Customer"="Customer"))
#Replace NA with 0
FV_Train_Transactions[is.na(FV_Train_Transactions)] <- 0
##Subset to only the users' transactions to be used in training
FV_Train_Transactions<-FV_Train_Transactions %>%
filter(Use==1)
##Create date flag for train and test to use to create the model on the train and comparing the results with the output of the test df
FV_Train_Transactions_Compare<-FV_Train_Transactions %>%
filter(Flag2>0)
##Create SCV for TRAIN
FV_TRAIN_SCV<-FV_Train_Transactions %>%
filter(Flag1>0) %>%
group_by(Customer, Brand)%>%
select(Customer, Brand)
FV_TRAIN_SCV$Flag<-1
#Not sure why this hasn't selected unique rows, so remove duplicates.
FV_TRAIN_SCV<-distinct(FV_TRAIN_SCV)
##Create scv for TEST
FV_TEST_SCV<-FV_Train_Transactions_Compare %>%
filter(Flag2>0) %>%
select(Customer, Brand) %>%
group_by(Customer, Brand)
FV_TEST_SCV$Flag<-1
#Not sure why this hasn't selected unique rows, so remove duplicates.
FV_TEST_SCV<-distinct(FV_TEST_SCV)
Transpose to columns
install.packages("reshape")
install.packages("reshape2")
install.packages("tidytext")
library(reshape)
library(reshape2)
library(tidytext)
#Melt data for transposition
#Train
fv_train_md<-melt(FV_TRAIN_SCV, id=(c("Customer", "Brand")))
FV_TRAIN_SCV2<-dcast(fv_train_md, Customer~Brand, value="Flag", fun.aggregate = mean)
#Test
fv_test_md<-melt(FV_TEST_SCV, id=(c("Customer" , "Brand")))
#Do the same for the overall transactions table
#Make FV_SCV a binary rating matrix
fv_overall<- FV[,c(1,3)] #The table name is case sensitive. Select only the customer and brand columns
fv_overall<- distinct(fv_overall) #Remove dups
fv_overall$Flag<-1
fv_overall_md<-melt(fv_overall, id=(c("Customer", "Brand")))
fv_overall_2<- dcast(fv_overall_md, Customer~Brand, value="Flag", fun.aggregate = mean)
#fv_test_123<-dcast(FV_TEST_SCV, Customer~Brand, value.var="Brand")
#colnames(fv_test_123)
#fv_test_12345<-which(fv_test_123==1, arr.ind=TRUE)
#fv_test_123<-colnames(fv_test_123)[fv_test_12345]
#fv_test_123
#fv_test_123_df<-as.data.frame((fv_test_123))
FV_TRAIN_SCV2<-dcast(fv_train_md, Customer~Brand, value="Value", fun.aggregate = mean)
FV_TEST_SCV2<-dcast(fv_test_md, Customer~Brand, value="Value", fun.aggregate = mean)
#Replace NaN with 0
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
FV_TRAIN_SCV2[is.nan(FV_TRAIN_SCV2)] <- 0
FV_TEST_SCV2[is.nan(FV_TEST_SCV2)] <- 0
fv_overall_2[is.nan(fv_overall_2)] <- 0
#
#
install.packages("recommenderlab")
library(recommenderlab)
row.names(FV_TRAIN_SCV2)<-FV_TRAIN_SCV2$Customer
FV_TRAIN_SCV2$Hawkers<-0
FV_TRAIN_SCV2$Pollini<-0
FV_TRAIN_SCV2$"Twin Set"<-0
FV_TRAIN_SCV2_matrix<-as.matrix(FV_TRAIN_SCV2[,2:ncol(FV_TRAIN_SCV2)])
FV_TRAIN_SCV2_binarymatrix<-as(FV_TRAIN_SCV2_matrix,"binaryRatingMatrix")
similarity_FV_train_trans_items<-similarity(FV_TRAIN_SCV2_binarymatrix, method="jaccard", which="items")
train_col<- data.frame(colnames(FV_TRAIN_SCV2))
#------------------------------------------------------------------------------------
row.names(fv_overall_2)<-fv_overall_2$Customer
#Convert NaN to 0
fv_overall_2[is.nan(fv_overall_2)]<-0
#fv_overall_matrix<- as.matrix(fv_overall_2[,2:(ncol(fv_overall_2)-3)])#Convert to matrix
fv_overall_matrix<- as.matrix(fv_overall_2[,2:ncol(fv_overall_2)])#Convert to matrix
#fv_overall_matrix<- as.matrix(fv_overall_matrix[,1:(ncol(fv_overall_2)-3)])
fv_matrix_binary<- as(fv_overall_matrix, "binaryRatingMatrix") #Make a binary ratings matrix
FV_overall_similarity<-similarity(fv_matrix_binary, method="jaccard", which="items")
overall_col<- data.frame(colnames(fv_overall_2))
#---------------------------------------------------------------------------------------------------------
#
#
#Now, define multiple recommender algorithms to compare them all.
algorithms <- list(`user-based CF 50` = list(name = "UBCF",param = list(method = "Jaccard", nn = 50)),
`user-based CF 100` = list(name = "UBCF",param = list(method = "Jaccard", nn = 100)),
`user-based CF 200` = list(name = "UBCF",param = list(method = "Jaccard", nn = 200)),
`user-based CF 500` = list(name = "UBCF",param = list(method = "Jaccard", nn = 500)),
#
`item-based CF 3` = list(name = "IBCF",param = list(method = "Jaccard", k = 3)),
`item-based CF 5` = list(name = "IBCF",param = list(method = "Jaccard", k = 5)),
`item-based CF 10` = list(name = "IBCF",param = list(method = "Jaccard", k = 10)),
`item-based CF 50` = list(name = "IBCF",param = list(method = "Jaccard", k = 50))
)
scheme <- evaluationScheme(FV_TRAIN_SCV2_binarymatrix, method = "cross", k = 4,given = 1)
scheme <- evaluationScheme(fv_matrix_binary, method = "cross", k = 4,given = 1)
results <- evaluate(scheme, algorithms, n = c(1,2,3,4,5,6,7,8))
results <- evaluate(scheme, algorithms, n = c(1,2,3,4,5,6,7,8,50,100,200,500))#Evaluating with n=c(1,2,3.....) being the number of recommendations
#names(results) #Check that all results have run.
#results
#Plot results to help determine which of the above models is best for further analysis
#plot(results, annotate = c(1, 3), legend = "right") #ROC Curve
#plot(results, "prec/rec", annotate = 3) #Precision/Recall Plot
The first of these plots (with FPR on x axis) is the ROC curve. The better performing model is the curve with the highest area therefore the better performing model, of these tested parameters, is UBCF with nn=500. Or, with nn=50.
Based on the precision/recall plot, nn should be set to 500.
MODEL USING UBCF nn = 500
recc_model <- Recommender(data = FV_TRAIN_SCV2_binarymatrix, method = "UBCF",
parameter = list(method = "Jaccard",
nn=500))
model_details <- getModel(recc_model)
model_details
#Running on ENTIRE DATA
recc_model <- Recommender(data = fv_matrix_binary, method = "UBCF",
parameter = list(method = "Jaccard",
nn=500))
model_details <- getModel(recc_model)
model_details
install.packages("plyr")
library(plyr)
#Get the scores
recc_predicted <- predict(object = recc_model, newdata = FV_TRAIN_SCV2_binarymatrix, n = 198, type="ratings")
ibcf_scores<-as(recc_predicted, "list")
ibcf_list_scores<-ldply(ibcf_scores, rbind)
ibcf_list_scores[is.na(ibcf_list_scores)]<-0
#Get the list of brands
recc_predicted_list2<-predict(object=recc_model,newdata= FV_TRAIN_SCV2_binarymatrix,type="topNList", n=198)
recc_predicted_list2_a<- as(recc_predicted_list2, "list")
recc_predicted_list2_b<-ldply(recc_predicted_list2_a, rbind)
#recc_predicted_list2_b[is.na(recc_predicted_list2_b)]<-0
#------------------------------------------------------------
#On the overall model:
#Get the scores
recc_predicted <- predict(object = recc_model, newdata = fv_matrix_binary, n = 80, type="ratings")
ibcf_scores<-as(recc_predicted, "list")
ibcf_list_scores<-ldply(ibcf_scores, rbind)
ibcf_list_scores[is.na(ibcf_list_scores)]<-0
#Get the list of brands
recc_predicted_list2<-predict(object=recc_model,newdata= fv_matrix_binary,type="topNList", n=80)
recc_predicted_list2_a<- as(recc_predicted_list2, "list")
recc_predicted_list2_b<-ldply(recc_predicted_list2_a, rbind)
#recc_predicted_list2_b[is.na(recc_predicted_list2_b)]<-0
Reshape df so all ratings are in one column. Use this to then create a unique table, to thendo a count if, as this always crashes excel.
install.packages("data.table")
library(data.table)
df.m1<- melt(ibcf_list_scores, id.vars=c(".id"),
value.name="Rating")
df.m1.unique<- data.frame(df.m1)
df.m1.unique$variable<-NULL
df.m1.unique$.id<-NULL
#df.m1.unique<-distinct(df.m1.unique)
#df.m1.unique<- df.m1.unique[order(df.m1.unique$Rating),] #This comma means it is only ordering based on this one var.
#Using ave
df.m1.unique$count<- ave(df.m1.unique$Rating, df.m1.unique[,c("Rating")], FUN=length)
rownames(df.m1.unique) <- c() #Remove rownames
df.m1.unique<-distinct(df.m1.unique)
df.m1.unique<- df.m1.unique[order(-df.m1.unique$Rating),]#Sort by ascending rating
#Plot this
df.m1.unique.plot<- data.frame(df.m1.unique[2:(nrow(df.m1.unique)-1),])
#plot(x=df.m1.unique.plot$Rating, y=df.m1.unique.plot$count)
#Get the cumulative distribution
df.m1.unique.plot2<- df.m1.unique.plot %>%
mutate(Percentage=cumsum(100*(count/sum(count))),
cumsum=cumsum(count))
Remove ratings
#a) Remove values that are less than specific rating
#Using logical indexing with replacement
ibcf_list_scores_removal<- ibcf_list_scores
#Replace low values with 0
ibcf_list_scores_removal[,2:ncol(ibcf_list_scores_removal)][ibcf_list_scores_removal[,2:ncol(ibcf_list_scores_removal)] < 0.0217] <- 0
#To flag whether customer is recommended the brand, replace all values >0 with 1. Keep 0 as is.
ibcf_list_scores_removal_b<- ibcf_list_scores_removal #Call a new df
ibcf_list_scores_removal_b[,2:ncol(ibcf_list_scores_removal_b)][ibcf_list_scores_removal_b[,2:ncol(ibcf_list_scores_removal_b)] > 0] <- 1#Create the flag
So basically I'd like to know how to create the model on my ENTIRE dataset?
And how to extract all ratings?
Thank you
I'm trying to specify a cluster variable after plm using vcovCR() in clubSandwich package for my simulated data (which I use for power simulation), but I get the following error message:
"Error in [.data.frame(eval(mf$data, envir), , index_names) : undefined columns selected"
I'm not sure if this is specific to vcovCR() or something general about R, but could anyone tell me what's wrong with my code? (I saw a related post here How to cluster standard errors of plm at different level rather than id or time?, but it didn't solve my problem).
My code:
N <- 100;id <- 1:N;id <- c(id,id);gid <- 1:(N/2);
gid <- c(gid,gid,gid,gid);T <- rep(0,N);T = c(T,T+1)
a <- qnorm(runif(N),mean=0,sd=0.005)
gp <- qnorm(runif(N/2),mean=0,sd=0.0005)
u <- qnorm(runif(N*2),mean=0,sd=0.05)
a <- c(a,a);gp = c(gp,gp,gp,gp)
Ylatent <- -0.05*T + a + u
Data <- data.frame(
Y = ifelse(Ylatent > 0, 1, 0),
id = id,gid = gid,T = T
)
library(clubSandwich)
library(plm)
fe.fit <- plm(formula = Y ~ T, data = Data, model = "within", index = "id",effect = "individual", singular.ok = FALSE)
vcovCR(fe.fit,cluster=Data$id,type = "CR2") # doesn't work, but I can run this by not specifying cluster as in the next line
vcovCR(fe.fit,type = "CR2")
vcovCR(fe.fit,cluster=Data$gid,type = "CR2") # I ultimately want to run this
Make your data a pdata.frame first. This is safer, especially if you want to have the time index created automatically (seems to be the case looking at your code).
Continuing what you have:
pData <- pdata.frame(Data, index = "id") # time index is created automatically
fe.fit2 <- plm(formula = Y ~ T, data = pData, model = "within", effect = "individual")
vcovCR(fe.fit2, cluster=Data$id,type = "CR2")
vcovCR(fe.fit2, type = "CR2")
vcovCR(fe.fit2,cluster=Data$gid,type = "CR2")
Your example does not work due to a bug in clubSandwich's data extraction function get_index_order (from version 0.3.3) for plm objects. It assumes both index variables are in the original data but this is not the case in your example where the time index is created automatically by only specifying the individual dimension by the index argument.