How to map predicted values to unique id in dataset? - r

I have written this R code to reproduce. Here, I have a created a unique column "ID", and I am not sure how to add the predicted column back to test dataset mapping to their respective IDs. Please guide me on the right way to do this.
#Code
library(C50)
data(churn)
data=rbind(churnTest,churnTrain)
data$ID<-seq.int(nrow(data)) #adding unique id column
rm(churnTrain)
rm(churnTest)
set.seed(1223)
ind <- sample(2,nrow(data),replace = TRUE, prob = c(0.7,0.3))
train <- data[ind==1,1:21]
test <- data[ind==2, 1:21]
xtrain <- train[,-20]
ytrain <- train$churn
xtest <- test[,-20]
ytest<- test$churn
x <- cbind(xtrain,ytrain)
## C50 Model
c50Model <- C5.0(churn ~
state +
account_length +
area_code +
international_plan +
voice_mail_plan +
number_vmail_messages +
total_day_minutes +
total_day_calls +
total_day_charge +
total_eve_minutes +
total_eve_calls +
total_eve_charge +
total_night_minutes +
total_night_calls +
total_night_charge +
total_intl_minutes +
total_intl_calls +
total_intl_charge +
number_customer_service_calls,data=train, trials=10)
# Evaluate Model
c50Result <- predict(c50Model, xtest)
table(c50Result, ytest)
#adding prediction to test data
testnew = cbind(xtest,c50Result)
#OR predict directly
xtest$churn = predict(c50Model, xtest)

I’d use match(dataID, predictedID) to match ID columns in data sets.
In reply to your comment:
If you want to add predicted values to the original dataframe, both ways of merging data and prediction are correct and produce identical result. The only thing is, I would use
xtest$churn_hut <- predict(c50Model, xtest)
instead of
xtest$churn <- predict(c50Model, xtest)
because here you are replacing original churn ( as in data$churn) with whatever the model predicted, so you can’t compare the two.

Related

R: How to loop through models dropping one observation at a time?

I have trouble looping through a regression model dropping one observation each time to estimate the effect of influential observations.
I would like to run the model several times, each time dropping the ith observation and extracting the relevant coefficient estimate and store it in a vector. I think this could quite easily be done with a fairly straight forward loop, however, I'm stuck at the specifics.
I want to be left with a vector containing n coefficient estimates from n iterations of the same model. Any help would be beneficial!
Below I provide some dummy data and example code.
#Dummy data:
set.seed(489)
patientn <- rep(1:400)
gender <- rbinom(400, 1, 0.5)
productid <- rep(c("Product A","Product B"), times=200)
country <- rep(c("USA","UK","Canada","Mexico"), each=50)
baselarea <- rnorm(400,400,60) #baseline area
baselarea2 <- rnorm(400,400,65) #baseline area2
sfactor <- c(
rep(c(0.3,0.9), times = 25),
rep(c(0.4,0.5), times = 25),
rep(c(0.2,0.4), times = 25),
rep(c(0.3,0.7), times = 25)
)
rashdummy2a <- data.frame(patientn,gender,productid,country,baselarea,baselarea2,sfactor)
Data <- rashdummy2a %>% mutate(rashleft = baselarea2*sfactor/baselarea*100) ```
## Example of how this can be done manually:
# model
m1<-lm(rashleft ~ gender + baselarea + sfactor, data = data)
# extracting relevant coefficient estimates, each time dropping a different "patient" ("patientn")
betas <- c(lm(rashleft ~ gender + baselarea + sfactor, data = rashdummy2b, patientn !=1)$coefficients[2],
lm(rashleft ~ gender + baselarea + sfactor, data = rashdummy2b, patientn !=2)$coefficients[2],
lm(rashleft ~ gender + baselarea + sfactor, data = rashdummy2b, patientn !=3)$coefficients[2])
# the betas vector now stores the relevant coefficient estimates (coefficient nr 2, for gender) for three different variations of the model.
We can use a for loop. In your question you use an object rashdummy2b which is not defined. Now I used data but you can replace that by an object of choice.
#create list to bind results to
result <- list()
#loop through patients and extract betas
for(i in unique(data$patientn)){
#construct linear model
lm.model <- lm(rashleft ~ gender + baselarea + sfactor, data = subset(data, data$patientn != i))
#create data.frame containing patient left out and coefficient
result.dt <- data.frame(beta = lm.model$coefficients[[2]],
patient_left_out = i)
#bind to list
result[[i]] <- result.dt
}
#bind to data.frame
result <- do.call(rbind, result)
Result
head(result)
beta patient_left_out
1 1.381248 1
2 1.345188 2
3 1.427784 3
4 1.361674 4
5 1.420417 5
6 1.454196 6
You can drop a particular row (or column) by using a negative index. In your case, you proceed as follows:
betas <- numeric(nrow(rashdummy2b)) # memory preallocation
for (i in 1:nrow(rashdummy2b)) {
betas[i] <- lm(rashleft ~ gender + baselarea + sfactor, data=rashdummy2b[-i,])$coefficients[2]
}

Is there a way to test a range of exponents in a lm() model in the same way as the code below more efficiently?

The basic gist is that I have a set of housing data that I need to create a model for to minimize the predicted price vs actual price of house based on the dataset. So I created this bit of code to essentially test for a range of different numerators and find the one that minimized the difference between them. I'm using the median instead of the mean as the data isn't exactly normal.
Since I only have experience with lm(), I'm using that to create the coefficients and C values. But since the model likes exponents, I have to also test various exponents. It does this for each of the variables and then goes back to the first and re-evaluates it based on the other exponents. The model starts out with all the exponents ending up equal to 1. So the same as the basic linear model. I know that this is probably horribly inefficient and probably uses a lot of code in a somewhat wasteful, but I'm in my first r class so sorry about the mess and/or convoluted coding logic.
Is there any way to do this same thing but being more efficient. Also, I can't really decrease the number of variables as the model likes having more variables and produces a greater margin of error when they aren't present.
w <- seq(1,10000,1)
r <- seq(1,10000,1)
t <- seq(1,10000,1)
z <- seq(1,10000,1)
s <- seq(1,10000,1)
coef_1 <- c(6000,6000,6000,6000,6000,6000,6000,6000)
v <- rep(6000, each = 8)
for(l_1 in 1:10){
for(t_1 in 1:8){
for(i in 1:10000){
t = t_1
coef_1[t] = i
mod5 <- lm(log(SALE_PRC) ~ I(TOT_LVG_AREA^((coef_1[1]-5000)/1000)) + I(LND_SQFOOT^((coef_1[2]-5000)/1000)) + I(RAIL_DIST^((coef_1[3]-5000)/1000)) + I(OCEAN_DIST^((coef_1[4]-5000)/1000)) + I(CNTR_DIST^((coef_1[5]-5000)/1000)) + I(HWY_DIST^((coef_1[6]-5000)/1000)) + I(structure_quality^((coef_1[7]-5000)/1000)) + SUBCNTR_DI + SPEC_FEAT_VAL + (exp(((coef_1[8]-5000)/1000)*SPECIAL_RATIO)) + age, data = kaggle_transform_final)
kaggle_new <- kaggle_transform_final %>%
add_predictions(model = mod5, var = "prediction") %>%
mutate(new_predict = exp(prediction)) %>%
mutate(new_difference = abs((new_predict-SALE_PRC))/SALE_PRC) %>%
mutate(average_percent_difference = median(new_difference)) %>%
mutate(mean_percent_difference = mean(new_difference)) %>%
mutate(quart_75 = quantile(new_difference,.75))
w[i] = kaggle_new$average_percent_difference[1]
r[i] = kaggle_new$mean_percent_difference[1]
t[i] = kaggle_new$quart_75[1]
z[i] = i
s[i] = (i-5000)/1000
if(i%%100 ==0){show(i)}
}
u <- data.frame(median_diff = w, mean_diff = r, quart_75 = t, actual = s, number = z) %>%
arrange(median_diff)
coef_1[t_1] <- u$number[1]
v[t_1] <- u$actual[1]
show(coef_1)
}
coef_1 <- coef_1
}

Dummies not included in summary

I want to create a function which will perform panel regression with 3-level dummies included.
Let's consider within model with time effects :
library(plm)
fit_panel_lr <- function(y, x) {
x[, length(x) + 1] <- y
#adding dummies
mtx <- matrix(0, nrow = nrow(x), ncol = 3)
mtx[cbind(seq_len(nrow(mtx)), 1 + (as.integer(unlist(x[, 2])) - min(as.integer(unlist(x[, 2])))) %% 3)] <- 1
colnames(mtx) <- paste0("dummy_", 1:3)
#converting to pdataframe and adding dummy variables
x <- pdata.frame(x)
x <- cbind(x, mtx)
#performing panel regression
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste0(names(y), "~", form,'-1'))
params <- list(
formula = form, data = x_copy, model = "within",
effect = "time"
)
pglm_env <- list2env(params, envir = new.env())
model_plm <- do.call("plm", params, envir = pglm_env)
model_plm
}
However, if I use data :
data("EmplUK", package="plm")
dep_var<-EmplUK['capital']
df1<-EmplUK[-6]
In output I will get :
> fit_panel_lr(dep_var, df1)
Model Formula: capital ~ sector + emp + wage + output + dummy_1 + dummy_2 +
dummy_3 - 1
<environment: 0x000001ff7d92a3c8>
Coefficients:
sector emp wage output
-0.055179 0.328922 0.102250 -0.002912
How come that in formula dummies are considered and in coefficients are not ? Is there any rational explanation or I did something wrong ?
One point why you do not see the dummies on the output is because they are linear dependent to the other data after the fixed-effect time transformation. They are dropped so what is estimable is estimated and output.
Find below some (not readily executable) code picking up your example from above:
dat <- cbind(EmplUK, mtx) # mtx being the dummy matrix constructed in your question's code for this data set
pdat <- pdata.frame(dat)
rhs <- paste(c("emp", "wage", "output", "dummy_1", "dummy_2", "dummy_3"), collapse = "+")
form <- paste("capital ~" , rhs)
form <- formula(form)
mod <- plm(form, data = pdat, model = "within", effect = "time")
detect.lindep(mod$model) # before FE time transformation (original data) -> nothing offending
detect.lindep(model.matrix(mod)) # after FE time transformation -> dummies are offending
The help page for detect.lindep (?detect.lindep is included in package plm) has some more nice examples on linear dependence before and after FE transformation.
A suggestion:
As for constructing dummy variables, I suggest to use R's factor with three levels and not have the dummy matrix constructed yourself. Using a factor is typically more convinient and less error prone. It is converted to the binary dummies (treatment style) by your typical estimation function using the model.frame/model.matrix framework.

Why do i get ''not defined because of singularities''?

So I am running a survival analysis on my dataset of google playstor downloads.
My analysis using survreg only provides me with nas for coefficients though.
"(5 not defined because of singularities)"
If I use a normal lm regression this problem does not occur. This would not work however since all observations of the dependent variable are right censored for a different number (the numeric value is also the limit).
My original dataset: https://www.kaggle.com/lava18/google-play-store-apps
So here I will show you my entire code. It might be a bit long so scroll to the end for the survival analysis, but I wanted to give you the ability to fully comprehend.
library(readxl)
Dataset <- read_excel("Thesis/googleplaystore.xlsx")
View(Dataset)
#selecteer 500 apps
set.seed(1998)
dataset <- Dataset[sample(nrow(Dataset), 500), ]
View(dataset)
#Lastupdated --> days_since
end <- matrix( c("2018-08-31"), nrow=500, ncol=1, byrow=FALSE)
end <- format(as.Date(end), "%Y/%m/%d")
View(end)
dataset$`Last Updated` <- as.Date(dataset$`Last Updated`,
format = "%B %d, %Y")
dataset$`Last Updated` <- format(as.Date(dataset$`Last Updated`), "%Y/%m/%d")
View(dataset)
install.packages('lubridate')
library(lubridate)
elapsed.time <- dataset$`Last Updated` %--% end
View(elapsed.time)
dataset$days_since <- as.duration(elapsed.time) / ddays(1)
View(dataset)
# + verwijdern uit aantal installs
dataset$Install <- gsub("\\+","", dataset$Installs)
View(dataset)
dataset$Install <- gsub(",","", dataset$Install)
# installs en price numeric maken
typeof(dataset$Install)
dataset$Install <- as.numeric(dataset$Install)
View(dataset)
typeof(dataset$Rating)
dataset$Rating <- as.numeric(dataset$Rating)
typeof(dataset$Reviews)
typeof(dataset$Price)
dataset$Price <- gsub("\\$","", dataset$Price)
dataset$Price <- as.numeric(dataset$Price)
typeof(dataset$days_since)
#Tobit Survival analyses
library(help=survival)
library(survival)
dataset$ins_cen <- matrix( c("0"), nrow=500, ncol=1, byrow=FALSE)
typeof(dataset$ins_cen)
dataset$ins_cen <- as.numeric(dataset$ins_cen)
install.packages('tidyverse')
library(tidyverse)
dataset_2 <- dataset %>% filter(!is.na(dataset$Rating))
View(dataset_2)
dataset_2$dum_cen <- ifelse(dataset_2$ins_cen == 0, 0, 1)
dataset_2$dum_fac <- as.factor(dataset_2$dum_cen)
survreg(Surv(Install, ins_cen, type= 'right') ~ Rating + Price + Reviews + days_since,
dist="gaussian", data = dataset_2)
cor(dataset)
#CRCH
install.packages('crch')
library(crch)
View(dataset)
CRCH <- crch(Install ~ Rating + Price + Size + Reviews +days_since + `Current Ver` + Category, data = dataset, dist = 'gaussian', right = dataset_2$Install)
I tried turning the event into an dummy variable and a factor but both options do not work. The dummy variable changes nothing, while the factor variable gives an error.
Error in survreg(Surv(Install, dum_fac, type = "right") ~ Rating +
Price + : multi-state survival is not supported
Thanks for any help.
Sorry if I am asking stupid questions but I am still learning and can't figure my problem out.
p.s. I also tried to solve my problem using crch() but this lead to a different error, where I can't seem to wrap my head around either.
Error in optim(par = start, fn = loglikfun, gr = gradfun, method =
method, : non-finite value supplied by optim
Edit: I noticed I left character variables in the crch code.
When this is removed from the formula I get a different error.
Error in solve.default(hessfun(par)) : system is computationally
singular: reciprocal condition number = 7.31468e-142
CRCH code:
#CRCH
install.packages('crch')
library(crch)
View(dataset)
CRCH <- crch(Install ~ Rating + Price + Reviews +days_since, data = dataset, dist = 'gaussian', left = -Inf, right = dataset_2$Install)
x = Price + Size + Reviews +days_since + `Current Ver` + Category

R: cant get a lme{nlme} to fit when using self-constructed interaction variables

I'm trying to get a lme with self constructed interaction variables to fit. I need those for post-hoc analysis.
library(nlme)
# construct fake dataset
obsr <- 100
dist <- rep(rnorm(36), times=obsr)
meth <- dist+rnorm(length(dist), mean=0, sd=0.5); rm(dist)
meth <- meth/dist(range(meth)); meth <- meth-min(meth)
main <- data.frame(meth = meth,
cpgl = as.factor(rep(1:36, times=obsr)),
pbid = as.factor(rep(1:obsr, each=36)),
agem = rep(rnorm(obsr, mean=30, sd=10), each=36),
trma = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)),
depr = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)))
# check if all factor combinations are present
# TRUE for my real dataset; Naturally TRUE for the fake dataset
with(main, all(table(depr, trma, cpgl) >= 1))
# construct interaction variables
main$depr_trma <- interaction(main$depr, main$trma, sep=":", drop=TRUE)
main$depr_cpgl <- interaction(main$depr, main$cpgl, sep=":", drop=TRUE)
main$trma_cpgl <- interaction(main$trma, main$cpgl, sep=":", drop=TRUE)
main$depr_trma_cpgl <- interaction(main$depr, main$trma, main$cpgl, sep=":", drop=TRUE)
# model WITHOUT preconstructed interaction variables
form1 <- list(fixd = meth ~ agem + depr + trma + depr*trma + cpgl +
depr*cpgl +trma*cpgl + depr*trma*cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl1 <- nlme::lme(fixed=form1[["fixd"]],
random=form1[["rndm"]],
correlation=corCompSymm(form=form1[["corr"]]),
data=main)
# model WITH preconstructed interaction variables
form2 <- list(fixd = meth ~ agem + depr + trma + depr_trma + cpgl +
depr_cpgl + trma_cpgl + depr_trma_cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl2 <- nlme::lme(fixed=form2[["fixd"]],
random=form2[["rndm"]],
correlation=corCompSymm(form=form2[["corr"]]),
data=main)
The first model fits without any problems whereas the second model gives me following error:
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
Nothing i found out about this error so far helped me to solve the problem. However the solution is probably pretty easy.
Can someone help me? Thanks in advance!
EDIT 1:
When i run:
modl3 <- lm(form1[["fixd"]], data=main)
modl4 <- lm(form2[["fixd"]], data=main)
The summaries reveal that modl4 (with the self constructed interaction variables) in contrast to modl3 shows many more predictors. All those that are in 4 but not in 3 show NA as coefficients. The problem therefore definitely lies within the way i create the interaction variables...
EDIT 2:
In the meantime I created the interaction variables "by hand" (mainly paste() and grepl()) - It seems to work now. However I would still be interested in how i could have realized it by using the interaction() function.
I should have only constructed the largest of the interaction variables (combining all 3 simple variables).
If i do so the model gets fit. The likelihoods then are very close to each other and the number of coefficients matches exactly.

Resources