So I am running a survival analysis on my dataset of google playstor downloads.
My analysis using survreg only provides me with nas for coefficients though.
"(5 not defined because of singularities)"
If I use a normal lm regression this problem does not occur. This would not work however since all observations of the dependent variable are right censored for a different number (the numeric value is also the limit).
My original dataset: https://www.kaggle.com/lava18/google-play-store-apps
So here I will show you my entire code. It might be a bit long so scroll to the end for the survival analysis, but I wanted to give you the ability to fully comprehend.
library(readxl)
Dataset <- read_excel("Thesis/googleplaystore.xlsx")
View(Dataset)
#selecteer 500 apps
set.seed(1998)
dataset <- Dataset[sample(nrow(Dataset), 500), ]
View(dataset)
#Lastupdated --> days_since
end <- matrix( c("2018-08-31"), nrow=500, ncol=1, byrow=FALSE)
end <- format(as.Date(end), "%Y/%m/%d")
View(end)
dataset$`Last Updated` <- as.Date(dataset$`Last Updated`,
format = "%B %d, %Y")
dataset$`Last Updated` <- format(as.Date(dataset$`Last Updated`), "%Y/%m/%d")
View(dataset)
install.packages('lubridate')
library(lubridate)
elapsed.time <- dataset$`Last Updated` %--% end
View(elapsed.time)
dataset$days_since <- as.duration(elapsed.time) / ddays(1)
View(dataset)
# + verwijdern uit aantal installs
dataset$Install <- gsub("\\+","", dataset$Installs)
View(dataset)
dataset$Install <- gsub(",","", dataset$Install)
# installs en price numeric maken
typeof(dataset$Install)
dataset$Install <- as.numeric(dataset$Install)
View(dataset)
typeof(dataset$Rating)
dataset$Rating <- as.numeric(dataset$Rating)
typeof(dataset$Reviews)
typeof(dataset$Price)
dataset$Price <- gsub("\\$","", dataset$Price)
dataset$Price <- as.numeric(dataset$Price)
typeof(dataset$days_since)
#Tobit Survival analyses
library(help=survival)
library(survival)
dataset$ins_cen <- matrix( c("0"), nrow=500, ncol=1, byrow=FALSE)
typeof(dataset$ins_cen)
dataset$ins_cen <- as.numeric(dataset$ins_cen)
install.packages('tidyverse')
library(tidyverse)
dataset_2 <- dataset %>% filter(!is.na(dataset$Rating))
View(dataset_2)
dataset_2$dum_cen <- ifelse(dataset_2$ins_cen == 0, 0, 1)
dataset_2$dum_fac <- as.factor(dataset_2$dum_cen)
survreg(Surv(Install, ins_cen, type= 'right') ~ Rating + Price + Reviews + days_since,
dist="gaussian", data = dataset_2)
cor(dataset)
#CRCH
install.packages('crch')
library(crch)
View(dataset)
CRCH <- crch(Install ~ Rating + Price + Size + Reviews +days_since + `Current Ver` + Category, data = dataset, dist = 'gaussian', right = dataset_2$Install)
I tried turning the event into an dummy variable and a factor but both options do not work. The dummy variable changes nothing, while the factor variable gives an error.
Error in survreg(Surv(Install, dum_fac, type = "right") ~ Rating +
Price + : multi-state survival is not supported
Thanks for any help.
Sorry if I am asking stupid questions but I am still learning and can't figure my problem out.
p.s. I also tried to solve my problem using crch() but this lead to a different error, where I can't seem to wrap my head around either.
Error in optim(par = start, fn = loglikfun, gr = gradfun, method =
method, : non-finite value supplied by optim
Edit: I noticed I left character variables in the crch code.
When this is removed from the formula I get a different error.
Error in solve.default(hessfun(par)) : system is computationally
singular: reciprocal condition number = 7.31468e-142
CRCH code:
#CRCH
install.packages('crch')
library(crch)
View(dataset)
CRCH <- crch(Install ~ Rating + Price + Reviews +days_since, data = dataset, dist = 'gaussian', left = -Inf, right = dataset_2$Install)
x = Price + Size + Reviews +days_since + `Current Ver` + Category
Related
I need help solving this error, I am not sure how to, but it seems as when I run the "dp_stat" in the end, an error appears when I do my OLS model. Essentially I try to find the cumulative difference between a benchmark and a given individual predictive model. Here dp is my independent variable and sg is my dependent variable. datanu is my excel data. I'm not sure how to attach the data here, however here is a link to the excel and the code: https://drive.google.com/drive/folders/12BOuNBODURIP7CQIBZWMmHFc1d7zXHxN?usp=sharing If anyone has a fix it would mean the world!
"#Error in lag():! n must be a positive integer, not a double vector of length 1."
rm(list= ls()) # Clear global environment
invisible(gc()) # Free up unused R-occupied memory
cat("\014") # Clear console output: equivalent to ctrl + L
library("tseries")
library("readxl")
library("Metrics")
library("lubridate")
library("ggplot2")
library("data.table")
library("dyn")
library("reshape2")
#header TRUE fordi første row er navne.
datanu <- read_xlsx("~/Documents/6.semester/Bachelor/Data/datanu.xlsx",
na = "NaN",
sheet = "datax",
)
myts <- ts(datanu, start=c(1872, 1), end=c(2020, 12), frequency=12)
plot(myts[, c("dp", "dy", "ep", "de")])
get_statistics <- function(myts, dp, sg, h=1, start=1872, end=2020, est_periods_OOS = 20) {
#### IS ANALYSIS
#1. Historical mean model for en portefølje
avg <- mean(window(myts, start, end)[, sg], na.rm=TRUE)
IS_error_N <- (window(myts, start, end)[, sg] - avg)
#2. OLS model
#reg <- dyn$lm(sg ~ lag(as.numeric(dp), 1), data=window(myts, start, end))
reg <- dyn$lm(eval(parse(text=sg)) ~ lag(eval(parse(text=dp)), -1), data=window(myts, start, end)) #Error in `lag()`:! `n` must be a positive integer, not a double vector of length 1.
IS_error_A <- reg$residuals
#OOS ANALYSIS
OOS_error_N <- numeric(end - start - est_periods_OOS)
OOS_error_A <- numeric(end - start - est_periods_OOS)
#anvender kun information op til forecasten er lavet.
j <- 0
for (i in (start + est_periods_OOS):(end-1)) {
j <- j + 1
#Get the actual ERP that you want to predict
actual_ERP <- as.numeric(window(myts, i+1, i+1)[, sg])
#1. Historical mean model
OOS_error_N[j] <- actual_ERP - mean(window(myts, start, i)[, sg], na.rm=TRUE)
#2. OLS model
reg_OOS <- dyn$lm(eval(parse(text=sg)) ~ lag(eval(parse(text=dp)), -1),
data=window(myts, start, i))
#Compute_error
df <- data.frame(x=as.numeric(window(myts, i, i)[, dp]))
names(df) <- dp
pred_ERP <- predict.lm(reg_OOS, newdata=df)
OOS_error_A[j] <- pred_ERP - actual_ERP
}
#Compute statistics
MSE_N <- mean(OOS_error_N^2)
MSE_A <- mean(OOS_error_A^2)
T <- length(!is.na(myts[, sg]))
OOS_R2 <- 1 - MSE_A/MSE_N
#Is the -1 enough (maybe -2 needed because of lag)?
OOS_oR2 <- OOS_R2 - (1-OOS_R2)*(reg$df.residual)/(T - 1)
dRMSE <- sqrt(MSE_N) - sqrt(MSE_A)
##
#### CREATE PLOT
IS <- cumsum(IS_error_N[2:length(IS_error_N)]^2)-cumsum(IS_error_A^2)
OOS <- cumsum(OOS_error_N^2)-cumsum(OOS_error_A^2)
df <- data.frame(x=seq.int(from=start + 1 + est_periods_OOS, to=end),
IS=IS[(1 + est_periods_OOS):length(IS)],
OOS=OOS) #Because you lose one observation due to the lag
#Shift IS errors vertically, so that the IS line begins
# at zero on the date of first OOS prediction. (se Goyal/Welch (2008, side 1465))
df$IS <- df$IS - df$IS[1]
df <- melt(df, id.var="x")
plotGG <- ggplot(df) +
geom_line(aes(x=x, y=value,color=variable)) +
geom_rect(data=data.frame(),#Needed by ggplot2, otherwise not transparent
aes(xmin=2008, xmax=2010,ymin=-0.2,ymax=0.2),
fill='red',
alpha=0.1) +
scale_y_continuous('Cumulative SSE Difference', limits=c(-0.2, 0.2)) +
scale_x_continuous('Year')
##
return(list(IS_error_N = IS_error_N,
IS_error_A = reg$residuals,
OOS_error_N = OOS_error_N,
OOS_error_A = OOS_error_A,
IS_R2 = summary(reg)$r.squared,
IS_aR2 = summary(reg)$adj.r.squared,
OOS_R2 = OOS_R2,
OOS_oR2 = OOS_oR2,
dRMSE = dRMSE,
plotGG = plotGG))
}
dp_stat <- get_statistics(myts, "dp", "sg", start=1872)
dp_stat$plotGG
As the error message states, n must be a positive integer, not a double vector of length 1. The error comes from you providing n = -1 (i.e., a negative number) as an argument. I assume your idea is to have a negative number of positions to lag by. However, the lag() function only accepts a positive number of lag positions. Instead of lag(), you should use lead() with n = 1 to achieve the desired result.
Long-time reader, first-time asker here :)
I have some data collected at specific times and dates, and there is reason to hypothesize the data roughly follows a 24-hour cycle. I would like to fit a sine wave model on my data as a function of time, so that it is possible to test if future data points fall on the predicted pattern.
I have read this, this and this response but they are not solving my problem because in my case, I'm hoping to keep the x-axis data in POSIXct date-time format. That's how the data is collected and using this format makes for an easily interpreted plot.
Here's some reproducible data that is identical to my real data:
time <- c("2022-01-01 09:20:00", "2022-01-02 11:10:00",
"2022-01-02 18:37:00", "2022-01-03 14:01:00",
"2022-01-05 06:50:00", "2022-01-06 17:03:00")
time <- as.POSIXct(time)
value <- c(3, 6, 2, 8, 4, 1)
These are plotted fine in base R:
plot(time, value)
However, now I run into trouble when I try to construct a sine regression model that would fit the time series. I'm also struggling to fully understand the parameters required by nls function. Based on the previous examples, I have tried this approach (with comments on how I understand it working):
res <- nls(value ~ A * sin(omega * time + phi) + C, # This is the basic sine-function format
data = data.frame(time, value), # This defines the data used
start = list(A = 1, omega = 1, phi = 1, C = 1)) # This gives nls the starting values?
Here, I get an error message: "Error in Ops.POSIXt(omega, time) : '*' not defined for "POSIXt" objects" which I interpret as meaning the specific date format I would like to use is not acceptable for this type of approach. I know this, because if I simply replace the time variable with a dummy vector of integers, the model works fine and I'm able to plot it as follows:
time2 <- c(1, 2, 3, 4, 5, 6)
res <- nls(value ~ A * sin(omega * time2 + phi) + C,
data = data.frame(time, value),
start=list(A=1, omega=1, phi=1, C=1))
coefs <- coef(res)
fit <- function(x, a, b, c, d) {a * sin(b * x + c) + d}
plot(time2, value)
curve(fit(x, a = coefs["A"], b = coefs["omega"],
c = coefs["phi"], d = coefs["C"]), add=TRUE,
lwd=2, col="red")
I know I'm on the right track but my main question is, how can I do the above process while maintaining the time variable in POSIXct format?
As mentioned, my main order of business would be to plot the data using Ggplot2, but I can't even begin to try that before I solve the initial problem. However, any pointers on how to get started with that are greatly appreciated! :)
I would probably just generate a numeric number of days from an arbitrary origin time and use that. You can then modify your fit function so that it converts date-times to predicted values. You can then easily make a data frame of predictions from your model and plot that.
df <- data.frame(time = time, value = value)
origin <- as.POSIXct("2022-01-01 00:00:00")
df$days <- as.numeric(difftime(time, origin, unit = "day"))
res <- nls(value ~ A * sin(omega * days + phi) + C,
data = df,
start = list(A = 1, omega = 1, phi = 1, C = 1))
fit <- function(res, newdata) {
x <- as.numeric(difftime(origin, newdata$time, units = "days"))
C <- as.list(coef(res))
C$A * sin(C$omega * x + C$phi) + C$C
}
new_df <- data.frame(time = origin + as.difftime(new_times, units = "days"))
new_df$value <- fit(res, new_df)
ggplot(df, aes(time, value)) +
geom_point() +
geom_line(data = new_df, colour = "gray") +
theme_bw()
I have been trying to work on IPTW using the Lalonde dataset. The codes are all working well until I tried to print the svyCreateTableOne, which would tell me there's an error:
error in round(n, digits = digits): non numeric argument to mathematical function.
Here is my code:
install.packages("Matching")
install.packages("ipw")
install.packages("survey")
install.packages("MatchIt")
library(tableone)
library(Matching)
library(ipw)
library(survey)
library(MatchIt)
data(lalonde)
#Fixing the data
age<-as.numeric(lalonde$age)
black<-as.numeric(lalonde$race=='black')
hispan<-as.numeric(lalonde$race=='hispan')
white<-as.numeric(lalonde$race=='white')
educ<-as.numeric(lalonde$educ)
nodegree<-as.numeric(lalonde$nodegree)
married<-as.numeric(lalonde$married)
re74<-lalonde$re74
re75<-lalonde$re75
re78<-lalonde$re78
treat<-as.numeric(lalonde$treat)
mydata<-cbind(age,black,hispan,educ,nodegree,married,re74,re75,re78,treat)
mydata<-data.frame(mydata)
xvars<-c("age","educ","black","hispan", "married","nodegree","re74","re75")
#table1
table1<-CreateTableOne(vars=xvars,strata="treat", data=mydata, test=FALSE)
print(table1,smd=TRUE)
undebug(print)
#Propensity Scores
psmodel<-glm(treat ~ age + educ + black + hispan + married + nodegree + re74 + re75,
family=binomial(link="logit"))
summary(psmodel)
#value of propensity score for each subject
ps<-predict(psmodel,type="response")
#create weights
weight<-ifelse(treat==1,1/(ps),1/(1-ps))
#apply weights to data
weighteddata<-svydesign(ids = ~ 1, data =mydata, weights = ~weight)
#weighted table 1
weightedtable <- svyCreateTableOne(vars = xvars, strata="treat", data = weighteddata,
test = TRUE)
#show table with SMD
print(weightedtable,smd = TRUE)
I've made sure that everything is numeric but the problem still persists. I've looked around for answers as well but nothing seems to help with my case. Can someone help please? Thank you!
The basic gist is that I have a set of housing data that I need to create a model for to minimize the predicted price vs actual price of house based on the dataset. So I created this bit of code to essentially test for a range of different numerators and find the one that minimized the difference between them. I'm using the median instead of the mean as the data isn't exactly normal.
Since I only have experience with lm(), I'm using that to create the coefficients and C values. But since the model likes exponents, I have to also test various exponents. It does this for each of the variables and then goes back to the first and re-evaluates it based on the other exponents. The model starts out with all the exponents ending up equal to 1. So the same as the basic linear model. I know that this is probably horribly inefficient and probably uses a lot of code in a somewhat wasteful, but I'm in my first r class so sorry about the mess and/or convoluted coding logic.
Is there any way to do this same thing but being more efficient. Also, I can't really decrease the number of variables as the model likes having more variables and produces a greater margin of error when they aren't present.
w <- seq(1,10000,1)
r <- seq(1,10000,1)
t <- seq(1,10000,1)
z <- seq(1,10000,1)
s <- seq(1,10000,1)
coef_1 <- c(6000,6000,6000,6000,6000,6000,6000,6000)
v <- rep(6000, each = 8)
for(l_1 in 1:10){
for(t_1 in 1:8){
for(i in 1:10000){
t = t_1
coef_1[t] = i
mod5 <- lm(log(SALE_PRC) ~ I(TOT_LVG_AREA^((coef_1[1]-5000)/1000)) + I(LND_SQFOOT^((coef_1[2]-5000)/1000)) + I(RAIL_DIST^((coef_1[3]-5000)/1000)) + I(OCEAN_DIST^((coef_1[4]-5000)/1000)) + I(CNTR_DIST^((coef_1[5]-5000)/1000)) + I(HWY_DIST^((coef_1[6]-5000)/1000)) + I(structure_quality^((coef_1[7]-5000)/1000)) + SUBCNTR_DI + SPEC_FEAT_VAL + (exp(((coef_1[8]-5000)/1000)*SPECIAL_RATIO)) + age, data = kaggle_transform_final)
kaggle_new <- kaggle_transform_final %>%
add_predictions(model = mod5, var = "prediction") %>%
mutate(new_predict = exp(prediction)) %>%
mutate(new_difference = abs((new_predict-SALE_PRC))/SALE_PRC) %>%
mutate(average_percent_difference = median(new_difference)) %>%
mutate(mean_percent_difference = mean(new_difference)) %>%
mutate(quart_75 = quantile(new_difference,.75))
w[i] = kaggle_new$average_percent_difference[1]
r[i] = kaggle_new$mean_percent_difference[1]
t[i] = kaggle_new$quart_75[1]
z[i] = i
s[i] = (i-5000)/1000
if(i%%100 ==0){show(i)}
}
u <- data.frame(median_diff = w, mean_diff = r, quart_75 = t, actual = s, number = z) %>%
arrange(median_diff)
coef_1[t_1] <- u$number[1]
v[t_1] <- u$actual[1]
show(coef_1)
}
coef_1 <- coef_1
}
I have written this R code to reproduce. Here, I have a created a unique column "ID", and I am not sure how to add the predicted column back to test dataset mapping to their respective IDs. Please guide me on the right way to do this.
#Code
library(C50)
data(churn)
data=rbind(churnTest,churnTrain)
data$ID<-seq.int(nrow(data)) #adding unique id column
rm(churnTrain)
rm(churnTest)
set.seed(1223)
ind <- sample(2,nrow(data),replace = TRUE, prob = c(0.7,0.3))
train <- data[ind==1,1:21]
test <- data[ind==2, 1:21]
xtrain <- train[,-20]
ytrain <- train$churn
xtest <- test[,-20]
ytest<- test$churn
x <- cbind(xtrain,ytrain)
## C50 Model
c50Model <- C5.0(churn ~
state +
account_length +
area_code +
international_plan +
voice_mail_plan +
number_vmail_messages +
total_day_minutes +
total_day_calls +
total_day_charge +
total_eve_minutes +
total_eve_calls +
total_eve_charge +
total_night_minutes +
total_night_calls +
total_night_charge +
total_intl_minutes +
total_intl_calls +
total_intl_charge +
number_customer_service_calls,data=train, trials=10)
# Evaluate Model
c50Result <- predict(c50Model, xtest)
table(c50Result, ytest)
#adding prediction to test data
testnew = cbind(xtest,c50Result)
#OR predict directly
xtest$churn = predict(c50Model, xtest)
I’d use match(dataID, predictedID) to match ID columns in data sets.
In reply to your comment:
If you want to add predicted values to the original dataframe, both ways of merging data and prediction are correct and produce identical result. The only thing is, I would use
xtest$churn_hut <- predict(c50Model, xtest)
instead of
xtest$churn <- predict(c50Model, xtest)
because here you are replacing original churn ( as in data$churn) with whatever the model predicted, so you can’t compare the two.