lm function is giving a warning that it is dropping rows - r

This is my question
Do the developers that make more games charge higher prices?
my code:
dev_data <- steam_data_final %>%
group_by(developer) %>%
summarize(num_dev = n(), avg_price = mean(price, na.rm = TRUE)) %>%
arrange(desc(num_dev))
dev_data
but this model isn't working, getting Warning: Dropping 3038 rows with missing values
mod_dev <- lm(num_dev ~ avg_price, data = dev_data)

Check if you have any NA using summary() or is.na() for each column. If you do have any NA, then it is the reason why the lm() gives you the warning message.
Also, it seems like you need to use lm(avg_price ~ num_dev, data = dev_data) instead of lm(num_dev ~ avg_price, data = dev_data). It seems like the dependent variable should be avg_price, not num_dev. (It depends on your question of research.)

Related

Problem `.x` is empty in pammtools packages

I am trying to replicate the example code in Bender and Schleip for Piece-wise exponential Additive Mixed modelling tools. Specifically a survival exercise with time varying effects.
https://arxiv.org/pdf/1806.01042.pdf
library(dplyr); library(tidyr); library(purrr); library(ggplot2)
library(survival); library(mgcv); library(pammtools)
data("pbc", package="survival")
# event time information
pbc <- pbc %>%
filter(id <= 312) %>%
mutate(status = ifelse(status==0,0,1) )%>%
select(id:status, trt:sex, bili, protime)
pbc %>% slice(1:6)
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(pbc$time, pbc$status)~sex|concurrent(bili, protime, tz_var = "day"),
id = "id")
I always get the error
Error: .x is empty, and no .init supplied
I installed and checked Rtools, I tried with different (older) version of Purrr, which sometimes is related with this error. I tried to run the code also on https://rdrr.io/snippets/.
Any idea? thank you very much...
You have not used the code in that vignette. And you added pbc$ to the arguments in Surv(), a common mistake but generally not a productive strategy
# Need to narrow the material from pbcseq
pbcseq <- pbcseq %>% select(id, day, bili, protime)
# I would have given it a different name
#------ Error when using "|" rather than "+"
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(time, status)~sex|concurrent(bili, protime, tz_var = "day"),
id = "id")
#Error: `.x` is empty, and no `.init` supplied
#________________
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(time, status)~sex + concurrent(bili, protime, tz_var = "day"),
id = "id") # No error
I think there may be an error in the vignette. I don't see any examples using the construct ...
Surv(time,status)~ variates | special(.)
They all use a "+" sign for adding the time-dependent covariates. If you go to https://adibender.github.io/pammtools//articles/data-transformation.html you see them using a "+" rather than a "|". I think there is some sloppiness in that package's documentation. But your additions only made the problem worse.

Treatment effect table in R with horizontally-oriented variables

So i subsetted a dataframe to keep only my 4 columns of interest. I want to count the number of control (0) and treated (1) observations. I computed something with the gtsummary package, but the variables are vertically oriented (like here http://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html), one below each other, and this is not what i want. I searched on google but all the tables i saw have this orientation too.
I put here a picture of what i'd like to obtain, if some of you have any idea !
code i use to obtain my initial table (same as in the link)
install.packages("gtsummary")
library(gtsummary)
trial <- finaldf %>% select(treatment, 2digID,4digID,classificationsdescription)
trial %>% tbl_summary()
t2 <- trial %>% tbl_summary(by = treatment)
I cannot put the real data but i created an example that looks like my data :
_2ID <- c(38,38,38,38,38,38,38,38,38,38,80,80,80,80,80,80,80,80,80,80)
_4ID <- c(3837,3837,3837,3812,3812,3896,3894,3894,3877,3877, 8099,8099,8027,8027,8027,8033,8033,8064,8064,8022)
descriptions <- c('ILL1','ILL1','ILL1', 'ILL2','ILL2','ILL3','ILL4','ILL4','ILL5','ILL5','ILL1','ILL1','ILL2','ILL2','ILL2','ILL3','ILL3','ILL4','ILL4','ILL5')
trt <-c(0,0,0,1,1,1,0,0,1,1,0,0,1,1,1,0,0,1,1,0)
df.data <- data.frame(_2ID,_4ID,descriptions, trt)
UPDATE - SOLVED
I think i managed to solve this problem even if my output is a dataframe and not a "publication-ready" table :
install.packages("reshape2")
library(reshape2)
data_wide <- dcast(df,_2digID+_4digID+descriptions ~ treatment, value.var="counts")
But i'm not sure yet that this gives the right numbers tho.
The example below gets you close, but not exactly what you're after. I like the idea of being able to support tables like this, and I'll add it to the list of features to implement!
library(gtsummary)
#> #Uighur
packageVersion("gtsummary")
#> [1] '1.4.1'
tbl <-
trial %>%
mutate(
grade = paste("Grade", as.character(grade)),
stage = paste("Stage", as.character(stage))
) %>%
tbl_strata(
strata = c(stage, grade),
~ .x %>%
tbl_summary(by = trt,
include = response,
type = response ~ "categorical",
missing = "no",
statistic = response ~ "{n}") %>%
modify_header(all_stat_cols() ~ "**{level}**"),
.combine_with = "tbl_stack"
) %>%
as_flex_table()
Table truncated because it was very long!
Created on 2021-07-14 by the reprex package (v2.0.0)

error in add_p()' for variable X and test 'fisher.test', p-value omitted

I get the error below when I try to use the add_p() function to get a p-value for differences between my by variable (with 10 levels) and a categorical variable with two levels (yes/no). I am not sure how to provide a reproducible example. From the trials data, I imagine my by variable would be the "T Stage" variable with 10 levels, and the categorical variables would be: (1) "Chemotherapy Treatment" with 2 levels, and (2) "Chemotherapy Treatment2" with 4 levels. But here is the code I ran.
library(gtsummary)
library(tidyverse)
miro_def %>%
select(mheim, age_dx, time_t1d_yrs, gender, collard, fhist_pandz) %>%
tbl_summary(by = mheim, missing = "no",
type = list(c(gender, collard, fhist_pandz, mheim) ~ "categorical"),
label = list(gender ~ "Gender",
fhist_pandz ~ "Family history of PD",
age_dx ~ "Age at diagnosis",
time_t1d_yrs ~ "Follow-up(years)")) %>%
add_p() %>%
# style the output with custom header
#modify_header(stat_by = "{level}") %>%
# convert to kableExtra as_kable_extra(booktabs = TRUE) %>%
# reduce font size to make table fit. # you may also use the `latex_options = "scale_down"` argument here.
kable_styling(font_size = 7, latex_options = "scale_down")
However, I do get a p-value for this by variable (10 levels) with other variables (which are continous/numeric)
How can I fix this error?
In the case where I have the mentioned multilevel by variable and a multilevel (>2 levels) categorical variable, is there something special I should do to get a p-value?
There was an error in 'add_p()' for variable 'gender' and test 'fisher.test', p-value omitted:
Error in stats::fisher.test(data[[variable]], as.factor(data[[by]])): FEXACT error 7(location). LDSTP=18540 is too small for this problem,
(pastp=51.2364, ipn_0:=ipoin[itp=150]=215, stp[ipn_0]=40.6787).
Increase workspace or consider using 'simulate.p.value=TRUE'
There was an error in 'add_p()' for variable 'collard' and test 'fisher.test', p-value omitted:
Error in stats::fisher.test(data[[variable]], as.factor(data[[by]])): FEXACT error 7(location). LDSTP=18570 is too small for this problem,
(pastp=37.0199, ipn_0:=ipoin[itp=211]=823, stp[ipn_0]=23.0304).
Increase workspace or consider using 'simulate.p.value=TRUE'
There was an error in 'add_p()' for variable 'fhist_pandz' and test 'fisher.test', p-value omitted:
Error in stats::fisher.test(data[[variable]], as.factor(data[[by]])): FEXACT error 7(location). LDSTP=18570 is too small for this problem,
(pastp=36.4614, ipn_0:=ipoin[itp=58]=1, stp[ipn_0]=31.8106).
Increase workspace or consider using 'simulate.p.value=TRUE'
since nobody posted an answer, here's what I used when coming across this. Following the Examples given in the help file ?gtsummary::add_p.tbl_summary, I composed a custom function that runs fisher.test with the simulate.p.values = TRUE option:
## define custom test
fisher.test.simulate.p.values <- function(data, variable, by, ...) {
result <- list()
test_results <- stats::fisher.test(data[[variable]], data[[by]], simulate.p.value = TRUE)
result$p <- test_results$p.value
result$test <- test_results$method
result
}
## add p-values to your gtsummary table, using custom test defined above
summary_table %>%
add_p(
test = list(all_categorical() ~ "fisher.test.simulate.p.values") # this applies the custom test to all categorical variables
)
You can also amend the number of iterations for computing the simulated p-values by changing the default B = 2000 parameter to fisher.test() above.
All this assumes, of course, that it's appropriate to use Fisher's test in the first place.
Since it fixed the issue for me, I would like to indicate that since version 1.3.6 of gtsummary there is an option in add_p() with which you can specify arguments to the test functions (i.e. test.args). Thank you to the developers for this!
From the NEWS:
Each add_p() method now has the test.args = argument. Use this argument to pass
additional arguments to the statistical method, e.g.
add_p(test = c(age, marker) ~ "t.test",
test.args = c(age, marker) ~ list(var.equal = TRUE))
It is also explained in the add_p() help (i.e. ?add_p).
I had a similar problem. You have to increase your workspace with test.args within add_p().
miro_def %>%
select(mheim, age_dx, time_t1d_yrs, gender, collard, fhist_pandz) %>%
tbl_summary(by = mheim, missing = "no",
type = list(c(gender, collard, fhist_pandz, mheim) ~ "categorical"),
label = list(gender ~ "Gender",
fhist_pandz ~ "Family history of PD",
age_dx ~ "Age at diagnosis",
time_t1d_yrs ~ "Follow-up(years)")) %>%
add_p(test.args = variable_with_no_pval ~ list(workspace=2e9))
or
add_p(test.args = all_test("fisher.test") ~ list(workspace=2e9))

Getting error message while calculating rmse in a time series analysis

I am trying to replicate this example of time series analysis in R using Keras (see Here) and unfortunately I am receiving error message while computing first average rmes
coln <- colnames(compare_train)[4:ncol(compare_train)]
cols <- map(coln, quo(sym(.)))
rsme_train <-
map_dbl(cols, function(col)
rmse(
compare_train,
truth = value,
estimate = !!col,
na.rm = TRUE
)) %>% mean()
rsme_train
Error message:
Error in is_symbol(x) : object '.' not found
There are some helpful comments at the bottom of the post but new version of dplyr doesn't help really. Any suggestion how to get around this?
I stumbled upon the same problem, so here's a solution that is close to the original code.
The transformation for cols is not necessary, because !! works with the character vector already. You can change the code to
coln <- colnames(compare_train)[4:ncol(compare_train)]
rsme_train <-
map_df(coln, function(col)
rmse(
compare_train,
truth = value,
estimate = !!col,
na.rm = TRUE
)) %>%
pull(.estimate) %>%
mean()
rsme_train
You might also want to check for updates of tidyverse, just to be sure.

randomForest Categorical Predictor Limits

I understand and appreciate that R's randomForest function can only handle categorical predictors with less than 54 categories. However, when I trim my categorical predictor down to less than 54 categories, I still get the error. The only questions I've seen around categorical predictor limits on stackoverflow is how to get around this category limit, but I'm trying to trim my number of categories to follow the function's limitations and I am still get the error.
The following script creates a data frame so we can predict 'profession'. Understandably, I get the "Can not handle categorical predictors with more than 53 categories" error when trying to run randomForest() on 'df' due to the 'college_id' variable.
But when I trim my data set to only include the top 40 college IDs, I get the same error. Am I missing some basic data frame concept that retains all of the categories even though only 40 are now populated in the 'df2' data frame? What is a workaround option that I can use?
library(dplyr)
library(randomForest)
# create data frame
df <- data.frame(profession = sample(c("accountant", "lawyer", "dentist"), 10000, replace = TRUE),
zip = sample(c("32801", "32807", "32827", "32828"), 10000, replace = TRUE),
salary = sample(c(50000:150000), 10000, replace = TRUE),
college_id = as.factor(c(sample(c(1001:1040), 9200, replace = TRUE),
sample(c(1050:9999), 800, replace = TRUE))))
# results in error, as expected
rfm <- randomForest(profession ~ ., data = df)
# arrange college_ids by count and retain the top 40 in the 'df' data frame
sdf <- df %>%
dplyr::group_by(college_id) %>%
dplyr::summarise(n = n()) %>%
dplyr::arrange(desc(n))
sdf <- sdf[1:40, ]
df2 <- dplyr::inner_join(df, sdf, by = "college_id")
df2$n <- NULL
# confirm that df2 only contains 40 categories of 'college_id'
nrow(df2[which(!duplicated(df2$college_id)), ])
# THIS IS WHAT I WANT TO RUN, BUT STILL RESULTS IN ERROR
rfm2 <- randomForest(profession ~ ., data = df2)
I think you still had all the factor levels in your variable. Try adding this line before you fit the forest again:
df2$college_id <- factor(df2$college_id)

Resources