Well, I've read the function reference for step_num2factor and didn't figured it out how to use it properly, honestly.
temp_names <- as.character(unique(sort(all_raw$MSSubClass)))
price_recipe <-
recipe(SalePrice ~ . , data = train_raw) %>%
step_num2factor(MSSubClass, levels = temp_names)
temp_rec <- prep(price_recipe, training = train_raw, strings_as_factors = FALSE) # temporary recipe
temp_data <- bake(temp_rec, new_data = all_raw) # temporary data
class(all_raw$MSSubClass)
# > col_double()
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
The data output temp_data$MSSubClass is full of NA after the use of the step.
The obs are saved as 20,30,40.... 190 and I want to transform to names ( or even the same numbers but as unordered factors)
If you know more blog posts about the usage of step_num2factor or some code that uses, I would be gladly to see as well.
The complete dataset is provided by kaggle at:
kaggle data
Thx in advance,
I don't think that step_num2factor() is the best fit for this variable. Take a look at the help again, and notice that you need to give a transform argument that can be used to modify the numeric values prior to determining the levels. This would work OK if this data was all multiples of 10, but you have some values like 75 and 85, so I don't think you want that. This recipe step works best for numeric/integer-ish variables that you can more easily transform to a set of integers with a simple function.
Instead, I think you should think about step_mutate() and a simple coercion to a factor type:
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
train_raw <- read_csv("~/Downloads/house-prices-advanced-regression-techniques/train.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_character(),
#> Id = col_double(),
#> MSSubClass = col_double(),
#> LotFrontage = col_double(),
#> LotArea = col_double(),
#> OverallQual = col_double(),
#> OverallCond = col_double(),
#> YearBuilt = col_double(),
#> YearRemodAdd = col_double(),
#> MasVnrArea = col_double(),
#> BsmtFinSF1 = col_double(),
#> BsmtFinSF2 = col_double(),
#> BsmtUnfSF = col_double(),
#> TotalBsmtSF = col_double(),
#> `1stFlrSF` = col_double(),
#> `2ndFlrSF` = col_double(),
#> LowQualFinSF = col_double(),
#> GrLivArea = col_double(),
#> BsmtFullBath = col_double(),
#> BsmtHalfBath = col_double(),
#> FullBath = col_double()
#> # ... with 18 more columns
#> )
#> See spec(...) for full column specifications.
price_recipe <-
recipe(SalePrice ~ ., data = train_raw) %>%
step_mutate(MSSubClass = factor(MSSubClass))
juiced_price <- prep(price_recipe) %>%
juice()
levels(juiced_price$MSSubClass)
#> [1] "20" "30" "40" "45" "50" "60" "70" "75" "80" "85" "90" "120"
#> [13] "160" "180" "190"
juiced_price %>%
count(MSSubClass)
#> # A tibble: 15 x 2
#> MSSubClass n
#> <fct> <int>
#> 1 20 536
#> 2 30 69
#> 3 40 4
#> 4 45 12
#> 5 50 144
#> 6 60 299
#> 7 70 60
#> 8 75 16
#> 9 80 58
#> 10 85 20
#> 11 90 52
#> 12 120 87
#> 13 160 63
#> 14 180 10
#> 15 190 30
Created on 2020-05-03 by the reprex package (v0.3.0)
This looks to me like it gets you the factor levels you want. If you want to save those strings from the .txt file like "1-STORY 1945 & OLDER" as a new_levels vector, you could say factor(MSSubClass, levels = new_levels).
Related
I am working through 'Machine Learning & R Expert techniques for predictive modeling' by Brett Lantz. I am using the tidymodels suite as I try the example modeling exercises in R.
I am working through chapter 5 in which you build a decision tree with the C5.0 algorithm. I hav e created the model using the code shown below
c5_v1 <- C5_rules() %>%
set_mode('classification') %>%
set_engine('C5.0')
c5_res_1 <- fit(object = c5_v1, formula = default ~., data = credit_train)
This has worked successfully:
parsnip model object
Call:
C5.0.default(x = x, y = y, trials = trials, rules = TRUE, control
= C50::C5.0Control(minCases = minCases, seed = sample.int(10^5, 1), earlyStopping
= FALSE))
Rule-Based Model
Number of samples: 900
Number of predictors: 20
Number of Rules: 22
Non-standard options: attempt to group attributes
Try as I might, Google as I do, read parsnips documentation, etc., I cannot find out how to view the decision tree. Can anyone tell me how to view the actual tree it has created?
Do note C5_rules() is a specification for a rule-fit model. Therefore, after fitting with C5_rules(), you shouldn't expect the output to be a decision tree but a set of rules instead.
With the C5.0 engine, you're able to get both a decision tree output and a rules output. With the fitted model, run extract_fit_engine() to obtain the engine specific fit embedded within a parsnip model fit, followed by summary() to extract the output.
library(tidymodels)
library(rules)
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
data(penguins, package = "modeldata")
#model specification
C5_decision_tree <- decision_tree() |>
set_engine("C5.0") |>
set_mode("classification")
C5_rules <- C5_rules() |>
#no need to set engine because only C5.0 is used for C5_rules()
#verify with show_engines("C5_rules")
set_mode("classification")
#fitting the models
C5_decision_tree_fitted <- C5_decision_tree |>
fit(species ~ ., data = penguins)
C5_rules_fitted <- C5_rules |>
fit(species ~ ., data = penguins)
#extracting decision tree
C5_decision_tree_fitted |>
extract_fit_engine() |>
summary()
#>
#> Call:
#> C5.0.default(x = x, y = y, trials = 1, control = C50::C5.0Control(minCases =
#> 2, sample = 0))
#>
#>
#> C5.0 [Release 2.07 GPL Edition] Mon Jul 4 09:32:16 2022
#> -------------------------------
#>
#> Class specified by attribute `outcome'
#>
#> Read 333 cases (7 attributes) from undefined.data
#>
#> Decision tree:
#>
#> flipper_length_mm > 206:
#> :...island = Biscoe: Gentoo (118)
#> : island in {Dream,Torgersen}:
#> : :...bill_length_mm <= 46.5: Adelie (2)
#> : bill_length_mm > 46.5: Chinstrap (5)
#> flipper_length_mm <= 206:
#> :...bill_length_mm > 43.3:
#> :...island in {Biscoe,Torgersen}: Adelie (4/1)
#> : island = Dream: Chinstrap (59/1)
#> bill_length_mm <= 43.3:
#> :...bill_length_mm <= 42.3: Adelie (134/1)
#> bill_length_mm > 42.3:
#> :...sex = female: Chinstrap (4)
#> sex = male: Adelie (7)
#>
#>
#> Evaluation on training data (333 cases):
#>
#> Decision Tree
#> ----------------
#> Size Errors
#>
#> 8 3( 0.9%) <<
#>
#>
#> (a) (b) (c) <-classified as
#> ---- ---- ----
#> 145 1 (a): class Adelie
#> 1 67 (b): class Chinstrap
#> 1 118 (c): class Gentoo
#>
#>
#> Attribute usage:
#>
#> 100.00% flipper_length_mm
#> 64.56% bill_length_mm
#> 56.46% island
#> 3.30% sex
#>
#>
#> Time: 0.0 secs
#extracting rules
C5_rules_fitted |>
extract_fit_engine() |>
summary()
#>
#> Call:
#> C5.0.default(x = x, y = y, trials = trials, rules = TRUE, control
#> = C50::C5.0Control(minCases = minCases, seed = sample.int(10^5,
#> 1), earlyStopping = FALSE))
#>
#>
#> C5.0 [Release 2.07 GPL Edition] Mon Jul 4 09:32:16 2022
#> -------------------------------
#>
#> Class specified by attribute `outcome'
#>
#> Read 333 cases (7 attributes) from undefined.data
#>
#> Rules:
#>
#> Rule 1: (68, lift 2.2)
#> bill_length_mm <= 43.3
#> sex = male
#> -> class Adelie [0.986]
#>
#> Rule 2: (208/64, lift 1.6)
#> flipper_length_mm <= 206
#> -> class Adelie [0.690]
#>
#> Rule 3: (48, lift 4.8)
#> island = Dream
#> bill_length_mm > 46.5
#> -> class Chinstrap [0.980]
#>
#> Rule 4: (34/1, lift 4.6)
#> bill_length_mm > 42.3
#> flipper_length_mm <= 206
#> sex = female
#> -> class Chinstrap [0.944]
#>
#> Rule 5: (118, lift 2.8)
#> island = Biscoe
#> flipper_length_mm > 206
#> -> class Gentoo [0.992]
#>
#> Default class: Adelie
#>
#>
#> Evaluation on training data (333 cases):
#>
#> Rules
#> ----------------
#> No Errors
#>
#> 5 2( 0.6%) <<
#>
#>
#> (a) (b) (c) <-classified as
#> ---- ---- ----
#> 146 (a): class Adelie
#> 1 67 (b): class Chinstrap
#> 1 118 (c): class Gentoo
#>
#>
#> Attribute usage:
#>
#> 97.90% flipper_length_mm
#> 49.85% island
#> 40.84% bill_length_mm
#> 30.63% sex
#>
#>
#> Time: 0.0 secs
Created on 2022-07-04 by the reprex package (v2.0.1)
I know that in tidymodels you can set a custom tunable parameter space by interacting directly with the workflow object as follows:
library(tidymodels)
model <- linear_reg(
mode = "regression",
engine = "glmnet",
penalty = tune()
)
rec_cars <- recipe(mpg ~ ., data = mtcars)
wkf <- workflow() %>%
add_recipe(rec_cars) %>%
add_model(model)
wkf_new_param_space <- wkf %>%
parameters() %>%
update(penalty = penalty(range = c(0.9, 1)))
but sometimes it makes more sense to do this right at the moment I specify a recipe or a model.
Someone knows a way to achieve this?
The parameter ranges are inherently separated from the model specification and recipe specification in tidymodels. When you set tune() you are giving a signal to the tune function that this parameter will take multiple values and should be tuned over.
So as a short answer, you can not specify ranges of parameters when you specify a recipe or a model, but you can create the parameters object right after as you did.
In the end, you need the parameter set to construct the grid values that you are using for hyperparameter tuning, and you can create those gid values in at least 4 ways.
The first way is to do it the way you are doing it, by pulling the needed parameters out of the workflow and modifying them when needed.
The second way is to create a parameters object that will match the parameters that you will need to use. This option and the remaining require you to make sure that you create values for all the parameters you are tuning.
The Third way is to skip the parameters object altogether and create the grid with your grid_*() function and dials functions.
The fourth way is to skip dials functions altogether and create the data frame yourself. I find tidyr::crossing() an useful replacement for grid_regular(). This way is a lot easier when you are working with integer parameters and parameters that don't benefit from transformations.
library(tidymodels)
model <- linear_reg(
mode = "regression",
engine = "glmnet",
penalty = tune()
)
rec_cars <- recipe(mpg ~ ., data = mtcars)
wkf <- workflow() %>%
add_recipe(rec_cars) %>%
add_model(model)
# Option 1: using parameters() on workflow
wkf_new_param_space <- wkf %>%
parameters() %>%
update(penalty = penalty(range = c(-5, 5)))
wkf_new_param_space %>%
grid_regular(levels = 10)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
# Option 2: Using parameters() on list
my_params <- parameters(
list(
penalty(range = c(-5, 5))
)
)
my_params %>%
grid_regular(levels = 10)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
# Option 3: Use grid_*() with dials objects directly
grid_regular(
penalty(range = c(-5, 5)),
levels = 10
)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
# Option 4: Create grid values manually
tidyr::crossing(
penalty = 10 ^ seq(-5, 5, length.out = 10)
)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
Created on 2021-08-17 by the reprex package (v2.0.1)
seems that this is an old question but I am having a hard time trying to insert this approach (option 1) in my workflow.
How is supposed to continue?
wkf_new_param_space is used as grid or as object in tuning model?
model_tuned <-
tune::tune_grid(
object = wkf_new_param_space, ?
resamples = cv_folds,
grid = wkf_new_param_space, ?
metrics = model_metrics,
control = tune::control_grid(save_pred = TRUE, save_workflow = TRUE)
)
I was trying to perform an exercise with correlated random normal variables in R. The purpose is pretty simple -- generate correlated random variables, add some errors to these random variables, and look at their combined standard deviations. The following code works fine but periodically spits out this message:
res = NULL
x1 = rnorm(n = 500, mean = 0.02, sd = 0.2)
for (i in 3:5) {
nn = i
wght = rep(1/(nn + 1), nn + 1)
x234 = scale(matrix( rnorm(nn * 500), ncol = nn ))
x1234 = cbind(scale(x1),x234)
c1 = var(x1234)
chol1 = solve(chol(c1))
newx = x1234 %*% chol1
dgn = diag(x = 1, nrow = nn + 1, ncol = nn + 1)
corrs = (runif(nn, min = 0.2, max = 0.8))
v = c(1)
vv = c(v, corrs)
dgn[1, ] = vv
dgn[, 1] = t(vv)
newc = dgn
chol2 = chol(newc)
finalx = newx %*% chol2 * sd(x1) + mean(x1)
fsd = sqrt(t(wght)%*%cov(finalx)%*%wght)
noise = scale(matrix(rnorm((nn + 1) * 500, mean = 0, sd = 0.1), ncol = nn + 1))
nmt = finalx + noise
nsd = sqrt(t(wght)%*%cov(nmt)%*%wght)
cmb = c(nn + 1, fsd, nsd)
res = rbind(res, cmb)
res
}
res
Here is the error message again:
Error in chol.default(newc) :
the leading minor of order 4 is not positive definite
As I increase the number of random variables from 5 to 10, the success rate falls dramatically. I have done some searching, but was not able to understand properly what is going on. I would appreciate if someone could help me explain the reason for the error message and improve the code so that I can increase the number of random variables. I am OK to modify the number of observations (currently set to 500).
Thanks!
Not to discourage from doing your own code but have you considered using one of the packages that generates the random variables correlated the way you specify and then adding your noise as desired. Seems more efficient...
# install.packages("SimMultiCorrData")
library(SimMultiCorrData)
#>
#> Attaching package: 'SimMultiCorrData'
#> The following object is masked from 'package:stats':
#>
#> poly
rcorrvar(n = 100, k_cat = 0, k_cont = 3, method = "Fleishman",
means = c(0.02, 0.02, 0.02), vars = c(.2, .2, .2), skews = c(0, 0, 0), skurts = c(0, 0, 0),
rho = matrix(c(1, -.8475514, -.7761684, -.8475514, 1, .7909486, -.7761684, .7909486, 1), 3, 3))
#>
#> Constants: Distribution 1
#>
#> Constants: Distribution 2
#>
#> Constants: Distribution 3
#>
#> Constants calculation time: 0 minutes
#> Intercorrelation calculation time: 0 minutes
#> Error loop calculation time: 0 minutes
#> Total Simulation time: 0 minutes
#> $constants
#> c0 c1 c2 c3
#> 1 0 1 0 0
#> 2 0 1 0 0
#> 3 0 1 0 0
#>
#> $continuous_variables
#> V1 V2 V3
#> 1 0.319695107 -0.09539562 0.04935637
#> 2 -0.044993481 0.18392534 -0.06670649
#> 3 -0.070313476 -0.06346264 -0.24941367
#> 4 0.172113990 0.34618351 0.47828409
#> 5 -0.274574396 0.34460006 0.09628439
#> 6 0.163286017 -0.10404186 -0.30498440
#> 7 0.189720419 0.34919058 -0.06916222
#> 8 0.346294222 -0.06309378 -0.17904333
#> 9 0.126299946 -0.08265343 0.04920184
#> 10 -0.280404683 0.17026612 0.51986206
#> 11 0.038499522 0.12446549 0.08325109
#> 12 -0.280384601 0.39031703 0.52271159
#> 13 0.045278970 0.46994063 0.11951804
#> 14 -0.194794669 -0.23913369 0.20371862
#> 15 -0.231546212 -0.00530418 -0.05841145
#> 16 0.346088425 -0.33119118 -0.27331346
#> 17 -0.453004492 0.60059088 0.52166094
#> 18 -0.072573425 0.05046599 0.33414391
#> 19 0.166013559 -0.18329940 0.10446314
#> 20 -0.098604755 -0.12496718 -0.61084161
#> 21 0.112571406 0.06160790 -0.16522639
#> 22 -0.089738379 0.35995382 0.18410621
#> 23 1.263601427 -0.93129093 -1.01284304
#> 24 0.467595367 -0.37048826 -0.56007336
#> 25 0.687837527 -0.71037730 -0.39024692
#> 26 -0.069806105 0.12184969 0.48233090
#> 27 0.460417179 0.11288231 -0.65215841
#> 28 -0.280200352 0.69895708 0.48867650
#> 29 -0.434993285 0.34369961 0.38985123
#> 30 0.156164881 -0.01521342 0.12130470
#> 31 0.106427524 -0.43769376 -0.38152970
#> 32 0.004461824 -0.02790287 0.13729747
#> 33 -0.617069179 0.62369153 0.74216927
#> 34 0.246206541 -0.22352474 -0.07086127
#> 35 -0.367155270 0.81098732 0.74171120
#> 36 -0.350166970 0.31690673 0.65302786
#> 37 -0.811889266 0.47066271 1.39740693
#> 38 -0.640483432 0.95157401 0.91042674
#> 39 0.288275932 -0.33698868 -0.15963674
#> 40 -0.056804796 0.29483915 0.15245274
#> 41 -0.266446983 0.09157321 -0.18294133
#> 42 0.611748802 -0.51417900 -0.22829506
#> 43 -0.052303947 -0.12391952 0.32055082
#> 44 0.127253868 0.06030743 -0.05578007
#> 45 0.395341299 -0.16222908 -0.08101956
#> 46 0.232971542 -0.09001768 0.06416376
#> 47 0.950584749 -0.67623380 -0.53429103
#> 48 0.256754894 -0.02981766 0.11701343
#> 49 0.233344371 -0.16151008 -0.05955383
#> 50 0.179751022 -0.09613500 -0.02272254
#> 51 0.097857477 0.27647838 0.40066424
#> 52 0.312418540 -0.02838812 -0.13918162
#> 53 0.705549829 -0.61698405 -0.29640094
#> 54 -0.074780651 0.42953939 0.31652087
#> 55 -0.291403183 0.05610553 0.32864232
#> 56 0.255325304 -0.55157170 -0.35415178
#> 57 0.120880052 -0.03856729 -0.61262393
#> 58 -0.648674586 0.59293157 0.79705060
#> 59 -0.404069704 0.29839572 -0.11963513
#> 60 0.029594092 0.24640773 0.27927410
#> 61 -0.127056071 0.30463198 -0.11407147
#> 62 -0.443629418 0.01942471 -0.32452308
#> 63 -0.139397963 0.20547578 0.11826198
#> 64 -0.512486967 0.24807759 0.67593407
#> 65 0.175825431 -0.15323003 -0.15738781
#> 66 -0.169247924 -0.29342285 -0.32655455
#> 67 0.540012695 -0.59459258 -0.12475814
#> 68 -0.498927728 0.05150384 0.07964582
#> 69 -0.166410612 0.07525901 -0.24507295
#> 70 0.582444257 -0.64069856 -0.60202487
#> 71 0.432974856 -0.66789588 -0.35017817
#> 72 0.484137908 -0.05404562 -0.34554109
#> 73 0.050180754 0.16226779 0.03339923
#> 74 -0.454340954 0.71886665 0.16057079
#> 75 0.776382309 -0.78986861 -1.29451966
#> 76 -0.480735672 0.43505688 0.46473186
#> 77 -0.086088864 0.54821715 0.42424756
#> 78 1.274991665 -1.26223004 -0.89524217
#> 79 0.006008305 -0.07710162 -0.07703056
#> 80 0.052344453 0.05182247 0.03126195
#> 81 -1.196792535 1.25723077 1.07875988
#> 82 0.057429049 0.06333375 -0.01933766
#> 83 0.207780426 -0.25919776 0.23279382
#> 84 0.316861262 -0.17226266 -0.24638375
#> 85 -0.032954787 -0.35399252 -0.17783342
#> 86 0.629198645 -0.85950566 -0.72744805
#> 87 0.068142675 -0.44343898 -0.17731659
#> 88 -0.244845275 0.28838443 0.32273254
#> 89 -0.206355945 -0.16599180 0.28202824
#> 90 0.023354603 0.18240309 0.30508536
#> 91 0.038201949 0.21409777 -0.05523652
#> 92 -0.961385546 1.21994616 0.71859653
#> 93 -0.916876574 0.36826421 0.35458708
#> 94 -0.135629660 -0.19348452 -0.14177523
#> 95 1.142650739 -0.94119197 -0.87394690
#> 96 0.561089630 -0.29328666 -0.63295015
#> 97 -0.054000942 -0.09673068 0.40208010
#> 98 -0.536990807 0.41466009 0.21541141
#> 99 0.015140675 -0.10702733 -0.29580071
#> 100 -0.830043387 0.77655165 0.08875664
#>
#> $summary_continuous
#> Distribution n mean sd median min max skew
#> X1 1 100 0.02 0.4472136 0.026474348 -1.196793 1.274992 0.19390551
#> X2 2 100 0.02 0.4472136 0.007060266 -1.262230 1.257231 -0.02466011
#> X3 3 100 0.02 0.4472136 0.005962144 -1.294520 1.397407 0.03116489
#> skurtosis fifth sixth
#> X1 0.7772036 0.1731516 -5.447704
#> X2 0.6009945 0.4473608 -4.123007
#> X3 0.7130090 0.1809188 -2.905017
#>
#> $summary_targetcont
#> Distribution mean sd skew skurtosis
#> 1 1 0.02 0.4472136 0 0
#> 2 2 0.02 0.4472136 0 0
#> 3 3 0.02 0.4472136 0 0
#>
#> $sixth_correction
#> [1] NA NA NA
#>
#> $valid.pdf
#> [1] "TRUE" "TRUE" "TRUE"
#>
#> $correlations
#> [,1] [,2] [,3]
#> [1,] 1.0000000 -0.8475514 -0.7761684
#> [2,] -0.8475514 1.0000000 0.7909486
#> [3,] -0.7761684 0.7909486 1.0000000
#>
#> $Sigma1
#> [,1] [,2] [,3]
#> [1,] 1.0000000 -0.8475514 -0.7761684
#> [2,] -0.8475514 1.0000000 0.7909486
#> [3,] -0.7761684 0.7909486 1.0000000
#>
#> $Sigma2
#> [,1] [,2] [,3]
#> [1,] 1.0000000 -0.8475514 -0.7761684
#> [2,] -0.8475514 1.0000000 0.7909486
#> [3,] -0.7761684 0.7909486 1.0000000
#>
#> $Constants_Time
#> Time difference of 0 mins
#>
#> $Intercorrelation_Time
#> Time difference of 0 mins
#>
#> $Error_Loop_Time
#> Time difference of 0 mins
#>
#> $Simulation_Time
#> Time difference of 0 mins
#>
#> $niter
#> 1 2 3
#> 1 0 0 0
#> 2 0 0 0
#> 3 0 0 0
#>
#> $maxerr
#> [1] 2.220446e-16
Created on 2020-05-13 by the reprex package (v0.3.0)
OK this is going to be a long post.
So i am fairly new with R (i am currently using the MR free 3.5, with no checkpoint) but i am trying to work with the tidyverse, which i find very elegant in writing code and a lot of times a lot more simple.
I decided to replicate an exercise from guru99 here. It is a simple k-means exercise. However because i always want to write "generalizeble" code i was trying to automatically rename the variables in mutate with new names. So i searched SO and found this solution here which is very nice.
First what works fine.
#library(tidyverse)
link <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv"
df <- read.csv(link)
rescaled <- df %>% discard(is.factor) %>%
select(-X) %>%
mutate_all(
funs("scaled" = scale)
)
When you download the data with read.csv you get the df in dataframe class and everything works.
And now the weird thinks start. If you download the data with read_csv or make it a tibble at any point after (the first X variable will be named X1 and you need to change the is.factor to is.character because stings are converted to character not factors unless explicitly asked for, for future me and others.)
and then run the code
df1 <- read_csv(link)
df1 %>% discard(is.character) %>%
select(-X1) %>%
mutate_all(
funs("scaled" = scale)
)
the new named variables are named price_scaled[,1] speed_scaled[,1] hd_scaled[,1] ram_scaled[,1] etc. when you view the output in the console or you even if you print().
BUT if you view() on it you see the output with the names you expect which are price_scaled speed_scaled hd_scaled etc. ALSO I am using an Rmarkdown document for the code and when i change the chunk output to inline it diplays the names correctly with hd_scaled etc.
Any one has any idea how to get the names printed in the console like price_scaled etc.
Why this is happening?
Though that this would be interesting to ask.
scale() returns a matrix, and dplyr/tibble isn't automatically coercing it to a vector. By changing your mutate_all() call to the below, we can have it return a vector. I identified this is what was happening by calling class(df1$speed_scaled) and seeing the result of "matrix".
library(tidyverse)
link <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv"
df <- read_csv(link)
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> price = col_double(),
#> speed = col_double(),
#> hd = col_double(),
#> ram = col_double(),
#> screen = col_double(),
#> cd = col_character(),
#> multi = col_character(),
#> premium = col_character(),
#> ads = col_double(),
#> trend = col_double()
#> )
df %>% discard(is.character) %>%
select(-X1) %>%
mutate_all(
list("scaled" = function(x) scale(x)[[1]])
)
#> # A tibble: 6,259 x 14
#> price speed hd ram screen ads trend price_scaled speed_scaled
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1499 25 80 4 14 94 1 -1.24 -1.28
#> 2 1795 33 85 2 14 94 1 -1.24 -1.28
#> 3 1595 25 170 4 15 94 1 -1.24 -1.28
#> 4 1849 25 170 8 14 94 1 -1.24 -1.28
#> 5 3295 33 340 16 14 94 1 -1.24 -1.28
#> 6 3695 66 340 16 14 94 1 -1.24 -1.28
#> 7 1720 25 170 4 14 94 1 -1.24 -1.28
#> 8 1995 50 85 2 14 94 1 -1.24 -1.28
#> 9 2225 50 210 8 14 94 1 -1.24 -1.28
#> 10 2575 50 210 4 15 94 1 -1.24 -1.28
#> # ... with 6,249 more rows, and 5 more variables: hd_scaled <dbl>,
#> # ram_scaled <dbl>, screen_scaled <dbl>, ads_scaled <dbl>,
#> # trend_scaled <dbl>
I've been trying to calculate marginal means for my lmer & glmer in R. I found the emmeans function and I've been trying to understand it and apply it to my model. I found that it's hard to get the means for an interaction, so I'm starting with just additive predictors, but the function doesn't work the way it's presented in examples (e.g. here https://cran.r-project.org/web/packages/emmeans/vignettes/sophisticated.html)
emmeans(Oats.lmer, "nitro")
nitro emmean SE df lower.CL upper.CL
0.0 78.89207 7.294379 7.78 61.98930 95.79484
0.2 97.03425 7.136271 7.19 80.25029 113.81822
0.4 114.19816 7.136186 7.19 97.41454 130.98179
0.6 124.06857 7.070235 6.95 107.32795 140.80919
what I'm getting is:
emmeans(model2, "VariableA")
VariableA emmean SE df lower.CL upper.CL
0.4657459 2649.742 120.8955 19.07 2396.768 2902.715
Only one line and the variable is averaged instead of split into 0 and 1 (which are the values in the dataset, and maybe the problem is that it's categorical?)
The model I'm running is :
model2 = lmer (rt ~ variableA + variableB + (1 |participant) + (1 |sequence/item), data=memoryData, REML=FALSE)
EDIT: The data file is quite big and I wasn't sure how to extract useful information from it, but here is the structure:
> str(memoryData)
'data.frame': 3168 obs. of 123 variables:
$ participant : int 10 10 10 10 10 10 10 10 10 10 ...
$ variableA : int 1 1 1 1 1 1 1 1 1 1 ...
$ variableB : int 1 1 1 1 1 1 1 1 1 1 ...
$ sequence: int 1 1 1 1 1 1 1 1 1 1 ...
$ item : int 25 26 27 28 29 30 31 32 33 34 ...
$ accuracy : int 1 1 1 1 1 1 0 1 1 1 ...
$ rt : num 1720 1628 1728 2247 1247 ...
Why is the function not working for me?
And as a further question, is there a way to get these means when I include interaction between variables A and B?
EDIT 2: ok, it did work when I changed it to factor, I guess my method of doing it was incorrect. But I'm still not sure how to calculate it when there is an interaction? Because with this method, R says "NOTE: Results may be misleading due to involvement in interactions"
To see marginal means of interactions, add all variables of the interaction term to emmeans(), and you need to use the at-argument if you want to see the marginal means at different levels of the interaction terms.
Here are some examples, for the average effect of the interaction, and for marginal effects at different levels of the interaction term. The latter has the advantage in terms of visualization.
library(ggeffects)
library(lme4)
library(emmeans)
data("sleepstudy")
sleepstudy$inter <- sample(1:5, size = nrow(sleepstudy), replace = T)
m <- lmer(Reaction ~ Days * inter + (1 + Days | Subject), data = sleepstudy)
# average marginal effect of interaction
emmeans(m, c("Days", "inter"))
#> Days inter emmean SE df lower.CL upper.CL
#> 4.5 2.994444 298.3427 8.84715 16.98 279.6752 317.0101
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
# marginal effects at different levels of interactions -
# useful for plotting
ggpredict(m, c("Days [3,5,7]", "inter"))
#>
#> # Predicted values of Reaction
#> # x = Days
#>
#> # inter = 1
#> x predicted std.error conf.low conf.high
#> 3 279.349 8.108 263.458 295.240
#> 5 304.839 9.818 285.597 324.082
#> 7 330.330 12.358 306.109 354.551
#>
#> # inter = 2
#> x predicted std.error conf.low conf.high
#> 3 280.970 7.624 266.028 295.912
#> 5 304.216 9.492 285.613 322.819
#> 7 327.462 11.899 304.140 350.784
#>
#> # inter = 3
#> x predicted std.error conf.low conf.high
#> 3 282.591 7.446 267.997 297.185
#> 5 303.593 9.384 285.200 321.985
#> 7 324.594 11.751 301.562 347.626
#>
#> # inter = 4
#> x predicted std.error conf.low conf.high
#> 3 284.212 7.596 269.325 299.100
#> 5 302.969 9.502 284.345 321.594
#> 7 321.726 11.925 298.353 345.099
#>
#> # inter = 5
#> x predicted std.error conf.low conf.high
#> 3 285.834 8.055 270.046 301.621
#> 5 302.346 9.839 283.062 321.630
#> 7 318.858 12.408 294.540 343.177
#>
#> Adjusted for:
#> * Subject = 308
emmeans(m, c("Days", "inter"), at = list(Days = c(3, 5, 7), inter = 1:5))
#> Days inter emmean SE df lower.CL upper.CL
#> 3 1 279.3488 8.132335 23.60 262.5493 296.1483
#> 5 1 304.8394 9.824196 20.31 284.3662 325.3125
#> 7 1 330.3300 12.366296 20.69 304.5895 356.0704
#> 3 2 280.9700 7.630745 18.60 264.9754 296.9646
#> 5 2 304.2160 9.493225 17.77 284.2529 324.1791
#> 7 2 327.4621 11.901431 17.84 302.4420 352.4822
#> 3 3 282.5912 7.445982 16.96 266.8786 298.3038
#> 5 3 303.5927 9.383978 16.98 283.7927 323.3927
#> 7 3 324.5942 11.751239 16.98 299.7988 349.3896
#> 3 4 284.2124 7.601185 18.34 268.2639 300.1609
#> 5 4 302.9694 9.504102 17.85 282.9900 322.9487
#> 7 4 321.7263 11.927612 17.99 296.6666 346.7860
#> 3 5 285.8336 8.076779 23.02 269.1264 302.5409
#> 5 5 302.3460 9.845207 20.48 281.8399 322.8521
#> 7 5 318.8584 12.416642 21.02 293.0380 344.6788
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
And a plotting example:
ggpredict(m, c("Days", "inter [1,3,5]")) %>% plot()
You say that "changing the vari[a]ble to factor doesn't help", but I would think this would (as documented in the emmeans FAQ):
md <- transform(memoryData,
variableA=factor(variableA),
variableB=factor(variableB))
model2 = lmer (rt ~ variableA + variableB +
(1 |participant) + (1 |sequence/item), data=md, REML=FALSE)
emmeans(model2, ~variableA)
emmeans(model2, ~variableB)
emmeans(model2, ~variableA + variableB)
If this really doesn't work, then we need a reproducible example ...