stratification using caret package - how to deal with dropout - r

I am using the caret package for stratified randomization in a clinical trial. I have 4 strata and I used StrBCD.ui (path, folder = "myfolder") to create it. I have two questions. 1. Every time I rerun the function and add a new subject data, it should take into account the data of previously inserted patients, right? Where are these data stored? I cannot find the specific file in the created folder 2. I might have some dropouts after the participants have been assigned to a treatment and I want to exclude them from the randomization. How do I remove their data from the randomization? However, I do not want to restart the randomization from scratch as I want to keep the participants that have already been allocated, how do I do it? In other words, is there a way to have somehow control over the function and the data considered for the randomization?
Thank you
Elisabetta
I tried to run the program several times and look for the data in the created folders

Related

Can I use xgboost global model properly, if I skip step_dummy(all_nominal_predictors(), one_hot = TRUE)?

I wanted to try xgboost global model from: https://business-science.github.io/modeltime/articles/modeling-panel-data.html
On smaller scale it works fine( Like wmt data-7 departments,7ids), but what if I would like to run it on 200 000 time series (ids)? It means step dummy creates another 200k columns & pc can't handle it.(pc can't handle even 14k ids)
I tried to remove step_dummy, but then I end up with xgboost forecasting same values for all ids.
My question is: How can I forecast 200k time series with global xgboost model and be able to forecast proper values for each one of the 200k ids.
Or is it necessary to put there step_ dummy in oder to create proper FC for all ids?
Ps:code should be the same as one in the link. Only in my dataset there are 50 monthly observations for each id.
For this model, the data must be given to xgboost in the format of a sparse matrix. That means that there should not be any non-numeric columns in the data prior to the conversion (with tidymodels does under the hood at the last minute).
The traditional method for converting a qualitative predictor into a quantitative one is to use dummy variables. There are a lot of other choices though. You can use an effect encoding, feature hashing, or others too.
I think that there is no proper answer to the question "how it would be possible to forecast 200k ts" properly. Global Models are the way to go here, but you need to experiment to find out, which models do not belong inside the global forecast model.
There will be a threshold, determined mostly by the length of the series, that you put inside the global model.
Keep in mind to use several global models, with different feature recipes.
If you want to avoid step_dummy function, use lightgbm from the bonsai package, which is considerably faster and more accurate.

Comparison of good vs bad dataset using R

Stuck in a problem. There are two datasets A and B. Say they're datasets of two factories. Factory A is performing really well whereas Factory B is not. I have the data-set of Factory A (data being output from the manufacturing units) as well as Factory B, both having the same variables. How can I identify the problematic variable in Factory B which needs to be fixed so that Factory B starts performing well too? Therefore, I need to identify the problematic variable which needs immediate attention.
Looking forward to your response.
p.s: coding language being used is R
Well this is shameless plug for the dataMaid package which I helped write and which sort of does what you are asking. The idea of the dataMaid package is to run a battery of tests on the variables in a data frame and produce a report that a human investigator (preferably someone with knowledge about the context) can look through in order to identify potential problems.
A super simple way to get started is to load the package and use the
clean function on a data frame (if you try to clean the same data
frame several times then it may be necessary to add the replace=TRUE
argument to overwrite the existing report).
devtools::install_github("ekstroem/dataMaid")
library(dataMaid)
data(trees)
clean(trees)
This will create a report with summaries and error checks for each
variable in the trees data frame. A summary of all the variables is provided and for the trees data it looks like this
while the information from each variable may look like this
Here we get a status about the variable type, summary statistics, a plot and - in this case - an indicator that there might be a problem with outliers.
The dataMaid package can also be used interactively by running checks for the individual variables or for all variables in the dataset
data(toyData)
check(toyData$var2) # Individual check of var2
check(toyData) # Check all variables at once
By default the standard battery of tests is run depending on the
variable type, but it is possible to extend the package by providing your own checks.
In your case I would run the package on both datasets to get two reports, and any major differences in those would raise a flag about what could be problematic.

Cluster Analysis using R for large data sample

I am just starting out with segmenting a customer database using R I have for an ecommerce retail business. I seek some guidance about the best approach to proceed with for this exercise.
I have searched the topics already posted here and tried them out myself like dist() and hclust(). However I am running into one issue or another and not able to overcome it since I am new to using R.
Here is the brief description of my problem.
I have approximately 480K records of customers who have bought so far. The data contains following columns:
email id
gender
city
total transactions so far
average basket value
average basket size ( no of item purchased during one transaction)
average discount claimed per transaction
No of days since the user first purchased
Average duration between two purchases
No of days since last transaction
The business goal of this exercise is to identify the most profitable segments and encourage repeat purchases in those segments using campaigns. Can I please get some guidance as to how to do this successfully without running into problems like the size of the sample or the data type of columns?
Read this to learn how to subset data frames. When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. If not, you might want to take a random sample of your data instead of all of it. Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this:
cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)
For large values, you may want to log transform them--just experiment and see what makes sense. I really am not sure about this, and that's just a suggestion. Maybe choosing a more appropriate clustering or distance metric would be better.
Finally, when you run hclust, you need to pass in the d matrix, and not the original data set.
h <- hclust(d, "ave")
Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale.
I am not sure if clustering is the way to go here.
Here are some ideas:
First split your data into a training set (say 70%) and a test set.
Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables.
fit <-lm(averagebasketvalue ~., data = custdata)
Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables.
Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. You can use the predict() function , the calls will look like
fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²
Maybe "city" contains too many unique values to be meaningful. Try to generalize them by introducing a new attribute CityClass (e.g. BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). You might also condition the model on "gender". Drop "email id".
This can go on for a while... play with the model to try to get better R-squared and SSEs.
I think a tree-based model (rpart) might also work well here.
Then you might change to cluster analysis at a later time.

Recursive Partitioning in R

I am aiming to better predict a buying habits of a company's customer base according to several customer attributes (demographic, past purchase categories, etc). I have a data set of about 100,000 returning customers including the time interval from their last purchase (the dependent variable in this study) along with several attributes (both continuous and categorical).
I plan on doing a survival analysis on each segment (segments defined as having similar time intervals across observations) to help understand likely time intervals between purchases. The problem I am encountering is how to best define these segments; i.e. groupings of attributes such that the time interval is sufficiently different between segments and similar within segments. I believe that building a decision tree is the best way to do this, I would suppose using recursive partitioning.
I am new to R and have poked around with the party package's mob command, however I am confused by which variables to include in the model and which to include for partitioning (command: mob(y ~ x1 + ... + xk | z1 + ... + zk), x being model variables and z being partitions). I simply want to build a tree from the set of attributes, so I suppose I want to partition on all of them? Not sure. I have also tried the rpart command but either get no tree or a tree with hundreds of thousands of nodes depending on the cp level.
If anyone has any suggestions, I'd appreciate it. Sorry for the novel and thanks for the help.
From the documentation at ?mob:
MOB is an algorithm for model-based recursive partitioning yielding a
tree with fitted models associated with each terminal node.
It's asking for model variables because it will build a model at every terminal node (e.g. linear, logistic) after splitting on the partition variables. If you want to partition without fitting models to the terminal nodes, the function I've used is ctree (also in the party package).

Determine Signficant Subgroups of Data Inputs

I have a large (10000 X 5001) table representing 10000 samples and 5001 different features of these samples. One of these features represents an output variable of each sample. In other words, I have 5000 input variables and one output variable for each sample.
I know that most of these inputs are irrelevant. Therefore, what I would like to do is determine the subset of input variables that predicts the output variable best. What is the best/simplest way to go about doing this in R?
You might want to check out Weka. In the Explorer load the data and then go to the Select attributes tab. There you will find several options to get the most informative attributes/features in your dataset.
You may want Principal Component Analysis (stats::prcomp) or Linear Discriminant Analysis (MASS::lda).
See this document by Avril Coghlan
http://little-book-of-r-for-multivariate-analysis.readthedocs.org/en/latest/
Rather than taking 'random' suggestions, why not go to the CRAN Task View for Cluster Analysis & Finite Mixture Models ?

Resources