Populating two inter-related columns in a R data.table - r

I have this sample data table:
df <- data.table(indexer = c(0:12), x1 =
c(0,1000,1500,1000,1000,2000,
1000,1000,0,351.2,1000,1000,1851.2)
)
Now I need to create two additional columns x2 and x3 in this data frame such as x2[i] = x1[i] - x3[i] and x3[i] = x2[i-1] with x3[1]=0.
How can I do this without using a loop in an efficient way?
EDIT1: expected results are
x2 = c(0.0,1000.0,500.0,500.0,500.0,1500.0,-500.0,1500.0,-1500.0,1851.2,-851.2,1851.2‌​,0.0)
and
x3 = c(0.0,0.0,1000.0,500.0,500.0,500.0,1500.0,-500.0,1500.0,-1500.0,1851.2,-851.2,18‌​51.2)
EDIT2: First time here posting questions. Hence all these confusions. Forget the example guys, the formulas are:
x3[i] = c - x2[i-1]*(1+r/12); x2[i] = x1[i] - x3[i]; x3[1] = 0; # c is some constant.

The problem is that x2 and x3 depend on each other. Thus, one needs to express x2 in terms of x1:
Once we have the formula, programming is easy:
df$x2 <- (-1)^(df$indexer) * cumsum(df$x1*(-1)^(df$indexer))
And x3 can be obtained from x2:
df$x3 <- c(0,df$x2[-nrow(df)])
[EDIT2] I guess that solution for the modified question, if it exists at all, should be sought along the same lines. I don't think it should be considered as a programming-related problem, because the code is quite straightforward once the mathematical formula is known.

Related

Simulate missing values with MNAR method in R

I simulated a data set with the following assumptions:
x1 <- rbinom(100,0,0.5) #trt
x2 <- rnorm(100,0,1) # metric outcome
df <- data.frame(x1,x2)
Now I'm trying to include missing values with two different methods: First "missing completely at random" and second "missing not at random". Therefore I tried lots of packages, but it does not work, as I expacted.
For the first scenario (MCAR) I used:
df_mcar <- ampute(data = df, prop = 0.1, mech = "MCAR", patterns = c(1, 0))$amp
... and it seems to work (with probability of 10% only x2 has missing values - independently of x1)
For the second scenario I want - again - that only x2 has missing values, but this time with special assumption on x1: Only for x1 = 1 I want x2 to have missing values in 10% of cases.
So in variable x2 I want missing values with probability of p=0.1 for x1 = 1 and with probability of p=0 for x1 = 0.
I would be glad for any hint or a simple solution :)
PS: I often read something like prodNA(...) but it does not work
Could probably do something like:
library(dplyr)
df %>%
mutate(
x2 = if_else(x1 == 1 & runif(n()) < .1, NA_real_, x2)
)
My R is currently too busy for me to run the code, though.

multiple imputation and multigroup SEM in R

I want to perform multigroup SEM on imputed data using the R packages mice and semTools, specifically the runMI function that calls Lavaan.
I am able to do so when imputing the entire dataset at once, but whilst trawling through stackoverflow/stackexchange I have come across the recommendation to impute data separately for each level of a grouping variable (e.g. men, women), so that the features of each group are preserved
(e.g. https://stats.stackexchange.com/questions/149053/questions-on-multiple-imputation-with-mice-for-a-multigroup-sem-analysis-inclu). However, I've not been able to find any references to support this course.
My question is both conceptual and practical -
1) Is splitting the dataset by group prior to imputing the correct course? Could anyone point me towards references advising this?
2) If so, how can I combine the datasets imputed by group using mice together, whilst still retaining multiple imputed datasets in a list of dataframes of the mids class? I have attempted to do so, but end up with an integer
set.seed(12345)
HSMiss <- HolzingerSwineford1939[ , paste("x", 1:9, sep = "")]
HSMiss$x5 <- ifelse(HSMiss$x1 <= quantile(HSMiss$x1, .3), NA, HSMiss$x5)
HSMiss$x9 <- ifelse(is.na(HSMiss$x5), NA, HSMiss$x9)
HSMiss$school <- HolzingerSwineford1939$school
HS.model <- '
visual =~ x1 + a*x2 + b*x3
textual =~ x4 + x5 + x6
x7 ~ textual + visual + x9
'
group1 <- subset(HSMiss, school =='Pasteur')
group2 <- subset(HSMiss, school =='Grant-White')
imputed.group1 <- mice(group1, m = 3, seed = 12345)
imputed.group2 <- mice(group2, m = 3, seed = 12345)
#attempted merging:
imputed.both <- nrow(complete(rbind(imputed.group1, imputed.group2)))
I would be incredibly grateful if anyone can offer me some help. As you can tell, I am very much still learning about R and imputation, so apologies if this is a stupid question - however, I couldn't find anything regarding this specific query elsewhere.
You are getting just an integer when mergin because you are calling nrow(). Remove that call and you'll get a merged data frame.
imputed.both <- complete(rbind(imputed.group1, imputed.group2))
In case you find yourself with datasets that have multiple groups, you can something like the following to simplify this task.
imputed.groups <- lapply(split(HSMiss, HSMiss$school), function(x) {
complete(mice(x, m = 3, seed = 12345))
})
imputed.both <- do.call(args = imputed.groups, what = rbind)
About how appropiate is this approach for imputing, that's probably a question better suited for Cross Validated.

How to use dplyr to make several simple regressions using always the same independent variable but changing the dependent one?

I hope this is not the simplest question. I need to make a simple regression (yes, a simple one: Y = a + bX + epsilon). My data frame is such that each column has one variable (and each column has 20 rows (observations)). The problem is that the first 10 columns are from Y1 to Y10 and the last one is the only independent variable.
So, I have to run 10 regressions, changing only the Yi (i = 1,...10). For example:
Y1 = a + bX + epsilon
Y2 = a + bX + epsilon
...
Y10 = a + bX + epsilon
(Yi and X are all vectors (20 x 1), it's really a simply exercise)
I can do it one by one, but I was thinking to do them all in one command. I am not a veteran in programming and I was thinking if dplyr could help me with this.
I am really looking for suggestions.
Thank you.
You can try
lapply(d1[paste0('Y',1:10)], function(y) lm(y~d1[,'X']))
where d1 is the dataset

How to separate specific list items with "+" and add to formula?

I am trying to generate a formula using dataframe column names of the following format:
d ~ x1 + x2 + x3 + x4
From the following sample dataset:
a = c(1,2,3)
b = c(2,4,6)
c = c(1,3,5)
d = c(9,8,7)
x1 = c(1,2,3)
x2 = c(2,4,6)
x3 = c(1,3,5)
x4 = c(9,8,7)
df = data.frame(a,b,c,d,x1,x2,x3,x4)
As for what I have tried already:
I know that I can subset only the columns I need using the following approach
predictors = names(df[5:8])
response = names(df[4])
Although, my efforts to try and include these into a formula have failed
How can I assemble the predictors and the response variables into the following format:
d ~ x1 + x2 + x3 + x4
I ultimately want to input this formula into a randomForest function.
We can avoid the entire problem by using the default method of randomForest (rather than the formula method):
randomForest(df[5:8], df[[4]])
or in terms of predictors and response defined in the question:
randomForest(df[predictors], df[[response]])
As mentioned in the Note section of the randomForest help file the default method used here has the additional advantage of better performance than the formula method.
How about:
reformulate(predictors,response=response)

Generating multiple datasets and applying function and output multiple dataset

Here is my problem, just hard for me...
I want to generate multiple datasets, then apply a function to these datasets and output corresponding output in single or multiple dataset (whatever possible)...
My example, although I need to generate a large number of variables and datasets
seed <- round(runif(10)*1000000)
datagen <- function(x){
set.seed(x)
var <- rep(1:3, c(rep(3, 3)))
yvar <- rnorm(length(var), 50, 10)
matrix <- matrix(sample(1:10, c(10*length(var)), replace = TRUE), ncol = 10)
mydata <- data.frame(var, yvar, matrix)
}
gdt <- lapply (seed, datagen)
# resulting list (I believe is correct term) has 10 dataframes:
# gdt[1] .......to gdt[10]
# my function, this will perform anova in every component data frames and
#output probability coefficients...
anovp <- function(x){
ind <- 3:ncol(x)
out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]])
pval <- out$coefficients[,4][2]
pval <- do.call(rbind,pval)
}
plist <- lapply (gdt, anovp)
Error in gdt[x] : invalid subscript type 'list'
This is not working, I tried different options. But could not figure out...finally decided to bother experts, sorry for that...
My questions are:
(1) Is this possible to handle such situation in this way or there are other alternatives to handle such multiple datasets created?
(2) If this is right way, how can I do it?
Thank you for attention and I will appreciate your help...
You have the basic idea right, in that you should create a list of data frames and then use lapply to apply the function to each element of the list. Unfortunately, there are several oddities in your code.
There is no point in randomly generating a seed, then setting it. You only need to use set.seed in order to make random numbers reproducible. Cut the lines
seed <- round(runif(10)*1000000)
and maybe
set.seed(x)
rep(1:3, c(rep(3, 3))) is the same as rep(1:3, each = 3).
Don't call your variables var or matrix, since they will mask the names of those functions. since it's confusing.
3:ncol(x) is dangerous. If x has less than 3 columns it doesn't do what you think it does.
... and now, the problem you actually wanted solving.
The problem is in the line out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]]).
lapply passes data frames into anovp, not indicies, so x is a data frame in gdt[x]. Which throws an error.
One more thing. While you are rewriting that line, note that lm takes a data argument, so you don't need to do things like gdt$some_column; you can just reference some_column directly.
EDIT: Further advice.
You appear to always use the formula yvar ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10. Since its the same each time, create it before your call to lapply.
independent_vars <- paste(colnames(gdt[[1]])[-1:-2], collapse = " + ")
model_formula <- formula(paste("yvar", independent_vars, sep = " ~ "))
I probably wouldn't bother with the anovp function. Just do
models <- lapply(gdt, function(data) lm(model_formula, data))
Then include a further call to lapply to play with the coefficients if necessary. The next line replicates your anovp code, but won't work because model$coefficients is a vector (so the dimensions aren't right). Adjust to retrieve the bit you actualy want.
coeffs <- lapply(models, function(model) do.call(rbind, model$coefficients[,4][2]))

Resources