Subsetting numeric variable but not factor in R

Subsetting numeric variable but not factor in R - r

I would like to calculate Linear mixed-effect models via for loop where there is always hard-coded Y and random effect. The X variables (var.nam[i]) are to be loop through. I wrote the code and it is working (as I believe), but I would also like to subset X variable (var.nam[i]) depending on the X variable (var.nam[i]) type (numeric, factor) where:
when X variable (var.nam[i]) is numeric, exclude all observation equal to 0
when X variable (var.nam[i]) is factor, do not subset X variable (var.nam[i])
A short sample of my code is here:
for(i in 1:length(var.nam)) {
formula[i] <- paste0("Y", "~", paste0(c(var.nam[i], c("Season"),
c("Sex"),
c("Age"),
c("BMI"),
c("(1|HID)")), collapse="+"))
model <- lmer(formula[i], data = subset(data, paste0(c(var.nam[i])) != 0))
# loop continues...
}
As it is written now, it will subset all X variables (var.nam[i]) regardless of the type. Is there any workaround or different way to subset variable, that would work in this specific case?

Checking if this solution works is a bit hard without data or the complete for loop.
Based on your question you want to conditionally subset, adding a if else statement should make this possible:
for(i in 1:length(var.nam)) {
formula[i] <- paste0("Y", "~", paste0(c(var.nam[i], c("Season"),
c("Sex"),
c("Age"),
c("BMI"),
c("(1|HID)")), collapse="+"))
data1 <- if(mode(var.nam[i]) == "numeric") {subset(data, paste0(c(var.nam[i])) !=0)} else {data}
model <- lmer(formula[i], data = data1)
# loop continues...
}

Related

Using apply() function to iterate over different data types doesn't work

I want to write a function that dynamically uses different correlation methods depending on the scale of measure of the feature (continuous, dichotomous, ordinal). The label is always continuous. My idea was to use the apply() function, so iterate over every feature (aka column), check it's scale of measure (numeric, factor with two levels, factor with more than two levels) and then use the appropriate correlation function. Unfortunately my code seems to convert every feature into a character vector and as consequence the condition in the if statement is always false for every column. I don't know why my code is doing this. How can I prevent my code from converting my features to character vectors?
set.seed(42)
foo <- sample(c("x", "y"), 200, replace = T, prob = c(0.7, 0.3))
bar <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.5,0.05,0.1,0.1,0.25))
y <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.25,0.1,0.1,0.05,0.5))
data <- data.frame(foo,bar,y)
features <- data[, !names(data) %in% 'y']
dyn.corr <- function(x,y){
# print out structure of every column
print(str(x))
# if feature is numeric and has more than two outcomes use corr.test
if(is.numeric(x) & length(unique(x))>2){
result <- corr.test(x,y)[['r']]
} else {
result <- "else"
}
}
result <- apply(features,2,dyn.corr,y)

apply is built for matrices. When you apply to a data frame, the first thing that happens is coercing your data frame to a matrix. A matrix can only have one data type, so all columns of your data are converted to the most general type among them when this happens.
Use sapply or lapply to work with columns of a data frame.
This should work fine (I tried to test, but I don't know what package to load to get the corr.test function.)
result <- sapply(features, dyn.corr, income)

Use custom function on data.table

I would like to ask if it is possible to apply this function to a data.table approach:
myfunction <- function(i) {
a <- test.dt[i, 1:21, with = F]
final <- t((t(b) == a) * value)
final[is.na(final)] <- 0
sum.value <- rowSums(final)
final1 <- cbind(train.dt, sum.value)
final1 <- final1[order(-sum.value),]
final1 <- final1[final1$sum.value > 0,]
suggestion <- unique(final1[, 22, with = F])
suggestion <- suggestion[1:5, ]
return(suggestion)
}
This is a custom kNN function I made to be used on character columns. It gives top 5 suggestions/predictions. However, It has performance issues on my end if it is performed on large test data (I cannot tweak it myself so far).
The variables used are as folllows:
train.dt -- the training data, includes 22 columns (21 features, 1 label column)
test.dt -- the test data, same structure as training data
value -- a vector that contains the weights/importance value of 21 features
sum.value -- sum of all the weights on value vector (sum(value))
b -- has the same data as the training data, but excluding the label column
a -- has the same data as the test data, but excluding the label column
suggestion -- the output
Also, I want to use lapply (or any appropriate apply family) on this function, and the i variable in the function pertains to the row number on the test data: meaning, I want to apply it on each rows of the test data. I cannot make it yet.
Hope you can understand and thank you in advance!

Running for loop across multiple groups

I am running the following imputation task in R as a for loop:
myData <- essuk[c(2,3,4,5,6,12)]
myDataImp <- matrix(0,dim(myData)[1],dim(myData)[2])
lower <- c(0)
upper <- c(Inf)
for (k in c(1:5))
{
gmm.fit1 <- gmm.tmvnorm(matrix(myData[,k],length(myData[,k]),1), lower=lower, upper=upper)
useMu <- matrix(gmm.fit1$coefficients[1],1,1)
useSigma <- matrix(gmm.fit1$coefficients[2],1,1)
replaceThese <- myData[,k]<=0
myDataImp[,k] <- myData[,k]
myDataImp[replaceThese,k] <- rtmvnorm(n=sum(replaceThese), c(useMu), c(useSigma), c(-Inf), c(0))
}
The steps are pretty straightforward
Define the data set and an empty imputation data set.
For column 1-5, fit a model.
Extract model estimates to be used for imputation.
Run a model using model estimates and replace values <= 0 with the new values in the imputation data set.
However, I want to do this separately for multiple groups, rather than for the full sample. Column 12 in the data set contains information on group membership (integers ranging from 1-72).
I have tried several options, including splitting the data frame with data_list <- split(myData, myData$V12) and use the lapply() function. However, this does not work due to how model estimates are formatted:
Error in as.data.frame.default(data) :
cannot coerce class ""gmm"" to a data.frame
I have also thought about the possibility of doing a nested for loop, although I am not sure how that could be accomplished. Any suggestions are much appreciated.

what about using subset() ?
myData$V12 = as.factor(myData$V12)
listofresults= c()
for (i in levels(myData$V12)){
data = subset (myData, myData$V12 == i)
#your analysis here: result saved in myDataImp
listofresults = c(listofresults, myDataImp)
}
not the most elegant, but should work.

'R', 'mice', missing variable imputation - how to only do one column in sparse matrix

I have a matrix that is half-sparse. Half of all cells are blank (na) so when I try to run the 'mice' it tries to work on all of them. I'm only interested in a subset.
Question: In the following code, how do I make "mice" only operate on the first two columns? Is there a clean way to do this using row-lag or row-lead, so that the content of the previous row can help patch holes in the current row?
set.seed(1)
#domain
x <- seq(from=0,to=10,length.out=1000)
#ranges
y <- sin(x) +sin(x/2) + rnorm(n = length(x))
y2 <- sin(x) +sin(x/2) + rnorm(n = length(x))
#kill 50% of cells
idx_na1 <- sample(x=1:length(x),size = length(x)/2)
y[idx_na1] <- NA
#kill more cells
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
y2[idx_na2] <- NA
#assemble base data
my_data <- data.frame(x,y,y2)
#make the rest of the data
for (i in 3:50){
my_data[,i] <- rnorm(n = length(x))
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
my_data[idx_na2,i] <- NA
}
#imputation
est <- mice(my_data)
data2 <- complete(est)
str(data2[,1:3])
Places that I have looked for answers:
help document (link)
google of course...
https://stats.stackexchange.com/questions/99334/fast-missing-data-imputation-in-r-for-big-data-that-is-more-sophisticated-than-s

I think what you are looking for can be done by modifying the parameter "where" of the mice function. The parameter "where" is equal to a matrix (or dataframe) with the same size as the dataset on which you are carrying out the imputation. By default, the "where" parameter is equal to is.na(data): a matrix with cells equal to "TRUE" when the value is missing in your dataset and equal to "FALSE" otherwise. This means that by default, every missing value in your dataset will be imputed. Now if you want to change this and only impute the values in a specific column (in my example column 2) of your dataset you can do this:
# Define arbitrary matrix with TRUE values when data is missing and FALSE otherwise
A <- is.na(data)
# Replace all the other columns which are not the one you want to impute (let say column 2)
A[,-2] <- FALSE
# Run the mice function
imputed_data <- mice(data, where = A)

Instead of the where argument a faster way might be to use the method argument. You can set this argument to "" for the columns/variables you want to skip. Downside is that automatic determination of the method will not work. So:
imp <- mice(data,
method = ifelse(colnames(data) == "your_var", "logreg", ""))
But you can get the default method from the documentation:
defaultMethod
... By default, the method uses pmm, predictive mean matching (numeric data) logreg, logistic regression imputation (binary data, factor with 2 levels) polyreg, polytomous regression imputation for unordered categorical data (factor > 2 levels) polr, proportional odds model for (ordered, > 2 levels).

Your question isn't entirely clear to me. Are you saying you wish to only operate on two columns? In that case mice(my_data[,1:2]) will work. Or you want to use all the data but only fill in missing values for some columns? To do this, I'd just create an indicator matrix along the following lines:
isNA <- data.frame(apply(my_data, 2, is.na))
est <- mice(my_data)
mapply(function(x, isna) {
x[isNA == 1] <- NA
return(x)
}, <each MI mice return object column-wise>, isNA)
For your final question, "can I use mice for rolling data imputation?" I believe the answer is no. But you should double check the documentation.

Counter in Variable Selection

I am running the following code, which is working fine:
model <- NULL
summary <- NULL
stepwise <- NULL
for (i in 1:100){
model[[i]] <- lm(r[[i]]~x1[[i]]+x2[[i]]+x3[[i]]+noise1[[i]]+noise2[[i]]+noise3[[i]]+noise4[[i]]+noise5[[i]]+noise6[[i]]+noise7[[i]])
summary[[i]] <- summary(model[[i]])$coefficients
stepwise[[i]] <- step(model[[i]], direction="both")$coefficients
}
I wanted to set up a counter to keep track of the variables that are stored in the stepwise list. I want a count of how many times each variable (x1, x2, x3, noise1, noise2, noise3, noise4, noise5, noise6, noise7) occurs. I was thinking of something like this
createCounter <- function(VALUE){
for (i in 1:100){
output <- VALUE <- VALUE+i
return(output)
}
}
but I don't know how to fine-tune it so that R understands to count a value if the stepwise list contains the particular variable. Any help would be appreciated.

well, the step()$coefficients returns a named vector of the coefficients. While the values contain the actual coefficients, the names on the vector store the actual names of the variables. So you can extract and count all the variable names from all the 100 models with
table(unlist(lapply(stepwise, function(x) names(x))))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subsetting numeric variable but not factor in R - r

Related

Using apply() function to iterate over different data types doesn't work

Use custom function on data.table

Running for loop across multiple groups

'R', 'mice', missing variable imputation - how to only do one column in sparse matrix

Counter in Variable Selection

Categories

Resources