Use paste and !is.na to subset data frame

Use paste and !is.na to subset data frame - r

I am trying to define a subset of a dataframe for a standard lm model using a "for loop". In the subset expression, I want to refer to col1 using paste and subset all observations where col1-3 is not NA. I have tried the following things, but they do not work:
for(i in 1:3) {
lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(paste0("col", i))))
}
OR define the colname separately:
for(i in 1:3) {
colname <- as.name(paste0("col", i))
lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(colname)))
}
BTW: This is a simplified code to illustrate what I am trying to do. The code in my script does not give an error but ignores the !is.na condition of the subset expression. However, it works if done manually like this:
lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(col1)))
I would greatly appreciate some advice!
Thanks in advance!
FK

The is.na() portion is being "ignored" because what you think is being evaluated isn't what is being evaulated. What is being evaluated is:
!is.na("col1")
and the string "col1" is obviously not NA, so it evaluates to TRUE and is recycled for all rows in your data. The issue you are having is you have a variable name stored as a string, and subset() needs a logical vector. So you need a way to use your variable name stored in a string and use it to get the corresponding evaluated logical vector that subset() needs. You can update your code to use something along the lines of:
for(i in 1:3) {
lm(y ~ x1 + x2, data=subset(df, x3=="Y" & !is.na(df[[paste0("col", i)]])))
}
While this isn't optimal, and there are other ways you can and probably should update your code. Something along the lines of:
for(i in 1:3) {
lm(y ~ x1 + x2, data = df,
subset = df$x3 == "Y" & !is.na(df[[paste0("col", i)]]))
}
is a bit cleaner as it uses the subset argument to subset your data.
You still have the issue of you're not storing the results of your call to lm() anywhere.

Related

Subsetting numeric variable but not factor in R

I would like to calculate Linear mixed-effect models via for loop where there is always hard-coded Y and random effect. The X variables (var.nam[i]) are to be loop through. I wrote the code and it is working (as I believe), but I would also like to subset X variable (var.nam[i]) depending on the X variable (var.nam[i]) type (numeric, factor) where:
when X variable (var.nam[i]) is numeric, exclude all observation equal to 0
when X variable (var.nam[i]) is factor, do not subset X variable (var.nam[i])
A short sample of my code is here:
for(i in 1:length(var.nam)) {
formula[i] <- paste0("Y", "~", paste0(c(var.nam[i], c("Season"),
c("Sex"),
c("Age"),
c("BMI"),
c("(1|HID)")), collapse="+"))
model <- lmer(formula[i], data = subset(data, paste0(c(var.nam[i])) != 0))
# loop continues...
}
As it is written now, it will subset all X variables (var.nam[i]) regardless of the type. Is there any workaround or different way to subset variable, that would work in this specific case?

Checking if this solution works is a bit hard without data or the complete for loop.
Based on your question you want to conditionally subset, adding a if else statement should make this possible:
for(i in 1:length(var.nam)) {
formula[i] <- paste0("Y", "~", paste0(c(var.nam[i], c("Season"),
c("Sex"),
c("Age"),
c("BMI"),
c("(1|HID)")), collapse="+"))
data1 <- if(mode(var.nam[i]) == "numeric") {subset(data, paste0(c(var.nam[i])) !=0)} else {data}
model <- lmer(formula[i], data = data1)
# loop continues...
}

Loop over data table columns and apply glm using for loop

I am trying to loop over my data table columns and apply glm to each column using a for loop.
for(n in 1:ncol(dt)){
model = glm(y ~ dt[, n], family=binomial(link="logit"))
}
Why doesn't this work? I am getting this error:
Error in `[.data.table`(dt, , n) :
j (the 2nd argument inside [...]) is a single symbol but column name 'n' is not found. Perhaps you intended DT[, ..n]. This difference to data.frame is deliberate and explained in FAQ 1.1.
I nearly managed to make it work using something like dt[[n]], but I think it gets rid of the column name.

Using lapply to iterate over columns and reformulate to construct the formula.
model_list <- lapply(names(dt), function(x)
glm(reformulate(x, 'y'), dt, family=binomial(link="logit")))

We can create a formula with paste and use that in glm
model <- vector('list', ncol(dt))
for(n in 1:ncol(dt)){
model[[n]] = glm(as.formula(paste0('y ~ ', names(dt)[n])),
data = dt, family=binomial(link="logit"))
}

use more than 100 features lm function in R

Assume I have a dataframe consisting of 101 columns, where the first 100 are named data1 to data100 and the 101th column is named y.
If I want to use the lm function in R where data1 to data100 are the features.
I know this can be written as:
lin_reg <- lm(y ~ data1+...+data100, dataframe)
Is there a better way of doing this?

lin_reg <- lm(y ~ ., data = dataframe)
This assumes your data is really only consisting of your outcome + all feature variables. No extra column. The "." indicates "take everything else from that data frame".
Since - as per comment - the TO wants to exclude certain columns:
data_frame_subset <- dataframe[, !names(dataframe) %in% c("data5", "data10")]
lin_reg <- lm(y ~ ., data = dataframe_subset)
In this example, I would exclude the columns data5 and data10.

You can create the formula dynamically with reformulate :
lin_reg <- lm(reformulate(paste0('data', 1:100), 'y'), dataframe)

Passing arguments to subset within a function

I am attempting to fit a bunch of different models to a single dataset. Each of the models uses a different combination of outcome variable and data subset. To fit all of these models, I created a dataframe with one column for the outcome variable and one column specifying the data subset (as a string). (Note that the subsets are overlapping so there doesn't appear to be an obvious way to do this using nest().) I then created a new function which takes one row of this dataframe and calls "lm" using these options. Lastly, I use pmap to map this function to the dataframe.
After a bunch of experimentation, I found an approach that works but that is rather inelegant (see below for a simplified version of what I did). It seems like there should be a way to pass the subset condition to the subset argument in lm rather than using parse(eval(text = condition)) to first create a logical vector. I read the Advanced R section on metaprogramming in the hopes that they would provide some insight, but I was unable to find anything that works.
Any suggestions would be helpful.
library(tidyverse)
outcomes <- c("mpg", "disp")
sub_conditions <- c("mtcars$cyl >=6", "mtcars$wt > 2")
models <- expand.grid(y = outcomes, condition = sub_conditions) %>% mutate_all(as.character)
fit <- function(y, condition) {
# Create the formula to use in all models
rx <- paste(y, "~ hp + am")
log_vec <- eval(parse(text = condition))
lm(rx, data = mtcars[log_vec,])
}
t <- pmap(models, fit)

Are you sure you want to pass conditions in this way using string?
If that is the case, there are not many options. You can use rlang::parse_expr as an alternative.
fit <- function(y, condition) {
rx <- paste(y, "~ hp + am")
lm(rx, data = mtcars[eval(rlang::parse_expr(condition)),])
}
and call it via
purrr::pmap(models, fit)

With R, loop over data frames, and assign appropriate names to objects created in the loop

This is something which data analysts do all the time (especially when working with survey data which features missing responses.) It's common to first multiply impute a set of compete data matrices, fit models to each of these matrices, and then combine the results. At the moment I'm doing things by hand and looking for a more elegant solution.
Imagine there's 5 *.csv files in the working directory, named dat1.csv, dat2.csv, ... dat5.csv. I want to estimate the same linear model using each data set.
Given this answer, a first step is to gather a list of the files, which I do with the following
csvdat <- list.files(pattern="dat.*csv")
Now I want to do something like
for(x in csvdat) {
lm.which(csvdat == "x") <- lm(y ~ x1 + x2, data = x)
}
The "which" statement is my silly way of trying to number each model in turn, using the location in the csvdat list the loop is currently up to. that is, I'd like this loop to return a set of 5 lm objects with the names lm.1, lm.2, etc
Is there some simple way to create these objects, and name them so that I can easily indicate which data set they correspond to?
Thanks for your help!

Another approach is to use the plyr package to do the looping. Using the example constructed by #chl, here is how you would do it
require(plyr)
# read csv files into list of data frames
data_frames = llply(csvdat, read.csv)
# run regression models on each data frame
regressions = llply(data_frames, lm, formula = y ~ .)
names(regressions) = csvdat

Use a list to store the results of your regression models as well, e.g.
foo <- function(n) return(transform(X <- as.data.frame(replicate(2, rnorm(n))),
y = V1+V2+rnorm(n)))
write.csv(foo(10), file="dat1.csv")
write.csv(foo(10), file="dat2.csv")
csvdat <- list.files(pattern="dat.*csv")
lm.res <- list()
for (i in seq(along=csvdat))
lm.res[[i]] <- lm(y ~ ., data=read.csv(csvdat[i]))
names(lm.res) <- csvdat

what you want is a combination of the functions seq_along() and assign()
seq_along helps creates a vector from 1 to 5 if there are five objects in csvdat (to get the appropriate numbers and not only the variable names). Then assign (using paste to create the appropriate astrings from the numbers) lets you create the variable.
Note that you will also need to load the data file first (was missing in your example):
for (x in seq_along(csvdat)) {
data.in <- read.csv(csvdat[x]) #be sure to change this to read.table if necessary
assign(paste("lm.", x, sep = ""), lm(y ~ x1 + x2, data = data.in))
}
seq_along is not totally necessary, there could be other ways to solve the numeration problem.
The critical function is assign. With assign you can create variables with a name based on a string. See ?assign for further info.
Following chl's comments (see his post) everything in one line:
for (x in seq_along(csvdat)) assign(paste("lm", x, sep = "."), lm(y ~ x1 + x2, data = read.csv(csvdat[x]))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use paste and !is.na to subset data frame - r

Related

Subsetting numeric variable but not factor in R

Loop over data table columns and apply glm using for loop

use more than 100 features lm function in R

Passing arguments to subset within a function

With R, loop over data frames, and assign appropriate names to objects created in the loop

Categories

Resources