Unique list of variable strings for model estimation - r

I want to create a vector of unique variable combinations to estimate various regression models for different sets of variables, while fixing one variable to be always included.
For example, I always want to include variable X1, plus a distinct combination of up to, say, three (this threshold could be varying depending on the specific data and research question at hand) other variables from the full list of available variables X2, X3, ..., XN.
The bi-variate case is rather simple, I guess.
However, already for tri-variate models, the variable combination "X1 X2 X3" will yield the same coefficients as "X1 X3 X2". Further, I also want to exclude combinations which contain same variables twice, e.g "X1 X2 X2".
How to exclude these "double-counting"/redundant combinations best? Or how to create such a vector of all possible distinct combinations?
Test code i tried so far (separating variables with underscore):
library(dplyr)
'%!in%' <- function(x,y)!('%in%'(x,y))
A <- c("X1", "X2", "X3", "X4", "X5") # all variables in dataset
a <- "X1" # keep X1 in all models
A_minus_a <- A[A %!in% a]
# first combination:
C1 <- outer(a, A_minus_a, paste, sep = "_")
# second set of combinations:
C2 <- outer(C1, A_minus_a, paste, sep = "_") %>% as.vector
# third set of combinations:
C3 <- outer(C2, A_minus_a, paste, sep = "_") %>% as.vector
# full list of model combinations, but including many "double-counted"/redundant models:
C <- c(C1, C2, C3)
Any help you can provide is very much appreciated!
P.S. for the second step I could prevent the problem by formatting the result of outer() into a matrix and then extracting the lower triangular elements without the diagonal of the matrix. However, when turning to the third set of combinations this does not work anymore. So, there might be a better solution from start.

How about using combn()? e.g. for sets of three variables:
cc <- combn(A_minus_a, m=3)
apply(cc,2,paste,collapse="_")
## [1] "X2_X3_X4" "X2_X3_X5" "X2_X4_X5" "X3_X4_X5"

Related

Running the same function over multiple variables with systematic names (i.e., based on trial number)

Ran an experiment with multiple trials, wide format data (i.e., each subject has a row and information pertaining to each of those subject's trials is contained in the columns).
The measurements collected for each trial are identical, such that for each participant there are columns with the same names followed by the relevant trial number e.g., x1, y1, x2, y2 where x and y are measurements collected at each trial and 1 and 2 represent trials 1 and 2 respectively.
I am creating some new variables based on the values for each trial e.g., x1 and y1 should be joined to create x1y1 (the particular functions likely don't matter, as I have been successful in writing them for the first trial). Now I am trying to apply those functions across the multiple trials (which again are identified by the number that follows the variable name) without writing a line of code for each trial 1:n that replicates the code I've written for trial 1. My question is whether I can use apply or a for loop that looks through the column names for the structure/numbering.
Suppose I want to do the following:
XY_1 = paste0(X1,Y1)
but for each trial 1-n. XY_n = paste0(Xn,Yn).
Perhaps something like:
for (trial in c(1:50)){XY[trial] = paste0(X[trial], Y[trial])}
I would like the new variables to be output as columns in the data file. Thank you for your help!
I think you're looking for mapply, where you can apply a function to vectors/lists in groups:
v1 <- c("a","b","c")
v2 <- c("1","2","3")
paste_together <- function(X,Y) {
paste0(X, Y)
}
Called as:
mapply(paste_together, v1, v2)
which gives:
> result <- mapply(paste_together, v1, v2)
> result
a b c
"a1" "b2" "c3"
or perhaps:
> result <- mapply(paste_together, v1, v2, USE.NAMES=F)
> result
"a1" "b2" "c3"
I wasn't clear on your last statement. If you need the result of this to be in a single column, then convert it into a data frame:
> result <- data.frame(XpasteY=mapply(paste_together, v1, v2, USE.NAMES=F))
> result
XpasteY
1 a1
2 b2
3 c3

Understanding coercion of factors into characters in an R dataframe

Trying to figure out how coercion of factors/ dataframe works in R. I am trying to plot boxplots for a subset of a dataframe. Let's see step-by-step
x = rnorm(30, 1, 1)
Created a vector x with normal distribution
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
Created a character string to later use as a factor for plotting boxplots for x1, x2, x3
df = data.frame(x,c)
combined x and c into a data.frame. So now we would expect class of df: dataframe, df$x: numeric, df$c: factor (because we sent c into a dataframe) and is.data.frame and is.list applied on df should give us TRUE and TRUE. (I assumed that all dataframes are lists as well? and that's why we are getting TRUE for both checks.)
And that's what happens below. All good till now.
class(df)
#[1] "data.frame"
is.data.frame(df)
#[1] TRUE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"
Now I plot the spread of x grouped using factors present in c. So the first argument is x ~ c. But I want boxplots for just two factors: x1and x2. So I used a subset argument in boxplot function.
boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)
This is the plot we get, notice since x3 is a factor, it is still plotted
i.e. we still got 3 categories on x-axis of the boxplot inspite of subsetting to 2 categories.
So, one solution I found was to change the class of df variables into numeric and character
class(df)<- c("numeric", "character")
boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)
New boxplot. This is what we wanted, so it worked!, we plotted boxes for just x1 and x2, got rid of x3
But if we just run the same checks, we ran before doing this coercion, on all variables, we get these outputs.
Anything funny?
class(df)
#[1] "numeric" "character"
is.data.frame(df)
#[1] FALSE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"
Check out that df $ c (the second variable containing caegories x1, x2, x3) is still a factor!
And df stopped being a list (so was it ever a list?)
And what did we do exactly by class(df)<- c("numeric", "character") this coercion if not changing the datatype of df $ c?
So to sum up,
my questions for tldr version:
Are all dataframes, also lists in R?
Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?
If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?
And why did df stopped being a dataframe after we did the above steps?
The answers make more sense if we take your questions in a different order.
Are all dataframes, also lists in R?
Yes. A data frame is a list of vectors (the columns).
And why did df stopped being a list after we did the above steps?
It didn't. It stopped being a data frame, because you changed the class with class(df)<- c("numeric", "character"). is.list(df) returns TRUE still.
If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?
class(df) operates on the df object itself, not the columns. Look at str(df). The factor column is still a factor. class(df) set the class attribute on the data frame object itself to a vector.
Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?
You've messed up your data frame object by explicitly setting the class attribute of the object to a vector c("numeric", "character"). It's hard to predict the full effects of this. My best guess is that boxplot or the functions that draw the axes accessed the class attribute of the data frame somehow.
To do what you really wanted:
x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c)
df$c <- as.character(df$c)
or
x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c, stringsAsFactors=FALSE)
Use droplevels like this:
df0 <- subset(df, c %in% c("x1", "x2"))
df0 <- transform(df0, c = droplevels(c))
levels(df0$c)
## [1] "x1" "x2"
Note that now c only has two levels, not three.
We can write this as a pipeline using magrittr like this:
library(magrittr)
df %>%
subset(c %in% c("x1", "x2")) %>%
transform(c = droplevels(c)) %>%
boxplot(x ~ c, data = .)

For loops: Running through column names

I was looking for a shorter way to write this using for loops
ie: i is 1 to 22 and my data will add columns 1 through 22 in the multiple regression:
reg <-lm(log(y)~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+z1+z+z3+z4+z5+z6+z7+z8+z9+z10+z11+z12, data)
To clarify, x1 and x2 and x3 are all column names - they are x two (not x squared), I am trying to do a multiple regression with the last 22 columns in my data set
Someone suggested to do this:
reg1 <- lm(log(data$y)~terms( as.formula(
paste(" ~ (", paste0("X", 29:ncol(data) , collapse="+"), ")")
)
))
But
It doesn't work
I don't think it is doing multiple regression (xone + xtwo+ xthree), rather it assigned the binary value 1 to each variable x1, x2, x3... and added them, which is not what I want.
I know that a for-loop was requested but it would have been a clumsy strategy, so here's a possible correct strategy:
formchr <- paste(
paste( "log(y)" , paste0( "x", 1:10, collapse="+"), sep="~"),
# the LHS and first 10 terms
paste0( "z", 1:12, collapse="+"), #next 12 terms
sep="+") # put both parts together
reg1 <- lm( as.formula(formchr), data=data)
The full character-version of the formula should be passed to the as.formula function and the paste and paste0 functions are fully vectorized, so no loop is needed.
If the first 22 columns were the desired target for the RHS terms, you could have pasted together names(data)[1:22] or ...[29:50] if those were hte locations, and htis would be substituted for the RHS terms in the second paste above, dropping the third paste.
The only reason I used data as the name of an object is that it was implied by the question. It is a very confusing practice to use that name. data is an R function and objects should have specific names that do not overlap with function names. The other very commonly abused name in this regard is df, which is the density function for the distribution.
You could first subset your data into a data.frame which contains only the columns of interest. Then, you can run a linear model using the . formula syntax to select all columns other than the y variable.
Example using 1000 rows and 50 cols of data
N <- 1000
P <- 50
data <- as.data.frame(rep(data.frame(rnorm(N)), P))
Assign your y data to y.
y <- as.data.frame(rep(data.frame(rnorm(N)), 1))
Create a new data.frame containing y and the last 22 columns.
model_data <- cbind(y, data[ ,29:50])
colnames(model_data) <- c("y", paste0("x", 1:10), paste0("z",1:12))
The following should do the trick. The . formula syntax will select all columns other than the y column.
reg <-lm(log(y) ~ ., data = model_data)

remove duplicate entries in cell - R

I searched high and low on here, as well as tried duplicate and unique functions for what I'm about to ask, but couldn't get anything to work. Let's say I have a data frame named company with a variable state. When I collapse the rows I'm left with this output in one of the state variable observations:
PA;PA;PA;TX;TX
How could I remove the dups inside the cell (and entire vector for that matter), so it looks as follows:
PA;TX
I have no problems removing dup rows, but can't seem to do it for the cells themselves.
This works for a single string:
x <- "PA;PA;PA;TX;TX"
x2 <- strsplit(x, ";")
x3 <- unlist(x2)
x4 <- unique(x3)
x5 <- paste(x4, collapse = ";")
If you want to do it for the whole vector company$state, you could roll all that up into one call to sapply:
sapply(company$state, function(x) paste(unique(unlist(strsplit(x, ";"))), collapse = ";"))

Covariance matrices by group, lots of NA

This is a follow up question to my earlier post (covariance matrix by group) regarding a large data set. I have 6 variables (HML, RML, FML, TML, HFD, and BIB) and I am trying to create group specific covariance matrices for them (based on variable Group). However, I have a lot of missing data in these 6 variables (not in Group) and I need to be able to use that data in the analysis - removing or omitting by row is not a good option for this research.
I narrowed the data set down into a matrix of the actual variables of interest with:
>MMatrix = MMatrix2[1:2187,4:10]
This worked fine for calculating a overall covariance matrix with:
>cov(MMatrix, use="pairwise.complete.obs",method="pearson")
So to get this to list the covariance matrices by group, I turned the original data matrix into a data frame (so I could use the $ indicator) with:
>CovDataM <- as.data.frame(MMatrix)
I then used the following suggested code to get covariances by group, but it keeps returning NULL:
>cov.list <- lapply(unique(CovDataM$group),function(x)cov(CovDataM[CovDataM$group==x,-1]))
I figured this was because of my NAs, so I tried adding use = "pairwise.complete.obs" as well as use = "na.or.complete" (when desperate) to the end of the code, and it only returned NULLs. I read somewhere that "pairwise.complete.obs" could only be used if method = "pearson" but the addition of that at the end it didn't make a difference either. I need to get covariance matrices of these variables by group, and with all the available data included, if possible, and I am way stuck.
Here is an example that should get you going:
# Create some fake data
m <- matrix(runif(6000), ncol=6,
dimnames=list(NULL, c('HML', 'RML', 'FML', 'TML', 'HFD', 'BIB')))
# Insert random NAs
m[sample(6000, 500)] <- NA
# Create a factor indicating group levels
grp <- gl(4, 250, labels=paste('group', 1:4))
# Covariance matrices by group
covmats <- by(m, grp, cov, use='pairwise')
The resulting object, covmats, is a list with four elements (in this case), which correspond to the covariance matrices for each of the four groups.
Your problem is that lapply is treating your list oddly. If you run this code (which I hope is pretty much analogous to yours):
CovData <- matrix(1:75, 15)
CovData[3,4] <- NA
CovData[1,3] <- NA
CovData[4,2] <- NA
CovDataM <- data.frame(CovData, "group" = c(rep("a",5),rep("b",5),rep("c",5)))
colnames(CovDataM) <- c("a","b","c","d","e", "group")
lapply(unique(as.character(CovDataM$group)), function(x) print(x))
You can see that lapply is evaluating the list in a different manner than you intend. The NAs don't appear to be the problem. When I run:
by(CovDataM[ ,1:5], CovDataM$group, cov, use = "pairwise.complete.obs", method = "pearson")
It seems to work fine. Hopefully that generalizes to your problem.

Resources