R: Help using dummyVars and adding back into data.frame - r

I have a data.frame of 373127 obs. of 193 variables. Some variables are factors which I want to use dummyVars() to separate each factor into its own column. I then want to merge the separate dummy variable columns back into my original data.frame, so I thought I could do the whole thing with apply, but something is not working and I can't figure out what it is.
Sample:
dat_final <- apply(dummies.var1, 1, function(x) {
dummies.var1 <- dummyVars(~ dat1$factor.var1 -1, data = dat1)
})
Thanks!

You can do the following that will create a new df, trsf, but you could always reassign back to the original df:
library(caret)
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)
See more here

The real answer is .... Don't do that. It's almost never necessary.

You could do something like this:
# Example data
df = data.frame(x = rep(LETTERS, each = 3), y = rnorm(78))
df = cbind(df, model.matrix(~df$x - 1))
However, as pointed out by #user30257 it is hard to see why you want to do it. In general, modeling tools in R don't need dummy vars, but deal with factors directly.

Creating dummy variables can be very important in feature selection, which it sounds like the original poster was doing.
For instance, suppose you have a feature that contains duplicated information (i.e., one of its levels corresponds to something measured elsewhere). You can determine this is the case very simply by comparing the dummy variables for these features using a variety of dissimilarity measures.
My preference is to use:
sparse.model.matrix and
cBind

Related

R: Comparing different versions of data in terms of levels

my aim is to compare differences in levels of variables that might occur across different versions of a dataset. In my code, I first generate strings in order to be able to compare several variables (numeric, categorical, etc.). However, the code fails and does not give the desired results, which would be a data frame that consists of the variable and possible differences (in a list). Any help is appreciated!
Thank you.
data1 <- lapply(?, as.character)
data2 <- lapply(?, as.character)
check_diffs <- function(vars, data1, data2) {
levels1 <- unique(data1$vars)
levels2 <- unique(data2$vars)
diff <- ifelse(length(union(setdiff(levels1,levels2), setdiff(levels2,levels1)))>0, list(union(setdiff(levels1,levels2), setdiff(levels2,levels1))), NA)
return(data.frame(var = vars, diffs = I(diff)))
}
diffs_df <- map_dfr(vars, ~check_diffs(.x, data1 = ?, data2 = ?))
The issue with the code was that vars gives a string, which must be called with get(vars, dataX). Then, the code gives the differences in coding between both data sets.

How do you remerge the response variable to the data frame after removing it for standardization?

I have a dataset with 61 columns (60 explanatory variables and 1 response variable).
All the explantory variables all numerical, and the response is categorical (Default).Some of the ex. variables have negative values (financial data), and therefore it seems more sensible to standardize rather than normalize. However, when standardizing using the "apply" function, I have to remove the response variable first, so I do:
model <- read.table......
modelwithnoresponse <- model
modelwithnoresponse$Default <- NULL
means <- apply(modelwithnoresponse,2mean)
standarddeviations <- apply(modelwithnoresponse,2,sd)
modelSTAN <- scale(modelwithnoresponse,center=means,scale=standarddeviations)
So far so good, the data is standardized. However, now I would like to add the response variable back to the "modelSTAN". I've seen some posts on dplyr, merge-functions and rbind, but I couldnt quite get to work so that response would simply be added back as the last column to my "modelSTAN".
Does anyone have a good solution to this, or maybe another workaround to standardize it without removing the response variable first?
I'm quite new to R, as I'm a finance student and took R as an elective..
If you want to add the column model$Default to the modelSTAN data frame, you can do it like this
# assign the column directly
modelSTAN$Default <- model$Default
# or use cbind for columns (rbind is for rows)
modelSTAN <- cbind(modelSTAN, model$Default)
However, you don't need to remove it at all. Here's an alternative:
modelSTAN <- model
## get index of response, here named default
resp <- which(names(modelSTAN) == "default")
## standardize all the non-response columns
means <- colMeans(modelSTAN[-resp])
sds <- apply(modelSTAN[-resp], 2, sd)
modelSTAN[-resp] <- scale(modelSTAN[-resp], center = means, scale = sds)
If you're interested in dplyr:
library(dplyr)
modelSTAN <- model %>%
mutate(across(-all_of("default"), scale))
Note, in the dplyr version I didn't bother saving the original means and SDs, you should still do that if you want to back-transform later. By default, scale will use the mean and sd.

Kruskal-Wallis test on multiple columns at once

This maybe sounds a bit simple, but I cannot get the answer.
I have a dataset in R that has 26 samples in rows and many variables (>20) in columns. Some of them are categorical, so what I need to do is to carry out a Kruskal Wallis test for each numerical variable depending on each categorical one, so I do:
env_fact <- read.csv("environ_facts.csv")
kruskal.test(env_fact-1 ~ Categorical_var-1, data=env_fact)
But with this I can only do the test to the numerical variables one by one, which is tiresome.
Is there any way to carry all the Kruskal-Wallis tests for all numerical variables at once?
I can repeat it by each categorical variable, since I only have 4, but for the numerical one I have more than 20!!
Thanks a lot
Since I do not have sample of the data set I can only answer "theoretically".
First, you need to recognize which are the numeric columns.
The way to do this is the following:
df = tibble(x = rnorm(10), y = rnorm(10), z = "a", w = rnorm(10))
NumericCols = sapply(df, function(x) is.numeric(x))
df_Numeric = df[, Types == TRUE]
Now you take the numeric part of df, df_Numeric, and apply your function blabla on each column at a time:
sapply(df_Numeric, function(x) blabla(x))
Thank you very much Omry.
Working with a colleague we reached an incomplete different solution to yours:
my.variables <- colnames(env_fact)
for(i in 1:length(my.variables)) {
if(my.variables[i] == 'Categorical_var') {
next
} else {
kruskal.test(env_fact[,i], env_fact$Categorical_var)
}
}
However, we haven't been able to print on screen/get an output with the results for each of 'my.variables' by the 'Categorical_var' analyzed. We could only get a result for all the 'my.variables' as a whole.
Any idea??
Thank you very much
P.S.: My data looks like this:
Sample,Nunatak,Slope,Altitude,Depth,Fluoride,Acetate,Formiate,Chloride,Nitrate
m4,1,1,1,1,0.044,0.884,0.522,0.198,0.021
m6,1,1,1,2,0.059,0.852,0.733,0.664,0.038
m7,1,1,1,3,0.082,0.339,1.496,0.592,0.034
m8,1,1,2,1,0.112,0.812,2.709,0.357,0.014
m10,1,1,2,2,0.088,0.768,2.535,0.379,0
m11,1,1,3,1,0.101,0.336,4.504,0.229,0
m13,1,1,3,2,0.092,0.681,1.862,0.671,0.018
m14,1,2,2,1,0.12,1.055,3.018,0.771,0
m16,1,2,2,2,0.102,1.019,1.679,1.435,0
m17,1,2,2,3,0.26,0.631,0.505,0.574,0.008
Where Nunatak, Slope, Altitude and Depth are categorical and the rest are numerical. Hope this helps

After Lasso: Store remaining variables as new dataframe (using R)

First of all, thank you very much for your interest and time. My question (using R):
To predict the yvar, I have run a lasso regression which reduced the set of xvariables from 736 to 30.
lasso.mod =glmnet(x,y,alpha=1)
cv.out =cv.glmnet (x,y,alpha=1)
lasso.bestlam =cv.out$lambda.min
tmp_coef = coef(cv.out,s=lasso.bestlam)
varnames = data.frame(name = tmp_coef#Dimnames[[1]][tmp_coef#i])
mylist = list(name = tmp_coef#Dimnames[[1]][tmp_coef#i])
Hence, I have the remaining variable names as a data frame and also as a list.
How is it possible to create a new data frame which has these remaining 30 variables and their observations in it? In other words: How can I get a subset of my original data which does not contain 737 variables but only 31?
I think this should be quite easy, however I have been spending more than two hours and it never worked...
Best wishes,
Thomas
Cannot test your solution as I do not have the data, but this should do the trick:
varnames <- tmp_coef#Dimnames[[1]][tmp_coef#i]
as.data.frame(cbind(x[, varnames], y))
Your tmp_coef#Dimnames[[1]][tmp_coef#i] variable contains the names of the remaining variables, but also contains "(Intercept)" as the first item. If you discard it with -1], you can extract the columns:
x <- as.data.frame(x[, tmp_coef#Dimnames[[1]][tmp_coef#i][-1]])
Even simpler, you can use the indices in tmp_coef#i directly:
x <- as.data.frame(x[, tmp_coef#i[-1]])

Turn R output into a dataframe of multiple vectors

The standard bit of code below from the VARS package forecasts values for several variables.
What I want to do is to take those values and turn them into a data frame so I can produce time series graphs.
> predict(var4, n.ahead=12, ci=0.95)
This question is highly vague. I suppose you're looking for:
x <- predict(var4, n.ahead=12, ci=0.95)
data.frame(n = rep(names(x), each = nrow(x$fcst[[1]])), do.call(rbind, x$fcst))
By the way: The package's name is vars, not VARS.

Resources