my aim is to compare differences in levels of variables that might occur across different versions of a dataset. In my code, I first generate strings in order to be able to compare several variables (numeric, categorical, etc.). However, the code fails and does not give the desired results, which would be a data frame that consists of the variable and possible differences (in a list). Any help is appreciated!
Thank you.
data1 <- lapply(?, as.character)
data2 <- lapply(?, as.character)
check_diffs <- function(vars, data1, data2) {
levels1 <- unique(data1$vars)
levels2 <- unique(data2$vars)
diff <- ifelse(length(union(setdiff(levels1,levels2), setdiff(levels2,levels1)))>0, list(union(setdiff(levels1,levels2), setdiff(levels2,levels1))), NA)
return(data.frame(var = vars, diffs = I(diff)))
}
diffs_df <- map_dfr(vars, ~check_diffs(.x, data1 = ?, data2 = ?))
The issue with the code was that vars gives a string, which must be called with get(vars, dataX). Then, the code gives the differences in coding between both data sets.
I have a dataset with 61 columns (60 explanatory variables and 1 response variable).
All the explantory variables all numerical, and the response is categorical (Default).Some of the ex. variables have negative values (financial data), and therefore it seems more sensible to standardize rather than normalize. However, when standardizing using the "apply" function, I have to remove the response variable first, so I do:
model <- read.table......
modelwithnoresponse <- model
modelwithnoresponse$Default <- NULL
means <- apply(modelwithnoresponse,2mean)
standarddeviations <- apply(modelwithnoresponse,2,sd)
modelSTAN <- scale(modelwithnoresponse,center=means,scale=standarddeviations)
So far so good, the data is standardized. However, now I would like to add the response variable back to the "modelSTAN". I've seen some posts on dplyr, merge-functions and rbind, but I couldnt quite get to work so that response would simply be added back as the last column to my "modelSTAN".
Does anyone have a good solution to this, or maybe another workaround to standardize it without removing the response variable first?
I'm quite new to R, as I'm a finance student and took R as an elective..
If you want to add the column model$Default to the modelSTAN data frame, you can do it like this
# assign the column directly
modelSTAN$Default <- model$Default
# or use cbind for columns (rbind is for rows)
modelSTAN <- cbind(modelSTAN, model$Default)
However, you don't need to remove it at all. Here's an alternative:
modelSTAN <- model
## get index of response, here named default
resp <- which(names(modelSTAN) == "default")
## standardize all the non-response columns
means <- colMeans(modelSTAN[-resp])
sds <- apply(modelSTAN[-resp], 2, sd)
modelSTAN[-resp] <- scale(modelSTAN[-resp], center = means, scale = sds)
If you're interested in dplyr:
library(dplyr)
modelSTAN <- model %>%
mutate(across(-all_of("default"), scale))
Note, in the dplyr version I didn't bother saving the original means and SDs, you should still do that if you want to back-transform later. By default, scale will use the mean and sd.
This maybe sounds a bit simple, but I cannot get the answer.
I have a dataset in R that has 26 samples in rows and many variables (>20) in columns. Some of them are categorical, so what I need to do is to carry out a Kruskal Wallis test for each numerical variable depending on each categorical one, so I do:
env_fact <- read.csv("environ_facts.csv")
kruskal.test(env_fact-1 ~ Categorical_var-1, data=env_fact)
But with this I can only do the test to the numerical variables one by one, which is tiresome.
Is there any way to carry all the Kruskal-Wallis tests for all numerical variables at once?
I can repeat it by each categorical variable, since I only have 4, but for the numerical one I have more than 20!!
Thanks a lot
Since I do not have sample of the data set I can only answer "theoretically".
First, you need to recognize which are the numeric columns.
The way to do this is the following:
df = tibble(x = rnorm(10), y = rnorm(10), z = "a", w = rnorm(10))
NumericCols = sapply(df, function(x) is.numeric(x))
df_Numeric = df[, Types == TRUE]
Now you take the numeric part of df, df_Numeric, and apply your function blabla on each column at a time:
sapply(df_Numeric, function(x) blabla(x))
Thank you very much Omry.
Working with a colleague we reached an incomplete different solution to yours:
my.variables <- colnames(env_fact)
for(i in 1:length(my.variables)) {
if(my.variables[i] == 'Categorical_var') {
next
} else {
kruskal.test(env_fact[,i], env_fact$Categorical_var)
}
}
However, we haven't been able to print on screen/get an output with the results for each of 'my.variables' by the 'Categorical_var' analyzed. We could only get a result for all the 'my.variables' as a whole.
Any idea??
Thank you very much
P.S.: My data looks like this:
Sample,Nunatak,Slope,Altitude,Depth,Fluoride,Acetate,Formiate,Chloride,Nitrate
m4,1,1,1,1,0.044,0.884,0.522,0.198,0.021
m6,1,1,1,2,0.059,0.852,0.733,0.664,0.038
m7,1,1,1,3,0.082,0.339,1.496,0.592,0.034
m8,1,1,2,1,0.112,0.812,2.709,0.357,0.014
m10,1,1,2,2,0.088,0.768,2.535,0.379,0
m11,1,1,3,1,0.101,0.336,4.504,0.229,0
m13,1,1,3,2,0.092,0.681,1.862,0.671,0.018
m14,1,2,2,1,0.12,1.055,3.018,0.771,0
m16,1,2,2,2,0.102,1.019,1.679,1.435,0
m17,1,2,2,3,0.26,0.631,0.505,0.574,0.008
Where Nunatak, Slope, Altitude and Depth are categorical and the rest are numerical. Hope this helps
First of all, thank you very much for your interest and time. My question (using R):
To predict the yvar, I have run a lasso regression which reduced the set of xvariables from 736 to 30.
lasso.mod =glmnet(x,y,alpha=1)
cv.out =cv.glmnet (x,y,alpha=1)
lasso.bestlam =cv.out$lambda.min
tmp_coef = coef(cv.out,s=lasso.bestlam)
varnames = data.frame(name = tmp_coef#Dimnames[[1]][tmp_coef#i])
mylist = list(name = tmp_coef#Dimnames[[1]][tmp_coef#i])
Hence, I have the remaining variable names as a data frame and also as a list.
How is it possible to create a new data frame which has these remaining 30 variables and their observations in it? In other words: How can I get a subset of my original data which does not contain 737 variables but only 31?
I think this should be quite easy, however I have been spending more than two hours and it never worked...
Best wishes,
Thomas
Cannot test your solution as I do not have the data, but this should do the trick:
varnames <- tmp_coef#Dimnames[[1]][tmp_coef#i]
as.data.frame(cbind(x[, varnames], y))
Your tmp_coef#Dimnames[[1]][tmp_coef#i] variable contains the names of the remaining variables, but also contains "(Intercept)" as the first item. If you discard it with -1], you can extract the columns:
x <- as.data.frame(x[, tmp_coef#Dimnames[[1]][tmp_coef#i][-1]])
Even simpler, you can use the indices in tmp_coef#i directly:
x <- as.data.frame(x[, tmp_coef#i[-1]])
The standard bit of code below from the VARS package forecasts values for several variables.
What I want to do is to take those values and turn them into a data frame so I can produce time series graphs.
> predict(var4, n.ahead=12, ci=0.95)
This question is highly vague. I suppose you're looking for:
x <- predict(var4, n.ahead=12, ci=0.95)
data.frame(n = rep(names(x), each = nrow(x$fcst[[1]])), do.call(rbind, x$fcst))
By the way: The package's name is vars, not VARS.