This maybe sounds a bit simple, but I cannot get the answer.
I have a dataset in R that has 26 samples in rows and many variables (>20) in columns. Some of them are categorical, so what I need to do is to carry out a Kruskal Wallis test for each numerical variable depending on each categorical one, so I do:
env_fact <- read.csv("environ_facts.csv")
kruskal.test(env_fact-1 ~ Categorical_var-1, data=env_fact)
But with this I can only do the test to the numerical variables one by one, which is tiresome.
Is there any way to carry all the Kruskal-Wallis tests for all numerical variables at once?
I can repeat it by each categorical variable, since I only have 4, but for the numerical one I have more than 20!!
Thanks a lot
Since I do not have sample of the data set I can only answer "theoretically".
First, you need to recognize which are the numeric columns.
The way to do this is the following:
df = tibble(x = rnorm(10), y = rnorm(10), z = "a", w = rnorm(10))
NumericCols = sapply(df, function(x) is.numeric(x))
df_Numeric = df[, Types == TRUE]
Now you take the numeric part of df, df_Numeric, and apply your function blabla on each column at a time:
sapply(df_Numeric, function(x) blabla(x))
Thank you very much Omry.
Working with a colleague we reached an incomplete different solution to yours:
my.variables <- colnames(env_fact)
for(i in 1:length(my.variables)) {
if(my.variables[i] == 'Categorical_var') {
next
} else {
kruskal.test(env_fact[,i], env_fact$Categorical_var)
}
}
However, we haven't been able to print on screen/get an output with the results for each of 'my.variables' by the 'Categorical_var' analyzed. We could only get a result for all the 'my.variables' as a whole.
Any idea??
Thank you very much
P.S.: My data looks like this:
Sample,Nunatak,Slope,Altitude,Depth,Fluoride,Acetate,Formiate,Chloride,Nitrate
m4,1,1,1,1,0.044,0.884,0.522,0.198,0.021
m6,1,1,1,2,0.059,0.852,0.733,0.664,0.038
m7,1,1,1,3,0.082,0.339,1.496,0.592,0.034
m8,1,1,2,1,0.112,0.812,2.709,0.357,0.014
m10,1,1,2,2,0.088,0.768,2.535,0.379,0
m11,1,1,3,1,0.101,0.336,4.504,0.229,0
m13,1,1,3,2,0.092,0.681,1.862,0.671,0.018
m14,1,2,2,1,0.12,1.055,3.018,0.771,0
m16,1,2,2,2,0.102,1.019,1.679,1.435,0
m17,1,2,2,3,0.26,0.631,0.505,0.574,0.008
Where Nunatak, Slope, Altitude and Depth are categorical and the rest are numerical. Hope this helps
Related
my aim is to compare differences in levels of variables that might occur across different versions of a dataset. In my code, I first generate strings in order to be able to compare several variables (numeric, categorical, etc.). However, the code fails and does not give the desired results, which would be a data frame that consists of the variable and possible differences (in a list). Any help is appreciated!
Thank you.
data1 <- lapply(?, as.character)
data2 <- lapply(?, as.character)
check_diffs <- function(vars, data1, data2) {
levels1 <- unique(data1$vars)
levels2 <- unique(data2$vars)
diff <- ifelse(length(union(setdiff(levels1,levels2), setdiff(levels2,levels1)))>0, list(union(setdiff(levels1,levels2), setdiff(levels2,levels1))), NA)
return(data.frame(var = vars, diffs = I(diff)))
}
diffs_df <- map_dfr(vars, ~check_diffs(.x, data1 = ?, data2 = ?))
The issue with the code was that vars gives a string, which must be called with get(vars, dataX). Then, the code gives the differences in coding between both data sets.
I want to write a function that dynamically uses different correlation methods depending on the scale of measure of the feature (continuous, dichotomous, ordinal). The label is always continuous. My idea was to use the apply() function, so iterate over every feature (aka column), check it's scale of measure (numeric, factor with two levels, factor with more than two levels) and then use the appropriate correlation function. Unfortunately my code seems to convert every feature into a character vector and as consequence the condition in the if statement is always false for every column. I don't know why my code is doing this. How can I prevent my code from converting my features to character vectors?
set.seed(42)
foo <- sample(c("x", "y"), 200, replace = T, prob = c(0.7, 0.3))
bar <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.5,0.05,0.1,0.1,0.25))
y <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.25,0.1,0.1,0.05,0.5))
data <- data.frame(foo,bar,y)
features <- data[, !names(data) %in% 'y']
dyn.corr <- function(x,y){
# print out structure of every column
print(str(x))
# if feature is numeric and has more than two outcomes use corr.test
if(is.numeric(x) & length(unique(x))>2){
result <- corr.test(x,y)[['r']]
} else {
result <- "else"
}
}
result <- apply(features,2,dyn.corr,y)
apply is built for matrices. When you apply to a data frame, the first thing that happens is coercing your data frame to a matrix. A matrix can only have one data type, so all columns of your data are converted to the most general type among them when this happens.
Use sapply or lapply to work with columns of a data frame.
This should work fine (I tried to test, but I don't know what package to load to get the corr.test function.)
result <- sapply(features, dyn.corr, income)
I have a data.frame of 373127 obs. of 193 variables. Some variables are factors which I want to use dummyVars() to separate each factor into its own column. I then want to merge the separate dummy variable columns back into my original data.frame, so I thought I could do the whole thing with apply, but something is not working and I can't figure out what it is.
Sample:
dat_final <- apply(dummies.var1, 1, function(x) {
dummies.var1 <- dummyVars(~ dat1$factor.var1 -1, data = dat1)
})
Thanks!
You can do the following that will create a new df, trsf, but you could always reassign back to the original df:
library(caret)
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)
See more here
The real answer is .... Don't do that. It's almost never necessary.
You could do something like this:
# Example data
df = data.frame(x = rep(LETTERS, each = 3), y = rnorm(78))
df = cbind(df, model.matrix(~df$x - 1))
However, as pointed out by #user30257 it is hard to see why you want to do it. In general, modeling tools in R don't need dummy vars, but deal with factors directly.
Creating dummy variables can be very important in feature selection, which it sounds like the original poster was doing.
For instance, suppose you have a feature that contains duplicated information (i.e., one of its levels corresponds to something measured elsewhere). You can determine this is the case very simply by comparing the dummy variables for these features using a variety of dissimilarity measures.
My preference is to use:
sparse.model.matrix and
cBind
I have a large dataset: each row is a sample and each column is a feature. The first column however is filled with class factors (which here is 1,2,3,4,5). My aim is to do a wilcoxon comparison between all the classes (so for every combination 1,2:1,3;1,4;1,5;2,3...) for all the features. This is the code I wrote in order to do this (X is the dataframe)
facs <- length(levels(factor(X[,1])))
v <- matrix(as.character(combn(facs,2)),ncol=facs*2)
vecBoh <- data.frame(row.names=paste(v[1,],"-",v[2,]))
for(i in 2:ncol(X))
{
WilF <- function(coppie) wilcox.test(X[,i] ~ Class, data=X, subset = Class %in% coppie)
vecBoh[,i-1] <- as.numeric(sapply(apply(v,2,WilF),"[",3))
}
It works but it's extremely slow. I have the feeling there's a quicker way to do this. Does anyone have a clue?
You can use the pairwise.wilcox.test function for pairwise comparison between groups and I think that reading about multiple comparison before can help here.
lapply(df[,-1], function(x)
pairwise.wilcox.test(x, df$Class, p.adjust.method = "none"))
Where df is your data.frame
This is a follow up question to my earlier post (covariance matrix by group) regarding a large data set. I have 6 variables (HML, RML, FML, TML, HFD, and BIB) and I am trying to create group specific covariance matrices for them (based on variable Group). However, I have a lot of missing data in these 6 variables (not in Group) and I need to be able to use that data in the analysis - removing or omitting by row is not a good option for this research.
I narrowed the data set down into a matrix of the actual variables of interest with:
>MMatrix = MMatrix2[1:2187,4:10]
This worked fine for calculating a overall covariance matrix with:
>cov(MMatrix, use="pairwise.complete.obs",method="pearson")
So to get this to list the covariance matrices by group, I turned the original data matrix into a data frame (so I could use the $ indicator) with:
>CovDataM <- as.data.frame(MMatrix)
I then used the following suggested code to get covariances by group, but it keeps returning NULL:
>cov.list <- lapply(unique(CovDataM$group),function(x)cov(CovDataM[CovDataM$group==x,-1]))
I figured this was because of my NAs, so I tried adding use = "pairwise.complete.obs" as well as use = "na.or.complete" (when desperate) to the end of the code, and it only returned NULLs. I read somewhere that "pairwise.complete.obs" could only be used if method = "pearson" but the addition of that at the end it didn't make a difference either. I need to get covariance matrices of these variables by group, and with all the available data included, if possible, and I am way stuck.
Here is an example that should get you going:
# Create some fake data
m <- matrix(runif(6000), ncol=6,
dimnames=list(NULL, c('HML', 'RML', 'FML', 'TML', 'HFD', 'BIB')))
# Insert random NAs
m[sample(6000, 500)] <- NA
# Create a factor indicating group levels
grp <- gl(4, 250, labels=paste('group', 1:4))
# Covariance matrices by group
covmats <- by(m, grp, cov, use='pairwise')
The resulting object, covmats, is a list with four elements (in this case), which correspond to the covariance matrices for each of the four groups.
Your problem is that lapply is treating your list oddly. If you run this code (which I hope is pretty much analogous to yours):
CovData <- matrix(1:75, 15)
CovData[3,4] <- NA
CovData[1,3] <- NA
CovData[4,2] <- NA
CovDataM <- data.frame(CovData, "group" = c(rep("a",5),rep("b",5),rep("c",5)))
colnames(CovDataM) <- c("a","b","c","d","e", "group")
lapply(unique(as.character(CovDataM$group)), function(x) print(x))
You can see that lapply is evaluating the list in a different manner than you intend. The NAs don't appear to be the problem. When I run:
by(CovDataM[ ,1:5], CovDataM$group, cov, use = "pairwise.complete.obs", method = "pearson")
It seems to work fine. Hopefully that generalizes to your problem.