R: Comparing different versions of data in terms of levels - r

my aim is to compare differences in levels of variables that might occur across different versions of a dataset. In my code, I first generate strings in order to be able to compare several variables (numeric, categorical, etc.). However, the code fails and does not give the desired results, which would be a data frame that consists of the variable and possible differences (in a list). Any help is appreciated!
Thank you.
data1 <- lapply(?, as.character)
data2 <- lapply(?, as.character)
check_diffs <- function(vars, data1, data2) {
levels1 <- unique(data1$vars)
levels2 <- unique(data2$vars)
diff <- ifelse(length(union(setdiff(levels1,levels2), setdiff(levels2,levels1)))>0, list(union(setdiff(levels1,levels2), setdiff(levels2,levels1))), NA)
return(data.frame(var = vars, diffs = I(diff)))
}
diffs_df <- map_dfr(vars, ~check_diffs(.x, data1 = ?, data2 = ?))

The issue with the code was that vars gives a string, which must be called with get(vars, dataX). Then, the code gives the differences in coding between both data sets.

Related

error with dfidx: the two indexes don't define unique observations

I have collected data from a survey in order to perform a choice based conjoint analysis.
I have preprocessed and clean data with python in order to use them in R.
However, when I apply the function dfidx on the dataset I get the following error: the two indexes don't define unique observations.
I really do not understand why. Before creating the .csv file I checked if there were duplicates through the pandas function final_df.duplicated().sum() and its out put was 0 meaning that there were no duplicates.
Can please some one help me to understand what I am doing wrong ?
Here is the code:
df <- read.csv('.../survey_results.csv')
df <- df[,-c(1)]
df$Platform <- as.factor(df$Platform)
df$Deposit <- as.factor(df$Deposit)
df$Fees <- as.factor(df$Fees)
df$Financial_Instrument <- as.factor(df$Financial_Instrument)
df$Leverage <- as.factor(df$Leverage)
df$Social_Trading <- as.factor(df$Social_Trading)
df.mlogit <- dfidx(df, idx = list(c("resp.id","ques"), "position"), shape='long')
Here is the link to the dataset that I am using https://github.com/AlbertoDeBenedittis/conjoint-survey-shiny/blob/main/survey_results.csv
Thank you in advance for you time
The function dfidx() is build for data frames "for which observations are defined by two (potentialy nested) indexes" (ref).
I don't think this function is build for more than two idxs. Especially that, in your df, there aren't any duplicates ONLY when considering the combinations of the three columns you mention above (resp.id, ques and position).
One solution to this problem is to "combine" the two columns resp.id and ques into one (called for example resp.id.ques) with paste(...).
df$resp.id.ques <- paste(df$resp.id, df$ques, sep="_")
Then you can write the following line which should work just fine:
df.mlogit <- dfidx(df, idx = list("resp.id.ques", "position"))

Kruskal-Wallis test on multiple columns at once

This maybe sounds a bit simple, but I cannot get the answer.
I have a dataset in R that has 26 samples in rows and many variables (>20) in columns. Some of them are categorical, so what I need to do is to carry out a Kruskal Wallis test for each numerical variable depending on each categorical one, so I do:
env_fact <- read.csv("environ_facts.csv")
kruskal.test(env_fact-1 ~ Categorical_var-1, data=env_fact)
But with this I can only do the test to the numerical variables one by one, which is tiresome.
Is there any way to carry all the Kruskal-Wallis tests for all numerical variables at once?
I can repeat it by each categorical variable, since I only have 4, but for the numerical one I have more than 20!!
Thanks a lot
Since I do not have sample of the data set I can only answer "theoretically".
First, you need to recognize which are the numeric columns.
The way to do this is the following:
df = tibble(x = rnorm(10), y = rnorm(10), z = "a", w = rnorm(10))
NumericCols = sapply(df, function(x) is.numeric(x))
df_Numeric = df[, Types == TRUE]
Now you take the numeric part of df, df_Numeric, and apply your function blabla on each column at a time:
sapply(df_Numeric, function(x) blabla(x))
Thank you very much Omry.
Working with a colleague we reached an incomplete different solution to yours:
my.variables <- colnames(env_fact)
for(i in 1:length(my.variables)) {
if(my.variables[i] == 'Categorical_var') {
next
} else {
kruskal.test(env_fact[,i], env_fact$Categorical_var)
}
}
However, we haven't been able to print on screen/get an output with the results for each of 'my.variables' by the 'Categorical_var' analyzed. We could only get a result for all the 'my.variables' as a whole.
Any idea??
Thank you very much
P.S.: My data looks like this:
Sample,Nunatak,Slope,Altitude,Depth,Fluoride,Acetate,Formiate,Chloride,Nitrate
m4,1,1,1,1,0.044,0.884,0.522,0.198,0.021
m6,1,1,1,2,0.059,0.852,0.733,0.664,0.038
m7,1,1,1,3,0.082,0.339,1.496,0.592,0.034
m8,1,1,2,1,0.112,0.812,2.709,0.357,0.014
m10,1,1,2,2,0.088,0.768,2.535,0.379,0
m11,1,1,3,1,0.101,0.336,4.504,0.229,0
m13,1,1,3,2,0.092,0.681,1.862,0.671,0.018
m14,1,2,2,1,0.12,1.055,3.018,0.771,0
m16,1,2,2,2,0.102,1.019,1.679,1.435,0
m17,1,2,2,3,0.26,0.631,0.505,0.574,0.008
Where Nunatak, Slope, Altitude and Depth are categorical and the rest are numerical. Hope this helps

R - Subset a Dataframe with a Programmatically built Formula

I'm working with a large data frame that is pulled from a data lake which I need to subset according to multiple different columns and run an analysis on. The basic subsettings come from an external Excel file which I read in and generate all possible combinations of. I want something to loop through each of these columns and subset my data accordingly.
A few of the subsettings follow a similar form to:
data_settings <- data.frame(country = rep(c('DE','RU','US','CA','BR'),6),
transport=rep(c('road','air','sea')),
category = rep(c('A','B')))
And my data lake extract has a form like:
df <- data.frame(country = rep(unique(data_settings$country),6),
transport = rep(unique(data_settings$transport),10),
category = rep(c('A','B'),15),
values = round(runif(30) * 10))
I need to subset the data according to each of the rows in my data_settings data frame, so I built a loop which constructs the formula according to what is in my data_settings data frame.
for(i in 1:nrow(data_settings)){
sub_string <- paste0(names(data_settings[1]), '==', data_settings[i,1])
for(j in 2:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
sub_string <- paste0(sub_string, ' & ', col," == ","'",val,"'")
}
df_sub <- subset(df, formula(sub_string))
}
This successfully builds my strings which I try to pass to formula or as.formula, but I receive an error at that point. I've tried a few different formulations without any success. In my actual case, there are thousands of combinations with different columns and values to filter against.
Thanks in advance for your help!
Try this:
merge(data_settings, df)
I worked with my previous approach a bit more today without using subset, filter, etc. and put this together which seems to do what I want well enough by filtering recursively according to the next item in the data_settings frame.
for(i in 1:nrow(data_settings)){
df_sub <- df
for(j in 1:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
df_col <- grep(col, names(df))
df_sub <- df_sub[df_sub[,df_col] == val,]
}
# Run further analysis here...
}

R: Help using dummyVars and adding back into data.frame

I have a data.frame of 373127 obs. of 193 variables. Some variables are factors which I want to use dummyVars() to separate each factor into its own column. I then want to merge the separate dummy variable columns back into my original data.frame, so I thought I could do the whole thing with apply, but something is not working and I can't figure out what it is.
Sample:
dat_final <- apply(dummies.var1, 1, function(x) {
dummies.var1 <- dummyVars(~ dat1$factor.var1 -1, data = dat1)
})
Thanks!
You can do the following that will create a new df, trsf, but you could always reassign back to the original df:
library(caret)
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)
See more here
The real answer is .... Don't do that. It's almost never necessary.
You could do something like this:
# Example data
df = data.frame(x = rep(LETTERS, each = 3), y = rnorm(78))
df = cbind(df, model.matrix(~df$x - 1))
However, as pointed out by #user30257 it is hard to see why you want to do it. In general, modeling tools in R don't need dummy vars, but deal with factors directly.
Creating dummy variables can be very important in feature selection, which it sounds like the original poster was doing.
For instance, suppose you have a feature that contains duplicated information (i.e., one of its levels corresponds to something measured elsewhere). You can determine this is the case very simply by comparing the dummy variables for these features using a variety of dissimilarity measures.
My preference is to use:
sparse.model.matrix and
cBind

Wilcoxon test on a large dataset algorithm

I have a large dataset: each row is a sample and each column is a feature. The first column however is filled with class factors (which here is 1,2,3,4,5). My aim is to do a wilcoxon comparison between all the classes (so for every combination 1,2:1,3;1,4;1,5;2,3...) for all the features. This is the code I wrote in order to do this (X is the dataframe)
facs <- length(levels(factor(X[,1])))
v <- matrix(as.character(combn(facs,2)),ncol=facs*2)
vecBoh <- data.frame(row.names=paste(v[1,],"-",v[2,]))
for(i in 2:ncol(X))
{
WilF <- function(coppie) wilcox.test(X[,i] ~ Class, data=X, subset = Class %in% coppie)
vecBoh[,i-1] <- as.numeric(sapply(apply(v,2,WilF),"[",3))
}
It works but it's extremely slow. I have the feeling there's a quicker way to do this. Does anyone have a clue?
You can use the pairwise.wilcox.test function for pairwise comparison between groups and I think that reading about multiple comparison before can help here.
lapply(df[,-1], function(x)
pairwise.wilcox.test(x, df$Class, p.adjust.method = "none"))
Where df is your data.frame

Resources