Collapse some categorical variables in tidyverse - r

I'm working with a large dataset that has several locations. However, for one of my analyses, two locations "Wells1" and "Wells2", need to be collapsed into a single location "Wells". All other locations should keep their current names.
There are several excellent questions showing how to do this using different basic R functions (#1, #2), but I was wondering if anyone knows which tidyverse function would achieve the same goal.
The only thing I've come up with so far is:
case_when(recvDeployName %in% c("Wells1", "Wells2") ~ "Wells")
However, I get the following error message:
Error: Case 1 (.) must be a two-sided formula, not a list
I suspect, I need to specify what should be done with the other categories, but I'm not sure what that is.

The case_when can be written as
case_when(recvDeployName %in% c("Wells1", "Wells2") ~ "Wells",
TRUE ~ recvDeployName)

Related

R: Conditionally mutating a variable (tricky)

We are currently working on a project for school, and we do not have that much experience with coding and R. The dataset that we are working on contains the variable operationtype, which has a lot of combinations between several operation types. We want to recode this into the variable operationcategory. These are the categories we want to recode the many operations into:
"AVR/P+other"
"AVR/P+MVP/R+other"
"MVR/P+other"
"CABG+other"
"CABG+AVR/P+other"
"CABG+MVR/P+other"
If none of above then > ~ "Remaining"
We were wondering if this can be done somewhat automatically, where we can specify the following for AVR/P+other: If it includes AVR/P, however does not include MVP/R then classify as AVR/P+other, if it does include MVP/R then classify as "AVR/P+MVP/R+other". Since these are two categories that are closely related. Doing this by hand would take forever, so hopefully this is possible.
Thank you for your help in advance.
Koen
Assuming that operationtype contains the exact string, what I would probably do is something like this:
library(dplyr)
library(stringr)
transformed_df <- df %>%
mutate(operationcategory = case_when(str_detect(operationtype, "AVR/P") & str_detect(operationtype, "MVP/R") ~ "AVR/P+MVP/R+other",
str_detect(operationtype, "AVR/P") ~ "AVR/P+other",
TRUE ~ "Remaining"))
Just beware that they are evaluated as they come, so the most restrictive contidions should be on top.
You could use regular expressions to use a single str_detect, but this is probably easier to understand and use.

Anova Test- Separate Groups with Two Factor Comparison

Good morning,
I am trying to run some ANOVA tests on my dataset (using R) and I keep getting errors. I am trying to compare the average percentage of correct responses, as a factor of what "group" the subjects were in and what session/day it was. However, I have two separate conditions that I need to analyze separately.
So essentially, I need to compare PctCorrect in Condition 1, between groups and sessions and then do the same thing for condition 2.
I attempted using this code:
aov(ext$Pct.Correct[ext$Condition=="NC-EXT"]~ext$Group*ext$Session, data=ext)
and I got the following error:
Error in model.frame.default(formula = ext$Pct.Correct[ext$Condition
== : variable lengths differ (found for 'ext$Group')
I ran this code to make sure that all of my values were even:
mytable <- table(ext$Session, ext$Group, ext$Condition)
ftable(mytable)
And they were all the same value (which was to be expected), so I am not sure what's wrong.
I am very new to R so it's entirely possible that I am doing this completely wrong. Any help would be greatly appreciated.
You are filtering the left side of the equation and not filtering the right side, thus the "variable length error".
You could try filter your dataframe in the data= option like this:
aov(Pct.Correct ~ Group* Session, data=ext[ext$Condition=="NC-EXT",])

Subsetting within a function

I'm trying to subset a dataframe within a function using a mixture of fixed variables and some variables which are created within the function (I only know the variable names, but cannot vectorise them beforehand). Here is a simplified example:
a<-c(1,2,3,4)
b<-c(2,2,3,5)
c<-c(1,1,2,2)
D<-data.frame(a,b,c)
subbing<-function(Data,GroupVar,condition){
g=Data$c+3
h=Data$c+1
NewD<-data.frame(a,b,g,h)
subset(NewD,select=c(a,b,GroupVar),GroupVar%in%condition)
}
Keep in mind that in my application I cannot compute g and h outside of the function. Sometimes I'll want to make a selection according to the values of h (as above) and other times I'll want to use g. There's also the possibility I may want to use both, but even just being able to subset using 1 would be great.
subbing(D,GroupVar=h,condition=5)
This returns an error saying that the object h cannot be found. I've tried to amend subset using as.formula and all sorts of things but I've failed every single time.
Besides the ease of the function there is a further reason why I'd like to use subset.
In the function I'm actually working on I use subset twice. The first time it's the simple subset function. It's just been pointed out below that another blog explored how it's probably best to use the good old data[colnames()=="g",]. Thanks for the suggestion, I'll have a go.
There is however another issue. I also use subset (or rather a variation) in my function because I'm dealing with several complex design surveys (see package survey), so subset.survey.design allows you to get the right variance estimation for subgroups. If I selected my group using [] I would get the wrong s.e. for my parameters, so I guess this is quite an important issue.
Thank you
It's happening right as the function is trying to define GroupVar in the beginning. R is looking for the object h by itself (not within the dataframe).
The best thing to do is refer to the column names in quotes in the subset function. But of course, then you'd have to sidestep the condition part:
subbing <- function(Data, GroupVar, condition) {
....
DF <- subset(Data, select=c("a","b", GroupVar))
DF <- DF[DF[,3] %in% condition,]
}
That will do the trick, although it can be annoying to have one data frame indexing inside another.

Generating a dummy variable from lots of categories

So...I have a large data set with a variable that has many categories. I want to create new variables that group some of those categories into one.
I could do that with a conditional statement, but given the amount of categories it would take me forever to go one line at the time. Also, while my original variable is numeric, the values themselves are random so I can´t use logical or range statements.
How do I create this conditional variable based on many particular values?
I tried the following, but without success. Below is an example of the different categories I want to group into one.
classes <- c(549,162,210,222,44,96,62,208,525,202,149,442,427,
564,423,106,422,546,205,560,127,536,34,261,568,
366,524,401,548,95,156,8,528, 430,527,556,203,554,523,
501,530,55,252,585,19,540,71,204,502,504, 196,436,48,
102,526,201,521,23,558,552,118,416,117,216,510,494,
516,544,518)
So this seemed pretty intuitive to me, but it doesn´t work.
df$chem<- cbind(ifelse(df$class == classes ,1,0))
Needless to say I´m a beginner, and this is probably not so hard to do, but I´ve been looking for a solution to this particular problem and I can´t seem to find it. What am I missing? Thanks!
You are looking for %in% not ==
eg
df$chem <- cbind(ifelse(df$class %in% classes ,1,0))
or using the logical to numeric conversion
df$chem <- as.numeric(df$class %in% classes)
if you want individual dummy variables for all the categories in df$class then you can use the class.ind function in the package nnet (which is shipped as a recommended package)
library(nnet)
class_ind <- class.ind(df$class)
# add if you want to combine with the original
df_ind <- do.call(cbind, list(df, class.ind(df$class))

R: partimat function doesn't recognize my classes

I am a relatively novice r user and am attempting to use the partimat() function within the klaR package to plot decision boundaries for a linear discriminant analysis but I keep encountering the same error. I have tried inputing the arguments multiple different ways according to the manual, but keep getting the following error:
Error in partimat.default(x, grouping, ...) :
at least two classes required
Here is an example of the input I've given:
partimat(sources1[,c(3:19)],grouping=sources1[,2],method="lda",prec=100)
where my data table is loaded in under the name "sources1" with columns 3 through 19 containing the explanatory variables and column 2 containing the classes. I have also tried doing it by entering the formula like so:
partimat(sources1$group~sources1$tio2+sources1$v+sources1$cr+sources1$co+sources1$ni+sources1$rb+sources1$sr+sources1$y+sources1$zr+sources1$nb+sources1$la+sources1$gd+sources1$yb+sources1$hf+sources1$ta+sources1$th+sources1$u,data=sources1)
with these being the column heading.
I have successfully run an LDA on this same data set without issue so I'm not quite sure what is wrong.
From the source code of the partimat.default function getAnywhere(partimat.default) it states
if (nlevels(grouping) < 2)
stop("at least two classes required")
Therefore maybe you haven't defined your grouping column as a factor variable. If you try summary(sources1[,2]) what do you get? If it's not a factor, try
sources1[,2] <- as.factor(sources1[,2])
Or in method 2 try removing the "sources1$"on each of your variable names in the formula as you specify the data frame in which to look for these variable names in the data argument. I think you are effectively specifying the dataframe twice and it might be looking, for instance, for
"sources1$sources1$groups"
Rather than
"sources1$groups"
Without further error messages or a reproducible example (i.e. include some data in your post) it's hard to say really.
HTH

Resources