R wilcox.test by categorical subset error - r

I'm trying to specify catagorical subgroups, I found a source which suggests you can simply use this layout
wilcox.test(growth ~ sugar, data= carbs, subset= sugar %in% c("test", "C"))
However on my dataset it doesn't work, though the same format works if convert groups to numerical values in excel.
wilcox.test(Distance~Application, data= walking.dat,
subset = Application %in% c("Control", "Cue-Lure"))
Error in wilcox.test.formula(Distance ~ Application, data = walking.dat, :
grouping factor must have exactly 2 levels
Any suggestions would be great.
Thanks!

It is amazing you can still see my deleted comments. I made two comments earlier, pointing out two possible issues.
issue 1:
It is highly likely that there is no "Control" or "Cue-Lure" in walking.dat$Application. I would suggest you try
with(walking.dat, unique(Application[Application %in% c("Control", "Cue-Lure")]))
to see what you get. Possibly you either get a single element, or nothing.
I can easily reconstruct the error you encountered. Consider the built-in R dataset airquality.
data(airquality)
unique(airquality$Month) ## 5 6 7 8 9
wilcox.test(Ozone ~ Month, data = airquality, subset = Month %in% c(6, 7)) ## fine
wilcox.test(Ozone ~ Month, data = airquality, subset = Month %in% c(1, 7)) ## fail
In the second case, you get an error:
Error in wilcox.test.formula(Ozone ~ Month, data = foo, subset = Month %in% :
grouping factor must have exactly 2 levels
because 1 is not an available value of Month.
issue 2
If both levels exist, then I guess your variable Application is factor. Check class(Application). The problem of a factor, can be seen from here:
x <- factor(letters[1:4])
x[x %in% c("a", "b")]
#[1] a b
#Levels: a b c d
Note that the factor levels do drop after %in%. However, if you do:
x <- as.character(x)
x[x %in% c("a", "b")]
#[1] "a" "b"
Although you get characters, the formula method will coerce it into factors automatically. In this way, there is no danger that additional unused factor levels could break wilcox.test().

Related

r, stuck on converting numeric to factor with labels

I am stuck on this topic and I am new to R. I am trying to take this repetition pattern and assign levels using factor(), as.factor(), or levels() to assign levels to 1-5 such as 1 = yes, 2 =no, 3 =sometimes, 4 = almost never, 5 = almost always (these are for the purposes of this example only, I am aware that they do not make sense in terms of order). I am out of ideas. I have tried assigning the below code to a variable ex: x and making it as.factor(x) and then use:
x <- as.factor(x)
y <- factor(x, c('yes','no','sometimes','almost never', 'almost always')
#or y <- factor(a, levels = c('yes','no','sometimes','almost never', 'almost always'), ordered = TRUE)
the output always gives me a bunch of NA values. I am trying to get R to convert 1-5 to the respective values and output yes < no < sometimes ,etc... in terms of providing an intrinsic order for my values/levels
code I have for the pattern this:
rep(c(1:5),each=3,times=4)
Almost had it, its labels not levels
factor(
rep(c(1:5),each=3,times=4),
levels = 1:5,
labels = c('yes','no','sometimes','almost never', 'almost always'),
ordered = TRUE
)
Edit: as Roland pointed out, including both levels and labels is safer.

R Aggregate with a yet undefined range of columns (including factors)

I probably miss the right words to find my answer using the search function. I will have a dataset with a yet unknown number of columns, because they are a function of work within another program and later changes there will change the number of variables in the dataset. However, the dataset has a clear structure, with 6 variables in the beginning (including the below mentioned code, a factor variable, and year and starting at the 7 column all the other variables that are a function of the work in the other program (MaxQDA).
So I wish to have a flexible call for 7 to N columns for an aggregate function to replace the dot in the following code, which to my understanding calls for all columns.
dataset2 <- aggregate(. ~ code+jahr,
data = dataset,
sum,
na.action=na.pass
)
Suggestions from here do not help, as I don't know how to transfer the code+jahr into other suggested variations of aggregate-function writing.
addendum: Or, put differently: I wish to exempt a few columns from the aggregate-function, while summing up a range of other columns.
Since there was confusion about vector types. I have some factor data like ID and Name. Data would look like this
set.seed(42)
test2 <- as.data.frame(matrix(sample(16 * 4, replace=TRUE), ncol=16, nrow=4))
code <-c("aaa", "bbb","aaa", "ddd")
jahr <- c("1990", "1993", "2007", "2020")
id <- c("id1", "id2", "id3", "id4")
Name <- c("bla", "bla2", "bla3", "bla4")
test <- data.frame(code, jahr, id, Name)
dataset <- data.frame(test, test2)
dataset[1:4] <- lapply(dataset[, 1:4], as.factor)
Using dataset above we want to remove id and Name from the aggregation since they are factors that are not used to define groups. The simplest way to do that is to extract those columns of data:
dataset2 <- aggregate(. ~ code+jahr, data = dataset[ , -(3:4)], sum, na.action=na.pass)
A slightly more complicated method is to define a logical statement that identifies columns that are factors but not used for grouping. The main advantage is not having to figure out column numbers and making it relatively simple to change the grouping variables:
keep <- colnames(dataset) %in% c("code", "jahr") | sapply(dataset, is.numeric)
dataset2 <- aggregate(. ~ code+jahr, data = dataset[, keep], sum, na.action=na.pass)
Both produce the same results

How to run t-test with subset of rows in R

Below is a part of my data (pairht_protein)
I am trying to run t-test on all the variables (columns) between two groups which are:
Resistant_group <- c(PAIR-01, PAIR-12, PAIR-09)
Sensitive_group <- c(PAIR-07, PAIR-02, PAIR-05)
Before I make a function I tired to pick one of the variables and tried:
t.test(m_pHSL660 ~ Subject, data = subset(pairht_protein, Subject %in% c("Resistant_group", "Sensitive_group")))
But it gave me an error : 'grouping factor must have exactly 2 levels'
Is there a way to run t-test between these groups? and possibly make it as a function?
First, you must correct how you define the groups (you cannot use dashes on variable names):
Resistant_group <- c('PAIR-01', 'PAIR-12', 'PAIR-09')
Sensitive_group <- c('PAIR-07','PAIR-02','PAIR-05')
Then, using dplyr package create another factor variable with only two levels:
library(dplyr)
# assuming pairht_protein is your dataset name
pairht_protein <- pairht_protein %>% mutate(sub = case_when( subject %in% Resistant_group ~1,
subject %in% Sensitive_group ~2),
sub = as.factor(sub))
Because this new variable is going to make NAs values for elements outside your groups, you don't need to subsetting:
t.test(m_pHSL660 ~ sub, data =pairht_protein)

Understanding coercion of factors into characters in an R dataframe

Trying to figure out how coercion of factors/ dataframe works in R. I am trying to plot boxplots for a subset of a dataframe. Let's see step-by-step
x = rnorm(30, 1, 1)
Created a vector x with normal distribution
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
Created a character string to later use as a factor for plotting boxplots for x1, x2, x3
df = data.frame(x,c)
combined x and c into a data.frame. So now we would expect class of df: dataframe, df$x: numeric, df$c: factor (because we sent c into a dataframe) and is.data.frame and is.list applied on df should give us TRUE and TRUE. (I assumed that all dataframes are lists as well? and that's why we are getting TRUE for both checks.)
And that's what happens below. All good till now.
class(df)
#[1] "data.frame"
is.data.frame(df)
#[1] TRUE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"
Now I plot the spread of x grouped using factors present in c. So the first argument is x ~ c. But I want boxplots for just two factors: x1and x2. So I used a subset argument in boxplot function.
boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)
This is the plot we get, notice since x3 is a factor, it is still plotted
i.e. we still got 3 categories on x-axis of the boxplot inspite of subsetting to 2 categories.
So, one solution I found was to change the class of df variables into numeric and character
class(df)<- c("numeric", "character")
boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)
New boxplot. This is what we wanted, so it worked!, we plotted boxes for just x1 and x2, got rid of x3
But if we just run the same checks, we ran before doing this coercion, on all variables, we get these outputs.
Anything funny?
class(df)
#[1] "numeric" "character"
is.data.frame(df)
#[1] FALSE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"
Check out that df $ c (the second variable containing caegories x1, x2, x3) is still a factor!
And df stopped being a list (so was it ever a list?)
And what did we do exactly by class(df)<- c("numeric", "character") this coercion if not changing the datatype of df $ c?
So to sum up,
my questions for tldr version:
Are all dataframes, also lists in R?
Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?
If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?
And why did df stopped being a dataframe after we did the above steps?
The answers make more sense if we take your questions in a different order.
Are all dataframes, also lists in R?
Yes. A data frame is a list of vectors (the columns).
And why did df stopped being a list after we did the above steps?
It didn't. It stopped being a data frame, because you changed the class with class(df)<- c("numeric", "character"). is.list(df) returns TRUE still.
If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?
class(df) operates on the df object itself, not the columns. Look at str(df). The factor column is still a factor. class(df) set the class attribute on the data frame object itself to a vector.
Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?
You've messed up your data frame object by explicitly setting the class attribute of the object to a vector c("numeric", "character"). It's hard to predict the full effects of this. My best guess is that boxplot or the functions that draw the axes accessed the class attribute of the data frame somehow.
To do what you really wanted:
x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c)
df$c <- as.character(df$c)
or
x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c, stringsAsFactors=FALSE)
Use droplevels like this:
df0 <- subset(df, c %in% c("x1", "x2"))
df0 <- transform(df0, c = droplevels(c))
levels(df0$c)
## [1] "x1" "x2"
Note that now c only has two levels, not three.
We can write this as a pipeline using magrittr like this:
library(magrittr)
df %>%
subset(c %in% c("x1", "x2")) %>%
transform(c = droplevels(c)) %>%
boxplot(x ~ c, data = .)

R: not meaningful as factors

what is best practice to handle this particular problem when it comes up? for example I have created a dataframe:
dat<- sqlQuery(con,"select * from mytable")
in which my table looks like:
ID RESULT GROUP
-- ------ -----
1 Y A
2 N A
3 N B
4 Y B
5 N A
in which ID is an int, Result and Group are both factors.
problem is that when I want to do something like:
tapply(dat$RESULT,dat$GROUP,sum)
I get complaints about columns being a factor:
Error in Summary.factor(c(2L,2L,2L,2L,1L,2L,1L,2L,2L,1L,1L, :
sum not meaningful for factors
Given that factors are essential for use in things like ggplot, how does everyone else handle this?
Setting stringsAsFactors=FALSE and rerunning gives
tapply(dat$RESULT,dat$GROUP,sum)
Error in FUN(X[[1L]], ...) : invalid "type" (character) or argument
so I'm not sure merely setting stringsAsFactors=FALSE is the right approach
I assume you want to sum up the "Y"s in the RESULT column.
As suggested by #akrun, one possibility is to use table()
with(dat,table(GROUP,RESULT))
If you want to stick with the tapply(), you can change the type of the RESULT column to a boolean:
dat$RESULT <- dat$RESULT=="Y"
tapply(dat$RESULT,dat$GROUP,sum)
If your goal is to have some columns as factors and other as strings, you can convert to factors only selected columns in the result, e.g. with
dat<- sqlQuery(con,"select ID,RESULT,GROUP from mytable",as.is=2)
As in the read.table man page (recalled by the sqlQuery man page) : as.is is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors.
But then again, you need either to use table() or to turn the result into a boolean.
I'm not clear what your question is, either. If you're just trying to sum the Y's, how about:
library(dplyr)
df <- data.frame(ID = 1:5,
RESULT = as.factor(c("Y","N","N","Y","N")),
GROUP = as.factor(c("A", "A", "B", "B", "A")))
df %>% mutate(logRes = (RESULT == "Y")) %>%
summarise(sum=sum(logRes))

Resources