I am stuck on this topic and I am new to R. I am trying to take this repetition pattern and assign levels using factor(), as.factor(), or levels() to assign levels to 1-5 such as 1 = yes, 2 =no, 3 =sometimes, 4 = almost never, 5 = almost always (these are for the purposes of this example only, I am aware that they do not make sense in terms of order). I am out of ideas. I have tried assigning the below code to a variable ex: x and making it as.factor(x) and then use:
x <- as.factor(x)
y <- factor(x, c('yes','no','sometimes','almost never', 'almost always')
#or y <- factor(a, levels = c('yes','no','sometimes','almost never', 'almost always'), ordered = TRUE)
the output always gives me a bunch of NA values. I am trying to get R to convert 1-5 to the respective values and output yes < no < sometimes ,etc... in terms of providing an intrinsic order for my values/levels
code I have for the pattern this:
rep(c(1:5),each=3,times=4)
Almost had it, its labels not levels
factor(
rep(c(1:5),each=3,times=4),
levels = 1:5,
labels = c('yes','no','sometimes','almost never', 'almost always'),
ordered = TRUE
)
Edit: as Roland pointed out, including both levels and labels is safer.
Related
Situation:
.csv file which contains the following:
x,y,z
1,2,3
-999,2,4
2,-999,4
2,4,-999
following tasks:
format variables correctly (factors)
define "-999" as NA
calculate mean size > A
create some boxplots
Issue:
If I am using the function replace_with_na_all (https://cran.r-project.org/web/packages/naniar/vignettes/replace-with-na.html) the calculation of the mean size will throw me this error for the mean calculation:
Argument is not numeric nor boolean: return NA
The boxplots look fine though.
If I am using the integrated NA declaration df[df == -999] <- NA the calculation of the mean values works well.
But the boxplot will show one graph including the "-999" only for the variable "x", if I first format the variables correctly as.factor and define the NAs afterwards.
Also the summary(df) command shows -999:0 for the variable x.
If I first define the NAs and convert to factor then everything is as supposed and I get plotted only the defined factors.
The summary(df) function will not show -999 for the variable.
These issues do not happen with other variables which I define as factors too.
Code sample:
df <- read.csv("C:/Users/Jeremias/Desktop/test.csv")
df[df == -999] <- NA
f$x <- as.factor(df$x)
mean(df[df$y > 1,"y"],na.rm = T)
boxplot(data = df, df$y ~ df$x, outline = F)
It took me several hours to find the solution (correct order), and I would like to understand the why.
Maybe some more experienced user has an explanation for this behaviour, if this is just R specific or whatever.
as you already concluded correctly it depends on the (correct) order. As soon as you define UrbanTrail$Geschlecht as factor its levels will be saved as attribute of the variable, as can be shown:
UrbanTrail <- data.frame(Geschlecht = c(1,2,2,1,1,2,1,1,2,-999),
Wohungsgroesse = 61:70)
UrbanTrail$Geschlecht <- as.factor(UrbanTrail$Geschlecht)
attr(UrbanTrail$Geschlecht, "levels") # Attributes: levels "-999", "1", "2"
UrbanTrail[UrbanTrail$Geschlecht == -999, "Geschlecht"] <- NA # Even though "-999" becomes 'NA ...
attr(UrbanTrail$Geschlecht, "levels") # ... attributes remain the same: levels "-999", "1", "2"
After -999 becomes NA its levels are not adjusted accordingly.
If you make a boxplot, boxplot will look for the levels (just as we did in this example) and find "-999", "1" and "2" and will use these as categories, as the levels are not modified after -999 becomes NA.
Probably replace_with_na will automatically modify the levels of the variable afterwards.
Best regards from Leipzig
Chris
P.S.:
I can strongly recommend reading "R for Data Science"
https://r4ds.had.co.nz/factors.html
I've often had to take a vector of group indicators and wanted to create a factor out of it to explore the data more easily. I've always done this by instantiating the factor and then assigning the levels to it where group indicators are the indices of the levels (perhaps easier to see below). But seeing as factors are the least understood data type for me, I wonder if there is a simple function that will do it all for me that I'm not aware of.
# set seed so we're all on the same page
set.seed(1337)
# create the contrived vector of indices
myNumbers <- sample(x = 1:26, size = 50, replace = TRUE)
# This is how I would create the factor
myFactor <- factor(myNumbers) # step 1
levels(myFactor) <- letters # step 2
# Inspect the result
myFactor
You can specify levels when creating the factor from a vector.
foo = factor(x = letters[myNumbers], levels = letters)
length(levels(foo))
#[1] 26
If you don't specify levels when creating factor, one will be automatically assigned from the unique values of vector
length(levels(myFactor)) #before step 2
#[1] 21
It means that, before step 2, the numeric values of factors in myFactor ranges from 1 to 21 (range(as.numeric(myFactor))). As a result, even though you intended to use indices from 1:26, you will be using indices from 1:21.
I'm trying to specify catagorical subgroups, I found a source which suggests you can simply use this layout
wilcox.test(growth ~ sugar, data= carbs, subset= sugar %in% c("test", "C"))
However on my dataset it doesn't work, though the same format works if convert groups to numerical values in excel.
wilcox.test(Distance~Application, data= walking.dat,
subset = Application %in% c("Control", "Cue-Lure"))
Error in wilcox.test.formula(Distance ~ Application, data = walking.dat, :
grouping factor must have exactly 2 levels
Any suggestions would be great.
Thanks!
It is amazing you can still see my deleted comments. I made two comments earlier, pointing out two possible issues.
issue 1:
It is highly likely that there is no "Control" or "Cue-Lure" in walking.dat$Application. I would suggest you try
with(walking.dat, unique(Application[Application %in% c("Control", "Cue-Lure")]))
to see what you get. Possibly you either get a single element, or nothing.
I can easily reconstruct the error you encountered. Consider the built-in R dataset airquality.
data(airquality)
unique(airquality$Month) ## 5 6 7 8 9
wilcox.test(Ozone ~ Month, data = airquality, subset = Month %in% c(6, 7)) ## fine
wilcox.test(Ozone ~ Month, data = airquality, subset = Month %in% c(1, 7)) ## fail
In the second case, you get an error:
Error in wilcox.test.formula(Ozone ~ Month, data = foo, subset = Month %in% :
grouping factor must have exactly 2 levels
because 1 is not an available value of Month.
issue 2
If both levels exist, then I guess your variable Application is factor. Check class(Application). The problem of a factor, can be seen from here:
x <- factor(letters[1:4])
x[x %in% c("a", "b")]
#[1] a b
#Levels: a b c d
Note that the factor levels do drop after %in%. However, if you do:
x <- as.character(x)
x[x %in% c("a", "b")]
#[1] "a" "b"
Although you get characters, the formula method will coerce it into factors automatically. In this way, there is no danger that additional unused factor levels could break wilcox.test().
I'm having trouble assigning value labels from lists to numeric variables. I've got a dataset (in form of a list()) containing eleven variables. The first five variables each have individual value levels, the last six each use the same 1-5 scale. I created lists with value labels for each of the first five variables and one for the scale. Now I would like to automatically assign the labels from those lists to my variables.
I've put my eleven variables in a list to be able to use mapply().
Here's an example of my current state:
# Example variables:
a <- c(1,2,3,4) # individual variable a
b <- c(1,2,2,1) # individual variable b
c <- c(1,2,3,4,5) # variable c using the scale
d <- c(1,2,3,4,5) # variable d also using the scale
mydata <- list(a,b,c,d)
# Example value labels:
lab.a <- c("These", "are", "value", "labels")
lab.b <- c("some", "more")
lab.c <- c("And", "those", "for", "the", "scale")
labels.abc <- list(lab.a, lab.b, lab.c)
# Assigning labels in two parts
part.a <- mapply(function(x,y) factor(as.numeric(x), labels = y, exclude = NA), mydata[1:2], labels.abc[1:2])
part.b <- mapply(function(x,y) factor(as.numeric(x), labels = y, exclude = NA), mydata[3:4], labels.abc[3])
Apart from not being able to combine the two parts, my major problem is the output format. mapply() gives the result in form of a matrix, where I need again a list containing the specific variables.
So, my question is: How can I assign the value labels in an automated procedure and as the result again get a list of variables, which now contain labeled information instead of numerics?
I'm quite lost here. Is my approach with mapply() generally doable, or am I completely on the wrong track?
Thanks in advance! Please comment if you need further information.
Problem solved!
Thanks #agstudy for pointing out the SIMPLIFY = FALSE argument, which prevents mapply() from reducing the result to a matrix.
The correct code is
part.a <- mapply(function(x,y) factor(as.numeric(x), labels = y, exclude = NA), mydata[1:2], labels.abc[1:2], SIMPLIFY = FALSE)
This provides exactly the same format of output as was put in.
I have two dataframes (DfA and DfB). Each dataframe has three factor variables: species, type and region. DfA also has a numeric value column, and I want to use it to estimate numeric values in a new column of DfB, based on shared attributes.
I have a function which asks for the species, type and region, then creates a subset of DfA with those attributes and runs an algorithm on the subset to estimate the new value. When I run the function and specify the values manually as a test, it works fine.
If all of the factor levels and combinations in DfB have matching factors in DfA, the function works fine with mapply. But if any row in DfB contains a factor level that is not present in DfA, I get an error (level sets of factors are different). Example: if DfA includes data for regions A,B and C, and DfB contains data for regions A,B,C and D, mapply returns the error; if I remove the rows with region D, the mapply function works.
How can I specify that, if the row contains a factor level that makes the function impossible, to skip it or put NA in instead and move on to run the function on the rows for which the function works?
You can drop/add levels to your data.frames to make sure your function works rather than cater for a special case:
# dropping and setting levels
Z = as.factor(sample(LETTERS[1:5],20,replace=T))
levels(Z)
Y = as.factor(Z[-which(Z %in% LETTERS[4:5])])
levels(Y)
Y=droplevels(Y) # drop the levels
levels(Y)
levels(Y) = levels(Z) # bring them back
levels(Y)
Y = factor(Y,levels=LETTERS[1:7]) # expand them
levels(Y)
attr(Y,"levels")
attr(Y,"levels") = LETTERS[1:8] # keep expanding them
levels(Y)
require(plyr)
Y = mapvalues(Y,levels(Y),letters[1:length(levels(Y))]) # change the labels of the levels
levels(Y)
x<-factor(Y, labels=LETTERS[(length(unique(Y))+1):(2*length(unique(Y)))]) # change the labels of the levels on another variable
In your case:
dfa = data.frame("LVL1"=as.factor(sample(LETTERS[1:2],20,replace=T)))
dfb = data.frame("LVL2"=as.factor(sample(LETTERS[2:5],20,replace=T)))
newLevels = sort(unique(union(levels(dfa$LVL1),levels(dfb$LVL2))))
dfa$LVL1 = factor(dfa$LVL1,levels=newLevels)
dfb$LVL2 = factor(dfb$LVL2,levels=newLevels)