R - Factor-to-Numeric Conversion using pattern matching - r

I have seen many questions on here addressing the issue of converting factors to numeric variables but none seem to address what I am trying to do.
I want to create a new column in a dataframe that contains numeric representations of an existing factor. I tried:
df$num = as.numeric(df$factor)
Which converted the factors but did not order them as needed. How can I define each factor's numeric value explicitly? Something along the lines of:
df$num = ("1" if factor == "GB", "2" if factor == "YT", "3" if factor == "BF")

Assuming you have a factor variable num with 3 levels (GB, YT, BF) that you want to analyze as numeric.
I solved a similar problem by converting to character first, ex:
df$num <- as.character(df$num)
Then recoding to numeric values
df$num <- recode(df$num, "GB" = 1, "YT" = 2, "BF" = 3)
It's not elegant, but this strategy worked for my similar problem.

Related

I want to revalue multiple variables at the same time in R

For my thesis research I am trying to revalue multiple variables/columns at the same time. I have tried using the following function, but this would require me to specify each column apart:
full_df$CR39 <- revalue(full_df$CR39, c("1"="0", "2" ="1"))
I have got around 65 variables (named CR00, CR01, CR02, ...) that I want to recode. The value "1" must become "0" and the value "2" must become "1." I also have got variables named CR00FAM, CR01FAM, CR02FAM, ...) which I do not wish to revalue at the same time.
I have tried using the 'select' function, but this does not seem to help: full_df%>% select(starts_with("DF"), -contains("FAM")).
Does anyone know a possible solution? I have searched a lot of stackoverflow topics, but none of the proposed solutions worked out for me.
We can loop over the variables and do this. Select the columns of interest based on a regex i.e. those column names starts with (^) the 'CR' followed by one or more digits (\\d+) at the end ($) of the string. Loop over the selected columns with lapply and apply revalue, assign the output back to the selected columns dataset
nm1 <- grep("^CR\\d+$", names(full_df), value = TRUE)
full_df[nm1] <- lapply(full_df[nm1], function(x) revalue(x, c("1"="0", "2" ="1"))
Or using dplyr
library(dplyr)
full_df <- full_df %>%
mutate(across(matches("^CR\\d+$"), ~
revalue(., c("1" = "0", "2" = "1"))))

R: boxplots include -999 which were defined as NA -> dependent on order of factor declaration and NA declaration

Situation:
.csv file which contains the following:
x,y,z
1,2,3
-999,2,4
2,-999,4
2,4,-999
following tasks:
format variables correctly (factors)
define "-999" as NA
calculate mean size > A
create some boxplots
Issue:
If I am using the function replace_with_na_all (https://cran.r-project.org/web/packages/naniar/vignettes/replace-with-na.html) the calculation of the mean size will throw me this error for the mean calculation:
Argument is not numeric nor boolean: return NA
The boxplots look fine though.
If I am using the integrated NA declaration df[df == -999] <- NA the calculation of the mean values works well.
But the boxplot will show one graph including the "-999" only for the variable "x", if I first format the variables correctly as.factor and define the NAs afterwards.
Also the summary(df) command shows -999:0 for the variable x.
If I first define the NAs and convert to factor then everything is as supposed and I get plotted only the defined factors.
The summary(df) function will not show -999 for the variable.
These issues do not happen with other variables which I define as factors too.
Code sample:
df <- read.csv("C:/Users/Jeremias/Desktop/test.csv")
df[df == -999] <- NA
f$x <- as.factor(df$x)
mean(df[df$y > 1,"y"],na.rm = T)
boxplot(data = df, df$y ~ df$x, outline = F)
It took me several hours to find the solution (correct order), and I would like to understand the why.
Maybe some more experienced user has an explanation for this behaviour, if this is just R specific or whatever.
as you already concluded correctly it depends on the (correct) order. As soon as you define UrbanTrail$Geschlecht as factor its levels will be saved as attribute of the variable, as can be shown:
UrbanTrail <- data.frame(Geschlecht = c(1,2,2,1,1,2,1,1,2,-999),
Wohungsgroesse = 61:70)
UrbanTrail$Geschlecht <- as.factor(UrbanTrail$Geschlecht)
attr(UrbanTrail$Geschlecht, "levels") # Attributes: levels "-999", "1", "2"
UrbanTrail[UrbanTrail$Geschlecht == -999, "Geschlecht"] <- NA # Even though "-999" becomes 'NA ...
attr(UrbanTrail$Geschlecht, "levels") # ... attributes remain the same: levels "-999", "1", "2"
After -999 becomes NA its levels are not adjusted accordingly.
If you make a boxplot, boxplot will look for the levels (just as we did in this example) and find "-999", "1" and "2" and will use these as categories, as the levels are not modified after -999 becomes NA.
Probably replace_with_na will automatically modify the levels of the variable afterwards.
Best regards from Leipzig
Chris
P.S.:
I can strongly recommend reading "R for Data Science"
https://r4ds.had.co.nz/factors.html

How to label factors but still preserve its original levels' values - R

I'll divide this question into two parts, being the first a general question, and the second a specific one.
First - I would like to know if there a is a possible way to label numeric factors but still keep its original numeric levels. This is specially confusing since I realised that when we pass a label argument to a factor, it then becomes this factor's levels, for example:
x<- factor(c(1,2,3, 2, 3, 1, 2), levels = c(1, 2, 3), labels = c("a", "b", "c"))
levels(x)
#[1] "a" "b" "c"
labels(x)
#[1] "1" "2" "3" "4" "5" "6" "7"
I would like to know if there is a way, like it does in Stata, to label the categories of a factor. I want to be able to sum x while its elements show as "a, "b or "c", but keep the value 1, 2, or 3.
Second- I'm asking this because I have a very large data set which has columns with numeric categories. This data set comes with a dictionary in xlsx which I read and treat into R, so each column has its numeric categories and their respective labels. I'm attempting to read the dictionary, create a list of categories and labels inside a list of columns and then read the data set, loop through the columns and label the variables. These labels are important so I don't have to look at the dictionary every time I have to interpret something on the data set. And the numeric levels are important because since I have a lot of dummy variables (yes or no variables) I want to be able to sum them.
Here's my code (I use the data.table package):
dic<- readRDS(dictionary_filename)
# Reading data set #
data <- fread(dataset_filename, header = T, sep = "|", encoding = "UTF-8", na.strings = c("NA", ""))
# Treating the data.set #
# Identifying which lines of the dictionary have categorized variables. This is very specific to my dictionary strcture #
index<- which(!is.na(dic$num.categoria))
# storing the names of columns that have categorized variables #
names_var<- dic$`Var name`[index]
names_var<- names_var[!is.na(names_var)]
# Creating a data frame with categorized variables which will be later split into lists #
df<- as.data.frame(dic[index,])
# Transforming the index column to factor so it is possible to split the data frame into a list with sublists for each categorized column #
df$N<- as.factor(df$N)
# Splitting the data frame to list
lst<- split(df, df$N)
# Creating a labels list and a levels list #
lbs<- list()
lvs<- list()
for (i in 1:length(lst)){
lbs[[i]]<- as.vector(lst[[i]]$category)
lvs[[i]]<- as.vector(lst[[i]]$category.number)
}
# Changing the data set columns into factors with ther respective levels and labels #
k<- 1
for (var in names_var){
set(data, j =var, value = factor(data[[var]], levels = lvs[[k]], labels = lbs[[k]]))
k<- k +1
}
I realize the code is a bit abstract since i don't provide the data set or the dictionary, but it is just so you could have an idea. My code works, it runs with no error and it does what I hoped it would do (all the categorized columns are now showing their labels, for example, "yes" or "no" when before it was 1 or 0). Except from the fact that I can no longer access the original numbers in levels, which I need to in the next part of my project.
It would be preferable if there is a general way of doing so, since I run this code in a function, with many columns with different data sets and different dictionaries. Is there a way to accomplish this?
PS.: I have read the documentation in R and the answers to those questions:
Factor, levels, and original values
Having issues using order function in R
But unfortunately I wasn't able to figure it out by myself, it just became obvious that using the "labels" argument in "factor" was not the way to get it done.
Thank you so much!

Comparison of of character type and factor type in R

Ok, so I am having this issue right now. I have a matrix A whose rownames are the values of a field in another matrix B. I want to find indices of my rownames in the second matrix B. Now I am trying to do this operation which(A$field == rowname_A) . Unfortunately couple of things are appearing one - the rowname_A variable is of character class. It is of this format , "X12345". The values of A$field is of type factor. Is there a way to remove the appended X from the character, convert it to factor and do the comparison. Or convert the factor variables of A$field in to character type and then do the comparison.
Help will be appreciated.
Thanks.
This is fairly straightfoward. The example below should help you out.
A <- matrix(1:3)
rownames(A) <- paste0("X", 1:3)
B <- data.frame(field = factor(1:3))
# Remove "X" from rownames(A) and check equality
B$field %in% substr(rownames(A), 2, nchar(rownames(A)))
# Add "X" to B$field and check equality
paste0("X", B$field) %in% rownames(A)

how to force model.matrix to use all levels of 2 categorical variables?

Description
I have 2 categorical variables and I want to turn them into columns - for each category exactly one column
Progress
Simple code to achive this:
d.data <- data.frame(a=as.factor(c("some1","some2","some3")), b = as.factor(c("other1","other3","other2")))
d.data.new <- data.frame(model.matrix(~a -1 + b -1, data=d.data))
names(d.data.new)
[1] "asome1" "asome2" "asome3" "bother2" "bother3"
"-1" works only for "a" variable which is represented by whole 3 levels, but "b" have only two - and I need whole 3.
Not really undarstand how "-1" works in this case for {formula} inside model.matrix
Not a model.matrix solution, but you can get the binary output using mtabulate
library(qdapTools)
mtabulate(as.data.frame(t(d.data)))
Or another option would be to loop through the column names of 'd.data' and do the model.matrix separately on each column, cbind and change the column names (if required).
d1 <- do.call(cbind,lapply(names(d.data), function(i)
model.matrix(~get(i)-1, d.data)))
colnames(d1) <- sub('.*\\)', '', colnames(d1))

Resources