R - Change column variable from categorical value to nominal - r

I have a CSV dataset where a column X has values between [1-4] which I would like to replace for ["Low","Medium Low","Medium High","High"] according to its value. So now dataset$X would be a vector of those categories instead of a vector of numbers.
I've checked this example, but it seems like a complicated version of what I'm trying to fix (it seems since it's from fixed values to fixed categories, there should be an easier and cleaner way). Any suggestion on how to do it?
PS: In the first place I tried it with "levels" and "cut" but since it is one fixed number and not a range it wouldn't work properly.

You can use X to subset your categorical vector.
dataset$X <- c("Low","Medium Low","Medium High","High")[dataset$X]
dataset
# X
#1 Low
#2 Medium Low
#3 Medium High
#4 High
Data:
dataset <- data.frame(X=1:4)

Related

How to label CCA-Plot with row.names in R

I've been trying to solve the following problem which I am sure is an easy one (I am just not able to find a solution). I am using the package vegan and want to perform a cca that shows the actual row names as labels (instead of the default "sit1", "sit2", ...).
I created a dataframe (ls_Treat1) with cast(), showing plot treatments (AB, DB, DL etc.) as row names and species occurences. The dataframe looks as follows:
species 1
species 2
species 3
AB
0
3
1
DB
1
6
0
DL
3
4
2
I created the data frame with the following code to set the treatments (AB, DB, DL, ...) as row names:
ls_Treat1 <- cast(fungi_ls, Treatment ~ species)
row.names(ls_Treat1)<- ls_Treat1$Treatment
ls_Treat1 <- ls_Treat1[,-1]
When I perform a cca with the following code:
ca <- cca(ls_Treat1)
plot(ca,display="sites")
R puts the default labels "sit1", "sit2", ... into the plot, instead of the actual row names, even though I have performed it this way before and the plots normally showed the right labels. Does this have anything to do with my creating the data frame? I tried to change the treatments (characters) into numbers (integers or factors) but still, the plot won't be labelled with my row names.
Can anyone help me with this?
Thank you very very much!!
The problem is that reshape::cast() does not produce data.frame but something else. It claims to be a data.frame but it is not. We do matrix algebra in cca and therefore we cast input to a matrix which works for standard data.frame, but it does not work with the object you supplied as input. In particular, after you remove the first column in ls_Treat1 <- ls_Treat1[,-1], you also remove the attributes that allow preserving names – it would have worked without removing this column (if reshape package was still loaded). It seems that upgrading to reshape2 package and using reshape2::acast() can be a solution.

What is the best way to treat labelled variables imported with haven?

I have about 15 SPSS election studies files saved as .sav files. My group and I will be recoding about 10 variables for each study to run some logistic regressions.
I have used haven() to import all the files, so it looks like all the variables are of the haven_labelled() class.
I have always been a little confused about how to handle this class of variables, however I have observed a lot of improved performance as the haven() and labelled() packages have been updated, so I'm inclined to keep using it as opposed to using, e.g. rio or foreign.
But I want to get a sense of what best practices should be before we start this effort so we don't look back with regret.
Each study file has about 200 variables, with a mix of factors and numeric variables. But to start, I'm wondering how I should go about recoding the sex variable so that I end up with a variable male where 1 is male and 0 is not.
One thing I want to ask about is the car::Recode() method of recoding variables as opposed to the dplyr::recode variable way. I personally find the dplyr::recode() syntax very clunky and the help documentation poor. I am also not sure about the best way to set missing values.
To be specific, I think I have three specific questions.
Question 1: is there a compelling reason to use dplyr::recode as opposed to car::Recode? My own answer is that car::Recode() looks sufficient and easy to use.
Question 2: Should I make a point of converting variables to factors or numeric or will I be OK, leaving variables as haven_labelled with updated value labels? I am concerned about this quote from the haven documentation about the labelled_class: ''This class provides few methods, as I expect you’ll coerce to a standard R class (e.g. a factor()) soon after importing''
However, maybe the haven_labelled class has been improved and is sufficiently different from the labelled class that it is no longer necessary to force conversion to other standard R classes.
Question 3: is there any advantage to setting missing values with the labelled (e.g. na_range(), na_values()) rather than with the car::Recode() method ?
My inclination is that there clear disadvantages to using the labelled methods and I should stick with the car::Recode() method.
Thank you .
#FAKE DATA
library(labelled)
var1<-labelled(rep(c(1,5), 100), c(male = 1, female = 5))
var2<-labelled(sample(c(1,3,5,7,8,9), size=200, replace=T), c('strongly agree'=1, 'agree'=3, 'disagree'=5, 'strongly disagree'=7, 'DK'=8, 'refused'=9))
#give variable labels
var_label(var1)<-'Respondent\'s sex'
var_label(var2)<-'free trade is a good thing'
df<-data.frame(var1=var1, var2=var2)
str(df)
#This works really well; and I really like this.
look_for(df, 'sex')
look_for(df, 'free trade')
#the Car way
df$male<-car::Recode(df$var1, "5=0")
#Check results
df$male
#value labels are still there, so would have to be removed or updated
as_factor(df$male)
#Remove value labels
val_labels(df$male)<-NULL
#Check
class(df$male) #left with a numeric variable
#The other car way, keeping and modifying value labels
df$male2<-car::Recode(df$var1, "5=0")
df$male2
val_label(df$male2, 0)<-c('female')
val_label(df$male2, 5)<-NULL
val_labels(df$male2)
#Check class
class(df$male2)
#Can run numeric functions on it
mean(df$male2)
#easily convert to factor
as_factor(df$male2)
#How to handle missing values
#The CAR way
#use car to set missing values to NA
df$free_trade<-Recode(df$var2, "8=NA; 9=NA")
#Check class
class(df$free_trade)
#can still run numeric functions on haven_labelled
mean(df$free_trade, na.rm=T)
#table
table(df$free_trade)
#did the na recode work?
table(is.na(df$free_trade))
#check value labels
val_labels(df$free_trade)
#How to handle missing values
#The CAR way
#use car to set missing values to NA
df$free_trade<-Recode(df$var2, "8=NA; 9=NA")
#Check class
class(df$free_trade)
#can still run numeric functions on haven_labelled
mean(df$free_trade, na.rm=T)
#table
table(df$free_trade)
#did the na recode work?
table(is.na(df$free_trade))
#check value labels
val_labels(df$free_trade)
#set missing values the labelled way
table(df$var2)
na_values(df$var2)<-c(8,9)
#check
df$var2
#but a table function of does not pick up 8 and 9 as m isisng
table(df$var2)
#this seems to not work very well
table(to_factor(df$var2))
to_factor(df$var2)
A bit late in the game, but still some answers:
Should I make a point of converting variables to factors or numeric or will I be OK, leaving variables as haven_labelled with updated value labels?
First, you need to understand that haven_labelled vectors are all of numeric type (i.e. they will be treated as continuous variables), which you can easily check with:
library(tidyverse)
df %>%
as_tibble() %>%
head()
which gives:
# A tibble: 6 x 2
var1 var2
<dbl+lbl> <dbl+lbl>
1 1 [male] 5 [disagree]
2 5 [female] 5 [disagree]
3 1 [male] 3 [agree]
4 5 [female] 5 [disagree]
5 1 [male] 7 [strongly disagree]
6 5 [female] 9 [refused]
The question if you shoudl convert to a standard type probably depends on your analysis.
For simple frequency tables it's probably fine to leave as is, e.g.
df %>%
as_tibble() %>%
count(var1)
# A tibble: 2 x 2
var1 n
<dbl+lbl> <int>
1 1 [male] 100
2 5 [female] 100
However, for any analysis that is type sensitive (already starts for calculating means, but also regression etc.) you definitely should convert your variables to an appropriate class for your analyses. Not doing so and treating everything as continuous will give your wrong results. Just think about a truly categorical variable like 1=Bus, 2=Car, 3=Bike that you'd throw into a linear regression.
Is there a compelling reason to use dplyr::recode as opposed to car::Recode?
There is now right or wrong here. Personally, I have a preference for staying within the tidyverse, because it has easy recode functions, e.g. the recode you mentioned, but for more complex tasks, you can also use if_else or case_when. And then you also have lots of functions to deal with missings like replace_na or na_if or coalesce. They syntax of car::recode isn't much different from the dplyr, so it's really mostly personal preference I'd say.
The same is true for your question if you should use the functions from labelled or not. The labelled packages indeed adds some very powerful functions to deal with labelled vectors taht go beyond what haven or the tidyverse offers, so IMO it's a good package to use.

R programming using the cut() function to split a variable into 3 classes

I a have a dataframe called wine with variables among which is the variable
Spice of integer type. I would like to split this variable (Spice) using the cut() function into 3 classes ( <2; between 2 and 2.5; >3 ).
Assuming you mean spice is numeric (and not integer, which would only fall between 2 and 2.5 if it was exactly two), it might be hard to do with just cut if you're trying to be left and right inclusive.
You can get close with something like
dat <- data.frame(spice=5*runif(100))
dat$lvl <- cut(dat$spice, breaks=c(0,2,2.5,1e6), right=FALSE)
dat$lvl <- as.factor(as.numeric(dat$lvl))
though if the value for spice is exactly 2.5 it will be placed in group 3 instead of group 2.

Missing rows after subsetting datatable on a single column

I have a datatable, DT, with columns A, B and C. I want only one A per unique B, and I want to choose that A based on the value of C (choose the largest C).
Based on this (incredibly helpful) SO page, Use data.table to get first of subgroup based on a variable, I tried something like this:
test <- data.table(A=c(1:3,1:2),B=c(1:5),C=c(11:15))
setkey(test,A,C)
test[,.SD[.N],by="A"]
In my test case, this gives me an answer that seems right:
# A B C
# 1: 1 6 16
# 2: 2 7 17
# 3: 3 8 18
# 4: 4 4 14
# 5: 5 5 15
And, as expected, the number of rows matches the number of unique entries for "A" in my DT:
length(unique(test$A))
# 5
However, when I apply this to my actual dataset, I am missing approximately 20% of my initially ~2 million rows.
I cannot seem to put together a test dataset that will recreate this type of a loss. There are no null values in the actual dataset. What else could be a factor in a dataset that would cause a discrepancy between the number of results from something like test[,.SD[.N],by="A"] and length(unique(test$A))?
Thanks to #Eddi's debugging coaching, here's the answer, at least for my dataset: differential handling of numbers in scientific notation.
In particular: In my actual dataset, columns A and B were very long numbers that, upon import from SQL to R, had been imported in scientific notation. It turns out the test[,.SD[.N],by="A"] and length(unique(test$A)) commands were handling this differently: length(unique(test$A)) was preserving the difference between two values that differed only in a small digit that is not visible in the collapsed scientific notation format printed as visual output, but test[,.SD[.N],by="A"] was, in essence, rounding the values and thus collapsing some of them together.
(I feel foolish that I didn't catch this myself before posting, but much appreciate the help - I hope somehow this spares someone else the same confusion, perhaps!)

Easy or default way to exclue rows with NA values from individual operations and comparisons

I work with survey data, where missing values are the rule rather than the exception. My datasets always have lots of NAs, and for simple statistics I usually want to work with cases that are complete on the subset of variables required for that specific operation, and ignore the other cases.
Most of R's base functions return NA if there are any NAs in the input. Additionally, subsets using comparison operators will return a row of NAs for any row with an NA on one of the variables. I literally never want either of these behaviors.
I would like for R to default to excluding rows with NAs for the variables it's operating on, and returning results for the remaining rows (see example below).
Here are the workarounds I currently know about:
Specify na.rm=T: Not too bad, but not all functions support it.
Add !is.na() to all comparison operations: Works, but it's annoying and error-prone to do this by hand, especially when there are multiple variables involved.
Use complete.cases(): Not helpful because I don't want to exclude cases that are missing any variable, just the variables being used in the current operation.
Create a new data frame with the desired cases: Often each row is missing a few scattered variables. That means that every time I wanted to switch from working with one variable to another, I'd have to explicitly create a new subset.
Use imputation: Not always appropriate, especially when computing descriptives or just examining the data.
I know how to get the desired results for any given case, but dealing with NAs explicitly for every piece of code I write takes up a lot of time. Hopefully there's some simple solution that I'm missing. But complex or partial solutions would also be welcome.
Example:
> z<-data.frame(x=c(413,612,96,8,NA), y=c(314,69,400,NA,8888))
# current behavior:
> z[z$x < z$y ,]
x y
3 96 400
NA NA NA
NA.1 NA NA
# Desired behavior:
> z[z$x < z$y ,]
x y
3 96 400
# What I currently have to do in order to get the desired output:
> z[(z$x < z$y) & !is.na(z$x) & !is.na(z$y) ,]
x y
3 96 400
One trick for dealing with NAs in inequalities when subsetting is to do
z[which(z$x < z$y),]
# x y
# 3 96 400
The which() silently drops NA values.

Resources