I'm starting to learn R and I'm having a hard time making changes to the names of values in a factor. I've tried using revalue and recode but am still seeing the original names when I look at the dataframe.
Here's what the DF looks like:
head(freecut)
gender oldness student_loaniness homeland
1 0 20 4 Eurasia
2 1 25 4 Oceana
3 1 56 2 Eastasia
4 0 65 6 Eastasia
5 1 50 5 Oceana
6 0 20 5 Eastasia
And here are the coding attempts:
revalue(freecut$homeland, c("Eastasia" = "East_Asia", "Eurasia" = "Asiope",
"Oceana" = "Nemoville"))
recode(freecut$homeland, Eastasia = "East_Asia", Eurasia = "Asiope",
Oceana = "Nemoville")
After running the code the DF looks exactly the same. I know that in Python I would have to throw in "inplace = TRUE" to make changes stick--not sure what I need to do here (or what I'm missing).
R doesn't modify in place, you have to assign results - either back to the original variable to modify it, or to a new variable. This is a paradigm of functional programming, and R is a functional programming language.
If you have x = 1, running x + 1 will evaluate and print the result, 2, but x is not changed. If you want to overwrite x with the modified value, you run x = x + 1.
Just the same way, running recode, will evaluate and print a result, but if you want to modify the column in your data frame, you need to explicitly assign it with freecut$homeland = recode(...).
There are a few exceptions in add-on packages. For example, the data.table package defines some set* operators which do modify objects in place. data.table is fantastic, especially if you need efficiency, but if you are just starting with R I would recommend getting familiar with the basics first.
In addition to Gregor's answer which addresses more fundamental issues, you can in your particular case use levels<-:
levels(freecut$homeland) <- c("first", "second", "third")
# order is important if you don't want surprises
Or if you are ready to join the dark side, consider macros from gtools package. The first steps are described e.g. in https://www.r-bloggers.com/macros-in-r/. Nobody is using macros in R but I don't know why. Maybe they're dangerous but maybe they just seem obscure.
Related
I have about 15 SPSS election studies files saved as .sav files. My group and I will be recoding about 10 variables for each study to run some logistic regressions.
I have used haven() to import all the files, so it looks like all the variables are of the haven_labelled() class.
I have always been a little confused about how to handle this class of variables, however I have observed a lot of improved performance as the haven() and labelled() packages have been updated, so I'm inclined to keep using it as opposed to using, e.g. rio or foreign.
But I want to get a sense of what best practices should be before we start this effort so we don't look back with regret.
Each study file has about 200 variables, with a mix of factors and numeric variables. But to start, I'm wondering how I should go about recoding the sex variable so that I end up with a variable male where 1 is male and 0 is not.
One thing I want to ask about is the car::Recode() method of recoding variables as opposed to the dplyr::recode variable way. I personally find the dplyr::recode() syntax very clunky and the help documentation poor. I am also not sure about the best way to set missing values.
To be specific, I think I have three specific questions.
Question 1: is there a compelling reason to use dplyr::recode as opposed to car::Recode? My own answer is that car::Recode() looks sufficient and easy to use.
Question 2: Should I make a point of converting variables to factors or numeric or will I be OK, leaving variables as haven_labelled with updated value labels? I am concerned about this quote from the haven documentation about the labelled_class: ''This class provides few methods, as I expect you’ll coerce to a standard R class (e.g. a factor()) soon after importing''
However, maybe the haven_labelled class has been improved and is sufficiently different from the labelled class that it is no longer necessary to force conversion to other standard R classes.
Question 3: is there any advantage to setting missing values with the labelled (e.g. na_range(), na_values()) rather than with the car::Recode() method ?
My inclination is that there clear disadvantages to using the labelled methods and I should stick with the car::Recode() method.
Thank you .
#FAKE DATA
library(labelled)
var1<-labelled(rep(c(1,5), 100), c(male = 1, female = 5))
var2<-labelled(sample(c(1,3,5,7,8,9), size=200, replace=T), c('strongly agree'=1, 'agree'=3, 'disagree'=5, 'strongly disagree'=7, 'DK'=8, 'refused'=9))
#give variable labels
var_label(var1)<-'Respondent\'s sex'
var_label(var2)<-'free trade is a good thing'
df<-data.frame(var1=var1, var2=var2)
str(df)
#This works really well; and I really like this.
look_for(df, 'sex')
look_for(df, 'free trade')
#the Car way
df$male<-car::Recode(df$var1, "5=0")
#Check results
df$male
#value labels are still there, so would have to be removed or updated
as_factor(df$male)
#Remove value labels
val_labels(df$male)<-NULL
#Check
class(df$male) #left with a numeric variable
#The other car way, keeping and modifying value labels
df$male2<-car::Recode(df$var1, "5=0")
df$male2
val_label(df$male2, 0)<-c('female')
val_label(df$male2, 5)<-NULL
val_labels(df$male2)
#Check class
class(df$male2)
#Can run numeric functions on it
mean(df$male2)
#easily convert to factor
as_factor(df$male2)
#How to handle missing values
#The CAR way
#use car to set missing values to NA
df$free_trade<-Recode(df$var2, "8=NA; 9=NA")
#Check class
class(df$free_trade)
#can still run numeric functions on haven_labelled
mean(df$free_trade, na.rm=T)
#table
table(df$free_trade)
#did the na recode work?
table(is.na(df$free_trade))
#check value labels
val_labels(df$free_trade)
#How to handle missing values
#The CAR way
#use car to set missing values to NA
df$free_trade<-Recode(df$var2, "8=NA; 9=NA")
#Check class
class(df$free_trade)
#can still run numeric functions on haven_labelled
mean(df$free_trade, na.rm=T)
#table
table(df$free_trade)
#did the na recode work?
table(is.na(df$free_trade))
#check value labels
val_labels(df$free_trade)
#set missing values the labelled way
table(df$var2)
na_values(df$var2)<-c(8,9)
#check
df$var2
#but a table function of does not pick up 8 and 9 as m isisng
table(df$var2)
#this seems to not work very well
table(to_factor(df$var2))
to_factor(df$var2)
A bit late in the game, but still some answers:
Should I make a point of converting variables to factors or numeric or will I be OK, leaving variables as haven_labelled with updated value labels?
First, you need to understand that haven_labelled vectors are all of numeric type (i.e. they will be treated as continuous variables), which you can easily check with:
library(tidyverse)
df %>%
as_tibble() %>%
head()
which gives:
# A tibble: 6 x 2
var1 var2
<dbl+lbl> <dbl+lbl>
1 1 [male] 5 [disagree]
2 5 [female] 5 [disagree]
3 1 [male] 3 [agree]
4 5 [female] 5 [disagree]
5 1 [male] 7 [strongly disagree]
6 5 [female] 9 [refused]
The question if you shoudl convert to a standard type probably depends on your analysis.
For simple frequency tables it's probably fine to leave as is, e.g.
df %>%
as_tibble() %>%
count(var1)
# A tibble: 2 x 2
var1 n
<dbl+lbl> <int>
1 1 [male] 100
2 5 [female] 100
However, for any analysis that is type sensitive (already starts for calculating means, but also regression etc.) you definitely should convert your variables to an appropriate class for your analyses. Not doing so and treating everything as continuous will give your wrong results. Just think about a truly categorical variable like 1=Bus, 2=Car, 3=Bike that you'd throw into a linear regression.
Is there a compelling reason to use dplyr::recode as opposed to car::Recode?
There is now right or wrong here. Personally, I have a preference for staying within the tidyverse, because it has easy recode functions, e.g. the recode you mentioned, but for more complex tasks, you can also use if_else or case_when. And then you also have lots of functions to deal with missings like replace_na or na_if or coalesce. They syntax of car::recode isn't much different from the dplyr, so it's really mostly personal preference I'd say.
The same is true for your question if you should use the functions from labelled or not. The labelled packages indeed adds some very powerful functions to deal with labelled vectors taht go beyond what haven or the tidyverse offers, so IMO it's a good package to use.
I have a matrix with individual column names (the row names are not important), like this
TestMat<-matrix(1:25,ncol=5,nrow=5)
colnames(TestMat)<-c("A","B","C","D","E")
TestMat
For various reasons, but mostly because a package will later need it, I can't alter the values in the matrix and they all have to be integers.
Now I want to categorize my colum names (e.g. A, B and D into "Group 1" and C and E into "Group 2"). The idea is, that the matrix will get smaller later on, as values in the matrix are randomly diminished. As soon as a column-sum reaches zero, that column will be dropped. Along this process I want to see how the fraction/size of one group changes, compared to the other groups.
I thought the easiest way would be to just name all the corresponding columns identical:
TestMat2<-matrix(1:25,ncol=5,nrow=5)
colnames(TestMat2)<-c("Group1","Group1","Group2","Group1","Group2")
TestMat2
But this gives me error-messages later on in the analysis, as R starts numbering the identical column-names in a way of "Group1" "Group1.1" "Group2" "Group1.2" "Group2.1".
I have tried my luck with "class", "attr" and "factor" commands to my column names, but don't get anywhere.
Is there a trick or command, I've maybe never heard of?
as per the comments why not put the grouping in another variable then something like:
> TestMat<-matrix(1:25,ncol=5,nrow=5)
> colnames(TestMat)<-c("A","B","C","D","E")
> F=factor(c("Group1","Group1","Group2","Group1","Group2"))
... do something to your matrix...
> summary(F[colSums(TestMat) >= 40])
Group1 Group2
1 2
Is that it (subs. 40 for 0)?
The Bioconductor package Bioboase defines a class ExpressionSet that allows annotations on rows and columns of a matrix
library(Biobase)
exprs = matrix(1:25,ncol=5,nrow=5, dimnames=list(NULL, LETTERS[1:5]))
df = data.frame(grp=c("Group1","Group1","Group2","Group1","Group2"),
row.names=colnames(exprs))
eset = ExpressionSet(exprs, AnnotatedDataFrame(df))
You can access columns in the data frame with $, subset with [, and extract with exprs(), e.g.,
> exprs(eset[, eset$grp == "Group1"])
A B D
1 1 6 16
2 2 7 17
3 3 8 18
4 4 9 19
5 5 10 20
or
> eset[,colSums(exprs(eset)) > 40]$grp
[1] Group2 Group1 Group2
Levels: Group1 Group2
The GenomicRanges package defines a similar class SummarizedExperiment when the rows are annotated with genomic ranges.
This coordinated integration of data and annotation on data is a really good thing, reducing the chance for 'clerical' errors when matrix and annotation are independent; I'm surprised so many comments suggest that you separately maintain two structures.
Thanks for all the helpful comments. I haven't posted here since my original post, because I first wanted to try all promising approaches and find a final solution to my problem.
I tried the Biobase package with its option for annotations, as well as Stephen's idea of grouping everything via a second variable.
As it turned out, as soon as the matrix diminished in size (as a part of the analysis) the external grouping failed, as column-numbers and grouping didn't match anymore and I couldn't find a way to combine the Bioconductor approach and my code.
I found a (somewhat roundabout) solution, though, if anybody cares:
I already stated, that, if I group my column-names identical for grouping, R later numbers my groups and they are thus not idential any longer.
But I then just searched for the first such-and-such neccessary letters to identify the proper group:
length(colnames(TestMat2)[substr(colnames(TestMat2),1,6) == "Group1"])
This way I can always check the fraction of one group of columns versus the others.
Thanks for your answers and help. I learned a lot and I think Bioconductor will come in handy in the future.
Cheers!
So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.
Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:
TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)
However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.
Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.
Here are the specifics for testing:
TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3),
Month=c(1,5,6,11,4,10,1,5,10),
Location=c(1,5,6,7,10,3,4,2,8))
This testDF keeps track of where each of 3 employees was over the course of the year among several locations.
(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)
The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.
EmployeeLocationNumber <- function(Site){
CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
return(LocationNumber)
}
I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.
So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:
Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?
Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?
I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?
Using logical indexing, the condensed one-liner replacement for your function is:
EmployeeLocationNumber <- function(Site){
with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}
Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.
A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.
B) In what sense is Location:8 the "second location visited"?
C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.
D) Conditional access of a data.frame typically involves logical indexing and or the use of which()
If you just want the sequence of visits by employee try this:
(Changed first argument to Month since that is what determines the sequence of locations)
with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
TestDF$LocOrder <- with(TestDF, ave(Month, Employee, FUN=seq))
If you wanted the second location for EE:3 it would be:
subset(TestDF, LocOrder==2 & Employee==3, select= Location)
# Location
# 8 2
The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.
Also, your example for EmployeeLocationNumber does not match your description.
> EmployeeLocationNumber(8)
[1] 3
Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()
TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)
which gives
> TestDF
Employee Month Location ELN
1 1 1 1 1
2 1 5 5 2
3 1 6 6 3
4 1 11 7 4
5 2 4 10 1
6 2 10 3 2
7 3 1 4 1
8 3 5 2 2
9 3 10 8 3
As to your other questions, I would just write it as
TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)
The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).
Your EmployeeLocationNumber function takes a vector in and returns a single value.
The assignment to create a new data.frame column therefore just gets a single value:
EmployeeLocationNumber(TestDF$Location) # returns 1
TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
I'll get back to you on that :)
Dito.
Update: I finally worked out some code to do it, but by then #DWin has a much better solution :(
TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))
...I guess the ave function does pretty much what the code above does. But for the record:
First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.
Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":
This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:
TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3
I'd done a serious PHP/JS coding recently, and I kind-of lost my R muscle. While this problem can be easily tackled within PHP/JS, what is the most efficient way of solving this one: I have to grade a questionnaire, and I have following scenario:
raw t
5 0
6 2
7-9 3
10-12 4
15-20 5
if x equals to, or is within range given in raw, value in according row in t should be returned. Of course, this can be done with for loop, or switch, but just imagine very lengthy set of value ranges in raw. How would you tackle this one?
We seem to be missing a part of the example because there in no mention of "x"
dat <- read.table(textConnection("raw t
5 0
6 2
7-9 3
10-12 4
15-20 5"), header=TRUE, stringsAsFactors=FALSE)
dat$bot <- as.numeric( sapply( sapply(dat$raw, strsplit, "-"), "[", 1 ))
get.t <- function(x) findInterval(x, dat$bot)
get.t(8)
#[1] 3
> dat$t[get.t(6)]
[1] 2
> dat$t[get.t(5)]
[1] 0
I would simply use an indexing scheme kind of like what Corbin alluded to, but since he didn't provide an example, here's a simple one:
m <- cbind(c(5:12,15:20),
rep(c(0,2,3,4,5),times = c(1,1,3,3,6)))
m[m[,1] == 11,2]
[1] 4
Note: very similar to Simone's answer as I started typing this a bit back. Has a note at the end though. The indexing approach I give is essentially Simone's answer.
There will have to be a loop involved somewhere.
The pseudo code of what I would do is something like:
score = blah
for each raw => t
break raw into rMin -> rMax
if(rMin <= score and rMax >= score)
return t
It avoids having to loop over each number between rMin and rMax (which is what I'm assuming you meant), but without some kind of indexing, that is the best you're going to get.
Note: if you have a ton of calls to this, and indexing would actually be worth your while, the easiest type of indexing would just be a hash map of score -> t entries.
Basically you would parse your example data into something like:
index[5] = 0
index[6] = 2
index[7] = 3
index[8] = 3
index[9] = 3
You would need to carefully weigh if building the index would be more time consuming than just looping over the ranges.
Note: the indexing approach is actually what Simone said.