I was given an Excel table similar to this:
datos <- data.frame(op= 1:4, var1= c(4, 2, 3, 2))
Now, there are other tables with the keys to op and var1, which happen to be categorical variables. Suppose that after loading them, they become:
set.seed(1)
op <- paste("op",c(1:4),sep="")
var1 <- sample(LETTERS, 19, replace= FALSE)
As you can see, there are unused levels in the data frame. I want to replace the numbers for the proper associated levels. This is what I've tried:
datos[] <- lapply(datos, factor)
levels(datos$op) <- op
levels(datos$var1) <- var1
This fails, because it reorders the factors alphabetically and gives a wrong output. I then tried:
datos$var1 <- factor(datos$var1, levels= var1, ordered= TRUE)
but this puts everything in datos$var1 as NA (I guess that's because of unmatching lengths.
What would be the rigth way to do this?
Following the kind advice of #docendoDiscimus, I post this answer for future reference:
For the data provided in the question:
datos$var1 <- factor(var1[datos$var1], levels= unique(var1))
datos
## op
Please notice that this solution should be applied without converting datos$var1 to factor (that is, without applying the code datos[] <- lapply(datos, factor).
Related
my aim is to compare differences in levels of variables that might occur across different versions of a dataset. In my code, I first generate strings in order to be able to compare several variables (numeric, categorical, etc.). However, the code fails and does not give the desired results, which would be a data frame that consists of the variable and possible differences (in a list). Any help is appreciated!
Thank you.
data1 <- lapply(?, as.character)
data2 <- lapply(?, as.character)
check_diffs <- function(vars, data1, data2) {
levels1 <- unique(data1$vars)
levels2 <- unique(data2$vars)
diff <- ifelse(length(union(setdiff(levels1,levels2), setdiff(levels2,levels1)))>0, list(union(setdiff(levels1,levels2), setdiff(levels2,levels1))), NA)
return(data.frame(var = vars, diffs = I(diff)))
}
diffs_df <- map_dfr(vars, ~check_diffs(.x, data1 = ?, data2 = ?))
The issue with the code was that vars gives a string, which must be called with get(vars, dataX). Then, the code gives the differences in coding between both data sets.
I want to cut test$income into 25 levels and using the intervals derived, I stored them in a variable called levels and I wish to cut train$income based on the same intervals. I tried the following code below but I am not sure why some of my values in train$income were coerced to NA.
What went wrong? Is there a better way to do this? Thank you!
test$income <- cut(test$income,b=25)
levels <- c(-0.853,-0.586,-0.325,-0.0643,0.196,0.457,0.718,0.978,1.24,1.5,1.76,2.02,2.28,2.54,2.8,3.06,3.32,3.59,3.85,4.11,4.37,4.63,4.89,5.15,5.41,5.68)
train$income <- cut(train$income,levels)
As #JohnGilfillan says, one reason can be that your train$income is higher than 5.68 or lower than -0.853. In this case you would get some of your values as NAs, while others would be numeric. This is a likely case, but another reason (for another instance) could be that you have used a character vector to specify the breaks in your actual code (levels from cut object will return a character vector). In this case you would get a vector with only NAs (written as <NA>).
The solution is to expand the extremes of your levels vector.
Try this:
set.seed(1)
a <- runif(100, -6, 6)
set.seed(2)
b <- runif(100, -6, 6)
levs <- levels(cut(a, 25))
levs <- gsub("\\(", "", levs)
levs <- gsub("\\]", "", levs)
levs <- c(as.numeric(sapply(strsplit(levs, ","), "[", 1)),
as.numeric(sapply(strsplit(levs, ","), "[", 2))[length(levs)])
cut.b <- cut(b, levs)
## Both NA values are outside levs
b[is.na(cut.b)]
cut.b.new <- cut(b, c(-6, levs[c(-1, -length(levs))], 6))
## No NAs
any(is.na(cut.b.new))
PS: It is not recommended to use function names as object names. Therefore levs instead of levels.
I have several variables whose names all start with the same pattern in my data frame (around 20). R reads them in as characters but they should be formatted as factors.
Below I have provided a comparable (just much smaller) data frame.
animal.farm <- data.frame(matrix(0, 5, 0))
set.seed(1)
animal.farm$ord.3 <- sample(1:4, 5, replace=T)
animal.farm$ani.4 <- sample(c("dog", "horse", "mink"), 5, replace=T)
animal.farm$ani.5 <- sample(c("fun", "boring", "clever"), 5, replace=T)
I've tried both
ls(pattern = "animal.farm$ani")
and
apropos("animal.farm$ani")
so that I can apply factor() to all the variables with one or two lines of code (that in this case start with "ani") but no luck so far.
A simple base R solution:
id <- grep("^ani", names(animal.farm))
animal.farm[id] <- lapply(animal.farm[id], as.factor)
Using stringr to detect column names that start with ani
library(stringr)
cols <- str_detect(colnames(animal.farm), "^ani")
animal.farm[,cols] <- lapply(animal.farm[,cols], as.factor)
This question already has answers here:
Drop unused factor levels in a subsetted data frame
(16 answers)
Closed 5 years ago.
Part of my code is similar to following:
id.row <- c("x1","x2", "x10", "x20")
id.num <- c(1, 2, 10, 20)
id.name <- c("one","two","ten","twenty")
info.data <- data.frame(id.row, id.num, id.name)
req.mat <- matrix(1:4, nrow = 2, ncol = 2)
row.names(req.mat) <- c("x1","x10")
p1 <- info.data$id.row %in% row.names(req.mat)
op1 <- info.data$id.num[p1]
op2 <- info.data$id.name[p1]
I think the code is pretty much self explanatory and am getting the results that i want. There is no problem in printing op1 but when I am trying to get op2 its giving me additional output (Levels). As,there are 1000s of rows in original info.data so this "level" is not looking nice. Is there a way to turn it off?
Maybe you can use print(op2, max.levels=0)
If you didn't want any of your variables to be factors, then
info.data <- data.frame(id.row, id.num, id.name, stringsAsFactors=F)
is a good choice. If you wanted id.row but not id.name to be a factor then
info.data <- data.frame(id.row=factor(id.row), id.num, id.name,
stringsAsFactors=F)
is better to set that up. If you created your data in some other way so they already exist as factors in the data.frame, you can convert them back to character with
info.data$id.name <- as.character(info.data$id.name)
If you do want id.name to be a factor but just want to drop the extra levels after subsetting, then
op2 <- info.data$id.name[p1, drop=T]
will do the trick.
I'm having trouble assigning value labels from lists to numeric variables. I've got a dataset (in form of a list()) containing eleven variables. The first five variables each have individual value levels, the last six each use the same 1-5 scale. I created lists with value labels for each of the first five variables and one for the scale. Now I would like to automatically assign the labels from those lists to my variables.
I've put my eleven variables in a list to be able to use mapply().
Here's an example of my current state:
# Example variables:
a <- c(1,2,3,4) # individual variable a
b <- c(1,2,2,1) # individual variable b
c <- c(1,2,3,4,5) # variable c using the scale
d <- c(1,2,3,4,5) # variable d also using the scale
mydata <- list(a,b,c,d)
# Example value labels:
lab.a <- c("These", "are", "value", "labels")
lab.b <- c("some", "more")
lab.c <- c("And", "those", "for", "the", "scale")
labels.abc <- list(lab.a, lab.b, lab.c)
# Assigning labels in two parts
part.a <- mapply(function(x,y) factor(as.numeric(x), labels = y, exclude = NA), mydata[1:2], labels.abc[1:2])
part.b <- mapply(function(x,y) factor(as.numeric(x), labels = y, exclude = NA), mydata[3:4], labels.abc[3])
Apart from not being able to combine the two parts, my major problem is the output format. mapply() gives the result in form of a matrix, where I need again a list containing the specific variables.
So, my question is: How can I assign the value labels in an automated procedure and as the result again get a list of variables, which now contain labeled information instead of numerics?
I'm quite lost here. Is my approach with mapply() generally doable, or am I completely on the wrong track?
Thanks in advance! Please comment if you need further information.
Problem solved!
Thanks #agstudy for pointing out the SIMPLIFY = FALSE argument, which prevents mapply() from reducing the result to a matrix.
The correct code is
part.a <- mapply(function(x,y) factor(as.numeric(x), labels = y, exclude = NA), mydata[1:2], labels.abc[1:2], SIMPLIFY = FALSE)
This provides exactly the same format of output as was put in.