I've got a frame with a set of different variables - integers, factors, logicals - and I would like to recode all of the "NAs" as a numeric across the whole dataset while preserving the underlying variable class. For example:
frame <- data.frame("x" = rnorm(10), "y" = rep("A", 10))
frame[6,] <- NA
dat <- as.data.frame(apply(frame,2, function(x) ifelse(is.na(x)== TRUE, -9, x) ))
dat
str(dat)
However, here the integers turn into factors; when I include as.numeric(x) in the apply() function, this introduces errors. Thanks for any and all thoughts on how to deal with this.
apply returns a matrix of type character. as.data.frame turns this into factors by default. Instead, you could do
dat <- as.data.frame(lapply(frame, function(x) ifelse(is.na(x), -9, x) ) )
Related
Trying to figure out how coercion of factors/ dataframe works in R. I am trying to plot boxplots for a subset of a dataframe. Let's see step-by-step
x = rnorm(30, 1, 1)
Created a vector x with normal distribution
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
Created a character string to later use as a factor for plotting boxplots for x1, x2, x3
df = data.frame(x,c)
combined x and c into a data.frame. So now we would expect class of df: dataframe, df$x: numeric, df$c: factor (because we sent c into a dataframe) and is.data.frame and is.list applied on df should give us TRUE and TRUE. (I assumed that all dataframes are lists as well? and that's why we are getting TRUE for both checks.)
And that's what happens below. All good till now.
class(df)
#[1] "data.frame"
is.data.frame(df)
#[1] TRUE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"
Now I plot the spread of x grouped using factors present in c. So the first argument is x ~ c. But I want boxplots for just two factors: x1and x2. So I used a subset argument in boxplot function.
boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)
This is the plot we get, notice since x3 is a factor, it is still plotted
i.e. we still got 3 categories on x-axis of the boxplot inspite of subsetting to 2 categories.
So, one solution I found was to change the class of df variables into numeric and character
class(df)<- c("numeric", "character")
boxplot(x ~ c, subset=c %in% c("x1", "x2"), data=df)
New boxplot. This is what we wanted, so it worked!, we plotted boxes for just x1 and x2, got rid of x3
But if we just run the same checks, we ran before doing this coercion, on all variables, we get these outputs.
Anything funny?
class(df)
#[1] "numeric" "character"
is.data.frame(df)
#[1] FALSE
is.list(df)
#[1] TRUE
class(df$x)
#[1] "numeric"
class(df$c)
#[1] "factor"
Check out that df $ c (the second variable containing caegories x1, x2, x3) is still a factor!
And df stopped being a list (so was it ever a list?)
And what did we do exactly by class(df)<- c("numeric", "character") this coercion if not changing the datatype of df $ c?
So to sum up,
my questions for tldr version:
Are all dataframes, also lists in R?
Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?
If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?
And why did df stopped being a dataframe after we did the above steps?
The answers make more sense if we take your questions in a different order.
Are all dataframes, also lists in R?
Yes. A data frame is a list of vectors (the columns).
And why did df stopped being a list after we did the above steps?
It didn't. It stopped being a data frame, because you changed the class with class(df)<- c("numeric", "character"). is.list(df) returns TRUE still.
If we did coerce factor into characters by doing the above steps, why is still showing that variable's class is factor?
class(df) operates on the df object itself, not the columns. Look at str(df). The factor column is still a factor. class(df) set the class attribute on the data frame object itself to a vector.
Why did our boxplot dropped x3 in the 2nd case (when we coerced class(df) into numeric and character?
You've messed up your data frame object by explicitly setting the class attribute of the object to a vector c("numeric", "character"). It's hard to predict the full effects of this. My best guess is that boxplot or the functions that draw the axes accessed the class attribute of the data frame somehow.
To do what you really wanted:
x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c)
df$c <- as.character(df$c)
or
x = rnorm(30, 1, 1)
c = c(rep("x1",10), rep("x2",10), rep("x3",10))
df = data.frame(x,c, stringsAsFactors=FALSE)
Use droplevels like this:
df0 <- subset(df, c %in% c("x1", "x2"))
df0 <- transform(df0, c = droplevels(c))
levels(df0$c)
## [1] "x1" "x2"
Note that now c only has two levels, not three.
We can write this as a pipeline using magrittr like this:
library(magrittr)
df %>%
subset(c %in% c("x1", "x2")) %>%
transform(c = droplevels(c)) %>%
boxplot(x ~ c, data = .)
I want to cut test$income into 25 levels and using the intervals derived, I stored them in a variable called levels and I wish to cut train$income based on the same intervals. I tried the following code below but I am not sure why some of my values in train$income were coerced to NA.
What went wrong? Is there a better way to do this? Thank you!
test$income <- cut(test$income,b=25)
levels <- c(-0.853,-0.586,-0.325,-0.0643,0.196,0.457,0.718,0.978,1.24,1.5,1.76,2.02,2.28,2.54,2.8,3.06,3.32,3.59,3.85,4.11,4.37,4.63,4.89,5.15,5.41,5.68)
train$income <- cut(train$income,levels)
As #JohnGilfillan says, one reason can be that your train$income is higher than 5.68 or lower than -0.853. In this case you would get some of your values as NAs, while others would be numeric. This is a likely case, but another reason (for another instance) could be that you have used a character vector to specify the breaks in your actual code (levels from cut object will return a character vector). In this case you would get a vector with only NAs (written as <NA>).
The solution is to expand the extremes of your levels vector.
Try this:
set.seed(1)
a <- runif(100, -6, 6)
set.seed(2)
b <- runif(100, -6, 6)
levs <- levels(cut(a, 25))
levs <- gsub("\\(", "", levs)
levs <- gsub("\\]", "", levs)
levs <- c(as.numeric(sapply(strsplit(levs, ","), "[", 1)),
as.numeric(sapply(strsplit(levs, ","), "[", 2))[length(levs)])
cut.b <- cut(b, levs)
## Both NA values are outside levs
b[is.na(cut.b)]
cut.b.new <- cut(b, c(-6, levs[c(-1, -length(levs))], 6))
## No NAs
any(is.na(cut.b.new))
PS: It is not recommended to use function names as object names. Therefore levs instead of levels.
Input:
df = data.frame(col1 = 1:5, col2 = 5:9)
rownames(df) <- letters[1:5]
#add jitter
jitter(df) #Error in jitter(df) : 'x' must be numeric
Expected output: jitter will be added to the columns of df. Thanks!
jitter is a function that takes numeric as input. You cannot simply run jitter on the whole data.frame. You need to loop through the columns. You can do:
data.frame(lapply(df, jitter))
Jitter is to be applied to a numerical vector, not a dataframe.
If you want to apply Jitter to all your columns, this should do:
apply(df, 2, jitter)
Just adding random numbers?
df_jit <- df + matrix(rnorm(nrow(df) * ncol(df), sd = 0.1), ncol = ncol(df))
a<- data.frame(sex=c(1,1,2,2,1,1),bq=factor(c(1,2,1,2,2,2)))
library(Hmisc)
label(a$sex)<-"gender"
label(a$bq)<-"xxx"
str(a)
b<-data.frame(lapply(a, as.character), stringsAsFactors=FALSE)
str(b)
When I covert dataframe a columns to character,the columns labels disappeared.My dataframe have many columns.Here as an example only two columns. How to keep columns labels when numeric convert to character? Thank you!
Labels are not a commonly used R feature. Unfortunately, you will have to do it yourself:
b <- data.frame(lapply(a, function(x) { y <- as.character(x); label(y) <- label(x); y }), stringsAsFactors = FALSE)
I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion