Cut a column based on intervals of another column in r - r

I want to cut test$income into 25 levels and using the intervals derived, I stored them in a variable called levels and I wish to cut train$income based on the same intervals. I tried the following code below but I am not sure why some of my values in train$income were coerced to NA.
What went wrong? Is there a better way to do this? Thank you!
test$income <- cut(test$income,b=25)
levels <- c(-0.853,-0.586,-0.325,-0.0643,0.196,0.457,0.718,0.978,1.24,1.5,1.76,2.02,2.28,2.54,2.8,3.06,3.32,3.59,3.85,4.11,4.37,4.63,4.89,5.15,5.41,5.68)
train$income <- cut(train$income,levels)

As #JohnGilfillan says, one reason can be that your train$income is higher than 5.68 or lower than -0.853. In this case you would get some of your values as NAs, while others would be numeric. This is a likely case, but another reason (for another instance) could be that you have used a character vector to specify the breaks in your actual code (levels from cut object will return a character vector). In this case you would get a vector with only NAs (written as <NA>).
The solution is to expand the extremes of your levels vector.
Try this:
set.seed(1)
a <- runif(100, -6, 6)
set.seed(2)
b <- runif(100, -6, 6)
levs <- levels(cut(a, 25))
levs <- gsub("\\(", "", levs)
levs <- gsub("\\]", "", levs)
levs <- c(as.numeric(sapply(strsplit(levs, ","), "[", 1)),
as.numeric(sapply(strsplit(levs, ","), "[", 2))[length(levs)])
cut.b <- cut(b, levs)
## Both NA values are outside levs
b[is.na(cut.b)]
cut.b.new <- cut(b, c(-6, levs[c(-1, -length(levs))], 6))
## No NAs
any(is.na(cut.b.new))
PS: It is not recommended to use function names as object names. Therefore levs instead of levels.

Related

How to add jitter in a data frame in R

Input:
df = data.frame(col1 = 1:5, col2 = 5:9)
rownames(df) <- letters[1:5]
#add jitter
jitter(df) #Error in jitter(df) : 'x' must be numeric
Expected output: jitter will be added to the columns of df. Thanks!
jitter is a function that takes numeric as input. You cannot simply run jitter on the whole data.frame. You need to loop through the columns. You can do:
data.frame(lapply(df, jitter))
Jitter is to be applied to a numerical vector, not a dataframe.
If you want to apply Jitter to all your columns, this should do:
apply(df, 2, jitter)
Just adding random numbers?
df_jit <- df + matrix(rnorm(nrow(df) * ncol(df), sd = 0.1), ncol = ncol(df))

R assign levels to factor variable

I was given an Excel table similar to this:
datos <- data.frame(op= 1:4, var1= c(4, 2, 3, 2))
Now, there are other tables with the keys to op and var1, which happen to be categorical variables. Suppose that after loading them, they become:
set.seed(1)
op <- paste("op",c(1:4),sep="")
var1 <- sample(LETTERS, 19, replace= FALSE)
As you can see, there are unused levels in the data frame. I want to replace the numbers for the proper associated levels. This is what I've tried:
datos[] <- lapply(datos, factor)
levels(datos$op) <- op
levels(datos$var1) <- var1
This fails, because it reorders the factors alphabetically and gives a wrong output. I then tried:
datos$var1 <- factor(datos$var1, levels= var1, ordered= TRUE)
but this puts everything in datos$var1 as NA (I guess that's because of unmatching lengths.
What would be the rigth way to do this?
Following the kind advice of #docendoDiscimus, I post this answer for future reference:
For the data provided in the question:
datos$var1 <- factor(var1[datos$var1], levels= unique(var1))
datos
## op
Please notice that this solution should be applied without converting datos$var1 to factor (that is, without applying the code datos[] <- lapply(datos, factor).

How to change values in data frame by column class in R

I've got a frame with a set of different variables - integers, factors, logicals - and I would like to recode all of the "NAs" as a numeric across the whole dataset while preserving the underlying variable class. For example:
frame <- data.frame("x" = rnorm(10), "y" = rep("A", 10))
frame[6,] <- NA
dat <- as.data.frame(apply(frame,2, function(x) ifelse(is.na(x)== TRUE, -9, x) ))
dat
str(dat)
However, here the integers turn into factors; when I include as.numeric(x) in the apply() function, this introduces errors. Thanks for any and all thoughts on how to deal with this.
apply returns a matrix of type character. as.data.frame turns this into factors by default. Instead, you could do
dat <- as.data.frame(lapply(frame, function(x) ifelse(is.na(x), -9, x) ) )

Undesired output (Levels) while selecting from R dataframe [duplicate]

This question already has answers here:
Drop unused factor levels in a subsetted data frame
(16 answers)
Closed 5 years ago.
Part of my code is similar to following:
id.row <- c("x1","x2", "x10", "x20")
id.num <- c(1, 2, 10, 20)
id.name <- c("one","two","ten","twenty")
info.data <- data.frame(id.row, id.num, id.name)
req.mat <- matrix(1:4, nrow = 2, ncol = 2)
row.names(req.mat) <- c("x1","x10")
p1 <- info.data$id.row %in% row.names(req.mat)
op1 <- info.data$id.num[p1]
op2 <- info.data$id.name[p1]
I think the code is pretty much self explanatory and am getting the results that i want. There is no problem in printing op1 but when I am trying to get op2 its giving me additional output (Levels). As,there are 1000s of rows in original info.data so this "level" is not looking nice. Is there a way to turn it off?
Maybe you can use print(op2, max.levels=0)
If you didn't want any of your variables to be factors, then
info.data <- data.frame(id.row, id.num, id.name, stringsAsFactors=F)
is a good choice. If you wanted id.row but not id.name to be a factor then
info.data <- data.frame(id.row=factor(id.row), id.num, id.name,
stringsAsFactors=F)
is better to set that up. If you created your data in some other way so they already exist as factors in the data.frame, you can convert them back to character with
info.data$id.name <- as.character(info.data$id.name)
If you do want id.name to be a factor but just want to drop the extra levels after subsetting, then
op2 <- info.data$id.name[p1, drop=T]
will do the trick.

Calculate Mean of a column in R having non numeric values

I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion

Resources