How to use substr() to remove NA in mpg$trans data - r

Turn the variable trans to a factor variable, of which unique values are “auto” and “manu” (Hint: use the function substr() to extract substrings in a character vector before converting to a factor vector)
unique(mpg$trans)
mpg$trans <- substr(mpg$trans, 1, 234)
mpg$trans <- factor(mpg$trans, levels = c("auto", "manu"))
str(mpg)
However, trans still doesn't work.

Related

character to numeric conversion problem in R

I have a big time series dataset in which the numeric results are stored in General format in MS-Excel. I tried using gsub(",", "", dummy ), but it did not work. The dataset does not have any , or any other visible special character other than a decimal point, and R picks up the datatype as character. Values are either positive or negative with one NA and all values have different number of decimal places.
How can I convert without having to deal with N/As after converting to numeric. One thing to note though is that when converted to numeric, some of the values are displayed in scientific notation like 12.1 e+03 and other values with four decimal places.
dummy = c("12.1", "42000", "1.2145", "12.25", N/A, "323.369", "-1.235", "335", "0")
# Convert to numeric
dummy = gsub(",", "", dummy )
dummy = as.numeric(dummy )
Error
Warning message:
NAs introduced by coercion "
Changing N/A to NA solves this issue:
# N/A to NA
dummy = c("12.1", "42000", "1.2145", "12.25", NA, "323.369", "-1.235", "335")
# Convert to numeric
dummy = gsub(",", "", dummy)
dummy = as.numeric(dummy)
To do so for your entire dataset, you can use:
# Across columns (for matrices)
data <- apply(data, 2, function(x){
ifelse(x == "N/A", NA, x)
})
# Then convert characters to numeric (for matrices)
data <- apply(data, 2, as.numeric)
# Across columns (for data frames)
data <- lapply(data, function(x){
ifelse(x == "N/A", NA, x)
})
# Then convert characters to numeric (for data frames)
data <- lapply(data, as.numeric)
Update: *apply differences for object types in R -- thanks to user20650 for pointing this out

Is there a one-liner to create a factor from a numeric vector of level indices?

I've often had to take a vector of group indicators and wanted to create a factor out of it to explore the data more easily. I've always done this by instantiating the factor and then assigning the levels to it where group indicators are the indices of the levels (perhaps easier to see below). But seeing as factors are the least understood data type for me, I wonder if there is a simple function that will do it all for me that I'm not aware of.
# set seed so we're all on the same page
set.seed(1337)
# create the contrived vector of indices
myNumbers <- sample(x = 1:26, size = 50, replace = TRUE)
# This is how I would create the factor
myFactor <- factor(myNumbers) # step 1
levels(myFactor) <- letters # step 2
# Inspect the result
myFactor
You can specify levels when creating the factor from a vector.
foo = factor(x = letters[myNumbers], levels = letters)
length(levels(foo))
#[1] 26
If you don't specify levels when creating factor, one will be automatically assigned from the unique values of vector
length(levels(myFactor)) #before step 2
#[1] 21
It means that, before step 2, the numeric values of factors in myFactor ranges from 1 to 21 (range(as.numeric(myFactor))). As a result, even though you intended to use indices from 1:26, you will be using indices from 1:21.

convert factor and character to numeric in a dataframe

I have a dataframe that I am trying to filter. Here is the structure:
'dataframe': 45 obs. of 1450 variables:
$ X01493112 :Factor w/ 47 levels "01493112", "0145769",...
..- attr(*, "names")= chr "510130020" "510360002"
I have a feeling I can't filter it because I have factors and characters but I cannot convert it to numeric. I have tried:
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
df2 <- as.numeric.factor(df1)
and numerous other conversions but I can't figure out why it won't work, when I call the new df I get
>numeric(0)
It would help to have some example data to work with, but try:
df$your_factor_variable_now_numeric <-
as.numeric(as.character(df$your_old_factor_variable))
And use it only to convert a factor variable, not the complete dataframe. You can also have a look at type.convert. If you want to convert all factors in the dataframe, you can use something along the lines
df[] <- lapply(df, function(x) as.numeric(as.character(x)))
Note that this converts all factors and might not be what you want if you have factors that do not represent numeric values. If unnecessary conversion is a problem, or if there are non-numeric factors or characters in the data, the following would be appropriate:
numerify <- function(x) if(is.factor(x)) as.numeric(as.character(x)) else x
df[] <- lapply(df, numerify)
On a more general point though, the type of your variables should not prevent you from filtering, if, with filtering, you mean subsetting the dataframe. However, the type conversion should be solved with the above code.
fun1 <- function(x) as.numeric(as.character(x))
fun2 <- function(x) as.numeric(x)
fac_to_num <- function(y) modifyList(y,lapply(y[sapply(y,is.factor)],fun1))
char_to_num <- function(y) modifyList(y,lapply(y[sapply(y,is.factor)],fun2))
Apply fac_to_num to the columns in your data for factor -> numeric conversion, char_to_num for character to numeric conversion.

R - Factor-to-Numeric Conversion using pattern matching

I have seen many questions on here addressing the issue of converting factors to numeric variables but none seem to address what I am trying to do.
I want to create a new column in a dataframe that contains numeric representations of an existing factor. I tried:
df$num = as.numeric(df$factor)
Which converted the factors but did not order them as needed. How can I define each factor's numeric value explicitly? Something along the lines of:
df$num = ("1" if factor == "GB", "2" if factor == "YT", "3" if factor == "BF")
Assuming you have a factor variable num with 3 levels (GB, YT, BF) that you want to analyze as numeric.
I solved a similar problem by converting to character first, ex:
df$num <- as.character(df$num)
Then recoding to numeric values
df$num <- recode(df$num, "GB" = 1, "YT" = 2, "BF" = 3)
It's not elegant, but this strategy worked for my similar problem.

R: mapply function returning error: level sets of factors are different

I have two dataframes (DfA and DfB). Each dataframe has three factor variables: species, type and region. DfA also has a numeric value column, and I want to use it to estimate numeric values in a new column of DfB, based on shared attributes.
I have a function which asks for the species, type and region, then creates a subset of DfA with those attributes and runs an algorithm on the subset to estimate the new value. When I run the function and specify the values manually as a test, it works fine.
If all of the factor levels and combinations in DfB have matching factors in DfA, the function works fine with mapply. But if any row in DfB contains a factor level that is not present in DfA, I get an error (level sets of factors are different). Example: if DfA includes data for regions A,B and C, and DfB contains data for regions A,B,C and D, mapply returns the error; if I remove the rows with region D, the mapply function works.
How can I specify that, if the row contains a factor level that makes the function impossible, to skip it or put NA in instead and move on to run the function on the rows for which the function works?
You can drop/add levels to your data.frames to make sure your function works rather than cater for a special case:
# dropping and setting levels
Z = as.factor(sample(LETTERS[1:5],20,replace=T))
levels(Z)
Y = as.factor(Z[-which(Z %in% LETTERS[4:5])])
levels(Y)
Y=droplevels(Y) # drop the levels
levels(Y)
levels(Y) = levels(Z) # bring them back
levels(Y)
Y = factor(Y,levels=LETTERS[1:7]) # expand them
levels(Y)
attr(Y,"levels")
attr(Y,"levels") = LETTERS[1:8] # keep expanding them
levels(Y)
require(plyr)
Y = mapvalues(Y,levels(Y),letters[1:length(levels(Y))]) # change the labels of the levels
levels(Y)
x<-factor(Y, labels=LETTERS[(length(unique(Y))+1):(2*length(unique(Y)))]) # change the labels of the levels on another variable
In your case:
dfa = data.frame("LVL1"=as.factor(sample(LETTERS[1:2],20,replace=T)))
dfb = data.frame("LVL2"=as.factor(sample(LETTERS[2:5],20,replace=T)))
newLevels = sort(unique(union(levels(dfa$LVL1),levels(dfb$LVL2))))
dfa$LVL1 = factor(dfa$LVL1,levels=newLevels)
dfb$LVL2 = factor(dfb$LVL2,levels=newLevels)

Resources