How data frame handle data type? - r

I have encountered an issue that I do not understand and could not find an explanation so far. Here is an example :
x = matrix(data = "test", nrow = 5, ncol = 3)
typeof(x[1, 1])
> "character"
x = as.data.frame(x)
typeof(x[1, 1])
> "integer"
Any idea as to why as.data.frame() coerce data to integer type and how to prevent it from happening ?

The matrix can hold only a single class. Normally, we use matrix for numeric elements. Suppose if there is a single element in matrix that is non-numeric, it will convert the whole matrix to character class.
Regarding the OP's post, we have a matrix with character elements. Coercing a matrix to data.frame (with as.data.frame), it will be converted to data.frame, but the default option (stringsAsFactors=TRUE) in data.frame for 'character' elements in each column will be to convert it to factor class. When we use typeof, we get the integer representation of factor.
This can be avoided by using stringsAsFactors=FALSE
x1 <- as.data.frame(x, stringsAsFactors=FALSE)

Related

Comparison of of character type and factor type in R

Ok, so I am having this issue right now. I have a matrix A whose rownames are the values of a field in another matrix B. I want to find indices of my rownames in the second matrix B. Now I am trying to do this operation which(A$field == rowname_A) . Unfortunately couple of things are appearing one - the rowname_A variable is of character class. It is of this format , "X12345". The values of A$field is of type factor. Is there a way to remove the appended X from the character, convert it to factor and do the comparison. Or convert the factor variables of A$field in to character type and then do the comparison.
Help will be appreciated.
Thanks.
This is fairly straightfoward. The example below should help you out.
A <- matrix(1:3)
rownames(A) <- paste0("X", 1:3)
B <- data.frame(field = factor(1:3))
# Remove "X" from rownames(A) and check equality
B$field %in% substr(rownames(A), 2, nchar(rownames(A)))
# Add "X" to B$field and check equality
paste0("X", B$field) %in% rownames(A)

Converting data frame column from character to numeric

I have a data frame that I construct as such:
> yyz <- data.frame(a = c("1","2","n/a"), b = c(1,2,"n/a"))
> apply(yyz, 2, class)
a b
"character" "character"
I am attempting to convert the last column to numeric while still maintaining the first column as a character. I tried this:
> yyz$b <- as.numeric(as.character(yyz$b))
> yyz
a b
1 1
2 2
n/a NA
But when I run the apply class it is showing me that they are both character classes.
> apply(yyz, 2, class)
a b
"character" "character"
Am I setting up the data frame wrong? Or is it the way R is interpreting the data frame?
If we need only one column to be numeric
yyz$b <- as.numeric(as.character(yyz$b))
But, if all the columns needs to changed to numeric, use lapply to loop over the columns and convert to numeric by first converting it to character class as the columns were factor.
yyz[] <- lapply(yyz, function(x) as.numeric(as.character(x)))
Both the columns in the OP's post are factor because of the string "n/a". This could be easily avoided while reading the file using na.strings = "n/a" in the read.table/read.csv or if we are using data.frame, we can have character columns with stringsAsFactors=FALSE (the default is stringsAsFactors=TRUE)
Regarding the usage of apply, it converts the dataset to matrix and matrix can hold only a single class. To check the class, we need
lapply(yyz, class)
Or
sapply(yyz, class)
Or check
str(yyz)

Sorting changes list to integer in R

I have a list and when I apply sort() it changes the type to 'integer' which is not understandable to me. Help is really appreciated.
myfile.csv is a single column with values {"a","a","c","b","c","a"}
The code is as follows:
temp <- read.csv("myfile.csv",header=TRUE)
typeof(temp) ## prints: "list"
temp2 <- sort(temp[,1])
typeof(temp2) ## prints: "integer"
and now i can't refer elements in temp2 using temp2[1,] or temp2[2,] and get error
Error in `[.default`(temp3, 1, ) : incorrect number of dimensions
Use this command and temp2 will be a data frame with sorted values:
temp2 <- temp[order(temp[ , 1]), , drop = FALSE]
temp2 <- sort(temp[,1]) takes the first column of the data.frame temp, sorts it, and assigns it to temp2. The result is an atomic vector (possibly with additional attributes) because data.frame columns are atomic vectors (possibly with additional attributes). If you want the first element temp2, you can use temp2[1]. You should study help("[").

Losing Class information when I use apply in R

When I pass a row of a data frame to a function using apply, I lose the class information of the elements of that row. They all turn into 'character'. The following is a simple example. I want to add a couple of years to the 3 stooges ages. When I try to add 2 a value that had been numeric R says "non-numeric argument to binary operator." How do I avoid this?
age = c(20, 30, 50)
who = c("Larry", "Curly", "Mo")
df = data.frame(who, age)
colnames(df) <- c( '_who_', '_age_')
dfunc <- function (er) {
print(er['_age_'])
print(er[2])
print(is.numeric(er[2]))
print(class(er[2]))
return (er[2] + 2)
}
a <- apply(df,1, dfunc)
Output follows:
_age_
"20"
_age_
"20"
[1] FALSE
[1] "character"
Error in er[2] + 2 : non-numeric argument to binary operator
apply only really works on matrices (which have the same type for all elements). When you run it on a data.frame, it simply calls as.matrix first.
The easiest way around this is to work on the numeric columns only:
# skips the first column
a <- apply(df[, -1, drop=FALSE],1, dfunc)
# Or in two steps:
m <- as.matrix(df[, -1, drop=FALSE])
a <- apply(m,1, dfunc)
The drop=FALSE is needed to avoid getting a single column vector.
-1 means all-but-the first column, you could instead explicitly specify the columns you want, for example df[, c('foo', 'bar')]
UPDATE
If you want your function to access one full data.frame row at a time, there are (at least) two options:
# "loop" over the index and extract a row at a time
sapply(seq_len(nrow(df)), function(i) dfunc(df[i,]))
# Use split to produce a list where each element is a row
sapply(split(df, seq_len(nrow(df))), dfunc)
The first option is probably better for large data frames since it doesn't have to create a huge list structure upfront.

Calculate Mean of a column in R having non numeric values

I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion

Resources