I recently had a problem in which everytime I read a csv file containing a table with values, R read it as a list format instead of numeric. As no thread provided me the entire answer for my situation, once I was able to make it run I decided to include here the script that worked for me in hope that it is useful to someone. Here it is, with some description and some options in case you need it:
(1) Read the data from a csv file. Here the file has no header, so I put F, if yours have a header, then change it to T.
data <- read.csv("folder_path/data_file.csv", header=F)
(1.a) Note: If you get a warning that says "incomplete final line found by readTableHeader", that means that R did not find an end-of-file symbol. Just put an extra empty line at the end in the csv file and the message will not show up again.
(2) You can check that the data is in list format (if it is numeric, then you are all set and don't need this procedure at all!) with the mode command.
mode(data)
(3) Initialize a matrix (as NA) where you want the data in numeric format, using the dimensions of data.
dataNum <- matrix(data = NA, nrow = dim(data)[1], ncol = dim(data)[2])
(4) OPTIONAL: If you want to add names to your columns and/or rows, you could use one if these options.
(4a) Add names to the columns and rows, assuming that each have similar information, in other words you want the names to be col_1, col_2, ... and row_1, row_2, ...
colnames(dataNum) <- colnames(dataNum, do.NULL = F, prefix = "col_")
rownames(dataNum) <- rownames(dataNum, do.NULL = F, prefix = "row_")
(4b) If you want different names for each column and each row, then use this option instead and add all the names by hand.
colnames(dataNum) <- c("col_name_1", "col_name_2")
rownames(dataNum) <- c("row_name_1", "row_name_2")
(5) Transform the data from list to numeric form and put it in the matrix dataNum.
for (i in 1:dim(data)[2]) {
dataNum[,i] <- c(as.numeric(data[[i]]))
}
(6) You can check that the matrix is in numeric format with the mode command.
mode(dataNum)
(7) OPTIONAL: In case you would like to transpose the matrix, you can use the following instruction.
dataNum <- t(dataNum)
Here is a shorter/faster way to turn your data.frame into a numeric matrix:
data <- data.matrix(data)
There is also
data <- as.matrix(data)
but one important difference is if your data contains a factor or character column: as.matrix will coerce everything into a character matrix while data.matrix will always return a numeric or integer matrix.
data <- data.frame(
logical = as.logical(c(TRUE, FALSE)),
integer = as.integer(c(TRUE, FALSE)),
numeric = as.numeric(c(TRUE, FALSE)),
factor = as.character(c(TRUE, FALSE))
)
data.matrix(data)
# logical integer numeric factor
# [1,] 1 1 1 2
# [2,] 0 0 0 1
as.matrix(data)
# logical integer numeric factor
# [1,] " TRUE" "1" "1" "TRUE"
# [2,] "FALSE" "0" "0" "FALSE"
Related
I am working in R and have a dataset comprising of 700 rows and 10 columns, with some of the values having '?' as value. I want to replace the '?' values with 0.
I am not sure if the is.na() function would work here, as the values are not NA. If I convert my dataset into a matrix, and after searching for '?' , replace it with 0, would that help?
I tried this code:
datafile <- sapply(datafile, function(y){if (y=='?') 0 else y})
after this I saved the file as a text file, but the ? didn't go away.
You don't even need to convert to a matrix. As Ben Bolker said, your best option is to use na.strings when reading in the file.
If the data frame is not coming from a file, you can directly do:
df[df=="?"] <- 0
You have to remember though that anything containing character might be converted to a factor. If that's the case, you have to convert those factors to character. Ben gives you a brute force option, here's a more gentle approach:
# check which variables are factors
isfactor <- sapply(df, is.factor)
# convert them to character
# I use lapply bcs that returns a list, and I use the
# list-like selection of "elements" (variables) to replace
# the variables
df[isfactor] <- lapply(df[isfactor], as.character)
So if you put everything together, you get:
df <- data.frame(
a = c(1,5,3,'?',4),
b = c(3,'?','?',3,2)
)
isfactor <- sapply(df, is.factor)
df[isfactor] <- lapply(df[isfactor], as.character)
df[df=="?"] <- 0
df
It depends whether you have other NA values in your data set. If not, almost certainly the easiest way to do this is to use the na.strings= argument to read.(table|csv|cv2|delim), i.e. read your data with something like dd <- read.csv(...,na.strings=c("?","NA"). Then
dd[is.na(dd)] <- 0
If for some reason you don't have control of this part of the process (e.g. someone handed you a .rda file and you don't have the original CSV), then it's a bit more tedious -- you need
which.qmark <- which(x=="?")
x <- suppressWarnings(as.numeric(as.character(x)))
x[which.qmark] <- 0
(This version also works if you have both ? and other NA values in your data)
On the output of read.table, as.vector produces an m x 1 matrix rather than a length m vector:
# data.txt contains one integer per line and nothing else
dataframe = read.table("data.txt", encoding='UTF-8', header=F)
v = as.vector(dataframe)
is.vector(v)
[1] FALSE
length(v)
[1] 1
dim(v)
[1] 19783 1
Consider readLines instead of read.table which imports the one column directly into a vector:
data <- readLines(con="data.txt", n=-1L, encoding='UTF-8', warn=FALSE)
is.vector(data)
#[1] TRUE
To summerise the above data types:
Data frame: A tabular object where each column can be a different type. A data frame is really a list.
Matrix: A tabular object where all values must have the same type.
Vector: A one dimensional object; all values must have the same type.
Hence it doesn't (in general) make sense to convert from a data frame to a vector.
In your example, you can either
unlist(dataframe)
or convert to a matrix, then use as.vector
as.vector(data.matrix(dataframe))
I would like to plot a heatmap on a table imported from MATLAB. The table has explicited rownames and colnames and I have loaded it into R with read.table, and I can run summary(i) and get the numeric summaries for each column:
i = read.table("file.txt",header=TRUE)
But when I try to run heatmap, it complains the converted matrix is not numeric, both with and without rownames.force=TRUE:
is.matrix(as.matrix(i,rownames.force=TRUE))
[1] TRUE
heatmap(as.matrix(i,rownames.force=TRUE))
Error in heatmap(as.matrix(i, rownames.force = TRUE)) :
'x' must be a numeric matrix
I think the problem is that as.matrix tries to convert the non-numeric rowname (or colname, I am not sure anymore :-():
as.matrix(i)[1]
[1] "cluster-594-walk-0161"
Any ideas?
Without a reproducible example we are left guessing what goes wrong, but the error suggests that the matrix does not contain numbers but (probably) characters. Does this work:
i = as.numeric(i)
heatmap(as.matrix(i,rownames.force=TRUE))
and what is the output of:
is.numeric(as.matrix(i)[1])
(probably FALSE).
edit:
Your edit shows that the matrix contains characters, not numerics. It may be that in the text file the rownames are included as an additional column, probably the first one. In that case:
i = read.table("file.txt", header = TRUE, row.names = 1)
reads the first column as the rownames. So the problem is most likely in read.table, not in the conversion to a matrix.
The solution is simply rely on defining the row names first then convert the data frame into matrix then inserting the raw names again. it should work perfectly
rnames <- data[,1] # assign labels in column 1 to "rnames"
mat_data <- data.matrix(data[,2:ncol(data)]) # transform column 2-5 into a matrix
rownames(mat_data) <- rnames # assign row names
heatmap.2(mat_data, col=redblue(256), scale="row", key=T, keysize=1.5, trace="none",cexCol=0.9,srtCol=45) # your heatmap
When I pass a row of a data frame to a function using apply, I lose the class information of the elements of that row. They all turn into 'character'. The following is a simple example. I want to add a couple of years to the 3 stooges ages. When I try to add 2 a value that had been numeric R says "non-numeric argument to binary operator." How do I avoid this?
age = c(20, 30, 50)
who = c("Larry", "Curly", "Mo")
df = data.frame(who, age)
colnames(df) <- c( '_who_', '_age_')
dfunc <- function (er) {
print(er['_age_'])
print(er[2])
print(is.numeric(er[2]))
print(class(er[2]))
return (er[2] + 2)
}
a <- apply(df,1, dfunc)
Output follows:
_age_
"20"
_age_
"20"
[1] FALSE
[1] "character"
Error in er[2] + 2 : non-numeric argument to binary operator
apply only really works on matrices (which have the same type for all elements). When you run it on a data.frame, it simply calls as.matrix first.
The easiest way around this is to work on the numeric columns only:
# skips the first column
a <- apply(df[, -1, drop=FALSE],1, dfunc)
# Or in two steps:
m <- as.matrix(df[, -1, drop=FALSE])
a <- apply(m,1, dfunc)
The drop=FALSE is needed to avoid getting a single column vector.
-1 means all-but-the first column, you could instead explicitly specify the columns you want, for example df[, c('foo', 'bar')]
UPDATE
If you want your function to access one full data.frame row at a time, there are (at least) two options:
# "loop" over the index and extract a row at a time
sapply(seq_len(nrow(df)), function(i) dfunc(df[i,]))
# Use split to produce a list where each element is a row
sapply(split(df, seq_len(nrow(df))), dfunc)
The first option is probably better for large data frames since it doesn't have to create a huge list structure upfront.
I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion