I would like to plot a heatmap on a table imported from MATLAB. The table has explicited rownames and colnames and I have loaded it into R with read.table, and I can run summary(i) and get the numeric summaries for each column:
i = read.table("file.txt",header=TRUE)
But when I try to run heatmap, it complains the converted matrix is not numeric, both with and without rownames.force=TRUE:
is.matrix(as.matrix(i,rownames.force=TRUE))
[1] TRUE
heatmap(as.matrix(i,rownames.force=TRUE))
Error in heatmap(as.matrix(i, rownames.force = TRUE)) :
'x' must be a numeric matrix
I think the problem is that as.matrix tries to convert the non-numeric rowname (or colname, I am not sure anymore :-():
as.matrix(i)[1]
[1] "cluster-594-walk-0161"
Any ideas?
Without a reproducible example we are left guessing what goes wrong, but the error suggests that the matrix does not contain numbers but (probably) characters. Does this work:
i = as.numeric(i)
heatmap(as.matrix(i,rownames.force=TRUE))
and what is the output of:
is.numeric(as.matrix(i)[1])
(probably FALSE).
edit:
Your edit shows that the matrix contains characters, not numerics. It may be that in the text file the rownames are included as an additional column, probably the first one. In that case:
i = read.table("file.txt", header = TRUE, row.names = 1)
reads the first column as the rownames. So the problem is most likely in read.table, not in the conversion to a matrix.
The solution is simply rely on defining the row names first then convert the data frame into matrix then inserting the raw names again. it should work perfectly
rnames <- data[,1] # assign labels in column 1 to "rnames"
mat_data <- data.matrix(data[,2:ncol(data)]) # transform column 2-5 into a matrix
rownames(mat_data) <- rnames # assign row names
heatmap.2(mat_data, col=redblue(256), scale="row", key=T, keysize=1.5, trace="none",cexCol=0.9,srtCol=45) # your heatmap
Related
I am new to r and rstudio and I need to create a vector that stores the first 100 rows of the csv file the programme reads . However , despite all my attempts my variable v1 ends up becoming a dataframe instead of an int vector . May I know what I can do to solve this? Here's my code:
library(readr)
library(readr)
cup_data <- read_csv("C:/Users/Asus.DESKTOP-BTB81TA/Desktop/STUDY/YEAR 2/
YEAR 2 SEM 2/PREDICTIVE ANALYTICS(1_PA_011763)/Week 1 (Intro to PA)/
Practical/cup98lrn variable subset small.csv")
# Retrieve only the selected columns
cup_data_small <- cup_data[c("AGE", "RAMNTALL", "NGIFTALL", "LASTGIFT",
"GENDER", "TIMELAG", "AVGGIFT", "TARGET_B", "TARGET_D")]
str(cup_data_small)
cup_data_small
#get the number of columns and rows
ncol(cup_data_small)
nrow(cup_data_small)
cat("No of column",ncol(cup_data_small),"\nNo of Row :",nrow(cup_data_small))
#cat
#Concatenate and print
#Outputs the objects, concatenating the representations.
#cat performs much less conversion than print.
#Print the first 10 rows of cup_data_small
head(cup_data_small, n=10)
#Create a vector V1 by selecting first 100 rows of AGE
v1 <- cup_data_small[1:100,"AGE",]
Here's what my environment says:
cup_data_small is a tibble, a slightly modified version of a dataframe that has slightly different rules to try to avoid some common quirks/inconsistencies in standard dataframes. E.g. in a standard dataframe, df[, c("a")] gives you a vector, and df[, c("a", "b")] gives you a dataframe - you're using the same syntax so arguably they should give the same type of result.
To get just a vector from a tibble, you have to explicitly pass drop = TRUE, e.g.:
library(dplyr)
# Standard dataframe
iris[, "Species"]
iris_tibble = iris %>%
as_tibble()
# Remains a tibble/dataframe
iris_tibble[, "Species"]
# This gives you just the vector
iris_tibble[, "Species", drop = TRUE]
I'm using a data.frame that contains many data.frames. I'm trying to access these sub-data.frames within a loop. Within these loops, the names of the sub-data.frames are contained in a string variable. Since this is a string, I can use the [,] notation to extract data from these sub-data.frames. e.g. X <- "sub.df"and then df[42,X] would output the same as df$sub.df[42].
I'm trying to create a single row data.frame to replace a row within the sub-data.frames. (I'm doing this repeatedly and that's why my sub-data.frame name is in a string). However, I'm having trouble inserting this new data into these sub-data.frames. Here is a MWE:
#Set up the data.frames and sub-data.frames
sub.frame <- data.frame(X=1:10,Y=11:20)
df <- data.frame(A=21:30)
df$Z <- sub.frame
Col.Var <- "Z"
#Create a row to insert
new.data.frame <- data.frame(X=40,Y=50)
#This works:
df$Z[3,] <- new.data.frame
#These don't (even though both sides of the assignment give the correct values/dimensions):
df[,Col.Var][6,] <- new.data.frame #Gives Warning and collapses df$Z to 1 dimension
df[7,Col.Var] <- new.data.frame #Gives Warning and only uses first value in both places
#This works, but is a work-around and feels very inelegant(!)
eval(parse(text=paste0("df$",Col.Var,"[8,] <- new.data.frame")))
Are there any better ways to do this kind of insertion? Given my experience with R, I feel like this should be easy, but I can't quite figure it out.
First, I am new on R.
My csv has some numbers considered like "general" so I can't do the math with data. Is there any solution for this?
I have tried data >- as.numeric ( as.character(data)) but I failed.
data <- read.csv(file="TC.csv", header=TRUE, sep=",")
data[ data == "?" ] <- NA
for(i in 1:ncol(data)) {
data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}
I get this message:
In mean.default(results) : argument is not numeric or logical: returning NA
I think the problem is related to numbers like on yellow cell.
Sample input:
You shouldn't need to loop over the data set to remove rows. Also, I don't believe the highlighted rows are the root of the problem. To make it easier, I would convert the data to a data frame.
data <- as.data.frame(read.csv(file="TC.csv", header=TRUE, sep=","))
To remove the '?' character, you should be able to run the code below. I think it is easier to run the code below instead of converting it to NA and then dropping it.
data <- data[!grepl('?',data$Column),]
mean(TC$Column)
summary(TC)
In summary, you should convert it to a data frame, replace/drop the rows that have values that aren't numeric, and then perform your summary stats.
You are getting that error message because you are applying the mean function to a list, when it operates on numeric types.
In R, the usual way of dealing with multi-dimensional data is not to loop over it, but to use one of the various apply functions, which perform an operation on one dimension of your data. Here you are looking for the column mean, which you get by:
TC.csv
a_0,a_1,a_2,a_3,a_4
3030.93,1,1,1,1
3095.78,2,2,2,2
2932.61,3,3,?,3
3032.24,4,4,4,4
2946.25,5,5,5,5
3058.88,6,?,6,6
get_mean.R
data <- read.csv(file="TC.csv", header=TRUE, sep=",", na.strings="?")
# apply( data, dimension, function, function_args )
col_means <- apply( data, 2, mean, na.rm=1 )
Apply Functions Over Array Margins
Apply a Function over a List or Vector
I am working in R and have a dataset comprising of 700 rows and 10 columns, with some of the values having '?' as value. I want to replace the '?' values with 0.
I am not sure if the is.na() function would work here, as the values are not NA. If I convert my dataset into a matrix, and after searching for '?' , replace it with 0, would that help?
I tried this code:
datafile <- sapply(datafile, function(y){if (y=='?') 0 else y})
after this I saved the file as a text file, but the ? didn't go away.
You don't even need to convert to a matrix. As Ben Bolker said, your best option is to use na.strings when reading in the file.
If the data frame is not coming from a file, you can directly do:
df[df=="?"] <- 0
You have to remember though that anything containing character might be converted to a factor. If that's the case, you have to convert those factors to character. Ben gives you a brute force option, here's a more gentle approach:
# check which variables are factors
isfactor <- sapply(df, is.factor)
# convert them to character
# I use lapply bcs that returns a list, and I use the
# list-like selection of "elements" (variables) to replace
# the variables
df[isfactor] <- lapply(df[isfactor], as.character)
So if you put everything together, you get:
df <- data.frame(
a = c(1,5,3,'?',4),
b = c(3,'?','?',3,2)
)
isfactor <- sapply(df, is.factor)
df[isfactor] <- lapply(df[isfactor], as.character)
df[df=="?"] <- 0
df
It depends whether you have other NA values in your data set. If not, almost certainly the easiest way to do this is to use the na.strings= argument to read.(table|csv|cv2|delim), i.e. read your data with something like dd <- read.csv(...,na.strings=c("?","NA"). Then
dd[is.na(dd)] <- 0
If for some reason you don't have control of this part of the process (e.g. someone handed you a .rda file and you don't have the original CSV), then it's a bit more tedious -- you need
which.qmark <- which(x=="?")
x <- suppressWarnings(as.numeric(as.character(x)))
x[which.qmark] <- 0
(This version also works if you have both ? and other NA values in your data)
I recently had a problem in which everytime I read a csv file containing a table with values, R read it as a list format instead of numeric. As no thread provided me the entire answer for my situation, once I was able to make it run I decided to include here the script that worked for me in hope that it is useful to someone. Here it is, with some description and some options in case you need it:
(1) Read the data from a csv file. Here the file has no header, so I put F, if yours have a header, then change it to T.
data <- read.csv("folder_path/data_file.csv", header=F)
(1.a) Note: If you get a warning that says "incomplete final line found by readTableHeader", that means that R did not find an end-of-file symbol. Just put an extra empty line at the end in the csv file and the message will not show up again.
(2) You can check that the data is in list format (if it is numeric, then you are all set and don't need this procedure at all!) with the mode command.
mode(data)
(3) Initialize a matrix (as NA) where you want the data in numeric format, using the dimensions of data.
dataNum <- matrix(data = NA, nrow = dim(data)[1], ncol = dim(data)[2])
(4) OPTIONAL: If you want to add names to your columns and/or rows, you could use one if these options.
(4a) Add names to the columns and rows, assuming that each have similar information, in other words you want the names to be col_1, col_2, ... and row_1, row_2, ...
colnames(dataNum) <- colnames(dataNum, do.NULL = F, prefix = "col_")
rownames(dataNum) <- rownames(dataNum, do.NULL = F, prefix = "row_")
(4b) If you want different names for each column and each row, then use this option instead and add all the names by hand.
colnames(dataNum) <- c("col_name_1", "col_name_2")
rownames(dataNum) <- c("row_name_1", "row_name_2")
(5) Transform the data from list to numeric form and put it in the matrix dataNum.
for (i in 1:dim(data)[2]) {
dataNum[,i] <- c(as.numeric(data[[i]]))
}
(6) You can check that the matrix is in numeric format with the mode command.
mode(dataNum)
(7) OPTIONAL: In case you would like to transpose the matrix, you can use the following instruction.
dataNum <- t(dataNum)
Here is a shorter/faster way to turn your data.frame into a numeric matrix:
data <- data.matrix(data)
There is also
data <- as.matrix(data)
but one important difference is if your data contains a factor or character column: as.matrix will coerce everything into a character matrix while data.matrix will always return a numeric or integer matrix.
data <- data.frame(
logical = as.logical(c(TRUE, FALSE)),
integer = as.integer(c(TRUE, FALSE)),
numeric = as.numeric(c(TRUE, FALSE)),
factor = as.character(c(TRUE, FALSE))
)
data.matrix(data)
# logical integer numeric factor
# [1,] 1 1 1 2
# [2,] 0 0 0 1
as.matrix(data)
# logical integer numeric factor
# [1,] " TRUE" "1" "1" "TRUE"
# [2,] "FALSE" "0" "0" "FALSE"