I am new to R and I am trying to convert a dataframe to a numeric matrix using the below code
expData <- read.table("GSM469176.txt",header = F)
expVec <- as.numeric(as.matrix(exp_data))
When I use as.matrix, without as.numeric, it returns some numbers (as below)
0.083531 0.083496 0.083464 0.083435 0.083406 0.083377 0.083348"
[9975] "-0.00285 -0.0028274 -0.0028046 -0.0027814 -0.0027574 -0.0027319 -0.0027042
but when I put in the as.numeric, they are all converted to "NA"
I apologize if someone has asked this question before but I can't find a post that solves my problem.
Thanks in advance
You have 2 issues. First, if you examine the structure of the data frame, you'll note that the first column is characters:
head(expData)[, 1:4]
V1 V2 V3 V4
1 YAL002W(cer) 6.1497e-02 6.2814e-02 6.4130e-02
2 YAL002W(par) 7.1352e-02 7.3262e-02 7.5171e-02
3 YAL003W(cer) 2.2428e-02 3.8252e-02 5.4078e-02
4 YAL003W(par) 2.6548e-02 3.6747e-02 4.6947e-02
5 YAL005C(cer) 2.4023e-05 2.3243e-05 2.2462e-05
6 YAL005C(par) 2.0252e-02 2.0346e-02 2.0440e-02
Therefore, trying to convert the complete data frame to numeric will not work as expected.
Second, you are running as.numeric() after as.matrix(), which is converting the matrix to a vector:
x <- as.numeric(as.matrix(expData))
# Warning message:
# NAs introduced by coercion
class(x)
[1] "numeric"
dim(x)
# NULL not a matrix
length(x)
# [1] 14261302
I suggest you try this:
rownames(expData) <- expData$V1
expData$V1 <- NULL
expData <- as.matrix(expData)
dim(expData)
# [1] 7502 1900
class(expData[, 1])
# [1] "numeric"
You get the NA's when R doesn't know how to convert something to a number.
Specifically, the quotation mark in your output tells me that you have one (several) LNG string of numbers. To see why this is bad, try: as.nmeric("-0.00285 -0.0028274")
I don't know what your raw data is like, but as #alexwhan mentioned, the culprit is probably in your call to read.table
To fix it, try explicitly setting the sep argument (ie, next to where you have header)
I would suggest opening up the raw file in a simple text editor (TextEdit.app or notepad, not Word) and seeing how they are separated. M guess is
..., sep="\t"
should do the trick.
Related
I'm currently using a large data column in for time.
Time in this format is 00h00.
'\d\dh\d\d' is the regex equivalent I believe.
Though many of the cells have terms like "morning" or other terms that can't be used.
I'm trying to use the str_replace_all() function with no success.
As a follow up question, would I be able plot these times on a histogram for each occurance? That is the end goal here.
Thank you for your suggestions.
If I understand correctly, you just want to filter off non matching entries in your time column, something like this:
df <- df[grepl("^\\d{2}h\\d{2}$", df$time), ]
As an alternative (guess), perhaps you mean to extract 00h00 from each string, removing any other non-compliant portion. This might result in empty strings.
vec <- c("01h23", "02h34 ", "03h45 morning", "morning")
stringr::str_extract(vec, "\\d\\d[Hh]\\d\\d")
# [1] "01h23" "02h34" "03h45" NA
or with base R,
out <- strcapture("(\\d\\d[Hh]\\d\\d)", vec, list(tm = ""))
out
# tm
# 1 01h23
# 2 02h34
# 3 03h45
# 4 <NA>
this returns a data.frame which can be easily extracted into a vector. If you need the non-compliant strings to be empty strings instead of NA, then
out$tm[is.na(out$tm)] <- ""
out
# tm
# 1 01h23
# 2 02h34
# 3 03h45
# 4
In a dataframe, I have a column that has numeric values and some mixed in character data for some rows. I want to remove all rows with the character data and keep those rows with a number value. The df I have is 6 million rows, so I simply made a small object to try to solve my issue and then implement at a larger scale.
Here is what I did:
a <- c("fruit", "love", 53)
b <- str_replace_all("^[:alpha:]", 0)
Reading answers to other UseMethod errors on here (about factors), I tried to change "a" to as.character(a) and attempt "b" again. But, I get the same error. I'm trying to simply make any alphabetic value into the number zero and I'm fairly new at all this.
There are several issues here, even in these two lines of code. First, a is a character vector, because its first element is a character. This means that your numeric 53 is coerced into a character.
> print(a)
[1] "fruit" "love" "53"
You've got the wrong syntax for str_replace_all. See the documentation for how to use it correctly. But that's not what you want here, because you want numerics.
The first thing you need to do is convert a to a numeric. A crude way of doing this is simply
>b <- as.numeric(a)
Warning message:
NAs introduced by coercion b
> b
[1] NA NA 53
And then subset to include only the numeric values in b:
> b <- b[!is.na(b)]
> b
[1] 53
But whether that's what you want to do with a 6 million row dataframe is another matter. Please think about exactly what you would like to do, supply us with better test data, and ask your question again.
There's probably a more efficient way of doing this on a large data frame (e.g. something column-wise, instead of row-wise), but to answer your specific question about each row a:
as.numeric(stringr::str_replace_all(a, "[a-z]+", "0"))
Note that the replacing value must be a character (the last argument in the function call, "0"). (You can look up the documentation from your R-console by: ?stringr::str_replace_all)
#eg1:
paste(data.frame(a=as.character(as.Date("2019-12-31"))))
[1]"1"
#eg2:
paste(data.table(a=as.character(as.Date("2019-12-31"))))
[1]"2019-10-12"
#eg3:
paste(data.frame(a=as.Date("2019-12-31")))
[1] 18261
my expected is like eg2, but i don't want use data.table
I have only one question: how to fix this issue, both eg2 and eg3 ?
When you put a character into a data.frame, it is turned into a factor. When you print a factor, it would seem data.table and data.frame are coerced differently. For your particular case, I was able to get around it by unlisting and converting to character before using paste.
> paste(as.character(unlist(data.frame(a=as.character(as.Date("2019-12-31"))))))
[1] "2019-12-31"
Alternatively, you could avoid this by setting stringsAsFactors = FALSE and avoid the factor conversion.
> paste(data.frame(a=as.character(as.Date("2019-12-31")), stringsAsFactors = FALSE))
[1] "2019-12-31"
I don't understand why you are trying to use paste() if what you want to do is view what is contained inside the data frame. Instead, just enter the variable name of the data frame:
df <- data.frame(a=as.character(as.Date("2019-12-31")))
df
a
1 2019-12-31
I have a data frame which includes a Reference column. This is a 10 digit number, which could start with zeros.
When importing into R, the leading zeros disappear, which I would like to add back in.
I have tried using sprintf and formatC, but I have different problems with each.
DF=data.frame(Reference=c(102030405,2567894562,235648759), Data=c(10,20,30))
The outputs I get are the following:
> sprintf('%010d', DF$Reference)
[1] "0102030405" " NA" "0235648759"
Warning message:
In sprintf("%010d", DF$Reference) : NAs introduced by coercion
> formatC(DF$Reference, width=10, flag="0")
[1] "001.02e+08" "02.568e+09" "02.356e+08"
The first output gives NA when the number already has 10 digits, and the second stores the result in standard form.
What I need is:
[1] 0102030405 2567894562 0235648759
library(stringi)
DF = data.frame(Reference = c(102030405,2567894562,235648759), Data = c(10,20,30))
DF$Reference = stri_pad_left(DF$Reference, 10, "0")
DF
# Reference Data
# 1 0102030405 10
# 2 2567894562 20
# 3 0235648759 30
Alternative solutions: Adding leading zeros using R.
When importing into R, the leading zeros disappear, which I would like
to add back in.
Reading the column(s) in as characters would avoid this problem outright. You could use readr::read_csv() with the col_types argument.
formatC
You can use
formatC(DF$Reference, digits = 0, width = 10, format ="f", flag="0")
# [1] "0102030405" "2567894562" "0235648759"
sprintf
The use of d in sprintf means that your values are integers (or they have to be converted with as.integer()). help(integer) explains that:
"the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly."
That is why as.integer(2567894562) returns NA.
Another work around would be to use a character format s in sprintf:
sprintf('%010s',DF$Reference)
# [1] " 102030405" "2567894562" " 235648759"
But this gives spaces instead of leading zeros. gsub() can add zeros back by replacing spaces with zeros:
gsub(" ","0",sprintf('%010s',DF$Reference))
# [1] "0102030405" "2567894562" "0235648759"
I have to import many datasets automatically with the first column being a name, so a character vector, and the second column being a numeric vector, so I was using these specifications with read.table: colClasses = c("character", "numeric").
This works great if I have a dataframe saved in a df_file like this:
df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("1e-04","1e-04","1e-04","1e-04")
read.table(df_file, header = FALSE, comment.char="", colClasses = c("character", "numeric"), stringsAsFactors=FALSE)
The problem is in some cases I have dataframes with numeric values in the form of exponential in the second column, and in these cases the import does not work since it does not recognise the column as numeric (or it imports as "character" if I don't specify the colClasses), so my question is:
how can I specify a column to be imported as numeric even when the values are exponential?
For example:
df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("10^(-4)","10^(-4)","10^(-4)","10^(-4)"))
I want all the exponential values to be imported as numeric, but even when I try to change from character to numeric after they are imported I get all "NA" (as.numeric(as.character(df$V2)) "Warning message: NAs introduced by coercion ")
I have tried to use "real" or "complex" with colClasses too but it still imports the exponentials as character.
Please help,
thank you!
I think the problem is that the form your exponentials are written in doesn't match the R style. If you read them in as character vectors you can convert them to exponentials if you know they all are exponentials. Use gsub to strip out the "10^(" and the ")", leaving you with the "-4", convert to numeric, then convert back to an exponential. Might not be the fastest way, but it works.
From your example:
df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("10^(-4)","10^(-4)","10^(-4)","10^(-4)"))
df$V2 <- 10^(as.numeric(gsub("10\\^\\(|\\)", "", df$V2)))
df
# V1 V2
#1 s1 1e-04
#2 s2 1e-04
#3 s3 1e-04
#4 s4 1e-04
Whats happening in detail: gsub("10\\^\\(|\\)", "", df$V2) is substituting 10^( and ) with an empty string (you need to escape the carat and the parentheses), as.numeric() is converting your -4 string into the number -4, then you're just running 10^ on each element of the numeric vector you just made.
If you read in your data.frame with stringsAsFactors=FALSE, the column in question should come in as a character vector, in which case you can simply do:
transform(df, V2=eval(parse(text=V2)))
You could use readLines to first load in the data and do all the operations required and then use read.table with textConnection as follows:
tt <- readLines("~/tmp.txt")
tt <- gsub("10\\^\\((.*)\\)$", "1e\\1", tt)
read.table(textConnection(tt), sep="\t", header=TRUE, stringsAsFactors=FALSE)
V1 V2
1 s1 1e-04
2 s2 1e-04
3 s3 1e-04
4 s4 1e-04