I have to import many datasets automatically with the first column being a name, so a character vector, and the second column being a numeric vector, so I was using these specifications with read.table: colClasses = c("character", "numeric").
This works great if I have a dataframe saved in a df_file like this:
df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("1e-04","1e-04","1e-04","1e-04")
read.table(df_file, header = FALSE, comment.char="", colClasses = c("character", "numeric"), stringsAsFactors=FALSE)
The problem is in some cases I have dataframes with numeric values in the form of exponential in the second column, and in these cases the import does not work since it does not recognise the column as numeric (or it imports as "character" if I don't specify the colClasses), so my question is:
how can I specify a column to be imported as numeric even when the values are exponential?
For example:
df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("10^(-4)","10^(-4)","10^(-4)","10^(-4)"))
I want all the exponential values to be imported as numeric, but even when I try to change from character to numeric after they are imported I get all "NA" (as.numeric(as.character(df$V2)) "Warning message: NAs introduced by coercion ")
I have tried to use "real" or "complex" with colClasses too but it still imports the exponentials as character.
Please help,
thank you!
I think the problem is that the form your exponentials are written in doesn't match the R style. If you read them in as character vectors you can convert them to exponentials if you know they all are exponentials. Use gsub to strip out the "10^(" and the ")", leaving you with the "-4", convert to numeric, then convert back to an exponential. Might not be the fastest way, but it works.
From your example:
df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("10^(-4)","10^(-4)","10^(-4)","10^(-4)"))
df$V2 <- 10^(as.numeric(gsub("10\\^\\(|\\)", "", df$V2)))
df
# V1 V2
#1 s1 1e-04
#2 s2 1e-04
#3 s3 1e-04
#4 s4 1e-04
Whats happening in detail: gsub("10\\^\\(|\\)", "", df$V2) is substituting 10^( and ) with an empty string (you need to escape the carat and the parentheses), as.numeric() is converting your -4 string into the number -4, then you're just running 10^ on each element of the numeric vector you just made.
If you read in your data.frame with stringsAsFactors=FALSE, the column in question should come in as a character vector, in which case you can simply do:
transform(df, V2=eval(parse(text=V2)))
You could use readLines to first load in the data and do all the operations required and then use read.table with textConnection as follows:
tt <- readLines("~/tmp.txt")
tt <- gsub("10\\^\\((.*)\\)$", "1e\\1", tt)
read.table(textConnection(tt), sep="\t", header=TRUE, stringsAsFactors=FALSE)
V1 V2
1 s1 1e-04
2 s2 1e-04
3 s3 1e-04
4 s4 1e-04
Related
How can I convert y vector into a numeric vector.
y <- c("1+2", "0101", "5*5")
when I use
as.numeric(Y)
OUTPUT
Na 101 NA
The following code
sapply(y, function(txt) eval(parse(text=txt)))
should to the work.
The problem is quite deep and you need to know about metaprogramming.
The problem with as.numeric is, that it only converts a string to a numeric, if the string only consists of numbers and one dot. Everything else is converted to NA. In your case, "1+2" contains a plus, hence NA. Or "5*5" contains a multiplication, hence NA. To say R that it should "perform the operation given by a string", you need eval and parse.
An option with map
library(purrr)
map_dbl(y, ~ eval(rlang::parse_expr(.x)))
#[1] 3 101 25
Simple but frustrating problem here:
I've imported xls data into R, which unfortunately is the only current way to get the data - no csv option or direct DB query.
Anyways - I'm looking to do quite a bit of manipulation on this data set, however the variable names are extraordinarily messy: ie. col2 = "\r\n\r\n\r\n\r\r XXXXXX YYYYY ZZZZZZ" - you get my gist. Each column head has an equally messy name as this example and there are typically >15 columns per spreadsheet.
Ideally I'd like to program a name manipulation solution via R to avoid manually changing the names in xls prior to importing. But I can't seem to find the right solution, since every R function I try/check requires the column name be spelled out and set to a new variable. Spelling out the entire column name is tedious and impractical and plus the special characters seem to break R's functions anyways.
Does anyone know how to do a global replace all names or a global rename by column number rather than name?
I've tried
replace()
for loops
lapply()
Remove non-printing characters in the first gsub. Then trim whitespace off the ends using trimws and replace consecutive strings of the same character with just one of them in the second gsub. No packages are used.
# test input
d <- data.frame("\r\r\r\r\r\n\n\n\n\n\n XXXX YYYY ZZZZ" = 0, check.names = FALSE)
names(d) <- trimws(gsub("[^[:print:]]", "", names(d)))
names(d) <- gsub("(.)\\1+", "\\1", names(d))
d
## X Y Z
## 1 0
With R 3.6 or later you could consider replacing the first gsub line with this trimws line:
names(d) <- trimws(names(d), "both", "\\s")
If you want syntactic names add this after the above code:
names(d) <- make.names(names(d))
d
## X.Y.Z
## 1 0
I'm struggling to convert a character vector to numeric in R. I import a dataframe from csv with:
a = read.csv('myData.csv', header=T, stringsAsFactors=F)
One of my factors, fac1, is a vector of numbers but contains some instances of "na" and "nr". Hence, typeof(a$fac1) returns "character"
I create a new dataframe without "na" and "nr" entries
k = a[a$fac1 != "na" & a$fac1 != "nr", ]
I then try to convert fac1 to numeric with:
k$fac1_num = as.numeric(k$fac1)
The problem is that this doesn't work, as typeof(k$fac1_num) now returns "double" instead of "numeric"
Can anyone suggest a fix / point out what I'm doing wrong? Thanks in advance!
Try just coercing to numeric:
a = read.csv('myData.csv', header=T, stringsAsFactors=F)
a$fac1_num = as.numeric(a$fac1)
If you need to subset (which is generally not needed and I would advise against doing routinely since there might be value in knowing what the other column value tell you about the "reality" behind the data), then just:
k <- a[ !is.na(a$fac1_num) , ]
That way you will still have the original character value in the a data-object and can examine its values if needed. The proper test for "numericy" is is.numeric
Try to use sapply with mode :
sapply(your_df, mode)
I have a data frame which includes a Reference column. This is a 10 digit number, which could start with zeros.
When importing into R, the leading zeros disappear, which I would like to add back in.
I have tried using sprintf and formatC, but I have different problems with each.
DF=data.frame(Reference=c(102030405,2567894562,235648759), Data=c(10,20,30))
The outputs I get are the following:
> sprintf('%010d', DF$Reference)
[1] "0102030405" " NA" "0235648759"
Warning message:
In sprintf("%010d", DF$Reference) : NAs introduced by coercion
> formatC(DF$Reference, width=10, flag="0")
[1] "001.02e+08" "02.568e+09" "02.356e+08"
The first output gives NA when the number already has 10 digits, and the second stores the result in standard form.
What I need is:
[1] 0102030405 2567894562 0235648759
library(stringi)
DF = data.frame(Reference = c(102030405,2567894562,235648759), Data = c(10,20,30))
DF$Reference = stri_pad_left(DF$Reference, 10, "0")
DF
# Reference Data
# 1 0102030405 10
# 2 2567894562 20
# 3 0235648759 30
Alternative solutions: Adding leading zeros using R.
When importing into R, the leading zeros disappear, which I would like
to add back in.
Reading the column(s) in as characters would avoid this problem outright. You could use readr::read_csv() with the col_types argument.
formatC
You can use
formatC(DF$Reference, digits = 0, width = 10, format ="f", flag="0")
# [1] "0102030405" "2567894562" "0235648759"
sprintf
The use of d in sprintf means that your values are integers (or they have to be converted with as.integer()). help(integer) explains that:
"the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly."
That is why as.integer(2567894562) returns NA.
Another work around would be to use a character format s in sprintf:
sprintf('%010s',DF$Reference)
# [1] " 102030405" "2567894562" " 235648759"
But this gives spaces instead of leading zeros. gsub() can add zeros back by replacing spaces with zeros:
gsub(" ","0",sprintf('%010s',DF$Reference))
# [1] "0102030405" "2567894562" "0235648759"
I am new to R and I am trying to convert a dataframe to a numeric matrix using the below code
expData <- read.table("GSM469176.txt",header = F)
expVec <- as.numeric(as.matrix(exp_data))
When I use as.matrix, without as.numeric, it returns some numbers (as below)
0.083531 0.083496 0.083464 0.083435 0.083406 0.083377 0.083348"
[9975] "-0.00285 -0.0028274 -0.0028046 -0.0027814 -0.0027574 -0.0027319 -0.0027042
but when I put in the as.numeric, they are all converted to "NA"
I apologize if someone has asked this question before but I can't find a post that solves my problem.
Thanks in advance
You have 2 issues. First, if you examine the structure of the data frame, you'll note that the first column is characters:
head(expData)[, 1:4]
V1 V2 V3 V4
1 YAL002W(cer) 6.1497e-02 6.2814e-02 6.4130e-02
2 YAL002W(par) 7.1352e-02 7.3262e-02 7.5171e-02
3 YAL003W(cer) 2.2428e-02 3.8252e-02 5.4078e-02
4 YAL003W(par) 2.6548e-02 3.6747e-02 4.6947e-02
5 YAL005C(cer) 2.4023e-05 2.3243e-05 2.2462e-05
6 YAL005C(par) 2.0252e-02 2.0346e-02 2.0440e-02
Therefore, trying to convert the complete data frame to numeric will not work as expected.
Second, you are running as.numeric() after as.matrix(), which is converting the matrix to a vector:
x <- as.numeric(as.matrix(expData))
# Warning message:
# NAs introduced by coercion
class(x)
[1] "numeric"
dim(x)
# NULL not a matrix
length(x)
# [1] 14261302
I suggest you try this:
rownames(expData) <- expData$V1
expData$V1 <- NULL
expData <- as.matrix(expData)
dim(expData)
# [1] 7502 1900
class(expData[, 1])
# [1] "numeric"
You get the NA's when R doesn't know how to convert something to a number.
Specifically, the quotation mark in your output tells me that you have one (several) LNG string of numbers. To see why this is bad, try: as.nmeric("-0.00285 -0.0028274")
I don't know what your raw data is like, but as #alexwhan mentioned, the culprit is probably in your call to read.table
To fix it, try explicitly setting the sep argument (ie, next to where you have header)
I would suggest opening up the raw file in a simple text editor (TextEdit.app or notepad, not Word) and seeing how they are separated. M guess is
..., sep="\t"
should do the trick.