Read data in form '1.4523e-9' - r

I'm trying to read data from a *.txt or *.csv file into R with read.table or read.csv. However, my data is written as e.g. 1.4523e-9 in the file denoting 1.4523*10^{-9} though ggplot recognizes this as a string instead of a real. Is there some sort of eval( )-function to convert this to its correct value ?

Depending on the exact format of the csv file you import,read.csv and read.table often simply convert all columns to factors. Since a straightforward conversion to numeric as failed, I assume this is your problem. You can change this using the colClasses argument as such:
# if every column should be numeric:
df <- read.csv("foobar.csv", colClasses = "numeric")
#if only some columns should be numeric, use a vector.
#to read the first as factor and the second as numeric:
read.csv("foobar.csv", colClasses = c("factor", "numeric")
Of course, both of the above are barebones examples; you probably want to supply other arguments as well, eg header = T.
If you don't want to supply the classes of each column when you read the table (maybe you don't know them yet!), you can convert after the fact using either of the following:
df$a <- as.numeric(as.character(a)) #as you already discovered
df$a <- as.numeric(levels(df$a)[df$a])
Yes, these are both clunky, but they are standard and frequently recommended.

Related

Error: unexpected input in "reg1 <- lm(_"

I am pretty new to R, about 3 months in, and when I was trying to run a regression R shot me this error, Error: unexpected input in "reg1 <- lm(_" the variable I use has an underscore, and some other variables too, I didn't know if R support underscore in a regression or not as thats the first time I had a variable with an underscore in it's name. If it doesn't, how can I change to name?
As good practice, always begin variable/column names with letters (although not explicitly the rule, and you can technically start with the period symbol, this will save hassle). When dealing with data imported into R with predefined column names (or just when dealing with dataframes in general) you can rename columns in the dataframe df as follows
names(df)[names(df) == 'OldName'] <- 'NewName'
If you really need to, you can protect 'illegal' names with back-quotes (although I agree with other answers/comments that this is not good practice ...)
dd <- data.frame(`_y`=rnorm(10), x = 1:10, check.names=FALSE)
names(dd)
## [1] "_y" "x"
lm(`_y` ~ x, data = dd)

Read multiple integer columns as string, trying to gsub and convert back to integer

i have about 30 columns within a dataframe of over 100 columns. the file i am reading in stores its numbers as characters. In other words 1300 is 1,300 and R thinks it is a character.
I am trying to fix that issue by replacing the "," with nothing and turn the field into an integer. I do not want to use gsub on each column that has the issue. I would rather store the columns as a vector that have the issue and do one function or loop with all the columns.
I have tried using lapply, but am not sure what to put as the "x" variable.
Here is my function with the error below it
ItemStats_2014[intColList] <- lapply(ItemStats_2014[intColList],
as.integer(gsub(",", "", ItemStats_2014[intColList])) )
Error in [.data.table(ItemStats_2014, intColList) : When i is a
data.table (or character vector), the columns to join by must be
specified either using 'on=' argument (see ?data.table) or by keying x
(i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might
have further speed benefits on very large data due to x being sorted
in RAM.
The file I am reading in stores its numbers as characters [with commas as decimal separator]
Just directly read those columns in as decimal, not as string:
data.table::fread() understands decimal separators: dec=',' by default.
You might need to play with fread(..., colClasses=(...) ) argument a bit to specify the integer columns:
myColClasses <- rep('string',100) # for example...
myColClasses[intColList] <- 'integer'
# ...any other colClass fixup as needed...
ItemStats_2014 <- fread('your.csv', colClasses=myColClasses)
This approach is simpler and faster and uses much less memory than reading as string, then converting later.
Try using dplyr::mutate_at() to select multiple columns and apply a transformation to them.
ItemStats_2014 <- ItemStats_2014 %>%
mutate_at(intColList, funs(as.integer(gsub(',', '', .))))
mutate_at selects columns from a list or using a dplyr selector function (see ?select_helpers) then applies one or more functions to each column. The . in gsub refers to each selected column that mutate_at passes to it. You can think of it as the x in function(x) ....

R data read as characters

I am trying to read a data table into R. The data contains:
two columns with numeric values (continuous),
1556 columns with either 0 or 1, and
one last column with strings, representing two groups (group A and
group B).
Some values are missing, and they are replaced with either ? or some spaces and then ?. As a result, when I read the table into R, the numbers were read as characters.
For example, if Data[1,1]=125, when I write is.numeric(Data[1,1]) I get FALSE. I want to turn all the numbers into numbers, and I want all the ? (with or without spaces before) into missing values. Do not know how to do this. Thank you! (I have 3279 rows).
You can specify the na.strings argument of ?read.table to be na.strings = c("?", "?."). Use that inside the read.table() call when you read the data into R. It should then by recognised correctly. Since you also have some spaces in your data, you could additionally use the strip.white = TRUE argument inside the read.table call.

Write different datatype values to a file in R

Is it possible to write values of different datatypes to a file in R? Currently, I am using a simple vector as follows:
> vect = c (1,2, "string")
> vect
[1] "1" "2" "string"
> write.table(vect, file="/home/sampleuser/sample.txt", append= FALSE, sep= "|")
However, since vect is a vector of string now, opening the file has following contents being in quoted form as:
"x"
"1"|"1"
"2"|"2"
"3"|"string"
Is it not possible to restore the data types of entries 1 and 2 being treated as numeric value instead of string. So my expected result is:
"x"
"1"|1
"2"|2
"3"|"string"
also, I am assuming the left side values "1", "2" and "3" are vector indexes? I did not understand how the first line is "x"?
I wonder if simply removing all the quotes from the output file will solve your problem? That's easy: Add quote=FALSE to your write.table() call.
write.table(vect, file="/home/sampleuser/sample.txt",
append=FALSE, sep="|", quote=FALSE)
x
1|1
2|2
3|string
Also, you can get rid of the column and row names if you like. But now your separator character doesn't appear because you have a one-column table.
write.table(vect, file="/home/sampleuser/sample.txt", append=FALSE, sep="|",
quote=FALSE, row.names=FALSE, col.names=FALSE)
1
2
string
For vectors and matrices, R requires everything to have the same data type. By default, R will coerce all of the data in the vector/matrix into the same format. R will coerce more specific types of data into less specific data types. In this case, any of the items stored in your vector can be reasonably represented as type "character", so it will automatically coerce the numeric parts of the vector to fit that data type.
As #Dason said, you're better off using a list if this isn't something you want.
Alternatively, you can use a data.frame, which lets you store different datatypes in different columns (internally, R stores data.frames as lists, so it makes sense that this would be another option).

How to correctly index an array?

Please download the file into your computer,and run :
http://freeuploadfiles.com/bb3cwypih2d2
data=read.table("path/to/file", sep="|",quote='',
head=T,blank.lines.skip=T,as.is=T)
ddata=array(data,dim=c(nrow(data),ncol(data)))
ddata[1,1]
I want to extract the first element of the first column. The answer should be AAC.
How do I do that?
Some suggestions to clean your code and make life easier in the long term:
Work with the data in a data.frame, not an array.
Never refer to TRUE as T. TRUE is a reserved word that can never be redefined, whereas T can take any value, including FALSE
Use the <- symbol for assignment
Don't use abbreviate argument names. The arguement is header, not head. This might bite you
Arrays can only contain a single class of object. Thus converting your data to array will implicitly convert the numeric column to character, which surely is a bad thing.
You then index the data frame like this:
dat <- read.table("nasdaqlisted.txt", sep="|", quote='',
header=TRUE, blank.lines.skip=TRUE, as.is=TRUE)
dat$Symbol[1]
[1] "AAC"
The following alternative ways of indexing also return the same element:
dat[1, "Symbol"]
dat[1, 1]
dat[, 1][1]
dat[["Symbol"]][1]
If you really want to do the foolish thing and convert your data to an array, then use matrix:
mdat <- as.matrix(dat)
mdat[1, 1]
Symbol
"AAC"
Disclaimer: I only post this since you ask. Arrays and matrices are powerful and fast, but not appropriate for this data.

Resources