I read a CSV file into R with the following command:
myfile <- read.csv('C:/Users/myfilepath.csv', sep=',', header = F)
With this I get a nice data frame looking a little like this:
year / Variable1 / Variable2 / etc.
1958 / 1.42547014192473E-08 / 3.06399766669684E-10 / etc.
1959 / 2.05022315791225E-09 / 8.80152568089836E-08 / etc.
1960 / etc. .... ....
However, R seems to treat the letter E for exponents as string. So I need to convert these first into a simple number before I can analyze the data. The data set has 50 rows and 12 columns.
I tried as.numeric but get the error message
Error: (list) object cannot be coerced to type 'double'
Any ideas?
You can format the DF using:
format(myfile,scientific=FALSE)
You can use "options("scipen"=100)" before you read the file.
If you see there is tailing zeros, then I will suggest tou to check the csv file before import.
The answers by Soto and Alistair work if the cells in the csv that is imported are formatted as 'scientific'. Otherwise it doesn't. Thanks guys!
Code used:
mydata<- read.csv('C:/Users/mydata.csv', sep=',', na.strings=c("", "NA"), header = F)
mydata <- sapply(mydata, as.numeric)
Related
My data set testdata has 2 variables named PWGTP and AGEP
The data are in a .csv file.
When I do:
> head(testdata)
The variables show up as
ï..PWGTP AGEP
23 55
26 56
24 45
22 51
25 54
23 35
So, for some reason, R is reading PWGTP as ï..PWGTP. No biggie.
HOWEVER, when I use some function to refer to the variable ï..PWGTP, I get the message:
Error: id variables not found in data: ï..PWGTP
Similarly, when I use some function to refer to the variable PWGTP, I get the message:
Error: id variables not found in data: PWGTP
2 Questions:
Is there anything I should be doing to the source file to prevent mangling of the variable name PWGTP?
It should be trivial to rename ï..PWGTP to something else -- but R is unable to find a variable named as such. Your thoughts on how one should try to repair the variable name?
This is a BOM (Byte Order Mark) UTF-8 issue.
To prevent this from happening, 2 options:
Save your file as UTF-8 without BOM / signature -- or --
Use fileEncoding = "UTF-8-BOM" when using read.table or read.csv
Example:
mydata <- read.table(file = "myfile.txt", fileEncoding = "UTF-8-BOM")
It is possible that the column names in the file could be 1 PWGTP i.e.with spaces between the number (or something else) and that characters which result in .. while reading in R. One way to prevent this would be to use check.names = FALSE in read.csv/read.table
d1 <- read.csv("yourfile.csv", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE)
However, it is better not to have a name starting with number or have spaces in between.
So, suppose, if the OP read the data with the default options i.e. with check.names = TRUE, we can use sub to change the column names
names(d1) <- sub(".*\\.+", "", names(d1))
As an example
sub(".*\\.+", "", "ï..PWGTP")
#[1] "PWGTP"
I'm sure this is simple, but I'm not coming across an answer. I would like to import a data frame into R without processing the lines in a text editor first. Essentially, I want R to do it on read in. So all lines containing
FRAME 1 of ***
OR
ATOM-WISE TOTAL CONTACT ENERGY
will be skipped, deleted or ignored.
And all that will be left is;
Chain Resnum Atom number Energy(kcal/mol)
ATOM C 500 1519 -2.1286
ATOM C 500 1520 -1.1334
ATOM C 500 1521 -0.8180
ATOM C 500 1522 -0.7727
Is there a simple solution to this? I'm not sure which scan() of read.table() arguments would work.
EDIT
I was able to use readLines and gsub to read in the file and remove the (un)necessary lines. I omitted the "" left from the deleted words and now I am trying to convert the character df to a regular(numeric) df. When I use data.frame(x) or as.data.frame(x) I am left with a data frame with 100K rows and only one variable. There should be at least 5 variables.
readLines gives you a vector with one character string for each line of the file. So you have to split these strings into the elements you want before you convert to a dataframe. If you have nice space-separated values, try:
m = matrix(unlist(strsplit(data, " +")), ncol=5, byrow=TRUE)
# where 'data' is the name of the vector of strings
df = data.frame(m, stringsAsFactors=FALSE)
Then for each column with numeric data, use as.numeric() on the column to convert.
This is a question raised from an impatient person who has started working R just now.
I have a file containing lines like this:
simulation_time:386300;real_time:365;agents:300
simulation_time:386800;real_time:368;agents:300
simulation_time:386900;real_time:383;agents:300
simulation_time:387000;real_time:451;agents:300
simulation_time:387100;real_time:345;agents:300
simulation_time:387200;real_time:327;agents:300
simulation_time:387300;real_time:411;agents:300
simulation_time:387400;real_time:405;agents:300
simulation_time:387500;real_time:476;agents:300
simulation_time:387600;real_time:349;agents:300
....
need to plot a graph out of the file. This link teaches how to plot a file by reading a file in tabular format. But the above lines are not in tabular or a neat csv format.
Could you please tell me how to parse such a file?
Besides, if you have a reference for the the impatients like me, please let me know.
thanks
If the structure of the file is strict, then you can customise you reading to get the data you want.
See the code below.
# reading the file
strvec = readLines(con = "File.txt", n = -1)
# strsplit by ";" or ":"
strlist = strsplit(strvec,":|;")
# changing to matrix (works only if the structure of each line is the same)
strmat = do.call(rbind, strlist)
# lets take only numbers
df = strmat[ ,c(2,4,6)]
# defining the names
colnames(df) = strmat[1 ,c(1,3,5)]
# changing strings to numerics (might be better methods, have any suggestions?)
df = apply(df, 2, as.numeric)
# changing to data.frame
df = as.data.frame(df)
# now you can do that ever you want
plot(df$simulation_time, type="l")
For data in that exact format:
d = read.csv(textConnection(gsub(";",":",readLines("data.csv"))),sep=":",head=FALSE)[,c(2,4,6)]
produces:
V2 V4 V6
1 386300 365 300
2 386800 368 300
3 386900 383 300
4 387000 451 300
you can then assign names to the data frame with names(d)=c("sim","real","agents").
It works by reading the file into a character vector, replacing the ";" with ":" so everything is separated by ":", then using read.csv to read that text into a data frame, and then taking only the data columns and not the repeated text columns.
I am aware that there are similar questions on this site, however, none of them seem to answer my question sufficiently.
This is what I have done so far:
I have a csv file which I open in excel. I manipulate the columns algebraically to obtain a new column "A". I import the file into R using read.csv() and the entries in column A are stored as factors - I want them to be stored as numeric. I find this question on the topic:
Imported a csv-dataset to R but the values becomes factors
Following the advice, I include stringsAsFactors = FALSE as an argument in read.csv(), however, as Hong Ooi suggested in the page linked above, this doesn't cause the entries in column A to be stored as numeric values.
A possible solution is to use the advice given in the following page:
How to convert a factor to an integer\numeric without a loss of information?
however, I would like a cleaner solution i.e. a way to import the file so that the entries of column entries are stored as numeric values.
Cheers for any help!
Whatever algebra you are doing in Excel to create the new column could probably be done more effectively in R.
Please try the following: Read the raw file (before any excel manipulation) into R using read.csv(... stringsAsFactors=FALSE). [If that does not work, please take a look at ?read.table (which read.csv wraps), however there may be some other underlying issue].
For example:
delim = "," # or is it "\t" ?
dec = "." # or is it "," ?
myDataFrame <- read.csv("path/to/file.csv", header=TRUE, sep=delim, dec=dec, stringsAsFactors=FALSE)
Then, let's say your numeric columns is column 4
myDataFrame[, 4] <- as.numeric(myDataFrame[, 4]) # you can also refer to the column by "itsName"
Lastly, if you need any help with accomplishing in R the same tasks that you've done in Excel, there are plenty of folks here who would be happy to help you out
In read.table (and its relatives) it is the na.strings argument which specifies which strings are to be interpreted as missing values NA. The default value is na.strings = "NA"
If missing values in an otherwise numeric variable column are coded as something else than "NA", e.g. "." or "N/A", these rows will be interpreted as character, and then the whole column is converted to character.
Thus, if your missing values are some else than "NA", you need to specify them in na.strings.
If you're dealing with large datasets (i.e. datasets with a high number of columns), the solution noted above can be manually cumbersome, and requires you to know which columns are numeric a priori.
Try this instead.
char_data <- read.csv(input_filename, stringsAsFactors = F)
num_data <- data.frame(data.matrix(char_data))
numeric_columns <- sapply(num_data,function(x){mean(as.numeric(is.na(x)))<0.5})
final_data <- data.frame(num_data[,numeric_columns], char_data[,!numeric_columns])
The code does the following:
Imports your data as character columns.
Creates an instance of your data as numeric columns.
Identifies which columns from your data are numeric (assuming columns with less than 50% NAs upon converting your data to numeric are indeed numeric).
Merging the numeric and character columns into a final dataset.
This essentially automates the import of your .csv file by preserving the data types of the original columns (as character and numeric).
Including this in the read.csv command worked for me: strip.white = TRUE
(I found this solution here.)
version for data.table based on code from dmanuge :
convNumValues<-function(ds){
ds<-data.table(ds)
dsnum<-data.table(data.matrix(ds))
num_cols <- sapply(dsnum,function(x){mean(as.numeric(is.na(x)))<0.5})
nds <- data.table( dsnum[, .SD, .SDcols=attributes(num_cols)$names[which(num_cols)]]
,ds[, .SD, .SDcols=attributes(num_cols)$names[which(!num_cols)]] )
return(nds)
}
I had a similar problem. Based on Joshua's premise that excel was the problem I looked at it and found that the numbers were formatted with commas between every third digit. Reformatting without commas fixed the problem.
So, I had the similar situation here in my data file when I readin as a csv. All the numeric value were turned into char. But in my file there was a value with a word "Filtered" instead of NA. I converted "Filtered" to NA in vim editor of linux terminal with a command <%s/Filtered/NA/g> and saved this file and later used it and read it in R, all the values were num type and not char type any more.
Looks like character value "Filtered" was inducing all values to be char format.
Charu
Hello #Shawn Hemelstrand here are the steps in detail below:
example matrix file.csv having 'Filtered' word in it
I opened the file.csv in linux command terminal
vi file.csv
then press "Esc shift:"
and type the following command at the bottom
"%s/Filtered/NA/g"
press enter
then press "Esc shift:"
write "wq" at the bottom (this save the file and quit vim editor)
then in R script I read the file
data<- read.csv("file.csv", sep = ',', header = TRUE)
str(data)
All columns were num type which were earlier char type.
In case you need more help, it would be easier to share your txt or csv file.
I have a series of CSV files where numbers are formatted in the european style using commas instead of decimal points, i.e. 0,5 instead of 0.5.
There are too many of these files to edit them before importing to R. I was hoping there is an easy parameter for the read.csv() function, or a method to apply to the extracted dataset in order for R to treat the data as a number rather than a string.
When you check ?read.table you will probably find all the answer that you need.
There are two issues with (continental) European csv files:
What does the c in csv stand for? For standard csv this is a ,, for European csv this is a ;
sep is the corresponding argument in read.table
What is the character for the decimal point? For standard csv this is a ., for European csv this is a ,
dec is the corresponding argument in read.table
To read standard csv use read.csv, to read European csv use read.csv2. These two functions are just wrappers to read.table that set the appropriate arguments.
If your file does not follow either of these standards set the arguments manually.
From ?read.table:
dec the character used in the file for decimal points.
And yes, you can use that for read.csv as well. (to me: no stupid, you cannot!)
Alternatively, you can also use
read.csv2
which assumes a "," decimal separator and a ";" for column separators.
read.csv(... , sep=";")
Suppose this imported field is called "amount", you can fix the type in this way if your numbers are being read in as character:
d$amount <- sub(",",".",d$amount)
d$amount <- as.numeric(d$amount)
I have this happen to me frequently along with a bunch of other little annoyances when importing from excel or excel csv. As it seems that there's no consistent way to ensure getting what you expect when you import into R, post-hoc fixes seem to be the best method. By that I mean, LOOK at what you imported - make sure it's what you expected and fix it if it's not.
can be used as follow:
mydata <- read.table(fileIn, dec=",")
input file (fileIn):
D:\TEST>more input2.txt
06-05-2014 09:19:38 3,182534 0
06-05-2014 09:19:51 4,2311 0
Problems may also be solved if you indicate how your missing values are represented (na.strings=...). For example V1 and V2 here have the same format (decimals separated by "," in csv file), but since NAs are present in V1 it is interpreted as factor:
dat <- read.csv2("...csv", header=TRUE)
head(dat)
> ID x time V1 V2
> 1 1 0:01:00 0,237 0.621
> 2 1 0:02:00 0,242 0.675
> 3 1 0:03:00 0,232 0.398
dat <- read.csv2("...csv", header=TRUE, na.strings="---")
head(dat)
> ID x time V1 V2
> 1 1 0:01:00 0.237 0.621
> 2 1 0:02:00 0.242 0.675
> 3 1 0:03:00 0.232 0.398
maybe
as.is=T
this also prevents to convert the character columns into factors
Just to add to Brandon's answer above, which worked well for me (I don't have enough rep to comment):
If you're using
d$amount <- sub(",",".",d$amount)
d$amount <- as.numeric(d$amount)
don't forget that you may need sub("[.]", "", d$amount, perl=T) to get around the . character.
You can pass the decimal character as a parameter (dec = ","):
# Semicolon as separator and comma as decimal point by default
read.csv2(file, header = TRUE, sep = ";", quote = "\"", dec = ",",
fill = TRUE, comment.char = "", encoding = "unknown", ...)
More info on https://r-coder.com/read-csv-r/
For the pythonistas:
import pandas as pd
mycsv = pd.read_csv("file.csv", delimiter=";", decimal=",")