I have a series of CSV files where numbers are formatted in the european style using commas instead of decimal points, i.e. 0,5 instead of 0.5.
There are too many of these files to edit them before importing to R. I was hoping there is an easy parameter for the read.csv() function, or a method to apply to the extracted dataset in order for R to treat the data as a number rather than a string.
When you check ?read.table you will probably find all the answer that you need.
There are two issues with (continental) European csv files:
What does the c in csv stand for? For standard csv this is a ,, for European csv this is a ;
sep is the corresponding argument in read.table
What is the character for the decimal point? For standard csv this is a ., for European csv this is a ,
dec is the corresponding argument in read.table
To read standard csv use read.csv, to read European csv use read.csv2. These two functions are just wrappers to read.table that set the appropriate arguments.
If your file does not follow either of these standards set the arguments manually.
From ?read.table:
dec the character used in the file for decimal points.
And yes, you can use that for read.csv as well. (to me: no stupid, you cannot!)
Alternatively, you can also use
read.csv2
which assumes a "," decimal separator and a ";" for column separators.
read.csv(... , sep=";")
Suppose this imported field is called "amount", you can fix the type in this way if your numbers are being read in as character:
d$amount <- sub(",",".",d$amount)
d$amount <- as.numeric(d$amount)
I have this happen to me frequently along with a bunch of other little annoyances when importing from excel or excel csv. As it seems that there's no consistent way to ensure getting what you expect when you import into R, post-hoc fixes seem to be the best method. By that I mean, LOOK at what you imported - make sure it's what you expected and fix it if it's not.
can be used as follow:
mydata <- read.table(fileIn, dec=",")
input file (fileIn):
D:\TEST>more input2.txt
06-05-2014 09:19:38 3,182534 0
06-05-2014 09:19:51 4,2311 0
Problems may also be solved if you indicate how your missing values are represented (na.strings=...). For example V1 and V2 here have the same format (decimals separated by "," in csv file), but since NAs are present in V1 it is interpreted as factor:
dat <- read.csv2("...csv", header=TRUE)
head(dat)
> ID x time V1 V2
> 1 1 0:01:00 0,237 0.621
> 2 1 0:02:00 0,242 0.675
> 3 1 0:03:00 0,232 0.398
dat <- read.csv2("...csv", header=TRUE, na.strings="---")
head(dat)
> ID x time V1 V2
> 1 1 0:01:00 0.237 0.621
> 2 1 0:02:00 0.242 0.675
> 3 1 0:03:00 0.232 0.398
maybe
as.is=T
this also prevents to convert the character columns into factors
Just to add to Brandon's answer above, which worked well for me (I don't have enough rep to comment):
If you're using
d$amount <- sub(",",".",d$amount)
d$amount <- as.numeric(d$amount)
don't forget that you may need sub("[.]", "", d$amount, perl=T) to get around the . character.
You can pass the decimal character as a parameter (dec = ","):
# Semicolon as separator and comma as decimal point by default
read.csv2(file, header = TRUE, sep = ";", quote = "\"", dec = ",",
fill = TRUE, comment.char = "", encoding = "unknown", ...)
More info on https://r-coder.com/read-csv-r/
For the pythonistas:
import pandas as pd
mycsv = pd.read_csv("file.csv", delimiter=";", decimal=",")
Related
ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
when reading a csv file via fread and using colClasses to read the columns as numerics I am having trouble with data that consists of numbers with commas instead of dots. Since the data files have different origins, some use "." and some use "," as decimal separator
dt <- data.table(a=c("1,4","2,0","4,5","3,5","6,9"),c=(10:14))
write.csv(dt,"dt.csv",row.names=F)
dcsv <- fread("dt.csv", colClasses = list(numeric = 1:2), dec = ",").
I have 2 problems:
I want to read both columns as numerics. So I tried using dec = ",". I now get an error: Column number 2 (colClasses[[1]][2]) is out of range [1,ncol=1]
So I changed to colClasses = list(numeric = 1), but don't quite understand this.
Still the first column turns out to be character type instead of numeric.
How could I also change dec to .and ,, since I don't know in advance what type of decimal separator any of the hundreds of files uses. I tried a vector, but did not work out. What am I missing? Thanks for any help!
It is not normal to have a file with 2 different types of numeric separator.
You should question the source of the file first thing.
Nevertheless, if you have such a file, the correct way to read it is with the variables with a comma separator as a string then to convert it to a numeric.
library(data.table)
dt <- data.table(a=c("1,4","2,0","4,5","3,5","6,9"),c=(10:14))
write.csv(dt,"dt.csv",row.names=F)
dcsv <- fread("dt.csv", dec = ".")
dcsv[, a:= as.numeric(gsub("\"", "", gsub(",", ".", a)))]
If you don't know if your variable is with a comma or a dot separator, you can loop over your variable to test if the variable is a string with only number and comma and convert only the ones fulfilling that condition.
The data in have is as below in a .csv file.
id.airwaybill_number.order_number.org_pincode.product_type.inscan_date.pickup_date.actual_weight.original_act_weight.chargeable_weight.collectable_value.declared_value.code.name.active.center_shortcode.center_shortcode.if.sc.center_shortcode...NULL csc.center_shortcode sc.center_shortcode..rts_status.reverse_pickup.ref_airwaybill_number.dest_pincode.pincode.item_description.length.breadth.height.volumetric_weight.city_name.city_name.state_shortcode.state_shortcode.zone_shortcode.zone_shortcode
"61773384 147200492 SLP759809537 110008 ppd 2016-03-02 04:38:56 2016-03-01 0.25 0.25 0.5 0 424 92006 JASPER INFOTECH PRIVATE LIMITED activ 0 NULL 37.5 DLT MPS MPS 0 0 NULL 403516 403516 Vimarsh Rechargeable Tube With Charger Emergency Light 10 10 10 0.2 DELHI MAPUSA DL GA NCR WS"
When I import it into R using -
y <- read.csv("x.csv", sep = "\t")
y <- read.table("x.csv", sep = "\t")
All the data comes into one cell. This is sample of very big data and I want to import the data column wise and not in a single cell.
Please help.
Your file is a little odd, in that it seems to have a mix of delimiters (some \t, some _, and some ,), and as #Sun Bee mentions in the comments, your header doesn't seem to match up with your data. For those reasons, it might be worth working on the file "from scratch" rather than relying on something like read.table or fread.
First, read in the file as text:
con <- file( "x.csv" )
input <- readLines( con )
close( con )
Then perform a few tasks on it. First, split the text in each line by any of \t, ,, and _.
data <- sapply( input, strsplit, "\t|,|_" )
If you take a look at the lengths of each element, you'll see that the first (the header) is an odd one out, meaning the values won't line up with the header names.
sapply( data, length )
My suggestion here is to remove that first row, and go without a header for now.
data <- data[ -1 ]
Then bind the list together rowwise to make a matrix* (which you can convert to a data.frame if you prefer). I'm removing the row names here because I assume you don't need them.
data <- do.call( rbind, data )
row.names(data) <- NULL
What results from the above is something that I'd say represents your data well, albeit without columns names. You can take the first line of your file and work with it to extract proper column names if you wish, but I'm not seeing exactly how they should go, so I won't attempt it here.
NOTE if you want the rbind function not to convert the columns to factor class (which it will by default), you can specify options( stringsAsFactors = FALSE ) beforehand.
I have a series of txt files formatted in the same way.
The first few rows are all about file information. There are no variable names. As you can see spaces between factors are inconsistent but Columns are left-aligned or right-aligned.I know SAS could directly read data with this format and wonder if R provide any function similar.
I tried read.csv function to load these data and I want to save them in a data.frame with 3 columns, while it turns out the option sep = "\s"(multiple spaces) in the function cannot recognize regular expression.
So I tried to read these data in a variable first and use substr function to split them as following.
step1
Factor<-data.frame(substr(Share$V1,1,9),substr(Share$V1,9,14),as.numeric(substr(Share$V1,15,30)))
step2
But this is quite unintelligent, and need to count the spaces between.
I wander if there is any method to directly load data as three columns.
> Factor
F T S
1 +B2P A 1005757219
2 +BETA A 826083789
We can use read.table to read it as 3 columns
read.table(text=as.character(Share$V1), sep="", header=FALSE,
stringsAsFactors=FALSE, col.names = c("FactorName", "Type", "Share"))
# FactorName Type Share
#1 +B2P A 1005757219
#2 +BETA A 826083789
#3 +E2P A 499237181
#4 +EF2P A 38647147
#5 +EFCHG A 866171133
#6 +IL1QNS A 945726018
#7 +INDMOM A 862690708
Another option would be to read it directly from the file, skipping the header line and change the column names
read.table("yourfile.txt", header=FALSE, skip=1, stringsAsFactors=FALSE,
col.names = c("FactorName", "Type", "Share"))
My data set testdata has 2 variables named PWGTP and AGEP
The data are in a .csv file.
When I do:
> head(testdata)
The variables show up as
ï..PWGTP AGEP
23 55
26 56
24 45
22 51
25 54
23 35
So, for some reason, R is reading PWGTP as ï..PWGTP. No biggie.
HOWEVER, when I use some function to refer to the variable ï..PWGTP, I get the message:
Error: id variables not found in data: ï..PWGTP
Similarly, when I use some function to refer to the variable PWGTP, I get the message:
Error: id variables not found in data: PWGTP
2 Questions:
Is there anything I should be doing to the source file to prevent mangling of the variable name PWGTP?
It should be trivial to rename ï..PWGTP to something else -- but R is unable to find a variable named as such. Your thoughts on how one should try to repair the variable name?
This is a BOM (Byte Order Mark) UTF-8 issue.
To prevent this from happening, 2 options:
Save your file as UTF-8 without BOM / signature -- or --
Use fileEncoding = "UTF-8-BOM" when using read.table or read.csv
Example:
mydata <- read.table(file = "myfile.txt", fileEncoding = "UTF-8-BOM")
It is possible that the column names in the file could be 1 PWGTP i.e.with spaces between the number (or something else) and that characters which result in .. while reading in R. One way to prevent this would be to use check.names = FALSE in read.csv/read.table
d1 <- read.csv("yourfile.csv", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE)
However, it is better not to have a name starting with number or have spaces in between.
So, suppose, if the OP read the data with the default options i.e. with check.names = TRUE, we can use sub to change the column names
names(d1) <- sub(".*\\.+", "", names(d1))
As an example
sub(".*\\.+", "", "ï..PWGTP")
#[1] "PWGTP"