How to remove the extra commas from a csv file? - r

I was trying to use a csv file in R in read.transactions() command from arules package.
The csv file when opened in Notepad++ shows extra commas for every non-existing values. So, I'm having to manually delete those extra commas before using the csv in read.transactions(). For example, the actual csv file when opened in Notepad++ looks like:
D115,DX06,Slz,,,,
HC,,,,,,
DX06,,,,,,
DX17,PG,,,,,
DX06,RT,Dty,Dtcr,,
I want it to appear like below while sending it into read.transactions():
D115,DX06,Slz
HC
DX06
DX17,PG
DX06,RT,Dty,Dtcr
Is there any way I can make that change in read.transactions() itself, or any other way? But even before that, we don't get to see those extra commas in R(that output I showed was from Notepad++)..
So how can we even remove them in R when we can't see it?

A simple way to create a new file without the trailing commas is:
file_lines <- readLines("input.txt")
writeLines(gsub(",+$", "", file_lines),
"without_commas.txt")
In the gsub command, ",+$" matches one or more (+) commas (,) at the end of a line ($).
Since you're using Notepad++, you could just do the substitution in that program: Search > Replace, replace ,+$ with nothing, Search Mode=Regular Expression.

Related

How to detect comma vs semicolon separation in a file before reading it? [duplicate]

I have to read in a lot of CSV files automatically. Some have a comma as a delimiter, then I use the command read.csv().
Some have a semicolon as a delimiter, then I use read.csv2().
I want to write a piece of code that recognizes if the CSV file has a comma or a semicolon as a a delimiter (before I read it) so that I don´t have to change the code every time.
My approach would be something like this:
try to read.csv("xyz")
if error
read.csv2("xyz")
Is something like that possible? Has somebody done this before?
How can I check if there was an error without actually seeing it?
Here are a few approaches assuming that the only difference among the format of the files is whether the separator is semicolon and the decimal is a comma or the separator is a comma and the decimal is a point.
1) fread As mentioned in the comments fread in data.table package will automatically detect the separator for common separators and then read the file in using the separator it detected. This can also handle certain other changes in format such as automatically detecting whether the file has a header.
2) grepl Look at the first line and see if it has a comma or semicolon and then re-read the file:
L <- readLines("myfile", n = 1)
if (grepl(";", L)) read.csv2("myfile") else read.csv("myfile")
3) count.fields We can assume semicolon and then count the fields in the first line. If there is one field then it is comma separated and if not then it is semicolon separated.
L <- readLines("myfile", n = 1)
numfields <- count.fields(textConnection(L), sep = ";")
if (numfields == 1) read.csv("myfile") else read.csv2("myfile")
Update Added (3) and made improvements to all three.
A word of caution. read.csv2() is designed to handle commas as decimal point and semicolons as separators (default values). If by any chance, your csv files have semicolons as separators AND points as decimal point, you may get problems because of dec = "," setting. If this is the case and you indeed have separator as the ONLY difference between the files, it is better to change the "sep" option directly using read.table()

readcsv fails to read # character in Julia

I've been using asd=readcsv(filename) to read a csv file in Julia.
The first row of the csv file contains strings which describe the column contents; the rest of the data is a mix of integers and floats. readcsv reads the numbers just fine, but only reads the first 4+1/2 string entries.
After that, it renders "". If I ask the REPL to display asd[1,:], it tells me it is 1x65 Array{Any,2}.
The fifth column in the first row of the csv file (this seems to be the entry it chokes on) is APP #1 bias voltage [V]; but asd[1,5] is just APP . So it looks to me as though readcsv has choked on the "#" character.
I tried using "quotes=false" keyword in readcsv, but it didn't help.
I used to use xlsread in Matlab and it worked fine.
Has anybody out there seen this sort of thing before?
The comment character in Julia is #, and this applies when reading files from delimited text files.
But luckily, the readcsv() and readdlm() functions have an optional argument to help in these situations.
You should try readcsv(filename; comment_char = '/').
Of course, the example above assumes that you don't have any / characters in your first line. If you do, then you'll have to change that / above to something else.

How to check if CSV file has a comma or a semicolon as separator?

I have to read in a lot of CSV files automatically. Some have a comma as a delimiter, then I use the command read.csv().
Some have a semicolon as a delimiter, then I use read.csv2().
I want to write a piece of code that recognizes if the CSV file has a comma or a semicolon as a a delimiter (before I read it) so that I don´t have to change the code every time.
My approach would be something like this:
try to read.csv("xyz")
if error
read.csv2("xyz")
Is something like that possible? Has somebody done this before?
How can I check if there was an error without actually seeing it?
Here are a few approaches assuming that the only difference among the format of the files is whether the separator is semicolon and the decimal is a comma or the separator is a comma and the decimal is a point.
1) fread As mentioned in the comments fread in data.table package will automatically detect the separator for common separators and then read the file in using the separator it detected. This can also handle certain other changes in format such as automatically detecting whether the file has a header.
2) grepl Look at the first line and see if it has a comma or semicolon and then re-read the file:
L <- readLines("myfile", n = 1)
if (grepl(";", L)) read.csv2("myfile") else read.csv("myfile")
3) count.fields We can assume semicolon and then count the fields in the first line. If there is one field then it is comma separated and if not then it is semicolon separated.
L <- readLines("myfile", n = 1)
numfields <- count.fields(textConnection(L), sep = ";")
if (numfields == 1) read.csv("myfile") else read.csv2("myfile")
Update Added (3) and made improvements to all three.
A word of caution. read.csv2() is designed to handle commas as decimal point and semicolons as separators (default values). If by any chance, your csv files have semicolons as separators AND points as decimal point, you may get problems because of dec = "," setting. If this is the case and you indeed have separator as the ONLY difference between the files, it is better to change the "sep" option directly using read.table()

CSV file is not recognized

Im exporting an excel file into a .csv file (cause I want to import it into R) but R doesn't recognize it.
I think this is because when I open it in notepad I get:
Item;Description
1;ja
2;ne
While a file which does not have any import issues is structured like this in notepad:
"Item","Description"
"1","ja"
"2","ne"
Does anybody know how I can either export it from excel in the right format or import a csv file with ";" seperator into R.
It's easy to deal with semicolon-delimited files; you can use read.csv2() instead of read.csv() (although be aware this will also use comma as the decimal separator character!), or specify sep=";".
Sorry to ask, but did you try reading ?read.csv ? The relevant information is in there, although it might admittedly be a little overwhelming/hard to sort out if you're new to R:
sep: the field separator character. Values on each line of the
file are separated by this character. If ‘sep = ""’ (the
default for ‘read.table’) the separator is ‘white space’,
that is one or more spaces, tabs, newlines or carriage
returns.

Reading a csv file with embedded quotes into R

I have to work with a .csv file that comes like this:
"IDEA ID,""IDEA TITLE"",""VOTE VALUE"""
"56144,""Net Present Value PLUS (NPV+)"",1"
"56144,""Net Present Value PLUS (NPV+)"",1"
If I use read.csv, I obtain a data frame with one variable. What I need is a data frame with three columns, where columns are separated by commas. How can I handle the quotes at the beginning of the line and the end of the line?
I don't think there's going to be an easy way to do this without stripping the initial and terminal quotation marks first. If you have sed on your system (Unix [Linux/MacOS] or Windows+Cygwin?) then
read.csv(pipe("sed -e 's/^\"//' -e 's/\"$//' qtest.csv"))
should work. Otherwise
read.csv(text=gsub("(^\"|\"$)","",readLines("qtest.csv")))
is a little less efficient for big files (you have to read in the whole thing before processing it), but should work anywhere.
(There may be a way to do the regular expression for sed in the same, more-compact form using parentheses that the second example uses, but I got tired of trying to sort out where all the backslashes belonged.)
I suggest both removing the initial/terminal quotes and turning the back-to-back double quotes into single double quotes. The latter is crucial in case some of the strings contain commas themselves, as in
"1,""A mostly harmless string"",11"
"2,""Another mostly harmless string"",12"
"3,""These, commas, cause, trouble"",13"
Removing only the initial/terminal quotes while keeping the back-to-back quote leads the read.csv() function to produce 6 variables, as it interprets all commas in the last row as value separators. So the complete code might look like this:
data.text <- readLines("fullofquotes.csv") # Reads data from file into a character vector.
data.text <- gsub("^\"|\"$", "", data.text) # Removes initial/terminal quotes.
data.text <- gsub("\"\"", "\"", data.text) # Replaces "" by ".
data <- read.csv(text=data.text, header=FALSE)
Or, of course, all in a single line
data <- read.csv(text=gsub("\"\"", "\"", gsub("^\"|\"$", "", readLines("fullofquotes.csv", header=FALSE))))

Resources