I'm working with CSV files and the problem is that some rows have columns containing \" inside. A simple example would be:
"Row 42"; "Some value"; "Description: \"xyz\""; "Anoher value"
As you can see, the third column contains that combination and when I use the read_csv method in R, the input format is messed up. One working solution is to open the CSV file in Notepad++ and simply replace \" with ', for example. However, I'd prefer to have this automated.
I'm able to replace the \" with ' by using
gsub('\\\\"', "\\\'", df)
However, I'm not able to write it in the original format. Whenever I read the CSV file with R, I lose the quotation marks indicating the columns. So, in other words, my current method outputs the following:
"Row 42; Some value; Description: 'xyz'; Anoher value"
The quotation marks before and after ; are missing.
It's almost fine, but when opening the preprocessed file with Excel, it doesn't recongize the columns. I think the most convenient solution would be to read the CSV file simply as one big string containing all the quotation marks, replacing the desired combination explained above and then write it again. However, I'm not able to read the file as one big string containing all the quotation marks.
Is there a way to read the CSV file with R containing all the quotation marks? Do you have any other solutions to achieve that?
Already tried read.table? It comes with the base installation of R.
Define sep=';' as the separator and use nothing as quotes, quotes=''. Then gsub the redundant quotes away and do trimws. This should fix your data.
x <- '"Row 42"; "Some value;" "Description: \"xyz\""; "Anoher value"'
tab <- read.table(text=x, sep=';', quote='')
tab[] <- lapply(tab, \(x) trimws(gsub(x, pat='\\"', rep='')))
tab
# V1 V2 V3 V4
# 1 Row 42 Some value Description: xyz Anoher value
In your case use read.table(file='<path to .csv file>', sep=';', quote='')
I found the solution, if anyone else faces the same problem:
data <- read_lines(inputFileName)
preprocessed <- gsub('\\\\"', "\\\'", data)
write_lines(preprocessed, outputFileName)
Related
I have more than 100 name-items on my R Script without " " and "," between them. I want to make a vector from them.
AWE XYA Name3 WERFS XYAGD ...... DSFSF
The vector should be
vec <- c("AWE", "XYA" ,"Name3" ,"WERFS" ,"XYAGD" ...... ,"DSFSF")
Instead of manually entering " " and ,. Is there a way to automate this?
If you want to do that from Rstudio, you have some solutions here.
You also have a Rstudio addin to put quotation mark around words:
remotes::install_github("hrbrmstr/hrbraddins")
See there or there. After putting quotation marks, you can do a find and replace after selecting the area in the script to transform " into ",
Assuming the file in which this is stored is called temp.R, you can use scan to get a character vector. This will also work if you have text (.txt) file.
vec <- scan('temp.R', what = "character", quiet = TRUE)
I am trying to export a table to CSV format, but one of my columns is special - it's like a number string except that the length of the string needs to be the same every time, so I add trailing spaces to shorter numbers to get it to a certain length (in this case I make it length 5).
library(dplyr)
library(readr)
df <- read.table(text="ID Something
22 Red
55555 Red
123 Blue
",header=T)
df <- mutate(df,ID=str_pad(ID,5,"right"," "))
df
ID Something
1 22 Red
2 55555 Red
3 123 Blue
Unfortunately, when I try to do write_csv somewhere, the trailing spaces disappear which is not good for what I want to use this for. I think it's because I am downloading the csv from the R server and then opening it in Excel, which messes around with the data. Any tips?
str_pad() appears to be a function from stringr package, which is not currently available for R 3.5.0 which I am using - this may be the cause of your issues as well. If it the function actually works for you, please ignore the next step and skip straight to my Excel comments below
Adding spaces. Here is how I have accomplished this task with base R
# a custom function to add arbitrary number of trailing spaces
SpaceAdd <- function(x, desiredLength = 5) {
additionalSpaces <- ifelse(nchar(x) < desiredLength,
paste(rep(" ", desiredLength - nchar(x)), collapse = ""), "")
paste(x, additionalSpaces, sep="")
}
# use the function on your df
df$ID <- mapply(df$ID, FUN = SpaceAdd)
# write csv normally
write.csv(df, "df.csv")
NOTE When you import to Excel, you should be using the 'import from text' wizard rather than just opening the .csv. This is because you need marking your 'ID' column as text in order to keep the spaces
NOTE 2 I have learned today, that having your first column named 'ID' might actually cause further problems with excel, since it may misinterpret the nature of the file, and treat it as SYLK file instead. So it may be best avoiding this column name if possible.
Here is a wiki tl;dr:
A commonly encountered (and spurious) 'occurrence' of the SYLK file happens when a comma-separated value (CSV) format is saved with an unquoted first field name of 'ID', that is the first two characters match the first two characters of the SYLK file format. Microsoft Excel (at least to Office 2016) will then emit misleading error messages relating to the format of the file, such as "The file you are trying to open, 'x.csv', is in a different format than specified by the file extension..."
details: https://en.wikipedia.org/wiki/SYmbolic_LinK_(SYLK)
I have to work with a .csv file that comes like this:
"IDEA ID,""IDEA TITLE"",""VOTE VALUE"""
"56144,""Net Present Value PLUS (NPV+)"",1"
"56144,""Net Present Value PLUS (NPV+)"",1"
If I use read.csv, I obtain a data frame with one variable. What I need is a data frame with three columns, where columns are separated by commas. How can I handle the quotes at the beginning of the line and the end of the line?
I don't think there's going to be an easy way to do this without stripping the initial and terminal quotation marks first. If you have sed on your system (Unix [Linux/MacOS] or Windows+Cygwin?) then
read.csv(pipe("sed -e 's/^\"//' -e 's/\"$//' qtest.csv"))
should work. Otherwise
read.csv(text=gsub("(^\"|\"$)","",readLines("qtest.csv")))
is a little less efficient for big files (you have to read in the whole thing before processing it), but should work anywhere.
(There may be a way to do the regular expression for sed in the same, more-compact form using parentheses that the second example uses, but I got tired of trying to sort out where all the backslashes belonged.)
I suggest both removing the initial/terminal quotes and turning the back-to-back double quotes into single double quotes. The latter is crucial in case some of the strings contain commas themselves, as in
"1,""A mostly harmless string"",11"
"2,""Another mostly harmless string"",12"
"3,""These, commas, cause, trouble"",13"
Removing only the initial/terminal quotes while keeping the back-to-back quote leads the read.csv() function to produce 6 variables, as it interprets all commas in the last row as value separators. So the complete code might look like this:
data.text <- readLines("fullofquotes.csv") # Reads data from file into a character vector.
data.text <- gsub("^\"|\"$", "", data.text) # Removes initial/terminal quotes.
data.text <- gsub("\"\"", "\"", data.text) # Replaces "" by ".
data <- read.csv(text=data.text, header=FALSE)
Or, of course, all in a single line
data <- read.csv(text=gsub("\"\"", "\"", gsub("^\"|\"$", "", readLines("fullofquotes.csv", header=FALSE))))
I have a large dataset in dbf file and would like to export it to the csv type file.
Thanks to SO already managed to do it smoothly.
However, when I try to import it into R (the environment I work) it combines some characters together, making some rows much longer than they should be, consequently breaking the whole database. In the end, whenever I import the exported csv file I get only half of the db.
Think the main problem is with quotes in string characters, but specifying quote="" in R didn't help (and it helps usually).
I've search for any question on how to deal with quotes when exporting in visual foxpro, but couldn't find the answer. Wanted to test this but my computer catches error stating that I don't have enough memory to complete my operation (probably due to the large db).
Any helps will be highly appreciated. I'm stuck with this problem on exporting from the dbf into R for long enough, searched everything I could and desperately looking for a simple solution on how to import large dbf to my R environment without any bugs.
(In R: Checked whether have problems with imported file and indeed most of columns have much longer nchars than there should be, while the number of rows halved. Read the db with read.csv("file.csv", quote="") -> didn't help. Reading with data.table::fread() returns error
Expected sep (',') but '0' ends field 88 on line 77980:
But according to verbose=T this function reads right number of rows (read.csv imports only about 1,5 mln rows)
Count of eol after first data row: 2811729 Subtracted 1 for last eol
and any trailing empty lines, leaving 2811728 data rows
When exporting to TYPE DELIMITED You have some control on the VFP side as to how the export formats the output file.
To change the field separator from quotes to say a pipe character you can do:
copy to myfile.csv type delimited with "|"
so that will produce something like:
|A001|,|Company 1 Ltd.|,|"Moorfields"|
You can also change the separator from a comma to another character:
copy to myfile.csv type delimited with "|" with character "#"
giving
|A001|#|Company 1 Ltd.|#|"Moorfields"|
That may help in parsing on the R side.
There are three ways to delimit a string in VFP - using the normal single and double quote characters. So to strip quotes out of character fields myfield1 and myfield2 in your DBF file you could do this in the Command Window:
close all
use myfile
copy to mybackupfile
select myfile
replace all myfield1 with chrtran(myfield1,["'],"")
replace all myfield2 with chrtran(myfield2,["'],"")
and repeat for other fields and tables.
You might have to write code to do the export, rather than simply using the COPY TO ... DELIMITED command.
SELECT thedbf
mfld_cnt = AFIELDS(mflds)
fh = FOPEN(m.filename, 1)
SCAN
FOR aa = 1 TO mfld_cnt
mcurfld = 'thedbf.' + mflds[aa, 1]
mvalue = &mcurfld
** Or you can use:
mvalue = EVAL(mcurfld)
** manipulate the contents of mvalue, possibly based on the field type
DO CASE
CASE mflds[aa, 2] = 'D'
mvalue = DTOC(mvalue)
CASE mflds[aa, 2] $ 'CM'
** Replace characters that are giving you problems in R
mvalue = STRTRAN(mvalue, ["], '')
OTHERWISE
** Etc.
ENDCASE
= FWRITE(fh, mvalue)
IF aa # mfld_cnt
= FWRITE(fh, [,])
ENDIF
ENDFOR
= FWRITE(fh, CHR(13) + CHR(10))
ENDSCAN
= FCLOSE(fh)
Note that I'm using [ ] characters to delimit strings that include commas and quotation marks. That helps readability.
*create a comma delimited file with no quotes around the character fields
copy to TYPE DELIMITED WITH "" (2 double quotes)
For example, I want to read lines from the beginning of a text file up to a string with ";" symbol excluding this string.
Thanks a lot.
A very simple approach might be to read the contents of the using readLines:
content = readLines("data.txt")
And then split the character data on the ;:
split_content = strsplit(content, split = ";")
And then extract the first elememt, i.e. the text up to the semicolon:
first_element = lapply(split_content, "[[", 1]
The result is a list of all the text in the rows of the data file up to the semicolon.
Ps I'm not entirely sure about the last line...I can't check it as I've got no access to R right now.