I'm trying to work with a file that is saved as a .csv file but is actually ; deliminated. The decimal points are commas.
Example of a row:
SAA1;6,022367813;10,9403136;5,807354922;3,169925001;3,807354922;8,636624621;5,247927513;5,459431619;9,09011242;4,247927513;4,087462841;5,247927513;4,584962501;11,17492568;4,754887502;6,857980995;7,409390936;7,499845887;8,224001674;10,19967234;9,638435914;4,700439718;6,14974712;2,807354922;0;7,348728154;4,700439718;6,820178962;4,700439718;6,044394119;1,584962501;6,044394119;6,375039431;3,807354922;9,087462841;8,74819285;5,614709844;8,330916878;6,62935662;5,169925001;6,442943496;2,321928095;8,312882955;9,240791332;2,807354922;9,06608919;6,539158811;5,64385619;4,584962501;6,700439718;6,108524457;7,539158811;6,658211483;8,982993575;5,285402219;8,744833837
I need to read this data into R and then work with it as numbers where decimal points are "."
Here's what I've tried:
read.csv2("filename.csv", row.names=1, sep=";",dec=",")
This almost worked. Most of the numbers were correctly read in with periods. However all the numbers in certain columns remained separated by commas. I tried to fix this with:
temp<-sub(",", ".", data)
However, this did not quite work. It truncated several of the numbers and completely corrupted other ones. I have no idea why.
I've also tried opening the file in Sublime text. I found and replaced all commas with periods. This again worked for the majority of the data, but several numbers again became corrupted.
I've also tried reading in the file without changing the comma delimited nature, writing it period deliminated and then reading it in again.
temp<-read.csv2("filename.csv", row.names=1, sep=";")
write.csv2(temp, "filename_edited", sep = ";", dec=".", row.names = TRUE, col.names = TRUE)
temp2 <- read.csv2("filename_edited", sep=";", row.names=1)
This also didn't work. (I'm not surprised, I was getting desperate.)
What am I doing wrong? How can I fix it?
A common issue is related to trailing white space before or after the numbers (e.g. " 342,5", instead of "342,5"). Have you tried using the strip.white=TRUE parameter, like:
read.csv2("filename.csv", row.names=1, sep=";", strip.white=TRUE)
If you otherwise pre-process the data, trimws() may also be useful in this context.
Related
The program I am exporting my data from (PowerBI) saves the data as a .csv file, but the first line of the file is sep=, and then the second line of the file has the header (column names).
Sample fake .csv file:
sep=,
Initiative,Actual to Estimate (revised),Hours Logged,Revised Estimate,InitiativeType,Client
FakeInitiative1 ,35 %,320.08,911,Platform,FakeClient1
FakeInitiative2,40 %,161.50,400,Platform,FakeClient2
I'm using this command to read the file:
initData <- read.csv("initData.csv",
row.names=NULL,
header=T,
stringsAsFactors = F)
but I keep getting an error that there are the wrong number of columns (because it thinks the first line tells it the number of columns).
If I do header=F instead then it loads, but then when I do names(initData) <- initData[2,] then the names have spaces and illegal characters and it breaks the rest of my program. Obnoxious.
Does anyone know how to tell R to ignore that first line? I can go into the .csv file in a text editor and just delete the first line manually before I load it each time (if I do that, everything works fine) but I have to export a bunch of files and this is a bit stupid and tedious.
Any help would be much appreciated.
There are many ways to do that. Here's one:
all_content = readLines("initData.csv")
skip_first_line = all_content[-1]
initData <- read.csv(textConnection(skip_first_line),
row.names=NULL,
header=T,
stringsAsFactors = F)
Your file could be in a UTF-16 encoding. See hrbrmstr's answer in how to read a UTF-16 file:
I'm new, and I have a problem:
I got a dataset (csv file) with the 15 columns and 33,000 rows.
When I view the data in Excel it looks good, but when I try to load the data
into R- studio I have a problem:
I used the code:
x <- read.csv(file = "1energy.csv", head = TRUE, sep="")
View(x)
The result is that the columnnames are good, but the data (row 2 and further) are
all in my first column.
In the first column the data is separated with ; . But when i try the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";")
The next problem is: Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
So i made the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";", row.names = NULL)
And it looks liked it worked.... But now the data is in the wrong columns (for example, the "name" column contains now the "time" value, and the "time" column contains the "costs" value.
Does anybody know how to fix this? I can rename columns but i think that is not the best way.
Excel, in its English version at least, may use a comma as separator, so you may want to try
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",")
I once had a similar problem where header had a long entry that contained a character that read.csv mistook for column separator. In reality, it was a part of a long name that wasn’t quoted properly.
Try skipping header and see if the problem persists
x1 <- read.csv(file = "1energy.csv", skip = 1, head = FALSE, sep=";")
In reply to your comment:
Two things you can do. Simplest one is to assign names manually:
myColNames <- c(“col1.name”,”col2.name”)
names(x1) <- myColNames
The other way is to read just the name row (the first line in your file)
read only the first line, split it into a character vector
nameLine <- readLines(con="1energy.csv", n=1)
fileColNames <- unlist(strsplit(nameLine,”;”))
then see how you can fix the problem, then assign names to your x1 data frame. I don’t know what exactly is wrong with your first line, so I can’t tell you how to fix it.
Yet another cruder option is to open your csv file using a text editor and edit column names.
It happens because of Exel's specifics. The easy solution is just to copy all your data Ctrl+C to Notepad and Save it again from Notepad as filename.csv (don't forget to remove .txt if necessary). It worked well for me. R opened this newly created csv file correctly, all data was separated at columns right.
Open your file in text edit and see if it really is separated with commas...
Sometimes .csv files are separated with tabs instead of commas or semicolon and when opening in excel it has no problem but in R you have to specify the separator like this:
x <- read.csv(file = "1energy.csv", head = TRUE, sep="\t")
I once had the same problem, this was my solution. Hope it works for you.
This problem can arise due to regional settings on the excel application where the .csv file was created.
While in most places a "," separates the columns in a COMMA separated file (which makes sense), in other places it is a ";"
Depending on your regional settings, you can experiment with:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",") #used in North America
or,
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";") #used in some parts of Asia and Europe
You could use -
df <- read.csv("filename.csv", sep = ";", quote = "")
It solved one my problems similar to yours.
So i made the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";", row.names =
NULL) And it looks liked it worked.... But now the data is in the
wrong columns (for example, the "name" column contains now the "time"
value, and the "time" column contains the "costs" value.
Does anybody know how to fix this? I can rename columns but i think
that is not the best way.
I had the exact same issue. Did quite some research and found out, that the CSV was ill-formed.
In the header line of the CSV there were all the labels (separated by the separator) and then a line break.
Starting with line 2, there was an additional separator at the end of each line. So an example of such an ill-formed CSV file looks like this:
Field1;Field2 <-- see the *missing* semicolon at the end
12;23; <-- see the *trailing* semicolon in each of the data lines
34;67;
45;56;
Such ill-formatted files are even harder to spot for TAB-separated files.
Excel does not care about that, when importing CSV files.
But R does care.
When you use skip=1 you skip the header line that contains part of the mismatch. The data frame will be imported well, but there will be a column of "NA" at the end of each row. And obviously you will not have column names, as these were skipped.
Easiest solution: edit the CSV file and either add an additional separator at the end of the header line as well, or remove the trailing delimiters in the data lines. You can also use generic read and write functions in R for text files to automate that editing.
You can transform the data by arranging the data into many cells corresponding to columns.
1.Open your csv file
2.copy the content and paste it into txt file save and copy its content
3.open new excell file
4.in excell go to the section responsible for data . it is acually called "Data"
5.then on the left side go to external data query , in german "externe Daten abfragen"
6.go ahead step by step and seperate by commas
7. save your file as csv
I had the same problem and it was frustrating...
However, I found the ultimate solution
First take this (csv file) and then convert it online to Json file and download it ... then redo the whole thing backwards (re-convert Jason to csv) online... download the converted file... give it a name...
then put it on your Rstudio
file name <- read.csv(file='name your file.csv')
... took me 4 days to think out of the box... 🙂🙂🙂
I'm trying to work on a very large dataset in R. It's currently a saved as a CSV and I'm using read.csv to import it in. Unfortunately, one of the fields is Address, which naturally contains commas and R is obviously reading these as separators. One positive, all of the commas within the addresses are followed by spaces.
So here's the question. Is there anyway of telling read.csv that ", " is not a separator but "," is?
If not, is there anyway I can import a csv into R as one long string to then do a find and replace? Or, last resort, could you point me in the direction of a decent text editor that I can get away with installing on my work laptop?
Thanks, James
My first attempt would be to get the file recreated using either quotes or a | separator.
If that weren't an option, I would then try to use readLines and gsub,
assuming I can read all of those character strings into memory.
library(dplyr)
File <-
paste("Name,Address,Occupation,Hair Color",
"Mary,123 Lamb St, Ruralsville, KY,Shepherd,Blonde",
"Jim,17 Elm St., Urban Center, TN,Butler,Brown",
sep = "\n")
write(File, tmp)
DF <- readLines(tmp) %>%
gsub(", ", "_;_", .) %>%
textConnection() %>%
read.csv() %>%
mutate(Address = gsub("_;_", ", ", Address))
unlink(tmp)
If the file were too large to read into memory, I would likely attempt to write a loop
that reads 100,000 lines at a time, performs the above code on each segment, and
writes it to a new CSV. The new CSV will be properly quoted from R and should read in
fairly well.
(I haven't tried to write this loop yet. I"m hoping for your sake it doesn't come to that)
I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:
read.table(myfile, header =TRUE, sep = "\t")
To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.
This is not a new issue of course; they address it briefly here, and recommend using
read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")
However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:
read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")
In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and
 if I don't use fileEncoding).
Isn't there a simple way to just remove the BOM and use read.table without any special arguments?
Update for #Joe:
The SAS that I used:
FILENAME myfile 'C:\Documents ... file.txt' encoding="utf-8";
proc export data=lib.sastable
outfile=myfile
dbms=tab replace;
putnames=yes;
run;
Update on further weirdness: Using fileEncoding="UTF-8-BOM" as #Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?
Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.
As per your link, it looks like it works for me with:
read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')
note the -BOM in the file encoding.
This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).
or you can use the sas system option options=NOBOMFILE the write a uft-8 file without the BOM.
I'm trying to read a giant DF cbc.read.table:
my.df <- cbc.read.table("df.csv",sep = ";", header =F)
This is what I get:
Error in cbc.read.table("2012Q2.csv", sep = "|", header = F) :
No rows to read
The wd is set correctly. Inprinciple it works using read.table, just that it doesn't read in all lines (about two million)
Has anybody an idea what I can do about this?
SOLUTION:
Hi again, the following thread helped me out:
R: Why does read.table stop reading a file?
The problem was caused by quotation marks, probably because some of them were not closing. I simply used an editor and deleted all double and single quotation marks as well as all hash marks. It's working now.
#Anthony: Thanks for your question. I noticed that the problem did not occur in the first three lines which is why I got idea that it's an issue with the file. Thanks!
Paul