I'm trying to read a data file into R but every time I do R changes the headers. I can't see any way to control this in the documentation from the read function.
I have the same data saved as both a csv and a tsv but get the same problem with both.
The headers in the data file look like this when I open it in excel or in the console:
cod name_mun age_class 1985 1985M 1985F 1986 1986M 1986F
But when I read it into R using either read.csv('my_data.csv') or read.delim('my_data.tsv') R changes the headers to this:
> colnames(my_data)
[1] "ï..cod" "name_mun" "age_class" "X1985" "X1985M" "X1985F" "X1986"
[8] "X1986M" "X1986F"
Why does R do this and how can I prevent it from happening?
You are seeing two different things here.
The "ï.." on the first column comes from having a byte order mark at the beginning of your file. Depending on how you created the file, you may be able to save as just ASCII or even just UTF-8 without a BOM to get rid of that.
R does not like to have variable names that begin with a digit. If you look at the help page ?make.names you will see
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
You can get around that when you read in your data by using the check.names argument to read.csv possibly like this.
my_data = read.csv(file.choose(), check.names = FALSE)
That will keep the column names as numbers. It will also change the BOM to be the full BOM "".
Related
I have a series of massive data files that range in size from 800k to 1.4M rows, and one variable in particular has a set length of 12 characters (numeric data but with leading zeros where other the number of non-zero digits is fewer than 12). The column should look like this:
col
000000000003
000000000102
000000246691
000000000042
102851000324
etc.
I need to export these files for a client to a CSV file, using R. The final data NEEDS to retain the 12 character structure, but when I open the CSV files in excel, the zeros disappear. This happens even after converting the entire data frame to character. The code I am using to do this is as follows.
df1 %>%
mutate(across(everything(), as.character))
##### I did this for all data frames #####
export(df1, "df1.csv")
export(df2, "df2.csv")
....
export(df17, "df17.csv)
I've read a few other posts that say this is an excel problem, and that makes sense, but given the number of data files and amount of data, as well as the need for the client to be able to open it in excel, I need a way to do it on the front end in R. Any ideas?
Yes, this is definitely an Excel problem!
To demonstrate, In Excel enter your column values save the file as a CSV value and then re-open it in Excel, the leading zeros will disappear.
One option is add a leading non-numerical character such as '
paste0("\' ", df$col)
Not a great but an option.
A slightly better option is to paste Excel's Text function to the character string. Then Excel will process the function when the function is opened.
df$col <- paste0("=Text(", df$col, ", \"000000000000\")")
#or
df$col <- paste0("=\"", df$col, "\"")
write.csv(df, "df2.csv", row.names = FALSE)
Of course if the CSV file is saved and reopened then the leading 0 will again disappear.
Another option is to investigate saving the file directly as a .xlsx file with the "writexl", or "XLSX" or similar package.
I've been searching for similar problems but I can't find anything helpful.
I'm trying to open a portion of a big csv file with
#choosing a certain number of variables from more than 250 available in the file
resources<-c("P13_2_1","P13_3_1","P13_2_2",...)
v <- fread("file.csv", select = resources, header = TRUE, encoding = "UTF-8"
After the file is opened, wherever there shoul be NA there's blank cells. However, when I try to see whats in any of the blank cells, i see this
v$P13_2_1[2]
[1] "\r"
Similarly, the header of every column seems fine in the viewer of R Studio but when I try to see them in the console, there's the same \r attached.
The problem is present using both, read.csv and fread and I've tried to modify the quote and na.string arguments.
I would like to get rid of the "\r" and posibly subtitute it with NA
I've been downloading Tweets in the form of a .csv file with the following schema:
username;date;retweets;favorites;text;geo;mentions;hashtags;permalink
The problem is that some tweets have semi-colons in their text attribute, for example, "I love you babe ;)"
When i'm trying to import this csv to R, i get some records with wrong schema, as can you see here:
I think this format error is because of the csv parser founding ; in text section and separating the table there, if you understand what i mean.
I've already tried matching with the regex: (;".*)(;)(.*";)
and replacing it with ($1)($3) until not more matches are found, but the error continues in the csv parsing.
Any ideas to clean this csv file? Or why the csv parser is working bad?
Thanks for reading
Edit1:
I think that there is no problem in the structure more than a bad chosen separator (';'), look at these example record
Juan_Levas;2015-09-14 19:59;0;2;"Me sonrieron sus ojos; y me tembló hasta el alma.";Medellín,Colombia;;;https://twitter.com/Juan_Levas/status/643574711314710528
This is a well formatted record, but i think that the semi-colon in the text section (Marked between "") forces the parser to divide the text-section in 2 columns, in this case: "Me sonrieron sus ojos and y me tembló hasta el alma.";.
Is this possible?
Also, i'm using read.csv("data.csv", sep=';') to parse the csv to a data-frame.
Edit2:
How to reproduce the error:
Get the csv from here [~2 MB]: Download csv
Do df <- read.csv('twit_data.csv', sep=';')
Explore the resulting DataFrame (You can sort it by Date, Retweets or Favorites and you will see the inconsistences in the parsing)
Your CSV file is not properly formatted: the problem is not the separator occurring in character fields, it's rather the fact that the " are not escaped.
The best thing to do would be to generate a new file with a proper format (typically: using RFC 4180).
If it's not possible, your best option is to use a "smart" tool like readr:
library(readr)
df <- read_csv2('twit_data.csv')
It does quite a good job on your file. (I can't see any obvious parsing error in the resulting data frame)
The R function read.csv works as the following as stated in the manual: "If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names." That's good. However, when it comes to the function write.csv, I cannot find a way to write the csv file in a similar way. So, if I have a file.txt as below:
Column_1,Column_2
Row_1,2,3
Row_2,4,5
Then when I read it using a = read.csv('file.txt'), the row and column names are Row_x and Column_x as expected. However, when I write the matrix a to a csv file again, then what I get as a result from write.csv('file2.txt', quote=F) is as below:
,Column_1,Column_2
Row_1,2,3
Row_2,4,5
So, there is a comma in the beginning of this file. And if I would read this file again using a2 = read.csv('file2.txt'), then resulting a2 will not be the same as the previous matrix a. The row names of the matrix a2 will not be Row_x. That's, I do not want a comma in the beginning of the file. How can I get rid of this comma while using write.csv?
The two functions that you have mentioned, read.cvs and write.csv are just a specific form of the more generic functions read.table and write.table.
When I copy your example data into a .csv and try to read it with read.csv, R throws a warning and says that the header line was incomplete. Thus it resorted to special behaviour to fix the error. Because we had an incomplete file, it completed the file by adding an empty element at the top left. R understands that this is a header row, and thus the data appears okay in R, but when we write to a csv, it doesn't understand what is header and what is not. Thus the empty element only appearing in the header row created by R shows up as a regular element. Which you would expect. Basically it made our table into a 3x3 because it can't have a weird number of elements.
You want the extra comma there, because it allows programs to read the column names in the right place. In order to read the file in again you can do the following, assuming test.csv is your data. You can fix this by manually adding the column and row names in R, including the missing element to put everything in place.
To fix the wonky row names, you're going to want to add an extra option specifying which row is the row names (row.names = your_column_number) when you read it back in with the comma correctly in place.
y <- read.csv(file = "foo.csv") #this throws a warning because your input is incorrect
write.csv(y, "foo_out.csv")
x <- read.csv(file = "foo.csv", header = T, row.names = 1) #this will read the first column as the row names.
Play around with read/write.csv, but it might be worth while to move into the more generic functions read.table and write.table. They offer expanded functionality.
To read a csv in the generic function
y <- read.table(file = "foo.csv", sep = ",", header = TRUE)
thus you can specify the delimiter and easily read in excel spreadsheets (separated by tab or "\t") or space delimited files ( " " ).
Hope that helps.
example of the data I am trying to read into R data frame
Dear friends:
I am trying to read the data, in which a single Excel Cell containing multiple lines, into a R data frame. Ideally, I want to keep those multiple lines in a single spot in the data frame with some separators between these lines, such as |, ;, etc.
How could I do that?
The resulting file should be like this:
Patient Segment(s)_____________Sponsor(s)__PrimaryDrugs_Other Drugs
"Comorbid...| Diabetic...|Hypertensive..."__NIDDK__celecoxib__.....
Many thanks!
It may depend on how you access this data. If you export it to a CSV file the CR-LF's in the cell may break the lines so that there will be a need to read them in with readLines() and then reassemble them with paste(). On the other hand if you use a package that is designed to read individual cells, the line breaks may get incorporated into single elements. You should display the CSV output ... or explain how you planned on accessing hte XLS file and post a portion of it somewhere tha tpeople can get to.
On a Mac it requires ctl-opt-enter to put a cr-lf into a cell. If it is there, exporting produces a result that looks like this in a text editor
"there is
a test of
alt-ctl-enter
"
and which then looks like this with read.table:
read.table("~/test.csv", header=FALSE)
V1
1 there is \na test of \nalt-ctl-enter\n
#plus a harmless warning about an incomplete line.
So it as a single character element in a vector. To replace the "\n" which is the <"cr-lf"> in R strings with "|" (pipes) use gsub:
dat <- read.table("~/test.csv", header=FALSE)
gsub("\n", "|", dat$V1)
# [1] "there is |a test of |alt-ctl-enter|"