Persistent string present while using read.csv or fread - r

I've been searching for similar problems but I can't find anything helpful.
I'm trying to open a portion of a big csv file with
#choosing a certain number of variables from more than 250 available in the file
resources<-c("P13_2_1","P13_3_1","P13_2_2",...)
v <- fread("file.csv", select = resources, header = TRUE, encoding = "UTF-8"
After the file is opened, wherever there shoul be NA there's blank cells. However, when I try to see whats in any of the blank cells, i see this
v$P13_2_1[2]
[1] "\r"
Similarly, the header of every column seems fine in the viewer of R Studio but when I try to see them in the console, there's the same \r attached.
The problem is present using both, read.csv and fread and I've tried to modify the quote and na.string arguments.
I would like to get rid of the "\r" and posibly subtitute it with NA

Related

R: filename list result not recognized for actually reading the file (filename character encoding problem)

I get .xlsx files from various sources, to read and analyse the data in R, working great. Files are big, 10+ MB. So far, readxl::read_xlsx was the only solution that worked. xlsx::read.xls produced only error messages: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.OutOfMemoryError: GC overhead limit exceeded)
Problem: some files have non-standard letters in the filename, e.g. displayed in Windows 10/explorer as '...ü...xlsx' (the character 'ü' somewhere in the filename). When I read all filenames in the folder in R, I get '...u"...xlsx'). I check for doublettes of the filenames from different folders before I actualle read the files. However, when it comes to read the above file, I get an error message '... file does not exist', no matter if I use
the path/filename character variable directly obtained from list.files (showing '...u"...xlsx')
the string constant '...u"...xlsx'
the string constant '...ü...xlsx'
As far as I understand, the problem arises from aequivalent, yet not identical, unicode compositions. I have no influence on how these characters are originally encoded. Therefore I see no way to read the file, other than (so far manually) rename the file in Windows explorer, changing an 'ü' coded as 'u+"' to 'ü'.
Questions:
is there a workaround within R? (keep in mind the requirement to use read_xlsx, unless a yet unknown package works with huge files.
if not possible within R, what would be the best option to change filenames automatically ('u+"' to 'ü') - I need to keep the 'ü' (or ä, ö, and others) in order to connect the analysis results back to the input), preferrably without additional (non-standard) software (e.g. command shell).
EDIT:
To read the list of files, dir_ls works (as suggested), but it returns an even stranger filename: 'ö' instead of 'ö', which in turn cannot be read (found) by read_xlsx either.
try using the fs library. My workflow looks something like this:
library(tidyverse)
library(lubridate)
library(fs)
library(readxl)
directory_to_read <- getwd()
file_names_to_read <- dir_ls(path = directory_to_read,
recurse = FALSE, # set this to TRUE to read all subdirectories
glob = "*.xls*",
ignore.case = TRUE) %>% # This is to ignore upper/lower case extensions
# Use this to weed out temp files - I constantly have this probles
str_subset(string = .,
regex(pattern = "\\/~\\$", ignore_case = TRUE), #use \\ before $ else it will not work
negate = TRUE) # TRUE Returns non-matching patterns
map(file_names_to_red[4], read_excel)

Different number of lines when loading a file into R

I have a .txt file with one column consisting of 1040 lines (including a header). However, when loading it into R using the read.table() command, it's showing 1044 lines (including a header).
The snippet of the file looks like
L*H
no
H*L
no
no
no
H*L
no
Might it be an issue with R?
When opened in Excel it doesn't show any errors as well.
EDIT
The problem was that R read a line like L + H* as three separated lines L + H*.
I used
table <- read.table(file.choose(), header=T, encoding="UTF-8", quote="\n")
You can try readLines() to see how many lines are there in your file. And feel free to use read.csv() to import it again to see it gets the expected return. Sometimes, the file may be parsed differently due to extra quote, extra return, and potentially some other things.
possible import steps:
look at your data with text editor or readLines() to figure out the delimiter and file type
Determine an import method (type read and press tab, you will see the import functions for import. Also check out readr.)
customize your argument. For example, if you have a header or not, or if you want to skip the first n lines.
Look at the data again in R with View(head(data)) or View(tail(data)). And determine if you need to repeat step 2,3,4
Based on the data you have provided, try using sep = "\n". By using sep = "\n" we ensure that each line is read as a single column value. Additionally, quote does not need to be used at all. There is no header in your example data, so I would remove that argument as well.
All that said, the following code should get the job done.
table <- read.table(file.choose(), sep = "\n")

Headers changing when reading data from csv or tsv in R

I'm trying to read a data file into R but every time I do R changes the headers. I can't see any way to control this in the documentation from the read function.
I have the same data saved as both a csv and a tsv but get the same problem with both.
The headers in the data file look like this when I open it in excel or in the console:
cod name_mun age_class 1985 1985M 1985F 1986 1986M 1986F
But when I read it into R using either read.csv('my_data.csv') or read.delim('my_data.tsv') R changes the headers to this:
> colnames(my_data)
[1] "ï..cod" "name_mun" "age_class" "X1985" "X1985M" "X1985F" "X1986"
[8] "X1986M" "X1986F"
Why does R do this and how can I prevent it from happening?
You are seeing two different things here.
The "ï.." on the first column comes from having a byte order mark at the beginning of your file. Depending on how you created the file, you may be able to save as just ASCII or even just UTF-8 without a BOM to get rid of that.
R does not like to have variable names that begin with a digit. If you look at the help page ?make.names you will see
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
You can get around that when you read in your data by using the check.names argument to read.csv possibly like this.
my_data = read.csv(file.choose(), check.names = FALSE)
That will keep the column names as numbers. It will also change the BOM to be the full BOM "".

Loading csv into R with `sep=,` as the first line

The program I am exporting my data from (PowerBI) saves the data as a .csv file, but the first line of the file is sep=, and then the second line of the file has the header (column names).
Sample fake .csv file:
sep=,
Initiative,Actual to Estimate (revised),Hours Logged,Revised Estimate,InitiativeType,Client
FakeInitiative1 ,35 %,320.08,911,Platform,FakeClient1
FakeInitiative2,40 %,161.50,400,Platform,FakeClient2
I'm using this command to read the file:
initData <- read.csv("initData.csv",
row.names=NULL,
header=T,
stringsAsFactors = F)
but I keep getting an error that there are the wrong number of columns (because it thinks the first line tells it the number of columns).
If I do header=F instead then it loads, but then when I do names(initData) <- initData[2,] then the names have spaces and illegal characters and it breaks the rest of my program. Obnoxious.
Does anyone know how to tell R to ignore that first line? I can go into the .csv file in a text editor and just delete the first line manually before I load it each time (if I do that, everything works fine) but I have to export a bunch of files and this is a bit stupid and tedious.
Any help would be much appreciated.
There are many ways to do that. Here's one:
all_content = readLines("initData.csv")
skip_first_line = all_content[-1]
initData <- read.csv(textConnection(skip_first_line),
row.names=NULL,
header=T,
stringsAsFactors = F)
Your file could be in a UTF-16 encoding. See hrbrmstr's answer in how to read a UTF-16 file:

Inconsistency between 'read.csv' and 'write.csv' in R

The R function read.csv works as the following as stated in the manual: "If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names." That's good. However, when it comes to the function write.csv, I cannot find a way to write the csv file in a similar way. So, if I have a file.txt as below:
Column_1,Column_2
Row_1,2,3
Row_2,4,5
Then when I read it using a = read.csv('file.txt'), the row and column names are Row_x and Column_x as expected. However, when I write the matrix a to a csv file again, then what I get as a result from write.csv('file2.txt', quote=F) is as below:
,Column_1,Column_2
Row_1,2,3
Row_2,4,5
So, there is a comma in the beginning of this file. And if I would read this file again using a2 = read.csv('file2.txt'), then resulting a2 will not be the same as the previous matrix a. The row names of the matrix a2 will not be Row_x. That's, I do not want a comma in the beginning of the file. How can I get rid of this comma while using write.csv?
The two functions that you have mentioned, read.cvs and write.csv are just a specific form of the more generic functions read.table and write.table.
When I copy your example data into a .csv and try to read it with read.csv, R throws a warning and says that the header line was incomplete. Thus it resorted to special behaviour to fix the error. Because we had an incomplete file, it completed the file by adding an empty element at the top left. R understands that this is a header row, and thus the data appears okay in R, but when we write to a csv, it doesn't understand what is header and what is not. Thus the empty element only appearing in the header row created by R shows up as a regular element. Which you would expect. Basically it made our table into a 3x3 because it can't have a weird number of elements.
You want the extra comma there, because it allows programs to read the column names in the right place. In order to read the file in again you can do the following, assuming test.csv is your data. You can fix this by manually adding the column and row names in R, including the missing element to put everything in place.
To fix the wonky row names, you're going to want to add an extra option specifying which row is the row names (row.names = your_column_number) when you read it back in with the comma correctly in place.
y <- read.csv(file = "foo.csv") #this throws a warning because your input is incorrect
write.csv(y, "foo_out.csv")
x <- read.csv(file = "foo.csv", header = T, row.names = 1) #this will read the first column as the row names.
Play around with read/write.csv, but it might be worth while to move into the more generic functions read.table and write.table. They offer expanded functionality.
To read a csv in the generic function
y <- read.table(file = "foo.csv", sep = ",", header = TRUE)
thus you can specify the delimiter and easily read in excel spreadsheets (separated by tab or "\t") or space delimited files ( " " ).
Hope that helps.

Resources