R: read.csv introduced unreadable characters in one column name [duplicate] - r

I have a text file with Byte order mark (U+FEFF) at the beginning. I am trying to read the file in R. Is it possible to avoid the Byte order mark?
The function fread (from the data.table package) reads the file, but adds ļ»æ at the beginning of the first variable name:
> names(frame_pers)[1]
[1] "ļ»æreg_date"
The same is with read.csv function.
Currently I have made a function which removes the BOM from the first column name, but I believe there should be a way how to automatically strip the BOM.
remove.BOM <- function(x) setnames(x, 1, substring(names(x)[1], 4))
> names(frame_pers)[1]
[1] "ļ»æreg_date"
> remove.BOM(frame_pers)
> names(frame_pers)[1]
[1] "reg_date"
I am using the native encoding for the R session:
> options("encoding" = "")
> options("encoding")
$encoding
[1] ""

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:
As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove
a Byte Order Mark if present (which it often is for files and webpages
generated by Microsoft applications).

This was handled between versions 1.9.6 and 1.9.8 with this commit; update your data.table installation to fix this.
Once done, you can just use fread:
fread("file_name.csv")

I know it's been 8 years but I just had this problem and came across this so it might help. An important detail (mentioned by hadley above) is that it needs to be fileEncoding="UTF-8-BOM" not just encoding="UTF-8-BOM". "encoding" works for a few options but not UTF-8-BOM. Go figure. Found this out here: https://www.johndcook.com/blog/2019/09/07/excel-r-bom/

Related

str_detect producing vector related errors in R code (which previously worked) since update 1.5.0

I'm trying to do some simple str_detects as follows:
index1 <- str_detect(colnames(DataFrame), paste0("^", name_))
also, name_ is just a character string so paste0("^", name_)) is of length 1.
which yields the following error:
Error in stop_vctrs(): ! Input must be a vector, not an environment.
When I check rlang::last_error() I get:
`Backtrace:
stringr::str_detect(colnames(DataFrame), paste0("^", name_))
vctrs:::stop_scalar_type(<fn>(<env>), "")
vctrs:::stop_vctrs(msg, "vctrs_error_scalar_type", actual = x)`
I know that in this instance I could use the base R alternative:
grep(paste0("^", name_), colanmes(DataFrame))
but the issue is that I have many long scripts which feature str_detect many times...
I'd like to understand the ways around this new error so that I can best fix all these instances in my code, thank you.
I have read the update on Stringr 1.5.0 written by Hadley about the stricter vector definitions which have been implemented in tidyverse but I still pose my question
EDIT: uninstallation and reinstallation of R/studio/tools fixed the issue

Is it possible to use unicode column name in data.frame/data.table in certain locale?

I need to create a data.table, and some column names need to have some unicode symbols for consistence reason(I better to make it match another source which have these names).
In some locale there is no problem, but some user will met problem. Turned out my code will not work in certain locale.
Sys.setlocale("LC_CTYPE", "English_United States.1252")
> df <- data.frame("\u0394AICc" = 2)
Warning message:
unable to translate '<U+0394>AICc' to native encoding
> dt <- data.table(mtcars)
> dt[, "\u0394AICc" := "test"]
> dt
...
Warning message:
In do.call("cbind", lapply(x, function(col, ...) { :
unable to translate '<U+0394>AICc' to native encoding
# interestingly, the string can be printed normally in console
# just it will have problem as a data.frame column name.
> cat("\u0394AICc")
ΔAICc
I searched around and most information I found are about file or string encoding, not really data.frame column names.
There was an issue in data.table for this, it also depend on R/OS/locale. Given data.frame cannot handle it either, I think it's not a data.table problem.
All the problem will gone with some other locale, so I'm hoping there is a way to fix this and don't have to remove all the unicode symbols.

Issue with double quotes and fread function

I have some column entries that look like this:
c("This is just a "shame"...") # since its a character
THIS WILL WRITE A FILE ON YOUR C:\ DRIVE:
sample.data <- data.frame(case1=c("This is just a 'shame'..."),
case2="This is just a shame") # here I could not make it to insert the double quotes
write.csv(sample.data, file="C:/sample_data.csv")
require(data.table)
test.fread <- fread("C:/sample_data.csv")
test.read.csv <- read.csv("C:/sample_data.csv")
If I read the csv data with fread function (from data.table), I get his error:
Bumped column 79 to type character on data row 12681, field contains '
a.n."'. Coercing previously read values in this column from logical,
integer or numeric back to character which may not be lossless; e.g., if
'00' and '000' occurred before they will now be just '0', and there
may be inconsistencies with treatment of ',,' and ',NA,' too (if they
occurred in this column before the bump). If this matters please rerun
and set 'colClasses' to 'character' for this column. Please note that column
type detection uses the first 5 rows, the middle 5 rows and the
last 5 rows, so hopefully this message should be very rare.
If reporting to datatable-help, please rerun and include
the output from verbose=TRUE.
If I use read.csv no error occurs and the entries are read in correctly!
Question 1: How can I remove the double quotes inside the character name.
Question 2: Why read.csv reads the entries correctly but fread fails?
As #Arun kindly suggested, the data.table development version 1.9.5 currently on github may be of help here.
To install please follow this procedure (Rtools required):
# To install development version
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
It has been tested so this is to confirm that the newest version of data.table solves the issue with double quotes without problems.
For further details and updates check the following link github data.table

read.csv replaces column-name characters like `?` with `.`, `-` with `...`

I'm using RStudio and my output on the Console gets truncated. I can't find how to stop the truncation (I tried searching ?options as well as googling around for longer than I'd like to admit).
EDIT: My apologies everyone! I originally had the long name as 'ThisIsAReallyReallyReallyReallyReallyLongName', but the issue only came up with the long name of 'Translation Service Info - Which translation service?'. I think I found the issue. The ... wasn't truncating, it was replacing the unknown characters like ? and - with . and ....
Code
# Load File
myfile <- read.csv(file="C:\\Users\\wliu\\Desktop\\myfile.csv",
sep=",", header=TRUE, stringsAsFactors=FALSE, skip=2)
# Get my column names
mycolnames <- colnames(myfile)
# When I request a shorter name, this returns the full name
mycolnames[1] # Assuming first col is a short name
[1] "ThisIsAShortName"
# However, when I request a longer name, this returns a truncated version
mycolnames[2] # Assuming second col is a really long name
[1] "ThisIsA...Long...Name"
I want to get back the non-truncated version of mycolnames[2] (e.g. "ThisIsAReallyReallyReallyReallyReallyLongName")
Setup
I'm on Windows 7 64bit, RStudio Version 0.98.1091, R version 3.0.1 (2013-05-16) -- "Good Sport" with Platform: x86_64-w64-mingw32/x64 (64-bit). I tried with 'Use Git Bash as shell for Git projects' on and off.
myfile.csv
ThisIsAShortName, Translation Service Info - Which translation service?
23143505, Yes
23143614, No
23143324, Yes
This is expected behavior by read.csv, not a truncation problem in R. When you have spaces and special characters in the column names of a file, read.csv replaces each of them with a . unless you specify check.names = FALSE
Here's a glimpse at make.names, which is how read.table produces the column names.
nm <- "Translation Service Info - Which translation service?"
make.names(nm)
# [1] "Translation.Service.Info...Which.translation.service."
And here's the relevant line from read.table
if (check.names)
col.names <- make.names(col.names, unique = TRUE)
In RStudio, use the menu:
Tools > Global Options > Code > Display
In the Console section, set Limit length of lines displayed in console to: to some number much larger than it currently is set to.
Click OK.
How about:
options(width=300)
Does that solve the issue?

Using comparison operators on packageVersion

I'm writing some code for an R package that needs to compare the current package version to a record of the package version when it created a file. Basically, if the file was created by an old version of the package, it needs to check for changes to the file format. What I've got now is:
file_version<-get_file_version() # evals to a numeric like 100 or 302
pkg_version<-as.numeric(paste0(unlist(packageVersion('my.package')), collapse=""))
if (file_version < pkg_version) upgrade_fileformat() # checks and performs updates
So, this will work, but it seems like there should be a cleaner way to do this without the function sandwich to assign to the pkg_version variable. Does anyone know a nicer way to do the comparison I've described?
Instead of recording a number, recording packageVersion("my.package") and then when reading coerce to package_version() so that the comparison operators work as expected.
ver <- as.numeric_version("3.10.1")
ver < "3.11"
# [1] TRUE
ver > "3"
# [1] TRUE

Resources