read.csv replaces column-name characters like `?` with `.`, `-` with `...`

read.csv replaces column-name characters like `?` with `.`, `-` with `...` - r

I'm using RStudio and my output on the Console gets truncated. I can't find how to stop the truncation (I tried searching ?options as well as googling around for longer than I'd like to admit).
EDIT: My apologies everyone! I originally had the long name as 'ThisIsAReallyReallyReallyReallyReallyLongName', but the issue only came up with the long name of 'Translation Service Info - Which translation service?'. I think I found the issue. The ... wasn't truncating, it was replacing the unknown characters like ? and - with . and ....
Code
# Load File
myfile <- read.csv(file="C:\\Users\\wliu\\Desktop\\myfile.csv",
sep=",", header=TRUE, stringsAsFactors=FALSE, skip=2)
# Get my column names
mycolnames <- colnames(myfile)
# When I request a shorter name, this returns the full name
mycolnames[1] # Assuming first col is a short name
[1] "ThisIsAShortName"
# However, when I request a longer name, this returns a truncated version
mycolnames[2] # Assuming second col is a really long name
[1] "ThisIsA...Long...Name"
I want to get back the non-truncated version of mycolnames[2] (e.g. "ThisIsAReallyReallyReallyReallyReallyLongName")
Setup
I'm on Windows 7 64bit, RStudio Version 0.98.1091, R version 3.0.1 (2013-05-16) -- "Good Sport" with Platform: x86_64-w64-mingw32/x64 (64-bit). I tried with 'Use Git Bash as shell for Git projects' on and off.
myfile.csv
ThisIsAShortName, Translation Service Info - Which translation service?
23143505, Yes
23143614, No
23143324, Yes

This is expected behavior by read.csv, not a truncation problem in R. When you have spaces and special characters in the column names of a file, read.csv replaces each of them with a . unless you specify check.names = FALSE
Here's a glimpse at make.names, which is how read.table produces the column names.
nm <- "Translation Service Info - Which translation service?"
make.names(nm)
# [1] "Translation.Service.Info...Which.translation.service."
And here's the relevant line from read.table
if (check.names)
col.names <- make.names(col.names, unique = TRUE)

In RStudio, use the menu:
Tools > Global Options > Code > Display
In the Console section, set Limit length of lines displayed in console to: to some number much larger than it currently is set to.
Click OK.

How about:
options(width=300)
Does that solve the issue?

Related

R: read.csv introduced unreadable characters in one column name [duplicate]

I have a text file with Byte order mark (U+FEFF) at the beginning. I am trying to read the file in R. Is it possible to avoid the Byte order mark?
The function fread (from the data.table package) reads the file, but adds ļ»æ at the beginning of the first variable name:
> names(frame_pers)[1]
[1] "ļ»æreg_date"
The same is with read.csv function.
Currently I have made a function which removes the BOM from the first column name, but I believe there should be a way how to automatically strip the BOM.
remove.BOM <- function(x) setnames(x, 1, substring(names(x)[1], 4))
> names(frame_pers)[1]
[1] "ļ»æreg_date"
> remove.BOM(frame_pers)
> names(frame_pers)[1]
[1] "reg_date"
I am using the native encoding for the R session:
> options("encoding" = "")
> options("encoding")
$encoding
[1] ""

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:
As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove
a Byte Order Mark if present (which it often is for files and webpages
generated by Microsoft applications).

This was handled between versions 1.9.6 and 1.9.8 with this commit; update your data.table installation to fix this.
Once done, you can just use fread:
fread("file_name.csv")

I know it's been 8 years but I just had this problem and came across this so it might help. An important detail (mentioned by hadley above) is that it needs to be fileEncoding="UTF-8-BOM" not just encoding="UTF-8-BOM". "encoding" works for a few options but not UTF-8-BOM. Go figure. Found this out here: https://www.johndcook.com/blog/2019/09/07/excel-r-bom/

R: Encoding of labelled data and knit to html problems

First of all, sorry for not providing a reproducible example and posting images, a word of explanation why I did it is at the end.
I'd really appreciate some help - comments or otherwise, I think I did my best to be as specific and concise as I can
Problem I'm trying to solve is how to set up (and where to do it) encoding in order to get polish letters after a .Rmd document is knitted to html.
I'm working with a labelled spss file imported to R via haven library and using sjPlot tools to make tables and graphs.
I already spent almost all day trying to sort this out, but I feel I'm stucked with no idea where to go.
My sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
Whenever I run (via console / script)
sjt.frq(df$sex, encoding = "Windows-1250")
I get a nice table with proper encoding in the rstudio viewer pane:
Trying with no encoding sjt.frq(df$sex) gives this:
I could live with setting encoding each time a call to sjt.frq is made, but problem is, that no matter how I set up sjt.frq inside a markdown document, it always gets knited the wrong way.
Running chunk inside .Rmd is OK (for a completely unknown reason encoding = "UTF-8 worked as well here and it didn't previously):
Knitting same document, not OK:
(note, that html header has all the polish characters)
Also, it looks like that it could be either html or sjPlot specific because knitr can print polish letters when they are in a vector and are passed as if they where printed to console:
Is there anything I can set up / change in order to make this work?
While testing different options I discovered, that manually converting sex variable to factor and assigning labels again, works and Rstudio knits to html with proper encoding
df$sex <- factor(df$sex, label = c("kobieta", "mężczyzna"))
sjt.frq(df$sex, encoding = "Windows-1250")
Regarding no reproducible example:
I tried to simulate this example with fake data:
# Get libraries
library(sjPlot)
library(sjlabelled)
x <- rep(1:4, 4)
x<- set_labels(x, labels = c("ąę", "ćŁ", "óŚŚ", "abcd"))
# Run freq table similar to df$sex above
sjt.frq(x)
sjt.frq(x, encoding = "UTF-8")
sjt.frq(x, encoding = "Windows-1250")
Thing is, each sjt.frq call knits the way it should (although only encoding = "Windows-1250" renders properly in rstudio viewer pane.

If you run sjt.frq(), a complete HTML-page is returned, which is displayed in a viewer.
However, for use inside markdown/knitr-documents, there are only parts of the HTML-output required: You don't need the <head> part, for instance, as the knitr-document creates an own header for the HTML-page. Thus, there's an own print()-method for knitr-documents, which use another return-value to include into the knitr-file.
Compare:
dummy <- sjt.frq(df$sex, encoding = "Windows-1250")
dummy$output.complete # used for default display in viewer
dummy$knitr # used in knitr-documents
Since the encoding is located in the <meta>-tag, which is not included in the $knitr-value, the encoding-argument in sjt.frq() has no effect on knitr-documents.
I think that this might help you: rmarkdown::render_site(encoding = 'UTF-8'). Maybe there are also other options to encode text, or you need to modify the final HTML-file, changing the charset encoding there.

Issue with double quotes and fread function

I have some column entries that look like this:
c("This is just a "shame"...") # since its a character
THIS WILL WRITE A FILE ON YOUR C:\ DRIVE:
sample.data <- data.frame(case1=c("This is just a 'shame'..."),
case2="This is just a shame") # here I could not make it to insert the double quotes
write.csv(sample.data, file="C:/sample_data.csv")
require(data.table)
test.fread <- fread("C:/sample_data.csv")
test.read.csv <- read.csv("C:/sample_data.csv")
If I read the csv data with fread function (from data.table), I get his error:
Bumped column 79 to type character on data row 12681, field contains '
a.n."'. Coercing previously read values in this column from logical,
integer or numeric back to character which may not be lossless; e.g., if
'00' and '000' occurred before they will now be just '0', and there
may be inconsistencies with treatment of ',,' and ',NA,' too (if they
occurred in this column before the bump). If this matters please rerun
and set 'colClasses' to 'character' for this column. Please note that column
type detection uses the first 5 rows, the middle 5 rows and the
last 5 rows, so hopefully this message should be very rare.
If reporting to datatable-help, please rerun and include
the output from verbose=TRUE.
If I use read.csv no error occurs and the entries are read in correctly!
Question 1: How can I remove the double quotes inside the character name.
Question 2: Why read.csv reads the entries correctly but fread fails?

As #Arun kindly suggested, the data.table development version 1.9.5 currently on github may be of help here.
To install please follow this procedure (Rtools required):
# To install development version
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
It has been tested so this is to confirm that the newest version of data.table solves the issue with double quotes without problems.
For further details and updates check the following link github data.table

Unicode character with subscript

I want to add a Unicode character which has two letters as subscripts to my plot legend in R.
The character is an r with an accent breve (ř) and the two letters are i and j.
I already looked at this question: Unicode character with superscript and tried to adapt the answers to my problem.
Here is what I tried:
plot(1,pch=NA,ylab="",xlab="",axes=F)
legend("top",legend=paste("1-","\u{0159}"),bty ="n",bg = "white",cex=2)
legend("center",legend=paste("1-","\u{0159}","\u{0069}","\u{006A}"),bty="n",bg = "white",cex=2)
legend("bottomleft",legend=expression("1-"*"\u0159"["\u0069"*"\u006A"]),bty="n",bg = "white",cex=2)
legend("bottomright", legend = quote("1-" *"\u0159"["\u0069"*"\u006A"]),bty="n",bg = "white",cex=2)
The resulting plot can be found below
Both the Unicode letter and the subscript work fine by themselves but not together.
paste() with any combination of [ ] does return an error, but I think this is to be expected as paste can't handle [ ] for subscripts.
The FAQ site on CRAN might give a hint as I am using Windows but I am not sure how to implement this:
3.6 I don't see characters with accents at the R console, for example in ?text.
You need to specify a font in Rconsole (see Q5.2) that supports the encoding in use. This used to be a problem in earlier versions of Windows, but now it is hard to find a font which does not.
Support for these characters within Rterm depends on the environment (the terminal window and shell, including locale and codepage settings) within which it is run as well as the font used by the terminal window. Those are usually on legacy DOS settings and need to altered.

It has to do with the system locale, as seen e.g. if you try
# intToUtf8(345)
# [1] "ř"
# iconv(intToUtf8(345), "utf-8", localeToCharset())
# [1] "r"
This should fix it (I used Czech but other locales will probably work too):
Sys.setlocale("LC_CTYPE", "czech")
# [1] "Czech_Czech Republic.1250"
text(..., labels = quote("\u{0159}"[ij]))

fread segfault with 30GB space separated file with some rows starting with space

I was previously able to successfully load a 2.7 million by ~3000 odd column file into R data.table however now I am getting a segfault. The size of the file is still similar. Not sure what might be happening. The error message is not helping either. The machine has enough memory ~256 Gb
> dosages <- fread(combined_dosage_file, header=F, stringsAsFactors=F)
*** caught segfault ***
address (nil), cause 'unknown'
Traceback:
1: fread(combined_dosage_file, header = F, stringsAsFactors = F)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
SessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.10
Verbose output operating on a smaller test file with 500K lines
> f <- "../common_input/test.dosage"
> d <- fread(f, header=F, skip=1, stringsAsFactors=F, verbose=T)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 2 to detect sep ('skip' has been supplied) ... sep=' '
Found 3053 columns
First row with 3053 fields occurs on line 2 (either column names or first row of data)
'header' changed by user from 'auto' to FALSE
Count of eol after first data row: 500000
Subtracted 1 for last eol and any trailing empty lines, leaving 499999 data rows
Type codes: 03033222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 (first 5 rows)
Type codes: 03033222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 (+middle 5 rows)
Type codes: 03033222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 (+last 5 rows)
Bumping column 3 from INT to INT64 on data row 193249, field contains 'rs12218882'
Bumping column 3 from INT64 to REAL on data row 193249, field contains 'rs12218882'
Bumping column 3 from REAL to STR on data row 193249, field contains 'rs12218882'
*** caught segfault ***
address (nil), cause 'unknown'
Traceback:
1: fread(f, header = F, skip = 1, stringsAsFactors = F, verbose = T)
Update
I was looking at the data file more carefully and it turns out some random lines had extra space in the beginning. After removing those I was able to load the whole file in R (pretty amazing compared to std data.frame) with 60Gb RAM. For now I am good to go but it might be worth exploring if that was the root cause but seems like that space confused the fread to diff col size. May be it can spit out more specific error if feasible.
-Abhi

Update Sep 2015 : As per #Arun's comment above to question, he fixed this in v1.9.6 on CRAN Sep 2015, #494.
Previous answer for posterity ...
I've looked through NEWS for v1.8.11 and I can't see any crash fixes for fread. In fact I don't remember any outright crash reports for fread at all, so this is new.
Please pass verbose=TRUE to fread and see if you can reproduce with that on. The output will give me a clue as to where the crash is. It will also report information about the file (rows, columns and types) which will be helpful.
If you start with a fresh session and load the file, does it ever crash?
Btw, iiuc it's going to be between 30GB and 60GB in RAM, depending on the column types (2.7e6 * 3000 * 4|8 / 1024^3). A big file, but as you say you have 256GB of RAM, so yes this is definitely something fread should be fine with. It's possible fread hasn't been tested to this size yet. Datasets of that size mostly tend to be broken down into smaller files and then the idiom is :
rbindlist(lapply(<files>, fread))
But you have what you have : one very big file. So fread should be fine with it. Just explaining why it may not have come up before.