I always use fread from data.table package to read in large tables. But apparently it does not support reading unicode files in Windows (Windows 7 Professional to be more precise)
Here is the file I tried to read:
A,B
ą,ž
ū,į
ų,ė
š,ę
It works fine if I read it in Mac OS X, or I read it with read.csv with option encoding=UTF-8. Unfortunately fread does not have this option.
So are there other fast ways to read unicode tables in Windows, or I should just use other OS? Or am I missing something obvious?
Here is the output of sessionInfo():
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.4
loaded via a namespace (and not attached):
[1] chron_2.3-45 plyr_1.8.1 Rcpp_0.11.5 reshape2_1.4.1 stringr_0.6.2
Update: Pasting the output as requested.
> aa<-fread("F:/R/unicode_test2.csv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.000000 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 2 columns. Longest stretch was from line 1 to line 5
Starting data input on line 1 (either column names or first row of data). First 10 characters: Ä„,B
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 5 (including 1 at the end)
Count of sep: 4
nrow = MIN( nsep [4] / ncol [2] -1, neol [5] - nblank [1] ) = 4
Type codes ( first 5 rows): 44
Type codes: 44 (after applying colClasses and integer64)
Type codes: 44 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 dropped)
Read 4 rows. Exactly what was estimated and allocated up front
0.000s ( 0%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
0.000s ( 0%) Count rows (wc -l)
0.000s ( 0%) Column type detection (first, middle and last 5 rows)
0.000s ( 0%) Allocation of 4x2 result (xMB) in RAM
0.000s ( 0%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 0%) Changing na.strings to NA
0.001s Total
> aa
Ä„ B
1: ą ž
2: ū į
3: ų ė
4: Å¡ Ä™
> aa$A
[1] "ą" "ū" "ų" "š"
> aa$B
[1] "ž" "į" "ė" "ę"
> bb <- read.csv("F:/R/unicode_test.csv",encoding="UTF-8",strings=FALSE)
> bb
A B
1 a ž
2 u i
3 u e
4 š e
> bb$B
[1] "ž" "į" "ė" "ę"
> bb$A
[1] "ą" "ū" "ų" "š"
Related
I've updated Strawberry Perl 64-bit 5.30.2001 and the gdata package. Now, when loading library(gdata) I always get this warning messages which appear to be related to Perl.
suppressPackageStartupMessages(library(gdata))
# Warning messages:
# 1: In system(cmd, intern = intern, wait = wait | intern, show.output.on.console = wait, :
# running command 'C:\Windows\system32\cmd.exe /c ftype perl' had status 2
# 2: In system(cmd, intern = intern, wait = wait | intern, show.output.on.console = wait, :
# running command 'C:\Windows\system32\cmd.exe /c ftype perl' had status 2
However, read.xls, the function I need, seems to run well, except that the warning is repeated every time I use it.
read.xls("http://file-examples-com.github.io/uploads/2017/02/file_example_XLS_10.xls")
# trying URL 'http://file-examples-com.github.io/uploads/2017/02/file_example_XLS_10.xls'
# Content type 'application/vnd.ms-excel' length 8704 bytes
# downloaded 8704 bytes
# X0 First.Name Last.Name Gender Country Age Date Id
# 1 1 Dulce Abril Female United States 32 15/10/2017 1562
# 2 2 Mara Hashimoto Female Great Britain 25 16/08/2016 1582
# 3 3 Philip Gent Male France 36 21/05/2015 2587
# 4 4 Kathleen Hanner Female United States 25 15/10/2017 3549
# 5 5 Nereida Magwood Female United States 58 16/08/2016 2468
# 6 6 Gaston Brumm Male United States 24 21/05/2015 2554
# 7 7 Etta Hurn Female Great Britain 56 15/10/2017 3598
# 8 8 Earlean Melgar Female United States 27 16/08/2016 2456
# 9 9 Vincenza Weiland Female United States 40 21/05/2015 6548
# Warning messages:
# 1: In system(cmd, intern = intern, wait = wait | intern, show.output.on.console = wait, :
# running command 'C:\Windows\system32\cmd.exe /c ftype perl' had status 2
# 2: In system(cmd, intern = intern, wait = wait | intern, show.output.on.console = wait, :
# running command 'C:\Windows\system32\cmd.exe /c ftype perl' had status 2
I'm not sure how to deal with this warning because it says nothing to me, I could probably just ignore it and wrap a suppressWarnings() around it.
Nevertheless, does anybody know a way to fix this? I couldn't find anything by googling and don't know where to start and what's actually wrong.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gdata_2.18.0
loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2 gtools_3.8.2
I had the same issue with a freshly installed version of R, gdata and Strawberry Perl. I finally found this answer to a different (but related) question. Adapting the suggestion there, I ran the following on an elevated command promt:
FTYPE perl=C:\Strawberry\perl\bin\perl.exe %1 %*
This solved the issue for me – however: I am not sure if setting the FTYPE like this might have any unwanted side effects. So be careful.
Update: The command above did suppress the warning "ftype perl' had status 2" for me, but gdata still had issues:
gdata: Unable to load perl libaries needed by read.xls()
gdata: to support 'XLSX' (Excel 2007+) files.
gdata: Run the function 'installXLSXsupport()'
gdata: to automatically download and install the perl
gdata: libaries needed to support Excel XLS and XLSX formats.
However, installXLSXsupport() failed with an unspecific error message.
I then ran
Sys.which("perl")
perl
"C:\\rtools40\\usr\\bin\\perl.exe"
and realized that the Perl version from RTools takes precedence over my Strawberry Perl installation – and apparently gdata does not "like" that Perl version.
Therefore, I decided to give Strawberry Perl precedence over RTools by changing my .Renviron file (usethis::edit_r_environ()):
PATH="${RTOOLS40_HOME}\usr\bin;${PATH}" # old
PATH="${PATH};${RTOOLS40_HOME}\usr\bin" # new
Again, I'm not entirely sure what ramifications this might have, but it fixed gdata for me.
Maybe adjusting the PATH alone would also have done the trick (without the ftype stunt I made first), but I cannot test this anymore.
What I recommend:
Adjust the PATH first.
If gdata still complains about the ftype, set the ftype.
I have an annoying csv > 10gb which opens on Mac OSX but not on Windows 10.
The code I use
data_in <- fread("my_data.csv")
SessionInfo Windows
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4-3 forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4 readr_1.1.1 tidyr_0.8.0 tibble_1.4.2
[9] ggplot2_2.2.1 tidyverse_1.2.1 RMySQL_0.10.14 DBI_0.8
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 cellranger_1.1.0 pillar_1.2.1 compiler_3.4.4 plyr_1.8.4 bindr_0.1.1 tools_3.4.4 lubridate_1.7.2 jsonlite_1.5
[10] nlme_3.1-131.1 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.2.0 psych_1.8.3.3 cli_1.0.0 rstudioapi_0.7 yaml_2.1.18
[19] parallel_3.4.4 haven_1.1.1 bindrcpp_0.2.2 xml2_1.2.0 httr_1.3.1 hms_0.4.2 grid_3.4.4 glue_1.2.0 R6_2.2.2
[28] readxl_1.0.0 foreign_0.8-69 modelr_0.1.1 reshape2_1.4.3 magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[37] colorspace_1.3-2 stringi_1.1.7 lazyeval_0.2.1 munsell_0.4.3 broom_0.4.4 crayon_1.3.4
SessionInfo OSX
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.11.2
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0 yaml_2.1.19
The error I get on Windows, I have tried all suggested solutions found without any luck.
Expecting 10 cols, but line 1346596 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
In addition: Warning message:
Additional info when using verbose = TRUE (Windows) (Tried a smaller file, same problem)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.004474 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 10 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: ,asin,sale
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 3657 (including 0 at the end)
Count of sep: 138915
nrow = MIN( nsep [138915] / (ncol [10] -1), neol [3657] - endblanks [0] ) = 3657
Type codes (point 0): 1444444340
Type codes (point 1): 1444444340
Type codes (point 2): 1444444340
Type codes (point 3): 1444444340
Type codes (point 4): 1444444344
Type codes (point 5): 1444444344
Type codes (point 6): 1444444344
Type codes (point 7): 1444444344
Type codes (point 8): 1444444344
Type codes (point 9): 1444444344
Type codes (point 10): 1444444444
Type codes: 1444444444 (after applying colClasses and integer64)
Type codes: 1444444444 (after applying drop or select (if supplied)
Allocating 10 column slots (10 - 0 dropped)
Error in fread("md2.csv", verbose = T) :
Expecting 10 cols, but line 3312 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
verbose = T OSX
nput contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 4 threads (omp_get_max_threads()=4, nth=4)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file md2.csv
File opened, size = 4.581MB (4803885 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
File ends abruptly with ','. Final end-of-line is missing. Using cow page to write 0 to the last byte.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<,asin,salesRank,imUrl,categori>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 10 fields using quote rule 0
Detected 10 columns on line 1. This line is either column names or first data row. Line starts as: <<,asin,salesRank,imUrl,categori>>
Quote rule picked = 0
fill=false and the most number of columns found is 10
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 10 because (4803885 bytes from row 1 to eof) / (2 * 127664 jump0size) == 18
Type codes (jump 000) : 5AAAAAA7A2 Quote rule 0
Type codes (jump 004) : 5AAAAAA7AA Quote rule 0
Type codes (jump 010) : 5AAAAAA7AA Quote rule 0
'header' determined to be true due to column 8 containing a string on row 1 and a lower type (float64) in the rest of the 1041 sample rows
=====
Sampled 1041 rows (handled \n inside quoted fields) at 11 jump points
Bytes from first data row on line 2 to the end of last row: 4803813
Line length: mean=2028.07 sd=3025.66 min=28 max=29901
Estimated number of rows: 4803813 / 2028.07 = 2369
Initial alloc = 4738 rows (2369 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5AAAAAA7AA
[10] Allocate memory for the datatable
Allocating 10 column slots (10 - 0 dropped) with 4738 rows
[11] Read the data
jumps=[0..2), chunk_size=2401906, total_size=4803813
Read 3311 rows x 10 columns from 4.581MB (4803885 bytes) file in 00:00.025 wall clock time
[12] Finalizing the datatable
Type counts:
1 : int32 '5'
1 : float64 '7'
8 : string 'A'
=============================
0.001s ( 2%) Memory map 0.004GB file
0.005s ( 19%) sep=',' ncol=10 and header detection
0.000s ( 0%) Column type detection using 1041 sample rows
0.000s ( 0%) Allocation of 4738 rows x 10 cols (0.000GB) of which 3311 ( 70%) rows used
0.019s ( 78%) Reading 2 chunks (0 swept) of 2.291MB (each chunk 1655 rows) using 2 threads
+ 0.004s ( 15%) Parse to row-major thread buffers (grown 0 times)
+ 0.012s ( 48%) Transpose
+ 0.004s ( 15%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.025s Total
Works fine in newest version of data.table
I am struggling with some encoding issues. I have many textfiles that contain rows in the following format:
https://dl.dropboxusercontent.com/u/94114397/example.txt
According to Notepad++, these are all encoded in UTF-8 and most non-ASCII characters are displayed correctly, as you can see in lines 1 and 2. However, I have problems with some characters that seem to be wrongly interpreted(?). In my example file, this the case in line 3 in the word "Lakuic", where there should be an "š" between the "u" and the "i". There actually is a character between those two letters which can be seen by copy-pasting the word into the google chrome address bar.
Now when I read the data in R, it displays "Laku< U+009A>ic". How can I resolve this?
Try converting from UTF-8 to latin1:
df <- read.table("http://dl.dropboxusercontent.com/u/94114397/example.txt", sep = "\t", row.names = 1, stringsAsFactors = FALSE, encoding="UTF-8")
iconv(df[, 1], from = "UTF-8", to = "latin1")
# [1] "Trichocentrum<->longifolium<-><->(Lindl.) R.Jiménez, Acta Bot. Mex. 97: 54 (2011)."
# [2] "Salvia<->× hegelmaieri<->nothosubsp. accidentalis<->(Sánchez-Gómez & R.Morales)."
# [3] "Edraianthus<->tarae<-><->Lakušic, Bilten Drustva Ekologa BiH, Ser. A 4: 108 (1987)."
My sessioInfo():
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
#
# locale:
# [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C LC_TIME=German_Germany.1252
This works for me:
file1 <- "https://dl.dropboxusercontent.com/u/94114397/example.txt"
result <- read.table(file1, header=F, sep="\t", quote="\"",encoding="windows-1252")
Consider a data.table
dt = data.table(id = rep(c('a','b'), each=2),
val = rep(c(1,2,3), times=c(1,2,1)))
# > dt
# id val
# 1: a 1
# 2: a 2
# 3: b 2
# 4: b 3
that we want to subset by id.
If we key by that column alone, no problem.
setkey(dt, id)
dt[J('a'), val]
# id val
# 1: a 1
# 2: a 2
dt[J('a'), range(val)]
# id V1
# 1: a 1
# 2: a 2
But if dt happens to be keyed also by the numeric column val, then that extra key column no longer seems to work in j.
setkey(dt, id, val)
dt[J('a'), val]
# id val
# 1: a 1
dt[J('a'), range(val)]
# id V1
# 1: a 1
# 2: a 1
## I would have expected same results here as when key(dt) == "id" only
Some values seem to be missing now...
unless we resort to vector scan (which can be slow, and returns vectors here)
dt[id == 'a', val]
# [1] 1 2
dt[id == 'a', range(val)]
# [1] 1 2
or unless we explicitly set by (which throws a warning).
dt[J('a'), range(val), by = id]
# id V1
# 1: a 1
# 2: a 2
# Warning message:
# In `[.data.table`(dt, J("a"), range(val), by = id) :
# by is not necessary in this query; it equals all the join columns
# in the same order. j is already evaluated by group of x that each
# row of i matches to (by-without-by, see ?data.table). Setting by
# will be slower because a subset of x is taken and then grouped
# again. Consider removing by, or changing it.
What's going on please?
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] data.table_1.9.2
loaded via a namespace (and not attached):
[1] plyr_1.8.1 Rcpp_0.11.0 reshape2_1.2.2 stringr_0.6.2
[5] tools_3.0.1
Update: Added test in commit 1309 to catch regressions on this issue at any later stages. Closes #734. From NEWS:
Added tests (1351.1 and 1351.2) to catch any future regressions on particular case of binary search based subset reported here on SO. Thanks to Scott for the post. The regression was contained to v1.9.2 AFAICT. Closes #734.
Scott, thanks for the report (and the follow-up comment). It seems to have occurred in 1.9.2 alone. I tested it on the current development version v1.9.3 and things seem to work as intended. Please check the README file for installation instructions.
I've added a issue #734 to remind us to add a test to cover this usage so that we don't miss it again during any changes in the future.
I am trying to read a .txt file, with Hebrew column names, but without success.
I uploaded an example file to:
http://www.talgalili.com/files/aa.txt
And am trying the command:
read.table("http://www.talgalili.com/files/aa.txt", header = T, sep = "\t")
This returns me with:
X.....ª X...ª...... X...œ....
1 12 97 6
2 123 354 44
3 6 1 3
Instead of:
אחת שתיים שלוש
12 97 6
123 354 44
6 1 3
My output for:
l10n_info()
Is:
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
And for:
Sys.getlocale()
Is:
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Can you suggest to me what to try and change to allow me to load the file correctly ?
Update:
Trying to use:
read.table("http://www.talgalili.com/files/aa.txt",fileEncoding ="iso8859-8")
Has resulted in:
V1
1 ?
Warning messages:
1: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
invalid input found on input connection 'http://www.talgalili.com/files/aa.txt'
2: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
incomplete final line found by readTableHeader on 'http://www.talgalili.com/files/aa.txt'
While also trying this:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
Or this:
Sys.setlocale("LC_ALL", "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8")
Get's me this:
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
Finally, here is the > sessionInfo()
R version 2.10.1 (2009-12-14)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1255 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.10.1
Any suggestion or clarification will be appreciated.
Best,
Tal
I would try passing parameter fileEncoding to read.table with a value of iso8859-8.
Use iconvlist() to get an alphabetical list of the supported encodings. As I saw here Hebrew must be part 8 of ISO 8859.
I've tried #George Donats answer, but couldn't make it work. So I wanted to suggest another possibility for future reference.
I couldn't find the file online, so I've recreated a txt file like your using TAB as a seperator. You can load it into R with the Hebrew text using a connection. It is demonstrated below:
con<-file("aa.txt",open="r",encoding="iso8859-8") ##Open a read-only connection with encoding fit for Hebrew (iso8859-8)
Than you can load it into R with your code, using con variable as the file input, code described here:
data<-read.table(con,sep="\t",header=TRUE)
Browsing into the data variable gives the following results:
str(data)
'data.frame': 3 obs. of 3 variables:
$ אחת : int 6 44 3
$ שתיים: int 97 354 1
$ שלוש : int 12 123 6
> data$אחת
[1] 6 44 3