I have an annoying csv > 10gb which opens on Mac OSX but not on Windows 10.
The code I use
data_in <- fread("my_data.csv")
SessionInfo Windows
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4-3 forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4 readr_1.1.1 tidyr_0.8.0 tibble_1.4.2
[9] ggplot2_2.2.1 tidyverse_1.2.1 RMySQL_0.10.14 DBI_0.8
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 cellranger_1.1.0 pillar_1.2.1 compiler_3.4.4 plyr_1.8.4 bindr_0.1.1 tools_3.4.4 lubridate_1.7.2 jsonlite_1.5
[10] nlme_3.1-131.1 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.2.0 psych_1.8.3.3 cli_1.0.0 rstudioapi_0.7 yaml_2.1.18
[19] parallel_3.4.4 haven_1.1.1 bindrcpp_0.2.2 xml2_1.2.0 httr_1.3.1 hms_0.4.2 grid_3.4.4 glue_1.2.0 R6_2.2.2
[28] readxl_1.0.0 foreign_0.8-69 modelr_0.1.1 reshape2_1.4.3 magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[37] colorspace_1.3-2 stringi_1.1.7 lazyeval_0.2.1 munsell_0.4.3 broom_0.4.4 crayon_1.3.4
SessionInfo OSX
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.11.2
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0 yaml_2.1.19
The error I get on Windows, I have tried all suggested solutions found without any luck.
Expecting 10 cols, but line 1346596 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
In addition: Warning message:
Additional info when using verbose = TRUE (Windows) (Tried a smaller file, same problem)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.004474 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 10 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: ,asin,sale
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 3657 (including 0 at the end)
Count of sep: 138915
nrow = MIN( nsep [138915] / (ncol [10] -1), neol [3657] - endblanks [0] ) = 3657
Type codes (point 0): 1444444340
Type codes (point 1): 1444444340
Type codes (point 2): 1444444340
Type codes (point 3): 1444444340
Type codes (point 4): 1444444344
Type codes (point 5): 1444444344
Type codes (point 6): 1444444344
Type codes (point 7): 1444444344
Type codes (point 8): 1444444344
Type codes (point 9): 1444444344
Type codes (point 10): 1444444444
Type codes: 1444444444 (after applying colClasses and integer64)
Type codes: 1444444444 (after applying drop or select (if supplied)
Allocating 10 column slots (10 - 0 dropped)
Error in fread("md2.csv", verbose = T) :
Expecting 10 cols, but line 3312 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
verbose = T OSX
nput contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 4 threads (omp_get_max_threads()=4, nth=4)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file md2.csv
File opened, size = 4.581MB (4803885 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
File ends abruptly with ','. Final end-of-line is missing. Using cow page to write 0 to the last byte.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<,asin,salesRank,imUrl,categori>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 10 fields using quote rule 0
Detected 10 columns on line 1. This line is either column names or first data row. Line starts as: <<,asin,salesRank,imUrl,categori>>
Quote rule picked = 0
fill=false and the most number of columns found is 10
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 10 because (4803885 bytes from row 1 to eof) / (2 * 127664 jump0size) == 18
Type codes (jump 000) : 5AAAAAA7A2 Quote rule 0
Type codes (jump 004) : 5AAAAAA7AA Quote rule 0
Type codes (jump 010) : 5AAAAAA7AA Quote rule 0
'header' determined to be true due to column 8 containing a string on row 1 and a lower type (float64) in the rest of the 1041 sample rows
=====
Sampled 1041 rows (handled \n inside quoted fields) at 11 jump points
Bytes from first data row on line 2 to the end of last row: 4803813
Line length: mean=2028.07 sd=3025.66 min=28 max=29901
Estimated number of rows: 4803813 / 2028.07 = 2369
Initial alloc = 4738 rows (2369 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5AAAAAA7AA
[10] Allocate memory for the datatable
Allocating 10 column slots (10 - 0 dropped) with 4738 rows
[11] Read the data
jumps=[0..2), chunk_size=2401906, total_size=4803813
Read 3311 rows x 10 columns from 4.581MB (4803885 bytes) file in 00:00.025 wall clock time
[12] Finalizing the datatable
Type counts:
1 : int32 '5'
1 : float64 '7'
8 : string 'A'
=============================
0.001s ( 2%) Memory map 0.004GB file
0.005s ( 19%) sep=',' ncol=10 and header detection
0.000s ( 0%) Column type detection using 1041 sample rows
0.000s ( 0%) Allocation of 4738 rows x 10 cols (0.000GB) of which 3311 ( 70%) rows used
0.019s ( 78%) Reading 2 chunks (0 swept) of 2.291MB (each chunk 1655 rows) using 2 threads
+ 0.004s ( 15%) Parse to row-major thread buffers (grown 0 times)
+ 0.012s ( 48%) Transpose
+ 0.004s ( 15%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.025s Total
Works fine in newest version of data.table
Related
I have a large (8GB+) csv file (comma-separated) that I want to read into R. The file contains three columns
date #in 2017-12-27 format
text #a string
type #a label per string (either NA, typeA, or typeB)
The problem I encounter is that the text column contains various string indicators: ' (single quot. marks), " (double quot. marks), no quot. marks, as well as multiple separated strings.
E.g.
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
To read these large data, I tried:
Base read.csv and readr's read_csv function: work fine for a portion but fail (probably due to memory) or take ages to read
chunking the data via Mac terminal into batches of 1m lines: fails because lines seem to break arbitrarily
Using fread (preferred as I hope this will solve the two other issues): fails with Error: Expecting 3 cols, but line 1103 contains text after processing all cols.
My idea is to work around these issues by using specifics of the data that I know, i.e. that each line starts with a date and ends with either NA, typeA, or typeB.
How could I implement this (either using pure readLines or into fread)?
Edit:
Sample data (anonymized) as opened with Mac TextWrangler:
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success #myid",typeA
Sample data 2:
"date","text","type"
"2018-05-02","i try this, but it doesnt work",NA
"2018-05-02","Thank you very much. Cheers !!",NA
"2018-05-02","#myid. I'll change this.",NA
Sample data for reproducible fread error "Expecting 3 cols, but line 3 contains text after processing all cols.":
"date","text","type"
"2015-03-02","Some text, some text, some question? Please, some question?",NA
"2015-03-02","Here you have the error ""Can’t access {file \""Macintosh HD:abc:def:filename\"", \""/abc.txt\""} from directory."" something -1100 from {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} to file",NA
"2015-03-02","good idea",NA
"2015-03-02","Worked perfectly :)",NA
SessionInfo:
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4-3 readr_1.1.1
loaded via a namespace (and not attached):
[1] compiler_3.5.0 assertthat_0.2.0 R6_2.2.2 cli_1.0.0
[5] hms_0.4.2 tools_3.5.0 pillar_1.2.2 rstudioapi_0.7
[9] tibble_1.4.2 yaml_2.1.19 crayon_1.3.4 Rcpp_0.12.16
[13] utf8_1.1.3 pkgconfig_2.0.1 rlang_0.2.0
readLines approach could be
infile <- file("test.txt", "r")
txt <- readLines(infile, n = 1)
df <- NULL
#change this value as per your requirement
chunksize <- 1
while(length(txt)){
txt <- readLines(infile, warn=F, n = chunksize)
df <- rbind(df, data.frame(date = gsub("\\s.*", "", txt),
text = trimws(gsub("\\S+(.*)\\s+\\S+$", "\\1", txt)),
type = gsub(".*\\s", "", txt),
stringsAsFactors = F))
}
which gives
> df
date text type
1 2016-01-01 great job! NA
2 2016-01-02 please, type "submit" typeA
3 2016-01-02 "can't see the "error" now" typeA
4 2016-01-03 "add \\\\"/filename.txt\\\\"" NA
Sample data: test.txt contains
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
Update:
You can modify above code with below regex parser to parse another set of sample data
df <- rbind(df, data.frame(date = gsub("\"(\\S{10}).*", "\\1", txt),
text = gsub(".*\"\\,\"(.*)\"\\,(\"|NA).*", "\\1", txt),
type = gsub(".*\\,|\"", "", txt),
stringsAsFactors = F))
Another set of sample data:
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success #myid","typeA"
I tried to import a text file in R (3.4.0) which actually contains 4 columns but the 4th column is mostly empty until 200,000+th row. I use the fread() in package data.table (ver 1.10.4)
fread("test.txt",fill = TRUE, sep = "\t", quote = "", header = FALSE)
I got this error message:
Error in fread("test.txt", fill = TRUE, sep = "\t", quote = "", header = FALSE) :
Expecting 3 cols, but line 258088 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
I checked the file and there's additional text in 258088th row in the 4th column ("8-4").
Nevertheless, fill = TRUE did not solve this as I expected. I thought it might be fread() determining column numbers inappropriately because the additional column occurs very late in the file. So I tried this:
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 250000)
The error persisted. On the other hand,
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 258080)
This gives no error.
I thought I found the reason, but the weird thing happened when I tested with a dummy file generated by:
write.table(matrix(c(1:990000), nrow = 330000), "test2.txt", sep = "\t", row.names = FALSE)
with the addition of a "8-4" in the 4th column of the 250000th row by Excel. When read by fread():
fread("test2.txt", fill = TRUE, header = FALSE, sep = "\t")
It worked fine with no error message, and this should indicate some late additional column not necessarily trigger error.
I also tried changing encoding ("Latin-1" and "UTF-8") or quote, but neither helped.
Now I feel clueless, and hopefully I did my homework enough with a reproducible information. Thank you for helping.
For additional environmental info, my sessionInfo() is:
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3
[5] tibble_1.3.3 ggplot2_2.2.1 tidyverse_1.1.1 stringr_1.2.0
[9] microbenchmark_1.4-2.1 data.table_1.10.4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 cellranger_1.1.0 compiler_3.4.0 plyr_1.8.4 forcats_0.2.0
[6] tools_3.4.0 jsonlite_1.5 lubridate_1.6.0 nlme_3.1-131 gtable_0.2.0
[11] lattice_0.20-35 rlang_0.1.1 psych_1.7.5 DBI_0.6-1 parallel_3.4.0
[16] haven_1.0.0 xml2_1.1.1 httr_1.2.1 hms_0.3 grid_3.4.0
[21] R6_2.2.1 readxl_1.0.0 foreign_0.8-68 reshape2_1.4.2 modelr_0.1.0
[26] magrittr_1.5 scales_0.4.1 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[31] colorspace_1.3-2 stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
Actually there is a difference between the two files that you provide, and I think this is the cause of the different outputs of the fread.
The first file has an end of the line after the 3rd column, except line 258088, where there is a tab a 4th column and then the end of the line. (You can use the option 'show all characters to confirm that').
On the other hand the second file has in all rows an extra tab, i.e. a new empty column.
So in the first case fread expects 3 columns and then finds out a 4th column. On the contrary in the second file, fread expects 4 columns.
I checked read.table with fill=TRUE and it worked with both files. So I think that something is done differently with the fill option of the fread.
I would expect since fill=TRUE, all the lines to be used so as to infer the number of columns (with cost on computational time).
In the comments there are some nice workarounds you can use.
The file has issue: if the table has four columns, at the end of each row with the fourth column missing a \t should have been present.
In this case you may have better luck with a low-level approach: read the file line by line, add a \t to each row which doesn't have the fourth column, split each line with \t and collect all together in a data.frame. Most of the above work is done by the data.table::tstrsplit function. Try something like:
f<-readLines("test.txt")
require(stringr)
require(data.table)
a<-data.frame(tstrsplit(f,"\t",type.convert=TRUE,names=TRUE,keep=1:4),stringsAsFactors=FALSE)
str(a)
#'data.frame': 273070 obs. of 4 variables:
# $ V1: num 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 ...
# $ V2: num -18.7113 -1.2685 0.0768 0.1507 0.1609 ...
# $ V3: num 0 0 0 0 0 0 0 0 0 0 ...
# $ V4: chr NA NA NA NA ...
I was struggling with this as well. I found another solution (for csv and read.table) here How can you read a CSV file in R with different number of columns. This answer you can use the handy function count.fields to count the delimiters of a file by line and then take the max field count to pass the max number of column names to fread. A reproducible example is below.
Generate text with uneven number of fields
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n16520, California, ocean, summer, golden gate, beach, San Francisco\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n"
Write to file
cat(text, file = "foo")
Scan file for delimeters
max.fields<-max(count.fields("foo", sep = ','))
Now use fread to read file, but expect a max number of columns from the col.names argument
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
However, I was basing this data on the example data from ?count.fields and found if the max number of fields is in the last line of the file, fread will still fail with the following error.
Error in fread("foo", header = FALSE, fill = TRUE, sep = ",", col.names = paste("V", :
Expecting 3 cols, but line 9 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
example
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n16520, California, ocean, summer, golden gate, beach, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
I'll report this as an issue to the data.table Github. Update: issue logged here https://github.com/Rdatatable/data.table/issues/2691
I always use fread from data.table package to read in large tables. But apparently it does not support reading unicode files in Windows (Windows 7 Professional to be more precise)
Here is the file I tried to read:
A,B
ą,ž
ū,į
ų,ė
š,ę
It works fine if I read it in Mac OS X, or I read it with read.csv with option encoding=UTF-8. Unfortunately fread does not have this option.
So are there other fast ways to read unicode tables in Windows, or I should just use other OS? Or am I missing something obvious?
Here is the output of sessionInfo():
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.4
loaded via a namespace (and not attached):
[1] chron_2.3-45 plyr_1.8.1 Rcpp_0.11.5 reshape2_1.4.1 stringr_0.6.2
Update: Pasting the output as requested.
> aa<-fread("F:/R/unicode_test2.csv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.000000 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 2 columns. Longest stretch was from line 1 to line 5
Starting data input on line 1 (either column names or first row of data). First 10 characters: Ä„,B
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 5 (including 1 at the end)
Count of sep: 4
nrow = MIN( nsep [4] / ncol [2] -1, neol [5] - nblank [1] ) = 4
Type codes ( first 5 rows): 44
Type codes: 44 (after applying colClasses and integer64)
Type codes: 44 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 dropped)
Read 4 rows. Exactly what was estimated and allocated up front
0.000s ( 0%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
0.000s ( 0%) Count rows (wc -l)
0.000s ( 0%) Column type detection (first, middle and last 5 rows)
0.000s ( 0%) Allocation of 4x2 result (xMB) in RAM
0.000s ( 0%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 0%) Changing na.strings to NA
0.001s Total
> aa
Ä„ B
1: ą ž
2: ū į
3: ų ė
4: Å¡ Ä™
> aa$A
[1] "ą" "ū" "ų" "š"
> aa$B
[1] "ž" "į" "ė" "ę"
> bb <- read.csv("F:/R/unicode_test.csv",encoding="UTF-8",strings=FALSE)
> bb
A B
1 a ž
2 u i
3 u e
4 š e
> bb$B
[1] "ž" "į" "ė" "ę"
> bb$A
[1] "ą" "ū" "ų" "š"
Say I have a tall dataframe with many rows per group, like so:
df <- data.frame(group = factor(rep(c("a","b","c"), each = 5)),
v1 = sample(1:100, 15, replace = TRUE),
v2 = sample(1:100, 15, replace = TRUE),
v3 = sample(1:100, 15, replace = TRUE))
What I want to do is split df into length(levels(df$group)) separate dataframes, e.g.,
df_a <- df[df$group=="a",]; df_b <- df[df$group == "b",] ; ...
And then print each dataframe in a separate HTML/PDF/DOCX file (probably using Rmarkdown and knitr).
I want to do this because I have a large dataframe and want to create a personalized report for each group a, b, c, etc. Thanks.
Update (11/18/14)
Following #daroczig 's advice in this thread and another thread, I attempted to make my own template that would simply print a nicely formatted table of all columns and rows per group to substitute into the "correlations" template call in the original sapply() function. I want to make my own template rather than just printing the nice table (e.g., the answer #Thomas graciously provided) because I'd like to build additional customization into the template once the simple printing works. Anyway, I've certainly butchered it:
<!--head
meta:
title: Sample Report
author: Nicapyke
description: This is a demo
packages: ~
inputs:
- name: eachgroup
class: character
standalone: TRUE
required: TRUE
head-->
### Records received up to present for Group <%= eachgroup %>
<%=
pandoc.table(df[df$group == eachgroup, ])
%>
Then, after saving that as groupreport.rapport in my working directory, I wrote the following R code, modeled after #daroczig's response:
allgroups <- unique(df$group)
library(rapport)
for (eachstate in allstates) {
rapport.docx("FILEPATHHERE", eachgroup = eachgroup)
}
I received the error:
Error in openFileInOS(f.out) : File not found!
I'm not sure what happened. I see from the pander documentation that this means it's looking for a system file, but that doesn't mean much to me. Anyway, this error doesn't get at the root of the problem, which is 1) what should go in the input section of the custom template YAML header, and 2) which R code should go in the rapport template vs. in the R script.
I realize I may be making a number of errors that reveal my lack of experience with rapport and pander. Thanks for your patience!
N.B.:
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.8 dplyr_0.3.0.2 rapport_0.51 yaml_2.1.13 pander_0.5.1
plyr_1.8.1 lattice_0.20-29
loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0 grid_3.1.2
[7] lazyeval_0.1.9 magrittr_1.0.1 parallel_3.1.2 Rcpp_0.11.3 reshape_0.8.5 stringr_0.6.2
[13] tools_3.1.2
A slightly off-topic, but still R/markdown one-liner for separate reports with report templates:
> library(rapport)
> sapply(levels(df$group), function(g) rapport.html('correlations', data = df[df$group == g, ], vars = c('v1', 'v2', 'v3')))
Exported to */tmp/RtmpYyRLjf/rapport-correlations-1-0.[md|html]* under 0.683 seconds.
Exported to */tmp/RtmpYyRLjf/rapport-correlations-2-0.[md|html]* under 0.888 seconds.
Exported to */tmp/RtmpYyRLjf/rapport-correlations-3-0.[md|html]* under 1.063 seconds.
The rapport package can run (predefined or custom) report templates on any (sub)dataset in markdown, then export it to HTML/docx/PDF/other formats. For a quick demo, I've uploaded the resulting documents:
rapport-correlations-1-0.html
rapport-correlations-2-0.html
rapport-correlations-3-0.html
You can do this with by (or split) and xtable (from the xtable package). Here I create xtable objects of each subset, and then loop over them to print them to file:
library('xtable')
s <- by(df, df$group, xtable)
for(i in seq_along(s)) print(s[[i]], file = paste0('df',names(s)[i],'.tex'))
If you use the stargazer package, you can get a nice summary of the dataframe instead of the dataframe itself in just one line:
library('stargazer')
by(df, df$group, stargazer, out = paste0('df',unique(df$group),'.tex'))
You should be able to easily include each of these files in, e.g., a PDF report. You could also use HTML markup using either xtable or stargazer.
I am trying to read a .txt file, with Hebrew column names, but without success.
I uploaded an example file to:
http://www.talgalili.com/files/aa.txt
And am trying the command:
read.table("http://www.talgalili.com/files/aa.txt", header = T, sep = "\t")
This returns me with:
X.....ª X...ª...... X...œ....
1 12 97 6
2 123 354 44
3 6 1 3
Instead of:
אחת שתיים שלוש
12 97 6
123 354 44
6 1 3
My output for:
l10n_info()
Is:
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
And for:
Sys.getlocale()
Is:
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Can you suggest to me what to try and change to allow me to load the file correctly ?
Update:
Trying to use:
read.table("http://www.talgalili.com/files/aa.txt",fileEncoding ="iso8859-8")
Has resulted in:
V1
1 ?
Warning messages:
1: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
invalid input found on input connection 'http://www.talgalili.com/files/aa.txt'
2: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
incomplete final line found by readTableHeader on 'http://www.talgalili.com/files/aa.txt'
While also trying this:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
Or this:
Sys.setlocale("LC_ALL", "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8")
Get's me this:
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
Finally, here is the > sessionInfo()
R version 2.10.1 (2009-12-14)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1255 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.10.1
Any suggestion or clarification will be appreciated.
Best,
Tal
I would try passing parameter fileEncoding to read.table with a value of iso8859-8.
Use iconvlist() to get an alphabetical list of the supported encodings. As I saw here Hebrew must be part 8 of ISO 8859.
I've tried #George Donats answer, but couldn't make it work. So I wanted to suggest another possibility for future reference.
I couldn't find the file online, so I've recreated a txt file like your using TAB as a seperator. You can load it into R with the Hebrew text using a connection. It is demonstrated below:
con<-file("aa.txt",open="r",encoding="iso8859-8") ##Open a read-only connection with encoding fit for Hebrew (iso8859-8)
Than you can load it into R with your code, using con variable as the file input, code described here:
data<-read.table(con,sep="\t",header=TRUE)
Browsing into the data variable gives the following results:
str(data)
'data.frame': 3 obs. of 3 variables:
$ אחת : int 6 44 3
$ שתיים: int 97 354 1
$ שלוש : int 12 123 6
> data$אחת
[1] 6 44 3