Invalid UTF-8 Error when saving Leaflet Widget in R

Invalid UTF-8 Error when saving Leaflet Widget in R - r

I'm creating a Leaflet widget in R using the following code
m <- leaflet(map_data_wgs84) %>% addTiles() %>% addCircles(popup = (paste(sep="<br/>", as.character(map_data_wgs84$MEMBER_REF), map_data_wgs84$Name)))
saveWidget(m, file="c://software//members.html")
I want a popup that palaces an ID number and name separated by a line break. However when I run the saveWidget command I get the following error
Error in gsub("</", "\\u003c/", payload, fixed = TRUE) :
input string 1 is invalid UTF-8
which is because of the <br/> separator.
What am I doing wrong here?
thanks
UPDATE:
it would appear it is not the <br/> separator but rather character(s) in the map_data_wgs84$Name column. These 12000 records are pulled from a contact database before mapping.
I suspect I need some way to make the characters clean for use in Leaflet with something like htmlEscaoe however I cant figure out how to use this within paste. This doesnt work because htmlEscape is parsed as a string:
addCircles(popup = paste(as.character(map_data_wgs84$MEMBER_REF), ~htmlEscape(map_data_wgs84$Name), sep=","))
For for a MEMBER_REF of 56202 the popup becomes:
56202,htmlEscape(map_data_wgs84$Name)

Overview
To resolve the UTF-8 error, I followed three steps:
After reading How to identify/delete non-UTF-8 characters in R, I used base::Encoding() to manually encode the values with the Name column to UTF-8. I then used base::iconv() to replace all non UTF-8 characters with an empty space;
I manually placed the line break element - <br> rather than <br/> - inside of the sep argument within the paste() function when creating the popup within the markers of the leaflet object; and
To be safe, I used htmltools::htmlEscape() inside of the text vectors used inside of paste().
All together, I was able to export that object as an HMTL file. All packages versions are copied down below in the Session Info section.
Reproducible Example
# load necessary packages
library( htmltools )
library( htmlwidgets )
library( leaflet )
# create data
map_data_wgs84 <-
data.frame( MEMBER_REF = "Popup"
, Name = "Th\x86e birthplace of R."
, Long = 174.768
, Lat = -36.852
, stringsAsFactors = FALSE )
# pre-processing
# ensure that all characters in the `Name` column
# are valid UTF-8 encoded
# Thank you to SO for this gem
# https://stackoverflow.com/questions/17291287/how-to-identify-delete-non-utf-8-characters-in-r
Encoding( x = map_data_wgs84$Name ) <- "UTF-8"
# replace all non UTF-8 character strings with an empty space
map_data_wgs84$Name <-
iconv( x = map_data_wgs84$Name
, from = "UTF-8"
, to = "UTF-8"
, sub = "" )
# check work
map_data_wgs84$Name # [1] "The birthplace of R."
# create leaflet object
my.map <-
leaflet( data = map_data_wgs84 ) %>%
addTiles() %>%
addCircles( lng = ~Long
, lat = ~Lat
, popup = paste( as.character( map_data_wgs84$MEMBER_REF )
, htmlEscape( map_data_wgs84$Name )
, sep = "<br>" )
, radius = 50 )
# export leaflet object as HMTL file
saveWidget( widget = my.map
, file = "mywidget.html" )
# end of script #
Session Info
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] leaflet_2.0.1 htmlwidgets_1.2 htmltools_0.3.6
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 digest_0.6.15 later_0.7.3 mime_0.5
[5] R6_2.2.2 xtable_1.8-2 jsonlite_1.5 magrittr_1.5
[9] promises_1.0.1 tools_3.5.1 crosstalk_1.0.0 shiny_1.1.0
[13] httpuv_1.4.5 yaml_2.1.19 compiler_3.5.1

I saw this solution in another post (sorry, lost the link!) which use sprintf to combine text and variables and then using htmltools within lapply to make the content HTML-happy! In this particular I created the following list:
labels <- sprintf(
"<strong>%s</strong><br/>%o<br/>%s",
map_data$Name, map_data$Events, map_data$member_ref
) %>% lapply(htmltools::HTML)
and then called this from leaflet:
my_map <- leaflet(map_data_wgs84) %>% addTiles() %>% addCircles(popup = labels)
Hope this helps someone else!

Related

Reading large data with messy strings and multiple string indicators R

I have a large (8GB+) csv file (comma-separated) that I want to read into R. The file contains three columns
date #in 2017-12-27 format
text #a string
type #a label per string (either NA, typeA, or typeB)
The problem I encounter is that the text column contains various string indicators: ' (single quot. marks), " (double quot. marks), no quot. marks, as well as multiple separated strings.
E.g.
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
To read these large data, I tried:
Base read.csv and readr's read_csv function: work fine for a portion but fail (probably due to memory) or take ages to read
chunking the data via Mac terminal into batches of 1m lines: fails because lines seem to break arbitrarily
Using fread (preferred as I hope this will solve the two other issues): fails with Error: Expecting 3 cols, but line 1103 contains text after processing all cols.
My idea is to work around these issues by using specifics of the data that I know, i.e. that each line starts with a date and ends with either NA, typeA, or typeB.
How could I implement this (either using pure readLines or into fread)?
Edit:
Sample data (anonymized) as opened with Mac TextWrangler:
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success #myid",typeA
Sample data 2:
"date","text","type"
"2018-05-02","i try this, but it doesnt work",NA
"2018-05-02","Thank you very much. Cheers !!",NA
"2018-05-02","#myid. I'll change this.",NA
Sample data for reproducible fread error "Expecting 3 cols, but line 3 contains text after processing all cols.":
"date","text","type"
"2015-03-02","Some text, some text, some question? Please, some question?",NA
"2015-03-02","Here you have the error ""Can’t access {file \""Macintosh HD:abc:def:filename\"", \""/abc.txt\""} from directory."" something -1100 from {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} to file",NA
"2015-03-02","good idea",NA
"2015-03-02","Worked perfectly :)",NA
SessionInfo:
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4-3 readr_1.1.1
loaded via a namespace (and not attached):
[1] compiler_3.5.0 assertthat_0.2.0 R6_2.2.2 cli_1.0.0
[5] hms_0.4.2 tools_3.5.0 pillar_1.2.2 rstudioapi_0.7
[9] tibble_1.4.2 yaml_2.1.19 crayon_1.3.4 Rcpp_0.12.16
[13] utf8_1.1.3 pkgconfig_2.0.1 rlang_0.2.0

readLines approach could be
infile <- file("test.txt", "r")
txt <- readLines(infile, n = 1)
df <- NULL
#change this value as per your requirement
chunksize <- 1
while(length(txt)){
txt <- readLines(infile, warn=F, n = chunksize)
df <- rbind(df, data.frame(date = gsub("\\s.*", "", txt),
text = trimws(gsub("\\S+(.*)\\s+\\S+$", "\\1", txt)),
type = gsub(".*\\s", "", txt),
stringsAsFactors = F))
}
which gives
> df
date text type
1 2016-01-01 great job! NA
2 2016-01-02 please, type "submit" typeA
3 2016-01-02 "can't see the "error" now" typeA
4 2016-01-03 "add \\\\"/filename.txt\\\\"" NA
Sample data: test.txt contains
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
Update:
You can modify above code with below regex parser to parse another set of sample data
df <- rbind(df, data.frame(date = gsub("\"(\\S{10}).*", "\\1", txt),
text = gsub(".*\"\\,\"(.*)\"\\,(\"|NA).*", "\\1", txt),
type = gsub(".*\\,|\"", "", txt),
stringsAsFactors = F))
Another set of sample data:
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success #myid","typeA"

r - Error: Text after processing all cols in fread (data.table)

I tried to import a text file in R (3.4.0) which actually contains 4 columns but the 4th column is mostly empty until 200,000+th row. I use the fread() in package data.table (ver 1.10.4)
fread("test.txt",fill = TRUE, sep = "\t", quote = "", header = FALSE)
I got this error message:
Error in fread("test.txt", fill = TRUE, sep = "\t", quote = "", header = FALSE) :
Expecting 3 cols, but line 258088 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
I checked the file and there's additional text in 258088th row in the 4th column ("8-4").
Nevertheless, fill = TRUE did not solve this as I expected. I thought it might be fread() determining column numbers inappropriately because the additional column occurs very late in the file. So I tried this:
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 250000)
The error persisted. On the other hand,
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 258080)
This gives no error.
I thought I found the reason, but the weird thing happened when I tested with a dummy file generated by:
write.table(matrix(c(1:990000), nrow = 330000), "test2.txt", sep = "\t", row.names = FALSE)
with the addition of a "8-4" in the 4th column of the 250000th row by Excel. When read by fread():
fread("test2.txt", fill = TRUE, header = FALSE, sep = "\t")
It worked fine with no error message, and this should indicate some late additional column not necessarily trigger error.
I also tried changing encoding ("Latin-1" and "UTF-8") or quote, but neither helped.
Now I feel clueless, and hopefully I did my homework enough with a reproducible information. Thank you for helping.
For additional environmental info, my sessionInfo() is:
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3
[5] tibble_1.3.3 ggplot2_2.2.1 tidyverse_1.1.1 stringr_1.2.0
[9] microbenchmark_1.4-2.1 data.table_1.10.4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 cellranger_1.1.0 compiler_3.4.0 plyr_1.8.4 forcats_0.2.0
[6] tools_3.4.0 jsonlite_1.5 lubridate_1.6.0 nlme_3.1-131 gtable_0.2.0
[11] lattice_0.20-35 rlang_0.1.1 psych_1.7.5 DBI_0.6-1 parallel_3.4.0
[16] haven_1.0.0 xml2_1.1.1 httr_1.2.1 hms_0.3 grid_3.4.0
[21] R6_2.2.1 readxl_1.0.0 foreign_0.8-68 reshape2_1.4.2 modelr_0.1.0
[26] magrittr_1.5 scales_0.4.1 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[31] colorspace_1.3-2 stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2

Actually there is a difference between the two files that you provide, and I think this is the cause of the different outputs of the fread.
The first file has an end of the line after the 3rd column, except line 258088, where there is a tab a 4th column and then the end of the line. (You can use the option 'show all characters to confirm that').
On the other hand the second file has in all rows an extra tab, i.e. a new empty column.
So in the first case fread expects 3 columns and then finds out a 4th column. On the contrary in the second file, fread expects 4 columns.
I checked read.table with fill=TRUE and it worked with both files. So I think that something is done differently with the fill option of the fread.
I would expect since fill=TRUE, all the lines to be used so as to infer the number of columns (with cost on computational time).
In the comments there are some nice workarounds you can use.

The file has issue: if the table has four columns, at the end of each row with the fourth column missing a \t should have been present.
In this case you may have better luck with a low-level approach: read the file line by line, add a \t to each row which doesn't have the fourth column, split each line with \t and collect all together in a data.frame. Most of the above work is done by the data.table::tstrsplit function. Try something like:
f<-readLines("test.txt")
require(stringr)
require(data.table)
a<-data.frame(tstrsplit(f,"\t",type.convert=TRUE,names=TRUE,keep=1:4),stringsAsFactors=FALSE)
str(a)
#'data.frame': 273070 obs. of 4 variables:
# $ V1: num 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 ...
# $ V2: num -18.7113 -1.2685 0.0768 0.1507 0.1609 ...
# $ V3: num 0 0 0 0 0 0 0 0 0 0 ...
# $ V4: chr NA NA NA NA ...

I was struggling with this as well. I found another solution (for csv and read.table) here How can you read a CSV file in R with different number of columns. This answer you can use the handy function count.fields to count the delimiters of a file by line and then take the max field count to pass the max number of column names to fread. A reproducible example is below.
Generate text with uneven number of fields
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n16520, California, ocean, summer, golden gate, beach, San Francisco\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n"
Write to file
cat(text, file = "foo")
Scan file for delimeters
max.fields<-max(count.fields("foo", sep = ','))
Now use fread to read file, but expect a max number of columns from the col.names argument
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
However, I was basing this data on the example data from ?count.fields and found if the max number of fields is in the last line of the file, fread will still fail with the following error.
Error in fread("foo", header = FALSE, fill = TRUE, sep = ",", col.names = paste("V", :
Expecting 3 cols, but line 9 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
example
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n16520, California, ocean, summer, golden gate, beach, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
I'll report this as an issue to the data.table Github. Update: issue logged here https://github.com/Rdatatable/data.table/issues/2691

Mysterious misspelt R error in assign 'argumemt is not a character' after updating shiny

I am running a shiny app that basically generates SQL code for word-searches within a column 'WordText'.
While the code is working fine for other users running R 3.1.1, it has started throwing errors after I updated shiny. Please note that I was running R 3.2.3 prior to updating shiny and the shiny app worked fine.
ERROR MESSAGE:
Warning: Error in FUN: argumemt is not a character vector
Stack trace (innermost first):
74: lapply
73: paste8
72: HTML
71: assign
70: renderUI [U:\00 R\Shiny - Coursera\08 Tech05/server.R#174]
69: func
68: output$key1A_main
1: runApp
Also this is the first time I'm getting a stack trace! Not sure what triggered these.
The code snippet in question:
########### GENERATING MULTIPLE OUTPUTS ################
#### MULTIPLE KEYWORD DEPENDENCIES !!!
lapply(1:5, function(x){
## Defining as many functions as the number of times displayed - Word Search Generator, (Complaints and Monthly Trends TAB ) X2 - SS + TD
output[[sprintf("key%dA_main",x)]] <- output[[sprintf("key%dA_main_SS_1",x)]] <- output[[sprintf("key%dA_main_SS_2",x)]] <-
output[[sprintf("key%dA_main_TD_1",x)]] <- output[[sprintf("key%dA_main_TD_2",x)]] <- renderUI({
## Main Keyword Case Summary LIKE Statement
assign(sprintf("key%dA_start",x),
if(input[[sprintf("key%dA",x)]]=="") {""}
else {HTML(paste0("(",br(),"WordText like '%",input[[sprintf("key%dA",x)]],"%'",br(),em(sprintf("/* Main Keyword %d */",x)),br()))}
)
## 'AND' and Starting Parenthesis if any dependent keywords
assign(sprintf("key%dA_start_OR",x),
if(nchar(input[[sprintf("key%d_temp_1",x)]],allowNA = TRUE)==0) {" "} else {paste0("AND",br()," (",br()) }
)
## 1st Dependent Keyword Case Summary LIKE Statment
assign(sprintf("key%dA_first",x),
if(nchar(input[[sprintf("key%d_temp_1",x)]],allowNA = TRUE)==0) {" "} else {paste0("WordText like '%", input[[sprintf("key%d_temp_1",x)]], "%'", br())}
)
## All other Dependent Keywords Case Summary LIKE Statments
assign(sprintf("key%dA_other",x),HTML(
lapply(2:10, function(i) {
xy <- input[[sprintf("key%d_temp_%d",x, i)]]
if (nchar(xy,allowNA = TRUE)>0) paste0("OR WordText like '%", xy, "%'", br())
else " "
})#END lapply
))
## Ending Parenthesis if any dependent keywords
assign(sprintf("key%dA_end_OR",x),
if(nchar(input[[sprintf("key%d_temp_1",x)]],allowNA = TRUE)==0) {" "} else {paste0(")",br(),em(sprintf("/* Dependent Keyword(s) for Keyword %d */",x))) }
)
## Ending Parenthesis for entire criteria (Main Keyword + dependent keywords)
assign(sprintf("key%dA_end",x),
if(input[[sprintf("key%dA",x)]]=="") {" "} else {HTML(paste0(br(),")")) }
)
## Collating outputs
HTML(paste0( get(sprintf("key%dA_start",x))
, get(sprintf("key%dA_start_OR",x))
, get(sprintf("key%dA_first",x))
, get(sprintf("key%dA_other",x))
, get(sprintf("key%dA_end_OR",x))
, get(sprintf("key%dA_end",x))) )
})#END renderUI
})#END LAPPLY
The session info is as below:
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] scales_0.3.0 ggplot2_2.0.0 RODBC_1.3-12 shinythemes_1.0.1 DT_0.1 shiny_0.13.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.3 digest_0.6.9 mime_0.4 plyr_1.8.3 grid_3.2.3 R6_2.1.1 jsonlite_0.9.19 xtable_1.8-0 gtable_0.1.2 magrittr_1.5 tools_3.2.3
[12] htmlwidgets_0.5 munsell_0.4.2 httpuv_1.3.3 colorspace_1.2-6 htmltools_0.3

base::identical() returns TRUE, but the data frames is different

I've got a strange problem in dplyr (probably a bug?), but run into an even more strange problem when debugging.
The dplyr-part of code does already have an issue now, but please help me figure out why identical() doesn't detect differences?
The code (copied from the issue i created on dplyr's github) show the issue with swedish letters (å,ä,ö,Å,Ä,Ö), and as a result of that an example when base::identical(x,y) returns TRUE even when dataframe x and y are different.
# Script to show how dplyr::select() breakes dplyr::group_by() with swedish names
library(dplyr)
# Create data frame, column 1's name contains ä (specific swedish letters are åäöÅÄÖ)
my_df <- data.frame(användarnamn = letters[1:4], my_numvalues = 1:4,
my_text = c("stop","break","my","code"),
extra_col = LETTERS[1:4])
# use dplyr::select() to subset columns, then dplyr::group_by
# group_by fails on swedish column names if the df is subsetted with filter.
# If not subsetted or subsetted with [,1:3], everything works
my_df %>% select(1:3) %>% group_by(my_numvalues) # This works
my_df %>% select(1:3) %>% group_by(användarnamn) # This fails
my_df[,1:3] %>% group_by(användarnamn) # This works
my_df %>% group_by(användarnamn) # This works
# Same thing, but step by step
my_df_selected <- select(my_df, 1:3)
group_by(my_df_selected, användarnamn) # This fails
group_by(my_df_selected, my_numvalues) # This works
# and by %>%
my_df_selected %>% group_by(användarnamn) # This fails
my_df_selected %>% group_by(my_numvalues) # This works
# The names of the orignal df and the filtered is identical
identical(names(my_df)[1:3],names(my_df_selected))
# The function base::make.names() doesn't change the name, it's already valid
identical(names(my_df_selected), make.names(names(my_df_selected)))
# copy to a new df to rename
my_df_selected_renamed <- my_df_selected
# rename the df with it's own old names passing make.names()
names(my_df_selected_renamed) <- make.names(names(my_df_selected_renamed))
# The orignal subsetted and the renamed df is identical
# according to base::identical()
identical(my_df_selected, my_df_selected_renamed)
# Here's the strange thing, it works now! Why??? I REALLY don't understand!
my_df_selected_renamed %>% group_by(användarnamn) # This works now!
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252
[3] LC_MONETARY=Swedish_Sweden.1252 LC_NUMERIC=C
[5] LC_TIME=Swedish_Sweden.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.4.3
loaded via a namespace (and not attached):
[1] lazyeval_0.1.10 magrittr_1.5 R6_2.1.1 assertthat_0.1 parallel_3.2.2
[6] DBI_0.3.1 tools_3.2.2 Rcpp_0.12.1

Splitting a dataframe by group and printing group-specific rows to individual HTML files using pander and rapport

Say I have a tall dataframe with many rows per group, like so:
df <- data.frame(group = factor(rep(c("a","b","c"), each = 5)),
v1 = sample(1:100, 15, replace = TRUE),
v2 = sample(1:100, 15, replace = TRUE),
v3 = sample(1:100, 15, replace = TRUE))
What I want to do is split df into length(levels(df$group)) separate dataframes, e.g.,
df_a <- df[df$group=="a",]; df_b <- df[df$group == "b",] ; ...
And then print each dataframe in a separate HTML/PDF/DOCX file (probably using Rmarkdown and knitr).
I want to do this because I have a large dataframe and want to create a personalized report for each group a, b, c, etc. Thanks.
Update (11/18/14)
Following #daroczig 's advice in this thread and another thread, I attempted to make my own template that would simply print a nicely formatted table of all columns and rows per group to substitute into the "correlations" template call in the original sapply() function. I want to make my own template rather than just printing the nice table (e.g., the answer #Thomas graciously provided) because I'd like to build additional customization into the template once the simple printing works. Anyway, I've certainly butchered it:
<!--head
meta:
title: Sample Report
author: Nicapyke
description: This is a demo
packages: ~
inputs:
- name: eachgroup
class: character
standalone: TRUE
required: TRUE
head-->
### Records received up to present for Group <%= eachgroup %>
<%=
pandoc.table(df[df$group == eachgroup, ])
%>
Then, after saving that as groupreport.rapport in my working directory, I wrote the following R code, modeled after #daroczig's response:
allgroups <- unique(df$group)
library(rapport)
for (eachstate in allstates) {
rapport.docx("FILEPATHHERE", eachgroup = eachgroup)
}
I received the error:
Error in openFileInOS(f.out) : File not found!
I'm not sure what happened. I see from the pander documentation that this means it's looking for a system file, but that doesn't mean much to me. Anyway, this error doesn't get at the root of the problem, which is 1) what should go in the input section of the custom template YAML header, and 2) which R code should go in the rapport template vs. in the R script.
I realize I may be making a number of errors that reveal my lack of experience with rapport and pander. Thanks for your patience!
N.B.:
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.8 dplyr_0.3.0.2 rapport_0.51 yaml_2.1.13 pander_0.5.1
plyr_1.8.1 lattice_0.20-29
loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0 grid_3.1.2
[7] lazyeval_0.1.9 magrittr_1.0.1 parallel_3.1.2 Rcpp_0.11.3 reshape_0.8.5 stringr_0.6.2
[13] tools_3.1.2

A slightly off-topic, but still R/markdown one-liner for separate reports with report templates:
> library(rapport)
> sapply(levels(df$group), function(g) rapport.html('correlations', data = df[df$group == g, ], vars = c('v1', 'v2', 'v3')))
Exported to */tmp/RtmpYyRLjf/rapport-correlations-1-0.[md|html]* under 0.683 seconds.
Exported to */tmp/RtmpYyRLjf/rapport-correlations-2-0.[md|html]* under 0.888 seconds.
Exported to */tmp/RtmpYyRLjf/rapport-correlations-3-0.[md|html]* under 1.063 seconds.
The rapport package can run (predefined or custom) report templates on any (sub)dataset in markdown, then export it to HTML/docx/PDF/other formats. For a quick demo, I've uploaded the resulting documents:
rapport-correlations-1-0.html
rapport-correlations-2-0.html
rapport-correlations-3-0.html

You can do this with by (or split) and xtable (from the xtable package). Here I create xtable objects of each subset, and then loop over them to print them to file:
library('xtable')
s <- by(df, df$group, xtable)
for(i in seq_along(s)) print(s[[i]], file = paste0('df',names(s)[i],'.tex'))
If you use the stargazer package, you can get a nice summary of the dataframe instead of the dataframe itself in just one line:
library('stargazer')
by(df, df$group, stargazer, out = paste0('df',unique(df$group),'.tex'))
You should be able to easily include each of these files in, e.g., a PDF report. You could also use HTML markup using either xtable or stargazer.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex