I want to show the data using mapview package.
but include multibyte character, sometime cannot show the map.
What would be the best thing to show the map?
test1 <- test2 <- atlStorms2005
test1#data$test <- as.factor(c("日本語", "てすと"))
test2#data$test <- as.factor(c("日本語", "五十嵐"))
mapview(test1) # can show the map
mapview(test2) # cannot show <- function(data, encoding = "UTF-8", fileEncoding="UTF-8"){
write.csv(data, file("tmp.csv", encoding = encoding), row.names = F, fileEncoding=fileEncoding)
tmp <- readr::read_csv("tmp.csv", col_types = cols())
test2#data <-
mapview(test2) # can show
but,the popup in test colum character is corrupted text.
data is correct.
# A tibble: 6 × 4
Name MaxWind MinPress test
<chr> <int> <int> <chr>
1 ALPHA 45 998 日本語
2 ARLENE 60 989 五十嵐
3 BRET 35 1002 日本語
4 CINDY 65 991 五十嵐
5 DELTA 60 980 日本語
6 DENNIS 130 930 五十嵐

As of commit bc2c57f, this should have been fixed. Until the next CRAN release of mapview, simply use the development version (devtools::install_github("environmentalinformatics-marburg/mapview", ref = "develop")) to solve this issue.
In brief, this behavior was related to our Rcpp routines which run under the hood in order to ensure a computationally efficient creation of popup tables. Here, the user's native encoding was used instead of UTF-8 to create JSON output files, resulting in corrupted text output on some machines where UTF-8 was not the default.


How to access Youtube Data API v3 with R

I am trying to use R to retrieve data from the YouTube API v3 and there are few/no tutorials out there that show the basic process. I have figured out this much so far:
# Youtube API query
base_url <- ""
my_yt_search <- function(search_term, max_results = 20) {
my_api_url <- str_c(base_url, "search?part=snippet&", "maxResults=", max_results, "&", "q=", search_term, "&key=",
my_api_key, sep = "")
result <- GET(my_api_url)
my_yt_search(search_term = "salmon")
But I am just getting some general meta-data and not the search results. Help?
PS. I know there is a package 'tuber' out there but I found it very unstable and I just need to perform simple searches so I prefer to code the requests myself.
Sadly there is no way to directly get the durations, you'll need to call the videos endpoint (with the part set to part=contentDetails) after doing the search if you want to get those infos, however you can pass as much as 50 ids in a single call thus we can save some time by pasting all the ids together.
my_yt_duration <- function(...){
my_api_url <- paste0(base_url, "videos?part=contentDetails", paste0("&id=", ..., collapse=""), "&key=",
my_api_key )
GET(my_api_url) -> resp
fromJSON(content(resp, "text"))$items %>% as_tibble %>% select(id, contentDetails) -> tb
tb$contentDetails$duration %>% tibble(id=tb$id, duration=.)
### getting the video IDs
my_yt_search(search_term = "salmon")->res
## Converting from JSON then selecting all the video ids
# fromJSON(content(res,as="text") )$items$id$videoId
my_yt_duration(fromJSON(content(res,as="text") )$items$id$videoId) ->
# A tibble: 20 x 2
id duration
<chr> <chr>
1 -x2E7T3-r7k PT4M14S
2 b0ahREpQqsM PT3M35S
3 ROz8898B3dU PT14M17S
4 jD9VJ92xyzA PT5M42S
5 ACfeJuZuyxY PT3M1S
6 bSOd8r4wjec PT6M29S
7 522BBAsijU0 PT10M51S
8 1P55j9ub4es PT14M59S
9 da8JtU1YAyc PT3M4S
10 4MpYuaJsvRw PT8M27S
11 _NbbtnXkL-k PT2M53S
12 3q1JN_3s3gw PT6M17S
13 7A-4-S_k_rk PT9M37S
14 txKUTx5fNbg PT10M2S
17 4rTMdQzsm6U PT17M24S
18 V9eeg8d9XEg PT10M35S
20 rR9wq5uN_q8 PT4M53S

Slow wordcloud in R

Trying to create a word cloud from a 300MB .csv file with text, but its taking hours on a decent laptop with 16GB of RAM. Not sure how long this should typically take...but here's my code:
dfTemplate <- read.csv("CleanedDescMay.csv", header=TRUE, stringsAsFactors = FALSE)
template <- dfTemplate
template <- Corpus(VectorSource(template))
template <- tm_map(template, removeWords, stopwords("english"))
template <- tm_map(template, stripWhitespace)
template <- tm_map(template, removePunctuation)
dtm <- TermDocumentMatrix(template)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
png(file="WordCloudDesc1.png", width=1000, height=700, bg="grey30")
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per = 0.3, max.words=500)
title(main = "Top Template Words", font.main=1, col.main="cornsilk3", cex.main=1.5)
Any advice is appreciated!
Step 1: Profile
Have you tried profiling your full workflow yet with a small subset to figure out which steps are taking the most time? Profiling with RStudio here
If not, that should be your first step.
If the tm_map() functions are taking a long time:
If I recall correctly, I found working with stringi to be faster than the dedicated corpus tools.
My workflow wound up looking like the following for the pre-cleaning steps. This could definitely be optimized further -- magrittr pipes %>% do contribute to some additional processing time, but I feel like that's an acceptable trade-off for the sanity of not having dozens of nested parenthesis.
## This function handles the processing pipeline
textCleaner <- function(InputText, StopWords, Words, NewWords){
InputText %>%
stri_enc_toascii(.) %>%
toupper(.) %>%
stri_replace_all_regex(.,"[[:cntrl:]]"," ") %>%
stri_replace_all_regex(.,"[[:punct:]]"," ") %>%
stri_replace_all_regex(.,"[[:space:]]+"," ") %>% ## Replaces multiple spaces with
stri_replace_all_regex(.,"^[[:space:]]+|[[:space:]]+$","") %>% ## Remove leading and trailing spaces
stri_replace_all_regex(.,"\\b"%s+%StopWords%s+%"\\b","",vectorize_all = FALSE) %>% ## Stopwords
stri_replace_all_regex(.,"\\b"%s+%Words%s+%"\\b",NewWords,vectorize_all = FALSE) ## Replacements
## Replacement Words, I would normally read in a .CSV file
Replace <- data.table(Old = c("LOREM","IPSUM","DOLOR","SIT"),
New = c("I","DONT","KNOW","LATIN"))
## These need to be defined globally
GlobalStopWords <- c("AT","UT","IN","ET","A")
GlobalOldWords <- Replace[["Old"]]
GlobalNewWords <- Replace[["New"]]
## Generate some sample text
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
## Running Single Threaded
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
# user system elapsed
# 66.969 0.747 67.802
The process of cleaning text is embarrassingly parallel, so in theory you should be able some big time savings possible with multiple cores.
I used to run this pipeline in parallel, but looking back at it today, it turns out that the communication overhead makes this take twice as long with 8 cores as it does single threaded. I'm not sure if this was the same for my original use case, but I guess this may simply serve as a good example of why trying to parallelize instead of optimize can lead to more trouble than value.
## This function handles the cluster creation
## and exporting libraries, functions, and objects
parallelCleaner <- function(Text, NCores){
cl <- parallel::makeCluster(NCores)
clusterEvalQ(cl, library(magrittr))
clusterEvalQ(cl, library(stringi))
clusterExport(cl, list("textCleaner",
Text <- as.character(unlist(parallel::parLapply(cl, Text,
fun = function(x) textCleaner(x,
## Run it Parallel
DT[,CleanedText := parallelCleaner(Text = Text,
NCores = 8)]
# user system elapsed
# 6.700 5.099 131.429
If the TermDocumentMatrix(template) is the chief offender:
Update: I mentioned Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN recently that might be worth checking out: ngram Github Repository. Turns out I should have just tried it before explaining the really cumbersome process of building a command line tool from source-- this would have saved me a lot of time had been around 18 months ago!
It is a good deal more memory intensive and not quite as fast -- my memory usage peaked around 31 GB so that may or may not be a deal-breaker for you. All things considered, this seems like a really good option.
For the 500,000 paragraph case, ngrams clocks in at around 7 minutes of runtime:
ng1 <- ngram::ngram(DT[["CleanedText"]],n = 1)
ng2 <- ngram::ngram(DT[["CleanedText"]],n = 2)
ng3 <- ngram::ngram(DT[["CleanedText"]],n = 3)
pt1 <- setDT(ngram::get.phrasetable(ng1))
pt1[,Ngrams := 1L]
pt2 <- setDT(ngram::get.phrasetable(ng2))
pt2[,Ngrams := 2L]
pt3 <- setDT(ngram::get.phrasetable(ng3))
pt3[,Ngrams := 3L]
pt <- rbindlist(list(pt1,pt2,pt3))
# user system elapsed
# 411.671 12.177 424.616
pt[Ngrams == 2][order(-freq)][1:5]
# ngrams freq prop Ngrams
# 1: SED SED 75096 0.0018013693 2
# 2: AC SED 33390 0.0008009444 2
# 3: SED AC 33134 0.0007948036 2
# 4: SED EU 30379 0.0007287179 2
# 5: EU SED 30149 0.0007232007 2
You can try using a more efficient ngram generator. I use a command line tool called ngrams (available on github here) by Zheyuan Yu- partial implementation of Dr. Vlado Keselj 's Text-Ngrams 1.6 to take pre-processed text files off disk and generate a .csv output with ngram frequencies.
You'll need to build from source yourself using make and then interface with it using system() calls from R, but I found it to run orders of magnitude faster while using a tiny fraction of the memory. Using it, I was was able generate 5-grams from ~700MB of text input in well under an hour, the CSV result with all the output was 2.9 GB file with 93 million rows.
Continuing the example above, In my working directory, I have a folder, ngrams-master, in my working directory that contains the ngrams executable created with make.
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
I think I may have made a couple tweaks to get the output format how I wanted it, if you're interested I can try to find the changes I made to generate a .csvoutputs that differ from the default and upload to Github. (I did that project before I was familiar with the platform so I don't have a good record of the changes I made, live and learn.)
Update 2: I created a fork on Github, msummersgill/ngrams that reflects the slight tweaks I made to output results in a .CSV format. If someone was so inclined, I have a hunch that this could be wrapped up in a Rcpp based package that would be acceptable for CRAN submission -- any takers? I honestly have no clue how Ternary Search Trees work, but they seem to be significantly more memory efficient and faster than any other N-gram implementation currently available in R.
Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN, I haven't used it personally but it might be worth checking out as well: ngram Github Repository.
The Whole Shebang:
Using the same pipeline described above but with a size closer to what you're dealing with (ExampleText.txt comes out to ~274MB):
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
# user system elapsed
# 66.969 0.747 67.802
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
While the example may not be a perfect representation due to the limited vocabulary generated by stringi::stri_rand_lipsum(), the total run time of ~4.2 minutes using less than 8 GB of RAM on 500,000 paragraphs has been fast enough for the corpuses (corpi?) I've had to tackle in the past.
If wordcloud() is the source of the slowdown:
I'm not familiar with this function, but #Gregor's comment on your original post seems like it would take care of this issue.
GramSubset <- Grams[Ngrams == 2][1:500]
wordcloud(GramSubset[["Token"]],GramSubset[["Frequency"]],color = GramSubset[["Frequency"]],
rot.per = 0.3,font.main=1, col.main="cornsilk3", cex.main=1.5)

Need to use jsonlite to handle ndjson message list using stream_in() and stream_out()

I have an ndjson data source. For a simple example, consider a text file with three lines, each containing a valid json message. I want to extract 7 variables from the messages and put them in a dataframe.
Please use the following sample data in a text file. You can paste this data into a text editor and save it as "ndjson_sample.txt"
The following three lines of code accomplish what I want to do:
file1 <- "ndjson_sample.txt"
json_data1 <- ndjson::stream_in(file1)
raw_df_temp1 <-$ct))
For reasons I won't get into, I cannot use the ndjson package. I must find a way to use the jsonlite package to do the same thing using the stream_in() and stream_out() functions. Here's what I tried:
con_in1 <- file(file1, open = "rt")
con_out1 <- file(tmp <- tempfile(), open = "wt")
callback_func <- function(df){
jsonlite::stream_out(df, con_out1, pagesize = 1)
jsonlite::stream_in(con_in1, handler = callback_func, pagesize = 1)
con_in2 <- file(tmp, open = "rt")
raw_df_temp2 <- jsonlite::stream_in(con_in2)
This is not giving me the same data frame as a final output. Can you tell me what I'm doing wrong and what I have to change to make raw_df_temp1 equal raw_df_temp2?
I could potentially solve this with a the fromJSON() functions operating on each line of the file, but I'd like to find a way to do it with the stream functions. The files I will be dealing with a are quite large and so efficiency will be key. I need this to be as fast as possible.
Thank you in advance.
Currently under ct you'll find a string that can (subsequently) be fed to fromJSON independently, but it will not be parsed as such. Ignoring your stream_out(stream_in(...),...) test, here are a couple of ways to read it in:
json <- stream_in(file('ds_guy.ndjson'), simplifyDataFrame=FALSE)
# opening file input connection.
# Imported 3 records. Simplifying...
# closing file input connection.
ts = sapply(json, `[[`, "ts"),, lapply(json, function(a) fromJSON(a$ct)))
# ts Var1 Var2 Var3 Var4 Var5 Var6 Var7
# 1 1 6 6 -70 12353 1 abc x
# 2 2 6 6 -68 4528 1 def y
# 3 3 6 6 -70 -5409 1 ghi z
Calling fromJSON on each string might be cumbersome, and with larger data this slow-down is why there is stream_in, so if we can capture the "ct" component into a stream of its own, then ...
writeLines(sapply(json, `[[`, "ct"), 'ds_guy2.ndjson')
(There are far-more-efficient ways to do this with non-R tools, including perhaps a simple
sed -e 's/.*"ct":"\({.*\}\)"}$/\1/g' -e 's/\\"/"/g' ds_guy.ndjson > ds_guy.ndjson2
though this makes a few assumptions about the data that may not be perfectly safe. A better solution would be to use jq, which should "always" correctly-parse proper json, then a quick sed to replace escaped quotes:
jq '.ct' ds_guy.ndjson | sed -e 's/\\"/"/g' > ds_guy2.ndjson
and you can do that with system(...) in R if needed.)
From there, under the assumption that each line will contain exactly one row of data.frame data:
json2 <- stream_in(file('ds_guy2.ndjson'), simplifyDataFrame=TRUE)
# opening file input connection.
# Imported 3 records. Simplifying...
# closing file input connection.
cbind(ts=sapply(json, `[[`, "ts"), json2)
# ts Var1 Var2 Var3 Var4 Var5 Var6 Var7
# 1 1 6 6 -70 12353 1 abc x
# 2 2 6 6 -68 4528 1 def y
# 3 3 6 6 -70 -5409 1 ghi z
NB: in the first example, "ts" is a factor, all others are character because that's what fromJSON gives. In the second example, all strings are factor. This can easily be addressed through judicious use of stringsAsFactors=FALSE, depending on your needs.

Extract table from

I would like to extract the following table using rvest from (for any date):
I tried the following but failed to produce any result:
url <- ""
htmlSession <-html_session(url) ## create session
goForm <- html_form(htmlSession)[[2]] ## pull form from session
#filledGoForm <- set_values(goForm, value="04/26/2017") # This does not work
filledGoForm <- goForm
filledGoForm$fields[[1]]$value <- "04/26/2017"
htmlSession <- submit_form(htmlSession, filledGoForm)
> htmlSession <- submit_form(htmlSession, filledGoForm)
Submitting with ''
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode, :
Not Found (HTTP 404).
Any hints on how to do this highly appreciated.
That site uses many XHR requests to populate the tables. And, it establishes a server session with a hidden POST request which won't be replicated with html_session().
We'll need to add in httr for some help:
The first thing we need to do is to just hit the site to get an initial qs_wid cookie into the implicit cookie jar curl/httr/rvest share:
init <- GET("")
Next, we need to mimic the hidden "login" that the web page does:
nxt <- POST(url = "",
body = list(redirectPage = "/BondCenter/TRACEMarketAggregateStats.jsp"),
encode = "form")
That creates a session on the server back-end and places a few other cookies in our cookie jar.
url = "",
query = list(
) -> res
makes the request. You can make a function out of all three steps (together) and parameterize that last GET.
Unfortunately, that returns a very broken HTML <table> that html_table() can't translate into a data frame automagically for you, but that shouldn't stop you:
content(res) %>%
html_nodes("td") %>%
html_text() %>%
matrix(ncol=4, byrow=TRUE) %>%
as_data_frame() %>%
mutate_all(as.numeric) %>%
rename(all_issues=V1, investment_grade=V2, high_yield=V3, convertible=V4) %>%
mutate(category = c("total_issues_traded", "advances", "declines", "unchanged", "high_52", "low_52", "dollar_volume"))
## # A tibble: 7 × 5
## all_issues investment_grade high_yield convertible category
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 7983 5602 2194 187 total_issues_traded
## 2 3025 1798 1100 127 advances
## 3 4448 3575 824 49 declines
## 4 124 42 75 7 unchanged
## 5 257 66 175 16 high_52
## 6 139 105 33 1 low_52
## 7 22601 16143 5742 715 dollar_volume
To get the other data tables, go to the Developer Tools option in your browser (switch to one that has it if yours doesn't … you're likely on Windows given that you're doing finance things and IE/Edge aren't very good browsers for introspection) and refresh the page to see the other requests that get made.

extracting Australia BOM weather data programmatically with R

Here I can enter a substation number, say 009572; choose the variable (say Temperature) and its type (say Maximum). Clicking "get data" brings me to a page with a link "All years of data". Click it, and you got a zip file. I am aware of this questions, but here I don't have a direct link to a zip file. Can something be done to automate weather data extraction from the Australian Bureau Of Meteorology website with R?
I had the same question and this S.O. question was one of the first pages to come up. After further searching I found the R package Bomrang ( that:
Provides functions to interface with Australian Government Bureau of
Meteorology (BOM) data, fetching data and returning a tidy data frame
of précis forecasts, current weather data from stations, ag
information bulletins, historical weather data and downloading and
importing radar or satellite imagery.
Bomrang is apart of rOpenSci and is actively developed. It has a good set of functions:
Several functions are provided by bomrang to retrieve Australian
Bureau of Meteorology (BOM) data. A family of functions retrieve
weather data and return tidy data frames;
get_precis_forecast(), which retrieves the précis (short) forecast;
get_current_weather(), which fetches the current weather for a given station;
get_ag_bulletin(), which retrieves the agriculture bulletin;
get_weather_bulletin(), which retrieves the BOM 0900 or 1500 bulletins;
get_coastal_forecast(), which returns coastal waters forecasts; and
get_historical(), which retrieves historical daily observations for a given station.
A second group of functions retrieve information pertaining to
satellite and radar imagery,
the satellite imagery itself, get_satellite_imagery();
get_available_radar(); and
the radar imagery itself, get_radar_imagery().
The function get_historical() seems to do what OP is needing. For example, to get the historical daily rainfall from a weather station in Sydney is as easy as:
> rain_066062 <- bomrang::get_historical(stationid = 066062,
+ type = 'rain',
+ meta = T)
> head(rain_066062)
# A tibble: 1 x 10
site name lat lon start end years percent AWS ncc_obs_code
<int> <chr> <dbl> <dbl> <date> <date> <dbl> <int> <chr> <chr>
1 66062 SYDNEY (OBSERVATORY HILL) -33.9 151. 1858-07-01 2018-11-01 160. 100 Y 136
Product_code Station_number Year Month Day Rainfall Period Quality
1 IDCJAC0009 66062 1858 1 1 NA NA
2 IDCJAC0009 66062 1858 1 2 NA NA
3 IDCJAC0009 66062 1858 1 3 NA NA
4 IDCJAC0009 66062 1858 1 4 NA NA
5 IDCJAC0009 66062 1858 1 5 NA NA
Another nice feature is if you have the longitude and latitude of a place of interest, get_historical() will find the nearest weather station to that location.
To install from CRAN:
Or install the development version from Github:
if (!require("remotes")) {
install.packages("remotes", repos = "")
install_github("ropensci/bomrang", build_vignettes = TRUE)
Here's the code that I have done to download instantly and it also resolves your p_c problem. You can improve the function if you want and post.
#daily code = 136
#monthy code = 139
bomdata<- function(station,code){
for(i in 1: length(station)){
p.url<-paste("",station[i],"&p_display_type=availableYears&p_nccObsCode=",code,sep ="")
filelist <- list.files(pattern = ".txt")
foo<- file(filelist,"r")
text<- suppressWarnings(readLines(foo))
l<- regexpr(":",text[1])
m<- unlist(gregexpr(",", text[1], perl = TRUE))
pc<- substr(text[1],l[[1]]+1,l[[1]]+(m[2]-(l[[1]]+1)))
url<-paste("",station[i],"&p_c=",pc,"&p_nccObsCode=",code,"&p_startYear=2013", sep ="")
suppressWarnings(download.file(url,paste(station[i],".zip",sep= ""), mode = "wb"))
You can try this, it is a code sequence used by metvurst package. metvurst
url <- ""
yr <- 1993:2012
fijilst <- lapply(seq(yr), function(i) {
read.csv(paste(url, yr[i], ".csv", sep = ""), na.strings = c(-9999, 999))
While I still can't see how to do this with download.file(), the following almost does the job provided Chrome's "Ask where to save each file before downloading" is unticked.
system(paste('"C:/Documents and Settings/UserName/Local Settings/Application Data/Google/Chrome/Application/chrome.exe"',
'-url'), wait = FALSE)
Then I could use paste0() and loop through various station numbers if I knew what p_c=-18465084 means and how it changes from station to station.
