Dynamic data exporting using R - r

mybrowser$navigate("http://bitcointicker.co/transactions/")
> a <- mybrowser$findElement(using = 'css selector',"#transactionscontainer")
> a
[1] "remoteDriver fields"
$remoteServerAddr
[1] "localhost"
$port
[1] 4444
$browserName
[1] "firefox"
$version
[1] ""
$platform
[1] "ANY"
$javascript
[1] TRUE
$autoClose
[1] FALSE
$nativeEvents
[1] TRUE
$extraCapabilities
list()
[1] "webElement fields"
$elementId
[1] "0"
I am trying to web scrape live Data using RSelenium and Rvest. I am planning to create a control loop with a timer to run every minute but I am struggling with the dynamic exporting of data into a folder on my computer. The ideal would be to create an output file and R would update rows automatically on the one file although I am not sure if this is possible using R.

Related

Error when downloading multiple pdf files from list of urls in R

I have a list of urls and trying to download the pdfs they link to with lapply. Even though the download bar pops up, I get the following message and the files are not downloaded:
only first element of 'destfile' argument usedtrying URL 'https://reliefweb.int/sites/reliefweb.int/files/resources/hno_car_2021_final_fr.pdf'
Content type 'application/pdf' length 22087482 bytes (21.1 MB)
downloaded 21.1 MB
names<- lapply(pdf, basename) # get names
destination<- paste0 ("~/", names)
lapply(pdf,download.file, destfile=destination)
pdf
[[1]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/hno_car_2021_final_fr.pdf"
[[2]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/rnro_centralsahel_oct_2020_fr_web.pdf"
[[3]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/rnro_centralsahel_oct_2020_en_web.pdf"
[[4]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/hno_2020-final.pdf"
[[5]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/hno_light_2020-en_final_0.pdf"
[[6]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/20200701_HNO_CENTROAMERICA%20ADDENDUM%20ING.pdf"
[[7]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/20200706%20ONEPAGER%20HNO%20Centroame%CC%81rica%20ING.pdf"
[[8]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/cmr_hno_2020-revised_print.pdf"
[[9]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/20200616_HNO_CENTROAMERICA%20ADDENDUM.pdf"
I would handle everything at once. see code example for the first 2 files.
names = c("https://reliefweb.int/sites/reliefweb.int/files/resources/hno_car_2021_final_fr.pdf",
"https://reliefweb.int/sites/reliefweb.int/files/resources/rnro_centralsahel_oct_2020_fr_web.pdf"
)
# making the filenames
downloaded = lapply(names, function(url){
# extract the last part of the url to make the filename
destination = unlist(strsplit(url, '/'))
destination = destination[length(destination)]
destination = paste0 ("~/", destination)
# download the file
download.file(url = url, destfile=destination, mode="wb")
return(destination) # This is optional, just the see where the files are saved
})
# downloaded
# [[1]]
# [1] "~/hno_car_2021_final_fr.pdf"
# [[2]]
# [1] "~/rnro_centralsahel_oct_2020_fr_web.pdf"

Why does sapply of an ordered list outputs my content twice

I stored a list of files in a list using this code:
filesList <- list.files(path="/Users/myPath/data/", pattern="*.csv")
I then wanted to output it without the indexes (that usually appear of form [1] at start of each line, so I tried this:
sapply(filesList[order(filesList)], print)
The result is given below copied exactly from RStudio. Why does my list of files output twice? I can work with this, I am just curious.
[1] "IMDB_Bottom250movies.csv"
[1] "IMDB_Bottom250movies2_OMDB_Detailed.csv"
[1] "IMDB_Bottom250movies2.csv"
[1] "IMDB_ErrorLogIDs1_OMDB_Detailed.csv"
[1] "IMDB_ErrorLogIDs1.csv"
[1] "IMDB_ErrorLogIDs2_OMDB_Detailed.csv"
[1] "IMDB_ErrorLogIDs2.csv"
[1] "IMDB_OMDB_Kaggle_TestSet_OMDB_Detailed.csv"
[1] "IMDB_OMDB_Kaggle_TestSet.csv"
[1] "IMDB_Top250Engmovies.csv"
[1] "IMDB_Top250Engmovies2_OMDB_Detailed.csv"
[1] "IMDB_Top250Engmovies2.csv"
[1] "IMDB_Top250Indianmovies.csv"
[1] "IMDB_Top250Indianmovies2_OMDB_Detailed.csv"
[1] "IMDB_Top250Indianmovies2.csv"
[1] "IMDB_Top250movies.csv"
[1] "IMDB_Top250movies2_OMDB_Detailed.csv"
[1] "IMDB_Top250movies2.csv"
[1] "TestDoc2_KaggleData_OMDB_Detailed.csv"
[1] "TestDoc2_KaggleData.csv"
[1] "TestDoc2_KaggleData68_OMDB_Detailed.csv"
[1] "TestDoc2_KaggleData68.csv"
[1] "TestDoc2_KaggleDataHUGE_OMDB_Detailed.csv"
[1] "TestDoc2_KaggleDataHUGE.csv"
IMDB_Bottom250movies.csv IMDB_Bottom250movies2_OMDB_Detailed.csv
"IMDB_Bottom250movies.csv" "IMDB_Bottom250movies2_OMDB_Detailed.csv"
IMDB_Bottom250movies2.csv IMDB_ErrorLogIDs1_OMDB_Detailed.csv
"IMDB_Bottom250movies2.csv" "IMDB_ErrorLogIDs1_OMDB_Detailed.csv"
IMDB_ErrorLogIDs1.csv IMDB_ErrorLogIDs2_OMDB_Detailed.csv
"IMDB_ErrorLogIDs1.csv" "IMDB_ErrorLogIDs2_OMDB_Detailed.csv"
IMDB_ErrorLogIDs2.csv IMDB_OMDB_Kaggle_TestSet_OMDB_Detailed.csv
"IMDB_ErrorLogIDs2.csv" "IMDB_OMDB_Kaggle_TestSet_OMDB_Detailed.csv"
IMDB_OMDB_Kaggle_TestSet.csv IMDB_Top250Engmovies.csv
"IMDB_OMDB_Kaggle_TestSet.csv" "IMDB_Top250Engmovies.csv"
IMDB_Top250Engmovies2_OMDB_Detailed.csv IMDB_Top250Engmovies2.csv
"IMDB_Top250Engmovies2_OMDB_Detailed.csv" "IMDB_Top250Engmovies2.csv"
IMDB_Top250Indianmovies.csv IMDB_Top250Indianmovies2_OMDB_Detailed.csv
"IMDB_Top250Indianmovies.csv" "IMDB_Top250Indianmovies2_OMDB_Detailed.csv"
IMDB_Top250Indianmovies2.csv IMDB_Top250movies.csv
"IMDB_Top250Indianmovies2.csv" "IMDB_Top250movies.csv"
IMDB_Top250movies2_OMDB_Detailed.csv IMDB_Top250movies2.csv
"IMDB_Top250movies2_OMDB_Detailed.csv" "IMDB_Top250movies2.csv"
TestDoc2_KaggleData_OMDB_Detailed.csv TestDoc2_KaggleData.csv
"TestDoc2_KaggleData_OMDB_Detailed.csv" "TestDoc2_KaggleData.csv"
TestDoc2_KaggleData68_OMDB_Detailed.csv TestDoc2_KaggleData68.csv
"TestDoc2_KaggleData68_OMDB_Detailed.csv" "TestDoc2_KaggleData68.csv"
TestDoc2_KaggleDataHUGE_OMDB_Detailed.csv TestDoc2_KaggleDataHUGE.csv
"TestDoc2_KaggleDataHUGE_OMDB_Detailed.csv" "TestDoc2_KaggleDataHUGE.csv"
The second copy (without the indexes) is close enough to copy-paste-use, jsut wondering why this happened ?
What is happening here is that sapply is calling print on each element of fileList[order(fileList)] printing the contents to screen. Then Rstudio prints the result of the sapply function itself, which is a list of the contents printed by print. You can use cat to print values without the [1] or wrap sapply in invisible to suppress its output. https://stackoverflow.com/a/12985020/6490232

Count number of times a word-wildcard appears in text (in R)

I have a vector of either regular words ("activated") or wildcard words ("activat*"). I want to:
1) Count the number of times each word appears in a given text (i.e., if "activated" appears in text, "activated" frequency would be 1).
2) Count the number of times each word wildcard appears in a text (i.e., if "activated" and "activation" appear in text, "activat*" frequency would be 2).
I'm able to achieve (1), but not (2). Can anyone please help? thanks.
library(tm)
library(qdap)
text <- "activation has begun. system activated"
text <- Corpus(VectorSource(text))
words <- c("activation", "activated", "activat*")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Result:
# docs word.count activation activated activat*
# 1 doc 1 5 1(20.00%) 1(20.00%) 0
Is it possible that this might have to do something with the versions? I ran the exact same code (see below) and got what you expected
> text <- "activation has begunm system activated"
> text <- Corpus(VectorSource(text))
> words <- c("activation", "activated", "activat")
> apply_as_df(text, termco, match.list=words)
docs word.count activation activated activat
1 doc 1 5 1(20.00%) 1(20.00%) 2(40.00%)
Below is the output when I run R.version(). I am running this in RStudio Version 0.99.491 on Windows 10.
> R.Version()
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "3"
$minor
[1] "2.3"
$year
[1] "2015"
$month
[1] "12"
$day
[1] "10"
$`svn rev`
[1] "69752"
$language
[1] "R"
$version.string
[1] "R version 3.2.3 (2015-12-10)"
$nickname
[1] "Wooden Christmas-Tree"
Hope this helps
Maybe consider different approach using library stringi?
text <- "activation has begun. system activated"
words <- c("activation", "activated", "activat*")
library(stringi)
counts <- unlist(lapply(words,function(word)
{
newWord <- stri_replace_all_fixed(word,"*", "\\p{L}")
stri_count_regex(text, newWord)
}))
ratios <- counts/stri_count_words(text)
names(ratios) <- words
ratios
Result is:
activation activated activat*
0.2 0.2 0.4
In the code I convert * into \p{L} which means any letter in regex pattern. After that I count found regex occurences.

Rearranging list into data.frame

I scraped 99 user profiles from forums for my PhD research.
The output is a list with 99 elements. Since each user can decide for himself which information he or she is going to put on the profile there's a different number of information snippets attached to each element.
Here's a sample of the output (I also don't know why the numeration has all these $ and ' signs) :
$`77.1`
$`77.1`[[1]]
[1] "Username:"
$`77.1`[[2]]
[1] "*Username*"
$`77.1`[[3]]
[1] "*Username*"
$`77.1`[[4]]
[1] "Rank:"
$`77.1`[[5]]
[1] "*Rank*"
$`77.1`[[6]]
[1] "Groups:"
$`77.1`[[7]]
[1] "*Groups*"
$`77.1`[[8]]
[1] "Location:"
$`77.1`[[9]]
[1] "*Location*"
$`77.1`[[10]]
[1] ""
$`78.1`
$`78.1`[[1]]
[1] "Username:"
$`78.1`[[2]]
[1] "*Username*"
$`78.1`[[3]]
[1] "*Username*"
$`78.1`[[4]]
[1] "Rank:"
$`78.1`[[5]]
[1] "*Rank*"
$`78.1`[[6]]
[1] "Age:"
$`78.1`[[7]]
[1] "*AGE*"
$`78.1`[[8]]
[1] "Groups:"
$`78.1`[[9]]
[1] "*Groups*"
$`78.1`[[10]]
[1]"Interests in history:"
$`78.1`[[11]]
[1] "*Interests*"
$`78.1`[[12]]
[1] "Location:"
$`78.1`[[13]]
[1] "*Location*"
$`78.1`[[14]]
[1] ""
Is there a way to arrange this list into a data frame where each row consists of information from one element?
I tried to arrange them into a matrix, but this doesn't work well because the matrix needs a consistent amount of columns, which isn't given.
I would love it to look like this:
Id 1 2 3 4 5 6
1 Username: *Username* Rank *Rank* Groups: *Groups*
2 Username: *Username2* ...

R: resourcing code in snow workers

I'm using snow parApply() to distribute processing tasks to a number of workes on the local machine. The problem is that if I change the code in one of the functions the workers will not be aware of the changes.
How can I 'resource' the source code files in the workers?
EDIT
I can't call source() on my cluster to re-eval all my functions:
cl = makeSOCKcluster(rep("localhost", 5))
> clusterCall(cl, getwd)
[[1]]
[1] "/home/user"
[[2]]
[1] "/home/user"
[[3]]
[1] "/home/user"
[[4]]
[1] "/home/user"
[[5]]
[1] "/home/user"
> clusterCall(cl, source, 'ets.load.R')
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
5 nodes produced errors; first error: cannot open the connection
Update it in each worker using parallel::clusterCall

Resources