R Web scraping data from StockTwits website - r

I want to get some information from tweets posted on the platform StockTwits. Here you can see an example tweet: https://stocktwits.com/3726859/message/469518468
I have already asked the same question once (R How to web scrape data from StockTwits with RSelenium?), but the StockTwits website has been changed and I can no longer work with the same html_nodes() command.
I would therefore be very happy if someone could help me with the input in the html_nodes() function.
I would like to read the following information: Number of replies, number of reshares, number of likes:
I have got this far so far:
library(rvest)
read_html("https://stocktwits.com/SunAndStorm/message/499613811") |>
html_nodes()
The final result should be a dataframe, which should look like this:
# A tibble: 1 × 5
Reply Reshare Like Share Search
<lgl> <lgl> <lgl> <lgl> <lgl>
5 0 1 0 0

Look into the network section in the developer tools and you'd find their API. Call on it with a tweet ID of interest.
I composed a start for you here. I couldn't find reshares and search. but I am sure it is there somewhere. Since you have thousand of tweets to gather info on, this method is more efficient.
library(tidyverse)
library(httr2)
get_stockwits <- function(id) {
data <-
str_c("https://api.stocktwits.com/api/2/messages/", id, "/conversation.json?limit=21") %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE)
tibble(
tweet = data %>%
getElement("message") %>%
getElement("body"),
reply = data %>%
getElement("message") %>%
getElement("conversation") %>%
getElement("replies"),
likes = data %>%
getElement("message") %>%
getElement("likes") %>%
getElement("total"),
comments = data %>%
getElement("children") %>%
getElement("messages") %>%
getElement("body")
) %>%
nest(comments = comments)
}
get_stockwits(469518468)
# A tibble: 1 x 4
tweet reply likes comments
<chr> <int> <int> <list>
1 $GME going back in all this month 5 1 <tibble [2 x 1]>
Unnest comments to see the comments
get_stockwits(469518468) %>%
unnest(comments)
# A tibble: 2 x 4
tweet reply likes comments
<chr> <int> <int> <chr>
1 $GME going back in all this month 5 1 #okkenny yeah with options
2 $GME going back in all this month 5 1 #okkenny playing monthly only

I do not use the html nodes, but find the element with the xpath. Folowing code gives you the information you need
url <- "https://stocktwits.com/SunAndStorm/message/499613811"
# Set up driver
driver <- rsDriver(browser = "firefox", chromever = NULL)
remDr <- driver[["client"]]
# Go to site
remDr$navigate(url)
# Extract information using xpath
info <- remDr$findElement(using = "xpath", "/html/body/div[2]/div/div[2]/div[2]/div[2]/div/div/div/div[1]/div[1]/div/div[2]/article/div/div[5]")
Then you can use getelementtext to find the information
> info$getElementText()
[[1]]
[1] "4Comments\n0Reshares\n7Likes"
If you need help converting this string to a dataframe let me know and I can help you out, but I assume this is not the main problem.
Kind regerads

Related

select for rows that don't have a string

I have a df of lot #'s with all of the data associated with them. Some of that data is experimental. Those lot #'s start with X. For example, X42A7299, where any normal lot would be 42A7299. I want to exclude those rows. The DF is called all_cls4. Here is the code I have tried:
all_cls4new<- all_cls4 %>% filter(!str_detect(Lot_#, ^X))
this returns a +
I also get this result with filter and !grep. What am I missing?
library(dplyr)
library(stringr)
x <- tribble(
~lot, ~other_data,
"X42A7299", 45,
"42A7299", 100
)
x %>%
filter(!(str_detect(lot, '^X')))
#> # A tibble: 1 × 2
#> lot other_data
#> <chr> <dbl>
#> 1 42A7299 100
Also, be careful with a symbol in your column name (e.g. Lot_#). I would rename it to a "clean" name (e.g. snakecase). janitor::clean_names() is useful for this. If you use it as is, you will have to wrap in backticks:
x %>%
filter(!(str_detect(`Lot_#`, '^X')))

Webscraping using R - Table content

New to webscraping. I am trying to scrape specific data from websites.
For eg. https://www.vesselfinder.com/vessels/KOTA-CARUM-IMO-9494577-MMSI-563150100
I need to scrape the distance the ship has travelled in 2020 and 2021.
shipws <- read_html(shipsite)
The above code gets me the site. shipsite is the url.
Now, I tried using,
a <- shipws %>%
html_nodes( css = "_1hFrZ") %>%
html_attr()
But it returns a empty. _1hFrZ was the td class in the website. It returns empty when I use html_text() too.
a <- shipsite %>%
html() %>%
html_nodes(xpath='//*[#id="tbc1"]/div[1]/div[1]/table') %>%
html_table()
Few tutorials asked me to do it above way and that turned up with errors that html() function does not exist. If I remove html()
Would love to know where I am going wrong. Thank you.
We can just get all the tables from website by,
df = 'https://www.vesselfinder.com/vessels/KOTA-CARUM-IMO-9494577-MMSI-563150100' %>%
read_html() %>% html_table()
The table of interest is,
df[[2]]
# A tibble: 4 x 2
X1 X2
<chr> <int>
1 Travelled distance (nm) 98985
2 Port Calls 54
3 Average / Max Speed (kn) NA
4 Min / Max Draught (m) NA

R for loop to extract info from a file and add it into tibble?

I am not great with tidyverse so forgive me if this is a simple question. I have a bunch of files with data that I need to extract and add into distinct columns in a tibble I created.
I want the the row names to start with the file IDs which I did manage to create:
filelist <- list.fileS(pattern=".txt") # Gives me the filenames in current directory.
# The filenames are something like AA1230.report.txt for example
file_ID <- trimws(filelist, whitespace="\\..*") # Gives me the ID which is before the "report.txt"
metadata <- as_tibble(file_ID[1:181]) # create dataframe with IDs as row names for 180 files.
Now in these report files are information on species and abundance (kraken report files for those familiar with kraken) and all I need is to extract the number of reads for each domain. I can easily search up in each file the domains and number of reads that fall into that domain using something like:
sample_data <- as_tibble(read.table("AA1230.report.txt", sep="\t", header=FALSE, strip.white=TRUE))
sample_data <- rename(sample_data, Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) # Just renaming the column headers for clarity
sample_data %>% filter(Rank=="D") # D for domain
This gives me a clear output such as:
Percentage Num_Reads_Root Num_Reads_Taxon Rank NCBI_ID Name
<dbl> <int> <int> <fct> <int> <fct>
1 75.9 60533 28 D 2 Bacteria
2 0.48 386 0 D 2759 Eukaryota
3 0.01 4 0 D 2157 Archaea
4 0.02 19 0 D 10239 Viruses
Now, I want to just grab the info in the second column and final column and save this info into my tibble so that I can get something like:
> metadata
value Bacteria_Counts Eukaryota_Counts Viruses_Counts Archaea_Counts
<chr> <int> <int> <int> <int>
1 AA1230 60533 386 19 4
2 AB0566
3 AA1231
4 AB0567
5 BC1148
6 AW0001
7 AW0002
8 BB1121
9 BC0001
10 BC0002
....with 171 more rows
I'm just having trouble coming up with a for loop to create these sample_data outputs, then from that, extract the info and place into a tibble. I guess my first loop should create these sample_data outputs so something like:
for (files in file.list()) {
>> get_domains <<
}
Then another loop to extract that info from the above loop and insert it into my metadata tibble.
Any suggestions? Thank you so much!
PS: If regular dataframes in R is better for this let me know, I have just recently learned that tidyverse is a better way to organize dataframes in R but I have to learn more about it.
You could also do:
library(tidyverse)
filelist <- list.files(pattern=".txt")
nms <- c("Percentage", "Num_reads_root", "Num_reads_taxon", "Rank", "NCBI_ID", "Name")
set_names(filelist,filelist) %>%
map_dfr(read_table, col_names = nms, .id = 'file_ID') %>%
filter(Rank == 'D') %>%
select(file_ID, Name, Num_reads_root) %>%
pivot_wider(id_cols = file_ID, names_from = Name, values_from = Num_reads_root) %>%
mutate(file_ID = str_remove(file_ID, '.txt'))
I've found that using a for loop is nice sometimes because saves all the progress along the way in case you hit an error. Then you can find the problem file and debug it or use try() but throw a warning().
library(tidyverse)
filelist <- list.files(pattern=".txt") #list files
tmp_list <- list()
for (i in seq_along(filelist)) {
my_table <- read_tsv(filelist[i]) %>% # It looks like your files are all .tsv's
rename(Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) %>%
filter(Rank=="D") %>%
mutate(file_ID <- trimws(filelist[i], whitespace="\\..*")) %>%
select(file_ID, everything())
tmp_list[[i]] <- my_table
}
out <- bind_rows(tmp_list)
out

R list save as quoted list

Want to save recommenderlab predict list as list of "" seperated list. I have one question in place for the same but here want to extend it with a twist.
I already tried few approaches and found below as relavent but stuck with a simple step of putting the ouptput in "" comma seperated script.
library("recommenderlab")
library(stringi)
data("MovieLense")
MovieLense100 <- MovieLense[rowCounts(MovieLense) >100,]
MovieLense100
train <- MovieLense100[1:50]
rec <- Recommender(train, method = "UBCF")
rec
pre <- predict(rec, MovieLense100[101:105], n = 10)
as(pre, "list")
list1 = as(pre, "list")
cat(paste0(shQuote(list1[["291"]]),collapse=","))
The above gives me for given user:
"Titanic (1997)","Contact (1997)","Alien (1979)","Amadeus (1984)","Godfather, The (1972)","Aliens (1986)","Sting, The (1973)","American Werewolf in London, An (1981)","Schindler's List (1993)","Glory (1989)"
I want to put user and movies in dataframe where first column will be user and second column will be movies in above concatenated form
Given that cat(paste0(shQuote(list1[["291"]]),collapse=",")) produces the string of movie recommendations, one could do the following to turn this into a data frame tagged with a name:
movies <- cat(paste0(shQuote(list1[["291"]]),collapse=","))
theData <- data.frame(name="Santhosh",movies,stringsAsFactors=FALSE)
Another approach would be to save each movie as a separate column in the output data frame, which would make it easier to use the data in R without having to parse the movie list multiple times. The tidyverse (i.e. tidyr and dplyr) can be used to produce this data frame.
library(tidyr)
library(dplyr)
recommendedMovies <- c("Titanic (1997)","Contact (1997)","Alien (1979)","Amadeus (1984)","Godfather, The (1972)","Aliens (1986)","Sting, The (1973)","American Werewolf in London, An (1981)","Schindler's List (1993)","Glory (1989)")
theData <- data.frame(name="Santhosh",
rank=1:length(recommendedMovies),
movies=recommendedMovies,stringsAsFactors=FALSE)
theData %>% group_by(name) %>%
spread(.,rank,movies,sep="movie")
...and the output:
> theData %>% group_by(name) %>%
+ spread(.,rank,movies,sep="movie")
# A tibble: 1 x 11
# Groups: name [1]
name rankmovie1 rankmovie2 rankmovie3 rankmovie4 rankmovie5 rankmovie6 rankmovie7 rankmovie8 rankmovie9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Sant… Titanic (… Contact (… Alien (19… Amadeus (… Godfather… Aliens (1… Sting, Th… American … Schindler…
# ... with 1 more variable: rankmovie10 <chr>
>

How to scrape multiple tables that are without IDs or Class using R

I'm trying to scrape this webpage using R : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (All the pages)
I'm new to programming. And everywhere I've looked, tables are mostly identified with IDs or Divs or Class. On this page there's none. Data is stored in Table format. How should I scrape it?
This is what I did :
library(rvest)
webpage <- read_html("http://zipnet.in/index.php
page=missing_mobile_phones_search&criteria=browse_all")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[9:10] %>%
html_table(fill = TRUE)
colnames(tbls_ls[[1]]) <- c("Mobile Make", "State", "District",
"Police Station", "Status", "Mobile Type(GSM/CDMA)",
"FIR/DD/GD Dat")
You can scrape the table data by targeting the css id of each table. It looks like each page is composed of 3 different tables pasted one after another. Two of the tables have #AutoNumber15 css id while the third (in the middle) has the #AutoNumber16 css id.
I put a simple code example that should get you started in the right direction.
suppressMessages(library(tidyverse))
suppressMessages(library(rvest))
# define function to scrape the table data from a page
get_page <- function(page_id = 1) {
# default link
link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&Page_No="
# build link
link <- paste0(link, page_id)
# get tables data
wp <- read_html(link)
wp %>%
html_nodes("#AutoNumber16, #AutoNumber15") %>%
html_table(fill = TRUE) %>%
bind_rows()
}
# get the data from the first three pages
iter_page <- 1:3
# this is just a progress bar
pb <- progress_estimated(length(iter_page))
# this code will iterate over pages 1 through 3 and apply the get_page()
# function defined earlier. The Sys.sleep() part is used to pause the code
# after each iteration so that the sever is not overloaded with requests.
map_df(iter_page, ~ {
pb$tick()$print()
df <- get_page(.x)
Sys.sleep(sample(10, 1) * 0.1)
as_tibble(df)
})
#> # A tibble: 72 x 4
#> X1 X2 X3
#> <chr> <chr> <chr>
#> 1 FIR/DD/GD Number 000165 State
#> 2 FIR/DD/GD Date 17/08/2017 District
#> 3 Mobile Type(GSM/CDMA) GSM Police Station
#> 4 Mobile Make SAMSUNG J2 Mobile Number
#> 5 Missing/Stolen Date 23/04/2017 IMEI Number
#> 6 Complainant AKEEL KHAN Complainant Contact Number
#> 7 Status Stolen/Theft Report Date/Time on ZIPNET
#> 8 <NA> <NA> <NA>
#> 9 FIR/DD/GD Number FIR No 37/ State
#> 10 FIR/DD/GD Date 17/08/2017 District
#> # ... with 62 more rows, and 1 more variables: X4 <chr>

Resources