I'm trying to display a list with the top three users based on a user-selected variable (see below). I have created a function that filters my table based on the selection of the Agency via the dropdown and retrieves the top 3 users in a column. I then transfromed the column into a string to render it in the app, but the results are being displayed in between c(...):
I'm okay with the format of the names separated by a comma, but I cannot find a way to eliminate the c(...).
This is the code for my function:
Top3UsersbyAgency <- function(filteredbyAgencyPool) {
filteredbyAgencyPool %>%
arrange(desc(MTD_Domestic)) %>%
group_by(userDisplayName) %>%
head(3) %>%
select(userDisplayName) %>%
na.exclude() %>%
na_if("") %>%
na.omit() %>%
toString()
}
And this is the result:
> Top3UsersbyAgency(filteredbyAgencyPool)
[1] "c(\"Payal Malhotra\", \"Swati Parmar\", \"Unassigned\")"
In the app, I simply used textOutput in the ui and renderText in the server function. I tried to also use renderTable to display the results in the column, but it honestly looks ugly with the title of the column in the middle, so I'd rather display the information just as a list of names in plain text. Any suggestion on how to clean this string up?
Try to transform data.frame column to vector with %>% .[[1]]:
op3UsersbyAgency <- function(filteredbyAgencyPool) {
filteredbyAgencyPool %>%
arrange(desc(MTD_Domestic)) %>%
group_by(userDisplayName) %>%
head(3) %>%
select(userDisplayName) %>%
na.exclude() %>%
na_if("") %>%
na.omit() %>%
.[[1]] %>%
toString()
}
As an illustration :
library(magrittr)
data.frame(c('a','b','c')) %>% toString()
#> [1] "c(\"a\", \"b\", \"c\")"
data.frame(c('a','b','c')) %>% .[[1]] %>% toString()
#> [1] "a, b, c"
Related
I would LIKE TO EXTRACT THE job descriptions, aka the "p" TAG HTML ELEMENTS, from all 16 pages generated from the last line of code.
"ret" IS A LIST of 16 HTML pages generated by the last line of code. I'm not used to working with lists of lists, so I'm confused how to extract the data from these lists.
Normally I would use
res %>%
html_elements("body p")
But I'm getting the error message, "Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "list"
library(tidyverse)
library(rvest)
library(xml2)
url<-"https://www.indeed.com/jobs?q=data%20analyst&l=San%20Francisco%2C%20CA&vjk=0c2a6008b4969776"
page<-xml2::read_html(url)#function will read in the code from the webpage and break it down into different elements (<div>, <span>, <p>, etc.
#get job title
title<-page %>%
html_nodes(".jobTitle") %>%
html_text()
#get company Location
loc<-page %>%
html_nodes(".companyLocation") %>%
html_text()
#job snippet
page %>%
html_nodes(".job-snippet") %>%
html_text()
#Get link
desc<- page %>%
html_nodes("a[data-jk]") %>%
html_attr("href")
# Create combine link
combined_link <- paste("https://www.indeed.com", desc, sep="")
#Turn combined link into a session follow link
page1 <- html_session(combined_link[[1]])
page1 %>%
html_nodes(".iCIMS_JobContent, #jobDescriptionText") %>%
html_text()
#one<- page %>% html_elements("a[id*='job']")
#create function return a list of page-returns
ret <- lapply(paste0("https://www.indeed.com", desc), read_html)
We could either use lapply from base R
out <- lapply(ret, function(x) x %>%
html_nodes(".iCIMS_JobContent, #jobDescriptionText") %>%
html_text())
or loop with map from purrr
library(purrr)
out <- map(ret, ~ .x %>%
html_nodes(".iCIMS_JobContent, #jobDescriptionText") %>%
html_text())
NOTE: Both are looping over the elements of the list, the .x or x are the individual elements (from anonymous function - i.e. function created on the fly (function(x) or ~ - in tidyverse)
I am trying to apply a function that extracts a table from a list of scraped links. I am at the final stage where I am applying the get_injury_data function to the links - I have been having issues with successfully executing this. I get the following error:
Error in matrix(unlist(values), ncol = width, byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
I wonder if anyone can help me spot where I am going wrong. The code is as follows:
library(tidyverse)
library(rvest)
# create a function to grab the team links
get_team_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(!grepl('profil', links)) %>% # remove link of players included
filter(!grepl('spielplan', links)) %>% # remove link of additional team pages included
mutate(links = gsub("startseite", "kader", links)) # change link to go to the detailed page
}
# create a function to grab the player links
get_player_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(grepl('profil', links)) %>% # remove link of players included
mutate(links = gsub("profil", "verletzungen", links)) # change link to go to the injury page
}
# create a function to get the injury dataset
get_injury_data <- function(url){
url %>%
read_html() %>%
html_nodes('#yw1') %>%
html_table()
}
# get team links and save it as team_links
team_links <- get_team_links('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
# get player links and by mapping the function on to the player_injury_links dataset
# and then unnest the list of lists as a long list
player_injury_links <- team_links %>%
mutate(links = map(team_links$links, get_player_links)) %>%
unnest(links)
# using the player_injury_links list create a dataset by web scrapping the play injury pages
player_injury_data <- map(player_injury_links$links, get_injury_data)
Solution
So the issue that I was having was that some of the links that I was scraping did not have any data.
To overcome this issue used, I used the possibly function from purrr package. This helped me create a new, error-free function.
The line code that was giving me trouble is as follows:
player_injury_data <- player_injury_links %>%
purrr::map(., purrr::possibly(get_injury_data, otherwise = NULL, quiet = TRUE))
Error in unnest_tokens.data.frame(., entity, text, token = tokenize_scispacy_entities, :
Expected output of tokenizing function to be a list of length 100
The unnest_tokens() works well for a sample of few observations but fails on the entire dataset.
https://github.com/dgrtwo/cord19
Reproducible example:
library(dplyr)
library(cord19)
library(tidyverse)
library(tidytext)
library(spacyr)
Install the model from here - https://github.com/allenai/scispacy
spacy_initialize("en_core_sci_sm")
tokenize_scispacy_entities <- function(text) {
spacy_extract_entity(text) %>%
group_by(doc_id) %>%
nest() %>%
pull(data) %>%
map("text") %>%
map(str_to_lower)
}
paragraph_entities <- cord19_paragraphs %>%
select(paper_id, text) %>%
sample_n(10) %>%
unnest_tokens(entity, text, token = tokenize_scispacy_entities)
I face the same problem. I don't know the reason why, after I filter out empty abstract and shorter abstract string, everything seems work just fine.
abstract_entities <- article_data %>%
filter(nchar(abstract) > 30) %>%
select(paper_id, title, abstract) %>%
sample_n(1000) %>%
unnest_tokens(entity, abstract, token = tokenize_scispacy_entities)
The code below works if I remove the Sys.sleep() from within the map() function. I tried to research the error ('Don't know how to pluck from a closure') but i haven't found much on that topic.
Does anyone know where I can find documentation on this error, and any help on why it is happening and how to prevent it?
library(rvest)
library(tidyverse)
library(stringr)
# lets assume 3 pages only to do it quickly
page <- (0:18)
# no need to create a list. Just a vector
urls = paste0("https://www.mlssoccer.com/players?page=", page)
# define this function that collects the player's name from a url
get_the_names = function( url){
url %>%
read_html() %>%
html_nodes("a.name_link") %>%
html_text()
}
# map the urls to the function that gets the names
players = map(urls, get_the_names) %>%
# turn into a single character vector
unlist() %>%
# make lower case
tolower() %>%
# replace the `space` to underscore
str_replace_all(" ", "-")
# Now create a vector of player urls
player_urls = paste0("https://www.mlssoccer.com/players/", players )
# define a function that reads the 3rd table of the url
get_the_summary_stats <- function(url){
url %>%
read_html() %>%
html_nodes("table") %>%
html_table() %>% .[[3]]
}
# lets read 3 players only to speed things up [otherwise it takes a significant amount of time to run...]
a_few_players <- player_urls[1:5]
# get the stats
tables = a_few_players %>%
# important step so I can name the rows I get in the table
set_names() %>%
#map the player urls to the function that reads the 3rd table
# note the `safely` wrap around the get_the_summary_stats' function
# since there are players with no stats and causes an error (eg.brenden-aaronson )
# the output will be a list of lists [result and error]
map(., ~{ Sys.sleep(5)
safely(get_the_summary_stats) }) %>%
# collect only the `result` output (the table) INTO A DATA FRAME
# There is also an `error` output
# also, name each row with the players name
map_df("result", .id = "player") %>%
#keep only the player name (remove the www.mls.... part)
mutate(player = str_replace(player, "https://www.mlssoccer.com/players/", "")) %>%
as_tibble()
tables <- tables %>% separate(Match,c("awayTeam","homeTeam"), extra= "drop", fill = "right")
purrr::safely(...) returns a function, so your map(., { Sys.sleep(5); safely(get_the_summary_stats) }) is returning functions, not any data. In R, a "closure" is a function and its enclosing environment.
Tilde notation is a tidyverse-specific method of more-terse anonymous functions. Typically (e.g., with lapply) one would use lapply(mydata, function(x) get_the_summary_stats(x)). In tilde notation, the same thing is written as map(mydata, ~ get_the_summary_stats(.))
So, re-write to:
... %>% map(~ { Sys.sleep(5); safely(get_the_summary_stats)(.); })
From comments by #r2evans
I would like to know how to use lapply and/or for loops to have more concise code.
This is what I currently have and it works.
MLFreq <- MLlyrics %>%
unnest_tokens(word, line) %>%
anti_join(stop_words) %>%
ungroup() %>%
count(word)
MLpct <- sum(albumList2$MLlyrics$n) / sum(MLFreq$n)
ViewFreq <- ViewLyrics %>%
unnest_tokens(word, line) %>%
anti_join(stop_words) %>%
ungroup() %>%
count(word)
Viewpct <- sum(albumList2$ViewLyrics$n) / sum(ViewFreq$n)
#... repeating 6 times with different data frames
I've been trying
Freq <- lapply(albumList2, function(df){
df %>% unnest_tokens(word, line) %>%
anti_join(stop_words) %>%
ungroup()%>%
count(word) %>%
sum(albumList2$df$n) / sum(df$n)
})
and
for (i in 1:length(albumList2)) {
unnest_tokens(word, line) %>%
anti_join(stop_words) %>%
ungroup()%>%
count(word) %>%
print(sum(albumList2$i$n) / sum(i$n))
}
but the lapply brings
Error in check_input(x) : Input must be a character vector of any length or
a list of character vectors, each of which has a length of 1.
and the for loop brings
no applicable method for 'unnest_tokens_' applied to an object of class
"function"
For reference albumList2 contains a list of data frames (MLlyrics, ViewLyrics, etc...)
I was originally going to leave it as is but just read something along the lines of "If you use the same code 3 times, loop over it"
The problem with the lapply example is that the list that you are looping over is a nested list instead of single list.
Also, references of the kind: sum(albumList2$df$n) / sum(df$n) & print(sum(albumList2$i$n) / sum(i$n)) would not work.
i is just a number ranging from 1 to length(albumList2). Saying you want 1$n or albumList2$1$n does not make sense.
You should read about indexing in lists and nested lists here and here. Please add some dummy data that everyone can test and help you better.