Hello to all professionals out here,
I have created a csv which consists of cities and the corresponding Tripadvisor_Urls. If I now search for a specific link in my list, for example like here to Munich, the subset function ejects the URL. Now I try to read this URL, which is stored under search_url, using read_html. Unfortunately without success.
The relevant part of my code is the following.
search_url <- subset(data, city %in% "München", select = url)
pages <- read_html(search_url)
pages <- pages %>%
html_nodes("._15_ydu6b") %>%
html_attr('href')
When I run search_url I get the following output:
https://www.tripadvisor.de/Restaurants-g187323-Berlin.html
But when I use the above code and want to execute read_html, the following error occurs:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "data.frame"
I have now spent several hours on it, but unfortunately I have not received a suitable tip anywhere. It would be wonderful if you could help me here.
That's because the result of subset() is a data frame here, although the real result is simply one string. Check this simple example with mtcars:
# this will be data.frame although the result is one numeric value 21.4
class(subset(mtcars, disp == 258, select = mpg))
# [1] "data.frame"
So you probably can use
pages <- read_html(as.character(search_url))
if you are sure that your subset returns only 1 character value, otherwise
pages <- read_html(search_url[1, 1])
should work as well for the first result of your subset.
Related
a friend of mine is working with the r language and asked me what she did wrong, i can't seem to find the problem. does someone know what it is?
the code she send me:
# 10*. Pipe that to a ggplot command and create a histogram with 4 bins.
# Hint: you will NOT write ggplot(df, aes(...)) because the df is already piped in.
# Instead, just write: ggplot(aes(...)) etc.
# Title the histogram, "Distribution of Sunday tips for bills over $20"
# Feel free to style the plot (not required; this would be a typical exploratory
# analysis where only you will see it, so it doesn't have to be perfect).
df %>%
filter(total_bill > 20 & day == "Sun") %>%
ggplot(aes(x=total_bill, fill=size)) +
geom_histogram(bins=4) +
ggtitle("Distribution of Sunday tips for bills over $20")
the error:
Error in df(.) : argument "df1" is missing, with no default
Type ?df in your console, and you will see that df is a function with the following argument.
df(x, df1, df2, ncp, log = FALSE)
where df1 is an argument. So the error message is saying that R cannot find the first argument for the df function.
It seems like in this code example, your friend is trying to put a data frame called df into the filter function from the dplyr package and the ggplot function from the ggplot2 package to create a plot.
So my guess is your friend needs to define df as a data frame. Otherwise, R will think df is a function and keep throwing error.
By the way, since df is a defined function in R, it is not a good name for a data frame. However, people use df as a name for a data frame all the time. Try a different name, such as dat, for the name of a data frame next time.
I have a data.frame (dim: 100 x 1) containing a list of url links, each url looks something like this: https:blah-blah-blah.com/item/123/index.do .
The list (the list is a data.frame called my_list with 100 rows and a single column named col and is in character format $ col: chr) together looks like this :
1 "https:blah-blah-blah.com/item/123/index.do"
2" https:blah-blah-blah.com/item/124/index.do"
3 "https:blah-blah-blah.com/item/125/index.do"
etc.
I am trying to import each of these url's into R and collectively save the object as an object that is compatible for text mining procedures.
I know how to successfully convert each of these url's (that are on the list) manually:
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tm)
#1st document
url <- "https:blah-blah-blah.com/item/123/index.do"
article <- pdf_text(url)
Once this "article" file has been successfully created, I can inspect it:
str(article)
chr [1:13]
It looks like this:
[1] "abc ....."
[2] "def ..."
etc etc
[15] "ghi ...:
From here, I can successfully save this as an RDS file:
saveRDS(article, file = "article_1.rds")
Is there a way to do this for all 100 articles at the same time? Maybe with a loop?
Something like :
for (i in 1:100) {
url_i <- my_list[i,1]
article_i <- pdf_text(url_i)
saveRDS(article_i, file = "article_i.rds")
}
If this was written correctly, it would save each article as an RDS file (e.g. article_1.rds, article_2.rds, ... article_100.rds).
Would it then be possible to save all these articles into a single rds file?
Please note that list is not a good name for an object, as this will
temporarily overwrite the list() function. I think it is usually good
to name your variables according to their content. Maybe url_df would be
a good name.
library(pdftools)
#> Using poppler version 20.09.0
library(tidyverse)
url_df <-
data.frame(
url = c(
"https://www.nimh.nih.gov/health/publications/autism-spectrum-disorder/19-mh-8084-autismspecdisordr_152236.pdf",
"https://www.nimh.nih.gov/health/publications/my-mental-health-do-i-need-help/20-mh-8134-mymentalhealth-508_161032.pdf"
)
)
Since the urls are already in a data.frame we could store the text data in
an aditional column. That way the data will be easily available for later
steps.
text_df <-
url_df %>%
mutate(text = map(url, pdf_text))
Instead of saving each text in a separate file we can now store all of the data
in a single file:
saveRDS(text_df, "text_df.rds")
For historical reasons for loops are not very popular in the R community.
base R has the *apply() function family that provides a functional
approach to iteration. The tidyverse has the purrr package and the map*()
functions that improve upon the *apply() functions.
I recommend taking a look at
https://purrr.tidyverse.org/ to learn more.
It seems that there are certain url's in your data which are not valid pdf files. You can wrap it in tryCatch to handle the errors. If your dataframe is called df with url column in it, you can do :
library(pdftools)
lapply(seq_along(df$url), function(x) {
tryCatch({
saveRDS(pdf_text(df$url[x]), file = sprintf('article_%d.rds', x)),
},error = function(e) {})
})
So say you have a data.frame called my_df with a column that contains your URLs of pdf locations. As by your comments, it seems that some URLs lead to broken PDFs. You can use tryCatch in these cases to report back which links were broken and check manually what's wrong with these links.
You can do this in a for loop like this:
my_df <- data.frame(url = c(
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # working pdf
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pfd" # broken pdf
))
# make some useful new columns
my_df$id <- seq_along(my_df$url)
my_df$status <- NA
for (i in my_df$id) {
my_df$status[i] <- tryCatch({
message("downloading ", i) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(my_df$url[i]))
saveRDS(article_i, file = paste0("article_", i, ".rds"))
"OK"
}, error = function(e) {return("FAILED")}) # return the string FAILED if something goes wrong
}
my_df$status
#> [1] "OK" "FAILED"
I included a broken link in the example data on purpose to showcase how this would look.
Alternatively, you can use a loop from the apply family. The difference is that instead of iterating through a vector and applying the same code until the end of the vector, *apply takes a function, applies it to each element of a list (or objects which can be transformed to lists) and returns the results from each iteration in one go. Many people find *apply functions confusing at first because usually people define and apply functions in one line. Let's make the function more explicit:
s_download_pdf <- function(link, id) {
tryCatch({
message("downloading ", id) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(link))
saveRDS(article_i, file = paste0("article_", id, ".rds"))
"OK"
}, error = function(e) {return("FAILED")})
}
Now that we have this function, let's use it to download all files. I'm using mapply which iterates through two vectors at once, in this case the id and url columns:
my_df$status <- mapply(s_download_pdf, link = my_df$url, id = my_df$id)
my_df$status
#> [1] "OK" "FAILED"
I don't think it makes much of a difference which approach you choose as the speed will be bottlenecked by your internet connection instead of R. Just thought you might appreciate the comparison.
I have a code that I built to scrape player data from yahoo's fantasy football player page so I can get a list of players and the rank that yahoo gives them.
The code worked fine last year but now I am getting an error when I run the separate function:
> temp <- separate(temp,two,c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)
Error in `[.data.frame`(x, x_vars) : undefined columns selected
In addition: Warning message:
Expected 6 pieces. Missing pieces filled with `NA` in 1 rows [1].
I cannot figure out why it is giving this error, the column I am trying to separate looks correct. I have another script that uses this function to do something similar and when I went to try to use it there it worked fine.
The "missing pieces filled in with 'NA'" warning shouldn't be a problem, just that it wont run because of the undefined columns error.
The minimal code that I use to get to where I am is this:
library(rvest)## For read.html
library(tidyr)## For separate function
#scrapes the data
url <- 'https://football.fantasysports.yahoo.com/f1/107573/players?status=A&pos=O&cut_type=9&stat1=S_S_2017&myteam=0&sort=PR&sdir=1&count=0'
web <- read_html(url)
table = html_nodes(web, 'table')
temp <- html_table(table)[[2]]
#
colnames(temp) <- c('one','two',3:26)
temp <- separate(temp,two,c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)
The data is scraped in without names so I quickly give names to them including spelling out the column in question so it works with the separate function. I have tried using quotation marks around two in separate but it give the same error.
After remove the first row of temp, you code works.
library(dplyr)
colnames(temp) <- c('one','two',3:ncol(temp))
# Use ncol(temp) to make sure the column number is correct
temp2 <- temp %>%
filter(row_number() > 1) %>%
separate(two, c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)
Disclaimer: I know that this question has been asked before. The answer provided in this answer worked for me in the past, but for some reason has stopped now.
I am pulling Marketing email statistics from the Mailchimp API. I have been doing this for the last half year or so. However, in the past 2 months, I believe the structure of what I pull has changed and thus, my code no longer works and I cannot figure out why. I believe it has something to do with the nested data frames within my list of data frames that I receive.
Here is an example of my code and the resulting list of data frames. I have removed sensitive information from my code and image:
library(httr)
library(jsonlite)
library(plyr)
#Opens-----------
opens1 <- GET("https://us4.api.mailchimp.com/3.0/reports/***ReportNumber***/sent-to?count=4000",authenticate('***My Company***', '***My-Password***'))
opens1 <- content(opens1,"text")
opens1 <- fromJSON(opens1)
Then I run opens1 <- ldply(opens1, data.frame), and I receive the following error:
Error in allocate_column(df[[var]], nrows, dfs, var) :
Data frame column 'merge_fields' not supported by rbind.fill
I tried using and looking up rbind.fill() and the other methods described in the linked answer at the top of my post, to no avail. What am I interpreting incorrectly about the merge_fields variable, or am I way off, and how do I correct it?
I'm just trying to get one data frame of all of the variables from the opens1 list.
Thanks for any and all help, and please, feel free to ask any clarification questions!
On a quick glance, this seems to work for me:
library(httr)
campaign_id <- "-------"
apikey = "------"
url <- sprintf("https://us1.api.mailchimp.com/3.0/reports/%s/sent-to", campaign_id)
opens <- GET(url, query = list(apikey = apikey, count = 4000L))
lst <- rjson::fromJSON(content(opens, "text"))
df <- dplyr::bind_rows(
lapply(lst$sent_to, function(x)
as.data.frame(t(unlist(x)), stringsAsFactors = F)
))
i read some data into R with the read.xlsx() in openxlsx package, and here's my code for reading the data:
data_all = read.xlsx(xlsxFile = paste0(path, EoLfileName), sheet = 1, detectDates = T, skipEmptyRows = F)
now, when i access one name cell in my data, it will print the name in characters:
> data_all[1,'name']
[1] "76-ES+ADVIP-20G"
now, lets say i want to subset out some rows based on a condition on another colum:
data_sub = subset(data_all, !is.na(data_all$amount))
however, then if i print this subset data, i'd get:
> data_sub[1,'name']
[1] "A94198.10"
i've also tried to do subsetting using the following method:
data_sub = data_all[!is.na(data_all$amount),]
but i get the same thing: the expected output of "76-ES+ADVIP-20G" would be turned into "A94198.10"
I've checked many times with mode() and str() for data_all$name and data_sub$name, both return character, so they are in correct format.
here's a link to smaple data to play with:
https://drive.google.com/file/d/0BwIbultIWxeVY1VtdDU5NFp1Tkk/view?usp=sharing
Please please help me! I am quite stuck, and i dont see other posts with similar problem.
Why is this happeneing? subsetting shouldnt change data formatting correct?
Thank you in advance for your help!
additional note (if its helpful):
so when i tried to debug, i noticed that, when i was viewing the data_all in RStudio, and if i copy and paste the name "76-ES+ADVIP-20G" into the filter bar, it actually cannot find it; i'd have to type in "76-ES" and as soon as i type in the next character which is "+", RStudio data view filter would say "no matching records found"