Values are not getting entered in dataframe from web scraping - r

My main aim is to extract the content from the website. I want to save it locally. After the content should get updated in website it should reflect the local data also.
I am able to read the data from the webpage used in the code,now I want to save the result into data frame so that I can export the result. I want the values of x6 should enter into the data frame df ,so that I can export the data frame result into text file or excel file or you can suggest any other way to extract the data from the webpage used in the code (web scraping).In this I want my for loop is not working ,so please anyone help me out.
library(rvest)
library(dplyr)
library(qdapRegex) # install.packages("qdapRegex")
google <- read_html("https://bidplus.gem.gov.in/bidresultlists")
(x <- google %>%
html_nodes(".block") %>%
html_text())
class(x)
(x1 <- gsub(" ", "", x))
(x2 <- gsub(" ", "", x1))
(x3 <- gsub(" ", "", x2))
(x4 <- gsub(" ", "", x3))
(x5 <- gsub(" ", "", x4))
(x6 <- gsub("\n", "", x5))
class(x6)
length(x6[i])
typeof(x6)
for (i in x6) {
BIDNO <- rm_between(x6[i], "BID NO:", "Status", extract = TRUE)
Status <- rm_between(x6[i], "Status:", "Quantity Required", extract = TRUE)
Quantity_Required <- rm_between(x6[i], "Quantity Required:", "Department Name And Address", extract = TRUE)
Department_Name_And_Address <- rm_between(x6[i], "Department Name And Address:", "Start Date", extract = TRUE)
Start_Date <- rm_between(x6[i], "Start Date:", "End Date", extract = TRUE)
# End_Date <- rm_between(x6[i], "End Date: ", "Technical Evaluation", extract=TRUE)
df <- data.frame("BID_NO", "Status", "Quantity_Required", "Department_Name_Address", "Start_Date")
}
df
View(df)

Targeting the desired elements with XPath is likely a path with less frustration & error:
library(rvest)
library(dplyr)
pg <- read_html("https://bidplus.gem.gov.in/bidresultlists")
Get all the bid blocks:
blocks <- html_nodes(pg, ".block")
Target items & quantity div:
items_and_quantity <- html_nodes(blocks, xpath=".//div[#class='col-block' and contains(., 'Item(s)')]")
Pull out items and quantities:
items <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Item(s)')]/following-sibling::span") %>% html_text(trim=TRUE)
quantity <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Quantity')]/following-sibling::span") %>% html_text(trim=TRUE) %>% as.numeric()
Get department name and address. Modify it so the three lines are separated with pipes (|). This will enable separation at a later time. Pipe symbol is a pain for regex since it has to be escaped but it is highly unlikely to appear in the text and tabs can often cause confusion at a later time.
department_name_and_address <- html_nodes(blocks, xpath=".//div[#class='col-block' and contains(., 'Department Name And Address')]") %>%
html_text(trim=TRUE) %>%
gsub("\n", "|", .) %>%
gsub("[[:space:]]*\\||\\|[[:space:]]*", "|", .)
Target the block header which has bid # and status:
block_header <- html_nodes(blocks, "div.block_header")
Pull out bid # (see note at the end of the answer):
html_nodes(block_header, xpath=".//p[contains(#class, 'bid_no')]") %>%
html_text(trim=TRUE) %>%
gsub("^.*: ", "", .) -> bid_no
Pull out status:
html_nodes(block_header, xpath=".//p/b[contains(., 'Status')]/following-sibling::span") %>%
html_text(trim=TRUE) -> status
Target & pull out start & end dates:
html_nodes(blocks, xpath=".//strong[contains(., 'Start Date')]/following-sibling::span") %>%
html_text(trim=TRUE) -> start_date
html_nodes(blocks, xpath=".//strong[contains(., 'End Date')]/following-sibling::span") %>%
html_text(trim=TRUE) -> end_date
Make a data frame:
data.frame(
bid_no,
status,
start_date,
end_date,
items,
quantity,
department_name_and_address,
stringsAsFactors=FALSE
) -> xdf
Some of the bids are "RA"s so we can also create a column letting us know which ones are which:
xdf$is_ra <- grepl("/RA/", bid_no)
The resultant data frame:
str(xdf)
## 'data.frame': 10 obs. of 8 variables:
## $ bid_no : chr "GEM/2018/B/93066" "GEM/2018/B/93082" "GEM/2018/B/93105" "GEM/2018/B/93999" ...
## $ status : chr "Not Evaluated" "Not Evaluated" "Not Evaluated" "Not Evaluated" ...
## $ start_date : chr "25-09-2018 03:53:pm" "27-09-2018 09:16:am" "25-09-2018 05:08:pm" "26-09-2018 05:21:pm" ...
## $ end_date : chr "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" ...
## $ items : chr "automotive chassis fitted with engine" "automotive chassis fitted with engine" "automotive chassis fitted with engine" "Storage System" ...
## $ quantity : num 1 1 1 2 90 1 981 6 4 376
## $ department_name_and_address: chr "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Maharashtra Energy Department Maharashtra Bhusawal Tps N/a" ...
## $ is_ra : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
I'll let you turn dates into POSIXct elements.
The contiguous code w/o explanation is here.
Also, this isn't Java. for loops are rarely the solution to a problem in R. And, you should read up on regexes since counting spaces for substitution is also a path fraught with peril and frustration.

The problem appears to be that you created is a bunch of strings with 'BID_NO'etc in quotes. If you are trying to save values into a dataframe you need to save the variable names into which you saved the values into a dataframe instead.
df<-data.frame(BID_NO,Status,Quantity_Required,Department_Name_Address,Start_Date)
Provided all the code which creates each field is correct above and values are saved into those variables, you will get a ONE ROW dataframe because it is created in a for loop so each time it iterates you will write over the last version.
If you hope to save multiple rows, create final_df prior to the loop. Then
data.frame(rbind(final_df, df)) will bind the row of data to the empty frame on the first pass then add a new row each time through.
But any dataframe created in a loop will be created anew each pass and written over...and save values from variable withouth ' ' around them...

Related

Extract certain words from dynamic strings vector

I'm working with questionnaire datasets where I need to extract some brands' names from several questions. The problem is each data might have a different question line, for example:
Data #1
What do you know about AlphaToy?
Data #2
What comes to your mind when you heard AlphaCars?
Data #3
What do you think of FoodTruckers?
What I want to extract are the words AlphaToy, AlphaCars, and FoodTruckers. In Excel, I can get those brands' names via flash fill, the illustration is below.
As I working with R, I need to convert the "flash fill" step into an R function, yet I couldn't found out how to do it. Here's desired output:
brandName <- list(
Toy = c(
"1. What do you know about AlphaToy?",
"2. What do you know about BetaToyz?",
"3. What do you know about CharlieDoll?",
"4. What do you know about DeltaToys?",
"5. What do you know about Echoty?"
),
Car = c(
"18. What comes to your mind when you heard AlphaCars?",
"19. What comes to your mind when you heard BestCar?",
"20. What comes to your mind when you heard CoolCarz?"
),
Trucker = c(
"5. What do you think of FoodTruckers?",
"6. What do you think of IceCreamTruckers?",
"7. What do you think of JellyTruckers?",
"8. What do you think of SodaTruckers?"
)
)
extractBrandName <- function(...) {
#some codes here
}
#desired output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
As the title says, the function should work to dynamic strings, so when the function is applied to brandName the desired output is:
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Edit:
The brand name can be in lowercase, uppercase, or even two words or more, for instance: IBM, Louis Vuitton
The brand names might appear in the middle of the sentence, it's not always come at the end of the sentence. The thing is, the sentences are unpredictable because each client might provide different data of each other
Can anyone help me with the function code to achieve the desired output? Thank you in advance!
Edit, here's attempt
The idea (thanks to shs' answer) is to find similar words from the input, then exclude them leaving the unique words (it should be the brand names) behind. Following this post, I use intersect() wrapped inside a Reduce() to get the common words, then I exclude them via lapply() and make sure any two or more words brand names merged together with str_c(collapse = " ").
Code
library(stringr)
extractBrandName <- function(x) {
cleanWords <- x %>%
str_remove_all("^\\d+|\\.|,|\\?") %>%
str_squish() %>%
str_split(" ")
commonWords <- cleanWords %>%
Reduce(intersect, .)
extractedWords <- cleanWords %>%
lapply(., function(y) {
y[!y %in% commonWords] %>%
str_c(collapse = " ")
}) %>% unlist()
return(extractedWords)
}
Output (1st test case)
> #output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Output (2nd test case)
This test case includes two or more words brand names, located at the middle and the beginning of the sentence.
brandName2 <- list(
Middle = c("Have you used any products from AlphaToy this past 6 months?",
"Have you used any products from BetaToys Collection this past 6 months?",
"Have you used any products from Charl TOYZ this past 6 months?"),
First = c("AlphaCars is the best automobile dealer, yes/no?",
"Best Vehc is the best automobile dealer, yes/no?",
"CoolCarz & Bike is the best automobile dealer, yes/no?")
)
> #output
> lapply(brandName2, extractBrandName)
$Middle
[1] "AlphaToy" "BetaToys Collection" "Charl TOYZ"
$First
[1] "AlphaCars" "Best Vehc" "CoolCarz & Bike"
In the end, the solution to this problem is found. Thanks to shs who gave the initial idea and the answer from the post I linked above. If you have any suggestions, please feel free to comment. Thank you.
This function checks which words the first two strings have in common and then removes everything from the beginning of the strings up to and including the common element, leaving only the desired part of the string:
library(stringr)
extractBrandName <- function(x) {
x %>%
str_split(" ") %>%
{.[[1]][.[[1]] %in% .[[2]]]} %>%
str_c(collapse = " ") %>%
str_c("^.+", .) %>%
str_remove(x, .) %>%
str_squish() %>%
str_remove("\\?")
}
lapply(brandName, extractBrandName)
#> $Toy
#> [1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
#>
#> $Car
#> [1] "AlphaCars" "BestCar" "CoolCarz"
#>
#> $Trucker
#> [1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"

Importing multiple invoices (.PDF) in R. Turning them from strings to a tibble

So I'm doing a project where I need to load a numerous amount of .pdfs into R. This part is somewhat covered. The problem is when importing the pdfs into R, every line is a string. Not all the information in de the string is relevant. And in some of the cases information is missing. So I want to select the info I need and place them into a tibble for further analysis.
Importing the pdf's are done by pdftools. It's working, hints or tips are welcome though
invoice_pdfs = list.files(pattern="*.pdf") # gather all the .pdf in current wd.
invoice_list <- map(invoice_pdfs, .f = function(invoices){ # Using the purrr::map function .
pdf_text(invoices) %>% # extracting text from listed pdf file(s)
readr::read_lines() %>% # read all text from pdf
str_squish() %>% # clear all white space in text.
str_to_lower # convert string to lower case
})
reproducible example:
invoice_example <- c("invoice",
"to: rade ris",
"cane nompany",
"kakber street 23d",
"nork wey",
"+223 (0)56 015 6542",
"invoice id: 85600023",
"date reference product product reference weigth amount",
"01-02-2016 840000023 product a 24.45.6 de6583621 14.900 kg a 50 per tonne 745,00",
"07-02-2016 840000048 product b 24.45.7 qf8463641 19.000 kg a 50 per tonne 950,00",
"03-02-2016 840000032 product b 24.34.2 qf8463641 4.000 kg per tonne 250,00",
"02-02-2016 840000027 ke7801465 1.780 kg per tonne 89,00",
"subtotal 2.034,00",
"sales tax 183,06",
"total 2.217,06")
So here is where the problem starts.
What I've tried is using stringr and rebus to select specific parts of the text. I've made the following function to search the document for specific string, it returns the rownumber:
word_finder <- function(x, findWord){
word_hit <- x %>% # temp for storing TRUE or FALSE
str_detect(pattern = fixed(findWord))
which(word_hit == TRUE) # give rownumber if TRUE
}
And the following searchpatterns:
detect_date <- dgt(2) %R% "-" %R% dgt(2) %R% "-" %R% dgt(2)
detect_money <- optional(DIGIT) %R% optional(".") %R% one_or_more(DIGIT) %R% "," %R% dgt(2)
detect_invoice_num <- str_trim(SPC %R% dgt(8) %R% optional(SPC))
The next step should be to make a tibble (or data frame) with the column names c("date", "reference", "product", "product reference", "weight", "amount") I've also tried making a tibble of the whole invoice_example problem is the missing info in some fields and the column names don’t match the corresponding value's.
So I would like to make some function that uses the search pattern and places that specific value to a predestined column. I've got no clue how to get this done. Or maybe I should handle this completely different?
final result should be something like this.
reproducible example:
invoice_nr <- c("85600023", "85600023", "85600023", "85600023" )
date <- c( "01-02-2016", "07-02-2016", "03-02-2016", "02-02-2016")
reference <- c( "840000023", "840000048", "840000032", "840000027")
product_id <- c( "de6583621", "qf8463641", "qf8463641", "ke7801465")
weight <- c("14.900", "19.000", "4.000", "1.780")
amount <- c("745.00", "950.00", "250.00", "89.00")
example_tibble <- tibble(invoice_nr, date, reference, product_id, weight, amount)
Result:
# A tibble: 4 x 6
invoice_nr date reference product_id weight amount
<chr> <chr> <chr> <chr> <chr> <chr>
1 85600023 01-02-2016 840000023 de6583621 14.900 745.00
2 85600023 07-02-2016 840000048 qf8463641 19.000 950.00
3 85600023 03-02-2016 840000032 qf8463641 4.000 250.00
4 85600023 02-02-2016 840000027 ke7801465 1.780 89.00
Any suggested ways of dealing with this will be appreciated!
Actually you can use the functions of library(stringr) to achieve your goal (I skipped the rebus part as this seems to eb anyways 'just' a helper for creatign teh regex, which I did by hand):
library(tidyverse)
parse_invoice <- function(in_text) {
## define regex, some assumptions:
## product id is 2 lower characters followed by 7 digits
## weight is some digits with a dot followed by kg
## amount is some digits at the end with a comma
all_regex <- list(date = "\\d{2}-\\d{2}-\\d{4}",
reference = "\\d{9}",
product_id = "[a-z]{2}\\d{7}",
weight = "\\d+\\.\\d+ kg",
amount = "\\d+,\\d+$")
## look only at lines where there is invoice data
rel_lines <- str_subset(in_text, all_regex$date)
## extract the pieces from the regex
ret <- as_tibble(map(all_regex, str_extract, string = rel_lines))
## clean up the data
ret %>%
mutate(invoice_nr = str_extract(str_subset(in_text, "invoice id:"), "\\d{8}"),
date = as.Date(date, "%d-%m-%Y"),
weight = as.numeric(str_replace(weight, "(\\d+.\\d+) kg", "\\1")),
amount = as.numeric(str_replace(amount, ",", "."))
) %>%
select(invoice_nr,
date,
reference,
product_id,
weight,
amount)
}
str(parse_invoice(invoice_example))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 6 variables:
# $ invoice_nr: chr "85600023" "85600023" "85600023" "85600023"
# $ date : Date, format: "2016-02-01" "2016-02-07" ...
# $ reference : chr "840000023" "840000048" "840000032" "840000027"
# $ product_id: chr "de6583621" "qf8463641" "qf8463641" "ke7801465"
# $ weight : num 14.9 19 4 1.78
# $ amount : num 745 950 250 89
Since I'm not familiar with rebus I've rewritten your code. Assuming the invoices are at least somewhat structured the same I could generate a tibble from your example. You would just have to apply this to your whole list and then purrr::reduce it to a big tibble:
df <- tibble(date=na.omit(str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4}")))
df %>% mutate(invoice_nr=na.omit(sub("invoice id: ","",str_extract(invoice_example,"invoice id: [0-9]+"))),
reference=na.omit(sub("\\d{2}-\\d{2}-\\d{4} ","",str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4} \\d{9}"))),
product_id=na.omit(str_extract(invoice_example,"[:lower:]{2}\\d{7}")),
weight=na.omit(sub(" kg","",str_extract(invoice_example,"[0-9\\.]+ kg"))),
amount=na.omit(sub("tonne ","",str_extract(invoice_example,"tonne [0-9,]+"))))

R: Using dplyr to filter out a dataframe

I'm new to R and I'm having trouble filtering out my dataframe for a specific condition. For some reason, the code is working and I'm not getting any errors but when I view the updated dataframe... the condition I set didn't excute.
The condition that isn't excuting is the var>50.
Any help would be greatly appreciated!
Code so far:
if (!require(pacman)) {
install.packages('pacman')
}
pacman::p_load("ggplot2", "tidyr", "plyr", "dplyr")
#### Read in the necessary data ######
roadsalt_data <- read.table("QADportaldata_1988-2015.tsv", header = T, sep = "\t", fill = T, stringsAsFactors = F)
# Convert date column from a character class to a date class so ggplot can display as a continuous variable ###
roadsalt_data$stdate <- as.Date(roadsalt_data$stdate)
## Filter dataset to only contain columns I need ########
filtered_roadsalt <- roadsalt_data %>%
select(orgid, stdate, locid, charnam, val) %>%
filter(between(stdate, as.Date("1996-01-01"), as.Date("2015-07-01"))) %>%
filter(charnam == "Total dissolved solids" & "var" > 50)
Preview of my dataset:
'data.frame': 47850 obs. of 5 variables:
$ orgid : chr "USGS-NJ" "USGS-NJ" "USGS-NJ" "USGS-NJ" ...
$ stdate : Date, format: "2014-03-05" "2014-03-05" "2014-03-04" ...
$ locid : chr "USGS-01367785" "USGS-01367785" "USGS-01455099" "USGS-01455099" ...
$ charnam: chr "Total dissolved solids" "Total dissolved solids" "Total dissolved solids" "Total dissolved solids" ...
$ val : chr "0.21" "154" "0.43" "333" ...
I am assuming class(val) is a factor, then condition in filter has to be this way:
filter(charnam == "Total dissolved solids" & as.numeric(as.character(val)) > 50.00)
When using dplyr functions, you don't need quotes around your variable names. So,
filter(charnam == "Total dissolved solids" & "var" > 50)
should be replaced with
filter(charnam == "Total dissolved solids" & var > 50)
Var also has to be converted to a numeric variable.
That being said, if you select at the beginning of your pipe, you have to include all the variables on which you want to add filters. Because you haven't selected a variable called "var" in your initial select statement, so you won't be able to filter on var. If that's meant to be 'val', then you're good to go.

Geocode IP addresses in R

I have made this short code to automate geocoding of IP addresses by using the freegeoip.net (15,000 queries per hour by default; excellent service!):
> library(RCurl)
Loading required package: bitops
> ip.lst =
c("193.198.38.10","91.93.52.105","134.76.194.180","46.183.103.8")
> q = do.call(rbind, lapply(ip.lst, function(x){
try( data.frame(t(strsplit(getURI(paste0("freegeoip.net/csv/", x)), ",")[[1]]), stringsAsFactors = FALSE) )
}))
> names(q) = c("ip","country_code","country_name","region_code","region_name","city","zip_code","time_zone","latitude","longitude","metro_code")
> str(q)
'data.frame': 4 obs. of 11 variables:
$ ip : chr "193.198.38.10" "91.93.52.105" "134.76.194.180" "46.183.103.8"
$ country_code: chr "HR" "TR" "DE" "DE"
$ country_name: chr "Croatia" "Turkey" "Germany" "Germany"
$ region_code : chr "" "06" "NI" ""
$ region_name : chr "" "Ankara" "Lower Saxony" ""
$ city : chr "" "Ankara" "Gottingen" ""
$ zip_code : chr "" "06450" "37079" ""
$ time_zone : chr "Europe/Zagreb" "Europe/Istanbul" "Europe/Berlin" ""
$ latitude : chr "45.1667" "39.9230" "51.5333" "51.2993"
$ longitude : chr "15.5000" "32.8378" "9.9333" "9.4910"
$ metro_code : chr "0\r\n" "0\r\n" "0\r\n" "0\r\n"
In three lines of code you get coordinates for all IPs including city/country codes. I wonder if this could be parallelized so it runs even faster? To geocode >10,000 IPs can take hours otherwise.
library(rgeolocate)
ip_lst = c("193.198.38.10", "91.93.52.105", "134.76.194.180", "46.183.103.8")
maxmind(ip_lst, "~/Data/GeoLite2-City.mmdb",
fields=c("country_code", "country_name", "region_name", "city_name",
"timezone", "latitude", "longitude"))
## country_code country_name region_name city_name timezone latitude longitude
## 1 HR Croatia <NA> <NA> Europe/Zagreb 45.1667 15.5000
## 2 TR Turkey Istanbul Istanbul Europe/Istanbul 41.0186 28.9647
## 3 DE Germany Lower Saxony Bilshausen Europe/Berlin 51.6167 10.1667
## 4 DE Germany North Rhine-Westphalia Aachen Europe/Berlin 50.7787 6.1085
There are instructions in the package for obtaining the necessary data files. Some of the fields you're pulling are woefully inaccurate (more so than any geoip vendor would like to admit). If you do need ones that aren't available, file an issue and we'll add them.
I've found multidplyr is a great package for making parallel server calls. This is the best guide I've found, and I highly recommend reading the whole thing to better understand how the package works: http://www.business-science.io/code-tools/2016/12/18/multidplyr.html
library("devtools")
devtools::install_github("hadley/multidplyr")
library(parallel)
library(multidplyr)
library(RCurl)
library(tidyverse)
# Convert your example into a function
get_ip <- function(ip) {
do.call(rbind, lapply(ip, function(x) {
try(data.frame(t(strsplit(getURI(
paste0("freegeoip.net/csv/", x)
), ",")[[1]]), stringsAsFactors = FALSE))
})) %>% nest(X1:X11)
}
# Made ip.lst into a Tibble to make it work better with dplyr
ip.lst =
tibble(
ip = c(
"193.198.38.10",
"91.93.52.105",
"134.76.194.180",
"46.183.103.8",
"193.198.38.10",
"91.93.52.105",
"134.76.194.180",
"46.183.103.8"
)
)
# Create a cluster based on how many cores your machine has
cl <- detectCores()
cluster <- create_cluster(cores = cl)
# Create a partitioned tibble
by_group <- partition(ip.lst, cluster = cluster)
# Send libraries and the function get_ip() to each cluster
by_group %>%
cluster_library("tidyverse") %>%
cluster_library("RCurl") %>%
cluster_assign_value("get_ip", get_ip)
# Send parallel requests to the website and parse the results
q <- by_group %>%
do(get_ip(.$ip)) %>%
collect() %>%
unnest() %>%
tbl_df() %>%
select(-PARTITION_ID)
# Set names of the results
names(q) = c(
"ip",
"country_code",
"country_name",
"region_code",
"region_name",
"city",
"zip_code",
"time_zone",
"latitude",
"longitude",
"metro_code"
)

R For loop unwanted overwrite

I would like every result of the loop in a different text(somename).
Right now the loop overwrites;
library(rvest)
main.page <- read_html(x = "http://www.imdb.com/event/ev0000681/2016")
urls <- main.page %>% # feed `main.page` to the next step
html_nodes(".alt:nth-child(2) strong a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
for (i in urls){
a01 <- paste0("http://www.imdb.com",i)
text <- read_html(a01) %>% # load the page
html_nodes(".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop , .summary_text+ .credit_summary_item .itemprop") %>% # isloate the text
html_text()
}
How could I code it in such a way that the 'i' from the list is added tot text in the for statement?
To solidify my comment:
main.page <- read_html(x = "http://www.imdb.com/event/ev0000681/2016")
urls <- main.page %>% # feed `main.page` to the next step
html_nodes(".alt:nth-child(2) strong a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
texts <- sapply(head(urls, n = 3), function(i) {
read_html(paste0("http://www.imdb.com", i)) %>%
html_nodes(".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop , .summary_text+ .credit_summary_item .itemprop") %>%
html_text()
}, simplify = FALSE)
str(texts)
# List of 3
# $ /title/tt5843990/: chr [1:4] "Lav Diaz" "Charo Santos-Concio" "John Lloyd Cruz" "Michael De Mesa"
# $ /title/tt4551318/: chr [1:4] "Andrey Konchalovskiy" "Yuliya Vysotskaya" "Peter Kurth" "Philippe Duquesne"
# $ /title/tt4550098/: chr [1:4] "Tom Ford" "Amy Adams" "Jake Gyllenhaal" "Michael Shannon"
If you use lapply(...), you'll get an unnamed list, which may or may not be a problem for you. Instead, using sapply(..., simplify = FALSE), we get a named list where each name is (in this case) the partial url retrieved from urls.
Using sapply without simplify can lead to unexpected outputs. As an example:
set.seed(9)
sapply(1:3, function(i) rep(i, sample(3, size=1)))
# [1] 1 2 3
One may think that this will always return a vector. However, if any of the single elements returned is not the same length (for instance) as the others, then the vector becomes a list:
set.seed(10)
sapply(1:3, function(i) rep(i, sample(3, size=1)))
# [[1]]
# [1] 1 1
# [[2]]
# [1] 2
# [[3]]
# [1] 3 3
In which case, it's best to have certainty in the return value, forcing a list:
set.seed(9)
sapply(1:3, function(i) rep(i, sample(3, size=1)), simplify = FALSE)
# [[1]]
# [1] 1
# [[2]]
# [1] 2
# [[3]]
# [1] 3
That way, you always know exactly how to reference sub-returns. (This is one of the tenets and advantages to Hadley's purrr package: each function always returns a list of exactly the type you declare. (There are other advantages to the package.)

Resources