Web scraping and looping through pages with R - r

I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks).
In order to learn scraping, I am exercising with a few pages on Psychology Today.
I have written a function that allows me to scrape information for one therapist and to create a data set with the information collected in this way:
install.packages('rvest') #Loading the rvest package
install.packages('xml2') #Loading the xml2 package
library('rvest') #to scrape
library('xml2') #to handle missing values (it works with html_node, not with html_nodes)
#Specifying the url for desired website to be scraped
url <- 'https://www.psychologytoday.com/us/therapists/THE_ONE_YOU_WANT'
#Reading the HTML code from the website
URL <- read_html(url)
#creating the function
getProfile <- function(profilescrape) {
##NAME
#Using CSS selectors to name
nam_html <- html_node(URL,'.contact-name')
#Converting the name data to text
nam <- html_text(nam_html)
#Let's have a look at the rankings
head(nam)
#Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help
# me separate each item within the same type of
# information)
nam<-gsub("\n","",nam)
head(nam)
#Convering each info from text to factor
nam<-as.factor(nam)
#Let's have a look at the name
head(nam)
##MODALITIES
#Using CSS selectors to modality
mod_html <- html_node(URL,'.attributes-modality .copy-small')
#Converting the name data to text
mod <- html_text(mod_html)
#Let's have a look at the rankings
head(mod)
#Convering each info from text to factor
mod<-as.factor(mod)
#Let's have a look at the rankings
head(mod)
##Combining all the lists to form a data frame
onet_df<-data.frame(Name = nam,
Modality = mod)
##Structure of the data frame
str(onet_df)
}
View(onet_df)
This code seems to be working well for whatever therapist I choose.
Now, I would like to use this function on multiple profiles, to generate one data set, with name and modality of MHPs.
Let's say that I want to apply the above function "getProfile" to the first 20 therapists in Illinois and input the information for this 20 therapists in a data set called "onet_df"
j <- 1
MHP_codes <- c(324585 : 449807) #therapist identifier
withinpage_codes <- c(1 : 20) #therapist running number
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
record_profile <- getProfile <- function(profilescrape)
onet_df[[j]] <- rbind.fill(onet_df, record_profile)
j <- j + 1
}
}
EDITS START HERE:
This loop does not create any data set; moreover, it does not give any error message.
Would someone be able to help me de-bug this loop?
Please, keep in mind that I am a real beginner.
Following sueggetions, I have modified what follows at the beginning:
#creating the function
getProfile <- function(URL) {....}
Moreover, I have used three alternative loops:
1st alternative
j <- 1
MHP_codes <- c(324585 : 449807) #therapist identifier
withinpage_codes <- c(1 : 20) #therapist running number
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
record_profile <- getProfile(URL)
onet_df[[j]] <- rbind.fill(onet_df, record_profile)
j <- j + 1
}
}
which gives the followin errors message:
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "character"
2nd alternative
MHP_codes <- c(324585, 449807) #therapist identifier
withinpage_codes <- c(1:20) #therapist running number
df_list <- vector(mode = "list",
length = length(MHP_codes) * length(withinpage_codes))
j <- 1
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
df_list[[j]] <- getProfile(URL)
j <- j + 1
}
}
final_df <- rbind.fill(df_list)
This loop gives the same error message (please, refer to the above one).
Now, I have just to figure out why no data set is produced with the loop. There might be two problems: First, something within the loop does not work (I have run both loops on only one existing page and no data set is produced) ; Second, when I run the loop on a series of link, some of them might be missing, which would produce an error message.

Consider several adjustments:
Adjust function to receive a URL parameter. Right profilescrape is not used anywhere in function. Function takes whatever URL is assigned in global environment.
getProfile <- function(URL) {
...
}
Adjust the ending of function to return the needed object. Without return, R will return the last line read. Therefore, replace str(onet_df) with return(onet_df).
Pass dynamic URL in loop to method without calling function:
URL <- paste0(...)
record_profile <- getProfile(URL)
Initialize a list with specified length (2 x 20) before loop. Then on each iteration assign to loop index rather than growing object in loop which is memory inefficient.
MHP_codes <- c(324585, 449807) #therapist identifier
withinpage_codes <- c(1:20) #therapist running number
df_list <- vector(mode = "list",
length = length(MHP_codes) * length(withinpade_codes))
j <- 1
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NULL)
j <- j + 1
}
}
Call rbind.fill once outside loop to combine all data frames together
final_df <- rbind.fill(df_list)
With that said, consider an apply family solution, specifically Map (wrapper to mapply). Doing so, you avoid the bookkeeping of initializing list and incremental variable and you "hide" the loop for compact statement.
# ALL POSSIBLE PAIRINGS
web_codes_df <- expand.grid(MHP_codes = c(324585, 449807),
withinpage_codes = c(1:20))
# MOVE URL ASSIGNMENT INSIDE FUNCTION
getProfile <- function(code1, code2) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
# ...same code as before...
}
# ELEMENT-WISE LOOP PASSING PARAMS IN PARALLEL TO FUNCTION
df_list <- Map(function(code1, code2) tryCatch(getProfile(code1, code2),
error = function(e) NULL),
code1 = web_codes_df$MHP_codes,
code2 = web_codes_df$withinpage_codes)
final_df <- rbind.fill(df_list)

One of the users, Parfait, helped me to sort out the issues. So, a very big thank you goes to this user.
Below I post the script. I apologize if it is not presicely commented.
Here is the code.
#Loading packages
library('rvest') #to scrape
library('xml2') #to handle missing values (it works with html_node, not with html_nodes)
library('plyr') #to bind together different data sets
#get working directory
getwd()
setwd("~/YOUR OWN FOLDER HERE")
#DEFINE SCRAPING FUNCTION
getProfile <- function(URL) {
##NAME
#Using CSS selectors to name
nam_html <- html_node(URL,'.contact-name')
#Converting the name data to text
nam <- html_text(nam_html)
#Let's have a look at the rankings
head(nam)
#Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help
# me separate each item within the same type of
# information)
nam<-gsub("\n","",nam)
head(nam)
#Convering each info from text to factor
nam<-as.factor(nam)
#Let's have a look at the name
head(nam)
#If I need to remove blank space do this:
#Data-Preprocessing: removing excess spaces
#variable<-gsub(" ","",variable)
##MODALITIES
#Using CSS selectors to modality
mod_html <- html_node(URL,'.attributes-modality .copy-small')
#Converting the name data to text
mod <- html_text(mod_html)
#Let's have a look at the rankings
head(mod)
#Convering each info from text to factor
mod<-as.factor(mod)
#Let's have a look at the rankings
head(mod)
##Combining all the lists to form a data frame
onet_df<-data.frame(Name = nam,
Modality = mod)
return(onet_df)
}
Then, I apply this function with a loop to a few therapists. For illustrative purposes, I take four adjacent therapists' ID, without knowing apriori whether each of these IDs have been actually assigned (this is done because I want to see what happens if the loop stumbles on a non-existen link).
j <- 1
MHP_codes <- c(163805:163808) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- read_html(URL)
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NULL)
j <- j + 1
}
final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")

Related

Loop Changing to Matrix then Running tests

I have a dataframe with ~9000 rows of human coded data in it, two coders per item so about 4500 unique pairs. I want to break the dataset into each of these pairs, so ~4500 dataframes, run a kripp.alpha on the scores that were assigned, and then save those into a coder sheet I have made. I cannot get the loop to work to do this.
I can get it to work individually, using this:
example.m <- as.matrix(example.m)
s <- kripp.alpha(example.m)
example$alpha <- s$value
However, when trying a loop I am getting either "Error in get(v) : object 'NA' not found" when running this:
for (i in items) {
v <- i
v <- v[c("V1","V2")]
v <- assign(v, as.matrix(get(v)))
s <- kripp.alpha(v)
i$alpha <- s$value
}
Or am getting "In i$alpha <- s$value : Coercing LHS to a list" when running:
for (i in items) {
i.m <- i[c("V1","V2")]
i.m <- as.matrix(i.m)
s <- kripp.alpha(i.m)
i$alpha <- s$value
}
Here is an example set of data. Items is a list of individual dataframes.
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
items <- c("l","t")
I am sure this is a basic question, but what I want is for each file, i, to add a column with the alpha score at the end. Thanks!
Your problem is with scoping and extracting names from objects when referenced through strings. You'd need to eval() some of your object to make your current approach work.
Here's another solution
library("irr") # For kripp.alpha
# Produce the data
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
# Collect the data as a list right away
items <- list(l, t)
Now you can sapply() directly over the elements in the list.
sapply(items, function(v) {
kripp.alpha(as.matrix(v[c("V1","V2")]))$value
})
which produces
[1] 0.0 -0.5

Loop in R with a function included

I'm very new at R and I would like to do a loop in order to return search volume (through an API call) for a list of keywords.
Here the code that I used :
install.packages("SEMrushR")
library(SEMrushR)
mes_keywords_to_check <- readLines("voyage.txt") # List of keywords to check
mes_keywords_to_check <- as.character(mes_keywords_to_check)
Loop
for (i in 1:length(mes_keywords_to_check)) {
test_keyword <- as.character(mes_keywords_to_check[i])
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY NUMBER") ##keyword_overview_all is the function from the Semrush package
}
By doing this, I only get the Search Volume for the first keyword in the list. My purpose if of course to get the date required for the full list of keywords.
Here is the table that I get:
enter image description here
Do you have any idea how I could solve this issue?
Well, you need to add your results to some kind of container. for example to a list. As of now, you have just one object that gets filled with data from the most recent iteration of your loop.
results = list()
for (i in 1:length(mes_keywords_to_check)) {
test_keyword <- as.character(mes_keywords_to_check[i])
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY NUMBER") ##keyword_overview_all is the function from the Semrush package
results[[i]] <- df_test_2
}
But, most R experts would suggest to refrain from using a loop
library("plyr")
result <- plyr::ldply(mes_keywords_to_check, function(x) keyword_overview_all(as.character(x), "fr","API KEY NUMBER"))
I did not test this, and it probably needs some tweaking, but it should point you in the right direction.
It looks like you're reading in the text file with readLines("voyage.txt") which will return a list of each line. These lines are then being passed to the for loop. The below will convert the lines to words. There are various approaches, but below uses a loop within a loop to keep using for() and in case you prefer to search line-by-line-word-by-word. It also uses a regex to split on non-alpha-numeric so that you omit words bounded by punctuation.
mes_lines <- readLines("voyage.txt") # List of keywords to check
mes_lines <- as.character(mes_lines)
search_results <- list()
for (i in 1:length(mes_lines)) {
mes_keywords_to_check <- unlist(strsplit(mes_lines,"[^[:alnum:]]"))
mes_keywords_to_check <- mes_keywords_to_check[nchar(mes_keywords_to_check)>0]
if (length(mes_keywords_to_check)==0) next
for (w in 1:length(mes_keywords_to_check))
{
test_keyword <- as.character(mes_keywords_to_check[w])
print(paste0("Checking word=",test_keyword))
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY NUMBER") ##keyword_overview_all is the function from the Semrush package
search_results <- append(search_results,df_test_2)
}
}
search_results
Thanks for pointing to the right direction.
Here is what I did, and this is working:
final_result <- data.frame()
mes_keywords_to_check <- readLines("voyage.txt")
mes_keywords_to_check <- as.character(mes_keywords_to_check)
for (i in 1:length(mes_keywords_to_check)) {
test_keyword <- as.character(mes_keywords_to_check[i])
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY")
final_result <- rbind(final_result,df_test_2)
}

Mapping over a custom function

I am getting stuck trying to incorporate the "pages" element here, along with the "type" - the pages element changes but the type remains constant.
The current code I have maps over the tickers and downloads the data into a list format, however it only uses the default number of pages, how can I map over the tickers as well as the pages?
Adding company_filings(pages, type) doesn´t work.
library(edgarWebR)
tickers <- c("63908", "793952")
count <- 100
pages <- 1:4
type = "10-K"
custom_company_filings <- function(compfilings){
company_filings(compfilings)
}
x <- Map(custom_company_filings, tickers)
Edit:
What I am trying to achieve is to do something like:
for(tick in tickers){
for(i in pages){
company_filings("get company filings")
}
}
So for each ticker I Will collect pages 1,2,3,4 of the company_filing() function - which comes from the edgarWebr package.
The custom_company_filing() function was my attempt at trying to solve the problem but when I posted here I removed all my attempts. For instance one attempt was:
custom_company_filings <- function(compfilings, pages){
company_filings(compfilings, pages)
}
x <- Map(custom_company_filings(page = pages), tickers)
We can use a nested lapply:
library(edgarWebR)
tickers <- c("63908", "793952")
count <- 100
pages <- 1:4
type <- "10-K"
lapply(tickers, function(x){
lapply(pages, function(y){
company_filings(x, type = type, count = count, page = y)
})
})
Or with purrr:
library(purrr)
pmap(expand.grid(tickers, pages), ~company_filings(..1, type = type, count = count, page = ..2))

I don't know how to get every single result of my function list out

RStudio Version 1.0.143
Windows Ver: Windows10 Pro
I have 300+ files which has the same struction, and I want to create a loop, so it can calculate the correlation index of the required files. I can get the right files and calculate the correlation index, but I can't get them all showed as a result. I tried to save them to a vector, but it tells me "the object not found". and if it can work, I also worried about whether the content of the vector will stay if I run the function for several times. Here's the loop:
for(i in ind_larg){
+ specdata_i <- read.csv(i)
+ com_case_ind <- complete.cases(specdata_i)
+ sulfate_i <- specdata_i[,2][com_case_ind]
+ nitrate_i <- specdata_i[,3][com_case_ind]
+ ou[i] <- cor(sulfate_i, nitrate_i)
+ }
and the result
Error: object 'ou' not found
I'm not sure if you need the rest of the code before this, so I attach them at the end here.
> setwd("C:/Users/sunxi/Coursera/specdata")
> ind <- dir(path = "C:/Users/sunxi/Coursera/specdata", pattern = ".csv") #Save the index of the files to a vector.
> specdata_ful <- lapply(ind, read.csv) #combine all the files to a data frame.
> specdat_recon_ful <- do.call(rbind, specdata_ful) #Reconstruct the data frame to put the same variable in one column.
> com_case_ful <- complete.cases(specdat_recon_ful) #Filter the complete cases.
> id_ful <- specdat_recon_ful[,4][com_case_ful] #The ID of the complete cases.
> sulfate_ful <- specdat_recon_ful[,2][com_case_ful] #The sulfate value of the complete cases.
> nitrate_ful <- specdat_recon_ful[,3][com_case_ful] #The nitrate value of the complete cases.
> id_freq_ful <- table(id_ful) #Summary the frequency in each id
> id_freq_mat_ful <- as.data.frame(id_freq_ful) #transfer the table into the data.frame.
> good <- id_freq_mat_ful[["Freq"]] > 1000 #Filter the freqency larger than threshold.
> id_good <- id_freq_mat_ful[["id_ful"]][good] #Filter the id has the frequency of complete cases larger than the threshold.
> ind_larg <- ind[id_good] #Create an index for the id has required requency.
You have to create the variable ou before you access it with ou[i]:
ou <- c()
for(i in ind_larg){
# your loop here...
ou[i] <- cor(sulfate_i, nitrate_i)
}

Output of function that is supposed to be input of another function

So, I have a function:
complete <- function(directory,id = 1:332 ) {
directory <- list.files(path="......a")
g <- list()
for(i in 1:length(directory)) {
g[[i]] <- read.csv(directory[i],header=TRUE)
}
rbg <- do.call(rbind,g)
rbgr <- na.omit(rbg) #reads files and omits NA's
complete_subset <- subset(rbgr,rbgr$ID %in% id,select = ID)
table.rbgr <- sapply(complete_subset,table)
table.rbd <- data.frame(table.rbgr)
id.table <- c(id)
findla.tb <- cbind (id.table,table.rbd)
names(findla.tb) <- c("id","nob")
print(findla.tb) #creates table with number of observations
}
Basically when you call the specific numberic id (say 4),
you are suppose to get this output
id nobs
15 328
So, I just need the nobs data to be fed into another function which measures the correlation between two columns if the nobs value is greater than another arbitrarily determined value(T). Since nobs is determined by the value of id, I am uncertain how to create a function that takes into account the output of the other function?
I have tried something like this:
corr <- function (directory, t) {
 directory <- list.files(path=".......")
 g <- list()
 for(i in 1:length(directory)) {
 g[[i]] <- read.csv(directory[i],header=TRUE)
  }
rbg <- do.call(rbind,g)
g.all <- na.omit(rbg) #reads files and removes observations
source(".....complete.R") #sourcing the complete function above
complete("spec",id)
g.allse <- subset(g.all,g.all$ID %in% id,scol )
g.allnit <- subset(g.all,g.all$ID %in% id,nit )
for(g.all$ID %in% id) {
if(id > t) {
cor(g.allse,g.allnit) #calcualte correlation of these two columns if they have similar id
}
}
#basically for each id that matches the ID in g.all function, if the id > t variable, calculate the correlation between columns
}
complete("spec", 3)
cr <- corr("spec", 150)
head(cr)
I have also tried to make the complete function a data.frame but it does not work and it gives me the following error:
error in data.frame(... check.names = false) arguments imply differing number of rows. So, I am not sure how to proceed....
First off, a reproducible example always helps in getting your question answered, along with a clear explanation of what your functions do/are supposed to do. We cannot run your example code.
Next, you seem to have an error in your corr function. You make multiple references to id but never actually populate this variable in your example code. So we'll just have to guess at what you need help with.
I think what you are trying to do is:
given an id, call complete with that id
use the nobs from that in your code.
In this case, you need to make sure to store the output of your call to complete, e.g.
comp <- complete('spec', id)
You can access the id column value comp['id'] and the nobs value via comp['nobs'] so you could do e.g.
if (comp['nobs'] > t) {
# do stuff e.g.
cor(g.allse, g.allnit)
}
Make sure you store the output of cor somewhere if you wish to actualy get it back later.
You will have to fix the problem of id not being defined yourself, because it is unclear what you want that to be.

Resources