Loop for a string - r

This code will be used to count number of links in my tweets collection. The collection is collected from 10 accounts. The questions is, how could I loop through the ten accounts in one code and drop the output in a table or graph? "Unames" is representing the name of the account. Thanks in adavance,
mydata <- read.csv("tweets.csv",sep=",", header=TRUE)
head(mydata)
dim(mydata)
colnames(mydata)
****#tweets for each university****
table(mydata$University)
Unames<- unique(mydata$University)
mystring <- function(Uname, string){
mydata_temp <- subset(mydata,University==Uname)
mymatch <- rep(NA,dim(mydata_temp)[1])
for(i in 1:dim(mydata_temp)[1]){
mymatch[i] <- length(grep(string, mydata_temp[i,2]))
}
return(mymatch)
}
**#web link e.g. (Here I would like to see the total links for all universities in table or graph. The below code is only giving me the output one by one!
mylink <- mystring(Unames[1],"http://")

So my suspicions are wrong and you do have a body of data for which this command produces the desired results (and you expect all the :
mylink <- mystring(Unames[1],"http://")
In that case, you should just do this:
links_list <- lapply(Unames, mystring, "http://")

Related

Web scraping in R using for loop

I would like to scrape the data from this link, and I have written the following code in R to do so. This, however, does not work and only returns the first page of the results. Apparently, the loop does not work. Does anybody know what's wrong with the loop?
library('rvest')
for (i in 1:40) {
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))
rank_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(1)')
rank_data <- html_text(rank_data_html)
rank_data<-as.numeric(rank_data)
title_data_html <- html_nodes(webpage,'.censo_list font')
title_data <- html_text(title_data_html)
author_data_html <- html_nodes(webpage,'.censo_list+ td font')
author_data <- html_text(author_data_html)
country_data_html <- html_nodes(webpage,'.censo_list~ td:nth-child(4) font')
rcountry_data <- html_text(country_data_html)
year_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(5) font')
year_data <- html_text(year_data_html)
type_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(6) font')
type_data <- html_text(type_data_html)
}
censorship_df<-data.frame(Rank = rank_data, Title = title_data, Author = author_data, Country = rcountry_data, Type = type_data, Year = year_data)
write.table(censorship_df, file="sample.csv",sep=",",row.names=F)
Are you sure there's anything wrong with the loop? I would expect it to get the first page of results 40 times. Look at
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))
Shouldn't that be (difference in the last ten characters of the string; the quotation mark moves)
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=", i))
What paste0 does in R is it stitches together two strings without any separator. But you only have one string. So it tries to fetch results for page=, i. But you want it to fetch page=1 through page=40. So put the quotation mark like page=", i so that it pastes the URL and i together.
I'm not an R programmer, but that simply leaps out at me.
Source for paste0 behavior.

JSON applied over a dataframe in R

I used the below on one website and it returned a perfect result:
looking for key word: Emaar pasted at the end of the query:
library(httr)
library(jsonlite)
query<-"https://www.googleapis.com/customsearch/v1?key=AIzaSyA0KdZHRkAjmoxKL14eEXp2vnI4Yg_po38&cx=006431301429107149113:as7yqcm2qc8&q=Emaar"
result11 <- content(GET(query))
print(result11)
result11_JSON <- toJSON(result11)
result11_JSON <- fromJSON(result11_JSON)
result11_df <- as.data.frame(result11_JSON)
now I want to apply the same function over a data.frame containing key words:
so i did the below testing .csv file:
Company Name
[1] ADES International Holding Ltd
[2] Emirates REIT (CEIC) Limited
[3] POLARCUS LIMITED
called it Testing Website Extraction.csv
code used:
test_companies <- read.csv("... \\Testing Website Extraction.csv")
#removing space and adding "+" sign then pasting query before it (query already has my unique google key and search engine ID
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
Result here shows a list of length 3 (the 3 search terms) and sublist within each term containing: url (list[2]), queries (list[2]), ... items (list[10]) and these are the same for each search term (same length separately), my issue here is applying the remainder of the code
#when i run:
result_JSON <- toJSON(result)
result_JSON <- as.list(fromJSON(result_JSON))
I get a list of 6 list that has sublists
and putting it into a tidy dataframe where the results are listed under each other (not separately) is proving to be difficult
also note that I tried taking from the "result" list that has 3 separate lists in it each one by itself but its a lot of manual labor if I have a longer list of keywords
The expected end result should include 30 observations of 37 variables (for each search term 10 observations of 37 variables and all are underneath each other.
Things I have tried unsuccessfully:
These work to flatten the list:
#do.call(c , result)
#all.equal(listofvectors, res, check.attributes = FALSE)
#unlist(result, recursive = FALSE)
# for (i in 1:length(result)) {listofvectors <- c(listofvectors, result[[i]])}
#rbind()
#rbind.fill()
even after flattening I dont know how to organize them into a tidy final output for a non-R user to interact with.
Any help here would be greatly appreciated,
I am here in case anything is not clear about my question,
Always happy to learn more about R so please bear with me as I am just starting to catch up.
All the best and thanks in advance!
Basically what I did is extract only the columns I need from the dataframe list, below is the final code:
library(httr)
library(jsonlite)
library(tidyr)
library(stringr)
library(purrr)
library(plyr)
test_companies <- read.csv("c:\\users\\... Companies Without Websites List.csv")
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
function_toJSONall <- function(all) {toJSON(all)}
a <- lapply(result, function_toJSONall)
function_fromJSONall <- function(all) {fromJSON(all)}
b <- lapply(a, function_fromJSONall)
function_dataframe <- function(all) {as.data.frame(all)}
c <- lapply(b, function_dataframe)
function_column <- function(all) {all[ ,15:30]}
result_final <- lapply(c, function_column)
results_df <- rbind.fill(c[])

how to search through a column of links looking for string matches in r?

I have a data table with a list of .txt links in the same column. I am looking for a way for R to search within each link to see if the file contains either of the strings discount rate or discounted cash flow. I then want R to create 2 columns next to each link (one for discount rate and one for discounted cash flow) that is either going to have a 1 in it if present or a 0 if not.
Here's a small list of sample links that I would like to sift through:
http://www.sec.gov/Archives/edgar/data/1015328/0000913849-04-000510.txt
http://www.sec.gov/Archives/edgar/data/1460306/0001460306-09-000001.txt
http://www.sec.gov/Archives/edgar/data/1063761/0001047469-04-028294.txt
http://www.sec.gov/Archives/edgar/data/1230588/0001178913-09-000260.txt
http://www.sec.gov/Archives/edgar/data/1288246/0001193125-04-155851.txt
http://www.sec.gov/Archives/edgar/data/1436866/0001172661-09-000349.txt
http://www.sec.gov/Archives/edgar/data/1089044/0001047469-04-026535.txt
http://www.sec.gov/Archives/edgar/data/1274057/0001047469-04-023386.txt
http://www.sec.gov/Archives/edgar/data/1300379/0001047469-04-026642.txt
http://www.sec.gov/Archives/edgar/data/1402440/0001225208-09-007496.txt
http://www.sec.gov/Archives/edgar/data/35527/0001193125-04-161618.txt
Maybe something like this...
checktext <- function(file, text) {
filecontents <- readLines(file)
return(as.numeric(any(grepl(text, filecontents, ignore.case = TRUE))))
}
df$DR <- sapply(df$file_name, checktext, "discount rate")
df$DCF <- sapply(df$file_name, checktext, "discounted cash flow")
A much faster version, thanks to Gregor's comment below, would be
checktext <- function(file, text) {
filecontents <- readLines(file)
sapply(text, function(x) as.numeric(any(grepl(x, filecontents,
ignore.case = T))))
}
df[,c("DR","DCF")] <- t(sapply(df$file_name, checktext,
c("discount rate", "discounted cash flow")))
Or if you are doing it from URLs rather than local files, replace df$file_name with df$websiteURL in the above. It worked for me on the short list you provided.

for loop read html challenge in R

I would like to loop a parse query.
The thing that stops me is that I need to insert a number in the html that R then reads and parses. The html should be between " ", does anyone know how to insert the "i" from the "for loop" so that it will be replaced and R is also able to retrieve the html?
This is the code (I would like a list with all the artists of the charts of the 52 weeks):
library(rvest)
weeknummer = 1:52
l <- c()
b <- c()
for (i in weeknummer){
htmlpage <- read_html("http://www.top40.nl/top40/2015/week-"[i]"")
Top40html <- html_nodes(htmlpage,".credit")
top40week1 <- html_text(Top40html)
b <- top40week1
l <- c(l,b)
}
You need to turn the URL into one string.
pageurl <- paste0("http://www.top40.nl/top40/2015/week-",i)
htmlpage <- read_html(pageurl)

For loop skip if table does not exist

Ran in to another problem. I have a for loop that contains urls to scrape batting information from a table with the id batting_gamelogs. If that id does not exist on the page then move on to the next url else scrape the table.
I think it should be something like this below, but I can't get it to work.
if xpathSApply(batting, '//*[#id != "batting_gamelogs"]')[[1]] next
else
{
tableNode <- xpathSApply(batting, '//*[#id="batting_gamelogs"]')[[1]]
data <- readHTMLTable(tableNode, stringsAsFactors = FALSE)
data # select the first table
total <- cbind(id,year,data)
batlist <- rbind(batlist, total)
}
I have attached sample code.
#SCRAPE BATTING STATS
data = NULL
batlist = NULL
battingURLs <- paste("http://www.baseball-reference.com",yplist[,c("hrefs")],sep="")
for(thisbattingURL in battingURLs){
batting <- htmlParse(thisbattingURL)
fstampid <- regexpr("&", thisbattingURL, fixed=TRUE)-1
fstampyr <- regexpr("year=", thisbattingURL, fixed=TRUE)+5
id <- substr(thisbattingURL, 53, fstampid)
year <- substr(thisbattingURL, fstampyr, 75)
tableNode <- xpathSApply(batting, '//*[#id="batting_gamelogs"]')[[1]]
data <- readHTMLTable(tableNode, stringsAsFactors = FALSE)
data # select the first table
total <- cbind(id,year,data)
batlist <- rbind(batlist, total)
}
batlist
Any help is much appreciated!
I can't get it to work.
This phrase should always be a reminder to tell what actually happened (and how it differs from what you expected to happen). I suspect what happened was that it skipped too often (vs. not skipping when it should have). But you could tell us that, instead of leaving us to figure it out.
if xpathSApply(batting, '//*[#id != "batting_gamelogs"]')[[1]] next
The "not" is in the wrong place. Here, you're saying, skip this iteration if there is an element on the page that has an id attribute whose value is not batting_gamelogs. Instead you want to skip this iteration if there is no element on the page that has an id attribute whose value is batting_gamelogs.
So, use this for your XPath expression:
'//*[#id = "batting_gamelogs"]'
and put the "not" outside of xpathSApply(), by testing whether the length of the result list is zero (thanks to the answer at https://stackoverflow.com/a/25553805/423105):
if (length(xpathSApply(batting, '//*[#id = "batting_gamelogs"]')) == 0) next
I took out the [[1]] because you just want to test whether any values are returned; you don't care about extracting the first result.

Resources