How can I do web scraping in Julia? - web-scraping

I want to extract the names of universities and their websites from this site into lists.
In Python I did it with BeautifulSoup v4:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
content = BeautifulSoup(page.text, 'html.parser')
college_name = []
college_link = []
college_name_list = content.find_all('h3',class_='college')
for college in college_name_list:
if college.find('a'):
college_name.append(college.find('a').text)
college_link.append(college.find('a')['href'])
I really like programming in Julia and since it's very similar to Python, I wanted to know if I can do web scraping in Julia too. Any help would be appreciated.

Your python code doesn't quite work.
I guess the website has been updated recently.
Since they have removed the links as far as i can tell,.
Here is a similar example using Gumbo.jl and Cascadia.jl.
I am using the built in download command to download the webpage.
which writes it to disk in a temp-file, which i then read into String.
It might be cleaner to use HTTP.jl, which could read it straight into a String. But for this simple example it's fine
using Gumbo
using Cascadia
url = "https://thebestschools.org/features/best-computer-science-programs-in-the-world/"
page = parsehtml(read(download(url), String))
college_name = String[]
college_location = String[]
sections = eachmatch(sel"section", page.root)
for section in sections
maybe_col_heading = eachmatch(sel"h3.college", section)
if length(maybe_col_heading) == 0
continue
end
col_heading = first(maybe_col_heading)
name = strip(text(last(col_heading.children)))
push!(college_name, name)
loc = first(eachmatch(sel".school-location", section))
push!(college_location, text(loc[1]))
end
[college_name college_location]
Outputs
julia> [college_name college_location]
51×2 Array{String,2}:
"Massachusetts Institute of Technology (MIT)" "Cambridge, Massachusetts"
"Massachusetts Institute of Technology (MIT)" "Cambridge, Massachusetts"
"Stanford University" "Stanford, California"
"Carnegie Mellon University" "Pittsburgh, Pennsylvania"
⋮
"Shanghai Jiao Tong University" "Shanghai, China"
"Lomonosov Moscow State University" "Moscow, Russia"
"City University of Hong Kong" "Hong Kong"
Seems like it listed MIT twice.
probably the filtering code in my demo isn't quiet right.
But :shrug: MIT is a great university I hear.
Julia was invented there :joy:

Yes.
For the purpose of web-scraping, Julia has three libraries:
HTTP.jl to download the frontend source code of the website (this is comparable to python's requests library) ,
Gumbo.jl to parse the downloaded source code into a hierarchical structured object,
and Cascadia.jl to finally scrape using a CSS selector API.
I saw that you're young (16) from your profile and your python implementation is also correct.
Therefore, I'd suggest you to try to do a web-scraping task with these three libraries to better understand how they work.
The task that you wish to do, unfortunately, cannot be yet accomplished with Cascadia since the h3 is in a <span> which is currently not an implemented SelectorType in Cascadia.jl
Source

Related

Scraping: No attribute find_all for <p>

Good morning :)
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=438653&CID=102313
The text I am trying to get seems to be located inside some <p> and separated by <br>.
For some reason, whenever I try to access a <p>, I get the following mistake: "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", and this even if I do find instead of find_all().
My code is below (it is a very simple thing with no loop yet, I just would like to identify where the mistake comes from):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='MYPATH/chromedriver',options=options)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("div", class_="col-sm-12")
people_in_column = column.find_all("p").find_all("br")
Is there anything obvious I am not understanding here?
Thanks a lot in advance for your help!
You are trying to select a lsit of items aka ResultSet multiples times which is incorrect meaning using find_all method two times but not iterating.The correct way is as follows. Hope, it should work.
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Full working code as an example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source#.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Output:
Notice of NIH Policy to All Applicants:Meeting rosters are provided for information purposes only. Applicant investigators and institutional officials must not communicate directly with study section members about an application before or after the review. Failure to observe this policy will create a serious breach of integrity in the peer review process, and may lead to actions outlined inNOT-OD-22-044, including removal of the application from immediate
review.

How can I use beautiful soup to get the following data from kick starter?

I am trying to get some data from kick starter. How can use beautiful soup library?
Kick Starter link
https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=7
These are the following information I need
Crowdfunding goal
Total crowdfunding
Total backers
Length of the campaign (# of days)
This is my current code
import requests
r = requests.get('https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=1')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'})
len(results)
i'll give you some of hint that i know, and hope you can do by yourself.
crawling has legal problem when you abuse Term of Service.
find_all should use with 'for' statment. it works like find all on web page(Ctrl + f).
e.g.
for a in soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'}):
print (a)
3.links should be open 'for' statement. - https://www.kickstarte...seed=2600008&page=1
bold number repeated in for statement, so you can crawling all data In orderly
4.you sholud linked twice. - above link, there is list of pj. you should get link of these pj.
so code's algorithm likes this.
for i in range(0,10000):
url = www.kick.....page=i
for pj_link in find_all(each pj's link):
r2 = requests.get(pj_link)
soup2 = BeautifulSoup(r2.text, 'html.parser')
......

Extracting full article text via the newsanchor package [in R]

I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.
I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.

How to get the items count in xml using marklogic

I am new to MarkLogic ..I need to get the total count of books from the following XML. Can anyone suggest me.
<bk:bookstore xmlns:bk="http://www.bookstore.org">
<bk:book category='Computer'>
<bk:author>Gambardella, Matthew</bk:author>
<bk:title>XML Developer's Guide</bk:title>
<bk:price>44.95</bk:price>
<bk:publish_year>1995</bk:publish_year>
<bk:description>An in-depth look at creating applications with XML.
</bk:description>
</bk:book>
<bk:book category='Fantasy'>
<bk:author>Ralls, Kim</bk:author>
<bk:title>Midnight Rain</bk:title>
<bk:price>5.95</bk:price>
<bk:publish_year>2000</bk:publish_year>
<bk:description>A former architect battles corporate zombies, an evil
sorceress, and her own childhood to become queen of the world.
</bk:description>
</bk:book>
<bk:book category='Comic'>
<bk:author>Robert M. Overstreet</bk:author>
<bk:title>The Overstreet Indian Arrowheads Identification </bk:title>
<bk:price>2000</bk:price>
<bk:publish_year>1991</bk:publish_year>
<bk:description>A leading expert and dedicated collector, Robert M.
Overstreet has been writing The Official Overstreet Identification and
Price
Guide to Indian Arrowheads for more than 21 years</bk:description>
</bk:book>
<bk:book category='Comic'>
<bk:author>Randall Fuller</bk:author>
<bk:title>The Book That Changed America</bk:title>
<bk:price>1000</bk:price>
<bk:publish_year>2017</bk:publish_year>
<bk:description>The New York Times Book Review Throughout its history
America has been torn in two by debates over ideals and beliefs.
</bk:description>
</bk:book>
</bk:bookstore>
Can anyone find the solution for this question as I am new to this.
Id suggest using a cts:count-aggregate in combination with cts:element-reference. This requires you to have a element range index on book.
cts:count-aggregate(cts:element-reference(fn:QName("http://www.bookstore.org", "book")))
If performance isn't too critical and your document count isn't too large, you could also count with fn:count.
declare namespace bk="http://www.bookstore.org";
fn:count(//bk:book)
Try this-
declare namespace bk="http://www.bookstore.org";
let $book_xml :=
<bk:bookstore xmlns:bk="http://www.bookstore.org">
</bk:book>
........
........
</bk:book>
</bk:bookstore>
return fn:count($book_xml//bk:book)
Hope That Helps !

Can MeCab be configured / enhanced to give me the reading of English words too?

If I begin with a wholly Japanese sentence and run it through MeCab, I get something like this:
$ echo "吾輩は猫である" | mecab
吾輩 名詞,代名詞,一般,*,*,*,吾輩,ワガハイ,ワガハイ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
猫 名詞,一般,*,*,*,*,猫,ネコ,ネコ
で 助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ある 助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル
EOS
If I smash together everything I get from the last column, I get "ワガハイワネコデアル", which I can then feed into a speech synthesis program and get output. Said program, however, doesn't handle English words.
I throw English into MeCab, it manages to tokenise it (probably naively at the spaces), but gives no reading:
$ echo "I am a cat" | mecab
I 名詞,固有名詞,組織,*,*,*,*
am 名詞,一般,*,*,*,*,*
a 名詞,一般,*,*,*,*,*
cat 名詞,固有名詞,組織,*,*,*,*
EOS
I want to get readings for these as well, even if they're not perfect, so that I can get something along the lines of "アイアムアキャット".
I have already scoured the web for solutions and whereas I do find a bunch of web sites which have transliteration that appears to be adequate, I can't find any way to do it in my own code. In a couple of cases, I emailed the site authors and got no response yet after waiting for a few weeks. (Just how far behind on their inboxes are these people?)
There are a number of directions I can go but I hit dead ends on all of them so far, so this is my compound question:
MeCab takes custom dictionaries. Is there a custom dictionary which fills in the English knowledge somewhat?
Is there some other library or tool that can take English and spit out Katakana?
Is there some library or tool that can take IPA (International Phonetic Alphabet) and spit out Katakana? (I know how to get from English to IPA.)
As an aside, I find that the software "VOICEROID" can speak English text (poorly, but adequately for my purposes). This software uses MeCab too (or at least its DLL and dictionary files are included in the install.) It also uses another library, Cabocha, which as far as I can tell by running it does the exact same thing as MeCab. It could be using custom dictionaries for either of these two libraries to do the job, or the code to do it could be in the proprietary AITalk library they are using. More research is needed and I haven't figured out how to run either tool against their dictionaries to test it out directly either.

Resources