I'm a student and attending university lessons from home. My teacher just gave me this job, consisting to take all the titles and subtitles from an online italian journal that include the words "coronavirus" and / or "covid 19" in a certain time lapse (from 22th to 29th of January and just the 1st and the 8th of April), and transcribe them to an Excel file to analyze the words used.
I searched online and assumed that this could be considered scraping, and that made my day considering that I should find something like 100-150 titles plus subtitles and I have a very short deadline. Unfortunately, I am also a beginner at this and all I could do by myself was finding a way to collect just the title from the webpage. I'm using, like a beginner is supposed to, Data Miner with Google Chrome.
Practically I should find all titles and subtitles from the website "La Gazzetta dello Sport" (whose link I attach below) that contains the words "coronavirus" and/or "covid 19" but there is a problem: I can see just the titles in the search page, but to get the subtitles I should click on the article and go to another page. Is there a way to obtain all the results with Data Miner or should I use another program?
So, just to make it simple: I can't figure out how to make Data Miner collect the title from the search page, click it to go on the article page, collect the subtitle and go back to the search page to pass to the next title and subtitle and repeat. I don't know if this is possible or it's just sci-fi, like I said: I'm a total newbie at this and it's the first time using these kind of tools.
Url: https://www.gazzetta.it/nuovaricerca/home.shtml?q=coronavirus&dateFrom=2020-01-22&dateTo=2020-01-29
Related
There is a blog that keeps track of equipment losses during the invasion of Ukraine. I would like to scrape the blog and translate the information there into an excel file.
The first problem is that the blog mostly presents information by grouping links to photos under a heading on the website. The photos have dates written on them.
Example: The equipment type heading (Tanks) and then the model (T-62) and the link (1,captured)
image1
And if you click on the link you get this image:
image2
I would like to scrape the dates from the photos and then have those dates added alongside the Equipment type and model number in an excel file.
Example:
image3
Naturally, the way the dates are written is heterogenous in format (12.12.22 vs 12/12/22), location, and style.
The second problem is that often the blog will link to a tweet instead.
Can you point me in the right direction to begin coding something like this?
I've looked into the Tesserect library
I'd like to scrape data from the following website: http://maps2.roktech.net/durhamnc_gomaps4/
In a separate spreadsheet on my computer, I have a list of parcel IDs, corresponding to various properties in the county.
Here's what needs to happen:
1. First, copy and paste parcel ID (from a separate spreadsheet) into the search box, to search by parcel.
2. Then, copy and paste all the columns of data that show up associated with that parcel ID, and paste it into the spreadsheet.
And that's it! it sounds pretty simple, but I can't seem to figure it out. I've tried using UI path but I'm not experienced with the software.
How could I go about doing this? How difficult is this to do?
Thanks so much for any help or assistance.
Ryan
Please watch the following training video on how to scrape data using UIPATH.
https://www.tutorialspoint.com/uipath/uipath_studio_data_scraping_and_screen_scraping.htm
It is highly recommended to go through the free UiPath RPA Academy training videos that will quickly put you in the know
https://academy.uipath.com/learn/course/internal/view/elearning/372/level-1-foundation-training-revamped-20182
You don't need uipath for this job
Go to the site
zoom out
click "select options"
select the whole area
5.A table with "1000 results" will appear on the same line where "1000 results" is written there are 4 buttons the last one is "export to EXCEL" - click it and you will have the whole data in one table and then you can filter this table
I have a google doc page that is a "slider" quiz. So for example, people rate themselves based on a scale on how comfortable they are with say Microsoft Word (0=weak to 5= strong). Then powerpoint, etc.
These responses are submitted and saved in an excel google doc "responses.csv".
Based on the response per column, I want to use the "document studio" add-on, for which I select "google slide" as the option. So it makes me a google slide from the responses.
But I want to make a function that pulls the replies in values 1-5 and gives me an image, so I made an if(A1=1, "drive.google.com/1.jpg", A1=2, "drive.google.com/2.jpg"). Then I referenced the column "image-slider-1".
However, the image is not pulling up in the google slide. And I don't know why. I tried to reference the slider value and import an image from google docs.
I'm attempting to emergency-revamp my print company's website after the guy "developing" the site for me simply disappeared off the face of the planet last week, leaving me with no site and potentially countless thousands in lost revenue (not happy isn't close). All goes well until I came across this issue and for the life of me I can't find any answer's anywhere to it:
Creating a tabbed table containing sizes and prices in Wordpress... easy (this is not a stereotypical pricing table)
Integrate woocommerce into the chosen theme...easy
Now, making each price an individual "add to cart" button - major rage quit imminent.
I have no interest in making a product for every single conceivable variation as it's simply not necessary should I be able to get the tabbed table working as described. Having to do so would probably give me heart failure as I would then have to create yet more un-necessary graphics. My customers just want to be able to see the price, click the price (and thus "add it to cart"), purchase, done.
Here's the tabbed table in question in case everything I've just type makes zero sense: http://www.protradeprinting.com/canvasprints/
ANY suggestions would be a big help.
I am interested to extract the data of paranormal activity reported in news, so that i can analyze the
data of space and time of appearance for any correlations. This project is just for fun, to learn and use web scraping, text extraction and spatial and time correlation analysis. So please forgive me for deciding on this topic, I wanted to do something interesting and challenging work.
First I found this website has some collection of the reported paranormal incidences, they have collection for 2009,2010,2011 and 2012.
The structure of the website goes like this in every year they have 1..10 pages...and links goes like this
for year2009
link http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
In each page they have collected the stories under the heading like this
Internal structure
Paranormal Activity, Posted 03-14-09
each of these head lines has two pages inside it..goes like this
link http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
On each of these pages they have actual reported stories collected on various headlines..and the actual websites link for those stories. I am interested in collecting those reported text and extract information regarding the kind of paranormal activity like ghost, demon or UFOs and the time, date and place of incidents. I wish to analyze this data for any spatial and time correlations. If UFO or Ghosts are real they must have some behavior and correlations in space or time in their movements. This is long shot of the story...
I need help in web scraping the text form the above said pages. Here i have wrote down the code to follow one page and its link down to last final text i want. Can anyone let me know is there any better and efficient way to get the clean text from the final page. Also automation of the collecting text by following all 10 pages for whole 2009.
library(XML)
#source of paranormal news from about.com
#first page to start
#2009 - http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
pn.url<-"http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm"
pn.html<-htmlTreeParse(pn.url,useInternalNodes=T)
pn.h3=xpathSApply(pn.html,"//h3",xmlValue)
#extracting the links of the headlines to follow to the story
pn.h3.links=xpathSApply(pn.html,"//h3/a/#href")
#Extracted the links of the Internal structure to follow ...
#Paranormal Activity, Posted 01-03-09 (following this head line)
#http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
pn.l1.url<-pn.h3.links[1]
pn.l1.html<-htmlTreeParse(pn.l1.url,useInternalNodes=T)
pn.l1.links=xpathSApply(pn.l1.html,"//p/a/#href")
#Extracted the links of the Internal structure to follow ...
#British couple has 'black-and-white-twins' twice (following this head line)
#http://www.msnbc.msn.com/id/28471626/
pn.l1.f1.url=pn.l1.links[7]
pn.l1.f1.html=htmlTreeParse(pn.l1.f1.url,useInternalNodes=T)
pn.l1.f1.text=xpathSApply(pn.l1.f1.html,"//text()[not(ancestor::script)][not(ancestor::style)]",xmlValue)
I sincerely thanks in advance for reading my post and your time for helping me.
I will be great full for any expert who would like to mentor me in this whole project.
Regards
Sathish
Try to use Scrapy and BeautifulSoup libraries. Despite their being Python based, they are considered the best in scrapping domain. You can use command line interface to connect both, for more details about connecting R and Python have a look here.