There is a blog that keeps track of equipment losses during the invasion of Ukraine. I would like to scrape the blog and translate the information there into an excel file.
The first problem is that the blog mostly presents information by grouping links to photos under a heading on the website. The photos have dates written on them.
Example: The equipment type heading (Tanks) and then the model (T-62) and the link (1,captured)
image1
And if you click on the link you get this image:
image2
I would like to scrape the dates from the photos and then have those dates added alongside the Equipment type and model number in an excel file.
Example:
image3
Naturally, the way the dates are written is heterogenous in format (12.12.22 vs 12/12/22), location, and style.
The second problem is that often the blog will link to a tweet instead.
Can you point me in the right direction to begin coding something like this?
I've looked into the Tesserect library
Related
I have a situation where I need to extract tables from 13 different links, which have the same structure, and then append them into only one table with all the data. This way, at first I extracted the links from a home page by copying the link from the respective hyperlink, and then import the data through the Web connector on Power BI. However, 3 months later, I realized that those links changed every quarter but the link from the homepage where they are listed remain the same.
This way, I did some research and I found out this video on YouTube (https://www.youtube.com/watch?v=oxglJL0VWOI), which explained how I can scrape the links from a website, by building a table with the header of the link as a column and the respective link as another column. This way, I can have the links automatically updated, whenever I refresh the data.
The thing is that I am having issues to figure out how can I use this links to extract the data automatically without having to copy them one by one and then import the data using the Power BI Web connector (Web.BrowserContents). Does anyone can give me a hint of how can I implement this?
Thanks in advance!
I'm a student and attending university lessons from home. My teacher just gave me this job, consisting to take all the titles and subtitles from an online italian journal that include the words "coronavirus" and / or "covid 19" in a certain time lapse (from 22th to 29th of January and just the 1st and the 8th of April), and transcribe them to an Excel file to analyze the words used.
I searched online and assumed that this could be considered scraping, and that made my day considering that I should find something like 100-150 titles plus subtitles and I have a very short deadline. Unfortunately, I am also a beginner at this and all I could do by myself was finding a way to collect just the title from the webpage. I'm using, like a beginner is supposed to, Data Miner with Google Chrome.
Practically I should find all titles and subtitles from the website "La Gazzetta dello Sport" (whose link I attach below) that contains the words "coronavirus" and/or "covid 19" but there is a problem: I can see just the titles in the search page, but to get the subtitles I should click on the article and go to another page. Is there a way to obtain all the results with Data Miner or should I use another program?
So, just to make it simple: I can't figure out how to make Data Miner collect the title from the search page, click it to go on the article page, collect the subtitle and go back to the search page to pass to the next title and subtitle and repeat. I don't know if this is possible or it's just sci-fi, like I said: I'm a total newbie at this and it's the first time using these kind of tools.
Url: https://www.gazzetta.it/nuovaricerca/home.shtml?q=coronavirus&dateFrom=2020-01-22&dateTo=2020-01-29
I'd like to scrape data from the following website: http://maps2.roktech.net/durhamnc_gomaps4/
In a separate spreadsheet on my computer, I have a list of parcel IDs, corresponding to various properties in the county.
Here's what needs to happen:
1. First, copy and paste parcel ID (from a separate spreadsheet) into the search box, to search by parcel.
2. Then, copy and paste all the columns of data that show up associated with that parcel ID, and paste it into the spreadsheet.
And that's it! it sounds pretty simple, but I can't seem to figure it out. I've tried using UI path but I'm not experienced with the software.
How could I go about doing this? How difficult is this to do?
Thanks so much for any help or assistance.
Ryan
Please watch the following training video on how to scrape data using UIPATH.
https://www.tutorialspoint.com/uipath/uipath_studio_data_scraping_and_screen_scraping.htm
It is highly recommended to go through the free UiPath RPA Academy training videos that will quickly put you in the know
https://academy.uipath.com/learn/course/internal/view/elearning/372/level-1-foundation-training-revamped-20182
You don't need uipath for this job
Go to the site
zoom out
click "select options"
select the whole area
5.A table with "1000 results" will appear on the same line where "1000 results" is written there are 4 buttons the last one is "export to EXCEL" - click it and you will have the whole data in one table and then you can filter this table
I have a google doc page that is a "slider" quiz. So for example, people rate themselves based on a scale on how comfortable they are with say Microsoft Word (0=weak to 5= strong). Then powerpoint, etc.
These responses are submitted and saved in an excel google doc "responses.csv".
Based on the response per column, I want to use the "document studio" add-on, for which I select "google slide" as the option. So it makes me a google slide from the responses.
But I want to make a function that pulls the replies in values 1-5 and gives me an image, so I made an if(A1=1, "drive.google.com/1.jpg", A1=2, "drive.google.com/2.jpg"). Then I referenced the column "image-slider-1".
However, the image is not pulling up in the google slide. And I don't know why. I tried to reference the slider value and import an image from google docs.
I am interested to extract the data of paranormal activity reported in news, so that i can analyze the
data of space and time of appearance for any correlations. This project is just for fun, to learn and use web scraping, text extraction and spatial and time correlation analysis. So please forgive me for deciding on this topic, I wanted to do something interesting and challenging work.
First I found this website has some collection of the reported paranormal incidences, they have collection for 2009,2010,2011 and 2012.
The structure of the website goes like this in every year they have 1..10 pages...and links goes like this
for year2009
link http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
In each page they have collected the stories under the heading like this
Internal structure
Paranormal Activity, Posted 03-14-09
each of these head lines has two pages inside it..goes like this
link http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
On each of these pages they have actual reported stories collected on various headlines..and the actual websites link for those stories. I am interested in collecting those reported text and extract information regarding the kind of paranormal activity like ghost, demon or UFOs and the time, date and place of incidents. I wish to analyze this data for any spatial and time correlations. If UFO or Ghosts are real they must have some behavior and correlations in space or time in their movements. This is long shot of the story...
I need help in web scraping the text form the above said pages. Here i have wrote down the code to follow one page and its link down to last final text i want. Can anyone let me know is there any better and efficient way to get the clean text from the final page. Also automation of the collecting text by following all 10 pages for whole 2009.
library(XML)
#source of paranormal news from about.com
#first page to start
#2009 - http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
pn.url<-"http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm"
pn.html<-htmlTreeParse(pn.url,useInternalNodes=T)
pn.h3=xpathSApply(pn.html,"//h3",xmlValue)
#extracting the links of the headlines to follow to the story
pn.h3.links=xpathSApply(pn.html,"//h3/a/#href")
#Extracted the links of the Internal structure to follow ...
#Paranormal Activity, Posted 01-03-09 (following this head line)
#http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
pn.l1.url<-pn.h3.links[1]
pn.l1.html<-htmlTreeParse(pn.l1.url,useInternalNodes=T)
pn.l1.links=xpathSApply(pn.l1.html,"//p/a/#href")
#Extracted the links of the Internal structure to follow ...
#British couple has 'black-and-white-twins' twice (following this head line)
#http://www.msnbc.msn.com/id/28471626/
pn.l1.f1.url=pn.l1.links[7]
pn.l1.f1.html=htmlTreeParse(pn.l1.f1.url,useInternalNodes=T)
pn.l1.f1.text=xpathSApply(pn.l1.f1.html,"//text()[not(ancestor::script)][not(ancestor::style)]",xmlValue)
I sincerely thanks in advance for reading my post and your time for helping me.
I will be great full for any expert who would like to mentor me in this whole project.
Regards
Sathish
Try to use Scrapy and BeautifulSoup libraries. Despite their being Python based, they are considered the best in scrapping domain. You can use command line interface to connect both, for more details about connecting R and Python have a look here.