I am looking to extract some data from website:
http://www.delfi.lv/bizness/biznesa_vide/tirgus-liberalizacija-ka-latvija-nonaca-krievijas-gazes-juga.d?id=44233361&com=1&s=5
from me it's valuable to get info like:
"<h3 class="ca56269332 comment-noavatar listh3 comment-noavatar-author">
vārds
</h3>"
In this example "ca56269332" and "vārds" are dynamic variables.
For me I want to achieve something like this:
"<h3 class="* comment-noavatar listh3 comment-noavatar-author">
*
</h3>"
where "*" means a dynamic value, and export in some kind of excel or data file.
Also I want to extract multiple pages, like:
/tirgus-liberalizacija-ka-latvija-nonaca-krievijas-gazes-juga.d?id=44233361&com=1&s=5&no=0
/tirgus-liberalizacija-ka-latvija-nonaca-krievijas-gazes-juga.d?id=44233361&com=1&s=5&no=20
/tirgus-liberalizacija-ka-latvija-nonaca-krievijas-gazes-juga.d?id=44233361&com=1&s=5&no=40
ect.
Can anyone please share some valuable resources to achieve that, I know that you can make it with PHP file_get but I want easier solution because my goal is not to publish it to webpage but use as source for my study project as a data file.
How to EXTRACT dynamic data to AVOID saving EVERY page with ALL the useless information it contains and make it easier, avoiding to manually process large number of web comments?
Related
Looking for advice please on methods to scrape the gender of clothing items on a website that doesn't specify the gender on the product page.
The website I'm crawling is www.very.co.uk and an example of a product page would be this - https://www.very.co.uk/berghaus-combust-reflect-long-jacket-red/1600352465.prd
Looking at that page, there looks to be no easy way to create a script that could identify this item as womenswear. Other websites might have breadcrumbs to use, or the gender might be in the title / URL but this has nothing.
As I'm using scrapy, with the crawl template and Rules to build a hierarchy of links to scrape, I was wondering if it's possible to pass a variable in one of the rules or the starting_URL to identify all items scraped following this rule / starting URL would have a variable as womenswear? I can then feed this variable into a method / loader statement to tag the item as womenswear before putting it into a database.
If not, would anyone have any other ideas on how to categorise this item as womenswear. I saw an example where you could use an excel spreadsheet to create the start_urls and in that excel spreadsheet tag each row as womenswear, mens etc. However, I feel this method might cause issues further down the line and would prefer to avoid it if possible. I'll spare the details of why I think this would be problematic unless anyone asks.
Thanks in advance
There does seem to be a breadcrumb in your example, however for an alternative you can usually check the page source by simply searching your term - maybe there's some embedded javascript/json that can be extract?
Here you can see some javascript for subcategory that indicates that it's a "womens_everyday_sports_jacket".
You can parse it quite easily with some regex:
re.findall('subcategory: "(.+?)"', response.body_as_unicode())
# womens_everyday_sports_jacket
I am creating a form in Shiny R and when the user inputs some information, I want that information to then be appended to the bottom of an existing dataset in an existing xlsx file.
So far I've tried using write.xlsx() with append = TRUE to add to an existing file, but this just creates a new sheet in the file. From there I tried specifying the sheet I want the data to be written to (sheet = 'Data'), but it tries to make a new sheet with that name instead and gets mad that it already exists.
Is this something that's even possible with write.xlsx() or will I have to find a different method to do this?
Edit
I think it would actually be helpful if I give a little more info about the app I'm making. The app is also being used to analyze the data in the existing file. This is why I just want to add to it, rather than create a new file.
Edit 2
For the most part, I seem to have things working correctly thanks to the suggestions below in the comments. With the exception of one issue I can't seem to fix. In the form, every time you submit new data it adds an additional column of numbers to the dataset, which I definitely don't want! Is there a way to prevent this from happening? As of now I am using bind_rows() to merge the data and I've also tried rbind(), rbind_all(), smartbind(), and I'm sure maybe one or two more I'm forgetting at the moment, with no luck.
At the moment I am generating a pdf using knpsnappy, i'm showing a bunch of data I already want to show in there, but it'd be really handy to put one of my datatables in there, is this possible?
I would have the data as JSON and then display it with ajax. Example
If you don't have a lot of data, then you could just have it in your script file as let myJsonData = {[...JSON data...]}
I have a lot of similar URLs that I would like to merge in Google Analytics. I've managed to merge a lot of them already. However I've now run into a bit of a problem.
I have URLs that look something like this;
article/4567/edit
article/87478548/edit
article/82984786/add
article/8374/add
How would I go about merging these URLs so that they display as;
article/edit
article/add
Any help is greatly appreciated.
EDIT: I also need to be able to have GA display every article in one line on the table called "article/" regardless of any ID that is after it. I don't want the table to look like:
article/12342 1,000 views
article/7465890 900 views
I need it to display as:
article/ 1,900 views
You can create an Advanced filter that combines the relevant parts for you:
The output would be /article/edit or /article/add, with everything and anything between those removed.
EDIT:
If you just want everything, regardless of /edit, /add, /12341/edit, /7305/add, /whatever/edit, to show up just as /article, then you can just change your filter like this:
Field A: Request URI = (/article)/.*
Output to: Request URI = $A1
This will convert the following examples:
/article/123/edit -> /article
/article/2345/add -> /article
/article/anything -> /article
From this Combining similar URLs in Google Analytics you can find out how to do it. You need to use a regex. Something like this should work (did not test it).
(article\/)[0-9]*\/(edit|add)
The main Data Type used by Yahoo Pipes is the [Item], which is RSS feed content. I want to take an RSS's content or sub-element, make it into [Text] (or a number might work), and then use it as an INPUT into a [Module] to build a RSS-URL with specific parameters. I will then use the new RSS-URL to pull more content.
Could possibly use the [URL Builder Module] or some work-around.
The key here is using "dynamic" data from an RSS feed (not user input, or a static data), and getting that data into a Data Type that is compatible (and/or accessible) as an INPUT into a module.
It seems like a vital functionality, but I cannot figure it out. I have tried many, many work-around attempts, with no success.
The Specific API and Methods (if you are interested)
Using the LastFM API.
1st Method: user.getWeeklyChartList. Then pick the "from" (start) and "to" (end) Unix timestamps from 1 year-ago-today.
2nd Method: user.getWeeklyAlbumChart using those specific (and "dynamic") timestamps to pull my top albums for that week.
tl;dr. Build an RSS-URL using specific parameters from another RSS feed's content.
I think I may have figured it out. I doubt it is the best way, but it works. The problem was the module I needed to use didn't have and input node. But the Loop module has an input node, so if I embed the URL builder into the Loop module I can then access sub-element content from the 1st feed to use as parameters to build the URL for the 2nd feed! Then I can just scrap all the extra stuff generated by the Loop, by using Truncate.