How to get github files information with webscraping

How to get github files information with webscraping - web-scraping

A few weeks back I was faced with a challenge that was basically to use webscraping to get all the files of a GitHub repository and group them by extension and sum the size for those files of that particular extension. The important thing about this is that we SHOULD NOT use Github's API nor any webscraping tool.
My solution was to get the main HTML page as a string, apply a regex to extract all URLs that had <repo_owner>/<repo_name>/blob and <repo_owner>/<repo_name>/tree. From the blob URLs, we could make another request and apply another regex to extract the filesize and lines, for the URLs of the other type we'd make another request to extract more blob URLs. I did this until there are no URLs of the latter.
It solved the problem, but it was a pretty bad solution because we need to make too many requests to GitHub and we're always blocked at some point while analyzing a repository. I applied a delay between requests but it takes a "LOOOT" of time to process one repository, also, if we make like 10 requests simultaneously we'd still have too many requests problem.
Until today this bothers me because I couldn't find a better solution for this. As the challenge isn't valid anymore I'd like to know someone else's ideas of how this could be solved!

Related

Is it possible for a server to provide the same file to 2 people, on who understands only UTF-8 and the other, only ISO8559?

INTRO:
I'm in a situation because when uploading an inventory upload feed to Amazon, in 2021, they still don't understand UTF-8 encoding.
Here we have a file, in a wordpress installation, as the image for a visual product.
Example url : https://wordpresssite.com/uploads/Café-à-la-crème.jpg
Wordpress displays it fine.
Amazon reads a bunch of gibberish and can't find the file and gives an error.
Can we leave the file name on the source server as is and yet do something in cPanel or in
the excel file that lists this URL in a way that Amazon can also read it?
Is this ultimately as simple as telling Excel to encode that column differently before uploading?
Thank you in advance!
UPDATE : What I am trying now, is to export the Excel to CSV and then run it through line by line using PHP with a combination of tricks hoping to do a passable job of it. From what I see, there are many ways that "sorta" work, but nothing is sure.
UPDATE 2 : I realize that this doesn't solve my problem, because if Amazon changes the file name, changing an "é" to an "e", then it won't find the image either, so I'll have to go through all the images and find the ones with accents that I'm using.
QUESTION ABOUT PROCEDURE : I haven't been able to quite understand the way things work. I thought originally that this is about trying to get help when stuck. I have explained the problem and code isn't necessary. If I'm wrong, please tell me how it changes THIS situation? I'm using Excel, WordPress and I have to lose the UTF-8 accented characters that seem to cause Amazon's systems such grief (no judgement to Amazon, except that this resistance to UTF-8 is giving me brain shudders at the moment).
MORE INFO: If this helps, I'm writing in English but certain art products have a lot of French and some German in their names. I thought my example sufficient to illustrate what I was up against.
My problem is not how to convert the code but how to put the steps together to do what I need. It's because this whole process is not a simple iconv vs utf_decode() in php that it's extra stressful. Once I get the big picture sorted, the smaller steps are written about in many places where I could find more specifc details if I needed.
I'm not snarking here, but it seems that this kind of comment is just kicking someone when they are down. You are not the first to make such a suggestion over the years but again, I am curious how I could have explained any more than I have already — in a way that pertains to my actual problem.
Thanks for your response.

That URI is not properly encoded as per RFC 3986 (see also Wikipedia: percent/URL encoding). You cannot expect a server to blindly assume a requested URI to be UTF-8 encoded, but you can expect every server to support percent encoding:
https://wordpresssite.com/uploads/Caf%C3%A9-%C3%A0-la-cr%C3%A8me.jpg
In PHP this can be achieved thru rawurlencode(); in JavaScript it would be encodeURI().
Not sure what you want with Excel and CSV, but from what I understood it is unrelated to your actual problem.

Remove all URL parameters in Paw

I feel like I'm missing something basic, but I've googled and explored all around the interface and I can't find out how to remove multiple URL parameters at once without using the "Undo" or removing the request in its entirety. Basically I have requests with around 50 or more url parameters, and since Paw doesn't show these in the URL field, the only way I can see to remove them is being using the little minus symbol by each one, but this is quite tedious when you have lots. How do I remove multiple ones at once?

I'm late to the party, but this has been irritating me for a while now as well. Would love to see a batch delete action added for url params, but for now I create a version of the request with no params, then duplicate it when I need to start with a clean request. Definitely can clutter up the request list, but they could be thrown into a subfolder or some such and duplicated there if a lot of them might be needed. Either way, it's faster than always having to manually remove params - in my case just 15 or so but that's still an aggravation.

How to check if any URLs within my website contain specific Google Analytics UTM codes?

The website I manage uses Google Analytics to track URLs. Recently I found out that some of the URLs contain UTM codes and should not. I need some way of determining whether or not URLs that contain the following UTM codes utm_source=redirect or utm_source=redirectfolder are currently on the website and being redirected within the same website. If so, I will need to remove the UTM codes on those URLs, because Google Analytics automatically tracks URLs that redirect within the same domain. So it does not require UTM codes (and this actually hurts the analytics).
My apologies if I sound a little broken here, I am still trying to understand it all myself, as I am a new graduate with a CS degree and I am now the only web developer. I am not asking for anyone to write this for me, just if I could be pointed in the right direction to writing a ColdFusion script that may help with this.

So if I understand correctly your codebase is riddled with problematic URLS. To clean up the URLs programmatically you'll need to do a couple of things up front.
Identify the querystring parameter variable/value pair that needs to be
eliminated.
Create a worker file to access all your .cfm and .cfc files (of interest).
Create a loop that goes through the directories and reads, edits and saves your files (be careful here not to go crazy, maybe do not set to overwrite existing files (like make unique, unless you are sure).
Create a find/replace function or regex expression to target and remove your troublesome parameters
Save your file and move on in the loop.
OR:
You can use and IDE like dreamweaver or sublimetext to locate these via a regex search and spot check and remove.
I would selectively remove the URL parameters, but if you have so many pages that it makes no sense, then programmatic removal would be the way to go.
You will be using cfdirectory, cffile, rematch() (and create an array and rebuild) or find/replace replaceNoCase()
Your cfdirectory call will return a variable and like a query you will spin through it like you do with a normal query and cfoutput.
Pull one or two files out of your repo to create your code with until you are confortable. I would code in exit strategies (fail gracefully) like adding a locatable comments to the change spot so you can check it later manually, or escape out if a file won't write and many other try/catch opportunities.
I hope this helps.

Create own customed URL shortener

I am currently trying to code out a simple asp.net URL shortener which allows me to customise the shortened url. I am also not allowed to use open source, which means I cannot use any of the url shortening services. I am required to develop on on my own.
But this is the first time I am doing this so i have no idea on how to start(excluding the UI).
I understand that there are already such questions being asked. But I've read through the posts and I couldn't understand what is it about. I've also tried to google for the solution but it doesn't seem to be working.
I would really appreciate any help given to me.
P.S I am fairly new in programming and not strong in any of the programming languages.

You would need:
A system to store pairs of shortened URLs and their full version.
A page which takes the shortened URL parameter (eg. short.aspx?q=SHORTENED), looks it up in your data store, and redirects to the full URL.
Some interface to edit your data store, add new URLs, etcetera.
That should be it really. If this is too difficult, it might be smarter to start on a basic programming course first.

grabbing data from a ASP.NET webForm

I'm fairly new to web development and never before did i do any screen-scraping nor web-crawling, but yesterday a friend of mine asked me if i would be able to grab some data from this website, which is not mine, nor his, but the data is publicly available even for download.
The problem with the data is, it's available only as one file per one date or company, rather than one file for multiple dates or companies, which involves a lot of tedious 'clicking trough' the calendar and so he thought it would be nice if i would be able to create some app that could grab all the data with one click and output it in one single file or something similar..
The website uses aspx webFrom with __doPostBack to retrieve the data for different dates, even the links to download the data in XSL aren't the usual "href=…" links, they are, i assume, references for some asp script…
To be honest the only thing i tried was PHP cURL which didn't work, but since i tried cURL for the first time, i don't even know if it didn't work because it is not possible with cURL, or just because i don't know how to work with it.
I am only somewhat proficient in PHP and JavaScript, but not in ASP, though i would't mind learning something new.
So my question is..
Is it at all possible to grab the data from a website like this? and if it is, would you be so kind as to give me some hints on how to approach this kind of problem?
the website, again, is here http://extranet.net4gas.cz/capacity_ee.aspx
Thanks

C# has a nice WebClient class to do the job:
// Create web client.
WebClient client = new WebClient();
// Download string.
string value = client.DownloadString("http://www.microsoft.com/");
once you have the page html in a string you use regular expressions to scrape the content you are looking for.
here is a very basic regular expression to give a hint:
Regex regex = new Regex(#"\d+");
Match match = regex.Match("hello here 10 values");
if (match.Success)
{
Console.WriteLine(match.Value);
}

Marosko, as you said the data on website is open for public, so for sure you can scrape data out of it. Now, it is to decrease the manual click through dates and scraping data out of it. I personally don't have much idea about how Curl will work but I am sure it will involve a lot of coding. I would rather suggest you to automate the entire process using some automation tool, like a software application. Try Automation Anywhere, I bought it few months back for some data extraction purpose and it worked very well. It is automated and you can check the screen scraping capabilities it shows. Its my favorite :)
Charles

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex