I have a list of links. Those links each have a small list of text files that I'm trying to archive.
My list is at host/file/list.html
The list has almost a thousand links to /file/list.html?id=xxx
Inside of the list.html?id page, the linked files are located at /data/file/list/filename.txt with filename not having any patterns other than the filetype.
Along the way are all the header and footer links that I want to ignore. If I set my include-directories to /data/file/list it won't scrape any pages that are /file/list.html?id=xxx
Here's what I've got so far, but it won't work with recursion l=2, I have to be on the id page itself to work.
wget --recursive -l 2 --include-directories=/data/file/list http://host/file/list.html
This only downloads list.html and stops. If I also include /file/list it downloads too many other files, I'm only looking to download as few files as I can. I realize it's going to have to read each of the list.html?id pages to get the txt file lists, but it looks like its downloading all of the id pages one at a time without going through the links. Just in case I had my recursion limit wrong, I tried l=3 but that had the same result.
I ended up using the code adding /file/list to the included directories, also adding -nc to help prevent downloading the same header, footer links multiple times. It seemed to have worked well, mostly downloading the necessary files.
Related
I am using a quite unknown bookmark manager on Android. I picked this one after trying others because it was possible to import, export, classify by folders, the design was good and it was easy to search in my bookmarks.
After importing all my bookmarks from other browsers and also from files, I started classifying all of them into folders, subfolders, etc..
I spent many days to classify them all as I wanted.
After classifying them, I tried to export them.
The problem is that the only option offered is to export them in a .html file, containing all the bookmarks but without any folder.
The .html file contains all my bookmarks but in complete desorder, and doesnt mention the folders.
In the app there was also a "backup" function, so I tried and it creates a .db file.
I opened this .db file with some SQLiteViewer app and I found written inside, among other things I dont understand, a list of all my bookmarks with a number next to each one of them, and also a list of my folders with next to them the corresponding number.
When I open the .db file, I have a choice between
-SQlite master
-android metadata
-bookmarks
-folders
-sqlite sequence
If I click on "Bookmarks", all my bookmarks are in a kind of spreadsheet with lines and columns. Next to them in another columns, for example for each bookmark related with "Kitchen recipes" it's written the number 1.
And in the "Folders" folder, next to the folder called "Recipes" its also written 1.
So I'm happy because it seems that my classification is stored in this file.
But the fact is I dont know how to extract easily all that data, and create with it a "bookmark" file importable in other bookmark app or browser ( for example .csv or .xbel or .html but with folders)
I guess I need some "script" working like this:
if the first raw in "Folders" got the number 8 next to it
Then take all the bookmarks in the "bookmarks" folder that also got an 8 written next to it, and put it inside this folder.
I'm a complete noob in coding, I dont know what is SQlite, nor anything.
So i know that maybe I am asking for too much informations at the same time.
But if some kind person could put me in the way, by explaining me if
thats possible
what would be the easiest way
if some solution already exist
if someone like me can do it and what do I have to learn if I want some day to be able to do it
Thanks
Here's pictures so you understand easier:
Sqlite
Folders
Bookmarks
I don't already know what files exist, but I know the basic structure of the URL for a bunch of files that definitely do exist. I'd like to learn what they are and download them.
I can download an individual file without issue, in this case the land cover data for Alleghany county in Maryland:
download.file("https://cicwebresources.blob.core.windows.net/chesapeakebaylandcover/MD/ALLE_24001.zip"
, destfile ="data/GIS_downloads/")
But I'd like to download all the land cover data .zip files for the state of MD.
I saw some examples of webscraping that went something like this, and trying to make it work for my situation:
library(stringi)
baseURL<- "https://cicwebresources.blob.core.windows.net/chesapeakebaylandcover/MD/"
doc <- read_html(baseURL)
# etc
but the URL for what I want to call the "parent directory" returns a 404 error.
How can I list all the .zip files for MD, given that I know they all share the same URL format but don't know the specific strings for each county etc.?
Thanks!
The 404 error should be taken at face value... maybe the solution is to go from a webpage that can load, and find the needed links there. As #Gregor Thomas and #r2evans suggested, the website wasn't letting me access the parent directory, so a workaround was called for.
In this case, I found the links listed on another webpage (not the parent directory for the .zip files). This way I got a list of the needed links, though not by being clever with the scraping as suggested in comments above. The following code got me where I wanted to be for now...
doc<-httr::GET("https://www.chesapeakeconservancy.org/conservation-innovation-center/high-resolution-data/land-cover-data-project/")
CIC<-"https://cicwebresources.blob.core.windows.net/chesapeakebaylandcover/MD/"
parsed<-XML::htmlParse(doc)
links <- XML::xpathSApply(parsed, "//a/#href")
todl<-links[grepl(CIC, links)]
The object todl (to download) had the links I was looking for, without knowing what layers were and were not included.
I want to download a large number of .png files that have .htm file extensions. I've tried a some WinPcap-based utilities, but none of them pick up the files I need. The utilities I have tried are called York, EtherWatch and Pikachu2. I've also tried using a Firefox extension called Save Images - which was too buggy to be useful - and I've tried looking in the browser cache. This last approach works, but it has a problem...
...I need at least the last 30 characters of the file names to be maintained so that I know which image is which.
Does anyone know how I can get this done?
you can use downthemall to download all the images, and rename the file extension programmatically
I have two files where I edited one and left the other just for reference.
However I screwed some codes on the file I'm editing and since its a huge file, I don't know where I made the error or even if it have more errors. It was not altered I just deleted it.
I want to know if there is a program, plugin, script, something that I can insert the two files and override only the parameters of the classes that was edited (the class names wasn't altered).
I know I should have used GIT and all but I didn't. Lesson learned.
Appreciate any help. I'm using SublimeText.
If you're on a Unix-like OS, or you have Cygwyn installed you can use diff and patch to do this.
$ diff -u old.css new.css > changes.diff
$ patch < changes.diff
I have a lot of scripts (.do files) in different folders, which are frequently moved around. I would like to have Stata detect where the script is, and use that as a pwd (working directory). I know people that have this functionality seemingly by default (the pwd is changed to the script location when the script is run), but we cannot figure out why I am not so lucky. It is a bit tedious always having a "cd" line at the top of my scripts, and having to change this line to reflect the current directory. I'm using Stata 12 with Windows 7 Professional.
It looks to me like something similar is answered in this question:
Paths to do-file in Stata
What it seems like you could do is keep an MS Excel file that somehow tracks the location of all your scripts, and then use that to generate a simple high-level do-file that calls all your programs (although this may not be how your scripts work). If your folder locations are changing I am not sure how you can completely avoid updating at least some lines of code when something gets moved around. This would at least centralize the necessary updates into one place.
You can use Sublime text.
https://sublime.wbond.net/packages/Stata%20Enhanced
When you build the do file (or a selection) using sublime text, the filepath of the do file automatically becomes the current directory.