Build web graph with wget - graph

I'm using wget with -r (recursive) option, to crawl and download all the pages starting from a root.
For debugging purpose I'd like to output which page routed me to another one, for example: https://stackoverflow.com/ -> https://stackoverflow.com/questions
Is there such a way to do that?
Please note that I need explicitly use wget.

The best solution I found untill now is to use the --warc-file option, to export a warc archive of my crawl. This format also store the Referer.
Using a python library to read the output I wrote the following simple script, to export a csv with source/target columns:
import warc
f = warc.open("crawler.warc")
for record in f:
if record['WARC-Type'] != 'request':
continue
for line in record.payload:
if line.startswith("Referer:"):
print line.replace("Referer: ", "").strip('\n\r'), ",", record['WARC-Target-URI']

Related

Use AWK to safely search and replace URLs in Wordpress SQL-Dump

I am working on a webtool to mirror a Wordpress installation into a development system.
The aim is to have a Live system for production and a development system for testing. The webtool then offers a one-click-sync between those systems.
Each of the systems is standalone, with its own webroot, database and url.
I am having a trouble with the database dump in which I have to search all the references to the source and replace them with the URL of the destination (e.g.: "www.example.com" -> "www-dev.example.com").
What I need to do is:
Find all occurences of the URL and replace it with the new one.
IF the match also matches the format of a serialized string it should set the Field-Seperator, and reload the match, so that the actual length can be set in the array.
In a first attempt I tried to solve this with a 'sed' command looking as follows: sed -i.orig 's/360\.example\.com/360-dev\.my\.example\.dev/g'.
This didn't work because there are serialized arrays contained in the dump, containing the url. The sed command is no good for updating the string-length-indicator of the serialized arrays.
My latest attempt is to use an awk as suggested here, because it's capable of arithmetic operations.
My awk script looks like this:
/360[.]example[.]com/ {
sub("360.example.com", "360-dev.my.example.dev");
if ($0 ~ /s:[[:digit:]]+:["](http[s]?:\/\/)?360[.]example[.]com["]/){
FS="\"";
$0=$0;
n=length($2)-1;
sub(/:[[:digit:]]+:/, ":" n ":");
}
} 1
There seem to be some errors in my script, which I can't find. It does not replace all of the occurrences of the url and completely skips the length-indicator-update.
How can I fix my script to achieve what I want to do?
EDIT: (Added Input/Output samples)
Databasedump consists of the whole wordpress-database with CREATE TABLE IF NOT EXISTS and INSERT statements for each table and record.
Normal (unserialized) occurence:
(36, 'home', 'http://360.example.com/blogname', 'yes'),
should result in:
(36, 'home', 'http://360-dev.my.example.dev/blogname', 'yes'),
Serialized occurence:
(404, 'wp-maintenance-mode', 'a:21:{s:6:"active";i:1;s:4:"time";i:0;s:4:"link";i:1;s:7:"support";i:0;s:10:"admin_link";i:1;s:7:"rewrite";s:0:"";s:6:"notice";i:1;s:4:"unit";i:1;s:5:"theme";i:0;s:8:"styleurl";s:69:"http://360.example.com/wp-content/themes/blogname/css/maintenance.css";s:5:"index";i:0;s:5:"title";s:0:"";s:6:"header";s:0:"";s:7:"heading";s:0:"";s:4:"text";s:12:"Example Text";s:7:"exclude";a:1:{i:0;s:0:"";}s:6:"bypass";i:0;s:4:"role";a:1:{i:0;s:13:"administrator";}s:13:"role_frontend";a:1:{i:0;s:13:"administrator";}s:5:"radio";i:0;s:4:"date";s:0:"";}', 'yes'),
Should result in:
(404, 'wp-maintenance-mode', 'a:21:{s:6:"active";i:1;s:4:"time";i:0;s:4:"link";i:1;s:7:"support";i:0;s:10:"admin_link";i:1;s:7:"rewrite";s:0:"";s:6:"notice";i:1;s:4:"unit";i:1;s:5:"theme";i:0;s:8:"styleurl";s:76:"http://360-dev.my.example.dev/wp-content/themes/blogname/css/maintenance.css";s:5:"index";i:0;s:5:"title";s:0:"";s:6:"header";s:0:"";s:7:"heading";s:0:"";s:4:"text";s:12:"Example Text";s:7:"exclude";a:1:{i:0;s:0:"";}s:6:"bypass";i:0;s:4:"role";a:1:{i:0;s:13:"administrator";}s:13:"role_frontend";a:1:{i:0;s:13:"administrator";}s:5:"radio";i:0;s:4:"date";s:0:"";}', 'yes'),
EDIT 2:
Now using wp-cli to do the task of search & replace.
I've got a multisite setup with blogs numbered (2,3,9).
Executing wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.dev' results in an error, telling me that the Single-Site tables (wp_redirection_items and wp_redirection_groups) cannot be found.
This is true, because they really do not exist, but rather for each blog (e.g: wp_2_redirection_items and so on). This error results in over 9000 missed occurences in s&r. It's possible to replace everything with wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.com' wp_*. But it still throws the error.
As suggested by #archimiro the task now is done by wp-cli.
But as I am also having a multisite setup, which lead to some errors I had to figure out the command for a full database search-replace task.
The final command:
wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.dev' wp_*.
Without explicitly telling wp-cli to search&replace in ALL (wp_*) tables it would stop by the time a "table not found" error is thrown.
Also not awk or wpcli but this is a php function I wrote that seems to work well.
function snr($search, $replace, $inputfile, $outputfile){
$sql = file_get_contents($inputfile);
$sql1 = str_replace($search,$replace,$sql);
file_put_contents($outputfile,$sql1);
$serstrings = preg_split("/(?<=[{;])s:/",$sql1);
foreach($serstrings as $i=>$serstring) {
if (!!strpos($serstring, $replace)){
$justString = str_replace("\\","",str_replace("\\\\","j",explode('\\";',explode(':\\"',$serstring)[1])[0]));
$correct = strlen($justString);
$serstrings[$i] = preg_replace('/^\d+/',$correct, $serstrings[$i]);
}
}
file_put_contents($outputfile,implode("s:",$serstrings));
}
I've used this in past with success:
sed 's|360\.example\.com|360-dev\.my\.example\.dev|g' com.sql > local.sql
Edit: sorry not awk, but neither is wp-cli.

Unix SQLLDR scipt gives 'Unexpected End of File' error

All, I am running the following script to load the data on to the Oracle Server using unix box and sqlldr. Earlier it gave me an error saying sqlldr: command not found. I added "SQLPLUS < EOF", it still gives me an error for unexpected end of file syntax error on line 12 but it is only 11 line of code. What seems to be the problem according to you.
#!/bin/bash
FILES='ls *.txt'
CTL='/blah/blah1/blah2/name/filename.ctl'
for f in $FILES
do
cat $CTL | sed "s/:FILE/$f/g" >$f.ctl
sqlplus ID/'PASSWORD'#SERVERNAME << EOF sqlldr SCHEMA_NAME/SCHEMA_PASSWORD control=$f.ctl data=$f EOF
done
sqlplus will never know what to do with the command sqlldr. They are two complementary cmd-line utilities for interfacing with Oracle DB.
Note NO sqlplus or EOF etc required to load data into a schema:
#!/bin/bash
#you dont want this FILES='ls *.txt'
CTL_PATH=/blah/blah1/blah2/name/'
CTL_FILE="$CTL_PATH/filename.ctl"
SCHEMA_NM=SCHEMA_NAME
SCHEMA_PSWD=SCHEMA_PASSWORD
for f in *.txt
do
# don't need cat! cat $CTL | sed "s/:FILE/$f/g" >"$f".ctl
sed "s/:FILE/$f/g" "$CTL_FILE" > "$CTL_PATH/$f.ctl"
#myBad sqlldr "$SCHEMA_NAME/$SCHEMA_PASSWORD" control="$CTL_PATH/$f.ctl" data="$f"
sqlldr $SCHEMA_USER/$SCHEMA_PASSWORD#$SERVER_NAME control="$CTL_PATH/$f.ctl" data="$f" rows=10000 direct=true errors=999
done
Without getting too philosophical, using assignments like FILES=$(ls *.txt) is a bad habit to get into. By contrast, for f in *.txt will deal correctly for files with odd characters in them (like spaces or other syntax breaking values). BUT the other habit you do want to get into is to quote all variable references (like $f), with dbl-quotes : "$f", OK? ;-) This is the otherside of protection for files with spaces etc embedded in them.
In the edit update, I've varibalized your CTL_PATH and CTL_FILE. I think I understand your intent, that you have 1 std CTL_FILE that you pass thru sed to create a table specific .ctl file (a good approach in my experience). Note that you don't need to use cat to send a file to sed, but your use to create a altered file via redirection (> $f.ctl) is very shell-like too.
In 2nd edit update, I looked here on S.O. and found an example sqlldr cmdline that has the correct syntax and have modified to work with your variable names.
To finish up,
A. Are you sure the Oracle Client package is installed on the machine
that you are running your script on?
B. Is the /path/to/oracle/client/tools/bin included in your working
$PATH?
C. try which sqlldr. If you don't get anything, either its not
installed or its not in the path.
D. If not installed, you'll have to get it installed.
E. Once installed, note the directory that contains the sqlldr cmd.
find / -name 'sqlldr*' will take a long time to run, but it will
print out the path you want to use.
F. Take the "path" part of what is returned (like
/opt/oracle/11.2/client/bin/ (but not the sqlldr at the end), and
edit script at 2nd line with
(Txt added to appease the S.O. Formatter ;-) )
export ORCL_PATH="/path/you/found/to/oracle/client"
export PATH="$ORCL_PATH:$PATH"
These steps should solve any remaining issues. If this doesn't work, see if there is someone where you work that understands your local computing environment that can help explain any missing or different steps.
IHTH

How to use a template in vim

This is really a newbie question - but basically, how do I enable a template for certain filetypes.
Basically, I just want the template to insert a header of sorts, that is with some functions that I find useful, and libraries loaded etc.
I interpret
:help template
the way that I should place this in my vimrc
au BufNewFile,BufRead ~/.vim/skeleton.R
Running a R script then shows that something could happen, but apparently does not:
--- Auto-Commands ---
This may be because a template consists of commands (and there are no such in skeleton.R) - and in this case I just want it to insert a text header (which skelton.R consist of).
Sorry if this question is mind boggeling stupid ;-/
The command that you've suggested is not going to work: what this will do is run no Vim command whenever you open ~/.vim/skeleton.R
A crude way of achieving what you want would be to use:
:au BufNewFile *.R r ~/.vim/skeleton.R
This will read (:r) your file whenever a new *.R file is created. You want to avoid having BufRead in the autocmd, or it will read the skeleton file into your working file every time you open the file!
There are many plugins that add a lot more control to this process. Being the author and therefore completely biased, I'd recommend this one, but there are plenty of others listed here.
Shameless plug:
They all work in a relatively similar way, but to explain my script:
You install the plugin as described on the linked page and then create some templates in ~/.vim/templates. These templates should have the same extension as the 'target' file, so if it's a template for .R files, call it something like skeleton.R. In your .vimrc, add something like this:
let g:file_template_default = {}
let g:file_template_default['R'] = 'skeleton'
Then create your new .R file (with a filename, so save it if it's new) and enter:
:LoadFileTemplate
You can also skip the .vimrc editing and just do:
:LoadFileTemplate skeleton
See the website for more details.
Assume that your skeletons are in your ~/.vim/templates/ directory, you can put this
snippet in your vimrc file.
augroup templates
au!
" read in templates files
autocmd BufNewFile *.* silent! execute '0r ~/.vim/templates/skeleton.'.expand("<afile>:e")
augroup END
Some explanation,
BufNewFile . = each time we edit a new file
silent! execute = execute silently, no error messages if failed
0r = read file and insert content at top (0) in the new file
expand(":e") = get extension of current filename
see also http://vim.wikia.com/wiki/Use_eval_to_create_dynamic_templates
*fixed missing dot in file path
Create a templates subdirectory in your ~/.vim folder
$ mkdir -p ~/.vim/templates
Create a new file in subdirectory called R.skeleton and put in the header and/or other stuff you want to automagically load upon creating a new ".R " file.
$ vim ~/.vim/templates/R.skeleton
Then, add the following to your ~/.vimrc file, which may have been suggested in a way by "guest"
autocmd BufNewFile * silent! 0r $HOME/.vim/templates/%:e.skeleton
Have a look at my github repository for some more details and other options.
It's just a trick I used to use .
It's cheap but If you ain't know nothing about vim and it's commands it's easy to handle.
make a directory like this :
~/.vim/templates/barney.cpp
and as you konw barney.cpp should be your template code .
then add a function like ForUncleBarney() to end of your .vimrc file located in ~/.vimrc
it should be like
function ForBarneyStinson()
:read ~/.vim/templates/barney.cpp
endfunction
then just use this command in vim
:call ForBarneyStinson()
then you see your template
as an example I already have two templates for .cpp files
:call ForBarney()
:call ACM()
sorry said too much,
Coding's awesome ! :)
Also take a look at https://github.com/aperezdc/vim-template.git.
I use it and have contributed some patches to it and would argue its relatively full featured.
What about using the snipmate plugin? See here
There exist many template-file expanders -- you'll also find there explanations on how to implement a rudimentary template-file expander.
For my part, I'm maintaining the fork of muTemplate. For a simple start, just drop a {ft}.template file into {rtp}/template/. If you want to use any (viml) variable or expression, just do. You can even put vim code (and now even functions) into the template-file if you wish. Several smart decisions are already implemented for C++ and vim files.

Is there an app to automatically format CSS files?

I have a compressed CSS file (all whitespace removed) that I want to inspect, but it's a huge pain inspecting it as-is. Is there any utility (preferably linux command line) that I can run the file through to format it nicely?
The online service that Dave Newman mentioned has been converted into a Node.js script, which you can run on the command-line. If you have NPM installed you can just do:
npm install -g cssunminifier
And it’s pretty versatile how you can use it. Here are 3 different examples:
cssunminifier style.min.css style.css
cssunminifier --width=8 style.min.css
curl http://cdn.sstatic.net/stackoverflow/all.css | cssunminifier - | less
Here’s more info on the command-line css unminifier
Try this online service.
You can also inspect any compressed file in Firebug.
I wrote a little formatter in Ruby for you. Save it as some .rb file and use it via CLI like ruby format.rb input.css input-clean.css:
#Formats CSS
input, output = ARGV
#Input
if input == nil or output == nil
puts "Syntax: #{$0} [input] [output]"
exit
end
#Opens file
unless File.exist? input
puts "File #{input} doesn't exist."
exit
end
#Reads file
input = File.read input
#Creates output file
output = File.new output, "w+"
#Processes input
input = input.gsub("{", "\n{\n\t")
.gsub(",", ", ")
.gsub(";", ";\n\t")
.gsub(/\t?}/, "}\n\n\n")
.gsub(/\t([^:]+):/, "\t" + '\1: ')
#Writes output
output.write input
#Closes output
output.close
These programs are called 'beautifiers'. You should be able to google one that fits for you.
If you're looking for a locally-executable utility, as opposed to a web service, you want CSS Tidy.
This also indents: styleneat
Here's a free windows app to "beautify" your file. I haven't used it so I don't know how well it works.
http://www.blumentals.net/csstool/
It is specific, but Visual Studio does this on that file type. (by no means a generic solution to which you alude)
take a look at the vkBeautify plugin
http://www.eslinstructor.net/vkbeautify/
It can beautify (pretty print) CSS, XML and JSON text,
written in plain javascript, small, simple and fast

Convert asp.net project pages from Windows-1251 to Utf-8

I can do that file-by-file with Save As Encoding in Visual Studio, but I'd like to make this in one click. Is it possible?
I know, some will start bashing on me:
download a smalltalk IDE (such as ST/X),
open a workspace,
type in:
'yourDirectoryHere' asFilename directoryContentsAsFilenamesDo:[:oldFileName |
|cyrString utfString newFile|
cyrString := oldFileName contentsAsString.
utfString := CharacterEncoder encodeString:cyrString from:#'iso8859-5' into:#'utf'.
newFile := oldFile withSuffix:'utf'.
newFile contents:utfString.
].
that will convert all files in the given directory and create corresponding .utf files without affecting the original files. Even if you normally do not use smalltalk, for this type of actions, smalltalk is a perfect scripting environment.
I know, most of you don't read smalltalk, but the code should be readable even for non-smalltalkers and a corresponding perl/python/java/c# piece of code also written and executed in 1 minute or so, taking the above as a guide. I guess all current languages provide something similar to the CharacterEncoder above.

Resources