Use AWK to safely search and replace URLs in Wordpress SQL-Dump - wordpress

I am working on a webtool to mirror a Wordpress installation into a development system.
The aim is to have a Live system for production and a development system for testing. The webtool then offers a one-click-sync between those systems.
Each of the systems is standalone, with its own webroot, database and url.
I am having a trouble with the database dump in which I have to search all the references to the source and replace them with the URL of the destination (e.g.: "www.example.com" -> "www-dev.example.com").
What I need to do is:
Find all occurences of the URL and replace it with the new one.
IF the match also matches the format of a serialized string it should set the Field-Seperator, and reload the match, so that the actual length can be set in the array.
In a first attempt I tried to solve this with a 'sed' command looking as follows: sed -i.orig 's/360\.example\.com/360-dev\.my\.example\.dev/g'.
This didn't work because there are serialized arrays contained in the dump, containing the url. The sed command is no good for updating the string-length-indicator of the serialized arrays.
My latest attempt is to use an awk as suggested here, because it's capable of arithmetic operations.
My awk script looks like this:
/360[.]example[.]com/ {
sub("360.example.com", "360-dev.my.example.dev");
if ($0 ~ /s:[[:digit:]]+:["](http[s]?:\/\/)?360[.]example[.]com["]/){
FS="\"";
$0=$0;
n=length($2)-1;
sub(/:[[:digit:]]+:/, ":" n ":");
}
} 1
There seem to be some errors in my script, which I can't find. It does not replace all of the occurrences of the url and completely skips the length-indicator-update.
How can I fix my script to achieve what I want to do?
EDIT: (Added Input/Output samples)
Databasedump consists of the whole wordpress-database with CREATE TABLE IF NOT EXISTS and INSERT statements for each table and record.
Normal (unserialized) occurence:
(36, 'home', 'http://360.example.com/blogname', 'yes'),
should result in:
(36, 'home', 'http://360-dev.my.example.dev/blogname', 'yes'),
Serialized occurence:
(404, 'wp-maintenance-mode', 'a:21:{s:6:"active";i:1;s:4:"time";i:0;s:4:"link";i:1;s:7:"support";i:0;s:10:"admin_link";i:1;s:7:"rewrite";s:0:"";s:6:"notice";i:1;s:4:"unit";i:1;s:5:"theme";i:0;s:8:"styleurl";s:69:"http://360.example.com/wp-content/themes/blogname/css/maintenance.css";s:5:"index";i:0;s:5:"title";s:0:"";s:6:"header";s:0:"";s:7:"heading";s:0:"";s:4:"text";s:12:"Example Text";s:7:"exclude";a:1:{i:0;s:0:"";}s:6:"bypass";i:0;s:4:"role";a:1:{i:0;s:13:"administrator";}s:13:"role_frontend";a:1:{i:0;s:13:"administrator";}s:5:"radio";i:0;s:4:"date";s:0:"";}', 'yes'),
Should result in:
(404, 'wp-maintenance-mode', 'a:21:{s:6:"active";i:1;s:4:"time";i:0;s:4:"link";i:1;s:7:"support";i:0;s:10:"admin_link";i:1;s:7:"rewrite";s:0:"";s:6:"notice";i:1;s:4:"unit";i:1;s:5:"theme";i:0;s:8:"styleurl";s:76:"http://360-dev.my.example.dev/wp-content/themes/blogname/css/maintenance.css";s:5:"index";i:0;s:5:"title";s:0:"";s:6:"header";s:0:"";s:7:"heading";s:0:"";s:4:"text";s:12:"Example Text";s:7:"exclude";a:1:{i:0;s:0:"";}s:6:"bypass";i:0;s:4:"role";a:1:{i:0;s:13:"administrator";}s:13:"role_frontend";a:1:{i:0;s:13:"administrator";}s:5:"radio";i:0;s:4:"date";s:0:"";}', 'yes'),
EDIT 2:
Now using wp-cli to do the task of search & replace.
I've got a multisite setup with blogs numbered (2,3,9).
Executing wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.dev' results in an error, telling me that the Single-Site tables (wp_redirection_items and wp_redirection_groups) cannot be found.
This is true, because they really do not exist, but rather for each blog (e.g: wp_2_redirection_items and so on). This error results in over 9000 missed occurences in s&r. It's possible to replace everything with wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.com' wp_*. But it still throws the error.

As suggested by #archimiro the task now is done by wp-cli.
But as I am also having a multisite setup, which lead to some errors I had to figure out the command for a full database search-replace task.
The final command:
wp search-replace --url=360.example.com '360.example.com' '360-dev.my.example.dev' wp_*.
Without explicitly telling wp-cli to search&replace in ALL (wp_*) tables it would stop by the time a "table not found" error is thrown.

Also not awk or wpcli but this is a php function I wrote that seems to work well.
function snr($search, $replace, $inputfile, $outputfile){
$sql = file_get_contents($inputfile);
$sql1 = str_replace($search,$replace,$sql);
file_put_contents($outputfile,$sql1);
$serstrings = preg_split("/(?<=[{;])s:/",$sql1);
foreach($serstrings as $i=>$serstring) {
if (!!strpos($serstring, $replace)){
$justString = str_replace("\\","",str_replace("\\\\","j",explode('\\";',explode(':\\"',$serstring)[1])[0]));
$correct = strlen($justString);
$serstrings[$i] = preg_replace('/^\d+/',$correct, $serstrings[$i]);
}
}
file_put_contents($outputfile,implode("s:",$serstrings));
}

I've used this in past with success:
sed 's|360\.example\.com|360-dev\.my\.example\.dev|g' com.sql > local.sql
Edit: sorry not awk, but neither is wp-cli.

Related

How to avoid Jupyter cell-ids from changing all the time and thereby spamming the VCS diffs?

As discussed in q/66678305, newer Jupyter versions store in addition to the source code and output of cells an ID for the purpose of e.g. linking to a cell.
However, these IDs aren't stable but often change even when the cell's source code was not touched. As a result, if you have the .ipynb file under version control with e.g. git, the commits end up having lots of rather funny sounding “changed lines” that don't correspond to any actual change made in the commit. Like,
{
"cell_type": "code",
"execution_count": null,
- "id": "respected-breach",
+ "id": "incident-winning",
"metadata": {},
"outputs": [],
Is there a way to prevent this?
Answer for Git on Linux. Probably also works on MacOS, but not Windows.
It is good practice to not VCS the .ipynb files as saved by Jupyter, but instead a filtered version that does not contain all the volatile information. For this purpose, various git hooks are available; the one I'm using is based on https://github.com/toobaz/ipynb_output_filter/blob/master/ipynb_output_filter.py.
Strangely enough, it turns out this script can not be modified to remove the "id" field from cells. Namely, if you try to remove that field in the filtering loop, like with
for field in ("prompt_number", "execution_number", "id"):
if field in cell:
del cell[field]
then the write function from jupyter_nbformat will just put an id back in. It is possible to merely change the id to something constant, but then Jupyter will complain about nonunique ids.
As a hack to circumvent this, I now use this filter with a simple grep to delete the ID:
#!/bin/bash
grep -v '^ *"id": "[a-z\-]*",$'
Store that in e.g. ~/bin/ipynb_output_filter.sh, make it executable (chmod +x ~/bin/ipynb_output_filter.sh) and ensure you have the following ~/.gitattributes file:
*.ipynb filter=dropoutput_ipynb
and in your git config (either global ~/.gitconfig or project)
[core]
attributesfile = ~/.gitattributes
[filter "dropoutput_ipynb"]
clean = ~/bin/ipynb_output_filter.sh
smudge = cat
If you want to use a standard python filter in addition to that, you can invoke it before the grep in ~/bin/ipynb_output_filter.sh, like
#!/bin/bash
~/bin/ipynb_output_filter.py | grep -v '^ *"id": "[a-z\-]*",$'

Build web graph with wget

I'm using wget with -r (recursive) option, to crawl and download all the pages starting from a root.
For debugging purpose I'd like to output which page routed me to another one, for example: https://stackoverflow.com/ -> https://stackoverflow.com/questions
Is there such a way to do that?
Please note that I need explicitly use wget.
The best solution I found untill now is to use the --warc-file option, to export a warc archive of my crawl. This format also store the Referer.
Using a python library to read the output I wrote the following simple script, to export a csv with source/target columns:
import warc
f = warc.open("crawler.warc")
for record in f:
if record['WARC-Type'] != 'request':
continue
for line in record.payload:
if line.startswith("Referer:"):
print line.replace("Referer: ", "").strip('\n\r'), ",", record['WARC-Target-URI']

Fail a grunt build when debug prints exist in source

I am working on a PHP/Javascript project where I've nicely set up a build workflow. It involves testing, minifying, compressing into the final zip deliverable, and a whole lot of other nice stuff.
I want to build a task that fails when there are certain patterns in the source code. I would like to look for any print_r(), error_log(), var_dump(), etc functions, and halt the build process if there are any. Perhaps later I would like to check for things in Javascript or CSS so this is not only a PHP question.
I know it can be done with grunt-shell and grep but I'd like to know the following:
Are there any grunt plugins specific to this task? Ideally I would like to be able to specify a list of regexes per file type, and to set whether to continue or fail the build on pattern match.
How do others tackle the problem of double-checking the packaged source for the most common debug statements or other patterns?
Not a complete answer to my question, but I've recently come across this grunt plugin which is somewhat related. It removes console.log statements from JavaScript. Haven't tried it yet. Looks good. I still would like to know if there's something similar for PHP though.
http://grunt-tasks.com/grunt-remove-logging-calls/
Edit: Seeing as there's only tumbleweeds rolling in the wind here, I'm posting my workaround that's based on grunt-shell. However this is not what I was looking for. It's not perfect because it doesn't do proper syntax parsing:
shell: {
check_debug_prints: {
command: '(! (egrep -r "var_dump|print_r|error_log" --include=*.php src || egrep -r "console\.\w+|debugger;" --include=*.js src) ) || (echo "Debug prints in source - build aborted" && false )'
}
},
and
grunt.loadNpmTasks( 'grunt-shell' );
Edit 2: I finally found the exact grunt plugin I was looking for. It is grunt-search. There is a failOnMatch boolean option that lets you indicate if a particular regex pattern should cause the build to fail when found.

Unix SQLLDR scipt gives 'Unexpected End of File' error

All, I am running the following script to load the data on to the Oracle Server using unix box and sqlldr. Earlier it gave me an error saying sqlldr: command not found. I added "SQLPLUS < EOF", it still gives me an error for unexpected end of file syntax error on line 12 but it is only 11 line of code. What seems to be the problem according to you.
#!/bin/bash
FILES='ls *.txt'
CTL='/blah/blah1/blah2/name/filename.ctl'
for f in $FILES
do
cat $CTL | sed "s/:FILE/$f/g" >$f.ctl
sqlplus ID/'PASSWORD'#SERVERNAME << EOF sqlldr SCHEMA_NAME/SCHEMA_PASSWORD control=$f.ctl data=$f EOF
done
sqlplus will never know what to do with the command sqlldr. They are two complementary cmd-line utilities for interfacing with Oracle DB.
Note NO sqlplus or EOF etc required to load data into a schema:
#!/bin/bash
#you dont want this FILES='ls *.txt'
CTL_PATH=/blah/blah1/blah2/name/'
CTL_FILE="$CTL_PATH/filename.ctl"
SCHEMA_NM=SCHEMA_NAME
SCHEMA_PSWD=SCHEMA_PASSWORD
for f in *.txt
do
# don't need cat! cat $CTL | sed "s/:FILE/$f/g" >"$f".ctl
sed "s/:FILE/$f/g" "$CTL_FILE" > "$CTL_PATH/$f.ctl"
#myBad sqlldr "$SCHEMA_NAME/$SCHEMA_PASSWORD" control="$CTL_PATH/$f.ctl" data="$f"
sqlldr $SCHEMA_USER/$SCHEMA_PASSWORD#$SERVER_NAME control="$CTL_PATH/$f.ctl" data="$f" rows=10000 direct=true errors=999
done
Without getting too philosophical, using assignments like FILES=$(ls *.txt) is a bad habit to get into. By contrast, for f in *.txt will deal correctly for files with odd characters in them (like spaces or other syntax breaking values). BUT the other habit you do want to get into is to quote all variable references (like $f), with dbl-quotes : "$f", OK? ;-) This is the otherside of protection for files with spaces etc embedded in them.
In the edit update, I've varibalized your CTL_PATH and CTL_FILE. I think I understand your intent, that you have 1 std CTL_FILE that you pass thru sed to create a table specific .ctl file (a good approach in my experience). Note that you don't need to use cat to send a file to sed, but your use to create a altered file via redirection (> $f.ctl) is very shell-like too.
In 2nd edit update, I looked here on S.O. and found an example sqlldr cmdline that has the correct syntax and have modified to work with your variable names.
To finish up,
A. Are you sure the Oracle Client package is installed on the machine
that you are running your script on?
B. Is the /path/to/oracle/client/tools/bin included in your working
$PATH?
C. try which sqlldr. If you don't get anything, either its not
installed or its not in the path.
D. If not installed, you'll have to get it installed.
E. Once installed, note the directory that contains the sqlldr cmd.
find / -name 'sqlldr*' will take a long time to run, but it will
print out the path you want to use.
F. Take the "path" part of what is returned (like
/opt/oracle/11.2/client/bin/ (but not the sqlldr at the end), and
edit script at 2nd line with
(Txt added to appease the S.O. Formatter ;-) )
export ORCL_PATH="/path/you/found/to/oracle/client"
export PATH="$ORCL_PATH:$PATH"
These steps should solve any remaining issues. If this doesn't work, see if there is someone where you work that understands your local computing environment that can help explain any missing or different steps.
IHTH

Avoid message "-​- Loading resources from .sqliterc"

Minor problem, nevertheless irritating : Is there a way to avoid the following message from appearing each time I make a query :
-- Loading resources from /Users/ThG/.sqliterc
As a stupid workaround, this works:
<. sqlite your_sqlite.db 'select * from your_table'
This is because the current code does this:
if( stdin_is_interactive ){
utf8_printf(stderr,"-- Loading resources from %s\n",sqliterc);
}
Forcing a stdin redirect thwarts this due to this piece of code:
stdin_is_interactive = isatty(0);
This works as well:
sqlite -batch your_sqlite.db 'select * from your_table'
due to this piece of code:
}else if( strcmp(z,"-batch")==0 ){
/* Need to check for batch mode here to so we can avoid printing
** informational messages (like from process_sqliterc) before
** we do the actual processing of arguments later in a second pass.
*/
stdin_is_interactive = 0;
}
but it's longer, so kind of defeats the purpose.
I know that this question is PRETTY old now, but simply deleting '/Users/ThG/.sqliterc' should solve the problem. '.sqliterc' is a configuration file for sqlite's interactive command line front-end. If you don't spend a lot of time in there, you won't miss the file.
That resource msg comes out on stderr, and it's followed by a blank line, so you could get rid of it with something like this (wrapped up in a script file of its own):
#!/bin/bash
sqlite3 -init /your/init/file /your/sqlite3/file.db "
your
SQL
cmds
" 2>/dev/null | sed -e1d
When using sqlite in shell scripts, you usually don't even want your ~/.sqliterc to be loaded at all. This works well for me:
sqlite3 -init <(echo)
Explanation:
-init specifies the file to load instead of ~/.sqliterc.
<(echo) uses Process Substitution to provide a path to a temporary empty file.
A bit late but #levant pied almost had the solution, you need to pass an additional -interactive to silence the --loading resources from.
$ sqlite3 -batch -interactive
SQLite version 3.31.1 2020-01-27 19:55:54
Enter ".help" for usage hints.
sqlite> .q
You can simply rename your config file to disable the warning. And revert the rename to keep the configuration after use.
I use the following:
#suppress default configuration warnings
mv $HOME/.sqliterc $HOME/.backup.sqliterc
# sqlite3 scripts...
#revert
mv $HOME/.backup.sqliterc $HOME/.sqliterc

Resources