Add styling rules in pandoc tables for odt/docx output (table borders) - docx

I'm generating some odt/docx reports via markdown using knitr and pandoc and am now wondering how you'd go about formating tables. Primarily I'm interested in adding rules (at least top, bottom and one below the header, but being able to add arbitrary ones inside the table would be nice too).
Running the following example from the pandoc documentation through pandoc (without any special parameters) just yields a "plain" table without any kind of rules/colours/guides (in either -t odt or -t docx).
+---------------+---------------+--------------------+
| Fruit | Price | Advantages |
+===============+===============+====================+
| Bananas | $1.34 | - built-in wrapper |
| | | - bright color |
+---------------+---------------+--------------------+
| Oranges | $2.10 | - cures scurvy |
| | | - tasty |
+---------------+---------------+--------------------+
I've looked through the "styles" for the possibility of specifying table formating in a reference .docx/.odt but found nothing obvious beyond "table header" and "table contents" styles, both of which seem to concern only the formatting of text within the table.
Being rather unfamiliar with WYSIWYG-style document processors I'm lost as to how to continue.

Here's how I searched how to do this:
The way to add a table in Docx is to use the <w:tbl> tag. So I searched for this in the github repository, and found it in this file (called Writers/Docx.hs, so it's not a big surprise)
blockToOpenXML opts (Table caption aligns widths headers rows) = do
let captionStr = stringify caption
caption' <- if null caption
then return []
else withParaProp (pStyle "TableCaption")
$ blockToOpenXML opts (Para caption)
let alignmentFor al = mknode "w:jc" [("w:val",alignmentToString al)] ()
let cellToOpenXML (al, cell) = withParaProp (alignmentFor al)
$ blocksToOpenXML opts cell
headers' <- mapM cellToOpenXML $ zip aligns headers
rows' <- mapM (\cells -> mapM cellToOpenXML $ zip aligns cells)
$ rows
let borderProps = mknode "w:tcPr" []
[ mknode "w:tcBorders" []
$ mknode "w:bottom" [("w:val","single")] ()
, mknode "w:vAlign" [("w:val","bottom")] () ]
let mkcell border contents = mknode "w:tc" []
$ [ borderProps | border ] ++
if null contents
then [mknode "w:p" [] ()]
else contents
let mkrow border cells = mknode "w:tr" [] $ map (mkcell border) cells
let textwidth = 7920 -- 5.5 in in twips, 1/20 pt
let mkgridcol w = mknode "w:gridCol"
[("w:w", show $ (floor (textwidth * w) :: Integer))] ()
return $
[ mknode "w:tbl" []
( mknode "w:tblPr" []
( [ mknode "w:tblStyle" [("w:val","TableNormal")] () ] ++
[ mknode "w:tblCaption" [("w:val", captionStr)] ()
| not (null caption) ] )
: mknode "w:tblGrid" []
(if all (==0) widths
then []
else map mkgridcol widths)
: [ mkrow True headers' | not (all null headers) ] ++
map (mkrow False) rows'
)
] ++ caption'
I'm not familiar at all with Haskell, but I can see that the border-style is hardcoded, since there is no variable in it:
let borderProps = mknode "w:tcPr" []
[ mknode "w:tcBorders" []
$ mknode "w:bottom" [("w:val","single")] ()
, mknode "w:vAlign" [("w:val","bottom")] () ]
What does that mean ?
That means that you can't change the style of the docx tables with the current version of PanDoc. Howewer, there's a way to get your own style.
How to get your own style ?
Create a Docx Document with the style you want on your table (by creating that table)
Change the extension of that file and unzip it
Open word/document.xml and search for the <w:tbl>
Try to find out how your style translates in XML and change the borderProps according to what you see.
Here's a test with a border-style I created:
And here is the corresponding XML:
<w:tblBorders>
<w:top w:val="dotted" w:sz="18" w:space="0" w:color="C0504D" w:themeColor="accent2"/>
<w:left w:val="dotted" w:sz="18" w:space="0" w:color="C0504D" w:themeColor="accent2"/>
<w:bottom w:val="dotted" w:sz="18" w:space="0" w:color="C0504D" w:themeColor="accent2"/>
<w:right w:val="dotted" w:sz="18" w:space="0" w:color="C0504D" w:themeColor="accent2"/>
<w:insideH w:val="dotted" w:sz="18" w:space="0" w:color="C0504D" w:themeColor="accent2"/>
<w:insideV w:val="dotted" w:sz="18" w:space="0" w:color="C0504D" w:themeColor="accent2"/>
</w:tblBorders>
What about odt ?
I didn't have a look at it yet, ask if you don't find by yourself using a similar method.
Hope this helps and don't hesitate to ask something more

Same suggestion as edi9999: hack the xml content of converted docx. And the following is my R code for doing that.
The tblPr variable contains the definition of style to be added to the tables in docx. You could modify the string to satisfy your own need.
require(XML)
docx.file <- "report.docx"
tblPr <- '<w:tblPr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:tblStyle w:val="a8"/><w:tblW w:w="0" w:type="auto"/><w:tblBorders><w:top w:val="single" w:sz="4" w:space="0" w:color="000000" w:themeColor="text1"/><w:left w:val="single" w:sz="4" w:space="0" w:color="000000" w:themeColor="text1"/><w:bottom w:val="single" w:sz="4" w:space="0" w:color="000000" w:themeColor="text1"/><w:right w:val="single" w:sz="4" w:space="0" w:color="000000" w:themeColor="text1"/><w:insideH w:val="single" w:sz="4" w:space="0" w:color="000000" w:themeColor="text1"/><w:insideV w:val="single" w:sz="4" w:space="0" w:color="000000" w:themeColor="text1"/></w:tblBorders><w:jc w:val="center"/></w:tblPr>'
## unzip the docx converted by Pandoc
system(paste("unzip", docx.file, "-d temp_dir"))
document.xml <- "temp_dir/word/document.xml"
doc <- xmlParse(document.xml)
tbl <- getNodeSet(xmlRoot(doc), "//w:tbl")
tblPr.node <- lapply(1:length(tbl), function (i)
xmlRoot(xmlParse(tblPr)))
added.Pr <- names(xmlChildren(tblPr.node[[1]]))
for (i in 1:length(tbl)) {
tbl.node <- tbl[[i]]
if ('tblPr' %in% names(xmlChildren(tbl.node))) {
children.Pr <- xmlChildren(xmlChildren(tbl.node)$tblPr)
for (j in length(added.Pr):1) {
if (added.Pr[j] %in% names(children.Pr)) {
replaceNodes(children.Pr[[added.Pr[j]]],
xmlChildren(tblPr.node[[i]])[[added.Pr[j]]])
} else {
## first.child <- children.Pr[[1]]
addSibling(children.Pr[['tblStyle']],
xmlChildren(tblPr.node[[i]])[[added.Pr[j]]],
after=TRUE)
}
}
} else {
addSibling(xmlChildren(tbl.node)[[1]], tblPr.node[[i]], after=FALSE)
}
}
## save hacked xml back to docx
saveXML(doc, document.xml, indent = F)
setwd("temp_dir")
system(paste("zip -r ../", docx.file, " *", sep=""))
setwd("..")
system("rm -fr temp_dir")

edi9999 has the best answer but here's what I do:
When creating the docx, use a reference docx to get styles. That reference will contain a heap of other styles that just aren't used by Pandoc to create, but they are still in there. Typically you'll get the default sets, but you can add a new table style too.
Then, you only need to update the word\document.xml file to reference the new table style, and you can do that programmatically (by unzipping, running sed, and updating the docx archive), eg:
7z.exe x mydoc.docx word\document.xml
sed "s/<w:tblStyle w:val=\"TableNormal\"/<w:tblStyle w:val=\"NewTableStyle\"/g" word\document.xml > word\document2.xml
copy word\document2.xml word\document.xml /y
7z.exe u mydoc.docx word\document.xml

Add a table style named "TableNormal" in reference.docx.

Using a reference docx file and then python-docx does the job pretty easily :
https://python-docx.readthedocs.io/
First convert your document to docx :
Bash :
pandoc --standalone --data-dir=/path/to/reference/ --output=/tmp/xxx.docx input_file.md
Notes :
/path/to/reference/ points to the folder containing reference.docx
reference.docx is a file containing the styles you need for docx elements
Then give the tables of your document the style you want to use :
Python :
import docx
document = docx.Document('/tmp/xxx.docx')
for table in document.tables:
table.style = document.styles['custom_style'] # custom_style must exist in your reference.docx file
document.save("target.docx") # thank you Anish

Just add a table style what every you want called "Table" in the reference-doc file。And update pandoc to latest.

I really liked gbjbaanb's answer - here's a powershell version:
Background: Set up a PanDoc --reference-doc template as described in the pandoc documentation for the --reference-doc parameter
Open up Word and create a new custom table style in the template doc. In our example that custom table style is called 'MyCustomTable'
Generate your word doc using the --reference-doc parameter - the custom table style will be included in the doc, you just have to insert its name in the right place. This bit of powershell will do that for you:
$outFile = "C:\Path\To\Your\Doc.docx"
$workFolder = "C:\Some\Temp\Folder\Somewhere\"
# then this replaces table style in $outFile:
$zipFile = $outFile.Replace(".docx",".zip")
Rename-Item $outFile $zipFile
Expand-Archive $zipFile -DestinationPath $workFolder -Force
$wordXml = Get-Content "${workFolder}Word\Document.xml"
$updatedXml = $wordXml.Replace('<w:tblStyle w:val="Table" />','<w:tblStyle w:val="MyCustomTable" />')
Set-Content -Path "${workFolder}Word\Document.xml" -Value $updatedXml
Compress-Archive -Path "${workFolder}*" -DestinationPath $zipFile -Force
Rename-Item $zipFile $outFile
... where $outFile is the docx, and $workFolder is a temp folder somewhere.
In some earlier versions of PanDoc, instead of seaching for <w:tblStyle w:val="Table" /> you'll need to search for <w:tblStyle w:val="TableNormal" />

add filter and custom your own Table style, see lua filter: https://github.com/ZhouJunjun/TyporaLuaFilter

Related

snakemake Wildcards in input files cannot be determined from output files:

I use the snakemkae to create a pipeline to split bam by chr,but I met a problem,
Wildcards in input files cannot be determined from output files:
'OutputDir'
Can someone help me to figure it out ?
if config['ref'] == 'hg38':
ref_chr = []
for i in range(1,23):
ref_chr.append('chr'+str(i))
ref_chr.extend(['chrX','chrY'])
elif config['ref'] == 'b37':
ref_chr = []
for i in range(1,23):
ref_chr.append(str(i))
ref_chr.extend(['X','Y'])
rule all:
input:
expand(f"{OutputDir}/split/{name}.{{chr}}.bam",chr=ref_chr)
rule minimap2:
input:
TargetFastq
output:
Sortbam = "{OutputDir}/{name}.sorted.bam",
Sortbai = "{OutputDir}/{name}.sorted.bam.bai"
resources:
mem_mb = 40000
threads: nt
singularity:
OntSoftware
shell:
"""
minimap2 -ax map-ont -d {ref_mmi} --MD -t {nt} {ref_fasta} {input} | samtools sort -O BAM -o {output.Sortbam}
samtools index {output.Sortbam}
"""
rule split_bam:
input:
rules.minimap2.output.Sortbam
output:
splitBam = expand(f"{OutputDir}/split/{name}.{{chr}}.bam",chr=ref_chr),
splitBamBai = expand(f"{OutputDir}/split/{name}.{{chr}}.bam.bai",chr=ref_chr)
resources:
mem_mb = 30000
threads: nt
singularity:
OntSoftware
shell:
"""
samtools view -# {nt} -b {input} {chr} > {output.splitBam}
samtools index -# {nt} {output.splitBam}
"""
I change the wilcards {outputdir},but is dose not help.
expand(f"{OutputDir}/split/{name}.{{chr}}.bam",chr=ref_chr),
splitBamBai = expand(f"{OutputDir}/split/{name}.{{chr}}.bam.bai",chr=ref_chr),
A couple of comments on this lines...:
You escape chr by using double braces, {{chr}}. This means you don't want chr to be expanded, which I doubt it is correct. I suspect you want something like:
expand("{{OutputDir}}/split/{{name}}.{chr}.bam",chr=ref_chr),
The rule minimpa2 does not contain {chr} wildcard, hence the error you get.
As an aside, when you create a bam file and its index in the same rule, you can get the time stamp of the index file to be older than the bam file itself. This later can generate spurious warning from samtools/bcftools. See https://github.com/snakemake/snakemake/issues/1378 (not sure if it's been fixed).

Select version string from JSON array and increment it by one using jq

Bash script find a a tags in ECR repo:
aws ecr describe-images --repository-name laplacelab-backend-repo
\ --query 'sort_by(imageDetails,& imagePushedAt)[*]'
\--output json | jq -r '.[].imageTags'
Output:
[
"v1",
"sometag",
...
]
How I can extract the version number? v<number> can contain the only version tag. I need to get a number and increment version for the set to var. If output of sort_by(imageDetails,& imagePushedAt)[*] is empty JSON arr instead
[
{
"registryId": "057296704062",
"repositoryName": "laplacelab-backend-repo",
"imageDigest": "sha256:c14685cf0be7bf7ab1b42f529ca13fe2e9ce00030427d8122928bf2d46063bb7",
"imageTags": [
"v1"
],
"imageSizeInBytes": 351676915,
"imagePushedAt": 1593514683.0
}
]
Set 2
No one repo sort_by(imageDetails,& imagePushedAt)[*] return [] set 1.
As a result, I try to get var VERSION with next version for an update or 1 if the repo is empty.
You could use the select() function on the imageTags array and get only the tag starting with v and increment it.
jq '( .[].imageTags[] | select(startswith("v")) | ltrimstr("v") | tonumber | .+1 ) // 1'
For other cases like the tags array being empty or containing null strings (error case), the value defaults to 1
For storing into the variable e.g. say version (avoid using uppercase variable names from a user scripts), use command substitution. See How do I set a variable to the output of a command in Bash?
version=$( <your-pipeline> )
Note: This does not work well with version strings following Semantic versioning RFC, e.g. as v1.2.1 as jq does not have a library to parse them.

Default Representation/Drawing method in VMD

In VMD I want to load every new file with the drawing method CPK. This doesn't seem not to be an option in the .vmdrc file for some technical reasons.
How can I do this from the VMD command line (so that I can make a script)?
Or is there some other solution/workaround/hack to make this work?
There are several ways to achieve what you want:
(1) put the following line in the correct location of your .vmdrc
mol default style CPK
(2) use the VMD Preferences Panel (last item in the Extensions menu of the main window) to generate a .vmdrc file that meets your expectations(s). The setting you're looking for is in the Representations tab.
(3) for more advanced settings (i.e. default settings applied to molecules already loaded when vmd read the startup .vmdrc file), you can use the following (works for me on VMD 1.9.2):
proc reset_viz {molid} {
# operate only on existing molecules
if {[lsearch [molinfo list] $molid] >= 0} {
# delete all representations
set numrep [molinfo $molid get numreps]
for {set i 0} {$i < $numrep} {incr i} {
mol delrep $i $molid
}
# add new representations
mol representation CPK
# add other representation stuff you want here
mol addrep $molid
}
}
proc reset_viz_proxy {args} {
foreach {fname molid rw} $args {}
eval "after idle {reset_viz $molid}"
}
## put a trace on vmd_initialize_structure
trace variable vmd_initialize_structure w reset_viz_proxy
after idle {
if { 1 } {
foreach molid [molinfo list] {
reset_viz $molid
}
}
}
This piece of code is adapted from this Axel Kohlmeyer website.
HTH,
I found a convenient solution.
In .bashrc add:
vmda () {
echo -e "
mol default style CPK
user add key Control-w quit
" > /tmp/vmdstartup
echo "mol new $1" > /tmp/vmdcommand
vmd -e /tmp/vmdcommand -startup /tmp/vmdstartup
}
Look at a structure with
vmda file.pdb
and close the window (quit the application) with Ctrl+w, like other windows.

os.walk ignore directorys and its content

i'm trying to ignore some directory and the files in it in specific path and this is my code
x = open(wbCMD, 'a')
x.write('set path="C:\Program Files\WinRAR\";%path% c:/Program Files/WinRAR/\n')
x.write('Rar.exe a -r "Backup.rar" -m5 -ep1')
chkdict = {}
setdef = chkdict.setdefault
for root, dirs, files in os.walk(foldername):
if ignoreddirs in dirs:
continue
for file in files:
ext = path.splitext(file)[1]
if ext in ignored:
continue
if not ext in chkdict:
print("%s" % setdef(ext,ext))
x.write(" *%s" % setdef(ext,ext))
x.write(" *makefile *Depend *readme\npause")
x.close
del chkdict
ignoreddirs array looks like this
ignoreddirs = ["bin"]
dirs and ignoreddirs are both lists of strings. Therefore, dirs does not contain ignoreddirs. It may, however contain some of its elements. One way to check this would be to check their intersection:
if len(set(ignoreddirs).intersection(set(dirs))) > 0:
continue

How can I merge PDF files (or PS if not possible) such that every file will begin in a odd page?

I am working on a UNIX system and I'd like to merge thousands of PDF files into one file in order to print it. I don't know how many pages they are in advance.
I'd like to print it double sided, such that two files will not be on the same page.
Therefore it I'd the merging file to be aligned such that every file will begin in odd page and a blank page will be added if the next place to write is an even page.
Here's the solution I use (it's based on #Dingo's basic principle, but uses an easier approach for the PDF manipulation):
Create PDF file with a single blank page
First, create a PDF file with a single blank page somewhere (in my case, it is located at /path/to/blank.pdf). This command should work (from this thread):
touch blank.ps && ps2pdf blank.ps blank.pdf
Run Bash script
Then, from the directory that contains all my PDF files, I run a little script that appends the blank.pdf file to each PDF file with an odd page number:
#!/bin/bash
for f in *.pdf; do
let npages=$(pdfinfo "$f"|grep 'Pages:'|awk '{print $2}')
let modulo="($npages %2)"
if [ $modulo -eq 1 ]; then
pdftk "$f" "/path/to/blank.pdf" output "aligned_$f"
# or
# pdfunite "$f" "/path/to/blank.pdf" "aligned_$f"
else
cp "$f" "aligned_$f"
fi
done
Combine the results
Now, all aligned_-prefixed files have even page numbers, and I can join them using
pdftk aligned_*.pdf output result.pdf
# or
pdfunite aligned_*.pdf result.pdf
Tool info:
ps2pdf is in the ghostscript package in most Linux distros
pdfinfo, pdfunite are from the Poppler PDF rendering library (usually the package name is poppler-utils or poppler_utils)
pdftk is usually its own package, the pdftk package
your problem can be more easily solved if you look at this from an another point of view
to obtain that, in printing, page 1 of second pdf file will be not attached to last page of first pdf file on the same sheet of paper, and, more generally, first page of subsequent pdf file will be not printed on the back of the same sheet with the last page of the precedent pdf file
you need to perform a selective addition of one blank page only to pdf files having and odd number of pages
I wrote a simple script named abbblankifneeded that you can put in a file and then copy in /usr/bin or /usr/local/bin
and then invoke in folder where you have your pdf with this syntax
for f in *.pdf; do addblankifneeded $f; done
this script adds a blank page at end to pdf files having an odd number of pages, skipping pdf files having already an even number of pages and then join together all pdf into one
requirements: pdftk, pdfinfo
NOTE: depending from your bash environment, you may need to replace sh interpreter with bash interpreter in the first line of script
#!/bin/sh
#script to add automatically blank page at the end of a pdf documents, if count of their pages is a not a module of 2 and then to join all pdfs into one
#
# made by Dingo
#
# dokupuppylinux.co.cc
#
#http://pastebin.com/u/dingodog (my pastebin toolbox for pdf scripts)
#
filename=$1
altxlarg="`pdfinfo -box $filename| grep MediaBox | cut -d : -f2 | awk '{print $3 FS $4}'`"
echo "%PDF-1.4
%µí®û
3 0 obj
<<
/Length 0
>>
stream
endstream
endobj
4 0 obj
<<
/ProcSet [/PDF ]
/ExtGState <<
/GS1 1 0 R
>>
>>
endobj
5 0 obj
<<
/Type /Halftone
/HalftoneType 1
/HalftoneName (Default)
/Frequency 60
/Angle 45
/SpotFunction /Round
>>
endobj
1 0 obj
<<
/Type /ExtGState
/SA false
/OP false
/HT /Default
>>
endobj
2 0 obj
<<
/Type /Page
/Parent 7 0 R
/Resources 4 0 R
/Contents 3 0 R
>>
endobj
7 0 obj
<<
/Type /Pages
/Kids [2 0 R ]
/Count 1
/MediaBox [0 0 595 841]
>>
endobj
6 0 obj
<<
/Type /Catalog
/Pages 7 0 R
>>
endobj
8 0 obj
<<
/CreationDate (D:20110915222508)
/Producer (libgnomeprint Ver: 2.12.1)
>>
endobj
xref
0 9
0000000000 65535 f
0000000278 00000 n
0000000357 00000 n
0000000017 00000 n
0000000072 00000 n
0000000146 00000 n
0000000535 00000 n
0000000445 00000 n
0000000590 00000 n
trailer
<<
/Size 9
/Root 6 0 R
/Info 8 0 R
>>
startxref
688
%%EOF" | sed -e "s/595 841/$altxlarg/g">blank.pdf
pdftk blank.pdf output fixed.pdf
mv fixed.pdf blank.pdf
pages="`pdftk $filename dump_data | grep NumberOfPages | cut -d : -f2`"
if [ $(( $pages % 2 )) -eq 0 ]
then echo "$filename has already a multiple of 2 pages ($pages ). Script will be skipped for this file" >>report.txt
else
pdftk A=$filename B=blank.pdf cat A B output blankadded.pdf
mv blankadded.pdf $filename
pdffiles=`ls *.pdf | grep -v -e blank.pdf -e joinedtogether.pdf| xargs -n 1`; pdftk $pdffiles cat output joinedtogether.pdf
fi
exit 0
You can use PDFsam:
gratis
runs on Microsoft Windows, Mac OS X and Linux
portable version available (at least on Windows)
can add a blank page after each merged document if the document has an odd number of pages
Disclaimer: I'm the author of the tools I'm mentioning here.
sejda-console
It's a free and open source command line interface for performing pdf manipulations such as merge or split. The merge command has an option stating:
[--addBlanks] : add a blank page after each merged document if the number of pages is odd (optional)
Since you just need to print the pdf I'm assuming you don't care about the order your documents are merged. This is the command you can use:
sejda-console merge -d /path/to/pdfs_to_merge -o /outputpath/merged_file.pdf --addBlanks
It can be downloaded from the official website sejda.org.
sejda.com
This is a web application backed by Sejda and has the same functionalities mentioned above but through a web interface. You are required to upload your files so, depending on the size of your input set, it might not be the right solution for you.
If you select the merge command and upload your pdf documents you will have to flag the checkbox Add blank page if odd page number to get the desired behaviour.
Here is a PowerShell version of the most popular solution using pdftk. I did this for windows but you can use PowerShell Core for other platforms.
# install pdftk server if on windows
# https://www.pdflabs.com/tools/pdftk-server/
$blank_pdf_path = ".\blank.pdf"
$input_folder = ".\input\"
$aligned_folder = ".\aligned\"
$final_output_path = ".\result.pdf"
foreach($file in (Get-ChildItem $input_folder -Filter *.pdf))
{
# easy but might break if pdfinfo output changes
# takes 7th line with the "Page: 2" and matches only numbers
(pdfinfo $file.FullName)[7] -match "(\d+)" | Out-Null
$npages = $Matches[1]
$modulo = $npages % 2
if($modulo -eq 1)
{
$output_path = Join-Path $aligned_folder $file.Name
pdftk $file.FullName $blank_pdf_path output $output_path
}
else
{
Copy-Item $file.FullName -Destination $aligned_folder
}
}
$aligned_pdfs = Join-Path $aligned_folder "*.pdf"
pdftk $aligned_pdfs output $final_output_path
Preparation
Install Python and make sure you have the pyPDF package.
Create a PDF file with a single blank in /path/to/blank.pdf (I've created blank pdf pages here).
Save this as pdfmerge.py in any directory of your $PATH. (I'm not a Windows user. This is straight forward under Linux. Please let me know if you get errors / if it works.)
Make pdfmerge.py executable
Every time you need it
Run uniprint.py a directory that contains only PDF files you want to merge.
pdfmerge.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
def merge(path, blank_filename, output_filename):
blank = PdfFileReader(file(blank_filename, "rb"))
output = PdfFileWriter()
for pdffile in glob('*.pdf'):
if pdffile == output_filename:
continue
print("Parse '%s'" % pdffile)
document = PdfFileReader(open(pdffile, 'rb'))
for i in range(document.getNumPages()):
output.addPage(document.getPage(i))
if document.getNumPages() % 2 == 1:
output.addPage(blank.getPage(0))
print("Add blank page to '%s' (had %i pages)" % (pdffile, document.getNumPages()))
print("Start writing '%s'" % output_filename)
output_stream = file(output_filename, "wb")
output.write(output_stream)
output_stream.close()
if __name__ == "__main__":
parser = ArgumentParser()
# Add more options if you like
parser.add_argument("-o", "--output", dest="output_filename", default="merged.pdf",
help="write merged PDF to FILE", metavar="FILE")
parser.add_argument("-b", "--blank", dest="blank_filename", default="blank.pdf",
help="path to blank PDF file", metavar="FILE")
parser.add_argument("-p", "--path", dest="path", default=".",
help="path of source PDF files")
args = parser.parse_args()
merge(args.path, args.blank_filename, args.output_filename)
Testing
Please make a comment if this works on Windows and Mac.
Please always leave a comment if it doesn't work / it could be improved.
It works on Linux. Joining 3 PDFs to a single 200-page PDF took less then a second.
Martin had a good start. I updated to PyPdf2 and made a few tweaks like sorting the output by filename.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from argparse import ArgumentParser
from glob import glob
from PyPDF2 import PdfFileReader, PdfFileWriter
import os.path
def merge(pdfpath, blank_filename, output_filename):
with open(blank_filename, "rb") as f:
blank = PdfFileReader(f)
output = PdfFileWriter()
filelist = sorted(glob(os.path.join(pdfpath,'*.pdf')))
for pdffile in filelist:
if pdffile == output_filename:
continue
print("Parse '%s'" % pdffile)
document = PdfFileReader(open(pdffile, 'rb'))
for i in range(document.getNumPages()):
output.addPage(document.getPage(i))
if document.getNumPages() % 2 == 1:
output.addPage(blank.getPage(0))
print("Add blank page to '%s' (had %i pages)" % (pdffile, document.getNumPages()))
print("Start writing '%s'" % output_filename)
with open(output_filename, "wb") as output_stream:
output.write(output_stream)
if __name__ == "__main__":
parser = ArgumentParser()
# Add more options if you like
parser.add_argument("-o", "--output", dest="output_filename", default="merged.pdf",
help="write merged PDF to FILE", metavar="FILE")
parser.add_argument("-b", "--blank", dest="blank_filename", default="blank.pdf",
help="path to blank PDF file", metavar="FILE")
parser.add_argument("-p", "--path", dest="path", default=".",
help="path of source PDF files")
args = parser.parse_args()
merge(args.path, args.blank_filename, args.output_filename)
`
The code by #Chris Lercher in https://stackoverflow.com/a/12761103/1369181 did not quite work for me. I do not know whether that is because I am working on Cygwin/mintty. Also, I have to use qpdf instead of pdftk. Here is the code that has worked for me:
#!/bin/bash
for f in *.pdf; do
npages=$(pdfinfo "$f"|grep 'Pages:'|sed 's/[^0-9]*//g')
modulo=$(($npages %2))
if [ $modulo -eq 1 ]; then
qpdf --empty --pages "$f" "path/to/blank.pdf" -- "aligned_$f"
else
cp "$f" "aligned_$f"
fi
done
Now, all "aligned_" files have even page numbers, and I can join them using qpdf (thanks to https://stackoverflow.com/a/51080927):
qpdf --verbose --empty --pages aligned_* -- all.pdf
And here the useful code from https://unix.stackexchange.com/a/272878 that I have used for creating the blank page:
echo "" | ps2pdf -sPAPERSIZE=a4 - blank.pdf
This one worked for me. Have used pdfcpu on macos.
Can be installed this way:
brew install pdfcpu
And have slightly adjusted the code from https://stackoverflow.com/a/12761103/1369181
#!/bin/bash
mkdir aligned
for f in *.pdf; do
let npages=$(pdfcpu info "$f"|grep 'Page count:'|awk '{print $3}')
let modulo="($npages %2)"
if [ $modulo -eq 1 ]; then
pdfcpu page insert -pages l -mode after "$f" "aligned/$f"
else
cp "$f" "aligned/$f"
fi
done
pdfcpu merge merged-aligned.pdf aligned/*.pdf
rm -rf aligned
NB! It creates and removes "aligned" directory in the current directory. So feel free to improve it to make it safe for use.

Resources