Manually adding special characters to a pdf - r

I have a pdf (of a simple line graph) which was generated using R, and is just stored with plain text streams. The data used to generate the graph has been lost, and I would like to modify the axis labels by editing the pdf.
I found this line:
/F2 1 Tf 12.00 0.00 -0.00 12.00 238.73 18.72 Tm (**Error Rate**) Tj
Which appears to control the axis label I want (which currently says "Error Rate"). Changing it to say:
/F2 1 Tf 12.00 0.00 -0.00 12.00 238.73 18.72 Tm (**Different Label**) Tj
Does indeed result in the axis label changing to "Different Label".
Now, I want the new label to be "Mu". As in, the greek letter Mu. I know this is possible, since I can generate pdf's in R with greek letters in their axis labels.
My first thought was to manually enter the UTF-8 charactor for Mu using the vim trigraph ctrl+k m*, and also through the character map and the like, to give:
/F2 1 Tf 12.00 0.00 -0.00 12.00 238.73 18.72 Tm ( μ ) Tj
If I attempt to write the file after doing this, I get an error message "CONVERSION ERROR in line xyz", where xyz is the modified line. Opening the saved pdf reveals a '?' for the axis label.
How does the pdf encode mu? How can I modify the label accordingly?

PDF files are binary files, modifying them as text files will corrupt them in most cases. In order to keep the file valid you need to update the xref table (see the PDF specification for more details). If the byte count of the whole file changes, you will need to update each entry of the xref table for the file to remain valid. Another option could be to remove the xref table all together and pass the resulting file through another tool that can "guess" it for you. I have done this with ghostscript in the past with good results.
About the issue with fonts, which font /F2 corresponds to? Is it partially embedded in the PDF file?
If it is, you probably do not have the required information in the file to add the character μ.

I tried this:
pdf("testmu.pdf",compress=FALSE)
plot(1:10,1:10,xlab="abc",ylab=expression("LABEL "*mu))
dev.off()
and found the following chunk in the resulting file:
BT
/F2 1 Tf 0.00 12.00 -12.00 0.00 10.28 235.40 Tm (LABEL ) Tj
ET
BT
/F6 1 Tf 0.00 12.00 -12.00 0.00 10.28 276.09 Tm (m) Tj
ET
So I suspect if you use
/F6 1 Tf 12.00 0.00 -0.00 12.00 238.73 18.72 Tm ( m ) Tj
in your example above it should work. I don't know whether R always defines F6 (Symbol font), so you might also need to hack in something along these lines:
13 0 obj
<< /Type /Font /Subtype /Type1 /Name /F6 /BaseFont /Symbol
>>
edit: as pointed out in the other answer, and the comment below it, it seems you also need to manually update the xref count by (I think) searching for xref, finding a chunk like
xref
0 13
and incrementing the second value ...

As #yms pointed out, PDFs are typically not editable in a text editor, as they most likely contain binary data and have an xref table that needs to be updated if chracters are inserted or deleted in the PDF. If you must edit a pdf, use qpdf to edit the PDF.

Related

How Can I Download and Use a Matrix from Matrix Market?

I am trying to write code to store a matrix to a variable directly from Matrix Market's website. Below is a sample URL that I'd use:
https://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcsstruc1/bcsstk01.mtx.gz
The example URL will download a bcsstk01.mtx.gz file. I need to extract the bcsstk01.mtx file. Then I need to use MatrixMarket.mmread() so I can save to a variable.
I first tried saving the downloaded file (or URL location) to a variable A = HTTP.get(), but lack of online resources and lack of knowledge led to no results. Then I used HTTP.download() and got the .mtx.gz file, but I can't unzip it. And finally, MatrixMarket.mmread() cannot read .gz files. So I'm stuck with a downloaded file I can't do anything with unless I manually unzip it.
Using the info from link in the comments and some fiddling, I managed to get the following:
using TranscodingStreams, CodecZlib
using Downloads
stream = PipeBuffer()
openstream = TranscodingStream(GzipDecompressor(), stream)
Downloads.download("https://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcsstruc1/bcsstk01.mtx.gz", stream)
for line in eachline(openstream)
println(line)
end
This prints:
%%MatrixMarket matrix coordinate real symmetric
48 48 224
1 1 2.8322685185200e+06
5 1 1.0000000000000e+06
6 1 2.0833333333300e+06
7 1 -3.3333333333300e+03
...
which I suppose is the desired data.

Failure of unz() to unzip from a zip file offset of more than 2^31 bytes

I have been obtaining .zip archives of genome annotation from NCBI (mainly gff files). In order save disk space I prefer not to unzip the archive, but to read these files directly into R using unz(). However, it seems that unz() is unable to extract files from the end of 'large' zip files:
ncbi.zip <- "file_location/name.zip"
files <- unzip(ncbi.zip, list=TRUE)
gff.files <- files$Name[ grep("gff$", files$Name) ]
## this works
gff.128 <- readLines( unz(ncbi.zip, gff.files[128]) )
## this gives an empty data structure (read.table() stops
## with an error saying no lines or similar
gff.129 <- readLines( unz(ncbi.zip, gff.files[129]) )
## there are 31 more gff files after the 129th one.
## no lines are read from any of these.
The zip file itself seems to be fine; I can unzip the specific files using unzip on the command line and unzip -t does not report any errors.
I've tried this with R versions 3.5 (openSuse Leap 15.1), 3.6, and 4.2 (centOS 7) and with more than one zip file and get exactly the same result.
I attached strace to R whilst reading in the 128 and 129th file. In both cases I get a lot of lseek towards the end of file (offset 2845892608, larger than 2^31) to start with. This is where I assume the zip directory can be found. For the 128th file (the one that can be read), I eventually get an lseek to an offset slightly below 2^31, followed by a set of lseeks and reads (that extend beyone 2^31).
For the 129th file, I get the same reads towards the end of the file, but then rather than finding a position within the file I get:
lseek(3, 2845933568, SEEK_SET) = 2845933568
lseek(3, 4294963200, SEEK_SET) = 4294963200
read(3, "", 4096) = 0
lseek(3, 4095, SEEK_CUR) = 4294967295
read(3, "", 4096) = 0
Which is a bit weird since the file itself is only about 2.8 GB. 4294967295, is of course 2^32 - 1.
To me this feels like an integer overflow bug, and I am considering to post a bug report. But am wondering if anyone has seen something similar before or if I am doing something stupid.
Having done what I should have started with (reading the specification for the zip64 format specification), it's actually clear that this is not an integer overflow error.
Zip files contain a central directory at the end of the archive; this contains amongst other things the names of the compressed files and the offset of the compressed data in the zip archive. The offset (and file size fields) are only given 4 bytes each in the standard directory field; when the offset is larger than this it should instead be given in the extra fields section and the value in the standard field should be set to 0xFFFFFFFF. Since this is the offset that gets used when reading the file it seems clear that the problem lies in the parsing of the extra field.
I had a look at the source code for R 4.2.1 and it seems that the problem is due to the way the offset specified in the standard offset field is tested:
if(file_info.uncompressed_size == (ZPOS64_T)(unsigned long)-1)
changing this == 0xFFFFFFFF seems to fix the problem.
I've submitted a bug report to R. Hopefully changing the check will not have any unintended consequences and the issue will be fixed.
Still, I'm curious as to whether anyone else has come across the same issue. Seems a bit unlikely that my experience is unique.

Read over 13.000 txt files into RStudio

I am trying to load txt files into R. The files are EEG recordings for emotional words. Every text file is the recording of a word for a study participant. For every participant there are 360 words, which were recorded.
Every text file includes the complete time frame of EEG recording as a row (from 0 to 2000ms) and the electrodes from 1 to 58 in the column.
I have a Script for R worked out with a statistics crack, who is currently unavailable for me to ask again. The script used to work and read the data and produce and output.
Now RStudio does run the script to a certain point, but the RStudio environment does remain with every table showing NA for each data point. (Which it was able to read before.)
Much later R gives an error as output, but I guess this is mainly dependent on the data not being properly read.
I seem to be blind to mistakes in the script, in which I did not change anything after it worked for the first time.
Also installing R on another computer for the first time and running the script results in the same output: The environment shows NA for all data.
I have looked into several tips on how to read txt files into R, but they are not covering complex data.
The employed code, to sum up the data into means per electrode and time point is as follows:
stats<-function(x,tf1=NA,tf2=NA){
natf1<-sum(is.na(tf1))>0
natf2<-sum(is.na(tf2))>0
if(natf1&natf2){tf1<-1:dim(x)[2]}
if(!natf1){
mx<-as.numeric(lapply(as.data.frame(t(x[,tf1])),max))
mn<-as.numeric(lapply(as.data.frame(t(x[,tf1])),mean))
min<-as.numeric(lapply(as.data.frame(t(x[,tf1])),min))
}
if(!natf2){
mx2<-as.numeric(lapply(as.data.frame(t(x[,tf2])),max))
mn2<-as.numeric(lapply(as.data.frame(t(x[,tf2])),mean))
min2<-as.numeric(lapply(as.data.frame(t(x[,tf2])),min))
}
if(natf2){
return(cbind(electrode=1:58,max1=mx,mean1=mn,min1=min
,max2=NA,mean2=NA,min2=NA))
}else{
return(cbind(electrode=1:58
,max1=mx,mean1=mn,min1=min
,max2=mx2,mean2=mn2,min2=min2))
}}
I hope somebody can help me and improve my limited knowledge.
Cheers, Emily
P.S.: code to read the txt.files should be the following:
filenames<-dir() # dateinamen auslesen
nfiles<-length(filenames) #anzahl files
store<-matrix(rep(NA,nfiles*9*58),nfiles*58,9)
ti<-2010/503
t250<-floor(250/ti)
t350<-ceiling(350/ti)
t350p1<-t350+1
t450<-ceiling(450/ti)
for(i in 1:nfiles){
data<-read.table(filenames[i], sep=" ",dec=".")
vp<-as.numeric(substr(filenames[i],3,4))
word<-as.numeric(substr(filenames[i],15,17))
print(i)
store[(1+(i-1)*58):(58+(i-1)*58),1]<-rep(vp,58)
store[(1+(i-1)*58):(58+(i-1)*58),2]<-word
store[(1+(i-1)*58):(58+(i-1)*58),3:9]<-stats(data,t250:t350,t350p1:t450)
}
data looks as follows (being the output of an EEG recording, that is listed by time(x) and electrode (y):
-24.0726 -25.4886 -19.3321 -12.9210 -5.1501 3.1598 7.3684 4.7018 -2.2902 -7.5973 -8.6344 -7.8640 -7.4511 -6.1870 -2.6582 0.8325 1.3330 -0.3912 -1.8508 -3.5361 -5.7567 -6.1500 -5.9328 -6.0740 -5.1535 -3.7834 -0.3229 3.5887 2.1871 -3.7773 -7.5377 -7.9027 -10.2698 -11.9537 -8.7184 -6.0458 -9.0905 -14.2111 -17.0484 -18.7480 -17.6947 -12.7817 -9.0529 -7.9332 -8.9464 -11.4776 -13.9951 -11.9900 -3.6849 -1.1153 -5.2907 -4.8818 -2.8731 -5.9760 -7.7751 -5.4999 -7.4731 -9.3200 ...
error message received:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 5 did not have 2 elements
but I am not sure how to fix this or where/ in which file to check for said error

R: read.table problems

I might just be crazy but I really can't see where I am going wrong on my simple R program. I am trying to read a table from a file but every time I try to it comes back with this error:
./tmp.r: line 1: syntax error near unexpected token `('
./tmp.r: line 1: `tmp <- read.table("/home/Data/run1.DOC.sample_summary",header=FALSE)'
The file I am trying to read from looks something like this:
Aim A_%_above_20 A_%_above_30 A_%_above_40
28 0.0 0.0 0.0
99 50 100.0 82.9
34 62.1 0.0 0.0
Here is my code:
tmp <- read.table("/home/Data/run1.DOC.sample_summary",header=FALSE)
names(tmp)
max_num <- max(tmp)
hist(tmp$'*_%_above_30',col=heat.colors(max_num), main='Percent in Test', xlab='Percent Covered')
Does anyone see what I am doing wrong here? I am just not seeing it.
Thanks
Those tmp$'*_%_above_30' really work in your last line ?
Also, try to put comments on different part of your code to see which one is making your code crash.
Finally, maybe it's just a bad encoding of some characters in your code. Try to rewrite it from scratch.
And how do you launch your script ?
it this because you used wrong path name? e.g., try changing "home..." to "/home..."

How to extract TimeGrid Intervals using praat script?

Using praat executable, I can write TextGrid Interval in a text file by clicking on To TextGrid (vuv) button in the right panel in the following image. I'm using To TextGrid (vuv)... 0.02 0.01 code to in the praat script but getting "Command To TextGrid (vuv)..." not available for current selection error.
Am I missing something?
Can It be possible to do so using praat script at all?
This may help.
directory$ = "./"
list = Create Strings as file list... list 'directory$'*.wav
numberOfFiles = Get number of strings
if !numberOfFiles
exit There are no sound files in the folder!
endif
for current_file from 1 to numberOfFiles
select list
fileName$ = Get string... current_file
name$ = fileName$ - ".wav" - ".wav"
sound = Read from file... 'directory$''fileName$'
# min and max pitch
pulses = To PointProcess (periodic, cc)... 30 400
vuv = To TextGrid (vuv)... 0.02 0.01
Save as text file... 'directory$''name$'.TextGrid
plus pulses
plus sound
Remove
endfor
In addition, you can join praat-users group on Yahoo where Paul - one of the authors of Praat is very active answering questions. You can learn more about praat scripting on the praat manual here.
http://www.fon.hum.uva.nl/praat/manual/Scripting.html

Resources