R script file encoding (R Studio) - r

Which file encoding do I have to use to be able to save this vector (Matching complex URLs within text blocks (R)) correctly in a R script? The special characters and Chinese signs seem to make things somehow complicated.
x <- c("http://foo.com/blah_blah",
"http://foo.com/blah_blah/",
"(Something like http://foo.com/blah_blah)",
"http://foo.com/blah_blah_(wikipedia)",
"http://foo.com/more_(than)_one_(parens)",
"(Something like http://foo.com/blah_blah_(wikipedia))",
"http://foo.com/blah_(wikipedia)#cite-1",
"http://foo.com/blah_(wikipedia)_blah#cite-1",
"http://foo.com/unicode_(✪)_in_parens",
"http://foo.com/(something)?after=parens",
"http://foo.com/blah_blah.",
"http://foo.com/blah_blah/.",
"<http://foo.com/blah_blah>",
"<http://foo.com/blah_blah/>",
"http://foo.com/blah_blah,",
"http://www.extinguishedscholar.com/wpglob/?p=364.",
"http://✪df.ws/1234",
"rdar://1234",
"rdar:/1234",
"x-yojimbo-item://6303E4C1-6A6E-45A6-AB9D-3A908F59AE0E",
"message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff#mail.gmail.com%3e",
"http://➡.ws/䨹",
"www.c.ws/䨹",
"<tag>http://example.com</tag>",
"Just a www.example.com link.",
"http://example.com/something?with,commas,in,url, but not at end",
"What about <mailto:gruber#daringfireball.net?subject=TEST> (including brokets).",
"mailto:name#example.com",
"bit.ly/foo",
"“is.gd/foo/”",
"WWW.EXAMPLE.COM",
"http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))/Web_ENG/View_DetailPhoto.aspx?PicId=752",
"http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))",
"http://lcweb2.loc.gov/cgi-bin/query/h?pp/horyd:#field(NUMBER+#band(thc+5a46634))")
I appreciate any help.

Running your example,
source('file.R', encoding="unknown")
works fine and saving as R object and reloading works as well:
save(x, file='kk.Rd')
load('kk.Rd')
You can get all different encodings with iconvlist() and test them all, for example:
vals <- lapply(iconvlist(), function(x)
tryCatch(source('file.R', encoding=x),
error=function(e)return(NULL)))
with file.R being your script, and then
iconvlist()[which(!sapply(vals, function(x)is.null(x)))]
gives you all encodings where no error was thrown while loading.
Does this help?

Related

How do I tell Vim to use any/all dictionary files that fit a filepath wildcard glob?

I am trying to set the dictionary option (to allow autocompletion of certain words of my choosing) using wildcards in a filename glob, as follows:
:set dict+=$VIM/dict/dict*.lst
The hope is that, with this line in the initially sourced .vimrc (or, in my case of Windows 10, _vimrc), I can add different dictionary files to my $VIM/dict directory later, and each new invocation of Vim will use those dictionary files, without me needing to modify my .vimrc settings.
However, an error message says that there is no such file. When I give a specific filename (as in :set dict+=$VIM/dict/dict01.lst ), then it works.
The thing is, I could swear that this used to work. I had this setting in my .vimrc files since I started using Vim 7.1, and I don't recall any such error message until recently. Now it shows up on my Linux laptop as well as my Windows 7 and Windows 10 laptops. I can't remember exactly when this started happening.
Yes, I tried using backslashes (as in :set dict+=$VIM\dict\dict*.lst ) in case it was a Windows compatibility issue, but that still doesn't work. (Also this is happening on my Linux laptop, too, and that doesn't use backslashes for filepaths.)
Am I going senile? Or is there some other mysterious force going on?
Assuming for now that it is a change in the latest version of Vim, is there some way to specify "use all the dictionary files that fit this glob"?
-- Edited 2021-02-14 06:17:07
I also checked to see if it was due to having more than one file that fits the wildcard glob. (I thought that if I had more than one file that fit the wildcard, the glob would turn into two filenames, equivalent to saying dict+=$VIM/dict/dict01.lst dict02.lst which would not be syntactically valid.) But it still did not working after removing extra files so that only one file fit my pathname of $VIM/dict/dict*.lst . (I had previously put another Addendum here happily explaining that this was how I solved my problem, but it turned out to be premature.)
You must expand wildcards before setting an option. Multiple file names must be separated by commas. For example,
let &dictionary = tr(expand("$VIM/dict/dict*.lst"), "\n", ",")
If adding a value to a non-empty option, don't forget to add comma too (let is more universal than set, so it's less forgiving):
let &dictionary .= "," . tr(expand(...)...)

Is there a way to check where R is 'stuck' within a for loop? (R)

I am using system() to run several files iteratively through a program via CMD. It deposits each outputs into a sub-directory designated for specifically and only that input file. So # of inputs is exactly equal to the number of output directories/outputs.
My code works for the first iteration, but I can see in the console that it won't move on to the second file after completing the first. The stop sign remains active so I know R is still 'running', but since the for loop environment is unique I can't really tell what it's stuck on. It just stays like this for hours. Therefore I'm not sure how to begin to diagnose the issue I'm having. Is there a way of tracing what happened after cancelling the code, for example?
If your curious, the code looks like this btw. I don't know how to make it reproducible, so I just commented each line:
for (i in 1:length(flist)) {
##flist is a vector of character strings. Each
row of characters is both the name of the input file and the name of the
output directory
setwd(paste0(solutions_dir, "\\", flist[i]))
#sets the appropriate dir
system(paste0(program_dir,"\\program.exe I=",
file_dir, "\\", flist[i], " O=",solutions_dir, "\\", flist[i],
"\\solv"))
##line that inputs program's exe file and the appropriate input/output
locations
}

How do I include a variable file name in a system() function to call Windows commands

This is likely a stupid question but I have not found a work around (at least in anything I have searched for, though I might just not be using the right search parameters.)
I want to call an executable in Windows, and send a file to it (in this case a Blaise man file), the name of which is variable in my script.
So, for example, I have
x<-2
myfile<-c(paste("FileNumber",x,".man", sep="")
system("myapp.exe" myfile)
But I simply get
Error: unexpected symbol in "system("myapp.exe" myfile"
as if the command is not recognizing the object as myfile, instead taking "myfile" as literal text.
I tried using a paste function to create a whole line command, but that also did not work.
The system command will not concatenate the string and the myfile object together, you have to do it yourself.
So, try this instead:
x<-2
myfile<-c(paste("FileNumber",x,".man", sep=""))
cmd <- paste("myapp.exe", myfile)
system(cmd)
Or just:
x<-2
system(paste("myapp.exe", c(paste("FileNumber",x,".man", sep=""))))

Tesseract use of number-dawg

I need to specify a numeric pattern. I already made training normally.
I created a config file that has the line
user_patterns_suffix user-patterns
and the file user-patterns contains my patterns, for example:
:\d\d\d\d\d\d\d.
:\d\d\d\d\d\d\d\d\d;
!\d\d\d\d\d\d\d\d}
then I launch tesseract with the config file over a tif, and it tells me "Error: failed to insert pattern " message, for the first two patterns. It ultimately acts as if no pattern has been issued.
I need to recognize only and ever that patterns, and tried to train a language with a number-dawg file, but then, when using tesseract command, I got a segmentation fault.
I used in the number-dawg file its conversion of the above patterns:
: .
: ;
! }
The questions, as the google documentation is not clear, and I do not speak english:
the patterns file, where have to be used? I suppose number-dawg has to be used during training, but I got seg fault so couldn't try with it, and user-patterns during recognition phase, when launching Tesseract, but didn't work. Where am I doing errors?
do I need a dictionary, also, when training with number-dawg? I have a digit and punctiation only set of possible characters, and all the possible numbers in the digits, a dictionary is not possible. If I need to use dictionaries, how could I do?
Thanks in advance for help, any hint would be very appreciated

Wordpress/Apache - 404 error with unicode characters in image filenames

We've recently moved a website to a new server, and are running into an odd issue where some uploaded images with unicode characters in the filename are giving us a 404 error.
Via ssh/FTP, we can see that the files are definitely there.
For example:
http://sjofasting.no/project/adnoy
none of the images are working:
Code:
<img class='image-display' title='' src='http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg' width='685' height='484'/>
SSH:
-rw-r--r-- 1 xxxxxxxx xxxxxxxx 836813 Aug 3 16:12 ådnøy_1_2.jpg
What is also strange is that if you navigate to the directory you can even click on the image and it works:
http://sjofasting.no/wp/wp-content/uploads/2012/03/
click on 'ådnøy_1_2.jpg' and it works.
Somehow wordpress is generating
http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg
and copying from the direct folder browse is generating
http://sjofasting.no/wp/wp-content/uploads/2012/03/a%CC%8Adn%C3%B8y_1_2.jpg
What is going on??
edit:
If I copy the image url from the wordpress source I get:
http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellg%C3%A5rd-12.jpg
When copied from the apache browser I get:
http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellga%cc%8ard-12.jpg
What could account for this discrepancy between:
%C3%A5 and %cc%8
??
Unicode normalisation.
0xC3 0xA5 is the UTF-8 encoding for U+00E5 a-with-ring.
0xCC 0x8A is the UTF-8 encoding for U+030A combining ring.
U+0035 is the composed (Normal Form C) way of writing an a-ring; an a letter followed by U+030A is the decomposed (Normal Form D) way of writing it. å vs å - they should look the same, though they may differ slightly depending on font rendering.
Now normally it doesn't really matter which one you've got because sensible filesystems leave them untouched. If you save a file called [char U+00E5].txt (å.txt), it stays called that under Windows and Linux.
Macs, on the other hand, are insane. The filesystem prefers Normal Form D, to the extent that any composed characters you pass into it get converted into decomposed ones. If you put a file in called [char U+00E5].txt and immediately list the directory, you'll find you've actually got a file called a[char U+030A].txt. You can still access the file as [char U+00E5].txt on a Mac because it'll convert that input into Normal Form D too before looking it up, but you cannot recover the same filename in character sequence terms as you put in: it's a lossy conversion.
So if you save your files on a Mac and then transfer to a filesystem where [char U+00E5].txt and a[char U+030A].txt refer to different files, you will get broken links.
Update the pages to point to the Normal Form D versions of the URLs, or re-upload the files from a filesystem that doesn't egregiously mangle Unicode characters.
Think Different, Cause Bizarre Interoperability Problems.

Resources