BlueSky Statistics - Character Encoding Problem - r

I am loading a data set, characters of which was encoded in ISO 8859-9 ("Latin 5") using Windows 10 OS (Microsoft has assigned code page 28599 a.k.a. Windows-28599 to ISO-8859-9 in Windows).
The data set is originally in Excel.
Whenever I run an analysis, or any operation with a variable name containing a character specific to this code page (ISO 8859-9), I get an error like:
Error: undefined columns selected
BSkyFreqResults <- BSkyFrequency(vars = c("MesleÄŸi"), data = Turnudep_raw_data_5)
Error: object 'BSkyFreqResults' not found
BSkyFormat(BSkyFreqResults)
The characters ÄŸ within "MesleÄŸi" are originally one character in Turkish (g with an inverted hat on) ğ
Those variable names that contain only letters from US code page work normally in BlueSky operations.
If I try to use save as in Excel and use web option UTF-8, to convert the data to UTF-8, this does not work either. If I export it to csv file, it does not work as is, or saved as UTF-8.
How can I load this data into BlueSky so that it works?
This same data set works in Rstudio:
> Sys.getlocale('LC_CTYPE')
[1] "Turkish_Turkey.1254"
And also in SPSS:
Language is set to Unicode
Picture of Language settings in SPSS
It also works in Jamovi
I also get an error when I start BlueSky, that may be relevant to this problem:
Python-CFFI error
From cffi callback <function _consolewrite_ex at 0x000002A36B441F78>:
Traceback (most recent call last):
File "rpy2\rinterface_lib\callbacks.py", line 132, in _consolewrite_ex
File "rpy2\rinterface_lib\conversion.py", line 133, in _cchar_to_str_with_maxlen
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 15: invalid start byte
Since then I re-downloaded and re-installed BlueSky, but I still get this Python-CFFI error every time I start the software.
I want to work with BlueSky and will appreciate any help in resolving this problem.
Thanks in advance
Here is a link for reproducing the problem.
The zip file contains a data source of 2 cases both in Excel and BlueSky format, a BlueSky Markdown file to show how the error is produced and an RMarkdown file for redundancy (probably useless).
UPDATE: The Python error (Python-CFFI error) appears to be related to the Region settings in Windows.
If the region is USA (Turnudep_reprex_Windows_Region_USA-Settings.jpg) , the python error does NOT appear.
If the region is Turkey (Turnudep_reprex_Windows_Region_Turkey-Settings.jpg) the python error DOES appear.
Unfortunately, setting the region and language to USA does eliminate the python error message but not the other problem. Still all the operations with the Turkish variable names end up with an error.
This may be a problem only the BlueSky developers may solve ...
Any help or suggestion will be greatly appreciated.
UPDATE FOR VERSION 10.2: The Python error (Python-CFFI error) is eliminated in this version. All others persist. I also notice that I can not change the variable names that have characters not in US code page. Meaning, if a variable name is something like "HastaNo", I can do analysis with that variable and change the name of the variable in the editor. If the variable name is something like "Mesleği" I can not do analysis with that variable AND I CANNOT CHANGE THAT NAME in the editor to "Meslegi" or anything else, so that it is usable in analysis.
UPDATE FOR VERSION: BlueSky Statistics Version 10.2.1, R package version 8.70
No change from Version 10.2. Variable names that contain a character outside of ASCII, cause an error AND can not be changed in BlueSky Statistics.

For version 10, according to user manual chapter 15.1.3 you can adjust the encoding setting. (answer has been edited for more clarity)

Related

UTF-8 problems in RStudio

I am passing on my work with some R-files to my colleague atm, and we are having a lot of trouble getting the files to work on his computer. The script as well as the data contains the nordic letters, and so to prevent this from being an issue, we have made sure to save the R-files with encoding UTF-8.
Still, there are two problems. A solution to either one would be much appreciated:
Problem 1: loading the standard data CSV-file (semicolon separated - which works on my computer), my colleague gets the following error:
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string 3
But then, instead we have tried to make it work, both with a CSV-file that he has saved with UTF-8 format and also an excel (xlsx) file. Both these files he can load fine (with read.csv2 and read_excel from the officer-package, respectively), and in both cases, when he opens up the data in R, it looks fine to him too ("æ", "ø" and "å" are included).
The problem first comes when he tries to run the plots that actually has to "grap" and display the values from the data columns where "æ", "ø" and "å" are included in the values. Here, he gets the following error message:
in grid.call(c_textBounds, as.graphicAnnot(x$label), x$x, x$y, : invalid input 'value with æ/ø/å' in 'utf8towcs'
When I try to run the R-script with the CSV-UTF-8 data-file (comma-separated), and I open up the data in a tab in RStudio, I can see that æ, ø and å is not written correctly (but they are just a bunch of weird signs). This is weird, considering that it should work more optimally with this type of CSV-file, and instead I'm having problems with this and not the standard CSV-file (the non UTF-8, semicolon-separated file).
When I try to run the script with the xlsx-file it then works totally fine for me. I get to the plot that has to display the data values with æ, ø and å, and it works completely fine. I do not get the same error message.
Why does my colleague get these errors?
(we have also made sure that he has installed the danish version of R at the CRAN-website)
We have tried all of the above.

Attempts to parse bencode / torrent file in R

I wish I could parse torrent files automatically via R. I tried to use R-bencode package:
library('bencode')
test_torrent <- readLines('/home/user/Downloads/some_file.torrent', encoding = "UTF-8")
decoded_torrent <- bencode::bdecode(test_torrent)
but faced to error:
Error in bencode::bdecode(test_torrent) :
input string terminated unexpectedly
In addition if I try to parse just part of this file bdecode('\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf'), I get
Error in bdecode("\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf") :
Wrong encoding '�'. Allowed values are i, l, d or a digit.
Maybe there are another ways to do it in R? Or probably I can insert another language code in Rscript?
Thanks in advance!
It might be that the torrent file is somehow corrupted.
A bencode value must begin with the character i (for integers), l (for lists), d (for dictionaries) or a number (for the length of a string).
The example string ('\xe7\xc9...'), doesn't start with any of those characters, and hence it can't be decoded.
See this for more info on the bencode format.
There seem to be several issues here.
Firstly, your code should not treat torrent files as text files in UTF-8 encoding. Each torrent file is split into equally-sized pieces (except for the last piece ; )). Torrents contain a concatenation of SHA1 hashes of each of the pieces. SHA1 hashes are unlikely to be valid UTF-8 strings.
So, you should not read the file into memory using readLines, because that is for text files. Instead, you should use a connection:
test_torrent <- file("/home/user/Downloads/some_file.torrent")
open(test_torrent, "rb")
bencode::bdecode(test_torrent)
Secondly, it seems that this library is also suffering from a similar issue. As readChar that it makes use of, also assumes that it's dealing with text.
This might be due to recent R version changes though seeing as the library is over 6 years old. I was able to apply a quick hack and get it working by passing useBytes=TRUE to readChar.
https://github.com/UkuLoskit/R-bencode/commit/b97091638ee6839befc5d188d47c02567499ce96
You can install my version as follows:
install.packages("devtools")
library(devtools)
devtools::install_github("UkuLoskit/R-bencode")
Caveat lector! I'm not a R programmer :).

Why can't I convert my R data to SPSS due to invalid format?

I want to import a R file into SPSS. I used the following code for this:
library(foreign)
write.foreign(mydata, "C:\\Users\\LM\\OneDrive\\Documents\\mydata.txt",
"C:\\Users\\LM\\OneDrive\\Documents\\mydata.sps", package="SPSS")
Then I opened the syntax document that was made. When I run this in SPSS I get the following error:
Error # 4130 in column 41. Text: .
The DATA LIST command contains an invalid format.
Execution of this command stops.`
What went wrong?
You've already identified your answer -- In this case it is likely the leading period and not the trailing period that fails in your variable name. In any case, leading and trailing periods should be avoided in SPSS Statistics variable names.
See the IBM SPSS Statistics Command Syntax Reference Guide: Universals section -
https://www.ibm.com/support/knowledgecenter/SSLVMB_26.0.0/statistics_reference_project_ddita/spss/base/syn_variables_variable_names.html
Several sources provide examples on how to rename your columns in an R dataframe. Here is one:
http://rprogramming.net/rename-columns-in-r/
You might find the STATS GET R extension command useful. It can be installed from the Extensions > Extension Hub menu.

PDF File Import R

I have multiple .pdf-files (stored in a local folder), that contain text. I would like to import the .pdf-files (i.e., the texts) in R. I applied the function 'read_dir' (R package: [textreadr][1])
library ("textreadr")
Data <- read_dir("<MY PATH>")
The function works well. BUT. For several files, that include special characters (i.e., letters) in their names (such as 'ć'; e.g., 'filenameć.pdf'), the function did not work (error message: 'The following files failed to read in and were removed:' …).
What can I do?
I tried to rename the files via R (did not work (probably due to the same reasons)). That might be a workaround.
I did not want to rename the files manually :)
Follow-Up (only for experts):
For several files, I got one of the following error messages (and I have no idea why):
PDF error: Mismatch between font type and embedded font file
or
PDF error: Couldn't find trailer dictionary
Any suggestions or hints how to solve this issue?
Likely the issue concerns the encoding of the file names. If you absolutely want to use R to rename the files for you, the function you want to use is iconv, determine the encoding of the file names and then convert them to utf-8.
However, a much better system would imply renaming them using bash from command line. Can you provide a more complete set of examples?

When importing data into ChemoSpec, I get: Error in `[.data.frame`(temp, , 2) : undefined columns selected

I'm new to R (and any kind of programming language in general) and was hoping for a package to help analyze some HPLC data. My script:
library(ChemoSpec)
spec <- files2SpectraObject(gr.crit=c("Control","AC","Fifty"),
gr.cols=c("auto"), freq.unit="minutes", int.unit="mAU",
descrip="hplc test data", fileExt=".csv",
out.file="hplc test data", debug=TRUE)
And the output:
The default behavior of this function has changed as of July 2016. See
?files2SpectraObject. Really: please read it!
files2SpectraObject is checking the first file
files2SpectraObject will now import your files Importing file:
AC_3G_L_1_220_trim.csv Error in [.data.frame(temp, , 2) : undefined
columns selected
I've got the ChemoSpec pdf and formatted my files accordingly into two columns, no headers, .csv format. Any suggestions as to what I've missed?
I am the author of ChemoSpec -- sorry for the delay in answering!
You probably need to add sep = "," to your files2SpectraObject call. You may also need to set the header and possibly the decimal marker. The only way to know is to open one of your csv files in a plain text editor and see what it looks like. ChemoSpec now allows a lot of flexibility in the format of the csv file, because it turns of that not all instrument manufacturers feel that csv means "comma separated values". Plus, different countries have different standards for the decimal marker (and your instrument may or may not be set up to reflect typical local standards). This is all detailed in ?files2SpectraObject.
There is also a new version of ChemoSpec on CRAN as of a few days ago.

Resources