Cannot get R to compare seemingly equal strings - r

I recently received an excel file of unknown origin (and used excel to ouput a subset into a text file "example0.txt" (see in dropbox link below). In this dataset, two strings that appear by all means equal, cannot be compared by R.
a<-scan("example0.txt", what="raw")
a
[1] "ÖSTVÅG" "FALKVÅG" "ÖSTVÅG"
# cell a[1] and a[3] appear similar ("ÖSTVÅG"). However,
a[1]==a[3] # I was expecting TRUE but I get FALSE
nchar(a[1]) == nchar(a[3]) # I was expecting TRUE (n=6) but I get FALSE
# and similarly,
nchar(a[2]) == 8 # I was expecting n=7
see files in dropbox:
https://www.dropbox.com/sh/hz6vsj5kj1u9ag6/AACtTMh-x4IIB10kMCUE5rvBa?dl=0
This behaviour really complicates matching strings on a larger dataset I am working on, let alone newer datasets that will be coming in.
Initially I suspected it would have to do with Encodings but I have tried reading the data with different encodings and performing conversions in R but I still get the same results.
I managed to solve the issue by opening the file in notepad and re-writing the a[3], re-saving the file afterwards and I was surprised to see that it now works (you can see the result in example0correct.txt). even if the new file appears equal to the original one.
Can anyone explain me what is happening and how I can detect and correct this occurrences taking example0 as a starting point?
Note: Dont know if it matters, but I am using R 3.2.2 in a windows 7 machine with office 2013 installed.

Related

Fill initial student IDs with zeros in R/exams exams2nops()

When using the exams package to produce PDF files for NOPS exams I want to edit the number of digits that are available to my students (reglength). I am aware that the package only admits a minimum of 7 digits. However, our students only have 5 digits in their ID number. As so I would to know if it is possible to edit the template via PDF manipulation (I tried with the staplr::get_fields() function, but it does not work) or any other way... I simply want to add 0to the two first digits (i.e., 00xxxxx).
This is a good idea which I have implemented in exams 2.4-0. If you specify a reglength < 7 now, it is still enforced to be 7 internally - and thus this length is necessary in the registration CSV file! But on the exam sheet the initial IDs are fixed to "0" and the corresponding boxes are ticked already.
For example, the following code will give you a warning but produce the output below:
exams2nops(..., reglength = 5)
## Warning message:
## In make_nops_page(n, nchoice = nchoice, reglength = reglength) :
## 'reglength = 5' too small, using 7 instead, but fixing initial IDs to 0
Notes:
I haven't tested yet whether the scanning of these sheets actually works well enough because I'm in my home office for the foreseeable future without access to a suitable printer/scanner. I don't see sources for potential problems, though.
I assumed that you referred to exams2nops() rather than exams2pdf() and hence have modified your question accordingly. In exams2pdf() you can need to provide your template anyway and can edit it in any way you like, adding the zeros where you need them.
We added the XX manually to the exams2nops pdf generated file manually, and the scans did work fine. I see no reason why they should not work with your solution. Thanks for implementing it.

.rds file internal format

I have lost a .rds file due to the device (let's call it volume 1) getting filled up. Usually when that happened R would throw an error and stop. In that case I had a safe copy on a different volume (volume 2). This time however, R would write the file on volume 1 without error and copy it over to volume 2. Now the file cannot be opened with readRDS anymore with the error "error reading from connection".
The file contains a data.table, is stored uncompressed and infoRDS can read the metadata:
> infoRDS('corrupt.rds')
$version
[1] 3
$writer_version
[1] "3.6.3"
$min_reader_version
[1] "3.5.0"
$format
[1] "xdr"
$native_encoding
[1] "UTF-8"
Also, hexView::readRaw can read the file and shows the names of the columns of the data table.
Using
readRaw('corrupt.rds', endian = 'big', human = 'real', width = 8, offset = 5)
I can see many of the numbers I need to recover. However, this seems very tedious of an approach, since I don't understand the internal format of the .rds file.
I also looked into xmlDeserializeHook which I don't understand how to use.
Of course the C code used by readRDS unserializeFromConn contains all the information of the used structure, but a higher level documentation would be helpful.
Is there an easier way than to dive into that C code or pick up the numbers manually one by one?
R internals contains a documentation of the serialisation format. Unless somebody published a more detailed description on the mailing list, that’s probably the best we can do. But (at a glance) this looks to be a fairly comprehensive description (definitely when taken together with the implementation).

Problem with encoding of character strings when loading json files to RStudio under Windows 10

I am trying to extract Tweets from json files and save them as RData under Windows 10 and using RStudio version 1.2.5033 and streamR. However, Windows (and RStudio and streamR subsequently) assumes that the input is Windows-1252 although it is UTF-8 which leads to serious encoding issues.
To replicate the problem, please use this json file with two fake Tweets since I could not replicate the structure of the original json files within R. But this structure leads to issues with the only solution I found for the encoding issue (see below).
The code I used is the following:
df <- streamR::parseTweets("test.json")
The output I get with df$text is: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
The output should be: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
My question is therefore: (How) can I tell R and streamR to interpret the text as being encoded using UTF-8 rather than Windows-1252?
Since all this happens because the function wrongly assumes that the text is encoded with Windows-1252, one solution would be to go through the whole corpus and replace all of these wrongly interpreted special characters with the correct one, for example using the table I found here. In my case however, the corpus is very very large making this a very suboptimal solution in the long run. Additionally, I would not have the possibility to check whether it actually replaced all special characters correctly.
Some additional information:
Using rjson and the following code somehow makes R interpret the encoding correctly, but has troubles with the structure of the json files since it only extracts the first line:
lt <- rjson::fromJSON(file="test.json")
I guess it cannot extract the subsequent line because it does not recognise the line break which is an actual line break and not \n or any other character combination. Unfortunately, I do not have the possibility to change the json files.
The json files were created by another person under macOS using streamR - if I am not mistaken.
The same problem appears using simple R instead of RStudio. The problem does not appear on macOS.
The problem is even more serious when using tweet2r, the only other package I am aware of that allows to extract Tweets from json files using R. Tweet2r deletes specific special characters such as "¶" and thus, the wrongly interpreted special characters cannot be replaced anymore with the correct ones.
Thanks to MrFlick (see his comment), here is a solution using jsonlite that results in a very similar data frame structure and reads the encoding correctly:
df <- jsonlite::stream_in(file("~/../Downloads/test.json"))
Just some further information for those being used to the luxury of streamR handling tweets that might encounter a similar issue in the future, there are two main differences in the data frames created by parseTweets and stream_in:
parseTweets does not extract data for broken Tweets. stream_in does. Hence, the data frame has more rows when using stream_in but contains the same Tweets.
stream_in creates fewer variables since some of the columns in the data frame are themselves data frames. This might lead to issues when using the data frame without further transforming the data frame created with stream_in. parseTweets does that for you.

Import XLS, readxl / gdata bring in DF with NA

I am trying to bring this .xls file into R: https://www.reit.com/sites/default/files/returns/MonthlyHistoricalReturns.xls
I've tried to bring it in directly from the url on a Windows machine. I've already run across the https versus http issues as well as the perl issue for Windows. To get around this, I've tried to run on ubuntu as well as downloading the file first.
My latest two attempts with readxl and gdata both produce a data frame, though neither one has any data in it. There are no error messages.
NAREIT <- readxl::read_xls("~/Downloads/MonthlyHistoricalReturns.xls")
This produces 38 observations of one variable, all NA.
NAREIT <- gdata::read.xls("~/Downloads/MonthlyHistoricalReturns.xls")
And this one produces 0 observations of 1 variable, "No data available in table" is the text written inside the only cell.
The file is admittedly ugly, with multiple unneeded header rows, merged cells, frozen views, etc. I've tried specifying ranges, columns, rows, rows to skip, col names, etc.--everything I could think of from readxl and gdata documentation.
I can just cut the range I need, save as CSV, and work with it. But, as I am likely to have to come back to this regularly, I am looking for the'right' way to open this file. Any thoughts are much appreciated.
It looks like there are several rows of header, so you would need to figure out what you would like as a header, or actually consult a few pages on stack overflow that show you how to deal with 2 line headers.
Anyways, I can import it like this, and it seems to be just fine
library(readxl)
MonthlyHistoricalReturns <- read_excel("MonthlyHistoricalReturns.xls", sheet = "Index Data", skip = 7)
I skipped to line 7 to start your header there

data.table v.1.11.0+ no longer freads data file that was fread by v.1.10.4-3

I've encountered a possible bug in the new version of data.table. I have a 2GB .csv file with c. 3 million rows and 67 columns. I can use fread() to read it all fine from data.table v.1.10.4-3, but v.1.11.0+ terminates at a row somewhere down the middle. The base read.csv() also hits the same problem. I really like data.table and want to create a bug report on Github, but obviously I can't upload the 2GB data file anywhere.
I need a way of splicing maybe ~10 rows around the problematic point (the row number is known) in order to create a portable reproducible example. Any ideas how I can do that without reading in the .csv file?
Also, is there a program I can use to open the raw file to look at the problematic point and see what causes the issue? Notepad/Excel won't open a file this big.
EDIT: the verbose output.
EDIT2: this is the problematic line. It shows that what is supposed to be one line is somehow split into 3 lines. I can only assume it is due to an export bug in an ancient software (SAP Business Objects) that was used to create the CSV. It is unsurprising that it causes an issue. However, it surprising that data.table v.1.10.4-3 was able to handle it in a smart way and read it correctly, whereas v.1.11.0+ could not. Could it do something with encoding or technical hidden characters?
EDIT3: proof that this is what really happens.
Thanks for including the output. It shows that fread is issuing a warning. Did you miss this warning before?
Warning message:
In fread("Data/FP17s with TCD in March 2018.csv", na.strings = c("#EMPTY", :
Stopped early on line 138986. Expected 67 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<916439/0001,Q69,GDS Contract,MR A SYED,916439,Mr,SYED A Mr,A,SYED,58955,3718.00,Nine Mile Ride Dental Practice,Dental Surgery,193 Nine Mile Ride,Finchampstead,WOKINGHAM,RG40 4JD,2181233168.00,TORIN,FASTNEDGE,1 ANCHORITE CLOSE,>>
This is very helpful, surely. It tells you the line number: 138986. It says that this line is 22 fields but it expects 67. Could the warning be better by stating why it is expecting 67 fields at that point (e.g. by saying there are 67 column names and it has seen 67 columns up to that point?) It gives you a hint of what to try (fill=TRUE) which would fill that too-short line with NA in columns 23:67. Then it includes the data from the line, too.
Does it work with fill=TRUE, as the warning message suggests?
You say it worked in 1.10.4-3 but I suspect it's more likely it stopped early there too, but without warning. If so, that was a bug not to warn, now fixed.
Using Powershell on Windows:
Get-Content YourFile.csv | Select -Index (0,19,20,21,22) > OutputFileName.csv
would dump the header and lines 20-23 into a new file.
Use a combination of skip and nrow:
You mentioned that you have no problem reading the file with v.1.10.4-3, right?. So use that to skip most of the .csv and set nrow to the number of rows you want. Once you have that data.table, you can write that portion of the file and you have a portable reproducible example.
For example:
DT <- fread(my_file.csv, skip=138981, nrow=10)

Resources