Non-ASCII characters in R, reading from .sav file - r

I am trying to read a .sav file into RStudio. The file contains data from a Spanish language survey, and when I read it into R -- even though my default text encoding has already been set to ISO-8859-1 -- the display of special characters is incorrect.
For example, the word "Camión" appears as
"Cami<c3><b3>n"
even though it shows up correctly as "Camión" in PSPP.
This is what I did:
install.packages("memisc")
jcv2014 <- as.data.set(spss.system.file('myfile.sav'))
Later, I wanted to create a list of just the variable labels, so I did the following:
library(foreign)
jcv2014.spss <- read.spss("myfile.sav", to.data.frame=FALSE, use.value.labels=FALSE)
jcv2014_vars <- attr(jcv2014.spss, "variable.labels")
(I'm not sure if this is the best way to do it, but it worked)
Anyway, this time around, I still didn't get the proper accents but there was a different sort of encoding:
A variable label that was supposed to be "¿Qué calificación le daría..." instead appeared as
"\302\277Qu\303\251 calificaci\303\263n le dar\303\255a..."
I'm not sure how to get the proper characters, but they appear correctly in PSPP. I tried changing the default text encoding in R to both ISO-8859-1 and UTF-8, to no avail. I don't know what the original file was encoded in, but I guessed it would be one of those.
Any ideas?
And if it helps, I have R version 3.1.1 and OS X Yosemite version 10.10.1, and I am using PSPP, not SPSS.
Thanks so much in advance!!!

Can you just set the encoding once you've read the data in?
# Here's your sentence
s <- "\302\277Qu\303\251 calificaci\303\263n le dar\303\255a..."
# it has no encoding
Encoding(s)
# [1] "unknown"
# but if you specify UTF-8, then it shows up correctly
iconv(s, 'UTF-8')
# [1] "¿Qué calificación le daría..."
# This also works
Encoding(s) <- 'UTF-8'
s
# [1] "¿Qué calificación le daría..."
Here are the results of my sessionInfo() call. You should post yours too.
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reshape2_1.4 hexbin_1.27.0 ggplot2_1.0.0 data.table_1.9.2 yaml_2.1.13
[6] redshift_0.4 RJDBC_0.2-4 rJava_0.9-6 DBI_0.3.1
loaded via a namespace (and not attached):
[1] colorspace_1.2-4 digest_0.6.4 grid_3.1.1 gtable_0.1.2 labeling_0.2
[6] lattice_0.20-29 MASS_7.3-33 munsell_0.4.2 plyr_1.8.1 proto_0.3-10
[11] Rcpp_0.11.2 scales_0.2.4 stringr_0.6.2 tools_3.1.1
Update: looks like you may not have a locale that supports UTF-8. Here are the locale settings for each category on my system. You might try using Sys.setLocale() and updating them one by one on your system (or just use LC_ALL if you don't feel the need to test each one incrementally). ?Sys.setLocale for more info
cat_str <- c("LC_COLLATE", "LC_CTYPE", "LC_MONETARY", "LC_NUMERIC",
"LC_TIME", "LC_MESSAGES", "LC_PAPER", "LC_MEASUREMENT")
sapply(cat_str, Sys.getlocale)
# LC_COLLATE LC_CTYPE LC_MONETARY LC_NUMERIC LC_TIME LC_MESSAGES
# "en_US.UTF-8" "en_US.UTF-8" "en_US.UTF-8" "C" "en_US.UTF-8" "en_US.UTF-8"
# LC_PAPER LC_MEASUREMENT
# "" ""

Related

R source() encoding bug?

I am found very strange bug about encoding of character constants in R.
main.R:
options(encoding = "UTF-8")
print(Sys.getlocale())
print(getOption("encoding"))
print("first run")
source("internal.R")
print("")
print("second run")
source("internal.R", encoding = "UTF-8")
print("")
internal.R
print(Sys.getlocale())
print(getOption("encoding"))
char_constant="Тут не просто живут баги, тут у них гнездо"
print(Encoding(char_constant))
Now lets see the output, push source button in R
[1] "ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8"
[1] "UTF-8"
[1] "first run"
[1] "ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8"
[1] "UTF-8"
[1] "unknown"
[1] ""
[1] "second run"
[1] "ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8"
[1] "UTF-8"
[1] "UTF-8"
[1] ""
Notice the difference in encoding. "unknown" first time and "UTF-8" second time.
There is obvious small bug source ignores default encoding parameter.
The real bug is what mixing different encodings in data.table causes a lot of problems, and R-studio makes "UTF-8" constant when you execute just one string and makes "unknown" constant when you source whole file.
Do somebody have any idea what is going on and how to make workaround?
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin14.5.0 (64-bit)
Running under: OS X 10.12.4 (unknown)
locale:
[1] ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.0
On Windows, R's source function does not work with files that include characters that aren't part of the current system encoding. You may have trouble with RStudio's Run All and Source on Save commands, as they rely on source.
Take a look at: https://support.rstudio.com/hc/en-us/articles/200532197-Character-Encoding

R read.csv didn't load all rows of .tsv file

A little mystery. I have a .tsv file that contains 58936 rows. I loaded the file into R using this command:
dat <- read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE, sep="\t")
but nrow(dat) only shows this:
> nrow(dat)
[1] 28485
So I used the sed -n command to write the rows around where it stopped (before, including and after that row) to a new file and was able to load that file into R so I don't think there was any corruption in the file.
Is it an environment issue?
Here's my sessionInfo()
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-10 RSQLite_1.0.0 DBI_0.3.1 gsubfn_0.6-6 proto_0.3-10 scales_0.2.4 plotrix_3.5-11
[8] reshape2_1.4.1 dplyr_0.4.1
loaded via a namespace (and not attached):
[1] assertthat_0.1 chron_2.3-45 colorspace_1.2-4 lazyeval_0.1.10 magrittr_1.5 munsell_0.4.2
[7] parallel_3.1.2 plyr_1.8.1 Rcpp_0.11.4 rpart_4.1-8 stringr_0.6.2 tools_3.1.2
Did I run out of memory? Is that why it didn't finish loading?
I had a similar problem lately, and it turned out I had two different problems.
1 - Not all rows had the right number of tabs. I ended up counting them using awk
2 - At some points in the file I had quotes that were not closed. This was causing it to skip over all the lines until it found a closing quote.
I will dig up the awk code I used to investigate and fix these issues and post it.
Since I am using Windows, I used the awk that came with git bash.
This counted the number of tabs in a line and printed out those lines that did not have the right number.
awk -F "\t" 'NF!=6 { print NF-1 ":" $0 } ' Catalog01.csv
I used something similar to count quotes, and I used tr to fix a lot of it.
Pretty sure this was not a memory issue. If the problem is unmatched quotes then try this:
t <-read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE,sep="\t",
quote="")
There is also the very useful function count.fields that I use inside table to get a high-level view of the consequences of various parameter settings. Take a look at results of:
table( count.fields( "weekly_devdata.tsv", sep="\t"))
And compare to:
table( count.fields( "weekly_devdata.tsv", sep="\t", quote=""))
It's sometime necessary to read in with readLines, then remove one or more lines assigning the result to clean and then send the cleaned up lines to read.table(text=clean, sep="\t", quote="")
Could well be some illegal characters in some of the entries... Check out how many upload and where the issue is taking place. Delve deeper into that row of the raw data. Webdings font that kind of stuff!

invalid or not-yet-implemented 'Matrix' subsetting in Shiny

When I run my shiny application, I got an error message saying
Error in prob[tw, uni.c] :
invalid or not-yet-implemented 'Matrix' subsetting
That same code ran without error when it was not on Shiny. Any idea how I can troubleshoot this?
I'm not sure how to reproduce the data here, but prob is of class dgCMatrix from the Matrix package, tw is a single integer, and uni.c is a numeric vector.
EDIT:
sessionInfo() output:
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252 LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C LC_TIME=English_Singapore.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] shiny_0.10.1 Matrix_1.1-4
loaded via a namespace (and not attached):
[1] bitops_1.0-6 caTools_1.17.1 digest_0.6.4 grid_3.1.1 htmltools_0.2.6 httpuv_1.3.0 lattice_0.20-29
[8] Rcpp_0.11.3 RJSONIO_1.3-0 tools_3.1.1 xtable_1.7-4
It turned out to be a bug in my code that is exposed by how Shiny works.
Outside Shiny, the function where the code resides worked seamlessly, being fed input from another function in the right format.
In Shiny, I expected the function in server.R to receive the input after the submitButton button is pressed, with sensible input keyed into the field. Apparently, even before the first press of button, the default value in the input field (which was not a sensible one) was passed to my function. That default value is not well-handled by my function and caused the error. Both changing the default value, and building extra error-checking in my function, worked to solve the issue.
Apologies for the confusion; this was a learning experience to be careful with default values and with Shiny processing sequence.

"non-numeric argument to binary operator" error from getReturns

For some reason, a code I usually run in Rstudios is no longer working. I'm hoping that someone has had a similar experience and understands what's going on.
getReturns(c('C','BAC'), start='2004-01-01', end='2008-12-31')
This results in:
Error in unclass(e1) + unclass(e2) :
non-numeric argument to binary operator
I can't find anything online nor on stackoverflow that addresses this issue. Also, I saw that the most recent documentation, from July 2014 doesn't mention anything either:
http://cran.r-project.org/web/packages/stockPortfolio/stockPortfolio.pdf
Does anyone have any idea what's going on here?
It's probably a function name clash issue. Running
timeSeries::getReturns(c('C','BAC'), start='2004-01-01', end='2008-12-31')
gives me the error, but running
stockPortfolio::getReturns(c('C','BAC'), start='2004-01-01', end='2008-12-31')
works fine.
How did this happen?
You must have loaded the stockPortfolio package, and then loaded either timeSeries or another package that depends upon timeSeries. Have a look through your console for a message that looks like
The following object is masked from ‘package:stockPortfolio’:
getReturns
Use the double colon operator (as shown above) to explicitly tell R which package to look in.
I have a similar problem using stockPortfolio in a R Markdown program.
Code that works in a R file does not work in the rmd file.
```{r p3}
recordState()
ff <- allFunds1$Fund
returns <-stockPortfolio::getReturns(ff,freq="month")
save(allFunds1,file='allFunds1.rda')
```
gives the error message and traceback
Error in unclass(e1) + unclass(e2) : non-numeric argument to binary operator
5. structure(unclass(e1) + unclass(e2), class = "Date")
4.`+.Date`(as.Date(origin, ...), x)
3. as.Date.numeric(uDates, origin = minDate)
2. as.Date(uDates, origin = minDate
1. stockPortfolio::getReturns(ff, freq = "month")
My recordState function saves the results of search() and sessionInfo() in the chunk:
[1] "search:"
[1] ".GlobalEnv" "tools:rstudio" "package:stats"
[4] "package:graphics" "package:grDevices" "package:utils"
[7] "package:datasets" "package:methods" "Autoloads"
[10] "package:base"
[1] "sessionInfo():"
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 digest_0.6.11 dplyr_0.5.0
[4] rprojroot_1.2 assertthat_0.1 R6_2.2.0
[7] xtable_1.8-2 DBI_0.5-1 backports_1.0.5
[10] magrittr_1.5 evaluate_0.10 stringi_1.1.2
[13] stockPortfolio_1.2 rmarkdown_1.3 tools_3.3.2
[16] stringr_1.1.0 readr_1.0.0 yaml_2.1.14
[19] htmltools_0.3.5 knitr_1.15.1 tibble_1.2
The original posting suggests that this error can result from confusing stockPortfolio::getReturns with the function in timeSeries but I have used the full name and do not have either of the libraries loaded.

\usepackage{Sweavel} produces error: It seems you are using the Sweave-specific syntax

If I include \usepackage{Sweavel} in my .rnw file, I get an X11 popup error "It seems you are using the Sweave-specific syntax; you may need Sweave2knitr("IPT-baseline-test.rnw") to convert it to knitr" when I compile in RStudio (Version 0.98.484). The document compiles, but I have to dismiss the error.
(1) Any ideas why \usepackage{Sweavel} triggers the error?
(2) Is there a way to turn off the popup since the document compiles anyway?
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.5
loaded via a namespace (and not attached):
[1] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.3 evaluate_0.5.1
[5] formatR_0.10 ggplot2_0.9.3.1 grid_3.0.2 gtable_0.1.2
[9] labeling_0.2 MASS_7.3-29 munsell_0.4.2 plyr_1.8
[13] proto_0.3-10 RColorBrewer_1.0-5 reshape2_1.2.2 scales_0.2.3
[17] stringr_0.6.2 tools_3.0.2
You shouldn't need to \usepackage{Sweavel} explicitly, I think -- knitr should handle that automatically. If you really want to suppress this false positive , you can rename Sweavel.sty to a file name that doesn't start with Sweave ... the which_sweave() function at https://github.com/yihui/knitr/blob/de7c65c58acfb1f3f5c0ac2f00b92cd2546be943/R/utils-sweave.R shows you what patterns knitr is looking for to detect "old Sweave syntax", specifically in this case the regular expression
regexp <-
'^\\s*\\\\(usepackage(\\[.*\\])?\\{Sweave|SweaveInput\\{|SweaveOpts\\{)'
So changing to mySweavel.sty should work ...
grepl(regexp,"\\usepackage{Sweave}") ## TRUE
grepl(regexp,"\\usepackage{Sweavel}") ## TRUE
grepl(regexp,"\\usepackage{mySweavel}") ## FALSE
My guess is that you have a newer version of knitr on your new than on your old machine, and it's trying harder to detect old Sweave syntax.
Removing the Sweave tag in the latex document prevented the warnings. It did not prevent the rendering of the file. Thus, the suggestion made in one of the comments above worked... knit2pdf (in my case) figures it out. - E

Resources