UTF-8 dotless i in cmd - r

I have an R script which was coded in utf-8. While running it in Rstudio, there is no problem with the Turkish characters. However, when I try to run it from cmd it throws an error:
Columns ÜrünAçiklama, and HataTanimi don't exist.
It gives this error because my dataframe has the columns 'ÜrünAçıklama' and 'HataTanımı'.
As you can see, there is no problem with the characters "Ü,ü,Ç" but there is a problem with dotless i (ı). I run the script with this line in cmd
Rscript --encoding="UTF-8" myscript.r
my OS is windows10
What should I do? Thanks in advance.
EDIT:
An example should be fine.
Here is my dataset. When I try to delete duplicate lines, I cannot reach the columns contain dotless i. You can try it in your own cmd with the following script.
library(readxl)
rm(list = ls())
shell("cls")
df <- read_excel("stackoverflow.xlsx")
df$ÜrünNo
df$ÜrünAçıklama
df$HataTanımı
df$HataZamanı
df_nd <- df[!duplicated(df[,c("ÜrünNo","ÜrünAçıklama","HataZamanı")]),]
Also here is my CMD output:
[1] 1 2 3 3 4
[1] "X" "Y" "Z" "Z" "Q"
[1] "A" "B" "C" "C" "D"
[1] 10 11 12 12 13
Error in `vectbl_as_col_location()`:
! Can't subset columns past the end.
x Columns `ÜrünAçiklama` and `HataZamani` don't exist.
Backtrace:
x
1. +-df[!duplicated(df[, c("ÜrünNo", "ÜrünAçiklama", "HataZamani")])]
2. +-tibble:::`[.tbl_df`(...)
3. +-base::duplicated(df[, c("ÜrünNo", "ÜrünAçiklama", "HataZamani")])
4. +-df[, c("ÜrünNo", "ÜrünAçiklama", "HataZamani")]
5. \-tibble:::`[.tbl_df`(df, , c("ÜrünNo", "ÜrünAçiklama", "HataZamani"))
6. \-tibble:::vectbl_as_col_location(...)
7. +-tibble:::subclass_col_index_errors(...)
8. | \-base::withCallingHandlers(...)
9. \-vctrs::vec_as_location(j, n, names)
10. \-vctrs `<fn>`()
11. \-vctrs:::stop_subscript_oob(...)
12. \-vctrs:::stop_subscript(...)
13. \-rlang::abort(...)
Execution halted
As you can see, I can reach columns one by one, however when I try to delete duplicate line it just says columns do not exist.

While I was making some research, I encountered this sentence.
R 4.2 for Windows will support UTF-8 as native encoding, which will be a major improvement in encoding support, allowing Windows R users to work with international text and data.
Later, I just realized I was using 4.1. Updating R is the easiest and the fastest solution. Sorry for the inconvience.

Related

r functions will not recognise apostrophe in character string

I have a large data frame of survey data read from a .csv that looks like this when simplified.
x <- data.frame("q1" = c("yes","no","don’t_know"),
"q2" = c("no","no","don’t_know"),
"q3" = c("yes","don’t_know","don’t_know"))
I want to create a column using rowSums as below
x$dntknw<-rowSums(x=="don’t_know")
I can do it for all the yes and no answers easily, but In my dataframe it just generates zeros for the don’t_know's.
I previously had an issue with the apostrophe looking like this don’t_know. I added encoding = "UTF-8"to my read.table to fix this. However now I cant seem to get any R functions to recognise it, I tried gsub("’","",df) but this didnt work as with rowSums.
Is this a problem with the encoding? is there a regex solution to removing them? what solutions are there for dealing with this?
It is an encoding issue and not a regex one. I am unable to reproduce the issue and my encoding is set as UTF-8 in R. Try by setting the encoding to UTF-8 in default R rather than at the time of read.
here is my sample output with your code.
> x
q1 q2 q3 dntknw
1 yes no yes 0
2 no no don’t_know 1
3 don’t_know don’t_know don’t_know 3
> Sys.setlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
Here is some more detail that may be helpful.
https://support.rstudio.com/hc/en-us/articles/200532197-Character-Encoding
As #Drj stated, it is probably an encoding error. When I paste your code into my console, I get
> x$q1
[1] yes no don<U+0092>t_know
Even if the encoding is off, you can still match it using regex:
grepl("don.+t_know", x$q1)
# [1] FALSE FALSE TRUE
Hence, you can calculate the row sums as follows:
x$dntknw <- rowSums(apply(x, 2, function(y) grepl("don.+t_know", y)))
Which results in
> x
q1 q2 q3 dntknw
1 yes no yes 0
2 no no don<U+0092>t_know 1
3 don<U+0092>t_know don<U+0092>t_know don<U+0092>t_know 3

R (rStudio): View error 'names' attribute [4] must be the same length as the vector [1] for character vector

This has been happening recently, and I cannot understand how to resolve. N.B. I am using rStudio v0.99.893
I have created a character vector from a data.table, which I then attempt to View, and receive the above error:
Error in View : 'names' attribute [4] must be the same length as the vector [1]
The original DT has ~10,000 observations of 12 variables, here is a subset capturing all classes:
> head(DT, 3)
HQ URL type ID1 ID2 completion date_first
1: imag image-welcome basic 444 24 0.1111111 2016-01-04 14:55:57
2: imag image-welcome basic 329 12 0.2222222 2016-03-15 11:37:21
3: imag image-confirm int 101 99 0.1111111 2016-01-06 20:55:07
as.character(sapply(DT, class))
[1] "character" "character" "character" "integer"
[5] "integer" "numeric" "c(\"POSIXct\", \"POSIXt\")"
From DT I create a character vector of the unique values of URL for a subset of interest (only 'imag' HQ):
URL.unique <- unique(DT[HQ == "imag", URL])
> class(URL.unique)
[1] "character"
> names(URL.unique)
NULL
> View(URL.unique)
Error in View : 'names' attribute [4] must be the same length as the vector [1]
> length(URL.unique)
[1] 262
Printing URL.unique in the console works fine, as does exporting it via write.table() but it is annoying that I cannot view it.
Unless there is something implicitly incorrect about the above, I am resorting to reinstalling rStudio. I've already tried quitting and relaunching, just in case there was some issue as I tend to leave multiple projects open on my computer over days.
Any help would be appreciated!
As noted by #Jonathan, this is currently filed with RStudio to investigate. Can confirm reinstalling and other measures did not resolve the issue which still persists. If it is reproduced and filed as a bug, I would request #Jonathan to supply the details here for anyone else to tie into.
The workaround of View(data.frame(u = URL.unique)) does the job to launch the viewer on the data object of interest (thanks #Frank)
I am using View(as.matrix(df$col_name)) and it seems to be working well.

Display vector as rows instead of columns in Octave

I'm new to octave and running into a formatting issue which I can't seem to fix. If I display a variable with multiple columns I get something along the lines of:
Columns 1 through 6:
0.75883 0.93290 0.40064 0.43818 0.94958 0.16467
However what I would really like to have is:
0.75883
0.93290
0.40064
0.43818
0.94958
0.16467
I've read the format documentation here but haven't been able to make the change. I'm running Octave 3.6.4 on Windows however I've used Octave 3.2.x on Windows and seen it output to the desired output by default.
To be specific, in case it matters, I am using the fir1 command as part of the signal package and these are sample outputs that I might see.
It sounds like, as Dan suggested, you want to display the transpose of your vector, i.e. a row vector rather than a column vector:
>> A = rand(1,20)
A =
Columns 1 through 7:
0.681499 0.093300 0.490087 0.666367 0.212268 0.456260 0.532721
Columns 8 through 14:
0.850320 0.117698 0.567046 0.405096 0.333689 0.179495 0.942469
Columns 15 through 20:
0.431966 0.100049 0.650319 0.459100 0.613030 0.779297
>> A'
ans =
0.681499
0.093300
0.490087
0.666367
0.212268
0.456260
0.532721
0.850320
0.117698
0.567046
0.405096
0.333689
0.179495
0.942469
0.431966
0.100049
0.650319
0.459100
0.613030
0.779297

CRAN-R: Subset a dataframe bug or violation of semantics?

Today i was confronted with a bug in my code due to a dataframe subset operation. I would like to know if the problem i found is a bug or if i am violating R semantics.
I am running a RHEL x86_64 with an R 2.15.2-61015 (Trick or Treat). I am using the subset operation from the base package.
The following code should be reproducible and it was run on a clean R console initiated for the purpose of this test.
>teste <-data.frame(teste0=c(1,2,3),teste1=c(3,4,5))
>teste0<-1
>teste1<-1
>subset(teste,teste[,"teste0"]==1 & teste[,"teste1"]==1)
[1] teste0 teste1
<0 rows> (or 0-length row.names)
>subset(teste,teste[,"teste0"]==teste0 & teste[,"teste1"]==teste1)
teste0 teste1
1 1 3
2 2 4
3 3 5
However, if i run the logical code outside the subset operation:
>teste[,"teste0"]==teste0 & teste[,"teste1"]==teste1
[1] FALSE FALSE FALSE
I would expect that both subset operations would yield an empty dataframe. However, the second one returns the complete dataframe. Is this a bug or am I missing something about R environments and namespaces ?
Thank you for your help,
Miguel
In this statement:
subset(teste,teste[,"teste0"]==teste0 & teste[,"teste1"]==teste1)
teste0 means teste$teste0. Same for teste1.
In this statement:
teste[,"teste0"]==teste0 & teste[,"teste1"]==teste1
teste0 and teste1 are the vectors that you have defined above (not members of the data frame).

Import DAT File - Parsing Issue

I have a tab-delimited DAT file that I want to read into R. When I import the data using read.delim, my data frame has the correct number of columns, but has more rows than expected.
My datafile represents responses to a survey. After digging a little deeper, it appears that R is creating a new record when there is a "." in a column that represents an open-ended response. It appears that there are times when a respondent may have hit "enter" to add a new line.
Is there a way to get around this? I read the help, but I am not sure how I can tell R to ignore this character in the character response.
Here is an example response that parses incorrectly. This is one response, but you can see that there are returns that put this onto multiple lines when parsed by R.
possible ask for size before giving free tshirt.
Also maybe have the interview in conference rooms instead of tight offices. I felt very cramped.
I would of loved to have gone, but just had to make a choices and had more options then I expected.
I am analyzing the data with SPSS and the data were brought in fine, however, I need to use R for more advanced modeling
Any help will be greatly appreciated. Thanks in advance.
There is an 'na.strings' argument. You don't offer any test case, but perhaps you can to this:
read.delim(file="myfil.DAT", na.strings=".")
I think it would be good if you could produce an edit to your question that better demonstrated the problem. I cannot create an error with a simple effort:
> read.delim(text="a\tb\t.\nc\td\te\n",header=FALSE)
V1 V2 V3
1 a b .
2 c d e
> read.delim(text="a\tb\t.\nc\td\te\n",header=FALSE, na.strings=".")
V1 V2 V3
1 a b <NA>
2 c d e
(After the clarification that above comments are not particularly relevant.) This will bring in a field that has a linefeed in it .... but it requires that the "field" be quoted in the original file:
> scan(file=textConnection("'a\nb'\nx\t.\nc\td\te\n"), what=list("","","") )
Read 2 records
[[1]]
[1] "a\nb" "c"
[[2]]
[1] "x" "d"
[[3]]
[1] "." "e"

Resources