I am having trouble displaying Russian characters in the Rstudio console. I load an Excel file with Russian using the readxl package. The cyrillic displays properly in the dataframe. However, if I run a function that has an output that includes the variable names, the RStudio consoles displays symbols instead of the proper Cyrillic characters.
test.xlsx contains two columns - зависимая переменная (dependent variable - numeric) and независимая переменная (independent variable, factor).
зависимая_переменная независимая_переменная
5 а
6 б
8 в
8 а
7.5 б
6 в
5 а
4 б
3 в
2 а
5 б
My code:
Sys.setlocale(locale = "Russian")
install.packages("readxl")
require(readxl)
basetable <- readxl::read_excel('test.xlsx',sheet = 1)
View(basetable)
basetable$независимая_переменная <- as.factor(basetable$независимая_переменная)
str(basetable)
This is what I get for the output of the str function:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 11 obs. of 2 variables:
$ çàâèñèìàÿ_ïåðåìåííàÿ : num 5 6 8 8 7.5 6 5 4 3 2 ...
$ íåçàâèñèìàÿ_ïåðåìåííàÿ: Factor w/ 3 levels "а","б","в": 1 2 3 1 2 3 1 2 3 1 ...
I want to have the variable names displayed properly in Russian because I will be building many models from this data. For reference, here is my sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251
[3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C
[5] LC_TIME=Russian_Russia.1251
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readxl_0.1.1 shiny_0.13.1 dplyr_0.4.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.2 digest_0.6.9 assertthat_0.1 mime_0.4
[5] chron_2.3-47 R6_2.1.2 xtable_1.8-2 jsonlite_0.9.19
[9] DBI_0.3.1 magrittr_1.5 lazyeval_0.1.10 data.table_1.9.6
[13] tools_3.2.3 httpuv_1.3.3 parallel_3.2.3 htmltools_0.3
Try to change dataframe colnames encoding to UTF-8.
Encoding(colnames(YOURDATAFRAME)) <- "UTF-8"
Related
I have a dataframe that I'd like to add new columns to but where the calculation is dependant on values in another dataframe which holds instructions.
I have created a reproducible example below (although in reality there are quite a few more columns),
input dataframes:
base <- data.frame("A"=c("orange","apple","banana"),
"B"=c(5,3,6),
"C"=c(7,12,4),
"D"=c(5,2,7),
"E"=c(1,18,4))
key <- data.frame("cols"=c("A","B","C","D","E"),
"include"=c("no","no","yes","no","yes"),
"subtract"=c("na","A","B","C","D"),
"names"=c("na","G","H","I","J"))
desired output dataframe:
output <- data.frame("A"=c("orange","apple","banana"),
"B"=c(5,3,6),
"C"=c(7,12,4),
"D"=c(5,2,7),
"E"=c(1,18,4),
"H"=c(2,9,-2),
"J"=c(-4,16,-3))
The keys dataframe has a row for each column in the base dataframe and an "include" column that has to be set to "yes" for any calculation to be done. If it is set to yes, then I want to add a new column with a defined name that subtracts a given column.
For example, column "C" in the base dataframe is set to included so I want to create a new column called "H" that has values from column "C" minus values from column "B".
I thought I could do this with a loop but my attempts have not been successful and my searches have not found anything that helped (I'm a bit new). Any help would be much appreciated.
sessioninfo():
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
Here is a base R option
k <- subset(key, include == "yes")
output <- cbind(base,setNames(base[k[["cols"]]]-base[k[["subtract"]]],k$names))
and we will get
> output
A B C D E H J
1 orange 5 7 5 1 2 -4
2 apple 3 12 2 18 9 16
3 banana 6 4 7 4 -2 -3
Does the following work for you?
output <- base
for(i in which(key[["include"]] == "yes")){
key.row <- key[i, ]
output[[key.row[["names"]]]] <- base[[key.row[["cols"]]]] - base[[key.row[["subtract"]]]]
}
Result:
> output
A B C D E H J
1 orange 5 7 5 1 2 -4
2 apple 3 12 2 18 9 16
3 banana 6 4 7 4 -2 -3
I loaded a UTF-8 csv file with Japanese characters in it, its str is like this:
> str(purchases)
'data.frame': 168996 obs. of 7 variables:
$ ITEM_COUNT : int 1 1 1 1 1 1 2 2 1 1 ...
$ I_DATE : Date, format: "2012-03-28" "2011-07-04" ...
$ SMALL_AREA_NAME: Factor w/ 55 levels "キタ","ミナミ他",..: 6 47 26 26 26 26 26 35 35 26 ...
$ USER_ID_hash : Factor w/ 22782 levels "0000b53e182165208887ba65c079fc21",..: 19467 7623 7623 7623 7623 7623 7623 7623 7623 7623 ...
$ COUPON_ID_hash : Factor w/ 19368 levels "000eba9b783cec10658308b5836349f6",..: 3929 8983 5982 5982 5982 5982 5982 2737 18489 5018 ...
$ category : Factor w/ 13 levels "Beauty","Delivery service",..: 2 3 2 2 2 2 2 7 2 3 ...
So I think there's nothing wrong with my encoding or locale(en_US.UTF-8)? But when I plot with
> barplot(table(purchases$SMALL_AREA_NAME))
why do the Japanese characters turn into little blocks like this?
I think I have the font to display Japanese characters
> names(X11Fonts())
[1] "serif" "sans" "mono" "Times" "Helvetica"
[6] "CyrTimes" "CyrHelvetica" "Arial" "Mincho"
Additional info:
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_1.0.1
You may want to take a look at the showtext package, which allows you to use different fonts in R graphs. It also ships with a CJK font that can be used directly.
Try to run the code below:
library(showtext)
showtext.auto()
## ... code to generate data
barplot(table(purchases$SMALL_AREA_NAME))
I have a data.table dt with ids on the column idnum and a data.table ids that contains a list of ids in the column idnum (all of which exist on dt)
I want to get
The intersection: dt where dt.idnum ==ids.idnum`
The complement to the intersection: dt where dt.idnum not in ids.idnum
I got the first one with ease using
setkey(dt, idnum)
setkey(ids, idnum)
dt[ids]
However, Im stuck getting the second one. My approach was
dt[is.element(idnum, ids[, idnum]) == FALSE]
However, the row numbers of the two groups do not add up to nrow(dt). I suspect the second command. What can I do instead / Where am I going wrong? Is there perhaps a more efficient way of computing the second group given that it's the complement to the first group and I already have that one?
Update
I tried the approach given in the answer, but my numbers don't add up:
> nrow(x[J(ids$idnum)])
[1] 148
> nrow(x[!J(ids$idnum)])
[1] 52730
> nrow(x)
[1] 52863
While, the first two numbers added yield 52878. That is, I have 15 rows too many. My data contains duplicates in adj, could that be the reason?
Here's some description of the data I used:
> str(x)
Classes 'data.table' and 'data.frame': 52863 obs. of 1 variable:
$ idnum: int 6 6 11 21 22 22 22 22 27 27 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "idnum"
> head(x)
idnum
1: 6
2: 6
3: 11
4: 21
5: 22
6: 22
> str(ids)
Classes 'data.table' and 'data.frame': 46 obs. of 1 variable:
$ idnum: int 2909 5012 5031 5033 5478 6289 6405 6519 7923 7940 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "idnum"
> head(ids)
idnum
1: 2909
2: 5012
3: 5031
4: 5033
5: 5478
6: 6289
and here is
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] C/C/C/C/C/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] yaml_2.1.13 ggplot2_1.0.0 mFilter_0.1-3
[4] data.table_1.9.4 foreign_0.8-61
loaded via a namespace (and not attached):
[1] MASS_7.3-35 Rcpp_0.11.3 chron_2.3-45
[4] colorspace_1.2-4 digest_0.6.4 grid_3.1.1
[7] gtable_0.1.2 labeling_0.3 munsell_0.4.2
[10] plyr_1.8.1 proto_0.3-10 reshape2_1.4
[13] scales_0.2.4 stringr_0.6.2 tools_3.1.1
Here is one way:
library(data.table)
set.seed(1) # for reproducible example
dt <- data.table(idnum=1:1e5,x=rnorm(1e5)) # 10,000 rows, unique ids
ids <- data.table(idnum=sample(1:1e5,10)) # 10 random ids
setkey(dt,idnum)
result.1 <- dt[J(ids$idnum)] # inclusive set (records with common ids)
result.2 <- dt[!J(ids$idnum)] # exclusive set (records from dt with ids$idnum excluded
any(result.2$idnum %in% result.1$isnum)
# [1] FALSE
EDIT: Response to OPs comment.
Comparing the number of rows is not meaningful. The join will return rows corresponding to all matches. So if a given idnum is present twice in dt and three times in ids, you will get 2 X 3 = 6 rows in the result. The important test is the one I did: are any of the ids in result.1 also present in result.2? If so, then there's something wrong.
If you have duplicated ids$idnum, try:
result.1 <- dt[J(unique(ids$idnum))] # inclusive set (records with common ids)
I am trying to use normalize.loess() through lumiN() from lumi package.
At the 38th iteration, in loess() function it fails with
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, :
NA/NaN/Inf in foreign function call (arg 1)
I have searched and it may be related with the fact that an argument is missing.
I checked with debug(loess) and all arguments are defined.
I cannot post data because they are very large (13237x566) and also because they are confidential but.. I found this:
a minimal example works (random matrix 20x5)
normalization fails between column 1 and 38
same normalization using only those columns completed successfully
it is not a memory issue
matrix has not NA values
What am I missing?
Thanks
Code
raw_matrix <- lumiR('example.txt')
norm_matrix <- lumiN(raw_matrix, method='loess')
Perform loess normalization ...
Done with 1 vs 2 in iteration 1
Done with 1 vs 3 in iteration 1
Done with 1 vs 4 in iteration 1
Done with 1 vs 5 in iteration 1
Done with 1 vs 6 in iteration 1
Done with 1 vs 7 in iteration 1
Done with 1 vs 8 in iteration 1
Done with 1 vs 9 in iteration 1
Done with 1 vs 10 in iteration 1
Done with 1 vs 11 in iteration 1
Done with 1 vs 12 in iteration 1
Done with 1 vs 13 in iteration 1
Done with 1 vs 14 in iteration 1
Done with 1 vs 15 in iteration 1
Done with 1 vs 16 in iteration 1
Done with 1 vs 17 in iteration 1
Done with 1 vs 18 in iteration 1
Done with 1 vs 19 in iteration 1
Done with 1 vs 20 in iteration 1
Done with 1 vs 21 in iteration 1
Done with 1 vs 22 in iteration 1
Done with 1 vs 23 in iteration 1
Done with 1 vs 24 in iteration 1
Done with 1 vs 25 in iteration 1
Done with 1 vs 26 in iteration 1
Done with 1 vs 27 in iteration 1
Done with 1 vs 28 in iteration 1
Done with 1 vs 29 in iteration 1
Done with 1 vs 30 in iteration 1
Done with 1 vs 31 in iteration 1
Done with 1 vs 32 in iteration 1
Done with 1 vs 33 in iteration 1
Done with 1 vs 34 in iteration 1
Done with 1 vs 35 in iteration 1
Done with 1 vs 36 in iteration 1
Done with 1 vs 37 in iteration 1
Done with 1 vs 38 in iteration 1
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, :
NA/NaN/Inf in foreign function call (arg 1)
Environment
My sessionInfo() is
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] affy_1.38.1 lumi_2.12.0 Biobase_2.20.0
[4] BiocGenerics_0.6.0 BiocInstaller_1.10.2
loaded via a namespace (and not attached):
[1] affyio_1.28.0 annotate_1.38.0 AnnotationDbi_1.22.6
[4] beanplot_1.1 Biostrings_2.28.0 colorspace_1.2-4
[7] DBI_0.2-7 GenomicRanges_1.12.5 grid_3.0.2
[10] illuminaio_0.2.0 IRanges_1.18.1 KernSmooth_2.23-10
[13] lattice_0.20-24 limma_3.16.8 MASS_7.3-29
[16] Matrix_1.0-14 matrixStats_0.8.12 mclust_4.2
[19] methylumi_2.6.1 mgcv_1.7-27 minfi_1.6.0
[22] multtest_2.16.0 nleqslv_2.0 nlme_3.1-111
[25] nor1mix_1.1-4 preprocessCore_1.22.0 RColorBrewer_1.0-5
[28] reshape_0.8.4 R.methodsS3_1.5.2 RSQLite_0.11.4
[31] siggenes_1.34.0 splines_3.0.2 stats4_3.0.2
[34] survival_2.37-4 tcltk_3.0.2 tools_3.0.2
[37] XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.6.0
I somehow figured out what was not working:
I was trying to normalize a log2 matrix. As far as I know normalize.loess by default log transforms the input matrix, so that was going to be log transformed twice.
This was a problem, because some values in input matrix were equal to 1, so:
log2(log2(1)) = Inf
that clearly is not allowed as value during normalization.
Hope this helps someone.
I am working intensively with the amazing ff and ffbase package.
Due to some technical details, I have to work in my C: drive with my R session. After finishing that, I move the generated files to my P: drive (using cut/paste in windows, NOT using ff).
The problem is that when I load the ffdf object:
load.ffdf("data")
I get the error:
Error: file.access(filename, 0) == 0 is not TRUE
This is ok, because nobody told the ffdf object that it was moved, but trying :
filename(data$x) <- "path/data_ff/x.ff"
or
pattern(data) <- "./data_ff/"
does not help, giving the error:
Error in `filename<-.ff`(`*tmp*`, value = filename) :
ff file rename from 'C:/DATA/data_ff/id.ff' to 'P:/DATA_C/data_ff/e84282d4fb8.ff' failed.
Is there any way to "change" into the ffdf object the path for the files new location?
Thank you !!
If you want to 'correct' your filenames afterwards you can use:
physical(x)$filename <- "newfilename"
For example:
> a <- ff(1:20, vmode="integer", filename="./a.ff")
> saveRDS(a, "a.RDS")
> rm(a)
> file.rename("./a.ff", "./b.ff")
[1] TRUE
> b <- readRDS("a.RDS")
> b
ff (deleted) integer length=20 (20)
> physical(b)$filename <- "./b.ff"
> b[]
opening ff ./b.ff
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Using filename() in the first session would of course have been easier. You could also have a look at the save.ffdf and corresponding load.ffdf functions in the ffbase package, which make this even simpler.
Addition
To rename the filenames of all columns in a ffdf you can use the following function:
redir <- function(ff, newdir) {
for (x in physical(b)) {
fn <- basename(filename(x))
physical(x)$filename <- file.path(newdir, fn)
}
return (ff)
}
You can also use ff:::clone()
R> foo <- ff(1:20, vmode = "integer")
R> foo
ff (open) integer length=20 (20)
[1] [2] [3] [4] [5] [6] [7] [8] [13] [14] [15] [16] [17] [18] [19]
1 2 3 4 5 6 7 8 : 13 14 15 16 17 18 19
[20]
20
R> physical(foo)$filename
[1] "/vol/fftmp/ff69be3e90e728.ff"
R> bar <- clone(foo, pattern = "~/")
R> bar
ff (open) integer length=20 (20)
[1] [2] [3] [4] [5] [6] [7] [8] [13] [14] [15] [16] [17] [18] [19]
1 2 3 4 5 6 7 8 : 13 14 15 16 17 18 19
[20]
20
R> physical(bar)$filename
[1] "/home/ubuntu/69be5ec0cf98.ff"
From what I understand from briefly skimming the code of save.ffdf and load.ffdf, those functions do this for you when you save/load.