I am surprised that a copy of the matrix is made in the following code:
> (m <- matrix(1:12, nrow = 3))
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
> tracemem(m)
[1] "<000001E2FC1E03D0>"
> str(m)
int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
> attr(m, "dim") <- 4:3
tracemem[0x000001e2fc1e03d0 -> 0x000001e2fcb05008]:
> m
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
> str(m)
int [1:4, 1:3] 1 2 3 4 5 6 7 8 9 10 ...
Is it useful? Is it avoidable?
EDIT: I do not have the same results as GKi.
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3
> m <- matrix(1:12, nrow = 3)
> tracemem(m)
[1] "<000001F8DB2C7D90>"
> attr(m, "dim") <- c(4, 3)
tracemem[0x000001f8db2c7d90 -> 0x000001f8db2d93f0]:
One difference is that I do not use BLAS library...
I'm using R 3.6.3 and indeed a copy is made. To change an attribute without making a copy, you can use the setattr function of the data.table package:
library(data.table)
m <- matrix(1:12, nrow = 3)
.Internal(inspect(m))
setattr(m, "dim", c(4L,3L))
.Internal(inspect(m))
In my case it is not making a copy of the data:
m <- matrix(1:12, nrow = 3)
.Internal(inspect(m))
##250ff98 13 INTSXP g0c4 [REF(1),ATT] (len=12, tl=0) 1,2,3,4,5,...
#ATTRIB:
# #38da270 02 LISTSXP g0c0 [REF(1)]
# TAG: #194d610 01 SYMSXP g0c0 [MARK,REF(1171),LCK,gp=0x4000] "dim" (has value)
# #38c3d88 13 INTSXP g0c1 [REF(65535)] (len=2, tl=0) 3,4
attr(m, "dim") <- 4:3
.Internal(inspect(m))
##250ff98 13 INTSXP g0c4 [REF(1),ATT] (len=12, tl=0) 1,2,3,4,5,...
#ATTRIB:
# #38da270 02 LISTSXP g0c0 [REF(1)]
# TAG: #194d610 01 SYMSXP g0c0 [MARK,REF(1171),LCK,gp=0x4000] "dim" (has value)
# #38d9978 13 INTSXP g0c0 [REF(65535)] 4 : 3 (expanded)
It was #250ff98 and is afterwards still there. It is only changing the dim from #38c3d88 to #38d9978.
sessionInfo()
#R version 4.0.3 (2020-10-10)
#Platform: x86_64-pc-linux-gnu (64-bit)
#Running under: Debian GNU/Linux 10 (buster)
#
#Matrix products: default
#BLAS: /usr/local/lib/R/lib/libRblas.so
#LAPACK: /usr/local/lib/R/lib/libRlapack.so
#
#locale:
# [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
# [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
# [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
#[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#
#attached base packages:
#[1] stats graphics grDevices utils datasets methods base
#
#loaded via a namespace (and not attached):
#[1] compiler_4.0.3 tools_4.0.3
The same with tracemem.
m <- matrix(1:12, nrow = 3)
tracemem(m)
#[1] "<0x289ff98>"
attr(m, "dim") <- 4:3
tracemem(m)
#[1] "<0x289ff98>"
But if you make an str(m) in between it makes currently a copy:
m <- matrix(1:12, nrow = 3)
tracemem(m)
#[1] "<0x28a01c8>"
str(m)
# int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
attr(m, "dim") <- 4:3
#tracemem[0x28a01c8 -> 0x2895608]:
Related
I get a numeric to integer64 type conversion after melting a data.table object in R.
Given the file stats.txt, tab separated:
id x y
A 283726709252 0.1
B 288604342155 0.2
C 329048184196 0.3
D 192107948937 0.4
I want to read it into a data.table and melt it. So:
library(data.table)
stats<- fread('stats.txt')
stats
id x y
1: A 283726709252 0.1
2: B 288604342155 0.2
3: C 329048184196 0.3
4: D 192107948937 0.4
str(stats)
Classes ‘data.table’ and 'data.frame': 4 obs. of 3 variables:
$ id: chr "A" "B" "C" "D"
$ x :integer64 283726709252 288604342155 329048184196 192107948937
$ y : num 0.1 0.2 0.3 0.4
- attr(*, ".internal.selfref")=<externalptr>
So far so good. Now if I melt it, I get the y variable converted from numeric to integer64:
xm<- melt.data.table(data= stats, id.vars= 'id')
xm
id variable value
1: A x 283726709252
2: B x 288604342155
3: C x 329048184196
4: D x 192107948937
5: A y 4591870180066957722
6: B y 4596373779694328218
7: C y 4599075939470750515
8: D y 4600877379321698714
str(xm)
Classes ‘data.table’ and 'data.frame': 8 obs. of 3 variables:
$ id : chr "A" "B" "C" "D" ...
$ variable: Factor w/ 2 levels "x","y": 1 1 1 1 2 2 2 2
$ value :integer64 283726709252 288604342155 329048184196 192107948937 4591870180066957722 4596373779694328218 4599075939470750515 4600877379321698714
- attr(*, ".internal.selfref")=<externalptr>
Is this a bug or am I doing something wrong?
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_2.2.1 data.table_1.10.4-3
loaded via a namespace (and not attached):
[1] compiler_3.4.1 colorspace_1.3-2 scales_0.5.0 lazyeval_0.2.0 plyr_1.8.4 gtable_0.2.0 tibble_1.3.3 Rcpp_0.12.12 grid_3.4.1 rlang_0.1.1 munsell_0.4.3
I am attempting the exercise in R for data science (7.5.2.1, #2): Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
First, transmute columns.
library(nycflights13)
foo <- nycflights13::flights %>%
transmute(tot_delay = dep_delay + arr_delay, m = month, d = dest) %>%
filter(!is.na(tot_delay)) %>%
group_by(m, d) %>%
summarise(avg_delay = mean(tot_delay))
Now foo appears to be a data frame based on the 'Source' output.
> foo
Source: local data frame [1,112 x 3]
Groups: m [?]
m d avg_delay
<int> <chr> <dbl>
1 1 ALB 76.571429
2 1 ATL 8.567982
3 1 AUS 19.017751
4 1 AVL 49.000000
5 1 BDL 32.081081
6 1 BHM 47.043478
7 1 BNA 25.930233
8 1 BOS 2.698517
9 1 BQN 8.516129
10 1 BTV 18.393665
# ... with 1,102 more rows
It doesn't appear that as_tibble is working, what could I be doing wrong?
> as_tibble(foo)
Source: local data frame [1,112 x 3]
Groups: m [?]
m d avg_delay
<int> <chr> <dbl>
1 1 ALB 76.571429
2 1 ATL 8.567982
3 1 AUS 19.017751
4 1 AVL 49.000000
5 1 BDL 32.081081
6 1 BHM 47.043478
7 1 BNA 25.930233
8 1 BOS 2.698517
9 1 BQN 8.516129
10 1 BTV 18.393665
# ... with 1,102 more rows
Shouldn't the internals be different for a tibble?
> str(foo)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1112 obs. of 3 variables:
$ m : int 1 1 1 1 1 1 1 1 1 1 ...
$ d : chr "ALB" "ATL" "AUS" "AVL" ...
$ avg_delay: num 76.57 8.57 19.02 49 32.08 ...
- attr(*, "vars")=List of 1
..$ : symbol m
- attr(*, "drop")= logi TRUE
> str(as_tibble(foo))
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1112 obs. of 3 variables:
$ m : int 1 1 1 1 1 1 1 1 1 1 ...
$ d : chr "ALB" "ATL" "AUS" "AVL" ...
$ avg_delay: num 76.57 8.57 19.02 49 32.08 ...
- attr(*, "vars")=List of 1
..$ : symbol m
- attr(*, "drop")= logi TRUE
Note that as_tibble() works as expected
> packageDescription("tibble")
Package: tibble
Encoding: UTF-8
Version: 1.3.0
> is_tibble(foo)
[1] TRUE
Works for me - foo is a "tibble" and is announced as "A tibble: 112 x 3" in the print:
> foo
Source: local data frame [1,112 x 3]
Groups: m [?]
# A tibble: 1,112 x 3
m d avg_delay
<int> <chr> <dbl>
1 1 ALB 76.571429
2 1 ATL 8.567982
So you possibly have an old version of dplyr. Mine is:
> packageDescription("dplyr")
Package: dplyr
Type: Package
Version: 0.5.0
And everything else:
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0 tibble_1.3.1
loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.2.0 assertthat_0.2.0 DBI_0.5-1
[5] tools_3.3.1 Rcpp_0.12.11 rlang_0.1.1
I'm sorry to bother you with probably an encoding question. Spending couple of hours without getting the solution I decided to post it here.
I'm trying to write a simple table unsuccessfully using write.table, write.csv,write.csv2from Ubuntu 14.04. My data is kind of messy resulting from a cronjob:
ID <- c("",30,26,20,30,40,5,10,4)
b <- c("",2233,12,2,22,13,23,23,100)
c <- c("","","","","","","","","")
d <- c("","","","","","","","","")
e <- c("","","","","","800","","","")
f <- c("","","","","","","","","")
g <- c("","","","","","","","EA","")
h <- c("","","","","","","","","")
df <- data.frame(ID,b,c,d,e,f,g,h)
# change columns to chr
for(i in c(1,2:ncol(df))) {
df[,i] <- as.character(df[,i])
}
str(df)
# data.frame': 9 obs. of 8 variables:
# $ ID: chr "" "30" "26" "20" ...
# $ b : chr "" "2233" "12" "2" ...
# $ c : chr "" "" "" "" ...
# $ d : chr "" "" "" "" ...
# $ e : chr "" "" "" "" ...
# $ f : chr "" "" "" "" ...
# $ g : chr "" "" "" "" ...
# $ h : chr "" "" "" "" ...
head(df,n=9)
ID b c d e f g h
# 1
# 2 30 2233
# 3 26 12
# 4 20 2
# 5 30 22
# 6 40 13 800
# 7 5 23
# 8 10 23 EA
# 9 4 100
I have tried different combinations and suggestions found on SO, however nothing worked. The result is always somehow displaced instead of long its wide. In the current example ist just one long row.
I tried:
write.table(df,"df.csv",row.names = FALSE, dec=".",sep=";")
write.table(df,"df.csv",row.names = FALSE,dec=".",sep=";", col.names = T)
write.table(df,"df.csv",row.names = FALSE,sep=";",fileEncoding = "UTF-8")
write.table(df,"df.csv",row.names = FALSE,fileEncoding = "UTF-8")
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8
[4] LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.4.3 DBI_0.4-1 RGA_0.4.2 RMySQL_0.11-3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 lubridate_1.5.6 digest_0.6.9 assertthat_0.1 R6_2.1.2
[6] plyr_1.8.3 jsonlite_1.0 magrittr_1.5 httr_1.1.0 stringi_1.1.1
[11] curl_0.9.7 tools_3.3.1 stringr_1.0.0 parallel_3.3.1
Wrong output as pic:
Correct output results from the same data on :
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
[![enter image description here][2]][2]
The problem isn't R or Ubuntu it is notepad. Specifically, it expects "\r\n" for line breaks whereas most other text readers are happy with "\n" which is the default line break used by write.xxx.
If you add the parameter eol="\r\n" then you should be able to open in Notepad and see the expected line breaks.
For instance:
write.table(df,"df.csv",row.names = FALSE, dec=".",sep=";",eol="\r\n")
I am trying to understand if this is a bug in RStudio or am I missing something.
I am reading a csv file into R. When printing it into the console in RStudio I get gibrish (unless I look at a specific vector). While in Rgui this is fine.
The code I will run is this:
Sys.setlocale("LC_ALL", "Hebrew")
x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")
x # shows gibrish
x[,2]
colnames(x)
Here is the output from RStudio (gibrish)
> x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")
> x
âéì..áùðéí. îéâãø
1 23.0 æëø
2 24.0 ð÷áä
3 23.0 ð÷áä
4 24.0 ð÷áä
5 25.0 æëø
6 18.0 æëø
7 26.0 æëø
8 21.5 ð÷áä
9 24.0 æëø
10 26.0 æëø
11 24.0 æëø
12 19.0 ð÷áä
13 19.0 ð÷áä
14 24.5 æëø
15 21.0 ð÷áä
> x[,2]
[1] זכר נקבה נקבה נקבה זכר זכר זכר נקבה זכר זכר זכר נקבה נקבה זכר נקבה
Levels: זכר נקבה
> colnames(x)
[1] "âéì..áùðéí." "îéâãø"
>
And here it is in Rgui (here it is fine):
> x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")
> x # shows gibrish
גיל..בשנים. מיגדר
1 23.0 זכר
2 24.0 נקבה
3 23.0 נקבה
4 24.0 נקבה
5 25.0 זכר
6 18.0 זכר
7 26.0 זכר
8 21.5 נקבה
9 24.0 זכר
10 26.0 זכר
11 24.0 זכר
12 19.0 נקבה
13 19.0 נקבה
14 24.5 זכר
15 21.0 נקבה
> x[,2]
[1] זכר נקבה נקבה נקבה זכר זכר זכר נקבה זכר זכר זכר נקבה נקבה זכר נקבה
Levels: זכר נקבה
> colnames(x)
[1] "גיל..בשנים." "מיגדר"
>
In both sessions, my sessionInfo() is:
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255
[3] LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C
[5] LC_TIME=Hebrew_Israel.1255
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] installr_0.17.0
I'm using the latest RStudio version 0.99.892
Thanks.
This is a bug in R-studio and not the only one. I've seen you have received a general answer about problems R-studio currently having with non-English locale support on windows. As far as I know it is not the first time / version having similar problems. You may also meet some new problems that I think related to win 10 . Note that since I'm having the second type of problems as well, I am using English locale to print Hebrew.
So I have tried some debugging on your problem there and came with some work-around, and some new insights (I think..) on where is the problem. I think it can be further debugged to write a complete function that will fix it, but due to time (and hour) restrictions I've decide to stop here.
I've created this data:
x <- data.frame("x"= c("דור","dor"))
As mentioned already, using Hebrew locale I as well get gibrish
Sys.setlocale("LC_ALL", "Hebrew")
[1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"
"דור"
[1] "ãåø"
x
x
1 ãåø
2 dor
Using English locale, I've get this output.
Sys.setlocale("LC_ALL", "English")
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
"דור"
[1] "דור"
x
x
1 <U+05D3><U+05D5><U+05E8>
2 dor
Note that non data.frame output prints fine. It also occurs with data.table class, and prints fine with list and matrix.
Checking both print.data.frame and print.table methods reveals the main suspect: format.
Further investigation confirm these suspicions:
as.matrix(x)
x
[1,] "דור"
[2,] "dor"
format(as.matrix(x))
x
[1,] "<U+05D3><U+05D5><U+05E8>"
[2,] "dor "
As such in your case I suggest following this workflow:
Sys.setlocale("LC_ALL", "Hebrew")
x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")
as.matrix(x)
âéì..áùðéí. îéâãø
[1,] "23.0" "זכר"
[2,] "24.0" "נקבה"
[3,] "23.0" "נקבה"
[4,] "24.0" "נקבה"
[5,] "25.0" "זכר"
[6,] "18.0" "זכר"
[7,] "26.0" "זכר"
[8,] "21.5" "נקבה"
[9,] "24.0" "זכר"
[10,] "26.0" "זכר"
[11,] "24.0" "זכר"
[12,] "19.0" "נקבה"
[13,] "19.0" "נקבה"
[14,] "24.5" "זכר"
[15,] "21.0" "נקבה"
Both locales: Hebrew and English worked on my machine, but col.names didn't work for neither.
To conclude, this is far from being a complete solution, but just a small and partial work-around the printing (or shall recall the formatting) problem. It also shed some more light on this Hebrew / non-English issue in R-studio, on which some better solutions may be written. One example for a solution for a similar problem of writing Hebrew in windows can be seen on this SO thread.
I'm using the bigmemory and biganalytics packages and specifically trying to compute the mean of a big.matrix object. The documentation for biganalytics (e.g. ?biganalytics) suggests that mean() should be available for big.matrix objects, but this fails:
x <- big.matrix(5, 2, type="integer", init=0,
+ dimnames=list(NULL, c("alpha", "beta")))
x
# An object of class "big.matrix"
# Slot "address":
# <pointer: 0x00000000069a5200>
x[,1] <- 1:5
x[,]
# alpha beta
# [1,] 1 0
# [2,] 2 0
# [3,] 3 0
# [4,] 4 0
# [5,] 5 0
mean(x)
# [1] NA
# Warning message:
# In mean.default(x) : argument is not numeric or logical: returning NA
Although some things work OK:
colmean(x)
# alpha beta
# 3 0
sum(x)
# [1] 15
mean(x[])
# [1] 1.5
mean(colmean(x))
# [1] 1.5
without mean(), it seems mean(colmean(x)) is the next best thing:
# try it on something bigger
x = big.matrix(nrow=10000, ncol=10000, type="integer")
x[] <- c(1:(10000*10000))
mean(colmean(x))
# [1] 5e+07
mean(x[])
# [1] 5e+07
system.time(mean(colmean(x)))
# user system elapsed
# 0.19 0.00 0.19
system.time(mean(x[]))
# user system elapsed
# 0.28 0.11 0.39
Presumably mean() could be faster still, especially for rectangular matrices with a large number of columns.
Any ideas why mean() isn't working for me?
OK - re-installing biganalytics seems to have fixed this.
I now have:
library("biganalytics")
x = big.matrix(10000,10000, type="integer")
for(i in 1L:10000L) { j = c(1L:10000L) ; x[i,] <- i*10000L + j }
mean(x)
# [1] 50010001
mean(x[,])
# [1] 50010001
mean(colmean(x))
# [1] 50010001
system.time(replicate(100, mean(x)))
# user system elapsed
# 20.16 0.02 20.23
system.time(replicate(100, mean(colmean(x))))
# user system elapsed
# 20.08 0.00 20.24
system.time(replicate(100, mean(x[,])))
# user system elapsed
# 31.62 12.88 44.74
So all good. My sessionInfo() is now:
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biganalytics_1.1.12 biglm_0.9-1 DBI_0.3.1 foreach_1.4.2 bigmemory_4.5.8 bigmemory.sri_0.1.3
loaded via a namespace (and not attached):
[1] codetools_0.2-8 iterators_1.0.7 Rcpp_0.11.2