Compute mean of big.matrix - r

I'm using the bigmemory and biganalytics packages and specifically trying to compute the mean of a big.matrix object. The documentation for biganalytics (e.g. ?biganalytics) suggests that mean() should be available for big.matrix objects, but this fails:
x <- big.matrix(5, 2, type="integer", init=0,
+ dimnames=list(NULL, c("alpha", "beta")))
x
# An object of class "big.matrix"
# Slot "address":
# <pointer: 0x00000000069a5200>
x[,1] <- 1:5
x[,]
# alpha beta
# [1,] 1 0
# [2,] 2 0
# [3,] 3 0
# [4,] 4 0
# [5,] 5 0
mean(x)
# [1] NA
# Warning message:
# In mean.default(x) : argument is not numeric or logical: returning NA
Although some things work OK:
colmean(x)
# alpha beta
# 3 0
sum(x)
# [1] 15
mean(x[])
# [1] 1.5
mean(colmean(x))
# [1] 1.5
without mean(), it seems mean(colmean(x)) is the next best thing:
# try it on something bigger
x = big.matrix(nrow=10000, ncol=10000, type="integer")
x[] <- c(1:(10000*10000))
mean(colmean(x))
# [1] 5e+07
mean(x[])
# [1] 5e+07
system.time(mean(colmean(x)))
# user system elapsed
# 0.19 0.00 0.19
system.time(mean(x[]))
# user system elapsed
# 0.28 0.11 0.39
Presumably mean() could be faster still, especially for rectangular matrices with a large number of columns.
Any ideas why mean() isn't working for me?

OK - re-installing biganalytics seems to have fixed this.
I now have:
library("biganalytics")
x = big.matrix(10000,10000, type="integer")
for(i in 1L:10000L) { j = c(1L:10000L) ; x[i,] <- i*10000L + j }
mean(x)
# [1] 50010001
mean(x[,])
# [1] 50010001
mean(colmean(x))
# [1] 50010001
system.time(replicate(100, mean(x)))
# user system elapsed
# 20.16 0.02 20.23
system.time(replicate(100, mean(colmean(x))))
# user system elapsed
# 20.08 0.00 20.24
system.time(replicate(100, mean(x[,])))
# user system elapsed
# 31.62 12.88 44.74
So all good. My sessionInfo() is now:
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biganalytics_1.1.12 biglm_0.9-1 DBI_0.3.1 foreach_1.4.2 bigmemory_4.5.8 bigmemory.sri_0.1.3
loaded via a namespace (and not attached):
[1] codetools_0.2-8 iterators_1.0.7 Rcpp_0.11.2

Related

Is there a way to prevent copy-on-modify when modifying attributes?

I am surprised that a copy of the matrix is made in the following code:
> (m <- matrix(1:12, nrow = 3))
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
> tracemem(m)
[1] "<000001E2FC1E03D0>"
> str(m)
int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
> attr(m, "dim") <- 4:3
tracemem[0x000001e2fc1e03d0 -> 0x000001e2fcb05008]:
> m
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
> str(m)
int [1:4, 1:3] 1 2 3 4 5 6 7 8 9 10 ...
Is it useful? Is it avoidable?
EDIT: I do not have the same results as GKi.
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3
> m <- matrix(1:12, nrow = 3)
> tracemem(m)
[1] "<000001F8DB2C7D90>"
> attr(m, "dim") <- c(4, 3)
tracemem[0x000001f8db2c7d90 -> 0x000001f8db2d93f0]:
One difference is that I do not use BLAS library...
I'm using R 3.6.3 and indeed a copy is made. To change an attribute without making a copy, you can use the setattr function of the data.table package:
library(data.table)
m <- matrix(1:12, nrow = 3)
.Internal(inspect(m))
setattr(m, "dim", c(4L,3L))
.Internal(inspect(m))
In my case it is not making a copy of the data:
m <- matrix(1:12, nrow = 3)
.Internal(inspect(m))
##250ff98 13 INTSXP g0c4 [REF(1),ATT] (len=12, tl=0) 1,2,3,4,5,...
#ATTRIB:
# #38da270 02 LISTSXP g0c0 [REF(1)]
# TAG: #194d610 01 SYMSXP g0c0 [MARK,REF(1171),LCK,gp=0x4000] "dim" (has value)
# #38c3d88 13 INTSXP g0c1 [REF(65535)] (len=2, tl=0) 3,4
attr(m, "dim") <- 4:3
.Internal(inspect(m))
##250ff98 13 INTSXP g0c4 [REF(1),ATT] (len=12, tl=0) 1,2,3,4,5,...
#ATTRIB:
# #38da270 02 LISTSXP g0c0 [REF(1)]
# TAG: #194d610 01 SYMSXP g0c0 [MARK,REF(1171),LCK,gp=0x4000] "dim" (has value)
# #38d9978 13 INTSXP g0c0 [REF(65535)] 4 : 3 (expanded)
It was #250ff98 and is afterwards still there. It is only changing the dim from #38c3d88 to #38d9978.
sessionInfo()
#R version 4.0.3 (2020-10-10)
#Platform: x86_64-pc-linux-gnu (64-bit)
#Running under: Debian GNU/Linux 10 (buster)
#
#Matrix products: default
#BLAS: /usr/local/lib/R/lib/libRblas.so
#LAPACK: /usr/local/lib/R/lib/libRlapack.so
#
#locale:
# [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
# [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
# [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
#[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#
#attached base packages:
#[1] stats graphics grDevices utils datasets methods base
#
#loaded via a namespace (and not attached):
#[1] compiler_4.0.3 tools_4.0.3
The same with tracemem.
m <- matrix(1:12, nrow = 3)
tracemem(m)
#[1] "<0x289ff98>"
attr(m, "dim") <- 4:3
tracemem(m)
#[1] "<0x289ff98>"
But if you make an str(m) in between it makes currently a copy:
m <- matrix(1:12, nrow = 3)
tracemem(m)
#[1] "<0x28a01c8>"
str(m)
# int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
attr(m, "dim") <- 4:3
#tracemem[0x28a01c8 -> 0x2895608]:

How to save a CSV file with R with line breaks that Notepad will recognize?

I'm sorry to bother you with probably an encoding question. Spending couple of hours without getting the solution I decided to post it here.
I'm trying to write a simple table unsuccessfully using write.table, write.csv,write.csv2from Ubuntu 14.04. My data is kind of messy resulting from a cronjob:
ID <- c("",30,26,20,30,40,5,10,4)
b <- c("",2233,12,2,22,13,23,23,100)
c <- c("","","","","","","","","")
d <- c("","","","","","","","","")
e <- c("","","","","","800","","","")
f <- c("","","","","","","","","")
g <- c("","","","","","","","EA","")
h <- c("","","","","","","","","")
df <- data.frame(ID,b,c,d,e,f,g,h)
# change columns to chr
for(i in c(1,2:ncol(df))) {
df[,i] <- as.character(df[,i])
}
str(df)
# data.frame': 9 obs. of 8 variables:
# $ ID: chr "" "30" "26" "20" ...
# $ b : chr "" "2233" "12" "2" ...
# $ c : chr "" "" "" "" ...
# $ d : chr "" "" "" "" ...
# $ e : chr "" "" "" "" ...
# $ f : chr "" "" "" "" ...
# $ g : chr "" "" "" "" ...
# $ h : chr "" "" "" "" ...
head(df,n=9)
ID b c d e f g h
# 1
# 2 30 2233
# 3 26 12
# 4 20 2
# 5 30 22
# 6 40 13 800
# 7 5 23
# 8 10 23 EA
# 9 4 100
I have tried different combinations and suggestions found on SO, however nothing worked. The result is always somehow displaced instead of long its wide. In the current example ist just one long row.
I tried:
write.table(df,"df.csv",row.names = FALSE, dec=".",sep=";")
write.table(df,"df.csv",row.names = FALSE,dec=".",sep=";", col.names = T)
write.table(df,"df.csv",row.names = FALSE,sep=";",fileEncoding = "UTF-8")
write.table(df,"df.csv",row.names = FALSE,fileEncoding = "UTF-8")
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8
[4] LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.4.3 DBI_0.4-1 RGA_0.4.2 RMySQL_0.11-3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 lubridate_1.5.6 digest_0.6.9 assertthat_0.1 R6_2.1.2
[6] plyr_1.8.3 jsonlite_1.0 magrittr_1.5 httr_1.1.0 stringi_1.1.1
[11] curl_0.9.7 tools_3.3.1 stringr_1.0.0 parallel_3.3.1
Wrong output as pic:
Correct output results from the same data on :
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
[![enter image description here][2]][2]
The problem isn't R or Ubuntu it is notepad. Specifically, it expects "\r\n" for line breaks whereas most other text readers are happy with "\n" which is the default line break used by write.xxx.
If you add the parameter eol="\r\n" then you should be able to open in Notepad and see the expected line breaks.
For instance:
write.table(df,"df.csv",row.names = FALSE, dec=".",sep=";",eol="\r\n")

Unable to pipe predict() output through filter() to ggplot()

I'm struggling to figure out why I can't use filter() on the results
of predict.gam() and then ggplot() the subset of predictions. I'm not
sure the prediction step is really part of the problem, but that's what
it takes to trigger the error. Just filter() %>% ggplot() with a
dataframe works fine.
library(dplyr)
library(ggplot2)
library(mgcv)
gam1 <- gam(Petal.Length~s(Petal.Width) + Species, data=iris)
nd <- expand.grid(Petal.Width = seq(0,5,0.05),
Species = levels(iris$Species),
stringsAsFactors = FALSE)
predicted <- predict(gam1,newdata=nd)
predicted <- cbind(predicted,nd)
filter(tbl_df(predicted), Species == "setosa") %>%
ggplot(aes(x=Petal.Width, y = predicted)) +
geom_point()
## Error: length(rows) == 1 is not TRUE
But:
filter(tbl_df(predicted), Species == "setosa")
## Source: local data frame [101 x 3]
##
## predicted Petal.Width Species
## (dbl[10]) (dbl) (chr)
## 1 1.294574 0.00 setosa
## 2 1.327482 0.05 setosa
## 3 1.360390 0.10 setosa
## 4 1.393365 0.15 setosa
## 5 1.426735 0.20 setosa
## 6 1.460927 0.25 setosa
## 7 1.496477 0.30 setosa
## 8 1.533949 0.35 setosa
## 9 1.573888 0.40 setosa
## 10 1.616810 0.45 setosa
## .. ... ... ...
And the problem is filter() because:
pick <- predicted$Species == "setosa"
ggplot(predicted[pick,],aes(x=Petal.Width, y = predicted)) +
geom_point()
I've also tried saving the result of filter to an object and using that directly in ggplot() but that has the same error.
Obviously not a crisis, because there's a workaround, but my mental
model of how to use filter() is obviously wrong! Any insights much
appreciated.
Edit: When I first posted this I was still using R 3.2.3 and was getting warnings from ggplot2 and dplyr. So I upgraded to 3.3.0 and it's still happening.
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mgcv_1.8-12 nlme_3.1-127 ggplot2_2.1.0 dplyr_0.4.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 knitr_1.11 magrittr_1.5 munsell_0.4.2
## [5] colorspace_1.2-6 lattice_0.20-33 R6_2.1.1 stringr_1.0.0
## [9] plyr_1.8.3 tools_3.3.0 parallel_3.3.0 grid_3.3.0
## [13] gtable_0.1.2 DBI_0.3.1 htmltools_0.2.6 lazyeval_0.1.10
## [17] yaml_2.1.13 assertthat_0.1 digest_0.6.8 Matrix_1.2-6
## [21] formatR_1.2 evaluate_0.7.2 rmarkdown_0.9.5 labeling_0.3
## [25] stringi_1.0-1 scales_0.3.0
The problem arises because your predict() call generates a named array, instead of just a numerical vector.
class(predicted$predicted)
# [1] "array"
The first filter() will give you the correct output on the surface, however if you inspect the output you will notice that the column predicted is still some sort of nested array.
str(filter(tbl_df(predicted), Species == "setosa"))
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 101 obs. of 3 variables:
$ predicted : num [1:303(1d)] 1.29 1.33 1.36 1.39 1.43 ...
..- attr(*, "dimnames")=List of 1
.. ..$ : chr "1" "2" "3" "4" ...
$ Petal.Width: num 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 ...
$ Species: chr "setosa" "setosa" "setosa" "setosa" ...
In contrast, good old logical subsetting does the job on all dimensions:
str(predicted[pick,])
'data.frame': 101 obs. of 3 variables:
$ predicted : num [1:101(1d)] 1.29 1.33 1.36 1.39 1.43 ... # Now 101 obs here too
..- attr(*, "dimnames")=List of 1
.. ..$ : chr "1" "2" "3" "4" ...
$ Petal.Width: num 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 ...
$ Species : chr "setosa" "setosa" "setosa" "setosa" ...
So either you coerce the predicted column to numeric:
library(dplyr)
library(ggplot2)
predicted %>% mutate(predicted = as.numeric(predicted)) %>%
filter(Species == "setosa") %>%
ggplot(aes(x = Petal.Width, y = predicted)) +
geom_point()
Or replace filter() by subset():
predicted %>%
subset(Species == "setosa") %>%
ggplot(aes(x = Petal.Width, y = predicted)) +
geom_point()

Mean and Median Vs Summary

I'm currently doing a Reproducible Data course on Coursera and one of the questions ask for the Mean and Median of steps per day, I have this but when I confirm it with the summary function, the summary version of Mean and Median is different. I'm running this via knitr
Why would this be?
** below is an edit showing all of my script so far including a link to the raw data:
##Download the data You have to change https to http to get this to work in knitr
target_url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
target_localfile = "ActivityMonitoringData.zip"
if (!file.exists(target_localfile)) {
download.file(target_url, destfile = target_localfile)
}
Unzip the file to the temporary directory
unzip(target_localfile, exdir="extract", overwrite=TRUE)
List the extracted files
list.files("./extract")
## [1] "activity.csv"
Load the extracted data into R
activity.csv <- read.csv("./extract/activity.csv", header = TRUE)
activity1 <- activity.csv[complete.cases(activity.csv),]
str(activity1)
## 'data.frame': 15264 obs. of 3 variables:
## $ steps : int 0 0 0 0 0 0 0 0 0 0 ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Use a histogram to view the number of steps taken each day
histData <- aggregate(steps ~ date, data = activity1, sum)
h <- hist(histData$steps, # Save histogram as object
breaks = 11, # "Suggests" 11 bins
freq = T,
col = "thistle1",
main = "Histogram of Activity",
xlab = "Number of daily steps")
Obtain the Mean and Median of the daily steps
steps <- histData$steps
mean(steps)
## [1] 10766
median(steps)
## [1] 10765
summary(histData$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 8840 10800 10800 13300 21200
summary(steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 8840 10800 10800 13300 21200
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## locale:
## [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.6
##
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5 formatR_1.0 stringr_0.6.2 tools_3.1.1
Actually, the answers is correct, you just printing it wrong. You are setting digits option somewhere.
Put this before the scripts:
options(digits=12)
And you'll have:
mean(steps)
# [1] 10766.1886792
median(steps)
# [1] 10765
summary(steps)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 41.0000 8841.0000 10765.0000 10766.1887 13294.0000 21194.0000
Notice that summary use max(3, getOption("digits")-3) for how many numbers is printed. So it round it a bit (10766.1887 instead of 10766.1886792).

Converting Factor to Date without creating NA's

I'm running into problems converting factor to date; it is making NA values which I don't want.
Data for my problem can be found here: (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip)
x <- read.csv("activity.csv")
head(x)
steps date interval
1 NA 2012-10-01 0
2 NA 2012-10-01 5
3 NA 2012-10-01 10
4 NA 2012-10-01 15
5 NA 2012-10-01 20
6 NA 2012-10-01 25
Goal: I'm trying to find the mean total number of steps taken per day. So first, I need to bin values so that each data point corresponds to the sum for a given day
x$Day <- as.Date(cut(x$date, breaks = "day"))
Error in cut.default(x$date, breaks = "day") : 'x' must be numeric
Just confirm this with class function
class(x[,2])
"factor"
This is weird because from the head(x) above it looked like it was Date. Anyways so in order to bin values so that each data point corresponds to the sum for a given day using the cut function I need to change the date's to "Date" class
x[,2] <- as.Date(x[,2], format="%Y/%m/%d")
class(x[,2])
[1] "Date"
OK, so in theory I should be able to bin values now
x$Day <- as.Date(cut(x$date, breaks = "day"))
Error in seq.int(0, to0 - from, by) : 'to' cannot be NA, NaN or infinite
In addition: Warning messages:
1: In min.default(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, :
no non-missing arguments to min; returning Inf
2: In max.default(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, :
no non-missing arguments to max; returning -Inf
head(is.na(x))
steps date interval
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] TRUE TRUE FALSE
[4,] TRUE TRUE FALSE
[5,] TRUE TRUE FALSE
[6,] TRUE TRUE FALSE
If I compare this to what I saw prior to the x[,2] <- as.Date(x[,2], format="%Y/%m/%d")
head(is.na(x))
steps date interval
[1,] TRUE FALSE FALSE
[2,] TRUE FALSE FALSE
[3,] TRUE FALSE FALSE
[4,] TRUE FALSE FALSE
[5,] TRUE FALSE FALSE
[6,] TRUE FALSE FALSE
Not sure what's going on here? I know this should work because I got this idea from the following tutorial (http://blog.mollietaylor.com/2013/08/plot-weekly-or-monthly-totals-in-r.html?m=1)
sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Canada.1252
[2] LC_CTYPE=English_Canada.1252
[3] LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils
[5] datasets methods base
other attached packages:
[1] scales_0.2.4 ggplot2_1.0.0
loaded via a namespace (and not attached):
[1] colorspace_1.2-4 digest_0.6.4
[3] grid_3.0.3 gtable_0.1.2
[5] MASS_7.3-29 munsell_0.4.2
[7] plyr_1.8.1 proto_0.3-10
[9] Rcpp_0.11.1 reshape2_1.4
[11] stringr_0.6.2 tools_3.0.3
Just to illustrate, these all result in the same output (except for the class of the date column of course):
x <- read.csv("~/Downloads/activity.csv")
# Date is a factor
r1 <- aggregate(steps~date,data = x,FUN = mean)
x1 <- read.csv("~/Downloads/activity.csv",stringsAsFactors = FALSE)
# Date is a character
r2 <- aggregate(steps~date,data = x1,FUN = mean)
x2 <- x
x2$date <- as.Date(as.character(x$date))
# Date is a date
r3 <- aggregate(steps~date,data = x2,FUN = mean)
my_data <-
read.csv(your_file, stringsAsFactors = FALSE)
# Convert 'my_data$date' to Date format
my_data$date <-
as.Date(my_data$date)
This should work...

Resources