I'm currently doing a Reproducible Data course on Coursera and one of the questions ask for the Mean and Median of steps per day, I have this but when I confirm it with the summary function, the summary version of Mean and Median is different. I'm running this via knitr
Why would this be?
** below is an edit showing all of my script so far including a link to the raw data:
##Download the data You have to change https to http to get this to work in knitr
target_url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
target_localfile = "ActivityMonitoringData.zip"
if (!file.exists(target_localfile)) {
download.file(target_url, destfile = target_localfile)
}
Unzip the file to the temporary directory
unzip(target_localfile, exdir="extract", overwrite=TRUE)
List the extracted files
list.files("./extract")
## [1] "activity.csv"
Load the extracted data into R
activity.csv <- read.csv("./extract/activity.csv", header = TRUE)
activity1 <- activity.csv[complete.cases(activity.csv),]
str(activity1)
## 'data.frame': 15264 obs. of 3 variables:
## $ steps : int 0 0 0 0 0 0 0 0 0 0 ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Use a histogram to view the number of steps taken each day
histData <- aggregate(steps ~ date, data = activity1, sum)
h <- hist(histData$steps, # Save histogram as object
breaks = 11, # "Suggests" 11 bins
freq = T,
col = "thistle1",
main = "Histogram of Activity",
xlab = "Number of daily steps")
Obtain the Mean and Median of the daily steps
steps <- histData$steps
mean(steps)
## [1] 10766
median(steps)
## [1] 10765
summary(histData$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 8840 10800 10800 13300 21200
summary(steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 8840 10800 10800 13300 21200
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## locale:
## [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.6
##
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5 formatR_1.0 stringr_0.6.2 tools_3.1.1
Actually, the answers is correct, you just printing it wrong. You are setting digits option somewhere.
Put this before the scripts:
options(digits=12)
And you'll have:
mean(steps)
# [1] 10766.1886792
median(steps)
# [1] 10765
summary(steps)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 41.0000 8841.0000 10765.0000 10766.1887 13294.0000 21194.0000
Notice that summary use max(3, getOption("digits")-3) for how many numbers is printed. So it round it a bit (10766.1887 instead of 10766.1886792).
Related
I have a set of raster files (in this case downloaded from http://www.paleoclim.org/) that I am reading into R using the stars package.
library("tidyverse")
library("fs")
library("stars")
data_path <- "./paleoclim"
(data_files <- list.files(data_path, pattern = "*.tif"))
#> [1] "BA_v1_2_5m_bio_1_badia.tif"
#> [2] "BA_v1_2_5m_bio_10_badia.tif"
#> [3] "BA_v1_2_5m_bio_11_badia.tif"
#> [...]
#> [39] "EH_v1_2_5m_bio_1_badia.tif"
#> [40] "EH_v1_2_5m_bio_10_badia.tif"
#> [41] "EH_v1_2_5m_bio_11_badia.tif"
#> [...]
#> [58] "HS1_v1_2_5m_bio_1_badia.tif"
#> [59] "HS1_v1_2_5m_bio_10_badia.tif"
#> [60] "HS1_v1_2_5m_bio_11_badia.tif"
#> [...]
(paleoclim <- read_stars(path(data_path, data_files)))
#> stars object with 2 dimensions and 133 attributes
#> attribute(s):
#> BA_v1_2_5m_bio_1_badia.tif BA_v1_2_5m_bio_10_badia.tif
#> Min. :101.0 Min. :213.0
#> 1st Qu.:166.0 1st Qu.:278.0
#> Median :173.0 Median :298.0
#> Mean :171.8 Mean :290.3
#> 3rd Qu.:180.0 3rd Qu.:304.0
#> Max. :200.0 Max. :325.0
#> [...]
#> dimension(s):
#> from to offset delta refsys point values
#> x 1 72 36 0.0416667 WGS 84 FALSE NULL [x]
#> y 1 48 33 -0.0416667 WGS 84 FALSE NULL [y]
Created on 2020-12-07 by the reprex package (v0.3.0)
The filenames contain two pieces of information that I would like to represent as dimensions of the stars object, e.g. HS1_v1_2_5m_bio_1_badia.tif refers to period "HS1" and bioclimatic variable "bio_1".
I've got as far as using st_redimension() to create the new dimensions and levels:
periods <- str_extract(names(paleoclim), "[^_]+")
biovars <- str_extract(names(paleoclim), "bio_[0-9]+")
paleoclim %>%
merge() %>%
st_redimension(
new_dims = st_dimensions(x = 1:72, y = 1:48,
period = unique(periods),
biovar = unique(biovars))
)
#> stars object with 4 dimensions and 1 attribute
#> attribute(s):
#> X
#> Min. : -91.0
#> 1st Qu.: 26.0
#> Median : 78.0
#> Mean : 588.2
#> 3rd Qu.: 256.0
#> Max. :11275.0
#> dimension(s):
#> from to offset delta refsys point values
#> x 1 72 1 1 NA FALSE NULL [x]
#> y 1 48 1 1 NA FALSE NULL [y]
#> period 1 7 NA NA NA FALSE BA,...,YDS
#> biovar 1 19 NA NA NA FALSE bio_1,...,bio_9
But this doesn't actually map the values of the attributes (filenames) to the levels of the new dimensions. Also, most of the information (e.g. CRS) about the original x and y dimensions are lost because I have to recreate them manually.
How do you properly define new dimensions of a stars object based on another dimension or attribute?
Don't see a straightforward way to split one dimension into two after all files have been read into a three-dimensional stars object. An alternative approach you could use is:
read one folder at a time, where all files of that folder go into the variable third dimension, stored as separate stars objects in a list,
then combine the resulting stars objects, where the stars objects go into the period fourth dimension.
For this example, I downloaded the following two products and unzipped into two separate folders:
http://sdmtoolbox.org/paleoclim.org/data/BA/BA_v1_10m.zip
http://sdmtoolbox.org/paleoclim.org/data/HS1/HS1_v1_10m.zip
Here is the code:
library(stars)
# Directories with GeoTIFF files
paths = c(
"/home/michael/Downloads/BA_v1_10m",
"/home/michael/Downloads/HS1_v1_10m"
)
# Read the files and set 3rd dimension
r = list()
for(i in paths) {
files = list.files(path = i, pattern = "\\.tif$", full.names = TRUE)
r[[i]] = read_stars(files)
names(r[[i]]) = basename(files)
r[[i]] = st_redimension(r[[i]])
}
# Combine the list
r = do.call(c, r)
# Attributes to 4th dimension
names(r) = basename(paths)
r = st_redimension(r)
# Clean dimension names
r = st_set_dimensions(r, names = c("x", "y", "variable", "period"))
r
and the printout of the result:
## stars object with 4 dimensions and 1 attribute
## attribute(s), summary of first 1e+05 cells:
## BA_v1_10m.HS1_v1_10m
## Min. :-344.0
## 1st Qu.:-290.0
## Median :-274.0
## Mean :-264.8
## 3rd Qu.:-252.0
## Max. :-128.0
## NA's :94073
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 2160 -180 0.166667 WGS 84 FALSE NULL [x]
## y 1 1072 88.6667 -0.166667 WGS 84 FALSE NULL [y]
## variable 1 19 NA NA NA NA bio_1.tif,...,bio_9.tif
## period 1 2 NA NA NA NA BA_v1_10m , HS1_v1_10m
The result is a stars object with four dimensions, including x, y, variable, and period.
Here are plots, separately for each of the two levels in the period dimension:
plot(r[,,,,1,drop=TRUE])
plot(r[,,,,2,drop=TRUE])
I'm attempting an interval regression in R with censored data containing dependant values either as a number y or an interval [0, z] containing y.
After searching, I found several sources with examples recommending survival::survreg (i. e. here), though they're not dealing with exactly the same problem. However, I can't get it to work with my data and I assume I'm having some special case.
I'll give you a MWE. First I create some data and the latent intervals:
# data
set.seed(417699)
df <- data.frame(ind = rbinom(10, 1, .75))
df <- transform(df,
value = ifelse(df$ind == 1, sample(1:1000), NA),
value1 = ifelse(df$ind == 0, sample(10:100) * 10, 0),
cv1 = rbinom(10, 2, .7) # 1st independent var.
cv2 = rbinom(10, 2, .25) # 2nd indep. var.
)
# intervals depending if 'ind' equals 0
df$liv <- with(df, ifelse(ind == 1, value, 0))
df$uiv <- with(df, ifelse(ind == 0, value1, value))
df
## ind value value1 cv1 liv uiv cv2
## 1 1 616 1 2 616 616 0
## 2 0 NA 450 2 0 450 0
## 3 1 236 1 2 236 236 0
## 4 1 130 1 1 130 130 1
## 5 0 NA 350 1 0 350 1
## 6 0 NA 250 2 0 250 0
## 7 1 241 1 1 241 241 0
## 8 1 950 1 2 950 950 1
## 9 1 557 1 2 557 557 1
## 10 1 453 1 2 453 453 1
As one can see, there are intervals or points now depending on whether ind = 1 or 0. In detail, if ind = 0, the value lies somewhere in the interval.
Now, with survival::Surv() and assuming it is left censored I'm creating the "survival object" as follows.
library(survival)
(Y <- with(df, Surv(liv, uiv, event = rep(2, nrow(df)), type = "interval")))
## [1] [837, 837] [ 0, 340] [694, 694] [ 74, 74] [ 0, 280] [ 0, 640] [177, 177]
## [8] [650, 650] [368, 368] [179, 179]
summary(Y)
## time1 time2 status
## Min. : 0.0 Min. : 74.0 Min. :3
## 1st Qu.: 18.5 1st Qu.:204.2 1st Qu.:3
## Median :178.0 Median :354.0 Median :3
## Mean :297.9 Mean :423.9 Mean :3
## 3rd Qu.:579.5 3rd Qu.:647.5 3rd Qu.:3
## Max. :837.0 Max. :837.0 Max. :3
All fine, but at the end survreg() fails with an error:
survreg(Y ~ cv1 + cv2, data = df, dist = "gaussian")
## Error in coxph.wtest(t(x) %*% (wt * x), c((wt * eta + weights * deriv$dg) %*% :
## NA/NaN/Inf in foreign function call (arg 3)
In Surv() I tried several values for the options event= and type=, most of them didn't work and I'm confused how to specify the right settings (i. e. I don't know if I'm wrong or the function is, see following note).
Note: survreg() seems to have had a bug a few versions ago, but which now should be solved (I don't know for sure).
Does anyone know what's going on and how to solve this issue? Moreover, at the moment I guess this seems to be the only promising way to calculate such kind of an interval regression in R, but maybe there is a better option. Thank you.
A tiny comment on this question finally brought me the solution. The trick is to set type = "interval2" and we can drop the mode= option.
(Y <- with(df, Surv(liv, uiv, type = "interval2")))
## [1] 616 [ 0, 450] 236 130 [ 0, 350] [ 0, 250] 241
## [8] 950 557 453
summary(Y)
## time1 time2 status
## Min. : 0.0 Min. : 1.0 Min. :1.0
## 1st Qu.: 32.5 1st Qu.: 1.0 1st Qu.:1.0
## Median :238.5 Median : 1.0 Median :1.0
## Mean :318.3 Mean :105.7 Mean :1.6
## 3rd Qu.:531.0 3rd Qu.:187.8 3rd Qu.:2.5
## Max. :950.0 Max. :450.0 Max. :3.0
coef(intreg <- survreg(Y ~ cv1 + cv2, data = df, dist = "gaussian"))
## (Intercept) cv1 cv2
## -282.0126 326.4428 216.9370
Compared to normal OLS the regression results seem to be accurate:
coef(reg <- lm(value ~ cv1 + cv2, data = df))
## (Intercept) cv1 cv2
## -242.5294 364.1176 127.8235
I have a data set with 61 observations and 2 variables. When I summary the whole data, the quantiles, median, mean and max of the second variable are sometimes different from the result I get from summary the second variable alone. Why is that?
data <- read.csv("testdata.csv")
head(data)
# Group.1 x
# 1 10/1/12 0
# 2 10/2/12 126
# 3 10/3/12 11352
# 4 10/4/12 12116
# 5 10/5/12 13294
# 6 10/6/12 15420
summary(data)
# Group.1 x
# 10/1/12 : 1 Min. : 0
# 10/10/12: 1 1st Qu.: 6778
# 10/11/12: 1 Median :10395
# 10/12/12: 1 Mean : 9354
# 10/13/12: 1 3rd Qu.:12811
# 10/14/12: 1 Max. :21194
# (Other) :55
summary(data[2])
# x
# Min. : 0
# 1st Qu.: 6778
# Median :10395
# Mean : 9354
# 3rd Qu.:12811
# Max. :21194
# The following code yield different result:
summary(data$x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0 6778 10400 9354 12810 21190
#r2evans' comment is correct in that the discrepancy is caused by differences in summary.data.frame and summary.default.
The default value of digits for both methods is max(3L, getOption("digits") - 3L). If you haven't changed your options, this will evaluate to 4L. However, the two methods use their digits argument differently when formatting the output, which is the reason for the differences in the two methods' output. From ?summary:
digits: integer, used for number formatting with signif() (for summary.default) or format() (for summary.data.frame).
Say we have the vector of x´s summary statistics in the question:
q <- append(quantile(data$x), mean(data$x), after = 3L)
q
## 0% 25% 50% 75% 100%
## 0.00 6778.00 10395.00 9354.23 12811.00 21194.00
In summary.default the output is formatted by using signif, which rounds it's input to the supplied number of significant digits:
signif(q, digits = 4L)
## 0% 25% 50% 75% 100%
## 0 6778 10400 9354 12810 21190
While summary.data.frame uses format, which uses it's digits argument as only a sugggestion (?format) for the number of significant digits to display:
format(q, digits = 4L)
## 0% 25% 50% 75% 100%
## " 0" " 6778" "10395" " 9354" "12811" "21194"
Thus, when using the default digits argument value 4, summary.default(data$x) rounds the 5-digit quantiles to only 4 significant digits; but summary.data.frame(data[2]) displays the 5-digit quantiles witout rounding.
If you explicitly supply the digits argument as larger than 4, you'll get identical results:
summary(data[2], digits = 5L)
## x
## Min. : 0.0
## 1st Qu.: 6778.0
## Median :10395.0
## Mean : 9354.2
## 3rd Qu.:12811.0
## Max. :21194.0
summary(data$x, digits = 5L)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 6778.0 10395.0 9354.2 12811.0 21194.0
As an extreme example of the differences of the two methods with the default digits:
df <- data.frame(a = 1e5 + 0:100)
summary(df$a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100000 100000 100000 100000 100100 100100
summary(df)
## a
## Min. :100000
## 1st Qu.:100025
## Median :100050
## Mean :100050
## 3rd Qu.:100075
## Max. :100100
I'm using the bigmemory and biganalytics packages and specifically trying to compute the mean of a big.matrix object. The documentation for biganalytics (e.g. ?biganalytics) suggests that mean() should be available for big.matrix objects, but this fails:
x <- big.matrix(5, 2, type="integer", init=0,
+ dimnames=list(NULL, c("alpha", "beta")))
x
# An object of class "big.matrix"
# Slot "address":
# <pointer: 0x00000000069a5200>
x[,1] <- 1:5
x[,]
# alpha beta
# [1,] 1 0
# [2,] 2 0
# [3,] 3 0
# [4,] 4 0
# [5,] 5 0
mean(x)
# [1] NA
# Warning message:
# In mean.default(x) : argument is not numeric or logical: returning NA
Although some things work OK:
colmean(x)
# alpha beta
# 3 0
sum(x)
# [1] 15
mean(x[])
# [1] 1.5
mean(colmean(x))
# [1] 1.5
without mean(), it seems mean(colmean(x)) is the next best thing:
# try it on something bigger
x = big.matrix(nrow=10000, ncol=10000, type="integer")
x[] <- c(1:(10000*10000))
mean(colmean(x))
# [1] 5e+07
mean(x[])
# [1] 5e+07
system.time(mean(colmean(x)))
# user system elapsed
# 0.19 0.00 0.19
system.time(mean(x[]))
# user system elapsed
# 0.28 0.11 0.39
Presumably mean() could be faster still, especially for rectangular matrices with a large number of columns.
Any ideas why mean() isn't working for me?
OK - re-installing biganalytics seems to have fixed this.
I now have:
library("biganalytics")
x = big.matrix(10000,10000, type="integer")
for(i in 1L:10000L) { j = c(1L:10000L) ; x[i,] <- i*10000L + j }
mean(x)
# [1] 50010001
mean(x[,])
# [1] 50010001
mean(colmean(x))
# [1] 50010001
system.time(replicate(100, mean(x)))
# user system elapsed
# 20.16 0.02 20.23
system.time(replicate(100, mean(colmean(x))))
# user system elapsed
# 20.08 0.00 20.24
system.time(replicate(100, mean(x[,])))
# user system elapsed
# 31.62 12.88 44.74
So all good. My sessionInfo() is now:
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biganalytics_1.1.12 biglm_0.9-1 DBI_0.3.1 foreach_1.4.2 bigmemory_4.5.8 bigmemory.sri_0.1.3
loaded via a namespace (and not attached):
[1] codetools_0.2-8 iterators_1.0.7 Rcpp_0.11.2
i need help to speed up a little bit of code. i have a data.frame "df" and would like to create new columns and fill them with given values. Here a sample code how i do it.
df <- as.data.frame(1:20)
a <- c(31:50)
b <- c(201:220)
df[c("A","B")] <- c(a, b)
now the problem is that my data is big (some million rows) and it take more time than expected, so i think there is a better way. Any ideas? Thank you!
The task of extending data.frames (or any object) causes R to copy the whole object when you try to add a new column. Package data.table offers some great performance features that are added on to the data.frame model. It allows (among other things) to add columns in place. See the code below for a simple demo:
require(data.table)
a2 <- data.table(x=1:10)
a2[, y:=21:30] ## this will create y inside a2 without copying it
summary(a2) ## just like using a data.frame
The resulting object (a data.table) will play nice with (almost) all code that makes use of data.frame. It has an alternative syntax most operations, which are performed much more efficiently. It's worth spending some time looking into.
If you'd like to add multiple columns, then:
a2[, `:=`(y=21:30, z=31:40)]
Edit: #Thell has taken the time and prepared benchmarks with different methods for extending a data.frame. They suggest that despite the copying data.frame is faster. Keep this in mind as an alternative and see which one works best for your code.
You stated you have 'some million' rows so here is an excerpt of benchmarks with 3 columns of 10 million rows...
R 3.0.3 (on 32bit Celeron system w/ 2GB memory)
## Unit: microseconds
## expr min lq median uq max neval
## dt.addC(dt) 35.38 56.03 64.82 67.77 185.2 100
## df.add(df) 181.43 214.80 221.42 229.81 366.6 100
## dt.addB(dt) 2359.54 2457.09 2513.11 2577.00 6398.0 100
## dt.addA(dt) 2913.74 2995.64 3047.29 3125.82 6791.1 100
R 3.1.0 (on 64bit Haswell i7 w/ 24 GB memory)
## Unit: microseconds
## expr min lq median uq max neval
## df.add(df) 10.25 30.74 33.36 48.53 84.25 100
## dt.addC(dt) 27120.45 27563.79 27990.22 29642.46 87637.63 100
## dt.addB(dt) 38452.71 39018.90 46225.69 50142.46 130893.53 100
## dt.addA(dt) 193268.78 247749.71 251380.74 256380.43 440916.17 100
Note:
The difference between data.frame and data.table on 3.1.0 can be explained by the new way that R 3.1.0 handles assignments. Arun (one of the data.table authors) does so in this chat log.
df.add ( a common base way to add columns to dat.frame ).
df$b <- b.vals
df$c <- c.vals
dt.addA (the common base data.frame method applied to data.table)
dt$b <- b.vals
dt$c <- c.vals
dt.addB (a common data.table way to add columns)
dt[,`:=`(b=b.vals, c=c.vals)]
dt.addC (another data.table method of setting values [from Arun] )
## to reduce the overhead due to `[.data.table` on small data.frames.
set(dt, j="b", value=b.vals)
set(dt, j="c", value=c.vals)
Benchmarks for other data set sizes
R 3.1.0 on i7 System
# Test # 1,000
## Unit: microseconds
## expr min lq median uq max neval
## dt.addC(dt) 6.007 10.38 11.71 12.50 20.79 100
## df.add(df) 11.534 19.49 20.57 21.32 940.63 100
## dt.addB(dt) 326.166 344.85 351.43 365.47 1412.86 100
## dt.addA(dt) 798.777 850.47 867.60 888.23 1935.20 100
## test relative
## 1 df.add(df) 1
## 4 dt.addC(dt) 1
## 3 dt.addB(dt) 35
## 2 dt.addA(dt) 87
# Test # 10,000
## Unit: microseconds
## expr min lq median uq max neval
## dt.addC(dt) 11.13 17.88 19.20 20.80 988.9 100
## df.add(df) 10.97 20.56 22.65 24.94 41.1 100
## dt.addB(dt) 333.17 364.15 389.87 419.08 1347.0 100
## dt.addA(dt) 823.99 875.88 897.10 1076.90 29233.1 100
## test relative
## 1 df.add(df) 1
## 4 dt.addC(dt) 1
## 3 dt.addB(dt) 19
## 2 dt.addA(dt) 50
# Test # 10,000,000
## Unit: microseconds
## expr min lq median uq max neval
## df.add(df) 10.25 30.74 33.36 48.53 84.25 100
## dt.addC(dt) 27120.45 27563.79 27990.22 29642.46 87637.63 100
## dt.addB(dt) 38452.71 39018.90 46225.69 50142.46 130893.53 100
## dt.addA(dt) 193268.78 247749.71 251380.74 256380.43 440916.17 100
## test relative
## 1 df.add(df) 1
## 4 dt.addC(dt) 1536
## 3 dt.addB(dt) 2213
## 2 dt.addA(dt) 11667
R 3.0.3 on Celeron System
# Test # 1,000
## Unit: microseconds
## expr min lq median uq max neval
## dt.addC(dt) 55.78 82.58 94.48 96.14 176.1 100
## df.add(df) 182.65 215.36 220.10 225.03 361.6 100
## dt.addB(dt) 2699.10 2774.61 2827.34 2894.23 3442.2 100
## dt.addA(dt) 5259.89 6066.00 6122.37 6231.50 10265.9 100
## test relative
## 4 dt.addC(dt) 1.000
## 1 df.add(df) 2.889
## 3 dt.addB(dt) 32.444
## 6 dfadd2dtB(dt) 69.667
## 2 dt.addA(dt) 69.889
## 5 dfadd2dtA(dt) 96.000
# Test # 10,000
## Unit: microseconds
## expr min lq median uq max neval
## dt.addC(dt) 134.0 162.8 168.7 185.8 4135 100
## df.add(df) 576.7 616.4 633.7 663.2 72749 100
## dt.addB(dt) 2789.8 2932.6 2993.0 3054.7 6702 100
## dt.addA(dt) 5400.6 6701.5 6819.0 10079.2 11518 100
## test relative
## 4 dt.addC(dt) 1.000
## 1 df.add(df) 8.143
## 3 dt.addB(dt) 14.619
## 2 dt.addA(dt) 34.286
## 6 dfadd2dtB(dt) 34.381
## 5 dfadd2dtA(dt) 53.810
# Test # 10,000,000
## Unit: milliseconds
## expr min lq median uq max neval
## dt.addC(dt) 121.1 146.2 147.2 161.8 303.8 100
## dt.addB(dt) 197.7 225.4 228.0 270.2 380.7 100
## df.add(df) 767.8 823.5 857.0 938.2 1156.9 100
## dt.addA(dt) 709.6 1071.9 1112.6 1170.1 1343.9 100
## test relative
## 4 dt.addC(dt) 1.000
## 3 dt.addB(dt) 1.566
## 1 df.add(df) 6.172
## 2 dt.addA(dt) 7.594
```
System/Session Info...
Intel® Core™ i7-4700MQ Processor
24 GB
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rbenchmark_1.0.0 microbenchmark_1.3-0 data.table_1.9.2
##
## "Linux" "3.11.0-19-generic" "x86_64"
Intel(R) Celeron(R) CPU 2.53GHz
2 GB
## R version 3.0.3 (2014-03-06)
## Platform: i686-pc-linux-gnu (32-bit)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rbenchmark_1.0.0 microbenchmark_1.3-0 data.table_1.9.2
## [4] knitr_1.5
##
## "Linux" "3.2.0-60-generic-pae" "i686"
Why don't you simply do the following:
df <- data.frame (x=1:20)
df$a <- 31:50
df$b <- 201:220
There's an excellent ebook called "R Fundamentals and Graphics" which will give you a solid understanding of the basics of R and its graphical features.