I remarked a strange behavior of data.table that I don't understand:
library(data.table)
df <- as.data.table(matrix(ncol = 100,nrow = 3,data = sample(letters,300,replace = T)))
If I want to inverse first two columns, I could do:
df[,c(2,1,3:100L)]
which works fine. But if I do:
df[,c(2,1,3:ncol(df))]
[1] 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100
and I don't understand it, because ncol(df) is 100 and is an integer. Why does it do that ?
You need to use with=FALSE as follows:
df[,c(2,1,3:ncol(df)),with=FALSE]
From ?data.table, under the Arguments for with
When j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is not necessary anymore to select columns dynamically. Note that x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE] and to x[, .SD, .SDcols=cols].
Since c(2,1,3:100L) is a numeric column, then with=FALSE is not required and the columns are automatically returned. When it is c(2,1,3:ncol(df)), this expression will be evaluated and returned as a vector.
Should have a dupe somewhere
Related
I have a data table with over 90000 observations and 1201 variables. All columns except the last one store numeric values, the last column is the column with names of source files (over 100). Here is a small sample of the data table:
library(data.table)
DT <- data.table(V1=sample(0:100,20,replace=TRUE),
V2=sample(0:100,20,replace=TRUE), V3=sample(0:100,20,replace=TRUE),
V4=sample(0:100,20,replace=TRUE), V5=sample(0:100,20,replace=TRUE),
V6=sample(0:100,20,replace=TRUE), V7=sample(0:100,20,replace=TRUE),
file=rep(c("A","B","C","D"), each = 5))
What I want to do is to calculate a median of ALL values in each group (file). So e.g. for group A the median would be calculated from rows 1,2,3,4,5 at once. In the next step, I would like to assign the medians to each of the rows depending on a group (expected output below).
The question seems to be simple, I have googled many similar questions regarding median/mean calculation depending on a group (aggregate as one of the most popular solutions). However, in all cases only one column is taken into account for the median calculation. Here are 7 (or in my original data 1200) and median does not accept that - I should provide a numerical vector.
Therefore I have experimented with unlist, aggregate, dplyr package, tapply with any luck...
Due to the amount of data and groups (i.e. file) the code should be quite automatic and efficient... I would really appreciate your help!
Just a small example if the code which obviously has failed:
DT_median <- setDT(DT)[, DT_med := median(DT[,1:7]), by = file]
The expected result should look like this:
V1 V2 V3 V4 V5 V6 V7 file DT_med
42 78 9 0 60 46 65 A 37.5
36 36 46 45 5 96 64 A 37.5
83 31 92 100 15 2 9 A 37.5
36 16 49 82 32 4 46 A 37.5
29 17 39 6 62 52 97 A 37.5
37 70 17 90 8 10 93 B 47
72 62 68 83 96 77 20 B 47
10 47 29 2 93 16 30 B 47
69 87 7 47 96 17 8 B 47
23 70 72 27 10 86 49 B 47
78 51 13 33 56 6 39 C 51
28 92 100 5 75 33 17 C 51
71 82 9 20 34 83 22 C 51
62 40 84 87 37 45 34 C 51
55 80 55 94 66 96 12 C 51
93 1 99 97 7 77 6 D 41
53 55 71 12 19 25 28 D 41
27 25 28 89 41 22 60 D 41
91 25 25 57 21 98 27 D 41
2 63 17 53 99 65 95 D 41
As we want to calculate the median from all the values, grouped by 'file', unlist the Subset of Data.table (.SD), get the median and assign (:=) the output to create the new column 'DT_med'
library(data.table)
DT[, DT_med := median(unlist(.SD), na.rm = TRUE), by = file]
I use a simple example from dataset "airquality".
The first four rows are complete which can be checked simply with complete.cases
Row 5 contains missing values.
Row 6 also contains missing values.
This can be checked quickly by:
is.na(airquality[5,])
is.na(airquality[6,])
I would expect that which(is.na(airquality)) would give me the list of row numbers that include at least one true statement, i.e. at least one NA value.
However, it lists 5, 10, 25 ... , i.e. row number 6 is NOT listed. Why? there is a NA value in row number 6!
library(datasets)
complete.cases(airquality)
is.na(airquality[5,])
is.na(airquality[6,])
which(is.na(airquality))
There is obviously something that I do not understand here.
From help("is.na"):
The data frame method for is.na returns a logical matrix with the same
dimensions as the data frame, and with dimnames taken from the row and
column names of the data frame.
In other words, it's not giving you the information you're assuming it's giving you. It's giving you the elements of the matrix described above, by counting going down the columns. Try
# get the cases with missingness
which(!complete.cases(airquality))
[1] 5 6 10 11 25 26 27 32 33 34 35 36 37 39 42 43 45 46 52
[20] 53 54 55 56 57 58 59 60 61 65 72 75 83 84 96 97 98 102 103
[39] 107 115 119 150
# and check against is.na
unique(sort(which(is.na(airquality), arr.ind = TRUE)[ , 1]))
[1] 5 6 10 11 25 26 27 32 33 34 35 36 37 39 42 43 45 46 52
[20] 53 54 55 56 57 58 59 60 61 65 72 75 83 84 96 97 98 102 103
[39] 107 115 119 150
all.equal(which(!complete.cases(airquality)),
unique(sort(which(is.na(airquality), arr.ind = TRUE)[ , 1])))
TRUE
I am experiencing some strange behavior in R when trying to index a matrix with another matrix. I run into an error of subscript out of bounds with indexing with a 2 column matrix, but not with a four column matrix. See the following reproducible code. Any insight would be appreciated!
This
data <- matrix(rbinom(100, 1, .5), nrow = 10)
idx <- cbind(1:50, 51:100)
data[idx]
results in:
Error in data[idx] : subscript out of bounds
However
data[cbind(idx,idx)]
works.
My session info:
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin15.5.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
The key insight as to why this is wrong isn't working is given in ?'[':
When indexing arrays by [ a single argument i can be a matrix with as many columns as there are dimensions of x; the result is then a vector with elements corresponding to the sets of indices in each row of i.
and it is clear when the subscript out of bounds error arises; data doesn't have 50 rows and 100 columns.
What's happening in the second example the indexing matrix is just being treated as a vector because it has more columns than the matrix being indexed has dimensions, and is extracting elements c(1:100, 1:100) from data.
This is more easily see with
m <- matrix(1:100, ncol = 10, byrow = TRUE)
and indexing with cbind(idx, idx) gives
> m[cbind(idx,idx)]
[1] 1 11 21 31 41 51 61 71 81 91 2 12 22 32 42 52 62 72
[19] 82 92 3 13 23 33 43 53 63 73 83 93 4 14 24 34 44 54
[37] 64 74 84 94 5 15 25 35 45 55 65 75 85 95 6 16 26 36
[55] 46 56 66 76 86 96 7 17 27 37 47 57 67 77 87 97 8 18
[73] 28 38 48 58 68 78 88 98 9 19 29 39 49 59 69 79 89 99
[91] 10 20 30 40 50 60 70 80 90 100 1 11 21 31 41 51 61 71
[109] 81 91 2 12 22 32 42 52 62 72 82 92 3 13 23 33 43 53
[127] 63 73 83 93 4 14 24 34 44 54 64 74 84 94 5 15 25 35
[145] 45 55 65 75 85 95 6 16 26 36 46 56 66 76 86 96 7 17
[163] 27 37 47 57 67 77 87 97 8 18 28 38 48 58 68 78 88 98
[181] 9 19 29 39 49 59 69 79 89 99 10 20 30 40 50 60 70 80
[199] 90 100
which is the same as
m[c(idx[,1], idx[,2], idx[,1], idx[,2])]
or specifically,
m[c(1:50, 51:100, 1:50, 51:100)]
I've got some sort of index, like:
index <- 1:100
I've also got a list of "exclusion intervals" / ranges
exclude <- data.frame(start = c(5,50, 90), end = c(10,55, 95))
start end
1 5 10
2 50 55
3 90 95
I'm looking for an efficient way (in R) to remove all the indexes that belong in the ranges in the exclude data frame
so the desired output would be:
1,2,3,4, 11,12,...,48,49, 56,57,...,88,89, 96,97,98,99,100
I could do this iteratively: go over every exclusion interval (using ddply) and iteratively remove indexes that fall in each interval. But is there a more efficient way (or function) that does this?
I'm using library(intervals) to calculate my intervals, I could not find a built-in function tha does this.
Another approach that looks valid could be:
starts = findInterval(index, exclude[["start"]])
ends = findInterval(index, exclude[["end"]])# + 1L) ##1 needs to be added to remove upper
##bounds from the 'index' too
index[starts != (ends + 1L)] ##a value above a lower bound and
##below an upper is inside that interval
The main advantage here is that no vectors including all intervals' elements need to be created and, also, that it handles any set of values inside a particular interval; e.g.:
set.seed(101); x = round(runif(15, 1, 100), 3)
x
# [1] 37.848 5.339 71.259 66.111 25.736 30.705 58.902 34.013 62.579 55.037 88.100 70.981 73.465 93.232 46.057
x[findInterval(x, exclude[["start"]]) != (findInterval(x, exclude[["end"]]) + 1L)]
# [1] 37.848 71.259 66.111 25.736 30.705 58.902 34.013 62.579 55.037 88.100 70.981 73.465 46.057
We can use Map to get the sequence for the corresponding elements in 'start' 'end' columns, unlist to create a vector and use setdiff to get the values of 'index' that are not in the vector.
setdiff(index,unlist(with(exclude, Map(`:`, start, end))))
#[1] 1 2 3 4 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#[20] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#[39] 45 46 47 48 49 56 57 58 59 60 61 62 63 64 65 66 67 68 69
#[58] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
#[77] 89 96 97 98 99 100
Or we can use rep and then use setdiff.
i1 <- with(exclude, end-start) +1L
setdiff(index,with(exclude, rep(start, i1)+ sequence(i1)-1))
NOTE: Both the methods return the index position that needs to be excluded. In the above case, the original vector ('index') is a sequence so I used setdiff. If it contains random elements, use the position vector appropriately, i.e.
index[-unlist(with(exclude, Map(`:`, start, end)))]
or
index[setdiff(seq_along(index), unlist(with(exclude,
Map(`:`, start, end))))]
Another approach
> index[-do.call(c, lapply(1:nrow(exclude), function(x) exclude$start[x]:exclude$end[x]))]
[1] 1 2 3 4 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[25] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 56 57 58 59 60
[49] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
[73] 85 86 87 88 89 96 97 98 99 100
?sort states that the partial argument may be NULL or a vector of indices for partial sorting.
I tried:
x <- c(1,3,5,2,4,6,7,9,8,10)
sort(x)
## [1] 1 2 3 4 5 6 7 8 9 10
sort(x, partial=5)
## [1] 1 3 4 2 5 6 7 9 8 10
sort(x, partial=2)
## [1] 1 2 5 3 4 6 7 9 8 10
sort(x, partial=4)
## [1] 1 2 3 4 5 6 7 9 8 10
I am not sure what partial means when sorting a vector.
As ?sort states,
If partial is not NULL, it is taken to contain indices of elements of the result
which are to be placed in their correct positions in the sorted array by partial sorting.
In other words, the following assertion is always true:
stopifnot(sort(x, partial=pt_idx)[pt_idx] == sort(x)[pt_idx])
for any x and pt_idx, e.g.
x <- sample(100) # input vector
pt_idx <- sample(1:100, 5) # indices for partial arg
This behavior is different from the one defined in the Wikipedia article on partial sorting. In R sort()'s case we are not necessarily computing k smallest elements.
For example, if
print(x)
## [1] 91 85 63 80 71 69 20 39 78 67 32 56 27 79 9 66 88 23 61 75 68 81 21 90 36 84 11 3 42 43
## [31] 17 97 57 76 55 62 24 82 28 72 25 60 14 93 2 100 98 51 29 5 59 87 44 37 16 34 48 4 49 77
## [61] 13 95 31 15 70 18 52 58 73 1 45 40 8 30 89 99 41 7 94 47 96 12 35 19 38 6 74 50 86 65
## [91] 54 46 33 22 26 92 53 10 64 83
and
pt_idx
## [1] 5 54 58 95 8
then
sort(x, partial=pt_idx)
## [1] 1 3 2 4 5 6 7 8 11 12 9 10 13 15 14 16 17 18 23 30 31 27 21 32 36 34 35 19 20 37
## [31] 38 33 29 22 26 25 24 28 39 41 40 42 43 48 46 44 45 47 51 50 52 49 53 54 57 56 55 58 59 60
## [61] 62 64 63 61 65 66 70 72 73 69 68 71 67 79 78 82 75 81 80 77 76 74 89 85 88 87 83 84 86 90
## [91] 92 93 91 94 95 96 97 99 100 98
Here x[5], x[54], ..., x[8] are placed in their correct positions - and we cannot say anything else about the remaining elements. HTH.
EDIT: Partial sorting may reduce the sorting time, of course if you are interested in e.g. finding only some of the order statistics.
require(microbenchmark)
x <- rnorm(100000)
microbenchmark(sort(x, partial=1:10)[1:10], sort(x)[1:10])
## Unit: milliseconds
## expr min lq median uq max neval
## sort(x, partial = 1:10)[1:10] 2.342806 2.366383 2.393426 3.631734 44.00128 100
## sort(x)[1:10] 16.556525 16.645339 16.745489 17.911789 18.13621 100
regarding the statement "Here x[5], x[54], ..., x[8] are placed in their correct positions", I don't think it's correct, it should be "in the result, i.e. sorted x, result[5], result[54],.....,result[8], will be placed with right values from x."
quote from R manual:
If partial is not NULL, it is taken to contain indices of elements of
the result which are to be placed in their correct positions in the
sorted array by partial sorting. For each of the result values in a
specified position, any values smaller than that one are guaranteed to
have a smaller index in the sorted array and any values which are
greater are guaranteed to have a bigger index in the sorted array.