I have a data table with over 90000 observations and 1201 variables. All columns except the last one store numeric values, the last column is the column with names of source files (over 100). Here is a small sample of the data table:
library(data.table)
DT <- data.table(V1=sample(0:100,20,replace=TRUE),
V2=sample(0:100,20,replace=TRUE), V3=sample(0:100,20,replace=TRUE),
V4=sample(0:100,20,replace=TRUE), V5=sample(0:100,20,replace=TRUE),
V6=sample(0:100,20,replace=TRUE), V7=sample(0:100,20,replace=TRUE),
file=rep(c("A","B","C","D"), each = 5))
What I want to do is to calculate a median of ALL values in each group (file). So e.g. for group A the median would be calculated from rows 1,2,3,4,5 at once. In the next step, I would like to assign the medians to each of the rows depending on a group (expected output below).
The question seems to be simple, I have googled many similar questions regarding median/mean calculation depending on a group (aggregate as one of the most popular solutions). However, in all cases only one column is taken into account for the median calculation. Here are 7 (or in my original data 1200) and median does not accept that - I should provide a numerical vector.
Therefore I have experimented with unlist, aggregate, dplyr package, tapply with any luck...
Due to the amount of data and groups (i.e. file) the code should be quite automatic and efficient... I would really appreciate your help!
Just a small example if the code which obviously has failed:
DT_median <- setDT(DT)[, DT_med := median(DT[,1:7]), by = file]
The expected result should look like this:
V1 V2 V3 V4 V5 V6 V7 file DT_med
42 78 9 0 60 46 65 A 37.5
36 36 46 45 5 96 64 A 37.5
83 31 92 100 15 2 9 A 37.5
36 16 49 82 32 4 46 A 37.5
29 17 39 6 62 52 97 A 37.5
37 70 17 90 8 10 93 B 47
72 62 68 83 96 77 20 B 47
10 47 29 2 93 16 30 B 47
69 87 7 47 96 17 8 B 47
23 70 72 27 10 86 49 B 47
78 51 13 33 56 6 39 C 51
28 92 100 5 75 33 17 C 51
71 82 9 20 34 83 22 C 51
62 40 84 87 37 45 34 C 51
55 80 55 94 66 96 12 C 51
93 1 99 97 7 77 6 D 41
53 55 71 12 19 25 28 D 41
27 25 28 89 41 22 60 D 41
91 25 25 57 21 98 27 D 41
2 63 17 53 99 65 95 D 41
As we want to calculate the median from all the values, grouped by 'file', unlist the Subset of Data.table (.SD), get the median and assign (:=) the output to create the new column 'DT_med'
library(data.table)
DT[, DT_med := median(unlist(.SD), na.rm = TRUE), by = file]
Related
I remarked a strange behavior of data.table that I don't understand:
library(data.table)
df <- as.data.table(matrix(ncol = 100,nrow = 3,data = sample(letters,300,replace = T)))
If I want to inverse first two columns, I could do:
df[,c(2,1,3:100L)]
which works fine. But if I do:
df[,c(2,1,3:ncol(df))]
[1] 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100
and I don't understand it, because ncol(df) is 100 and is an integer. Why does it do that ?
You need to use with=FALSE as follows:
df[,c(2,1,3:ncol(df)),with=FALSE]
From ?data.table, under the Arguments for with
When j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is not necessary anymore to select columns dynamically. Note that x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE] and to x[, .SD, .SDcols=cols].
Since c(2,1,3:100L) is a numeric column, then with=FALSE is not required and the columns are automatically returned. When it is c(2,1,3:ncol(df)), this expression will be evaluated and returned as a vector.
Should have a dupe somewhere
I use a simple example from dataset "airquality".
The first four rows are complete which can be checked simply with complete.cases
Row 5 contains missing values.
Row 6 also contains missing values.
This can be checked quickly by:
is.na(airquality[5,])
is.na(airquality[6,])
I would expect that which(is.na(airquality)) would give me the list of row numbers that include at least one true statement, i.e. at least one NA value.
However, it lists 5, 10, 25 ... , i.e. row number 6 is NOT listed. Why? there is a NA value in row number 6!
library(datasets)
complete.cases(airquality)
is.na(airquality[5,])
is.na(airquality[6,])
which(is.na(airquality))
There is obviously something that I do not understand here.
From help("is.na"):
The data frame method for is.na returns a logical matrix with the same
dimensions as the data frame, and with dimnames taken from the row and
column names of the data frame.
In other words, it's not giving you the information you're assuming it's giving you. It's giving you the elements of the matrix described above, by counting going down the columns. Try
# get the cases with missingness
which(!complete.cases(airquality))
[1] 5 6 10 11 25 26 27 32 33 34 35 36 37 39 42 43 45 46 52
[20] 53 54 55 56 57 58 59 60 61 65 72 75 83 84 96 97 98 102 103
[39] 107 115 119 150
# and check against is.na
unique(sort(which(is.na(airquality), arr.ind = TRUE)[ , 1]))
[1] 5 6 10 11 25 26 27 32 33 34 35 36 37 39 42 43 45 46 52
[20] 53 54 55 56 57 58 59 60 61 65 72 75 83 84 96 97 98 102 103
[39] 107 115 119 150
all.equal(which(!complete.cases(airquality)),
unique(sort(which(is.na(airquality), arr.ind = TRUE)[ , 1])))
TRUE
Suppose I have a matrix or a data frame and I want only those values that are greater than 15 and no values between 85 and 90 both inclusive
a<-matrix(1:100,nrow = 10, ncol = 10)
rownames(a) <- LETTERS[1:10]
colnames(a) <- LETTERS[1:10]
A B C D E F G H I J
A 1 11 21 31 41 51 61 71 81 91
B 2 12 22 32 42 52 62 72 82 92
C 3 13 23 33 43 53 63 73 83 93
D 4 14 24 34 44 54 64 74 84 94
E 5 15 25 35 45 55 65 75 85 95
F 6 16 26 36 46 56 66 76 86 96
G 7 17 27 37 47 57 67 77 87 97
H 8 18 28 38 48 58 68 78 88 98
I 9 19 29 39 49 59 69 79 89 99
J 10 20 30 40 50 60 70 80 90 100
Note: You can convert it into dataframe if you know this kind of operation is possible in dataframe
Now I want My result in such a format that only those values that are greater than 5 and less than 85 retain and all else got deleted and replaced with blank space.
My desired out is like below
A B C D E F G H I J
A 11 21 31 41 51 61 71 81 91
B 12 22 32 42 52 62 72 82 92
C 13 23 33 43 53 63 73 83 93
D 14 24 34 44 54 64 74 84 94
E 5 15 25 35 45 55 65 75 85 95
F 6 16 26 36 46 56 66 76 96
G 7 17 27 37 47 57 67 77 97
H 8 18 28 38 48 58 68 78 98
I 9 19 29 39 49 59 69 79 99
J 10 20 30 40 50 60 70 80 100
Is there any kind of function in R which can take my condition and produce the desired result. I want to change code according to problem . I searched it over stack flow but didn't find something like this. I don't want to format based on rows or column.
I tried
a[a> 5 & a!=c(85:90)]
but this give me values and looses the structure.
Assuming that the 'a' is matrix, we can assign the values of 'a' %in% 86:90 or | less than 5 (a < 5) to NA. Here, I am not assigning it to '' as it will change the class from numeric to character. Also, assigning to NA would be useful for later processing.
a[a %in% 86:90 | a<5] <- NA
However, if we need it to be ''
a[a %in% 86:90 | a<5] <- ""
If we are using a data.frame
a1 <- as.data.frame(a)
a1[] <- lapply(a1, function(x) replace(x, x %in% 86:90| x <5, ""))
a1
# A B C D E F G H I J
#A 11 21 31 41 51 61 71 81 91
#B 12 22 32 42 52 62 72 82 92
#C 13 23 33 43 53 63 73 83 93
#D 14 24 34 44 54 64 74 84 94
#E 5 15 25 35 45 55 65 75 85 95
#F 6 16 26 36 46 56 66 76 96
#G 7 17 27 37 47 57 67 77 97
#H 8 18 28 38 48 58 68 78 98
#I 9 19 29 39 49 59 69 79 99
#J 10 20 30 40 50 60 70 80 100
NOTE: This changes the class of each column to character
In the OP's code, a!=c(85:90) will not work as intended as the 85:90 will recycle to the length of the 'a' and the comparison will be between the corresponding values in the recycled value and 'a'. Instead, we need to use %in% for a vector with length > 1.
I've got some sort of index, like:
index <- 1:100
I've also got a list of "exclusion intervals" / ranges
exclude <- data.frame(start = c(5,50, 90), end = c(10,55, 95))
start end
1 5 10
2 50 55
3 90 95
I'm looking for an efficient way (in R) to remove all the indexes that belong in the ranges in the exclude data frame
so the desired output would be:
1,2,3,4, 11,12,...,48,49, 56,57,...,88,89, 96,97,98,99,100
I could do this iteratively: go over every exclusion interval (using ddply) and iteratively remove indexes that fall in each interval. But is there a more efficient way (or function) that does this?
I'm using library(intervals) to calculate my intervals, I could not find a built-in function tha does this.
Another approach that looks valid could be:
starts = findInterval(index, exclude[["start"]])
ends = findInterval(index, exclude[["end"]])# + 1L) ##1 needs to be added to remove upper
##bounds from the 'index' too
index[starts != (ends + 1L)] ##a value above a lower bound and
##below an upper is inside that interval
The main advantage here is that no vectors including all intervals' elements need to be created and, also, that it handles any set of values inside a particular interval; e.g.:
set.seed(101); x = round(runif(15, 1, 100), 3)
x
# [1] 37.848 5.339 71.259 66.111 25.736 30.705 58.902 34.013 62.579 55.037 88.100 70.981 73.465 93.232 46.057
x[findInterval(x, exclude[["start"]]) != (findInterval(x, exclude[["end"]]) + 1L)]
# [1] 37.848 71.259 66.111 25.736 30.705 58.902 34.013 62.579 55.037 88.100 70.981 73.465 46.057
We can use Map to get the sequence for the corresponding elements in 'start' 'end' columns, unlist to create a vector and use setdiff to get the values of 'index' that are not in the vector.
setdiff(index,unlist(with(exclude, Map(`:`, start, end))))
#[1] 1 2 3 4 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#[20] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#[39] 45 46 47 48 49 56 57 58 59 60 61 62 63 64 65 66 67 68 69
#[58] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
#[77] 89 96 97 98 99 100
Or we can use rep and then use setdiff.
i1 <- with(exclude, end-start) +1L
setdiff(index,with(exclude, rep(start, i1)+ sequence(i1)-1))
NOTE: Both the methods return the index position that needs to be excluded. In the above case, the original vector ('index') is a sequence so I used setdiff. If it contains random elements, use the position vector appropriately, i.e.
index[-unlist(with(exclude, Map(`:`, start, end)))]
or
index[setdiff(seq_along(index), unlist(with(exclude,
Map(`:`, start, end))))]
Another approach
> index[-do.call(c, lapply(1:nrow(exclude), function(x) exclude$start[x]:exclude$end[x]))]
[1] 1 2 3 4 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[25] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 56 57 58 59 60
[49] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
[73] 85 86 87 88 89 96 97 98 99 100
?sort states that the partial argument may be NULL or a vector of indices for partial sorting.
I tried:
x <- c(1,3,5,2,4,6,7,9,8,10)
sort(x)
## [1] 1 2 3 4 5 6 7 8 9 10
sort(x, partial=5)
## [1] 1 3 4 2 5 6 7 9 8 10
sort(x, partial=2)
## [1] 1 2 5 3 4 6 7 9 8 10
sort(x, partial=4)
## [1] 1 2 3 4 5 6 7 9 8 10
I am not sure what partial means when sorting a vector.
As ?sort states,
If partial is not NULL, it is taken to contain indices of elements of the result
which are to be placed in their correct positions in the sorted array by partial sorting.
In other words, the following assertion is always true:
stopifnot(sort(x, partial=pt_idx)[pt_idx] == sort(x)[pt_idx])
for any x and pt_idx, e.g.
x <- sample(100) # input vector
pt_idx <- sample(1:100, 5) # indices for partial arg
This behavior is different from the one defined in the Wikipedia article on partial sorting. In R sort()'s case we are not necessarily computing k smallest elements.
For example, if
print(x)
## [1] 91 85 63 80 71 69 20 39 78 67 32 56 27 79 9 66 88 23 61 75 68 81 21 90 36 84 11 3 42 43
## [31] 17 97 57 76 55 62 24 82 28 72 25 60 14 93 2 100 98 51 29 5 59 87 44 37 16 34 48 4 49 77
## [61] 13 95 31 15 70 18 52 58 73 1 45 40 8 30 89 99 41 7 94 47 96 12 35 19 38 6 74 50 86 65
## [91] 54 46 33 22 26 92 53 10 64 83
and
pt_idx
## [1] 5 54 58 95 8
then
sort(x, partial=pt_idx)
## [1] 1 3 2 4 5 6 7 8 11 12 9 10 13 15 14 16 17 18 23 30 31 27 21 32 36 34 35 19 20 37
## [31] 38 33 29 22 26 25 24 28 39 41 40 42 43 48 46 44 45 47 51 50 52 49 53 54 57 56 55 58 59 60
## [61] 62 64 63 61 65 66 70 72 73 69 68 71 67 79 78 82 75 81 80 77 76 74 89 85 88 87 83 84 86 90
## [91] 92 93 91 94 95 96 97 99 100 98
Here x[5], x[54], ..., x[8] are placed in their correct positions - and we cannot say anything else about the remaining elements. HTH.
EDIT: Partial sorting may reduce the sorting time, of course if you are interested in e.g. finding only some of the order statistics.
require(microbenchmark)
x <- rnorm(100000)
microbenchmark(sort(x, partial=1:10)[1:10], sort(x)[1:10])
## Unit: milliseconds
## expr min lq median uq max neval
## sort(x, partial = 1:10)[1:10] 2.342806 2.366383 2.393426 3.631734 44.00128 100
## sort(x)[1:10] 16.556525 16.645339 16.745489 17.911789 18.13621 100
regarding the statement "Here x[5], x[54], ..., x[8] are placed in their correct positions", I don't think it's correct, it should be "in the result, i.e. sorted x, result[5], result[54],.....,result[8], will be placed with right values from x."
quote from R manual:
If partial is not NULL, it is taken to contain indices of elements of
the result which are to be placed in their correct positions in the
sorted array by partial sorting. For each of the result values in a
specified position, any values smaller than that one are guaranteed to
have a smaller index in the sorted array and any values which are
greater are guaranteed to have a bigger index in the sorted array.