I have data table like below DT,
col1 col2 col3 col4 col5
1: 1 2 3 4 5
2: 4 5 6 8 9
3: 3 4 4 5 5
4: 4 3 5 3 3
5: 4 5 6 6 67
I want to count unique values in certain columns for each row (for each row I want to use different columns for counting unique)
How do I achieve this in minimum number of steps possible? The table is huge so running for loop is out of the question.
I am looking for a solution like
DT[ , count_unique:= apply(DT[ , cols, with = F], 1, function(x) { length(unique(x)) })]
But this will fail, since "cols" will need to take different columns for each row.
Any help will be appreciated.
I think this is easiest to do with matrices, which have a matrix subset operation (from which incidentally the data.table join syntax is inspired from).
Let's say this is your data:
m = matrix(c(1:4, 1,3,2,2, 1,2,3,3), ncol = 3)
# [,1] [,2] [,3]
#[1,] 1 1 1
#[2,] 2 3 2
#[3,] 3 2 3
#[4,] 4 2 3
And let's say you want to count unique values for all columns for rows 1 and 2, and for first and last columns only for rows 3 and 4. The way you can represent this is as follows:
cols = matrix(c(1,1, 1,2, 1,3,
2,1, 2,2, 2,3,
3,1, 3,3,
4,1, 4,3), ncol = 2, byrow = T)
# [,1] [,2]
# [1,] 1 1
# [2,] 1 2
# [3,] 1 3
# [4,] 2 1
# [5,] 2 2
# [6,] 2 3
# [7,] 3 1
# [8,] 3 3
# [9,] 4 1
#[10,] 4 3
The result you want is then easy to compute:
tapply(m[cols], cols[,1], function(x) length(unique(x)))
#1 2 3 4
#1 2 1 2
Related
I'd like to return a grid with unique rows from a sequence vector. I'm looking for a general solution so I can pass any number of sequences in a vector. I don't know the terminology for this, so how can I do this?
Example
num <- 3
v <- c(seq(1, num, 1))
Desired Output
1 2 3
2 3 1
3 1 2
Second and third column can be switched:
1 3 2
2 1 3
3 2 1
I tried manipulating expand.grid() but it requires sorting and filtering which seems excessive.
We can use permn from combinat package which generates all possible permutations of v and then select top num of them using head
head(as.data.frame(do.call(rbind, combinat::permn(v))), num)
# V1 V2 V3
#1 1 2 3
#2 1 3 2
#3 3 1 2
We can also use sample to select any num rows instead of first num rows using head.
where
combinat::permn(v) #gives
#[[1]]
#[1] 1 2 3
#[[2]]
#[1] 1 3 2
#[[3]]
#[1] 3 1 2
#[[4]]
#[1] 3 2 1
#[[5]]
#[1] 2 3 1
#[[6]]
#[1] 2 1 3
Here's one solution (column order differs but the idea holds):
n = 3
sweep(replicate(n, 1:n), 2, 1:n, "+") %% n + 1
[,1] [,2] [,3]
[1,] 3 1 2
[2,] 1 2 3
[3,] 2 3 1
Explanation:
replicate will first create a matrix where each row is 1:n:
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3
I then use the sweep function to add 1 to column 1, 2 to column 2, 3 to column 3:
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
At this point, you can do a modulo on the matrix and then add 1 to arrive at the desired matrix.
Edit: If you need to have the same column order as you had above, can do
(sweep(replicate(n, 1:n), 2, 1:n, "+") + 1) %% n + 1
Another base R option
t(sapply(1:length(v), function(i) rep(v, 2)[i:(i+2)]))
# [,1] [,2] [,3]
#[1,] 1 2 3
#[2,] 2 3 1
#[3,] 3 1 2
Explanation: We cyclically permute v and store the vectors as column vectors in a matrix.
For general v (of length length(v)) this becomes
t(sapply(1:length(v), function(i) rep(v, 2)[i:(i + length(v) - 1)]))
Is it possible to extend the sample function in R to not return more than say 2 of the same element when replace = TRUE?
Suppose I have a list:
l = c(1,1,2,3,4,5)
To sample 3 elements with replacement, I would do:
sample(l, 3, replace = TRUE)
Is there a way to constrain its output so that only a maximum of 2 of the same elements are returned? So (1,1,2) or (1,3,3) is allowed, but (1,1,1) or (3,3,3) is excluded?
set.seed(0)
The basic idea is to convert sampling with replacement to sampling without replacement.
ll <- unique(l) ## unique values
#[1] 1 2 3 4 5
pool <- rep.int(ll, 2) ## replicate each unique so they each appear twice
#[1] 1 2 3 4 5 1 2 3 4 5
sample(pool, 3) ## draw 3 samples without replacement
#[1] 4 3 5
## replicate it a few times
## each column is a sample after out "simplification" by `replicate`
replicate(5, sample(pool, 3))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 4 2 2 3
#[2,] 4 5 1 2 5
#[3,] 2 1 2 4 1
If you wish different value to appear up to different number of times, we can do for example
pool <- rep.int(ll, c(2, 3, 3, 4, 1))
#[1] 1 1 2 2 2 3 3 3 4 4 4 4 5
## draw 9 samples; replicate 5 times
oo <- replicate(5, sample(pool, 9))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 5 1 4 3 2
# [2,] 2 2 4 4 1
# [3,] 4 4 1 1 1
# [4,] 4 2 3 2 5
# [5,] 1 4 2 5 2
# [6,] 3 4 3 3 3
# [7,] 1 4 2 2 2
# [8,] 4 1 4 3 3
# [9,] 3 3 2 2 4
We can call tabulate on each column to count the frequency of 1, 2, 3, 4, 5:
## set `nbins` in `tabulate` so frequency table of each column has the same length
apply(oo, 2L, tabulate, nbins = 5)
# [,1] [,2] [,3] [,4] [,5]
#[1,] 2 2 1 1 2
#[2,] 1 2 3 3 3
#[3,] 2 1 2 3 2
#[4,] 3 4 3 1 1
#[5,] 1 0 0 1 1
The count in all columns meet the frequency upper bound c(2, 3, 3, 4, 1) we have set.
Would you explain the difference between rep and rep.int?
rep.int is not the "integer" method for rep. It is just a faster primitive function with less functionality than rep. You can get more details of rep, rep.int and rep_len from the doc page ?rep.
I look for a R-code that transform the matrix as follows (a: the original matrix, b: the desired output), example:
a <- matrix(c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6), nrow = 6, ncol = 4)
b <- matrix(c(1,2,3,4,5,6,2,3,4,5,6,0,3,4,5,6,0,0,4,5,6,0,0,0), nrow = 6, ncol = 4)
a
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
[4,] 4 4 4 4
[5,] 5 5 5 5
[6,] 6 6 6 6
b
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
[3,] 3 4 5 6
[4,] 4 5 6 0
[5,] 5 6 0 0
[6,] 6 0 0 0
Thus, the first column is not shifted, the second column is shifted up one step, the third column shifted up two steps, and so on. The shifted columns are padded with zeros.
The following links didn't help me (nor: double for-loop, a function with different variables, the codes diag or kronecker).
R: Shift values in single column of dataframe UP
r matrix individual shift operations of elements
Rotate a Matrix in R
Have you any ideas? Thanks.
This seems to work with data.table. Should perform well with a large matrix:
library(data.table)
# One way
dt[, shift(.SD, 0:3, 0, "lead", FALSE), .SDcols = 1]
# Alternatively
dt[, shift(dt, 0:3, 0, "lead", FALSE)][, 1:4]
Both return:
V1 V2 V3 V4
1: 1 2 3 4
2: 2 3 4 5
3: 3 4 5 6
4: 4 5 6 0
5: 5 6 0 0
6: 6 0 0 0
Using the following data:
a <- matrix(c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6), nrow = 6, ncol = 4)
dt <- setDT(as.data.frame(a))
I have a raw solution using sapply. You shift your column on each iteration of sapply, and then sapply concatenate all the output, that you can feed to matrix with the good size (the size of your initial matrix)
matrix(sapply(1:dim(a)[2], function(x){c(a[x:dim(a)[1], x], rep(0, (x - 1) ))}), ncol = dim(a)[2], nrow = dim(a)[1])
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
[3,] 3 4 5 6
[4,] 4 5 6 0
[5,] 5 6 0 0
[6,] 6 0 0 0
You can shift the columns by filling a matrix which have one row more than "a" with the values from "a" (a Warning is generated during the recycling). Select the original number of rows. Replace the lower right triangle with zeros.
nr <- nrow(a)
a2 <- matrix(a, ncol = ncol(a), nrow = nr + 1)[1:nr, ]
a2[col(a2) + row(a2) > nr + 1] <- 0
a2
# [,1] [,2] [,3] [,4]
# [1,] 1 2 3 4
# [2,] 2 3 4 5
# [3,] 3 4 5 6
# [4,] 4 5 6 0
# [5,] 5 6 0 0
# [6,] 6 0 0 0
Building on tyluRp's answer, which almost worked for me, I suggest to loop through all columns and call shift on each, individually. Let's start with a matrix of random numbers here:
a <- matrix(floor(10*runif(24)), ncol=4)
a
[,1] [,2] [,3] [,4]
[1,] 8 4 8 3
[2,] 0 6 9 0
[3,] 1 6 0 7
[4,] 0 3 9 7
[5,] 2 4 2 9
[6,] 4 8 5 6
library(data.table)
dt <- setDT(as.data.frame(a))
Now the loop that does the job...
for (i in 2:length(dt)) dt[,i] <- shift(dt[,i,with=F],(i-1),0,"lead")
...by replacing columns with their shifted version.
The original answers replaced all columns by shifted copies of the first column, thus losing data. This is probably due to the group behaviour of data.table.
I have Matrix A and List B as per below:
Matrix A:
[,1][,2]
[1,] 1 1
[2,] 1 2
[3,] 2 1
[4,] 2 2
[5,] 10 1
[6,] 10 2
[7,] 11 1
[8,] 11 2
[9,] 5 5
[10,] 5 6
ListB below is the grouping done based on the minimum distance in the order of the rows based on Matrix A. For example, the first four points in List[[1]] is the first four points from Matrix A that is (1,1) (1,2) (2,1) (2,2) and it belongs to Group 1 and so on
List B:
[[1]]
[1] 1 1 1 1 3 2 3 2 1 1
[[2]]
[1] 3 3 3 3 3 3 1 2 3 3
[[3]]
[1] 1 1 2 2 3 3 3 3 2 2
How can I calculate the mean of the points of group 1, group 2 and froup 3 respectively based on the groupings?
If there is only one vector, this is how I do it:
meanPoints <- apply(MatrixA, 2, tapply, ListB, mean)
But how to do a loop to get the the mean points for the List [[1]] [[2]] [[3]] in R?
I think you can do it with lapply() to build an anonymous function to handle the iteration through your multiple grouping vectors.
# similar data bc I didn't want to type that
MatrixA <- matrix(data = 1:20, ncol = 2)
B <- c(rep(1:3, length.out = 10))
C <- c(rep(3:1, length.out = 10))
listB <- list(B, C)
# just wrapping your single vector solution
lapply(lists, function(x) {apply(MatrixA, 2, tapply, x, mean)})
[[1]]
[,1] [,2]
1 5.5 15.5
2 5.0 15.0
3 6.0 16.0
[[2]]
[,1] [,2]
1 6.0 16.0
2 5.0 15.0
3 5.5 15.5
Is that what you were looking for?
I am looking for a fast way to convert a list into a matrix with an additional column containing a repeating 1:5 pattern. For instance, the list mat looks like this. The list and the repeating pattern can get to thousands of values in length and so a fast approach would be ideal.
I can convert the list to a matrix using melt (may not be ideal for large matrices though), however, I am having trouble getting the repeating pattern to work.
The matrix looks like this
mat
[[1]]
[1] 5
[[2]]
[1] 1 4 5
[[3]]
[1] 3 1
[[4]]
[1] 4 6 5 3
The output should contain the values of the list as well as an index column containing a 1:5 repeating pattern depending on the length of each index in the list. For instance, mat[[4]] contains 4 values, therefore the index column should contain a values 1:4
output
[,1] [,2]
5 1
1 1
4 2
5 3
3 1
1 2
4 1
6 2
5 3
3 4
mat <- list(5, c(1,4,5), c(3,1), c(4,6,5,3)) ## your example data
We can use basic operations:
cbind( unlist(mat), sequence(lengths(mat)) )
# [,1] [,2]
# [1,] 5 1
# [2,] 1 1
# [3,] 4 2
# [4,] 5 3
# [5,] 3 1
# [6,] 1 2
# [7,] 4 1
# [8,] 6 2
# [9,] 5 3
#[10,] 3 4
Alternatively,
cbind( unlist(mat), unlist(lapply(mat, seq_along)) )
Here is another option with Map. We get the sequence of each list element with lapply, cbind the corresponding elements of list using Map and rbind it.
do.call(rbind, Map(cbind, mat, lapply(mat, seq_along)))
# [,1] [,2]
#[1,] 5 1
#[2,] 1 1
#[3,] 4 2
#[4,] 5 3
#[5,] 3 1
#[6,] 1 2
#[7,] 4 1
#[8,] 6 2
#[9,] 5 3
#[10,] 3 4
Or with data.table, we melt the list to a 2 column data.frame, convert it to data.table with setDT and assign (:=) the sequence of 'L1' to 'L1' after grouping by 'L1'
library(data.table)
setDT(melt(mat))[, L1 := seq_len(.N), L1][]
# value L1
# 1: 5 1
# 2: 1 1
# 3: 4 2
# 4: 5 3
# 5: 3 1
# 6: 1 2
# 7: 4 1
# 8: 6 2
# 9: 5 3
#10: 3 4