I have a vector
x <- c(1,2,5,4,3,1,1,4,2,6,7,2,4,1,5)
and I want to add the values that are the same, giving me the new vector
x <- c(4, 6, 3, 12, 10, 6, 7)
It sounds quite simple but I am stuck.
You can use sapply to iterate over the vector of the unique values, and then sum each one of the corresponding entries, like so:
> sapply(unique(x), function(i) sum(x[x == i]))
[1] 4 6 10 12 3 6 7
If the order is relevant, please indicate which order do you want.
In this solution, the order is the same as the output of unique, which you can use to know what is the sum of what value.
> unique(x)
[1] 1 2 5 4 3 6 7
Edit
It looks like you want the ascending order of unique values. In that case, you can do like this:
> sapply(sort(unique(x)), function(i) sum(x[x == i]))
[1] 4 6 3 12 10 6 7
aggregate(x, list(number = x), FUN = sum )
# number x
#1 1 4
#2 2 6
#3 3 3
#4 4 12
#5 5 10
#6 6 6
#7 7 7
The result is a data.frame and you can extract the second column as usual.
Here's another option, for fun:
with(rle(sort(x)), lengths * values)
# [1] 4 6 3 12 10 6 7
Benchmarks
library(microbenchmark)
x <- c(1,2,5,4,3,1,1,4,2,6,7,2,4,1,5)
x <- rep(x, length.out=1000)
matthew <- function() with(rle(sort(x)), lengths * values)
iled <- function() sapply(sort(unique(x)), function(i) sum(x[x == i]))
kota <- function() as.numeric(table(x) * as.integer(names(table(x))))
deena <- function() {
freqTable = as.data.frame(table(x))
as.numeric(as.character(freqTable$x)) * freqTable$Freq
}
roland <- function() aggregate(x, list(number = x), FUN = sum )$x
microbenchmark(matthew(), iled(), kota(), deena(), roland())
# Unit: microseconds
# expr min lq mean median uq max neval
# matthew() 105.5 116.9 167.5 122.5 131.3 1466 100
# iled() 111.2 125.6 160.3 131.4 138.8 1449 100
# kota() 1821.5 1899.3 1960.4 1915.9 1940.7 3031 100
# deena() 1124.7 1175.6 1221.1 1187.9 1207.7 2700 100
# roland() 1912.2 1967.9 2116.6 1995.5 2078.5 3610 100
One way to do that would be multiplying each element with its frequency. The table function does a good with that :
freqTable = as.data.frame(table(x))
requiredResult = as.numeric(as.character(freqTable$x)) * freqTable$Freq
You want to compute #n * n for each n, where #n is the number of occurrence of n.
Just an alternative approach to the sapply above.
table(x) * as.integer(names(table(x)))
# x
# 1 2 3 4 5 6 7
# 4 6 3 12 10 6 7
Related
Here is a sample dataset:
data <- data.frame(x=c(4,3,4,4,99),
y=c(4,NA,3,2,4),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4))
I would like to create a new column for means using rowMeans. I would like to keep na.rm=F because if its truly NA I do not want to include that into my means calculation.
But if its either 88/99 I would like R to ignore it while calculating the mean and still use the remaining valid values. So far I have the below.
data$mean <- rowMeans(subset(data, select = c(`x`,`y`,`z`,`w`)), na.rm = T)
But I am not sure how to add in a function where it would just ignore the 88 and 99 from calculations.
This is what I am hoping to get
data <- data.frame(x=c(4,3,4,4,99),
y=c(4,NA,3,2,4),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4),
mean=c(4,NA,3.25,3.25,4.3))
Any help is appreciated - thank you!
Using rowMeans nevertheless with na.rm=TRUE, but on a subset and temporally replaceing 88 and 99 with NA.
s <- rowSums(is.na(data)) == 0 ## store row subset
v <- c("x", "y", "z", "w") ## col subset to calc. mean
data$mean <- NA ## ini column
m <- as.matrix(data[v]) ## we'll ned a matrix
data$mean[s] <- rowMeans(replace(m[s, v], m[s, v] %in% c(88, 99), NA), na.rm=TRUE)
data
# x y z w mean
# 1 4 4 88 4 4.000000
# 2 3 NA NA 5 NA
# 3 4 3 4 2 3.250000
# 4 4 2 4 3 3.250000
# 5 99 4 5 4 4.333333
Or simply using apply but is much slower.
f <- \(x) if (any(is.na(x))) NA else mean(x[!x %in% c(88, 99)])
cbind(data, mean=apply(data, 1, f))
# x y z w mean
# 1 4 4 88 4 4.000000
# 2 3 NA NA 5 NA
# 3 4 3 4 2 3.250000
# 4 4 2 4 3 3.250000
# 5 99 4 5 4 4.333333
From microbenchmark.
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# apply 35.018162 35.941815 38.834333 36.394632 36.960161 212.469412 100 b
# rowMeans 1.097393 1.119396 1.493563 1.193787 1.226691 9.352118 100 a
data <- data.frame(x=c(4,3,4,4,99),
y=c(4,NA,3,2,4),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4))
df$mean <- apply(data, 1, function(x) {
idx <- which((x %in% c(88, 89)) == FALSE)
mean(x[ idx ], na.rm = TRUE)
})
x y z w mean
1 4 4 88 4 4.00
2 3 NA NA 5 4.00
3 4 3 4 2 3.25
4 4 2 4 3 3.25
5 99 4 5 4 28.00
I have a list of matrices, with duplicated values in column id. How can I split the duplicates in all list elements?
The way I do it with data.frames is with lapply + split + duplicated, but that doesnt work with matrices, as they are also split up in numerics. I would like to keep the matrix structure.
## Data.frame - all good
df <- data.frame(
id = rep(1:10, each = 2),
val = rep(10, each = 20)
)
df_list <- rep(list(df), 2);
lapply(df_list, function(x){split(x, duplicated(x[,'id']))$'FALSE'})
## Matrix - Here's my problem
mt <- as.matrix(data.frame(
id = rep(seq(1,10,1), each = 2),
val = rep(10, each = 20)
))
mt_list <- rep(list(mt), 2)
lapply(mt_list, function(x){split(x, duplicated(x[,'id']))$'FALSE'})
Maybe try
split(df,ave(df$id, df$id, FUN= function(x) seq_along(x)))
$`1`
id val
1 1 10
3 2 10
5 3 10
7 4 10
9 5 10
11 6 10
13 7 10
15 8 10
17 9 10
19 10 10
$`2`
id val
2 1 10
4 2 10
6 3 10
8 4 10
10 5 10
12 6 10
14 7 10
16 8 10
18 9 10
20 10 10
While writing that question and fiddling around with the code, I came up with a solution.
Since i didnt find anything about this specific setup, I though I'll post it anyway.
The functions subset / subset.matrix work:
lapply(mt_list, function(x){subset.matrix(x, !duplicated(x[,'id']))})
I benchmarked the different options; subset.matrix seems to be slightly faster than just subset.
mt <- as.matrix(data.frame(
id = rep(seq(1,1000,1), each = 2),
val = rep(1000, each = 20)
))
mt_list <- rep(list(mt), 50)
mc <- microbenchmark::microbenchmark(
subset = lapply(mt_list, function(x){subset(x, !duplicated(x[,'id']))}),
subset.matrix = lapply(mt_list, function(x){subset.matrix(x, !duplicated(x[,'id']))}),
split = lapply(mt_list, function(x){matrix(split(x, duplicated(x[,'id']))$'FALSE', ncol = 2)}),
unique = lapply( mt_list, unique )
)
mc
Unit: milliseconds
expr min lq mean median uq max neval cld
subset 3.758708 3.862849 4.256363 3.900580 3.981629 9.713416 100 a
subset.matrix 3.583632 3.700450 4.174137 3.729881 3.821947 9.611992 100 a
split 32.630604 33.061503 34.535531 33.262841 33.726039 77.531039 100 b
unique 144.832487 148.408874 155.099591 150.326865 155.456601 430.992916 100 c
For a clustering algorithm that I'm implementing, I would like to initialize the clusters assignments at random. However, I need that there are no gaps. That is, this is not ok:
set.seed(2)
K <- 10 # initial number of clusters
N <- 20 # number of data points
z_init <- sample(K,N, replace=TRUE) # initial assignments
z_init
# [1] 2 8 6 2 10 10 2 9 5 6 6 3 8 2 5 9 10 3 5 1
sort(unique(z_init))
# [1] 1 2 3 5 6 8 9 10
where labels 4 and 7 have not been used.
Instead, I would like this vector to be:
# [1] 2 6 5 2 8 8 2 7 4 5 5 3 6 2 4 7 8 3 4 1
where the label 5 has become 4 and so forth to fill the lower empty labels.
More examples:
The vector 1 2 3 5 6 8 should be ̀1 2 3 4 5 6 7
The vector 15,5,7,7,10 should be ̀1 2 3 3 4
Can it be done avoiding for loops? I don't need it to be fast, I prefer it to be elegant and short, since I'm doing it only once in the code (for label initialization).
My solution using a for loop
z_init <- c(3,2,1,3,3,7,9)
idx <- order(z_init)
for (i in 2:length(z_init)){
if(z_init[idx[i]] > z_init[idx[i-1]]){
z_init[idx[i]] <- z_init[idx[i-1]]+1
}
else{
z_init[idx[i]] <- z_init[idx[i-1]]
}
}
z_init
# 3 2 1 3 3 4 5
Edit: #GregSnow came up with the current shortest answer. I'm 100% convinced that this is the shortest possible way.
For fun, I decided to golf the code, i.e. write it as short as possible:
z <- c(3, 8, 4, 4, 8, 2, 3, 9, 5, 1, 4)
# solution by hand: 1 2 3 3 4 4 4 5 6 6 7
sort(c(factor(z))) # 18 bits, as proposed by #GregSnow in the comments
# [1] 1 2 3 3 4 4 4 5 6 6 7
Some other (functioning) attempts:
y=table(z);rep(seq(y),y) # 24 bits
sort(unclass(factor(z))) # 24 bits, based on #GregSnow 's answer
diffinv(diff(sort(z))>0)+1 # 26 bits
sort(as.numeric(factor(z))) # 27 bits, #GregSnow 's original answer
rep(seq(unique(z)),table(z)) # 28 bits
cumsum(c(1,diff(sort(z))>0)) # 28 bits
y=rle(sort(z))$l;rep(seq(y),y) # 30 bits
Edit2: Just to show that bits isn't everything:
z <- sample(1:10,10000,replace=T)
Unit: microseconds
expr min lq mean median uq max neval
sort(c(factor(z))) 2550.128 2572.2340 2681.4950 2646.6460 2729.7425 3140.288 100
{ y = table(z) rep(seq(y), y) } 2436.438 2485.3885 2580.9861 2556.4440 2618.4215 3070.812 100
sort(unclass(factor(z))) 2535.127 2578.9450 2654.7463 2623.9470 2708.6230 3167.922 100
diffinv(diff(sort(z)) > 0) + 1 551.871 572.2000 628.6268 626.0845 666.3495 940.311 100
sort(as.numeric(factor(z))) 2603.814 2672.3050 2762.2030 2717.5050 2790.7320 3558.336 100
rep(seq(unique(z)), table(z)) 2541.049 2586.0505 2733.5200 2674.0815 2760.7305 5765.815 100
cumsum(c(1, diff(sort(z)) > 0)) 530.159 545.5545 602.1348 592.3325 632.0060 844.385 100
{ y = rle(sort(z))$l rep(seq(y), y) } 661.218 684.3115 727.4502 724.1820 758.3280 857.412 100
z <- sample(1:100000,replace=T)
Unit: milliseconds
expr min lq mean median uq max neval
sort(c(factor(z))) 84.501189 87.227377 92.13182 89.733291 94.16700 150.08327 100
{ y = table(z) rep(seq(y), y) } 78.951701 82.102845 85.54975 83.935108 87.70365 106.05766 100
sort(unclass(factor(z))) 84.958711 87.273366 90.84612 89.317415 91.85155 121.99082 100
diffinv(diff(sort(z)) > 0) + 1 9.784041 9.963853 10.37807 10.090965 10.34381 17.26034 100
sort(as.numeric(factor(z))) 85.917969 88.660145 93.42664 91.542263 95.53720 118.44512 100
rep(seq(unique(z)), table(z)) 86.568528 88.300325 93.01369 90.577281 94.74137 118.03852 100
cumsum(c(1, diff(sort(z)) > 0)) 9.680615 9.834175 10.11518 9.963261 10.16735 14.40427 100
{ y = rle(sort(z))$l rep(seq(y), y) } 12.842614 13.033085 14.73063 13.294019 13.66371 133.16243 100
It seems to me that you are trying to randomly assign elements of a set (the numbers 1 to 20) to clusters, subject to the requirement that each cluster be assigned at least one element.
One approach that I could think of would be to select a random reward r_ij for assigning element i to cluster j. Then I would define binary decision variables x_ij that indicate whether element i is assigned to cluster j. Finally, I would use mixed integer optimization to select the assignment from elements to clusters that maximizes the collected reward subject to the following conditions:
Every element is assigned to exactly one cluster
Every cluster has at least one element assigned to it
This is equivalent to randomly selecting an assignment, keeping it if all clusters have at least one element, and otherwise discarding it and trying again until you get a valid random assignment.
In terms of implementation, this is pretty easy to accomplish in R using the lpSolve package:
library(lpSolve)
N <- 20
K <- 10
set.seed(144)
r <- matrix(rnorm(N*K), N, K)
mod <- lp(direction = "max",
objective.in = as.vector(r),
const.mat = rbind(t(sapply(1:K, function(j) rep((1:K == j) * 1, each=N))),
t(sapply(1:N, function(i) rep((1:N == i) * 1, K)))),
const.dir = c(rep(">=", K), rep("=", N)),
const.rhs = rep(1, N+K),
all.bin = TRUE)
(assignments <- apply(matrix(mod$solution, nrow=N), 1, function(x) which(x > 0.999)))
# [1] 6 5 3 3 5 6 6 9 2 1 3 4 7 6 10 2 10 6 6 8
sort(unique(assignments))
# [1] 1 2 3 4 5 6 7 8 9 10
You could do like this:
un <- sort(unique(z_init))
(z <- unname(setNames(1:length(un), un)[as.character(z_init)]))
# [1] 2 6 5 2 8 8 2 7 4 5 5 3 6 2 4 7 8 3 4 1
sort(unique(z))
# [1] 1 2 3 4 5 6 7 8
Here I replace elements of un in z_init with corresponding elements of 1:length(un).
A simple (but possibly inefficient) approach is to convert to a factor then back to numeric. Creating the factor will code the information as integers from 1 to the number of unique values, then add labels with the original values. Converting to numeric then drops the labels and leaves the numbers:
> x <- c(1,2,3,5,6,8)
> (x2 <- as.numeric(factor(x)))
[1] 1 2 3 4 5 6
>
> xx <- c(15,5,7,7,10)
> (xx2 <- as.numeric(factor(xx)))
[1] 4 1 2 2 3
> (xx3 <- as.numeric(factor(xx, levels=unique(xx))))
[1] 1 2 3 3 4
The levels = portion in the last example sets the numbers to match the order in which they appear in the original vector.
I have a vector in R,
a = c(2,3,4,9,10,2,4,19)
let us say I want to efficiently insert the following vectors, b, and c,
b = c(2,1)
d = c(0,1)
right after the 3rd and 7th positions (the "4" entries), resulting in,
e = c(2,3,4,2,1,9,10,2,4,0,1,19)
How would I do this efficiently in R, without recursively using cbind or so.
I found a package R.basic but its not part of CRAN packages so I thought about using a supported version.
Try this:
result <- vector("list",5)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (c(3,7)+1)))
result[c(FALSE,TRUE)] <- list(b,d)
f <- unlist(result)
identical(f, e)
#[1] TRUE
EDIT: generalization to arbitrary number of insertions is straightforward:
insert.at <- function(a, pos, ...){
dots <- list(...)
stopifnot(length(dots)==length(pos))
result <- vector("list",2*length(pos)+1)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos+1)))
result[c(FALSE,TRUE)] <- dots
unlist(result)
}
> insert.at(a, c(3,7), b, d)
[1] 2 3 4 2 1 9 10 2 4 0 1 19
> insert.at(1:10, c(4,7,9), 11, 12, 13)
[1] 1 2 3 4 11 5 6 7 12 8 9 13 10
> insert.at(1:10, c(4,7,9), 11, 12)
Error: length(dots) == length(pos) is not TRUE
Note the bonus error checking if the number of positions and insertions do not match.
You can use the following function,
ins(a, list(b, d), pos=c(3, 7))
# [1] 2 3 4 2 1 9 10 2 4 0 1 4 19
where:
ins <- function(a, to.insert=list(), pos=c()) {
c(a[seq(pos[1])],
to.insert[[1]],
a[seq(pos[1]+1, pos[2])],
to.insert[[2]],
a[seq(pos[2], length(a))]
)
}
Here's another function, using Ricardo's syntax, Ferdinand's split and #Arun's interleaving trick from another question:
ins2 <- function(a,bs,pos){
as <- split(a,cumsum(seq(a)%in%(pos+1)))
idx <- order(c(seq_along(as),seq_along(bs)))
unlist(c(as,bs)[idx])
}
The advantage is that this should extend to more insertions. However, it may produce weird output when passed invalid arguments, e.g., with any(pos > length(a)) or length(bs)!=length(pos).
You can change the last line to unname(unlist(... if you don't want a's items named.
The straightforward approach:
b.pos <- 3
d.pos <- 7
c(a[1:b.pos],b,a[(b.pos+1):d.pos],d,a[(d.pos+1):length(a)])
[1] 2 3 4 2 1 9 10 2 4 0 1 19
Note the importance of parenthesis for the boundaries of the : operator.
After using Ferdinand's function, I tried to write my own and surprisingly it is far more efficient.
Here's mine :
insertElems = function(vect, pos, elems) {
l = length(vect)
j = 0
for (i in 1:length(pos)){
if (pos[i]==1)
vect = c(elems[j+1], vect)
else if (pos[i] == length(vect)+1)
vect = c(vect, elems[j+1])
else
vect = c(vect[1:(pos[i]-1+j)], elems[j+1], vect[(pos[i]+j):(l+j)])
j = j+1
}
return(vect)
}
tmp = c(seq(1:5))
insertElems(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
insert.at(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
And there's the benchmark result :
> microbenchmark(insertElems(tmp, c(2,4,5), c(NA,NA,NA)), insert.at(tmp, c(2,4,5), c(NA,NA,NA)), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
insertElems(tmp, c(2, 4, 5), c(NA, NA, NA)) 9.660 11.472 13.44247 12.68 13.585 1630.421 10000
insert.at(tmp, c(2, 4, 5), c(NA, NA, NA)) 58.866 62.791 70.36281 64.30 67.923 2475.366 10000
my code works even better for some cases :
> insert.at(tmp, c(1,4,5), c(NA,NA,NA))
# [1] 1 2 3 NA 4 NA 5 NA 1 2 3
# Warning message:
# In result[c(TRUE, FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos))) :
# number of items to replace is not a multiple of replacement length
> insertElems(tmp, c(1,4,5), c(NA,NA,NA))
# [1] NA 1 2 3 NA 4 NA 5
Here's an alternative that uses append. It's fine for small vectors, but I can't imagine it being efficient for large vectors since a new vector is created upon each iteration of the loop (which is, obviously, bad). The trick is to reverse the vector of things that need to be inserted to get append to insert them in the correct place relative to the original vector.
a = c(2,3,4,9,10,2,4,19)
b = c(2,1)
d = c(0,1)
pos <- c(3, 7)
z <- setNames(list(b, d), pos)
z <- z[order(names(z), decreasing=TRUE)]
for (i in seq_along(z)) {
a <- append(a, z[[i]], after = as.numeric(names(z)[[i]]))
}
a
# [1] 2 3 4 2 1 9 10 2 4 0 1 19
Example Data
A<-c(1,4,5,6,2,3,4,5,6,7,8,7)
B<-c(4,6,7,8,2,2,2,3,8,8,7,8)
DF<-data.frame(A,B)
What I would like to do is apply a correction factor to column A, based on the values of column B. The rules would be something like this
If B less than 4 <- Multiply A by 1
If B equal to 4 and less than 6 <- Multiply A by 2
If B equal or greater than 6 <- Multiply by 4
I suppose I could write an "if" statement (and I'd be glad to see a good example), but I'd also be interested in using square bracket indexing to speed things up.
The end result would look like this
A B
2 4
16 6
20 7
24 8
ect
Use this:
within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
Or this (corrected by #agstudy):
within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
Benchmarking:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
identical(a(DF), b(DF))
#[1] TRUE
microbenchmark(a(DF), b(DF), times=1000)
#Unit: milliseconds
# expr min lq median uq max neval
# a(DF) 8.603778 10.253799 11.07999 11.923116 53.91140 1000
# b(DF) 3.763470 3.889065 5.34851 5.480294 39.72503 1000
Similar to #Ferdinand solution but using transform
transform(DF, newcol = ifelse(B<4, A,
ifelse(B>=6,4*A,2*A)))
A B newcol
1 1 4 2
2 4 6 16
3 5 7 20
4 6 8 24
5 2 2 2
6 3 2 3
7 4 2 4
8 5 3 5
9 6 8 24
10 7 8 28
11 8 7 32
12 7 8 28
I prefer to use findInterval as an index into a set of factors for such operations. The proliferation of nested test-conditional and consequent vectors with multiple ifelse calls offends my efficiency sensibilities:
DF$A <- DF$A * c(1,2,4)[findInterval(DF$B, c(-Inf,4,6,Inf) ) ]
DF
A B
1 2 4
2 16 6
3 20 7
4 24 8
snipped ....
Benchmark:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
ccc <- function(DF) within(DF, {A * c(1,2,4)[findInterval(B, c(-Inf,4,6,Inf) ) ]})
microbenchmark(a(DF), b(DF), ccc(DF), times=1000)
#-----------
Unit: microseconds
expr min lq median uq max neval
a(DF) 7616.107 7843.6320 8105.0340 8322.5620 93549.85 1000
b(DF) 2638.507 2789.7330 2813.8540 3072.0785 92389.57 1000
ccc(DF) 604.555 662.5335 676.0645 698.8665 85375.14 1000
Note: I would not have done this using within if I were coding my own function, but thought for fairness to the earlier effort, I would make it apples <-> apples.