How to make the difference between two consecutive elements of a vector and remove the one ending with two zeros if the difference is less than 10 - r

I am trying to generate a vector breaks_x which is the result of another vector break_init. If the difference between two successive elements of break_init is less than 10, the element ending with two zeros will be removed.
My code is always removing breaks_init[i] even if it not ending with two zeros.
Can anyone help please
break_init <- c(100,195,200,238,300,326,400,481,500,537,600,607,697,700,800,875,900,908,957)
breaks_x <- vector()
for(i in 1:(length(break_init) - 1))
{
if (break_init[i+1] - break_init[i] >= 10) {
breaks_x[i] <- break_init[i]
} else {
if (grepl("[00]$", as.character(break_init[i])) == TRUE){
breaks_x[i] <- NA
} else if (grepl("[00]$", as.character(break_init[i])) == FALSE) {
breaks_x[i+1] <- NA
} else {
breaks_x[i] <- break_init[i]
}
}
}
[1] 0 100 NA 200 238 300 326 400 481 500 537 NA 607 NA 700 800 875 NA 908 957 #result breaks_x
[1] 0 100 195 NA 238 300 326 400 481 500 537 NA 607 697 NA 800 875 NA 908 957 #what I want my result to be

r2evans has the right idea. Just a little modification to check both the forward and the backwards difference:
bln10 <- diff(break_init) < 10
breaks_x <- replace(break_init, (c(FALSE, bln10) | c(bln10, FALSE)) & break_init %% 100 == 0, NA)
breaks_x
# [1] 100 195 NA 238 300 326 400 481 500 537 NA 607 697 NA 800 875 NA 908 957

Related

R function generating incorrect results

I am trying to get better with functions in R and I was working on a function to pull out every odd value from 100 to 500 that was divisible by 3. I got close with the function below. It keeps returning all of the values correctly but it also includes the first number in the sequence (101) when it should not. Any help would be greatly appreciated. The code I wrote is as follows:
Test=function(n){
if(n>100){
s=seq(from=101,to=n,by=2)
p=c()
for(i in seq(from=101,to=n,by=2)){
if(any(s==i)){
p=c(p,i)
s=c(s[(s%%3)==0],i)
}}
return (p)}else{
stop
}}
Test(500)
Here is a function that gets all non even multiples of 3. It's fully vectorized, no loops at all.
Check if n is within the range [100, 500].
Create an integer vector N from 100 to n.
Create a logical index of the elements of N that are divisible by 3 but not by 2.
Extract the elements of N that match the index i.
The main work is done in 3 code lines.
Test <- function(n){
stopifnot(n >= 100)
stopifnot(n <= 500)
N <- seq_len(n)[-(1:99)]
i <- ((N %% 3) == 0) & ((N %% 2) != 0)
N[i]
}
Test(500)
Here is a vectorised one-liner which optionally allows you to change the lower bound from a default of 100 to anything you like. If the bounds are wrong, it returns an empty vector rather than throwing an error.
It works by creating a vector of 1:500 (or more generally, 1:n), then testing whether each element is greater than 100 (or whichever lower bound m you set), AND whether each element is odd AND whether each element is divisible by 3. It uses the which function to return the indices of the elements that pass all the tests.
Test <- function(n, m = 100) which(1:n > m & 1:n %% 2 != 0 & 1:n %% 3 == 0)
So you can use it as specified in your question:
Test(500)
# [1] 105 111 117 123 129 135 141 147 153 159 165 171 177 183 189 195 201 207 213 219
# [21] 225 231 237 243 249 255 261 267 273 279 285 291 297 303 309 315 321 327 333 339
# [41] 345 351 357 363 369 375 381 387 393 399 405 411 417 423 429 435 441 447 453 459
# [61] 465 471 477 483 489 495
Or play around with upper and lower bounds:
Test(100, 50)
# [1] 51 57 63 69 75 81 87 93 99
Here is a function example for your objective
Test <- function(n) {
if(n<100 | n> 500) stop("out of range")
v <- seq(101,n,by = 2)
na.omit(ifelse(v%%2==1 & v%%3==0,v,NA))
}
stop() is called when your n is out of range [100,500]
ifelse() outputs desired odd values + NA
na.omit filters out NA and produce the final results

Aggregate column intervals into new columns in data.table

I would like to aggregate a data.table based on intervals of a column (time). The idea here is that each interval should be a separate column with a different name in the output.
I've seen a similar question in SO but I couldn't get my head around the problem. help?
reproducible example
library(data.table)
# sample data
set.seed(1L)
dt <- data.table( id= sample(LETTERS,50,replace=TRUE),
time= sample(60,50,replace=TRUE),
points= sample(1000,50,replace=TRUE))
# simple summary by `id`
dt[, .(total = sum(points)), by=id]
> id total
> 1: J 2058
> 2: T 1427
> 3: C 1020
In the desired output, each column would be named after the interval size they originate from. For example with three intervals, say time < 10, time < 20, time < 30, the head of the output should be:
id | total | subtotal_under10 | subtotal_under20 | subtotal_under30
Exclusive Subtotal Categories
set.seed(1L);
N <- 50L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
cuts <- cut(dt$time,breaks,labels=paste0('subtotal_under',breaks[-1L]),right=F);
res <- dcast(dt[,.(subtotal=sum(points)),.(id,cut=cuts)],id~cut,value.var='subtotal');
res <- res[dt[,.(total=sum(points)),id]][order(id)];
res;
## id subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60 total
## 1: A NA NA 176 NA NA 512 688
## 2: B NA NA 599 NA NA NA 599
## 3: C 527 NA NA NA NA NA 527
## 4: D NA NA 174 NA NA NA 174
## 5: E NA 732 643 NA NA NA 1375
## 6: F 634 NA NA NA NA 1473 2107
## 7: G NA NA 1410 NA NA NA 1410
## 8: I NA NA NA NA NA 596 596
## 9: J 447 NA 640 NA NA 354 1441
## 10: K 508 NA NA NA NA 454 962
## 11: M NA 14 1358 NA NA NA 1372
## 12: N NA NA NA NA 730 NA 730
## 13: O NA NA 271 NA NA 259 530
## 14: P NA NA NA NA 78 NA 78
## 15: Q 602 NA 485 NA 925 NA 2012
## 16: R NA 599 357 479 NA NA 1435
## 17: S NA 986 716 865 NA NA 2567
## 18: T NA NA NA NA 105 NA 105
## 19: U NA NA NA 239 1163 641 2043
## 20: V NA 683 NA NA 929 NA 1612
## 21: W NA NA NA NA 229 NA 229
## 22: X 214 993 NA NA NA NA 1207
## 23: Y NA 130 992 NA NA NA 1122
## 24: Z NA NA NA NA 104 NA 104
## id subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60 total
Cumulative Subtotal Categories
I've come up with a new solution based on the requirement of cumulative subtotals.
My objective was to avoid looping operations such as lapply(), since I realized that it should be possible to compute the desired result using only vectorized operations such as findInterval(), vectorized/cumulative operations such as cumsum(), and vector indexing.
I succeeded, but I should warn you that the algorithm is fairly intricate, in terms of its logic. I'll try to explain it below.
breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
ints <- findInterval(dt$time,breaks);
res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)];
setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L])));
res;
## id total subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60
## 1: A 688 NA NA 176 176 176 688
## 2: B 599 NA NA 599 599 599 599
## 3: C 527 527 527 527 527 527 527
## 4: D 174 NA NA 174 174 174 174
## 5: E 1375 NA 732 1375 1375 1375 1375
## 6: F 2107 634 634 634 634 634 2107
## 7: G 1410 NA NA 1410 1410 1410 1410
## 8: I 596 NA NA NA NA NA 596
## 9: J 1441 447 447 1087 1087 1087 1441
## 10: K 962 508 508 508 508 508 962
## 11: M 1372 NA 14 1372 1372 1372 1372
## 12: N 730 NA NA NA NA 730 730
## 13: O 530 NA NA 271 271 271 530
## 14: P 78 NA NA NA NA 78 78
## 15: Q 2012 602 602 1087 1087 2012 2012
## 16: R 1435 NA 599 956 1435 1435 1435
## 17: S 2567 NA 986 1702 2567 2567 2567
## 18: T 105 NA NA NA NA 105 105
## 19: U 2043 NA NA NA 239 1402 2043
## 20: V 1612 NA 683 683 683 1612 1612
## 21: W 229 NA NA NA NA 229 229
## 22: X 1207 214 1207 1207 1207 1207 1207
## 23: Y 1122 NA 130 1122 1122 1122 1122
## 24: Z 104 NA NA NA NA 104 104
## id total subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60
Explanation
breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
breaks <- seq(0,ceiling(max(dt$time)/10)*10,10); ## old derivation, for reference
First, we derive breaks as before. I should mention that I realized there was a subtle bug in my original derivation algorithm. Namely, if the maximum time value is a multiple of 10, then the derived breaks vector would've been short by 1. Consider if we had a maximum time value of 60. The original calculation of the upper limit of the sequence would've been ceiling(60/10)*10, which is just 60 again. But it should be 70, since the value 60 technically belongs in the 60 <= time < 70 interval. I fixed this in the new code (and retroactively amended the old code) by adding 1 to the maximum time value when computing the upper limit of the sequence. I also changed two of the literals to integers and added an as.integer() coercion to preserve integerness.
ints <- findInterval(dt$time,breaks);
Second, we precompute the interval indexes into which each time value falls. We can precompute this once for the entire table, because we'll be able to index out each id group's subset within the j argument of the subsequent data.table indexing operation. Note that findInterval() behaves perfectly for our purposes using the default arguments; we don't need to mess with rightmost.closed, all.inside, or left.open. This is because findInterval() by default uses lower <= value < upper logic, and it's impossible for values to fall below the lowest break (which is zero) or on or above the highest break (which must be greater than the maximum time value because of the way we derived it).
res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)];
Third, we compute the aggregation using a data.table indexing operation, grouping by id. (Afterward we sort by id using a chained indexing operation, but that's not significant.) The j argument consists of 6 statements executed in a braced block which I will now explain one at a time.
y <- ints[.I];
This pulls out the interval indexes for the current id group in input order.
o <- order(y);
This captures the order of the group's records by interval. We will need this order for the cumulative summation of points, as well as the derivation of which indexes in that cumulative sum represent the desired interval subtotals. Note that the within-interval orders (i.e. ties) are irrelevant, since we're only going to extract the final subtotals of each interval, which will be the same regardless if and how order() breaks ties.
y <- y[o];
This actually reorders y to interval order.
w <- which(c(y[-length(y)]!=y[-1L],T));
This computes the endpoints of each interval sequence, IOW the indexes of only those elements that comprise the final element of an interval. This vector will always contain at least one index, it will never contain more indexes than there are intervals, and it will be unique.
v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks))));
This repeats each element of w according to its distance (as measured in intervals) from its following element. We use diff() on y[w] to compute these distances, requiring an appended length(breaks) element to properly treat the final element of w. We also need to cover if the first interval (and zero or more subsequent intervals) is not represented in the group, in which case we must pad it with NAs. This requires prepending an NA to w and prepending a 1 to the argument vector to diff().
c(sum(points),as.list(cumsum(points[o])[v]));
Finally, we can compute the group aggregation result. Since you want a total column and then separate subtotal columns, we need a list starting with the total aggregation, followed by one list component per subtotal value. points[o] gives us the target summation operand in interval order, which we then cumulatively sum, and then index with v to produce the correct sequence of cumulative subtotals. We must coerce the vector to a list using as.list(), and then prepend the list with the total aggregation, which is simply the sum of the entire points vector. The resulting list is then returned from the j expression.
setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L])));
Last, we set the column names. It is more performant to set them once after-the-fact, as opposed to having them set repeatedly in the j expression.
Benchmarking
For benchmarking, I wrapped my code in a function, and did the same for Mike's code. I decided to make my breaks variable a parameter with its derivation as the default argument, and I did the same for Mike's my_nums variable, but without a default argument.
Also note that for the identical() proofs-of-equivalence, I coerce the two results to matrix, because Mike's code always computes the total and subtotal columns as doubles, whereas my code preserves the type of the input points column (i.e. integer if it was integer, double if it was double). Coercing to matrix was the easiest way I could think of to verify that the actual data is equivalent.
library(data.table);
library(microbenchmark);
bgoldst <- function(dt,breaks=seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L)) { ints <- findInterval(dt$time,breaks); res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)]; setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L]))); res; };
mike <- function(dt,my_nums) { cols <- sapply(1:length(my_nums),function(x){return(paste0("subtotal_under",my_nums[x]))}); dt[,(cols) := lapply(my_nums,function(x) ifelse(time<x,points,NA))]; dt[,total := points]; dt[,lapply(.SD,function(x){ if (all(is.na(x))){ as.numeric(NA) } else{ as.numeric(sum(x,na.rm=TRUE)) } }),by=id, .SDcols=c("total",cols) ][order(id)]; };
## OP's sample input
set.seed(1L);
N <- 50L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60))));
## [1] TRUE
microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(dt)) 3.281380 3.484301 3.793532 3.588221 3.780023 6.322846 100
## mike(copy(dt), c(10, 20, 30, 40, 50, 60)) 3.243746 3.442819 3.731326 3.526425 3.702832 5.618502 100
Mike's code is actually faster (usually) by a small amount for the OP's sample input.
## large input 1
set.seed(1L);
N <- 1e5L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60,70))));
## [1] TRUE
microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60,70)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(dt)) 19.44409 19.96711 22.26597 20.36012 21.26289 62.37914 100
## mike(copy(dt), c(10, 20, 30, 40, 50, 60, 70)) 94.35002 96.50347 101.06882 97.71544 100.07052 146.65323 100
For this much larger input, my code significantly outperforms Mike's.
In case you're wondering why I had to add the 70 to Mike's my_nums argument, it's because with so many more records, the probability of getting a 60 in the random generation of dt$time is extremely high, which requires the additional interval. You can see that the identical() call gives TRUE, so this is correct.
## large input 2
set.seed(1L);
N <- 1e6L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60,70))));
## [1] TRUE
microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60,70)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(dt)) 204.8841 207.2305 225.0254 210.6545 249.5497 312.0077 100
## mike(copy(dt), c(10, 20, 30, 40, 50, 60, 70)) 1039.4480 1086.3435 1125.8285 1116.2700 1158.4772 1412.6840 100
For this even larger input, the performance difference is slightly more pronounced.
I'm pretty sure something like this might work as well:
# sample data
set.seed(1)
dt <- data.table( id= sample(LETTERS,50,replace=TRUE),
time= sample(60,50,replace=TRUE),
points= sample(1000,50,replace=TRUE))
#Input numbers
my_nums <- c(10,20,30)
#Defining columns
cols <- sapply(1:length(my_nums),function(x){return(paste0("subtotal_under",my_nums[x]))})
dt[,(cols) := lapply(my_nums,function(x) ifelse(time<x,points,NA))]
dt[,total := sum((points)),by=id]
dt[,(cols):= lapply(.SD,sum,na.rm=TRUE),by=id, .SDcols=cols ]
head(dt)
id time points subtotal_under10 subtotal_under20 subtotal_under30 total
1: G 29 655 0 0 1410 1410
2: J 52 354 447 447 1087 1441
3: O 27 271 0 0 271 530
4: X 15 993 214 1207 1207 1207
5: F 5 634 634 634 634 2107
6: X 6 214 214 1207 1207 1207
Edit: To aggregate columns, you can simply change to:
#Defining columns
cols <- sapply(1:length(my_nums),function(x){return(paste0("subtotal_under",my_nums[x]))})
dt[,(cols) := lapply(my_nums,function(x) ifelse(time<x,points,NA))]
dt[,total := points]
dt[,lapply(.SD,function(x){
if (all(is.na(x))){
as.numeric(NA)
} else{
as.numeric(sum(x,na.rm=TRUE))
}
}),by=id, .SDcols=c("total",cols) ]
This should give the expected output of 1 row per ID.
Edit: Per OPs comment below, changed so that 0s are NA. Changed so don't need an as.numeric() call in the building of columns.
After a while thinking about this, I think I've arrived at a very simple and fast solution based on conditional sum ! The small problem is that I haven't figured out how to automate this code to create a larger number of columns without having to write each of them. Any help here would be really welcomed !
library(data.table)
dt[, .( total = sum(points)
, subtotal_under10 = sum(points[which( time < 10)])
, subtotal_under20 = sum(points[which( time < 20)])
, subtotal_under30 = sum(points[which( time < 30)])
, subtotal_under40 = sum(points[which( time < 40)])
, subtotal_under50 = sum(points[which( time < 50)])
, subtotal_under60 = sum(points[which( time < 60)])), by=id][order(id)]
microbenchmark
Using the same benchmark proposed by #bgoldst in another answer, this simple solution is much faster than the alternatives:
set.seed(1L)
N <- 1e6L
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T))
library(microbenchmark)
microbenchmark(rafa(copy(dt)),bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60)))
# expr min lq mean median uq max neval cld
# rafa(copy(dt)) 95.79 102.45 117.25 110.09 116.95 278.50 100 a
# bgoldst(copy(dt)) 192.53 201.85 211.04 207.50 213.26 354.17 100 b
# mike(copy(dt), c(10, 20, 30, 40, 50, 60)) 844.80 890.53 955.29 921.27 1041.96 1112.18 100 c

R Conditional summing

I've just started my adventure with programming in R. I need to create a program summing numbers divisible by 3 and 5 in the range of 1 to 1000, using the '%%' operator. I came up with an idea to create two matrices with the numbers from 1 to 1000 in one column and their remainders in the second one. However, I don't know how to sum the proper elements (kind of "sum if" function in Excel). I attach all I've done below. Thanks in advance for your help!
s1<-1:1000
in<-s1%%3
m1<-matrix(c(s1,in), 1000, 2, byrow=FALSE)
s2<-1:1000
in2<-s2%%5
m2<-matrix(c(s2,in2),1000,2,byrow=FALSE)
Mathematically, the best way is probably to find the least common multiple of the two numbers and check the remainder vs that:
# borrowed from Roland Rau
# http://r.789695.n4.nabble.com/Greatest-common-divisor-of-two-numbers-td823047.html
gcd <- function(a,b) if (b==0) a else gcd(b, a %% b)
lcm <- function(a,b) abs(a*b)/gcd(a,b)
s <- seq(1000)
s[ (s %% lcm(3,5)) == 0 ]
# [1] 15 30 45 60 75 90 105 120 135 150 165 180 195 210
# [15] 225 240 255 270 285 300 315 330 345 360 375 390 405 420
# [29] 435 450 465 480 495 510 525 540 555 570 585 600 615 630
# [43] 645 660 675 690 705 720 735 750 765 780 795 810 825 840
# [57] 855 870 885 900 915 930 945 960 975 990
Since your s is every number from 1 to 1000, you could instead do
seq(lcm(3,5), 1000, by=lcm(3,5))
Just use sum on either result if that's what you want to do.
Props to #HoneyDippedBadger for figuring out what the OP was after.
See if this helps
x =1:1000 ## Store no. 1 to 1000 in variable x
x ## print x
Div = x[x%%3==0 & x%%5==0] ## Extract Nos. divisible by 3 & 5 both b/w 1 to 1000
Div ## Nos. Stored in DIv which are divisible by 3 & 5 both
length(Div)
table(x%%3==0 & x%%5==0) ## To see how many are TRUE for given condition
sum(Div) ## Sums up no.s divisible by both 3 and 5 b/w 1 to 1000

How to divide a set of overlapping ranges into non-overlapping ranges? but in R

Let's say we have two datasets:
assays:
BHID<-c(127,127,127,127,128)
FROM<-c(950,959,960,961,955)
TO<-c(958,960,961,966,969)
Cu<-c(0.3,0.9,2.5,1.2,0.5)
assays<-data.frame(BHID,FROM,TO,Cu)
and litho:
BHID<-c(125,127,127,127)
FROM<-c(940,949,960,962)
TO<-c(949,960,961,969)
ROCK<-c(1,1,2,3)
litho<-data.frame(BHID,FROM,TO,ROCK)
and I want to join the two sets and the results after running the algorithm would be:
BHID FROM TO CU ROCK
125 940 970 - 1
127 949 950 - 1
127 950 958 0.3 1
127 958 959 - 1
127 959 960 0.9 1
127 960 961 2.5 2
127 961 962 1.2 -
127 962 966 1.2 3
127 966 969 - 3
128 955 962 0.5 -
Use merge
merge(assays, litho, all=T)
In essence, all=T is the SQL equivalent for FULL OUTER JOIN. I haven't specified any columns, because in this case merge function will perform the join across the column with same names.
Tough one but the code seems to work. The idea is to first expand each row into many, each representing a one-increment from FROM to TO. After merging, identify contiguous rows and un-expand them... Obviously it is not a very efficient approach so it may or may not work if your real data has very large FROM and TO ranges.
library(plyr)
ASSAYS <- adply(assays, 1, with, {
SEQ <- seq(FROM, TO)
data.frame(BHID,
FROM = head(seq(FROM, TO), -1),
TO = tail(seq(FROM, TO), -1),
Cu)
})
LITHO <- adply(litho, 1, with, {
SEQ <- seq(FROM, TO)
data.frame(BHID,
FROM = head(seq(FROM, TO), -1),
TO = tail(seq(FROM, TO), -1),
ROCK)
})
not.as.previous <- function(x) {
x1 <- head(x, -1)
x2 <- tail(x, -1)
c(TRUE, !is.na(x1) & !is.na(x2) & x1 != x2 |
is.na(x1) & !is.na(x2) |
!is.na(x1) & is.na(x2))
}
MERGED <- merge(ASSAYS, LITHO, all = TRUE)
MERGED <- transform(MERGED,
gp.id = cumsum(not.as.previous(BHID) |
not.as.previous(Cu) |
not.as.previous(ROCK)))
merged <- ddply(MERGED, "gp.id", function(x) {
out <- head(x, 1)
out$TO <- tail(x$TO, 1)
out
})
merged
# BHID FROM TO Cu ROCK gp.id
# 1 125 940 949 NA 1 1
# 2 127 949 950 NA 1 2
# 3 127 950 958 0.3 1 3
# 4 127 958 959 NA 1 4
# 5 127 959 960 0.9 1 5
# 6 127 960 961 2.5 2 6
# 7 127 961 962 1.2 NA 7
# 8 127 962 966 1.2 3 8
# 9 127 966 969 NA 3 9
# 10 128 955 969 0.5 NA 10
Note that the first row is not exactly the same as in your expected output, but I think mine makes more sense.

R How to remove duplicates from a list of lists

I have a list of lists that contain the following 2 variables:
> dist_sub[[1]]$zip
[1] 901 902 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
[26] 929 930 931 933 934 935 936 937 938 939 940 955 961 962 963 965 966 968 969 970 975 981
> dist_sub[[1]]$hu
[1] 4990 NA 168 13224 NA 3805 NA 6096 3884 4065 NA 16538 NA 12348 10850 NA
[17] 9322 17728 NA 13969 24971 5413 47317 7893 NA NA NA NA NA 140 NA 4
[33] NA NA NA NA NA 13394 8939 NA 3848 7894 2228 17775 NA NA NA
> dist_sub[[2]]$zip
[1] 921 934 952 956 957 958 959 960 961 962 965 966 968 969 970 971
> dist_sub[[2]]$hu
[1] 17728 140 4169 32550 18275 NA 22445 0 13394 8939 3848 7894 2228 17775 NA 12895
Is there a way remove duplicates such that if a zipcode appears in one list is removed from other lists according to specific criteria?
Example: zipcode 00921 is present in the two lists above. I'd like to keep it only on the list with the lowest sum of hu (housing units). In this I would like to keep zipcode 00921 in the 2nd list only since the sum of hu is 162,280 in list 2 versus 256,803 in list 1.
Any help is very much appreciated.
Here is a simulate dataset for your problem so that others can use it too.
dist_sub <- list(list("zip"=1:10,
"hu"=rnorm(10)),
list("zip"=8:12,
"hu"=rnorm(5)),
list("zip"=c(1, 3, 11, 7),
"hu"=rnorm(4))
)
Here's a solution that I was able to come up with. I realized that loops were really the cleaner way to do this:
do.this <- function (x) {
for(k in 1:(length(x) - 1)) {
for (l in (k + 1):length(x)) {
to.remove <- which(x[[k]][["zip"]] %in% x[[l]][["zip"]])
x[[k]][["zip"]] <- x[[k]][["zip"]][-to.remove]
x[[k]][["hu"]] <- x[[k]][["hu"]][-to.remove]
}
}
return(x)
}
The idea is really simple: for each set of zips we keep removing the elements that are repeated in any set after it. We do this until the penultimate set because the last set will be left with no repeats in anything before it.
The solution to use the criterion you have, i.e. lowest sum of hu can be easily implemented using the function above. What you need to do is reorder the list dist_sub by sum of hu like so:
sum_hu <- sapply(dist_sub, function (k) sum(k[["hu"]], na.rm=TRUE))
dist_sub <- dist_sub[order(sum_hu, decreasing=TRUE)]
Now you have dist_sub sorted by sum_hu which means that for each set the sets that come before it have larger sum_hu. Therefore, if sets at values i and j (i < j) have values a in common, then a should be removed from ith element. That is what this solution does too. Do you think that makes sense?
PS: I've called the function do.this because I usually like writing generic solutions while this was a very specific question, albeit, an interesting one.

Resources