I have a data frame of 2 columns and a vector of the same length. I am trying to remove all duplicated pairs in the data frame and at the same index, remove it from the vector.
I have a data frame:
> from <- c(1,1,2,4,3)
> to <- c(1,1,2,3,5)
> ft <- data.frame(from,to)
> ft
from to
1 1 1
2 1 1
3 2 2
4 4 3
5 3 5
And vector:
> dist <- c(1,2,3,4,5)
> dist
[1] 1 2 3 4 5
I used the function unique() to remove all duplicated pairs:
> unique(ft)
from to
1 1 1
3 2 2
4 4 3
5 3 5
How can I get the index of where every pair from "ft" has been removed so that I can remove it from "dist" which would be the 2 in this case.
As #eddi notes, you can get a logical vector that indicates which rows are duplicates with duplicated(). I combined that with which(), which returns the number associated with the logical that is TRUE (i.e., the duplicated row). You can then create a new data.frame (vector, etc.) by using - to not include the indicated rows in the subscript of your object.
Edit: In the comments, #DWin points out a better way than using -. If we negate the duplicated() function with !, we will get a vector that we can use to determine which rows to retain:
> from <- c(1,1,2,4,3)
> to <- c(1,1,2,3,5)
> ft <- data.frame(from,to)
> ft
from to
1 1 1
2 1 1
3 2 2
4 4 3
5 3 5
> dist <- c(1,2,3,4,5)
> dist
[1] 1 2 3 4 5
> remove <- !duplicated(ft)
> remove
[1] TRUE FALSE TRUE TRUE TRUE
> ft.new <- ft[which(remove), ]
> ft.new
from to
1 1 1
3 2 2
4 4 3
5 3 5
> dist.new <- dist[which(remove)]
> dist.new
[1] 1 3 4 5
Related
Take as an example the dataframe below. I need to change the dataframe by keeping only the columns that are in the filter objects.
test <- data.frame(A = c(1,6,1,2,3) , B = c(1,2,1,1,2), C = c(1,7,6,4,1), D = c(1,1,1,1,1))
filter <- c("A", "B", "C", "D")
filter2 <- c("A","B","D")
To do that I'm using this piece of code:
`%ni%` <- Negate(`%in%`)
test <- test[,-which(names(test) %ni% filter2)]
If I use the filter2 object I get what is expected:
A B D
1 1 1 1
2 6 2 1
3 1 1 1
4 2 1 1
5 3 2 1
However, if I use the filter object, I get a dataframe with zero columns:
data frame with 0 columns and 5 rows
I expected to get an untouched dataframe, since filter had all test columns in it. Why does this happen, and how can I write a more reliable code not to get empty dataframes in these situations?
Use ! instead of -
test[,!(names(test) %ni% filter2)]
test[,!(names(test) %ni% filter)]
by wrapping with which and using -, it works only when the length of output of which is greater than 0
> which(names(test) %ni% filter2)
[1] 3
> which(names(test) %ni% filter)
integer(0)
By doing the -, there is no change in the integer(0) case
> -which(names(test) %ni% filter)
integer(0)
> -which(names(test) %ni% filter2)
[1] -3
thus,
> test[integer(0)]
data frame with 0 columns and 5 rows
I think you can simplify the column selection process by subsetting the dataframe with character vector of column names.
test[filter]
# A B C D
#1 1 1 1 1
#2 6 2 7 1
#3 1 1 6 1
#4 2 1 4 1
#5 3 2 1 1
test[filter2]
# A B D
#1 1 1 1
#2 6 2 1
#3 1 1 1
#4 2 1 1
#5 3 2 1
This question already has answers here:
Calculating cumulative sum for each row
(6 answers)
Sum of previous rows in a column R
(1 answer)
Closed 3 years ago.
I have a vector of alternating TRUE and FALSE values:
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
I'd like to number each instance of TRUE with a unique sequential number and to assign each FALSE value the number associated with the TRUE value preceding it.
therefore, my desired output using the example dat above (which has 4 TRUE values):
1 1 1 2 2 2 2 3 3 4 4 4 4 4
What I tried:
I've tried the following (which works), but I know there must be a simpler solution!!
whichT <- which(dat==T)
whichF <- which(dat==F)
l1 <- lapply(1:length(whichT),
FUN = function(x)
which(whichF > whichT[x] & whichF < whichT[(x+1)])
)
l1[[length(l1)]] <- which(whichF > whichT[length(whichT)])
replaceFs <- unlist(
lapply(1:length(whichT),
function(x) l1[[x]] <- rep(x,length(l1[[x]]))
)
)
replaceTs <- 1:length(whichT)
dat2 <- dat
dat2[whichT] <- replaceTs
dat2[whichF] <- replaceFs
dat2
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
I need a simpler and quicker solution b/c my real data set is 181k rows long!
Base R solutions preferred, but any solution works
cumsum(dat) will do what you want. When used in mathematical functions TRUE gets converted to 1 and FALSE to 0 so taking the cumulative sum will add 1 every time you see a TRUE and add nothing when there is a FALSE which is what you want.
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
cumsum(dat)
# [1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
Instead of doing the indexing, it can be easily done with cumsum from base R. Here, TRUE/FALSE gets coerced to 1/0 and when we do the cumulative sum, whereever there is 1, it gets increment by 1
cumsum(dat)
#[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
cumsum() is the most straightforward way, however, you can also do:
Reduce("+", dat, accumulate = TRUE)
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
Starting with a data.frame...
df = data.frame(k=c(1,5,4,7,6), v=c(3,1,4,1,5))
> df
k v
1 1 3
2 5 1
3 4 4
4 7 1
5 6 5
I might run some number of arbitrary manipulations...
> foo1 = df[df$k>3,]
> foo2 = head(foo1[order(foo1$v),], 2)
> foo2
k v
2 5 1
4 7 1
At this point foo2 has somehow retained the original row numbers fromdf (in this case 2 and 4).
How do I extract these?
> insert_magic_function_here(foo2)
[1] 2 4
I think you're looking for rownames.
Starting with a data.frame...
df = data.frame(k=c(1,5,4,7,6), v=c(3,1,4,1,5))
> df
k v
1 1 3
2 5 1
3 4 4
4 7 1
5 6 5
I might run some number of arbitrary manipulations...
> foo1 = df[df$k>3,]
> foo2 = head(foo1[order(foo1$v),], 2)
> foo2
k v
2 5 1
4 7 1
At this point foo2 has somehow retained the original row numbers fromdf (in this case 2 and 4).
How do I extract these?
> insert_magic_function_here(foo2)
[1] 2 4
I think you're looking for rownames.
I like to select the first (2,3,0,4) rows of each group in a data frame.
> f<-data.frame(group=c(1,1,1,2,2,3,4),y=c(1:7))
>
> group y
> 1 1
> 1 2
> 1 3
> 2 4
> 2 5
> 3 6
> 4 7
and obtain a data frame as follows
group y
1 1
1 2
2 4
2 5
4 7
I tried to use by and head but head does not take a vector.
Thank you for your help.
With the more traditional lapply:
k <- c(2,3,0,4)
fs <- split(f, f$group)
do.call(rbind,lapply(seq_along(k), function(i) head(fs[[i]], k[i])))
result is:
group y
1 1 1
2 1 2
4 2 4
5 2 5
7 4 7
Using plyr:
library(plyr)
rows <- c(2,3,0,4)
ddply(f,.(group),function(x)head(x,rows[x[1,1]]))
group y
1 1 1
2 1 2
3 2 4
4 2 5
5 4 7
edit:
misunderstood the question so updated answer
Version of function with indexes.
fun1 <- function(){
idx <- c(0,which(diff(f$group)!=0))+1
idx2 <- unlist(lapply(1:length(nf),function(x) seq.int(from=idx[x],length.out=nf[x])),use.names=F)
f1 <- f[idx2,]
return(f1)
}
fun2 <- function(){
ddply(f,.(group),function(x) head(x,nf[x[1,1]]))
}
Test data (size suggested by author of question)
f<-data.frame(group=sample(1:1000,50000,T),y=c(1:50000))
f <- f[order(f$group),]
nf <- rpois(length(unique(f$group)),3)
system.time(fun1())
system.time(fun2())
On my system ~60 times faster is fun1.