I have a data frame of 2 columns and a vector of the same length. I am trying to remove all duplicated pairs in the data frame and at the same index, remove it from the vector.
I have a data frame:
> from <- c(1,1,2,4,3)
> to <- c(1,1,2,3,5)
> ft <- data.frame(from,to)
> ft
from to
1 1 1
2 1 1
3 2 2
4 4 3
5 3 5
And vector:
> dist <- c(1,2,3,4,5)
> dist
[1] 1 2 3 4 5
I used the function unique() to remove all duplicated pairs:
> unique(ft)
from to
1 1 1
3 2 2
4 4 3
5 3 5
How can I get the index of where every pair from "ft" has been removed so that I can remove it from "dist" which would be the 2 in this case.
As #eddi notes, you can get a logical vector that indicates which rows are duplicates with duplicated(). I combined that with which(), which returns the number associated with the logical that is TRUE (i.e., the duplicated row). You can then create a new data.frame (vector, etc.) by using - to not include the indicated rows in the subscript of your object.
Edit: In the comments, #DWin points out a better way than using -. If we negate the duplicated() function with !, we will get a vector that we can use to determine which rows to retain:
> from <- c(1,1,2,4,3)
> to <- c(1,1,2,3,5)
> ft <- data.frame(from,to)
> ft
from to
1 1 1
2 1 1
3 2 2
4 4 3
5 3 5
> dist <- c(1,2,3,4,5)
> dist
[1] 1 2 3 4 5
> remove <- !duplicated(ft)
> remove
[1] TRUE FALSE TRUE TRUE TRUE
> ft.new <- ft[which(remove), ]
> ft.new
from to
1 1 1
3 2 2
4 4 3
5 3 5
> dist.new <- dist[which(remove)]
> dist.new
[1] 1 3 4 5
I have a data.frame like this:
data <- data.frame(A=c(1,3,5),B=c(4,3,6),C=c(2,2,8),D=c(3,3,4))
A B C D
1 4 2 3
3 3 2 3
5 6 8 4
Now I want to create new variable "E", which is the lowest value of columns A,B and C. So that the data.frame now looks like this:
A B C D E
1 4 2 3 1
3 3 2 3 2
5 6 8 4 5
I can do this using a for loop:
for (i in 1:nrow(data)) {
data$E[i] <- min(data[i,c("A","B","C")])
}
But I was wondering whether this could be done differently (more efficient)?
Many thanks!
Here are a few ways of doing it,
with apply (to apply the min function to each row)
or pmin (parallel min).
pmin( data[,1], data[,2], data[,3] )
# [1] 1 2 5
do.call( pmin, data[,1:3] )
# [1] 1 2 5
apply(data[,1:3], 1, min)
# [1] 1 2 5
I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))
I like to select the first (2,3,0,4) rows of each group in a data frame.
> f<-data.frame(group=c(1,1,1,2,2,3,4),y=c(1:7))
>
> group y
> 1 1
> 1 2
> 1 3
> 2 4
> 2 5
> 3 6
> 4 7
and obtain a data frame as follows
group y
1 1
1 2
2 4
2 5
4 7
I tried to use by and head but head does not take a vector.
Thank you for your help.
With the more traditional lapply:
k <- c(2,3,0,4)
fs <- split(f, f$group)
do.call(rbind,lapply(seq_along(k), function(i) head(fs[[i]], k[i])))
result is:
group y
1 1 1
2 1 2
4 2 4
5 2 5
7 4 7
Using plyr:
library(plyr)
rows <- c(2,3,0,4)
ddply(f,.(group),function(x)head(x,rows[x[1,1]]))
group y
1 1 1
2 1 2
3 2 4
4 2 5
5 4 7
edit:
misunderstood the question so updated answer
Version of function with indexes.
fun1 <- function(){
idx <- c(0,which(diff(f$group)!=0))+1
idx2 <- unlist(lapply(1:length(nf),function(x) seq.int(from=idx[x],length.out=nf[x])),use.names=F)
f1 <- f[idx2,]
return(f1)
}
fun2 <- function(){
ddply(f,.(group),function(x) head(x,nf[x[1,1]]))
}
Test data (size suggested by author of question)
f<-data.frame(group=sample(1:1000,50000,T),y=c(1:50000))
f <- f[order(f$group),]
nf <- rpois(length(unique(f$group)),3)
system.time(fun1())
system.time(fun2())
On my system ~60 times faster is fun1.