Return df with a columns values that occur more than once [duplicate] - r

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I have a data frame df, and I am trying to subset all rows that have a value in column B occur more than once in the dataset.
I tried using table to do it, but am having trouble subsetting from the table:
t<-table(df$B)
Then I try subsetting it using:
subset(df, table(df$B)>1)
And I get the error
"Error in x[subset & !is.na(subset)] :
object of type 'closure' is not subsettable"
How can I subset my data frame using table counts?

Here is a dplyr solution (using mrFlick's data.frame)
library(dplyr)
newd <- dd %>% group_by(b) %>% filter(n()>1) #
newd
# a b
# 1 1 1
# 2 2 1
# 3 5 4
# 4 6 4
# 5 7 4
# 6 9 6
# 7 10 6
Or, using data.table
setDT(dd)[,if(.N >1) .SD,by=b]
Or using base R
dd[dd$b %in% unique(dd$b[duplicated(dd$b)]),]

May I suggest an alternative, faster way to do this with data.table?
require(data.table) ## 1.9.2
setDT(df)[, .N, by=B][N > 1L]$B
(or) you can couple .I (another special variable - see ?data.table) which gives the corresponding row number in df, along with .N as follows:
setDT(df)[df[, .I[.N > 1L], by=B]$V1]
(or) have a look at #mnel's another for another variation (using yet another special variable .SD).

Using table() isn't the best because then you have to rejoin it to the original rows of the data.frame. The ave function makes it easier to calculate row-level values for different groups. For example
dd<-data.frame(
a=1:10,
b=c(1,1,2,3,4,4,4,5,6, 6)
)
dd[with(dd, ave(b,b,FUN=length))>1, ]
#subset(dd, ave(b,b,FUN=length)>1) #same thing
a b
1 1 1
2 2 1
5 5 4
6 6 4
7 7 4
9 9 6
10 10 6
Here, for each level of b, it counts the length of b, which is really just the number of b's and returns that back to the appropriate row for each value. Then we use that to subset.

Related

Sum of data in a column based on categorical condition from another column [duplicate]

This question already has answers here:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
(3 answers)
Closed 9 years ago.
Suppose I have a data frame like this:
set.seed(123)
df <- as.data.frame(cbind(y<-sample(c("A","B","C"),10,T), X<-sample(c(1,2,3),10,T)))
df <- df[order(df$V1),]
Is there a simply function to sum (or any FUN) V2 by V1 and add to df as a new column, such that:
df$sum <- c(6,6,8,8,8,8,6,6,6,6)
df
I may write a function for that, but I have to do that frequently and be better to know the simplest way to realize that.
I agree with #mnel at least on his first point. I didn't see ave demonstrated in the answers he cited and I think it's the "simplest" base-R method. Using that data.frame(cbind( ...)) construction should be outlawed and teachers who demonstrate it should be stripped of their credentials.
set.seed(123)
df<-data.frame(y=sample( c("A","B","C"), 10, T),
X=sample(c (1,2,3), 10, T))
df<-df[order(df$y),] # that step is not necessary for success.
df
df$sum <- ave(df$X, df$y, FUN=sum)
df
y X sum
1 A 3 6
6 A 3 6
3 B 3 8
7 B 1 8
9 B 1 8
10 B 3 8
2 C 2 6
4 C 2 6
5 C 1 6
8 C 1 6

Return subset of rows based on aggregate of keyed rows [duplicate]

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 6 years ago.
I would like to subset a data table in R within each subset, based on an aggregate function over the subset of rows. For example, for each key, return all values greater than the mean of a field calculated only for rows in subset. Example:
library(data.table)
t=data.table(Group=rep(c(1:5),each=5),Detail=c(1:25))
setkey(t,'Group')
library(foreach)
library(dplyr)
ret=foreach(grp=t[,unique(Group)],.combine=bind_rows,.multicombine=T) %do%
t[Group==grp&Detail>t[Group==grp,mean(Detail)],]
# Group Detail
# 1: 1 4
# 2: 1 5
# 3: 2 9
# 4: 2 10
# 5: 3 14
# 6: 3 15
# 7: 4 19
# 8: 4 20
# 9: 5 24
#10: 5 25
The question is, is it possible to succinctly code the last two lines using data.table features? Sorry if this is a repeat, I am also struggling explaining the exact goal to have google/stackoverflow find it.
Using the .SD function works. Was not aware of it, thanks:
dt[, .SD[Detail > mean(Detail)], by = Group]
Also works, with some performance gains:
indx <- dt[, .I[Detail > mean(Detail)], by = Group]$V1 ; dt[indx]

countif within R repeated across each row

I'm having trouble trying to replicate some of the countif function I'm familiar with in excel. I've got a data frame, and it has a large number of rows. I'm trying to take 2 variables (x & z) and do a countif of how many other variables within my dataframe match that. I figured out doing:
sum('mydataframe'$x==`mydataframe`$x[1]&`mydataframe'$z==`mydataframe`$z[1])
This gives me the correct countif for x&z within the whole data set for the first row [1]. The problem is I've got to use that [1]. I've tried using the (with,...) command, but then I can no longer access the whole column.
I'd like to be able to do the count of x & z combination for each row within the data frame then have that output as a new vector that I can just add as another column. And I'd like this to go on for every row through to the end.
Hopefully this is pretty simple. I figure some combination of (with,..) or apply or something will do it, but I'm just too new.
I am interested in a count total in every instance, not a running sequential count.
It seems that you are asking for a way to create a new column that contains the number of rows in the entire data frame with x and z value equal to the values of those variables for that row.
With a bit of sample data:
(dat <- data.frame(x=c(1, 1, 2), z=c(3, 3, 3)))
# x z
# 1 1 3
# 2 1 3
# 3 2 3
One simple approach would be grouping with dplyr's group_by function and then creating a new column with the number of elements in that group:
library(dplyr)
dat %>% group_by(x, z) %>% mutate(n=n())
# x z n
# (dbl) (dbl) (int)
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
A base R solution would probably involve ave:
dat$n <- ave(rep(NA, nrow(dat)), dat$x, dat$z, FUN=length)
dat
# x z n
# 1 1 3 2
# 2 1 3 2
# 3 2 3 1
An option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(dat)) , group by 'x', 'z' and
assign 'n' as the number of elements in each group (.N).
library(data.table)
setDT(dat)[, n:= .N, by = .(x,z)]
dat
# x z n
#1: 1 3 2
#2: 1 3 2
#3: 2 3 1

Adding group column to data frame [duplicate]

This question already has an answer here:
Compute the minimum of a pair of vectors
(1 answer)
Closed 7 years ago.
Say I have the following data frame:
dx=data.frame(id=letters[1:4], count=1:4)
# id count
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
And I would like to (grammatically) add a column that will get the count whenever count<3, otherwise 3, so I'll get the following:
# id count group
# 1 a 1 1
# 2 b 2 2
# 3 c 3 3
# 4 d 4 3
I thought to use
dx$group=if(dx$count<3){dx$count}else{3}
but it doesn't work on arrays. How can I do it?
In this particular case you can just use pmin (as I stated in the comments above):
df$group <- pmin(df$count, 3)
In general your if/else construction does not work on vectors, but you can use the function ifelse. It takes three arguments: First the condition, then the result if the condition is met and finally the result if the condition is not met. For your example you would write the following:
df$group <- ifelse(df$count < 3, df$count, 3)
Note that in your example the pmin solution is better. Just mentioning the ifelse solution for completeness.

keep values of a data frame column R

In my data frame df I want to get the id number satisfying the condition that the value of A is greater than the value of B. In the example I only would want Id=2.
Id Name Value
1 A 3
1 B 5
1 C 4
2 A 7
2 B 6
2 C 8
vecA<-vector();
vecB<-vector();
vecId<-vector();
i<-1
while(i<=length(dim(df)[1]){
if(df$Name[[i]]=="A"){vecA<-c(vecA,df$Value)}
if(df$Name[[i]]=="B"){vecB<-c(vecB,df$Value)}
if(vecA[i]>vecB[i]){vecId<-c(vecId,)}
i<-i+1
}
First, you could convert your data from long to wide so you have one row for each ID:
library(reshape2)
(wide <- dcast(df, Id~Name, value.var="Value"))
# Id A B C
# 1 1 3 5 4
# 2 2 7 6 8
Now you can use normal indexing to get the ids with larger A than B:
wide$Id[wide$A > wide$B]
# [1] 2
The first answer works out well for sure. I wanted to get to regular subset operations as well. I came up with this since you might want to check out some of the more recent R packages. If you had 3 groups to compare that would be interesting. Oh in the code below exp is the exact data.frame you started with.
library(plyr)
library(dplyr)
comp <- exp %>% filter(Name %in% c("A","B")) %>% group_by(Id) %>% filter(min_rank(Value)>1)
# If the whole row is needed
comp[which.max(comp$Value),]
# If not
comp[which.max(comp$Value),"Id"]

Resources