This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed last month.
I have a data.frame
library(dplyr)
ID <- c(1,1,1,1,2,2,3,3,3,3,4,4,5)
Score <- c(20,22,34,56,78,98,56,43,45,33,24,54,22)
Quarter <- c("Q1","Q2","Q3","Q4","Q1","Q2","Q1","Q2","Q3","Q4","Q1","Q2","Q1")
df <- data.frame(ID,Score,Quarter)
I only want to deal with the data that has all 4 quarters (Q1,Q2,Q3,Q4 in column "Quarters"). One way I thought I could do this is subset when the ID is present 4 times because it is repeated in each Quarter. I am having a hard time sub-setting on the count of IDs. I tried:
filter(df, count(df, vars = ID)==4)
But it did not work and guidance would be greatly appreciated.
Thank you
One way we can do is by using n_distinct to get unique values for each ID and filter the group which has all 4 values.
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(Quarter) == 4)
# ID Score Quarter
# <dbl> <dbl> <fct>
#1 1.00 20.0 Q1
#2 1.00 22.0 Q2
#3 1.00 34.0 Q3
#4 1.00 56.0 Q4
#5 3.00 56.0 Q1
#6 3.00 43.0 Q2
#7 3.00 45.0 Q3
#8 3.00 33.0 Q4
Equivalent base R implementation using ave would be
df[as.numeric(ave(df$Quarter, df$ID, FUN = function(x) length(unique(x)))) == 4, ]
Here are a few alternatives. The last three are base solutions.
#1 is an SQL solution which creates a one-column data frame df0 with only those IDs having 4 quarters which is then joined to df thereby eliminating all other IDs.
#2 is a dplyr solution which filters the groups retaining only those with 4 rows.
#3 is a data.table solution which returns the rows for those ID groups having 4 rows and NULL for the other groups. This has the effect of eliminating the other groups.
#4 is a zoo solution which converts df to a wide form zoo object with quarters along the top and ID as the time index. It then removes any row having an NA and reshapes back to the original using fortify.zoo also reordering back to a sorted order. The last line of the solution could be omitted if the row order does not matter. Interestingly it does not use knowledge of the number 4.
#5 is a base solution which splits df into a list of data frames, one per ID, and then uses Filter to extract those having 4 rows. Finally it puts it all back together.
#6 is a base solution which creates a vector having one element per row of df containing the number of rows (including the current row) having the ID in that row. Then use subset to reduce df to those rows for which that vector equals 4.
#7 is a base solution which splits df into a list of data frames, one per ID, and then uses Reduce to iterate over it appending the current data frame to what we have so far if it has 4 rows or just keeping what we have so far if not.
# 1
library(sqldf)
sqldf("with df0 as (
select ID from df group by ID having count(*) = 4
)
select * from df join df0 using (ID)")
# 2
library(dplyr)
df %>% group_by(ID) %>% filter(n() == 4) %>% ungroup
# 3
library(data.table)
as.data.table(df)[, if (nrow(.SD) == 4) .SD, by = ID]
# 4
library(zoo)
z <- read.zoo(df, split = "Quarter")
df2 <- fortify.zoo(na.omit(z), melt = TRUE, names = names(df)[c(1, 3:2)])
df2 <- df2[order(df2$ID, df2$Quarter), ]
# 5
do.call("rbind", Filter(function(x) nrow(x) == 4, split(df, df$ID)))
# 6
subset(df, ave(ID, ID, FUN = length) == 4)
# 7
Reduce(function(x, y) if (nrow(y) == 4) rbind(x, y) else x, split(df, df$ID))
Here is another base R method using table, rowSums and %in%. We get the frequency count of 'ID', 'Quarter' columns with table, convert it to logical matrix where 0 values are TRUE and all others FALSE (!table(...)), get the rowwise sum (rowSums), convert to logical vector, get the names of the elements that are TRUE and create a comparison with the ID using %in% to subset the dataset
subset(df, ID %in% names(which(!rowSums(!table(df[c(1,3)])))))
# ID Score Quarter
#1 1 20 Q1
#2 1 22 Q2
#3 1 34 Q3
#4 1 56 Q4
#7 3 56 Q1
#8 3 43 Q2
#9 3 45 Q3
#10 3 33 Q4
I just figured out I can do this as well:
df[df$ID %in% names(table(df$ID))[table(df$ID)==4],]
It gets the desired result with using only the counts from ID
Related
My data frame looks like this
df <- data.frame(gene=c("A","B","C","A","B","D"),
origin=rep(c("old","new"),each=3),
value=sample(rnorm(10,2),6))
gene origin value
1 A old 1.5566908
2 B old 1.3000358
3 C old 0.7668213
4 A new 2.5274712
5 B new 2.2434525
6 D new 2.0758326
I want to find the common genes between the two different groups of origin (old and new)
I want my data to look like this
gene origin value
1 A old 1.5566908
2 B old 1.3000358
4 A new 2.5274712
5 B new 2.2434525
Any help is appreciated. Ideally I would like to find common rows among groups using multiple columns
A base R option using ave + subset
subset(
df,
as.logical(ave(origin,gene,FUN = function(x) all(c("old","new")%in% x)))
)
gives
gene origin value
1 A old 0.5994593
2 B old 4.0449345
4 A new 3.2478612
5 B new 0.2673525
You can use split and reduce to get the common genes and use it in filter.
library(dplyr)
library(purrr)
df %>% filter(gene %in% (split(df$gene, df$origin) %>% reduce(intersect)))
# gene origin value
#1 A old 1.271
#2 B old 2.838
#3 A new 0.974
#4 B new 1.375
Or keeping in base R -
subset(df, gene %in% Reduce(intersect, split(df$gene, df$origin)))
One possibility could be:
df %>%
group_by(gene) %>%
filter(all(c("old", "new") %in% origin))
gene origin value
<chr> <chr> <dbl>
1 A old 1.63
2 B old 0.904
3 A new 2.18
4 B new 1.24
I would filter according to duplicates, and scan it from last and first.
library(tidyverse)
df %>% filter(
duplicated(gene, fromLast = TRUE) | duplicated(gene, fromLast = FALSE)
)
gene origin value
1 A old 2.665606
2 B old 1.565466
3 A new 4.025450
4 B new 2.647110
Note: I cant replicate your data as you didnt provide a seed!
Using subset with table in base R
subset(df, gene %in% names(which(rowSums(table(gene, origin) > 0) == 2)))
gene origin value
1 A old 3.0536642
2 B old 2.0796124
4 A new 0.1621484
5 B new 2.3587338
I'm trying to remove rows with duplicate values in one column of a data frame. I want to make sure that all the existing values in that column are represented, appearing more than once if its values in one other column are not duplicated and non-missing, and only once if the values in that other column are all missing. Take for example the following data frame:
toy <- data.frame(Group = c(1,1,2,2,2,3,3,4,5,5,6,7,7), Class = c("a",NA,"a","b",NA,NA,NA,NA,"a","b","a","a","a"))
I would like to end up with this:
ideal <- data.frame(Group = c(1,2,2,3,4,5,5,6,7), Class = c("a","a","b",NA,NA,"a","b","a","a"))
I tried transforming the data frame into a data table and follow the advice here, like this:
library(data.table)
toy.dt <- as.data.table(toy)
toy.dt[, .(Class = if(all(is.na(Class))) NA_character_ else na.omit(Class)), by = Group]
but duplicates weren't handled as needed: value 7 in the column 'Group' should appear only once in the resulting data.
It would be a bonus if the solution doesn't require transforming the data into a data table.
Here is one way using base R. We first drop NA rows in toy and select only unique rows. We can then left join it with unique Group values to get the rows which are NA for the group.
df1 <- unique(na.omit(toy))
merge(unique(subset(toy, select = Group)), df1, all.x = TRUE)
# Group Class
#1 1 a
#2 2 a
#3 2 b
#4 3 <NA>
#5 4 <NA>
#6 5 a
#7 5 b
#8 6 a
#9 7 a
Same logic using dplyr functions :
library(dplyr)
toy %>%
na.omit() %>%
distinct() %>%
right_join(toy %>% distinct(Group))
If you would like to try a tidyverse approach:
library(tidyverse)
toy %>%
group_by(Group) %>%
filter(!(is.na(Class) & sum(!is.na(Class)) > 0)) %>%
distinct()
Output
# A tibble: 9 x 2
# Groups: Group [7]
Group Class
<dbl> <chr>
1 1 a
2 2 a
3 2 b
4 3 NA
5 4 NA
6 5 a
7 5 b
8 6 a
9 7 a
This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.
Here is code that I have confirmed to work:
df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)
The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.
Say I had data like this:
How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.
Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".
In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups: ID [3]
# ID value value.mean
# <fct> <int> <dbl>
#1 a 13 12.6
#2 a 14 12.6
#3 a 12 12.6
#4 a 13 12.6
#5 a 11 12.6
#6 b 12 15.5
#7 b 19 15.5
#8 cc4 10 10.0
Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(value.mean = mean(value))
## A tibble: 3 x 2
# ID value.mean
# <fct> <dbl>
#1 a 12.6
#2 b 15.5
#3 cc4 10.0
The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).
Sample data
df <- read.table(text =
"ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)
Update
To conclude from the comments&chat, this should be what you're after.
# Sample data
set.seed(2017)
csvdata <- data.frame(
microsat = rep(c("A", "B", "C"), each = 8),
allele = sample(20, 3 * 8, replace = T))
csvdata %>%
group_by(microsat) %>%
summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
# microsat D
# <fct> <dbl>
#1 A 0.844
#2 B 0.812
#3 C 0.812
Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.
A base R option would be
df1$value.mean <- with(df1, ave(value, ID))
My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).
I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.
For example, I'd like to convert this
> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17))
> d
x y z
1 1 10 20
2 1 11 19
3 2 12 18
4 4 13 17
Into this:
x y z
1 1 11 19
2 2 12 18
3 4 13 17
I'm using aggregate to do this currently, but the performance is unacceptable with more data:
> d.ordered = d[order(-d$y),]
> aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]})
I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.
Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?
Maybe duplicated() can help:
R> d[ !duplicated(d$x), ]
x y z
1 1 10 20
3 2 12 18
4 4 13 17
R>
Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:
R> ddply(d, "x", function(z) tail(z,1))
x y z
1 1 11 19
2 2 12 18
3 4 13 17
R>
Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).
Just to add a little to what Dirk provided... duplicated has a fromLast argument that you can use to select the last row:
d[ !duplicated(d$x,fromLast=TRUE), ]
Here is a data.table solution which will be time and memory efficient for large data sets
library(data.table)
DT <- as.data.table(d) # convert to data.table
setkey(DT, x) # set key to allow binary search using `J()`
DT[J(unique(x)), mult ='last'] # subset out the last row for each x
DT[J(unique(x)), mult ='first'] # if you wanted the first row for each x
There are a couple options using dplyr:
library(dplyr)
df %>% distinct(x, .keep_all = TRUE)
df %>% group_by(x) %>% filter(row_number() == 1)
df %>% group_by(x) %>% slice(1)
You can use more than one column with both distinct() and group_by():
df %>% distinct(x, y, .keep_all = TRUE)
The group_by() and filter() approach can be useful if there is a date or some other sequential field and
you want to ensure the most recent observation is kept, and slice() is useful if you want to avoid ties:
df %>% group_by(x) %>% filter(date == max(date)) %>% slice(1)