Select row by level of a factor - r

I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))

dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.

Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.

Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.

Related

mass removal of rows from a data frame based on a column condition

I would like to remove all rows based on a condition of a column. The code below produces a sample test data
test_data <- data.frame(index = c(1,2,3,4,5), group = c("a", "a", "a", "b", "c"), count = c(1,2,2,3,4))
The data frame has 3 columns: index, group and count. I would like to remove all rows belonging to the same group if any one row of the group has count 1. So in above data frame, I would like to remove entire index 1, 2 and 3 from data frame since first row has count = 1 and row 2nd and 3rd fall in the same group "a". The resultant data frame should look like this:
testdata2 <- data.frame(index = c(4,5), group = c("b", "c"), count = c(3,4))
Any help would be appreciated! Thanks!
We can use ave with subset in base R and select groups which has no value where count is 1.
subset(test_data, ave(count != 1, group, FUN = all))
# index group count
#4 4 b 3
#5 5 c 4
With dplyr this can be done as :
library(dplyr)
test_data %>% group_by(group) %>% filter(all(count != 1))
and data.table
library(data.table)
setDT(test_data)[, .SD[all(count != 1)], group]

In R: How to subset a large dataframe by top 5 longest runs of frequent values in 1 column?

I have a dataframe with 1 column. The values in this column can ONLY be "good" or "bad". I would like to find the top 5 largest runs of "bad".
I am able to use the rle(df) function to get the running length of all the "good" and "bad".
How do i find the 5 largest runs that attribute to ONLY "bad"?
How do i get the starting and ending indices of the top 5 largest runs for ONLY "bad"?
Your assistance is much appreciated!
One option would be rleid. Convert the 'data.frame' to 'data.table' (setDT(df1)), creating grouping column with rleid (generates a unique id based on adjacent non-matching elements, create the number of elements per group (n) as a column, and row number also as another column ('rn'), subset the rows where 'goodbad' is "bad", order 'n' in decreasing order, grouped by 'grp', summarise the 'first' and 'last' row numbe, as well as the entry for goodbad
library(data.table)
setDT(df1)[, grp := rleid(goodbad)][, n := .N, grp][ ,
rn := .I][goodbad == 'bad'][order(-n), .(goodbad = first(goodbad),
n = n, start = rn[1], last = rn[.N]), .(grp)
][n %in% head(unique(n), 5)][, grp := NULL][]
Or we can use rle and other base R methods
rl <- rle(df1$goodbad)
grp <- with(rl, rep(seq_along(values), lengths))
df2 <- transform(df1, grp = grp, n = rep(rl$lengths, rl$lengths),
rn = seq_len(nrow(df1)))
df3 <- subset(df2, goodbad == 'bad')
do.call(data.frame, aggregate(rn ~ grp, subset(df3[order(-df3$n),],
n %in% head(unique(n), 5)), range))
data
set.seed(24)
df1 <- data.frame(goodbad = sample(c("good", "bad"), 100,
replace = TRUE), stringsAsFactors = FALSE)
The sort(...) function arranges things by increasing or decreasing order. The default is increasing, but you can set "decreasing = TRUE". Use ?sort for more info.
The which(...) function returns the INDEX of values that meet a logical criteria. The code below sorts the times columns of rows where the goodbad value == GOOD.
sort(your.df$times[which(your.df$goodbad == GOOD)])
If you wanted to get the top 5 you could do this:
top5_good <- sort(your.df$times[which(your.df$goodbad == GOOD)])[1:5]
top5_bad <- sort(your.df$times[which(your.df$goodbad == BAD)])[1:5]

Finding rows with a unique combination of values (R)

This is a bit more complicated that the title lets on, and I'm sure if I could think of a way to better describe it, I could google it better.
I have data that looks like this:
SET ID
100301006 1287025
100301006 1287026
100301010 1287027
100301013 1287030
100301011 1287027
and I would like to identify and select those rows where each both values in a row have a unique value for the column. In the example above, I want to grab only the row:
100301013 1287030
I don't want SET 100301006, since it matches to 2 different records in the ID field (1287025 and 1287026). Similarly, I don't want SET 100301010 since the ID record it matches to (1287027) can also match another SET (10030011).
In some cases there could be more than 2 matches.
I could do this in loops, but that seems like a hack. I'd love a base R or data.table solution, but I'm not so interested in dplyr (trying to minimize dependencies).
We can use duplicated on each columns independently to create a list of logical vectors, Reduce it to a single vector with & and use that to subset the rows of the dataset
df1[Reduce(`&`, lapply(df1, function(x)
!(duplicated(x)|duplicated(x, fromLast = TRUE)))),]
# SET ID
#4 100301013 1287030
Or as #chinsoon12 suggested
m1 <- sapply(df1, function(x) !(duplicated(x)| duplicated(x, fromLast = TRUE)))
df1[rowSums(m1) == ncol(m1),, drop = FALSE]
data
df1 <- structure(list(SET = c(100301006L, 100301006L, 100301010L, 100301013L,
100301011L), ID = c(1287025L, 1287026L, 1287027L, 1287030L, 1287027L
)), class = "data.frame", row.names = c(NA, -5L))
Here's a quick base-R hack:
df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
SET ID
100301006 1287025
100301006 1287026
100301010 1287027
100301013 1287030
100301011 1287027")
counts <- sapply(df, function(x) { tb <- table(x); tb[ match(x, names(tb)) ]; })
counts
# SET ID
# 100301006 2 1
# 100301006 2 1
# 100301010 1 2
# 100301013 1 1
# 100301011 1 2
At this point, we have the number of times each element is found in its column ... so we want rows where all counts are 1.
df[ rowSums(counts == 1) == ncol(df), ]
# SET ID
# 4 100301013 1287030
You could use data.table to select only groups with 1 row, grouping by ID first, then by SET. This is similar to #r2evans method of checking that the counts for ID and SET are both 1.
library(data.table)
setDT(df)
df[, if(.N == 1) .SD, ID][, if(.N == 1) .SD, SET]
# SET ID
# 1: 100301013 1287030
Or for more than 2 columns
Reduce(function(x, y) x[, if(.N == 1) .SD, y], names(df), init = df)
# ID SET
# 1: 1287030 100301013
With base R, maybe you can use ave() to make it:
r <-df[which(with(df,ave(seq(nrow(df)),SET,FUN = length)*ave(seq(nrow(df)),ID,FUN = length)) == 1),]
> r
SET ID
4 100301013 1287030
DATA
df <- read.table(text="SET ID
100301006 1287025
100301006 1287026
100301010 1287027
100301013 1287030
100301011 1287027",header = T)
If we have a dataframe df and want to find unique values of columns: column1, column2, column3:
library(dplyr)
df <- df %>% group_by(column1,column2,column3) %>% summarise()

R equivalent to SAS "merge" "by"

If you only use "merge" and "by" in SAS to merge datasets that contain several variables with equal names (beside the ID(s) that you merge by), SAS will combine these variables in to one using the value read last - it is described here https://communities.sas.com/t5/SAS-Programming/Merge-step-overwriting-shared-vars/m-p/281542#M57117
Text from above link:
"There is a rule: whichever value was read last. But that rule is simple only when the merge is one-to-one. In that case, the value you get depends on the order in the MERGE statement:
merge a b;
by id;
The value of common variables (for a one-to-one merge) comes from data set B. SAS reads a value from data set A, then reads a value from data set B. The value from B is read last, and overwrites the value read from data set A.
If there is a mismatch, and an ID appears only in data set A but not in data set B, the value will be the one found in data set A."
How do I make R behave the same way without having to combine the rows afterwards after certain conditions? (in SAS, values are not overwritten by NAs)
library(tidyverse)
#create tibbles
df1 <- tibble(id = c(1:3), y = c("tt", "ff", "kk"))
df2 <- tibble(id = c(1,2,4), y = c(4,3,8))
df3 <- tibble(id = c(1:3), y = c(5,7,NA))
#combine the tibbles
combined_df <- list(df1, df2, df3) %>%
reduce(full_join, by = "id")
# desired output
combined_df_desired <- tibble(id = 1:4, y = c(5,7,"kk",8))
I don't know exactly what you mean with "certain conditions". There isn't a way to change the inner workings of full_join() but you can do:
list(df1, df2, df3) %>%
reduce(full_join, by = "id") %>%
mutate_all(as.character) %>%
mutate(y = coalesce(y, y.y , y.x,)) %>%
select(id, y)
A tibble: 4 x 2
id y
<chr> <chr>
1 1 5
2 2 7
3 3 kk
4 4 8
coalesce() takes a set of columns and returns the first non-NA value for each row. You can order the columns inside the function according to your priorities.

How to merge data frames in R using *alternative* columns

I'm trying to merge 2 data frames in R, but I have two different columns with different types of ID variable. Sometimes a row will have a value for one of those columns but not the other. I want to consider them both, so that if one frame is missing a value for one of the columns then the other will be used.
> df1 <- data.frame(first = c('a', 'b', NA), second = c(NA, 'q', 'r'))
> df1
first second
1 a <NA>
2 b q
3 <NA> r
> df2 <- data.frame(first = c('a', NA, 'c'), second = c('p', 'q', NA))
> df2
first second
1 a p
2 <NA> q
3 c <NA>
I want to merge these two data frames and get 2 rows:
row 1, because it has the same value for "first"
row 2, because it has the same value for "second"
row 3 would be dropped, because df1 has a value for "second", but not "first", and df2 has the reverse
It's important that NAs are ignored and don't "match" in this case.
I can get kinda close:
> merge(df1,df2, by='first', incomparables = c(NA))
first second.x second.y
1 a <NA> p
> merge(df1,df2, by='second', incomparables = c(NA))
second first.x first.y
1 q b <NA>
But I can't rbind these two data frames together because they have different column names, and it doesn't seem like the "R" way to do it (in the near future, I'll have a 3rd, 4th and even 5th type of ID).
Is there a less clumsy way to do this?
Edit: Ideally, the output would look like this:
> df3 <- data.frame(first = c('a', 'b'), second = c('p','q'))
> df3
first second
1 a p
2 b q
row 1, has matched because the column "first" has the same value in both data frames, and it fills in the value for "second" from df2
row 2, has matched because the column "second" has the same value in both data frames, and it fills in the value for "first" from df1
there is no row 3, because there is no column that has a value in both data frames
Using sqldf we can do, as in SQL we can alternate between joining conditions using OR
library(sqldf)
df <- sqldf("select a.*, b.*
from df1 a
join df2 b
ON a.first = b.first
OR a.second = b.second")
library(dplyr)
#If value in first is NA i.e. is.na(first) is TRUE then use first..3 value's else use first value's and the same for second
df %>% mutate(first = ifelse(is.na(first), first..3, first),
second = ifelse(is.na(second), second..4, second)) %>%
#Discard first..3 and second..4 since we no longer need them
select(-first..3, -second..4)
first second
1 a p
2 b q

Resources