subset dataframe based on hierarchical preference of factor levels within column in R - r

I have a dataframe which I would like to subset based on hierarchical preference of factor levels within a column. With following example I want to show, that per level of "ID" I want to select only one "method". Specifically, if possible keeping CACL, if CACL doesn't exist for this level, then subset for "KCL" and if that doesn't exist, then subset for "H2O".
ID<-c(1,1,1,2,2,3)
method<-c("CACL","KCL","H2O","H2O","KCL","H2O")
df1<-data.frame(ID,method)
ID method
1 1 CACL
2 1 KCL
3 1 H2O
4 2 H2O
5 2 KCL
6 3 H2O
ID<-c(1,2,3)
method<-c("CACL","KCL","H2O")
df2<-data.frame(ID,method)
ID method
1 1 CACL
2 2 KCL
3 3 H2O
I have done something similar subsetting by selecting a minimum number within a level, but am not able to adapt it. Am wondering whether I should use ifelse here too?
#if present, choose rows containing "number" 2 instead of 1 (this column contained only the two numbers 1 and 2)
library(dplyr)
new<-df %>%
group_by(col1,col2,col3) %>%
summarize(number = ifelse(any(number > 1), min(number[number>1]),1))
dfnew<-merge(new,df,by=c("colxyz","number"),all.x=T)

You can use order with match and then simply !duplicated:
df1 <- df1[order(match(df1$method, c("CACL","KCL","H2O"))),]
df1[!duplicated(df1$ID),]
# ID method
#1 1 CACL
#5 2 KCL
#6 3 H2O
#Variant not changing df1
i <- order(match(df1$method, c("CACL","KCL","H2O")))
df1[i[!duplicated(df1$ID[i])],]

An option using dplyr:
df1 %>%
mutate(preference = match(method, c("CACL","KCL","H2O"))) %>%
group_by(ID) %>%
filter(preference == min(preference)) %>%
select(-preference)
# A tibble: 3 x 2
# Groups: ID [3]
ID method
<dbl> <fct>
1 1 CACL
2 2 KCL
3 3 H2O

Related

Assign increasing index per group in grouped dataframe in R

I have a dataframe/tibble that looks like this:
# A tibble: 15 x 2
# Groups: id [3]
id date
<int> <date>
1 1 1998-02-13
2 1 1998-02-14
3 1 1998-02-15
4 1 1998-02-16
5 1 1998-02-17
6 2 1998-02-13
7 2 1998-02-14
8 2 1998-02-15
9 2 1998-02-16
10 2 1998-02-17
11 3 1998-02-13
12 3 1998-02-14
13 3 1998-02-15
14 3 1998-02-16
15 3 1998-02-17
I would like to add a variable "days_before_event" that would count from 1:5 (but this should not be hardcoded and rather be the elments per group). I though about doing something like this
df_long %>% mutate(days_before_event = 1:nrow(.))
where nrow(.) should be the the number of rows per group. This does not work and shows the error
Fehler: Problem with `mutate()` input `days_before_event`.
x Input `days_before_event` can't be recycled to size 5.
i Input `days_before_event` is `1:nrow(.)`.
i Input `days_before_event` must be size 5 or 1, not 15.
i The error occurred in group 1: id = 1.
Any trick on how i can achieve this?
We can use row_number instead of 1:nrow
library(dplyr)
df_long %>%
mutate(days_before_event =row_number())
Or using base R, use ave to group over the 'id' and get the sequence (seq_along)
df_long$days_before_event <- with(df_long, ave(seq_along(id),
id, FUN = seq_along))
The error is because it is a grouped dataset (from the group attribute showed in the printed data), therefore, 1:nrow will get the sequence from 1st to the last row of the entire dataset and not the last row of the group. This creates an imbalance in length and mutate can return only the output having the same length as the original data (or else have to wrap in a list)

Repeat (duplicate) just one row twice in R

I'm trying to duplicate just the second row in a dataframe, so that row will appear twice. A dplyr or tidyverse aproach would be great. I've tried using slice() but I can only get it to either duplicate the row I want and remove all the other data, or duplicate all the data, not just the second row.
So I want something like df2:
df <- data.frame(t = c(1,2,3,4,5),
r = c(2,3,4,5,6))
df1 <- data.frame(t = c(1,2,2,3,4,5),
r = c(2,3,3,4,5,6))
Thanks!
Here's also a tidyverse approach with uncount:
library(tidyverse)
df %>%
mutate(nreps = if_else(row_number() == 2, 2, 1)) %>%
uncount(nreps)
Basically the idea is to set the number of times you want the row to occur (in this case row number 2 - hence row_number() == 2 - will occur twice and all others occur only once but you could potentially construct a more complex feature where each row has a different number of repetitions), and then uncount this variable (called nreps in the code).
Output:
t r
1 1 2
2 2 3
2.1 2 3
3 3 4
4 4 5
5 5 6
One way with slice would be :
library(dplyr)
df %>% slice(sort(c(row_number(), 2)))
# t r
#1 1 2
#2 2 3
#3 2 3
#4 3 4
#5 4 5
#6 5 6
Also :
df %>% slice(sort(c(seq_len(n()), 2)))
In base R, this can be written as :
df[sort(c(seq(nrow(df)), 2)), ]

Automate filtering to subset data based on multiple columns

Here is a data set I am trying to subset:
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
The data set has a variable x1 that is measured at different time points, denoted by ax1, bx1, cx1 and dx1. I am trying to subset these data by deleting the rows with -1 on any column (i.e ax1, bx1, cx1, dx1). I would like to know if there is a way to automate filtering (or filter function) to perform this task. I am familiar with situations where the focus is to filter rows based on a single column (or variable).
For the current case, I made an attempt by starting with
mutate_at( vars(ends_with("x1"))
to select the required columns, but I am not sure about how to combine this with the filter function to produce the desired results. The expect output would have the 3rd and 4th row being deleted. I appreciate any help on this. There is a similar case resolved here but this has not been done through the automation process. I want to adapt the automation to the case of large data with many columns.
You can use filter() with across().
library(dplyr)
df %>%
filter(across(ends_with("x1"), ~ .x != -1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8
It's equivalent to filter_at() with all_vars(), which has been superseded in dplyr 1.0.0.
df %>%
filter_at(vars(ends_with("x1")), all_vars(. != -1))
Using base R :
With rowSums
cols <- grep('x1$', names(df))
df[rowSums(df[cols] == -1) == 0, ]
# id ax1 bx1 cx1 dx1
#1 1 5 0 2 3
#2 2 3 1 1 7
#5 5 9 3 5 8
Or with apply :
df[!apply(df[cols] == -1, 1, any), ]
Using filter_at;
library(tidyverse)
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
df
df %>%
filter_at(vars(ax1:dx1), ~. != as.numeric(-1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

keep values of a data frame column R

In my data frame df I want to get the id number satisfying the condition that the value of A is greater than the value of B. In the example I only would want Id=2.
Id Name Value
1 A 3
1 B 5
1 C 4
2 A 7
2 B 6
2 C 8
vecA<-vector();
vecB<-vector();
vecId<-vector();
i<-1
while(i<=length(dim(df)[1]){
if(df$Name[[i]]=="A"){vecA<-c(vecA,df$Value)}
if(df$Name[[i]]=="B"){vecB<-c(vecB,df$Value)}
if(vecA[i]>vecB[i]){vecId<-c(vecId,)}
i<-i+1
}
First, you could convert your data from long to wide so you have one row for each ID:
library(reshape2)
(wide <- dcast(df, Id~Name, value.var="Value"))
# Id A B C
# 1 1 3 5 4
# 2 2 7 6 8
Now you can use normal indexing to get the ids with larger A than B:
wide$Id[wide$A > wide$B]
# [1] 2
The first answer works out well for sure. I wanted to get to regular subset operations as well. I came up with this since you might want to check out some of the more recent R packages. If you had 3 groups to compare that would be interesting. Oh in the code below exp is the exact data.frame you started with.
library(plyr)
library(dplyr)
comp <- exp %>% filter(Name %in% c("A","B")) %>% group_by(Id) %>% filter(min_rank(Value)>1)
# If the whole row is needed
comp[which.max(comp$Value),]
# If not
comp[which.max(comp$Value),"Id"]

Resources