Using dplyr first function but ignoring a particular character - r

I wish to add the first feature in the following dataset in a new column
mydf <- data.frame (customer= c(1,2,1,2,2,1,1) , feature =c("other", "a", "b", "c", "other","b", "c"))
customer feature
1 1 other
2 2 a
3 1 b
4 2 c
5 2 other
6 1 b
7 1 c
by using dplyr. However, I wish to my code ignore the "other" feature in the data set and choose the first one after "other".
so the following code is not sufficient:
library (dplyr)
new <- mydf %>%
group_by(customer) %>%
mutate(firstfeature = first(feature))
How can I ignore "other" so that I reach the following ideal output:
customer feature firstfeature
1 1 other b
2 2 a a
3 1 b b
4 2 c a
5 2 other a
6 1 b b

With dplyr we can group by customer and take the first feature for every group.
library(dplyr)
mydf %>%
group_by(customer) %>%
mutate(firstfeature = feature[feature != "other"][1])
# customer feature firstfeature
# <dbl> <chr> <chr>
#1 1 other b
#2 2 a a
#3 1 b b
#4 2 c a
#5 2 other a
#6 1 b b
#7 1 c b
Similarly we can also do this with base R ave
mydf$firstfeature <- ave(mydf$feature, mydf$customer,
FUN= function(x) x[x!= "other"][1])

Another option is data.table
library(data.table)
setDT(mydf)[, firstfeature := feature[feature != "other"][1], customer]

Related

keep last non missing observation for all variables by group

My data has multiple columns and some of those columns have missing values in different rows. I would like to group (collapse) the data by the variable "g", keeping the last non missing obserbation of each varianle.
Input:
d <- data.table(a=c(1,NA,3,4),b=c(1,2,3,4),c=c(NA,NA,'c',NA),g=c(1,1,2,2))
Desired output
d_g <- data.table(a=c(1,4),b=c(2,4),c=c(NA,'c'),g=c(1,2))
data.table (or dplyr) solution prefered here
OBS:this is related to this question, but the main answers there seem to cause unecessary NAs in some groups
Using data.table :
library(data.table)
d[, lapply(.SD, function(x) last(na.omit(x))), g]
# g a b c
#1: 1 1 2 <NA>
#2: 2 4 4 c
One option using dplyr could be:
d %>%
group_by(g) %>%
summarise(across(everything(), ~ if(all(is.na(.))) NA else last(na.omit(.))))
g a b c
<dbl> <dbl> <dbl> <chr>
1 1 1 2 <NA>
2 2 4 4 c
In base aggregatecould be used.
aggregate(.~g, d, function(x) tail(x[!is.na(x)], 1), na.action = NULL)
# g a b c
#1 1 1 2
#2 2 4 4 c

Drop list columns from dataframe using dplyr and select_if

Is it possible to drop all list columns from a dataframe using dpyr select similar to dropping a single column?
df <- tibble(
a = LETTERS[1:5],
b = 1:5,
c = list('bob', 'cratchit', 'rules!','and', 'tiny tim too"')
)
df %>%
select_if(-is.list)
Error in -is.list : invalid argument to unary operator
This seems to be a doable work around, but was wanting to know if it can be done with select_if.
df %>%
select(-which(map(df,class) == 'list'))
Use Negate
df %>%
select_if(Negate(is.list))
# A tibble: 5 x 2
a b
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
There is also purrr::negate that would give the same result.
We can use Filter from base R
Filter(Negate(is.list), df)
# A tibble: 5 x 2
# a b
# <chr> <int>
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5

R: how to select rows with two conditions (bought both products)

I have a dataset which is similar to the following:
ID = c(1,2,3,4,1,2,3)
Product = c("a", "b", "c", "a","b","a","a")
Quantity = c(1,1,1,1,1,1,1)
df = data.frame(ID, Product, Quantity)
# ID Product Quantity
#1 1 a 1
#2 2 b 1
#3 3 c 1
#4 4 a 1
#5 1 b 1
#6 2 a 1
#7 3 a 1
I want to select the people who purchased both product a and product b. In the case of the above example, the desired result I want is:
ID Product Quantity
1 a 1
2 b 1
1 b 1
2 a 1
I cannot recall a function that does this for me. What I can think of is through loop but I am hoping to find a more succinct solution.
With ave:
df[
with(df, ave(as.character(Product), ID, FUN=function(x) all(c("a","b") %in% x) ))=="TRUE",
]
# ID Product Quantity
#1 1 a 1
#2 2 b 1
#5 1 b 1
#6 2 a 1
You could do the following with dplyr
library(dplyr)
df %>%
filter(Product %in% c('a','b')) %>% # Grab only desired products
group_by(ID) %>% # For each ID...
filter(n() > 1) %>% # Only grab IDs where the count >1
ungroup # Remove grouping.
## # A tibble: 4 x 3
## ID Product Quantity
## <dbl> <fctr> <dbl>
## 1 1 a 1
## 2 2 b 1
## 3 1 b 1
## 4 2 a 1
Edit
Here is a slightly more concise dplyr version using any (similar to how Psidom used it in the data.table solution):
df %>%
group_by(ID) %>%
filter(all(c('a','b') %in% as.character(Product))) %>%
ungroup
Another option using data.table:
library(data.table)
setDT(df)[, .SD[all(c("a", "b") %in% Product)], ID]
# ID Product Quantity
#1: 1 a 1
#2: 1 b 1
#3: 2 b 1
#4: 2 a 1
Here is an option using data.table
library(data.table)
setDT(df, key = "Product")[c("a", "b")][, if(uniqueN(Product)==2) .SD , ID]
# ID Product Quantity
#1: 1 a 1
#2: 1 b 1
#3: 2 a 1
#4: 2 b 1

Determine subgroup index

I have a large data frame with groups and subgroups. I would like to determine the index of the subgroup in each group, like shown in the OUTPUT column of the following data frame:
df <- data.frame(
Group = factor(c("A","A","A","A","A","B","B","B","B")),
Subgroup = factor(c("a","a","b","b","b","a","a","b","b")),
OUTPUT = c(1,1,2,2,2,1,1,2,2)
)
I've tried several possibilities with without any success. I'd like to work with dplyr, but I'm not sure how to go about this. The following code returns an unexpected result.
require(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(
OUTPUT_2 = dplyr::id(Subgroup)
)
#df
# Group Subgroup OUTPUT_2
# (fctr) (fctr) (int)
#1 A a 8
#2 A a 8
#3 A b 8
#4 A b 8
#5 A b 8
#6 B a 4
#7 B a 4
#8 B b 4
#9 B b 4
I've the feeling I'm close, but not getting there. Can anybody help?
Here is a solution with data.table without aggregation:
dt[order(Subgroup), Output := cumsum(!duplicated(Subgroup)) , by = .(Group)]
This will be much faster compared to methods based on aggregation.
We can use the factor route with dplyr
library(dplyr)
df %>%
group_by(Group) %>%
mutate(OUTPUT = as.numeric(factor(Subgroup, levels= unique(Subgroup))))
# Group Subgroup OUTPUT
# <fctr> <fctr> <dbl>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2
Or another option is match with the unique elements of 'Subgroup' after grouping by 'Group'
df %>%
group_by(Group) %>%
mutate(OUTPUT = match(Subgroup, unique(Subgroup)) )
# Group Subgroup OUTPUT
# <fctr> <fctr> <int>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
unique(dt[, .(Group, Subgroup)])[, idx := 1:.N, by = Group][dt, on = c('Group', 'Subgroup')]
# Group Subgroup idx OUTPUT
#1: A a 1 1
#2: A a 1 1
#3: A b 2 2
#4: A b 2 2
#5: A b 2 2
#6: B a 1 1
#7: B a 1 1
#8: B b 2 2
#9: B b 2 2
Translation to dplyr should be straightforward.
Another method, following the idea of using factors from aosmith's comment, is:
dt[, idx := as.integer(factor(Subgroup, unique(Subgroup))), by = Group][]
This will create a factor with correct levels per Group which is the indexing you're after.

Create a variable capturing the most frequent occurence by group

Define:
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
s.t.
> df1
id v1
1 1 a
2 1 b
3 1 b
4 2 c
5 2 c
6 2 c
I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.
> df2
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c
You can do this using ddply and a custom function to pick out the most frequent value:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.
Another way consists of using tidyverse functions:
grouping first, using group_by(), and counting the occurrence of the second variable using tally()
arranging by the number of occurrences with arrange()
summarizing and picking out the first row with summarize() and first()
Therefore:
df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))
This will give you just the mapping (which I find cleaner):
# A tibble: 2 x 2
id freq
<dbl> <fctr>
1 1 b
2 2 c
You can then left_join your original data frame with that table.
mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c

Resources