I have df as below and I would like to add new column to adds up col1 based on ID. I write code with pdlyr, but I don't know how to fix it.
df %>%
group_by(ID) %>%
mutate(col2= paste0(col1,?????) <- what to write here?
df<-read.table(text="
ID col1
1 A
1 B
1 A
2 C
2 C
2 D", header=T)
result
ID col1 col2
1 A ABA
1 B ABA
1 A ABA
2 C CCD
2 C CCD
2 D CCD
Use the collapse argument.
df %>%
group_by(ID) %>%
mutate(col2= paste(col1, collapse = "")) %>%
ungroup
giving:
# A tibble: 6 x 3
ID col1 col2
<int> <fct> <chr>
1 1 A ABA
2 1 B ABA
3 1 A ABA
4 2 C CCD
5 2 C CCD
6 2 D CCD
Alternately using only base R we could use this one-liner:
transform(df, col2 = ave(as.character(col1), ID, FUN = function(x) paste(x, collapse = "")))
giving:
ID col1 col2
1 1 A ABA
2 1 B ABA
3 1 A ABA
4 2 C CCD
5 2 C CCD
6 2 D CCD
Related
I am seeking an answer to finding a value that is tied to the max date which is also tied to an id value in R. The dataframe looks like
id
value
date
1
A
12/12/2021
1
B
12/13/2021
1
A
12/14/2021
2
A
12/13/2021
2
C
12/07/2021
2
B
12/17/2021
3
C
12/13/2021
3
B
12/06/2021
3
C
12/02/2021
The code should return:
id
value
date
max_value
1
A
12/12/2021
A
1
B
12/13/2021
A
1
A
12/14/2021
A
2
A
12/13/2021
B
2
C
12/07/2021
B
2
B
12/17/2021
B
3
C
12/13/2021
C
3
B
12/06/2021
C
3
C
12/02/2021
C
I have tried the following & get an error.
df <- df[!is.na(df$date),]
for(ID in unique(df$id)){
as.data.frame(df %>% filter(id == ID) %>% dplyr::mutate(max_value = ifelse(df$date == max(df$date, na.rm = T), df$value, df$value[df$date == max(df$date, na.rm = T) & df$id == ID])))
}
Try the following dplyr approach:
df %>%
group_by(id) %>%
mutate(max = value[date == max(date)])
Output:
# id value date max
# <int> <chr> <chr> <chr>
# 1 1 A 12/12/2021 A
# 2 1 B 12/13/2021 A
# 3 1 A 12/14/2021 A
# 4 2 A 12/13/2021 B
# 5 2 C 12/07/2021 B
# 6 2 B 12/17/2021 B
# 7 3 C 12/13/2021 C
# 8 3 B 12/06/2021 C
# 9 3 C 12/02/2021 C
I have a dataframe that looks like this
df <- data.frame(time=seq(1,4,1),col1=c("a","b","d","c"), col2=c("d","a","c","b"))
df
#> time col1 col2
#> 1 1 a d
#> 2 2 b a
#> 3 3 d c
#> 4 4 c b
Created on 2021-11-06 by the reprex package (v2.0.1)
I want to sort my data frame based on col2 and look like this
time col1 col2
3 d d
1 a a
4 c c
2 b b
Any ideas or help is highly appreciated!
Don't know, if this makes any sense, but you could do a self join:
library(tidyr)
library(dplyr)
df %>%
select(col2) %>%
inner_join(df %>% mutate(col2 = col1), by = "col2") %>%
select(time, col1, col2)
This returns
time col1 col2
1 3 d d
2 1 a a
3 4 c c
4 2 b b
A solution in base R:
df <- data.frame(time = match(df$col2,df$col1), col1 = df$col2, col2=df$col2)
#> time col1 col2
#> 1 3 d d
#> 2 1 a a
#> 3 4 c c
#> 4 2 b b
I have a specific filtering question. Here is how my sample dataset looks like:
df <- data.frame(id = c(1,2,3,3,4,5),
cat= c("A","A","A","B","B","B"))
> df
id cat
1 1 A
2 2 A
3 3 A
4 3 B
5 4 B
6 5 B
Grouping by id, when the cat has multiple categories, I would only filter cat A. So the desired output would be:
> df.1
id cat
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
Any ideas?
Thanks!
If there are only two groups in cat, we can use the following logic:
df %>%
group_by(id) %>%
filter(! (n() == 2 & cat == "B"))
# A tibble: 5 x 2
# Groups: id [5]
id cat
<dbl> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
When there are multiple other letters possible
df <- data.frame(id = c(1,2,3,3,4,5,6,6,6,7),
cat= c("A","A","A","B","B","B", "A", "B", "C","D"))
df %>%
group_by(id) %>%
filter(! (n() >= 2 & cat %in% LETTERS[2:26]))
# A tibble: 7 x 2
# Groups: id [7]
id cat
<dbl> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
6 6 A
7 7 D
Explanation: n() gives the current group size. When that condition is met, we filter for everything that is not "B".
In this example you can take the first item from the group. In other situations you may need to reorder arrange before.
(using dplyr)
df %>% group_by(id) %>% summarise(cat = first(cat))
Base R:
aggregate(
df$cat,
by = list(id = df$id),
FUN = \(x) {
unx <- unique(x)
if (length(unx) > 1) 'A' else unx
}
)
# id x
# 1 1 A
# 2 2 A
# 3 3 A
# 4 4 B
# 5 5 B
One approach with dplyr. After grouping by id, filter where there is only one row per id or cat is "A".
library(dplyr)
df %>%
group_by(id) %>%
filter(n() == 1 | cat == "A")
Output
id cat
<dbl> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
Also, if it is possible to have the same cat repeated within a single id, you can filter where the number of distinct cat is 1 (or keep if cat is "A"):
df %>%
group_by(id) %>%
filter(n_distinct(cat) == 1 | cat == "A")
Using base R
subset(df, cat == 'A'|id %in% names(which(table(id) == 1)))
id cat
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
I would like to create a new column based condition below:
if the `str` column only contains `A` then insert `A`
if the `str` column only contains `B` then insert `B`
if the `str` column only contains `A` and `B` then insert `AB`
df<-read.table(text="
ID str
1 A
1 A
1 AA
1 ABB
2 BA
2 BB", header=T)
ID str simplify_str
1 A A
1 A A
1 AA A
1 ABB AB
2 BA AB
2 BB B
As far as tidyverse options are concerned, you could use dplyr::case_when with stringr::str_detect
library(dplyr)
library(stringr)
df %>%
mutate(simplify_str = case_when(
str_detect(str, "^A+$") ~ "A",
str_detect(str, "^B+$") ~ "B",
TRUE ~ "AB"))
# ID str simplify_str
#1 1 A A
#2 1 A A
#3 1 AA A
#4 1 ABB AB
#5 2 BA AB
#6 2 BB B
Using your data.frame:
As <- grep("A",df$str)
Bs <- grep("B",df$str)
df$simplify_str <- ""
df$simplify_str[As] <- paste0(df$simplify_str[As],"A")
df$simplify_str[Bs] <- paste0(df$simplify_str[Bs],"B")
df
ID str simplify_str
1 1 A A
2 1 A A
3 1 AA A
4 1 ABB AB
5 2 BA AB
6 2 BB B
A general solution in base R where it splits the string and pastes together the unique characters in a sorted way.
df$simplify_str <- sapply(strsplit(as.character(df$str), ""),
function(x) paste(unique(sort(x)), collapse = ""))
df
# ID str simplify_str
#1 1 A A
#2 1 A A
#3 1 AA A
#4 1 ABB AB
#5 2 BA AB
#6 2 BB B
I have a dataset which is similar to the following:
ID = c(1,2,3,4,1,2,3)
Product = c("a", "b", "c", "a","b","a","a")
Quantity = c(1,1,1,1,1,1,1)
df = data.frame(ID, Product, Quantity)
# ID Product Quantity
#1 1 a 1
#2 2 b 1
#3 3 c 1
#4 4 a 1
#5 1 b 1
#6 2 a 1
#7 3 a 1
I want to select the people who purchased both product a and product b. In the case of the above example, the desired result I want is:
ID Product Quantity
1 a 1
2 b 1
1 b 1
2 a 1
I cannot recall a function that does this for me. What I can think of is through loop but I am hoping to find a more succinct solution.
With ave:
df[
with(df, ave(as.character(Product), ID, FUN=function(x) all(c("a","b") %in% x) ))=="TRUE",
]
# ID Product Quantity
#1 1 a 1
#2 2 b 1
#5 1 b 1
#6 2 a 1
You could do the following with dplyr
library(dplyr)
df %>%
filter(Product %in% c('a','b')) %>% # Grab only desired products
group_by(ID) %>% # For each ID...
filter(n() > 1) %>% # Only grab IDs where the count >1
ungroup # Remove grouping.
## # A tibble: 4 x 3
## ID Product Quantity
## <dbl> <fctr> <dbl>
## 1 1 a 1
## 2 2 b 1
## 3 1 b 1
## 4 2 a 1
Edit
Here is a slightly more concise dplyr version using any (similar to how Psidom used it in the data.table solution):
df %>%
group_by(ID) %>%
filter(all(c('a','b') %in% as.character(Product))) %>%
ungroup
Another option using data.table:
library(data.table)
setDT(df)[, .SD[all(c("a", "b") %in% Product)], ID]
# ID Product Quantity
#1: 1 a 1
#2: 1 b 1
#3: 2 b 1
#4: 2 a 1
Here is an option using data.table
library(data.table)
setDT(df, key = "Product")[c("a", "b")][, if(uniqueN(Product)==2) .SD , ID]
# ID Product Quantity
#1: 1 a 1
#2: 1 b 1
#3: 2 a 1
#4: 2 b 1