Group, summarize and transpose - r

I have a dataframe that looks like this:
ctgroup (dataframe)
Camera Trap Name Animal Name a_sum
1 CAM27 Chicken 1
2 CAM27 Dog 1
3 CAM27 Dog 4
4 CAM28 Cat 3
5 CAM28 Dog 22
6 CAM28 Dog 1
*a_sum = No. of animals recorded in a camera
So essentially I want to - Group by 2 fields(Camera Trap Name, Scientific Name) and then Count the number of record in the column "a_sum", and transpose the data so that Animal. Name becomes column and Camera Trap Name my rows. I want to display all the animal names in columns, with 0 if no data available
i.e.,
Camera trap name Dog Cat Wolf Chicken
CAM28 23 4 1 4
CAM27 5 0 0 4
I tried using the following code
dcast (ctgroup, Camera.Trap.name + Animal.name, value.var = "a_sum")
And I got the following error:
In dcast(ctgroup, Camera.Trap.name + Scientific.name, value.var = "a_sum") :
The dcast generic in data.table has been passed a grouped_df and will attempt to redirect to the reshape2::dcast; please note that reshape2 is deprecated, and this redirection is now deprecated as well. Please do this redirection yourself like reshape2::dcast(ctgroup). In the next version, this warning will become an error.
I don't think I know enough to construct the correct code for carrying out this work.

With data.table ...
# Load data.table.
require(data.table)
# Create data.set.
df <- data.frame(Camera = c("CAM27", "CAM27", "CAM27", "CAM28", "CAM28", "CAM28"),
Animal = c("Chicken", "Dog", "Dog", "Cat", "Dog", "Dog"),
a_sum = c(1, 1, 4, 3, 22, 1))
# Set the data.frame as a data.table.
setDT(df)
# Cast by `Camera` and `Animal` and sum `a_sum`.
dcast(df, Camera ~ Animal, value.var = "a_sum", fun.aggregate = sum)
# Camera Cat Chicken Dog
# 1: CAM27 0 1 5
# 2: CAM28 3 0 23
# If you want to coerce back to a data.frame.
setDF(df)

The dplyr approach:
library(dplyr)
library(tidyr)
ctgroup %>%
group_by(Camera, Animal) %>%
summarize(a_sum = sum(a_sum)) %>%
pivot_wider(id_cols = Camera, names_from = Animal, values_from = a_sum, values_fill = list(a_sum = 0))

Related

R: Merge rows that share same code and at least one or more strings in name-column

I would like to merge rows in a dataframe if they have at least one word in common and have the same value for 'code'. The column to be searched for matching words is "name". Here's an example dataset:
df <- data.frame(
id = 1:8,
name = c("tiger ltd", "tiger cpy", "tiger", "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c(rep("4564AB", 3), rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The approach that I envision would look something like this:
use group_by on the code-column,
check if the group contains 2 or more rows,
check if there are any shared words among the different rows. If so, merge those rows and combine the information into a single row.
The final dataset would look like this:
final_df <- data.frame(
id = c("1|2|3", 4:8),
name = c(paste(c("tiger ltd", "tiger cpy", "tiger"), collapse = "|"), "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c("4564AB", rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The first three rows have the common word 'tiger' and the same code. Therefore they are merged into a single row with the different values separated by "|". The other rows are not merged because they either do not have a word in common or do not have the same code.
We could have a condition with if/else after grouping. Extract the words from the 'name' column and check for any intersecting elements, create a flag where the length of intersecting elements are greater than 0 and the group size (n()) is greater than 1 and use this to paste/str_c elements of the other columns
library(dplyr)
library(stringr)
library(purrr)
library(magrittr)
df %>%
group_by(code = factor(code, levels = unique(code))) %>%
mutate(flag = n() > 1 &
(str_extract_all(name, "\\w+") %>%
reduce(intersect) %>%
length %>%
is_greater_than(0))) %>%
summarise(across(-flag, ~ if(any(flag))
str_c(.x, collapse = "|") else as.character(.x)), .groups = 'drop') %>%
select(names(df))
-output
# A tibble: 6 × 3
id name code
<chr> <chr> <fct>
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
-OP's expected
> final_df
id name code
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
You can use this helper function f(), and apply it to each group:
f <- function(d) {
if(length(Reduce(intersect,strsplit(d[["name"]]," ")))>0) {
d = lapply(d,paste0,collapse="|")
}
return(d)
}
library(data.table)
setDT(df)[,id:=as.character(id)][, f(.SD),code]
Output:
code id name
<char> <char> <char>
1: 4564AB 1|2|3 tiger ltd|tiger cpy|tiger
2: 7845BC 4 rhino
3: 7845BC 5 hippo
4: 6144DE 6 elephant
5: 7845KI 7 elephant bros
6: 7845EG 8 last comp

dplyr::full_join two data frames with part-match in the "by" argument in R

I would like to join two data sets that look like the following data sets. The matching rule would be that the Item variable from mykey matches the first part of the Item entry in mydata to some degree.
mydata <- tibble(Item = c("ab_kssv", "ab_kd", "cde_kh", "cde_ksa", "cde"),
Answer = c(1,2,3,4,5),
Avg = rep(-100, length(Item)))
mykey <- tibble(Item = c("ab", "cde"),
Avg = c(0 ,10))
The result should be the following:
Item Answer Avg
1 ab_kssv 1 0
2 ab_kd 2 0
3 cde_kh 3 10
4 cde_ksa 4 10
5 cde 5 10
I looked at these three SO questions, but did not find a nice solution there. I also briefly tried the fuzzyjoin package, but that did not work. Finally, I have a for-loop-based solution:
for (currLine in 1:nrow(mydata)) {
mydata$Avg[currLine] <- mykey$Avg[str_starts(mydata$Item[currLine], mykey$Item)]
}
It does the job, but is not nice to read / understand and I wonder if there is a possibility to make the "by" argument of full_join() from the dplyr package a bit more tolerant with its matching. Any help will be apreciated!
Using a fuzzyjoin::regex_left_join you could do:
Note: I renamed the Item column in your mykey dataset to regex to make clear that this is the regex to match by and added a "^" to ensure that we match at the beginning of the Item column in the mydata dataset.
library(fuzzyjoin)
library(dplyr)
mykey <- mykey %>%
rename(regex = Item) %>%
mutate(regex = paste0("^", regex))
mydata %>%
select(-Avg) %>%
regex_left_join(mykey, by = c(Item = "regex")) %>%
select(-regex)
#> # A tibble: 5 × 3
#> Item Answer Avg
#> <chr> <dbl> <dbl>
#> 1 ab_kssv 1 0
#> 2 ab_kd 2 0
#> 3 cde_kh 3 10
#> 4 cde_ksa 4 10
#> 5 cde 5 10

Pivoting a table with multiple binary measures of the same variable in R

I have a data.frame with hundreds of columns where each is a binary ("YES"/"NO") for a possible result on a measurement, and can be collapsed to a short list of variables with multiple options (none of the collapsed variable has more than one YES). I'm trying to figure out the collapsing, but the only solution I could come up with is very inelegant and time consuming on my large dataset.
Here's a toy example to explain what I mean:
> groceries<-tribble(~item, ~prod_potato, ~prod_apple, ~prod_banana, ~day_monday, ~day_tuesday, ~day_wednesday,
1, "N","N","Y","N","N","Y",
2, "Y","N","N","N","N","Y",
3, "Y","N","N","N","Y","N",
4,"N","Y","N","Y","N","N")
# A tibble: 4 x 7
item prod_potato prod_apple prod_banana day_monday day_tuesday day_wednesday
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 N N Y N N Y
2 2 Y N N N N Y
3 3 Y N N N Y N
4 4 N Y N Y N N
Each item can only be ONE of potato, banana or apple, and can only be bought on a specific day, so these multiple columns are really not helpful.
My desired result would be something like:
item prod day
1 banana wednesday
2 potato wednesday
3 potato tuesday
4 apple monday
Here's the solution I came up with, which does the job, but isn't very good:
pivot_longer(groceries,2:4,names_to=c("prod"),names_prefix = "prod_") %>%
filter(value=="Y") %>% select(-value) %>%
pivot_longer(2:4,names_to="day",names_prefix="day_") %>%
filter(value=="Y") %>% select(-value)
# A tibble: 4 x 3
item prod day
<dbl> <chr> <chr>
1 1 banana wednesday
2 2 potato wednesday
3 3 potato tuesday
4 4 apple monday
But I'm 100% sure there's a less cumbersome solution that doesn't require me to repeat this process over some 20-odd collapsed variables.
My absolute ideal solution would be able to group the columns based on the string before the _ and use that as the column name, and put the string after the _ as the value when the value of the original variable is "YES". But I'm willing to work with a slightly more manual solution where I identify the columns to group and the variable name every time.
Can anyone suggest a solution (ideally in tidyverse -- I'm sure data.table would have a super efficient solution, but I could never wrap my head around it)?
Thanks!
You can use pivot_longer to separate all of them into group and options, then just reduce to the Y values (I use summarize just to drop the column, but can easily use filter here) and pivot_wider.
library(dplyr)
library(tidyr)
# library(tidyverse)
groceries %>%
pivot_longer(-item,
names_to = c("group", "options"),
names_sep = "_") %>%
group_by(item, group) %>%
summarize(options = options[value == "Y"],
.groups = "drop") %>%
pivot_wider(names_from = "group",
values_from = "options")
#> # A tibble: 4 × 3
#> item day prod
#> <dbl> <chr> <chr>
#> 1 1 wednesday banana
#> 2 2 wednesday potato
#> 3 3 tuesday potato
#> 4 4 monday apple
library(data.table)
setDT(groceries)
dt <- melt(groceries, id.vars = c("item"))[value == "Y"]
dt[, c("A", "B") := tstrsplit(variable, "_")]
dcast(dt, item ~ A, value.var = c("B"))
# item day prod
# 1: 1 wednesday banana
# 2: 2 wednesday potato
# 3: 3 tuesday potato
# 4: 4 monday apple
Or as one-liner:
dcast(melt(groceries, id.vars = c("item"))[value == "Y"][, c("A", "B") := tstrsplit(variable, "_")], item ~ A, value.var = c("B"))
A base R option
dfout <- reshape(
transform(
subset(
cbind(groceries[1], stack(groceries[-1])),
values == "Y"
),
p = gsub("_.*", "", ind),
q = gsub(".*_", "", ind)
)[c("item", "p", "q")],
direction = "wide",
idvar = "item",
timevar = "p"
)
dfout[order(dfout$item), ]
gives
item q.prod q.day
9 1 banana wednesday
2 2 potato wednesday
3 3 potato tuesday
8 4 apple monday

Getting a count of specific values in a data frame that appear in another

This question may sound similar to others, but I hope it is different enough.
I want to take a specific list of values and count how often they appear in another list of values where non-occurring values are retuned as '0'.
I have a Data Frame (df1) with the following values:
Items <- c('Carrots','Plums','Pineapple','Turkey')
df1<-data.frame(Items)
>df1
Items
1 Carrots
2 Plums
3 Pineapple
4 Turkey
And a second Data Frame (df2) that contains a column called 'Thing':
> head(df2,n=10)
ID Date Thing
1 58150 2012-09-12 Potatoes
2 12357 2012-09-28 Turnips
3 50788 2012-10-04 Oranges
4 66038 2012-10-11 Potatoes
5 18119 2012-10-11 Oranges
6 48349 2012-10-14 Carrots
7 23328 2012-10-16 Peppers
8 66038 2012-10-26 Pineapple
9 32717 2012-10-28 Turnips
10 11345 2012-11-08 Oranges
I know the word 'Turkey' only appears in df1 NOT in df2. I want to return a frequency table or count of the items in df1 that appears in df2 and return '0' for the count of Turkey.
How can I summarize values of on Data Frame column using the values from another? The closest I got was:
df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)
But this return a list of items filtered between df1 and df2 so 'Turkey' gets excluded. So close!
> df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)
# A tibble: 3 x 2
Thing n
<fctr> <int>
1 Carrots 30
2 Pineapple 30
3 Plums 38
I want my output to look like this:
1 Carrots 30
2 Pineapple 30
3 Plums 38
4 Turkey 0
I am newish to R and completely new to dplyr.
I use this sort of thing all the time. I'm sure there's a more savvy way to code it, but it's what I got:
item <- vector()
count <- vector()
items <- list(unique(df1$Items))
for (i in 1:length(items)){
item[i] <- items[i]
count[i] <- sum(df2$Thing == item)
}
df3 <- data.frame(cbind(item, count))
Hope this helps!
Stephen's solution worked with a slight modification, adding the [i] to the item at the end of count[i] line. See below:
item <- vector()
count <- vector()
for (i in 1:length(unique(Items))){
item[i] <- Items[i]
count[i]<- sum(df2$Thing == item[i])
}
df3 <- data.frame(cbind(item, count))
> df3
item count
1 Carrots 30
2 Plums 38
3 Pineapple 30
4 Turkey 0
dplyr drops 0 count rows, and you have the added complication that the possible categories of Thing are different between your two datasets.
If you add the factor levels from df1 to df2, you can use complete from tidyr, which is a common way to add 0 count rows.
I'm adding the factor levels from df1 to df2 using a convenience function from package forcats called fct_expand.
library(dplyr)
library(tidyr)
library(forcats)
df2 %>%
mutate(Thing = fct_expand(Thing, as.character(df1$Item) ) ) %>%
count(Thing) %>%
complete(Thing, fill = list(n = 0) ) %>%
filter(Thing %in% df1$Items,)
A different approach is to aggregate df2 first, to right join with df1 (to pick all rows of df1), and to replace NA by zero.
library(dplyr)
df2 %>%
count(Thing) %>%
right_join(unique(df1), by = c("Thing" = "Items")) %>%
mutate(n = coalesce(n, 0L))
# A tibble: 4 x 2
Thing n
<chr> <int>
1 Carrots 1
2 Plums 0
3 Pineapple 1
4 Turkey 0
Warning message:
Column `Thing`/`Items` joining factors with different levels, coercing to character vector
The same approach in data.table:
library(data.table)
setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][is.na(N), N := 0L][]
Thing N
1: Carrots 1
2: Plums 0
3: Pineapple 1
4: Turkey 0
Note that in both implementations unique(df1) is used to avoid unintended duplicate rows after the join.
Edit 2019-06-22:
With development version 1.12.3 data.table has gained a coalesce() function. So, above statement can be written
setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][, N := coalesce(N, 0L)][]
If df2 is large and df1 contains only a few Items it might be more efficient to join first and then to aggregate:
library(dplyr)
df2 %>%
right_join(unique(df1), by = c("Thing" = "Items")) %>%
group_by(Thing) %>%
summarise(n = sum(!is.na(ID)))
# A tibble: 4 x 2
Thing n
<chr> <int>
1 Carrots 1
2 Pineapple 1
3 Plums 0
4 Turkey 0
Warning message:
Column `Thing`/`Items` joining factors with different levels, coercing to character vector
The same in data.table syntax:
library(data.table)
setDT(df2)[unique(setDT(df1)), on = .(Thing = Items)][, .(N = sum(!is.na(ID))), by = Thing][]
Thing N
1: Carrots 1
2: Plums 0
3: Pineapple 1
4: Turkey 0
Edit 2019-06-22: Above can be written more concisely by aggregating in a join:
setDT(df2)[setDT(df1), on = .(Thing = Items), .N, by = .EACHI]

Replace a subset of a data frame with dplyr join operations

Suppose that I gave a treatment to some column values of a data frame like this:
id animal weight height ...
1 dog 23.0
2 cat NA
3 duck 1.2
4 fairy 0.2
5 snake BAD
df <- data.frame(id = seq(1:5),
animal = c("dog", "cat", "duck", "fairy", "snake"),
weight = c("23", NA, "1.2", "0.2", "BAD"))
Suppose that the treatment require to work in a separately table, and gave as the result, the following data frame that is a subset of the original:
id animal weight
2 cat 2.2
5 snake 1.3
sub_df <- data.frame(id = c(2, 5),
animal = c("cat", "snake"),
weight = c("2.2", "1.3"))
Now I want to put all together again, so I use an operation like this:
> df %>%
anti_join(sub_df, by = c("id", "animal")) %>%
bind_rows(sub_df)
id animal weight
4 fairy 0.2
1 dog 23.0
3 duck 1.2
2 cat 2.2
5 snake 1.3
Exist some way to do this directly with join operations?
In the case that the subset is just the key column and the variable subject to give a treatment (id, animal weigth) and not the total variables of the original data frame (id, animal, weight, height), how could assemble the subset with the original set?
What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=).
Here's an example for your toy data:
library(data.table)
setDT(df) # convert to data.table without copy
setDT(sub_df) # convert to data.table without copy
# join and update "df" by reference, i.e. without copy
df[sub_df, on = c("id", "animal"), weight := i.weight]
The data is now updated:
# id animal weight
#1: 1 dog 23.0
#2: 2 cat 2.2
#3: 3 duck 1.2
#4: 4 fairy 0.2
#5: 5 snake 1.3
You can use setDF to switch back to ordinary data.frame.
Remove the na's first, then simply stack the tibbles:
bind_rows(filter(df,!is.na(weight)),sub_df)
Isn't dplyr::rows_update exactly what we need here? The following code should work:
df %>% dplyr::rows_update(sub_df, by = "id")
This should work as long as there is a unique identifier (one or multiple variables) for your datasets.
For anyone looking for a solution to use in a tidyverse pipeline:
I run into this problem a lot, and have written a short function that uses mostly tidyverse verbs to get around this. It will account for the case when there are additional columns in the original df.
For example, if the OP's df had an additional 'height' column:
library(dplyr)
df <- tibble(id = seq(1:5),
animal = c("dog", "cat", "duck", "fairy", "snake"),
weight = c("23", NA, "1.2", "0.2", "BAD"),
height = c("54", "45", "21", "50", "42"))
And the subset of data we wanted to join in was the same:
sub_df <- tibble(id = c(2, 5),
animal = c("cat", "snake"),
weight = c("2.2", "1.3"))
If we used the OP's method alone (anti_join %>% bind_rows), this won't work because of the additional 'height' column in df. An extra step or two is needed.
In this case we could use the following function:
replace_subset <- function(df, df_subset, id_col_names = c()) {
# work out which of the columns contain "new" data
new_data_col_names <- colnames(df_subset)[which(!colnames(df_subset) %in% id_col_names)]
# complete the df_subset with the extra columns from df
df_sub_to_join <- df_subset %>%
left_join(select(df, -new_data_col_names), by = c(id_col_names))
# join and bind rows
df_out <- df %>%
anti_join(df_sub_to_join, by = c(id_col_names)) %>%
bind_rows(df_sub_to_join)
return(df_out)
}
Now for the results:
replace_subset(df = df , df_subset = sub_df, id_col_names = c("id"))
## A tibble: 5 x 4
# id animal weight height
# <dbl> <chr> <chr> <chr>
#1 1 dog 23 54
#2 3 duck 1.2 21
#3 4 fairy 0.2 50
#4 2 cat 2.2 45
#5 5 snake 1.3 42
And here's an example using the function in a pipeline:
df %>%
replace_subset(df_subset = sub_df, id_col_names = c("id")) %>%
mutate_at(.vars = vars(c('weight', 'height')), .funs = ~as.numeric(.)) %>%
mutate(bmi = weight / (height^2))
## A tibble: 5 x 5
# id animal weight height bmi
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 1 dog 23 54 0.00789
#2 3 duck 1.2 21 0.00272
#3 4 fairy 0.2 50 0.00008
#4 2 cat 2.2 45 0.00109
#5 5 snake 1.3 42 0.000737
hope this is helpful :)

Resources