Numerical difference between all rows within a group in R - r

I have a dataframe that looks a bit like
Indices<-data.frame("Animal"=c("Cat", "Cat", "Cat", "Dog", "Dog", "Dog", "Dog", "Bird",
"Bird"), "Trend"=c(1,3,5,-3,1,2,4,2,1), "Project"=c("ABC", "ABC2",
"EDF", "ABC", "EDF", "GHI", "ABC2", "ABC", "GHI"))
I want to find out whether two or more trend estimates differ by >= 3 within each animal group. I tried using mutate and lag:
Indices %>%
group_by(CommonName) %>%
mutate(Diff = Trend - lag(Trend))
But this only shows me the difference between the rows that are right after each other, and I am trying to see the difference between all of the rows within a group. It also gives me the differences but doesn't tell me if the value is >=3.
I would prefer to have the end result being a list of the animals and project names that have an absolute trend difference >=3.
Animal TrendDiff Projects
Cat 4 ABC-EDF
Dog 7 ABC-ABC2
Dog 3 ABC2-EDF
Dog 4 ABC-EDF
Dog 5 ABC-GHI
I have well over 200 different "animal" groups and over 400 rows so need it to be something that doesn't need to specify each row. I am still very new to r so please be specific with your answers. Thanks!

One approach would be to left_join your Indices data.frame with itself
library(dplyr)
Indices %>%
left_join(Indices, by = "Animal") %>%
filter(Project.x != Project.y) %>%
mutate(TrendDiff = Trend.x - Trend.y) %>%
filter(TrendDiff >= 3)
# A tibble: 5 x 6
# Groups: Animal [2]
# Animal Trend.x Project.x Trend.y Project.y TrendDiff
# cat 5 EDF 1 ABC 4
# Dog 1 EDF -3 ABC 4
# Dog 2 GHI -3 ABC 5
# Dog 4 ABC2 -3 ABC 7
# Dog 4 ABC2 1 EDF 3

Related

R: Merge rows that share same code and at least one or more strings in name-column

I would like to merge rows in a dataframe if they have at least one word in common and have the same value for 'code'. The column to be searched for matching words is "name". Here's an example dataset:
df <- data.frame(
id = 1:8,
name = c("tiger ltd", "tiger cpy", "tiger", "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c(rep("4564AB", 3), rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The approach that I envision would look something like this:
use group_by on the code-column,
check if the group contains 2 or more rows,
check if there are any shared words among the different rows. If so, merge those rows and combine the information into a single row.
The final dataset would look like this:
final_df <- data.frame(
id = c("1|2|3", 4:8),
name = c(paste(c("tiger ltd", "tiger cpy", "tiger"), collapse = "|"), "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c("4564AB", rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The first three rows have the common word 'tiger' and the same code. Therefore they are merged into a single row with the different values separated by "|". The other rows are not merged because they either do not have a word in common or do not have the same code.
We could have a condition with if/else after grouping. Extract the words from the 'name' column and check for any intersecting elements, create a flag where the length of intersecting elements are greater than 0 and the group size (n()) is greater than 1 and use this to paste/str_c elements of the other columns
library(dplyr)
library(stringr)
library(purrr)
library(magrittr)
df %>%
group_by(code = factor(code, levels = unique(code))) %>%
mutate(flag = n() > 1 &
(str_extract_all(name, "\\w+") %>%
reduce(intersect) %>%
length %>%
is_greater_than(0))) %>%
summarise(across(-flag, ~ if(any(flag))
str_c(.x, collapse = "|") else as.character(.x)), .groups = 'drop') %>%
select(names(df))
-output
# A tibble: 6 × 3
id name code
<chr> <chr> <fct>
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
-OP's expected
> final_df
id name code
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
You can use this helper function f(), and apply it to each group:
f <- function(d) {
if(length(Reduce(intersect,strsplit(d[["name"]]," ")))>0) {
d = lapply(d,paste0,collapse="|")
}
return(d)
}
library(data.table)
setDT(df)[,id:=as.character(id)][, f(.SD),code]
Output:
code id name
<char> <char> <char>
1: 4564AB 1|2|3 tiger ltd|tiger cpy|tiger
2: 7845BC 4 rhino
3: 7845BC 5 hippo
4: 6144DE 6 elephant
5: 7845KI 7 elephant bros
6: 7845EG 8 last comp

How to replace NA in a dataframe for a specific value using the results of another column and taking into account conditions of another column?

I have a dataframe composed of 9 columns with more than 4000 observations. For this question I will present a simpler dataframe (I use the tidyverse library)
Let's say I have the following dataframe:
library(tidyverse)
df <- tibble(Product = c("Bread","Oranges","Eggs","Bananas","Whole Bread" ),
Weight = c(NA, 1, NA, NA, NA),
Units = c(2,6,1,2,1),
Price = c(1,3.5,0.5,0.75,1.5))
df
I want to replace the NA values of the Weight column for a number multiplied by the results of Units depending on the word showed by the column Product. Basically, is a rule like:
Replace NA in Weight for 2.5*number of units if Product contains the word "Bread". Replace for 1 if Product contains the word "Eggs"
The thing is that I don't know how to code somehting like that in R. I tried the following code that a kind user gave me for a similar question:
df <- df %>%
mutate(Weight = case_when(Product == "bread" & is.na(Weight) ~ 0.25*Units))
But it doesn't work and it doesn't take into account the fact that if there is "Whole Bread" written in my dataframe it also has to apply the rule.
Does anyone have an idea?
Some of them are not exact matches, so use str_detect
library(dplyr)
library(stringr)
df %>%
mutate(Weight = case_when(is.na(Weight) &
str_detect(Product, regex("Bread", ignore_case = TRUE)) ~ 2.5 * Units,
is.na(Weight) & Product == "Eggs"~ Units, TRUE ~ Weight))
-output
# A tibble: 5 × 4
Product Weight Units Price
<chr> <dbl> <dbl> <dbl>
1 Bread 5 2 1
2 Oranges 1 6 3.5
3 Eggs 1 1 0.5
4 Bananas NA 2 0.75
5 Whole Bread 2.5 1 1.5

Group, summarize and transpose

I have a dataframe that looks like this:
ctgroup (dataframe)
Camera Trap Name Animal Name a_sum
1 CAM27 Chicken 1
2 CAM27 Dog 1
3 CAM27 Dog 4
4 CAM28 Cat 3
5 CAM28 Dog 22
6 CAM28 Dog 1
*a_sum = No. of animals recorded in a camera
So essentially I want to - Group by 2 fields(Camera Trap Name, Scientific Name) and then Count the number of record in the column "a_sum", and transpose the data so that Animal. Name becomes column and Camera Trap Name my rows. I want to display all the animal names in columns, with 0 if no data available
i.e.,
Camera trap name Dog Cat Wolf Chicken
CAM28 23 4 1 4
CAM27 5 0 0 4
I tried using the following code
dcast (ctgroup, Camera.Trap.name + Animal.name, value.var = "a_sum")
And I got the following error:
In dcast(ctgroup, Camera.Trap.name + Scientific.name, value.var = "a_sum") :
The dcast generic in data.table has been passed a grouped_df and will attempt to redirect to the reshape2::dcast; please note that reshape2 is deprecated, and this redirection is now deprecated as well. Please do this redirection yourself like reshape2::dcast(ctgroup). In the next version, this warning will become an error.
I don't think I know enough to construct the correct code for carrying out this work.
With data.table ...
# Load data.table.
require(data.table)
# Create data.set.
df <- data.frame(Camera = c("CAM27", "CAM27", "CAM27", "CAM28", "CAM28", "CAM28"),
Animal = c("Chicken", "Dog", "Dog", "Cat", "Dog", "Dog"),
a_sum = c(1, 1, 4, 3, 22, 1))
# Set the data.frame as a data.table.
setDT(df)
# Cast by `Camera` and `Animal` and sum `a_sum`.
dcast(df, Camera ~ Animal, value.var = "a_sum", fun.aggregate = sum)
# Camera Cat Chicken Dog
# 1: CAM27 0 1 5
# 2: CAM28 3 0 23
# If you want to coerce back to a data.frame.
setDF(df)
The dplyr approach:
library(dplyr)
library(tidyr)
ctgroup %>%
group_by(Camera, Animal) %>%
summarize(a_sum = sum(a_sum)) %>%
pivot_wider(id_cols = Camera, names_from = Animal, values_from = a_sum, values_fill = list(a_sum = 0))

Delete a row based on a two seperate matching requirements

If there is a post about this I apologize - I searched many times for an answer and couldn't find anything that works.
What I need to do is delete all rows in the following example that are equal to 66 only if there is a duplicate animal type with anything other then a 66.
animals <- c("dog", "dog", "dog", "cat", "cat", "cat", "mouse", "mouse", "rat", "rat")
number <- c(1,2,66,2,66,66,66,66,2,1)
df <- data.frame(animals,number)
Using that df I would want to delete row 3 because dog has other values of 1 and 2, I would want to delete both 66's for cat because there is a cat with other value of 2 but I wouldn't want to delete either mouse entries because they are both 66, and I wouldn't want to delete anything with rat because there are no 66 values.
I would end up something similar to this:
animals <- c("dog", "dog", "cat", "mouse", "mouse", "rat", "rat")
number <- c(1,2,2,66,66,2,1)
In the real data-set there are so many entries that you simply cant use a count and remove everything with an aggregate total of less then 66 (was my first instinct)
This was my second try but couldn't think through it for some reason.
df(!number == 66 | if(unique(animals) ==
maybe a which statement involved? Any help would be greatly appreciated!
One way using base R ave where we check if any animal has a number other than 66, if it has then we return the ones ignoring 66 or else return all rows.
df[with(df, ave(number != 66, animals, FUN = function(x) if (any(x)) x else !x)), ]
# animals number
#1 dog 1
#2 dog 2
#3 cat 2
#4 mouse 66
#5 mouse 66
#6 rat 2
#7 rat 1
The dplyr version would filter the groups which has all 66 in it or ignore the rows with 66 otherwise.
library(dplyr)
df %>%
group_by(animals) %>%
filter(all(number == 66) | number != 66)
# animals number
# <fct> <dbl>
#1 dog 1
#2 dog 2
#3 cat 2
#4 mouse 66
#5 mouse 66
#6 rat 2
#7 rat 1
Using dplyr
library(dplyr)
df %>% group_by(animals) %>%
mutate(Flag= case_when( number %in% c(1,2) ~ 1,
all(number == 66) ~ 1,
number == 66 ~ 0)) %>%
filter(Flag==1) %>% select(-Flag) %>% ungroup()
# A tibble: 7 x 2
animals number
<chr> <dbl>
1 dog 1.
2 dog 2.
3 cat 2.
4 mouse 66.
5 mouse 66.
6 rat 2.
7 rat 1.

Match words in a data frame to a string in R

I have a data frame from a recall task where participants recall as many words as they can from a list they learned earlier. Here's a mock up of the data. Each row is a subject and each column (w1-w5) is a word recalled:
df <- data.frame(subject = 1:5,
w1 = c("screen", "toad", "toad", "witch", "toad"),
w2 = c("package", "tuna", "tuna", "postage", "dinosaur"),
w3 = c("tuna", "postage", "toast", "athlete", "ranch"),
w4 = c("toad", "witch", "tuna", "package", "NA"),
w5 = c("windwo", "mermaid", "NA", "NA", "NA")
)
Which produces the following data frame:
subject w1 w2 w3 w4 w5
1 1 screen package tuna toad windwo
2 2 toad tuna postage witch mermaid
3 3 toad tuna toast tuna NA
4 4 witch postage athlete package NA
5 5 toad dinosaur ranch NA NA
I want to match each word produced (columns w1 - w5) to a list of the correct words, which are:
words <- c("screen", "package", "tuna", "toad", "window",
"postage", "witch", "mermaid", "toast", "dinosaur")
I only want to award points for words that are spelled correctly and are not repeated. So for example, for the data above I'd like to end up with a data frame that looks like this:
subject nCorrect
1 1 4
2 2 5
3 3 3
4 4 3
5 5 2
Subject 1 would get four points because they misspelled one word.
Subject 2 would get five points.
Subject 3 would get 3 points because they repeated tuna and are missing one word.
Subject 4 would get three points because they have one incorrect word and one missing word.
Subject 5 would get two points because they have one incorrect word and two missing words.
data.frame(subject = df$subject
, nCorrect = apply(df[, -1], 1, function(x) sum(unique(x) %in% words)))
# subject nCorrect
# 1 1 4
# 2 2 5
# 3 3 3
# 4 4 3
# 5 5 2
With data.table (same result)
setDT(df)
df[, sum(unique(unlist(.SD)) %in% words), by = subject]
Another option is to convert the data in long format. Group on subject to use dplyr::summarise to find correct number of matching answers.
library(tidyverse)
words <- c("screen", "package", "tuna", "toad", "window",
"postage", "witch", "mermaid", "toast", "dinosaur")
df %>% gather(key, value, -subject) %>%
group_by(subject) %>%
summarise(nCorrect = sum(unique(value) %in% words))
# # A tibble: 5 x 2
# subject nCorrect
# <int> <int>
# 1 1 4
# 2 2 5
# 3 3 3
# 4 4 3
# 5 5 2

Resources