Delete a row based on a two seperate matching requirements - r

If there is a post about this I apologize - I searched many times for an answer and couldn't find anything that works.
What I need to do is delete all rows in the following example that are equal to 66 only if there is a duplicate animal type with anything other then a 66.
animals <- c("dog", "dog", "dog", "cat", "cat", "cat", "mouse", "mouse", "rat", "rat")
number <- c(1,2,66,2,66,66,66,66,2,1)
df <- data.frame(animals,number)
Using that df I would want to delete row 3 because dog has other values of 1 and 2, I would want to delete both 66's for cat because there is a cat with other value of 2 but I wouldn't want to delete either mouse entries because they are both 66, and I wouldn't want to delete anything with rat because there are no 66 values.
I would end up something similar to this:
animals <- c("dog", "dog", "cat", "mouse", "mouse", "rat", "rat")
number <- c(1,2,2,66,66,2,1)
In the real data-set there are so many entries that you simply cant use a count and remove everything with an aggregate total of less then 66 (was my first instinct)
This was my second try but couldn't think through it for some reason.
df(!number == 66 | if(unique(animals) ==
maybe a which statement involved? Any help would be greatly appreciated!

One way using base R ave where we check if any animal has a number other than 66, if it has then we return the ones ignoring 66 or else return all rows.
df[with(df, ave(number != 66, animals, FUN = function(x) if (any(x)) x else !x)), ]
# animals number
#1 dog 1
#2 dog 2
#3 cat 2
#4 mouse 66
#5 mouse 66
#6 rat 2
#7 rat 1
The dplyr version would filter the groups which has all 66 in it or ignore the rows with 66 otherwise.
library(dplyr)
df %>%
group_by(animals) %>%
filter(all(number == 66) | number != 66)
# animals number
# <fct> <dbl>
#1 dog 1
#2 dog 2
#3 cat 2
#4 mouse 66
#5 mouse 66
#6 rat 2
#7 rat 1

Using dplyr
library(dplyr)
df %>% group_by(animals) %>%
mutate(Flag= case_when( number %in% c(1,2) ~ 1,
all(number == 66) ~ 1,
number == 66 ~ 0)) %>%
filter(Flag==1) %>% select(-Flag) %>% ungroup()
# A tibble: 7 x 2
animals number
<chr> <dbl>
1 dog 1.
2 dog 2.
3 cat 2.
4 mouse 66.
5 mouse 66.
6 rat 2.
7 rat 1.

Related

What is the most efficient way of extracting some numbers from a data point in R? (Plus other specific steps!)

I've got quite a specific problem, for which I can just about find a very hacky solution, but I'm hoping somebody could outline a slightly more elegant method.
I have a CSV file, consisting of one row per historical football match played. The fields I care about look something like this:
home_team <- c("Team A", "Team B", "Team B")
away_team <- c("Team C", "Team C", "Team D")
home_goals <- c(2, 0, 1)
away_goals <- c(1, 2, 0)
home_goal_mins <- c("5 60", "NA", "80")
away_goal_mins <- c("15", "20 40", "NA")
df <- data.frame(home_team, away_team, home_goals, away_goals, home_goal_mins, away_goal_mins,
stringsAsFactors = FALSE)
df
#> home_team away_team home_goals away_goals home_goal_mins away_goal_mins
#> 1 Team A Team C 2 1 5 60 15
#> 2 Team B Team C 0 2 NA 20 40
#> 3 Team B Team D 1 0 80 NA
Created on 2020-10-05 by the reprex package (v0.3.0)
My goal is to transform this dataframe such that there is one line per goal scored, per game, like this:
The main challenges, as I see them:
The *_goal_mins fields are read in as strings containing both numbers and NAs
Replicating the rows such that the Home/Away team combinations have the same number of rows as the total number of goals for that match
With regards to (1), I've been using stringr::str_split(., " ") to extract the numbers but then struggle to transform them into a numeric vector. Taking the first row of df as an example, I'm struggling to transform "5 60" into c(5, 60), and it gets harder for me when I try to combine the home team's "5 60" with the away team's "15" to get the full goal sequence of c(5, 15, 60).
As for (2), my current approach is to calculate the total_goals_scored per match, and do the following:
expanded_df <- df[rep(seq_len(dim(df)[1]),
df$total_goals_scored), ]
but I sense that there may be a better method.
Any help or tips will be appreciated!
Thanks
Using dplyr and tidyr library you could do
bring home_goal_mins and away_goal_mins in same column using pivot_longer.
Split the data on whitespace and separate the goals in separate rows
Drop NA values
arrange data based on timestamp
Get data in wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = c(home_goal_mins, away_goal_mins)) %>%
separate_rows(value, sep = ' ', convert = TRUE) %>%
filter(!is.na(value)) %>%
arrange(home_team, away_team, value) %>%
group_by(home_team, away_team) %>%
mutate(row = row_number()) %>%
pivot_wider()
# home_team away_team home_goals away_goals row home_goal_mins away_goal_mins
# <chr> <chr> <dbl> <dbl> <int> <int> <int>
#1 Team A Team C 2 1 1 5 NA
#2 Team A Team C 2 1 2 NA 15
#3 Team A Team C 2 1 3 60 NA
#4 Team B Team C 0 2 1 NA 20
#5 Team B Team C 0 2 2 NA 40
#6 Team B Team D 1 0 1 80 NA

Count how many times a Nurse washes their hand before patient contact: Is X before Y, group_by(ID)?

I have a set of observed behaviour of nurses conducting patient care and record what they touch or do. This might look like:
df<-data.frame(ActivityID=rep(1:3, each=3),
Action=c("Door", "Hygiene", "Patient", "Door", "Patient", "Door", "Door", "Patient", "Hygiene"))
I'd like to check whether they wash their hands before the first time they touch the patient for each ActivityID and count for how many ActivityID's this occurs. Essentially I'd like to know if X happens before Y for each activity.
My thought was to use which to find the first occurrence for both Patient and Hygiene:
require(dplyr)
a=df%>%
group_by(ActivityID) %>%
which(Action=="Hygiene")
b=df%>%
group_by(ActivityID) %>%
which(Action=="Patient")
which(a<b)
But this doesn't seem to work in pipe form and sometimes, they don't touch the patient. Any help would be much appreciated.
Total unique activities can be calculated using :
library(dplyr)
total_Activities <- n_distinct(df$ActivityID)
total_Activities
#[1] 3
We can write a function to check if hands were washed anytime before touching the Patient for first time:
hands_washed_before_touch <- function(x) {
ind1 <- which(x == 'Hygiene')
ind2 <- which(x == 'Patient')
length(ind1) && length(ind2) && ind1[1] < ind2[1]
}
and use it by group :
df1 <- df %>%
group_by(ActivityID) %>%
summarise(hands_washed = hands_washed_before_touch(Action))
df1
# ActivityID hands_washed
# <int> <lgl>
#1 1 TRUE
#2 2 FALSE
#3 3 FALSE
To get count we can sum hands_washed column i.e sum(df1$hands_washed).
Here is another alternative using case_when from dplyr package.
library(dplyr)
df1<- df %>%
group_by(ActivityID) %>%
mutate(hands_washed = case_when(
!any(Action == "Hygiene") ~ "False",
min(c(which(Action == "Hygiene"), Inf)) > which.max(Action == "Patient")~ "False",
TRUE ~ "True"))%>%
ungroup()
df1
# A tibble: 9 x 3
# Groups: ActivityID [3]
# ActivityID Action hands_washed
# <int> <fct> <chr>
#1 1 Door True
#2 1 Hygiene True
#3 1 Patient True
#4 2 Door False
#5 2 Patient False
#6 2 Door False
#7 3 Door False
#8 3 Patient False
#9 3 Hygiene False

Numerical difference between all rows within a group in R

I have a dataframe that looks a bit like
Indices<-data.frame("Animal"=c("Cat", "Cat", "Cat", "Dog", "Dog", "Dog", "Dog", "Bird",
"Bird"), "Trend"=c(1,3,5,-3,1,2,4,2,1), "Project"=c("ABC", "ABC2",
"EDF", "ABC", "EDF", "GHI", "ABC2", "ABC", "GHI"))
I want to find out whether two or more trend estimates differ by >= 3 within each animal group. I tried using mutate and lag:
Indices %>%
group_by(CommonName) %>%
mutate(Diff = Trend - lag(Trend))
But this only shows me the difference between the rows that are right after each other, and I am trying to see the difference between all of the rows within a group. It also gives me the differences but doesn't tell me if the value is >=3.
I would prefer to have the end result being a list of the animals and project names that have an absolute trend difference >=3.
Animal TrendDiff Projects
Cat 4 ABC-EDF
Dog 7 ABC-ABC2
Dog 3 ABC2-EDF
Dog 4 ABC-EDF
Dog 5 ABC-GHI
I have well over 200 different "animal" groups and over 400 rows so need it to be something that doesn't need to specify each row. I am still very new to r so please be specific with your answers. Thanks!
One approach would be to left_join your Indices data.frame with itself
library(dplyr)
Indices %>%
left_join(Indices, by = "Animal") %>%
filter(Project.x != Project.y) %>%
mutate(TrendDiff = Trend.x - Trend.y) %>%
filter(TrendDiff >= 3)
# A tibble: 5 x 6
# Groups: Animal [2]
# Animal Trend.x Project.x Trend.y Project.y TrendDiff
# cat 5 EDF 1 ABC 4
# Dog 1 EDF -3 ABC 4
# Dog 2 GHI -3 ABC 5
# Dog 4 ABC2 -3 ABC 7
# Dog 4 ABC2 1 EDF 3

Replace a subset of a data frame with dplyr join operations

Suppose that I gave a treatment to some column values of a data frame like this:
id animal weight height ...
1 dog 23.0
2 cat NA
3 duck 1.2
4 fairy 0.2
5 snake BAD
df <- data.frame(id = seq(1:5),
animal = c("dog", "cat", "duck", "fairy", "snake"),
weight = c("23", NA, "1.2", "0.2", "BAD"))
Suppose that the treatment require to work in a separately table, and gave as the result, the following data frame that is a subset of the original:
id animal weight
2 cat 2.2
5 snake 1.3
sub_df <- data.frame(id = c(2, 5),
animal = c("cat", "snake"),
weight = c("2.2", "1.3"))
Now I want to put all together again, so I use an operation like this:
> df %>%
anti_join(sub_df, by = c("id", "animal")) %>%
bind_rows(sub_df)
id animal weight
4 fairy 0.2
1 dog 23.0
3 duck 1.2
2 cat 2.2
5 snake 1.3
Exist some way to do this directly with join operations?
In the case that the subset is just the key column and the variable subject to give a treatment (id, animal weigth) and not the total variables of the original data frame (id, animal, weight, height), how could assemble the subset with the original set?
What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=).
Here's an example for your toy data:
library(data.table)
setDT(df) # convert to data.table without copy
setDT(sub_df) # convert to data.table without copy
# join and update "df" by reference, i.e. without copy
df[sub_df, on = c("id", "animal"), weight := i.weight]
The data is now updated:
# id animal weight
#1: 1 dog 23.0
#2: 2 cat 2.2
#3: 3 duck 1.2
#4: 4 fairy 0.2
#5: 5 snake 1.3
You can use setDF to switch back to ordinary data.frame.
Remove the na's first, then simply stack the tibbles:
bind_rows(filter(df,!is.na(weight)),sub_df)
Isn't dplyr::rows_update exactly what we need here? The following code should work:
df %>% dplyr::rows_update(sub_df, by = "id")
This should work as long as there is a unique identifier (one or multiple variables) for your datasets.
For anyone looking for a solution to use in a tidyverse pipeline:
I run into this problem a lot, and have written a short function that uses mostly tidyverse verbs to get around this. It will account for the case when there are additional columns in the original df.
For example, if the OP's df had an additional 'height' column:
library(dplyr)
df <- tibble(id = seq(1:5),
animal = c("dog", "cat", "duck", "fairy", "snake"),
weight = c("23", NA, "1.2", "0.2", "BAD"),
height = c("54", "45", "21", "50", "42"))
And the subset of data we wanted to join in was the same:
sub_df <- tibble(id = c(2, 5),
animal = c("cat", "snake"),
weight = c("2.2", "1.3"))
If we used the OP's method alone (anti_join %>% bind_rows), this won't work because of the additional 'height' column in df. An extra step or two is needed.
In this case we could use the following function:
replace_subset <- function(df, df_subset, id_col_names = c()) {
# work out which of the columns contain "new" data
new_data_col_names <- colnames(df_subset)[which(!colnames(df_subset) %in% id_col_names)]
# complete the df_subset with the extra columns from df
df_sub_to_join <- df_subset %>%
left_join(select(df, -new_data_col_names), by = c(id_col_names))
# join and bind rows
df_out <- df %>%
anti_join(df_sub_to_join, by = c(id_col_names)) %>%
bind_rows(df_sub_to_join)
return(df_out)
}
Now for the results:
replace_subset(df = df , df_subset = sub_df, id_col_names = c("id"))
## A tibble: 5 x 4
# id animal weight height
# <dbl> <chr> <chr> <chr>
#1 1 dog 23 54
#2 3 duck 1.2 21
#3 4 fairy 0.2 50
#4 2 cat 2.2 45
#5 5 snake 1.3 42
And here's an example using the function in a pipeline:
df %>%
replace_subset(df_subset = sub_df, id_col_names = c("id")) %>%
mutate_at(.vars = vars(c('weight', 'height')), .funs = ~as.numeric(.)) %>%
mutate(bmi = weight / (height^2))
## A tibble: 5 x 5
# id animal weight height bmi
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 1 dog 23 54 0.00789
#2 3 duck 1.2 21 0.00272
#3 4 fairy 0.2 50 0.00008
#4 2 cat 2.2 45 0.00109
#5 5 snake 1.3 42 0.000737
hope this is helpful :)

Assign value to variable based on values on multiple other columns (alternative to ifelse)

I have a data frame describing a large number of people. I want to assign each person to a group, based on several variables. For example, let's say I have the variable "state" with 5 states, the variable "age group" with 4 groups and the variable "income" with 5 groups. I will have 5x4x5 = 100 groups, that I want to name with numbers going from 1 to 100. I have always done this in the past using a combination of ifelse statements, but now as I have 100 possible outcomes I am wondering if there is a faster way than specifying each combination by hand.
Here's a MWE with the expected outcome:
mydata <- as.data.frame(cbind(c("FR","UK","UK","IT","DE","ES","FR","DE","IT","UK"),
c("20","80","20","40","60","20","60","80","40","60"),c(1,4,2,3,1,5,5,3,4,2)))
colnames(mydata) <- c("Country","Age","Income")
group_grid <- transform(expand.grid(state = c("IT","FR","UK","ES","DE"),
age = c("20","40","60","80"), income = 1:5), val = 1:100)
desired_result <- as.data.frame(cbind(c("FR","UK","UK","IT","DE","ES","FR","DE","IT","UK"),
c("20","80","20","40","60","20","60","80","40","60"),
c(1,4,2,3,1,5,5,3,4,2),
c(2,78,23,46,15,84,92,60,66,33)))
colnames(desired_result) <- c("Country","Age","Income","Group_code")
mydata$Group_code <- with(mydata, as.integer(interaction(Country, Age, Income))) should do it.
Here is left_join option using dplyr
library(dplyr)
grpD <- group_grid %>%
mutate_if(is.factor, as.character) %>% #change to character class as joining
mutate(income = as.character(income))#with same class columns are reqd.
mydata %>%
mutate_if(is.factor, as.character) %>% #change class here too
left_join(., grpD, by= c("Country" = "state", "Age" = "age", "Income" = "income"))
# Country Age Income val
#1 FR 20 1 2
#2 UK 80 4 78
#3 UK 20 2 23
#4 IT 40 3 46
#5 DE 60 1 15
#6 ES 20 5 84
#7 FR 60 5 92
#8 DE 80 3 60
#9 IT 40 4 66
#10 UK 60 2 33

Resources