R Dataframe Detecting Hidden Repeated Patterns by Group - r

I have a dataframe looks like below:
person year location rank
Harry 2002 Los Angeles 1
Harry 2006 Boston 1
Harry 2006 Los Angeles 2
Harry 2006 Chicago 3
Peter 2001 New York 1
Peter 2002 New York 1
Lily 2005 Springfield 1
Lily 2007 New York 1
Lily 2008 Boston 1
Lily 2011 Chicago 1
Lily 2011 New York 2
Sam 2005 Springfield 1
Sam 2007 New York 1
Sam 2008 Boston 1
Sam 2008 Springfield 2
Sam 2008 New York 3
Sam 2011 Chicago 1
Sam 2011 Springfield 2
I want to know at person level, who has a location with rank=1 in a certain year and this location reappears in the next available year but rank!=1. For example, the output should look like:
person yes/no
Harry 1
Peter 0
Lily 0
Sam 1

Here's an approach with dplyr, probably could be more concise.
library(dplyr)
df1 %>%
# define year_number as a count of unique years [assumes sorted already]
group_by(person) %>%
mutate(year_num = cumsum(year != lag(year, default = 0))) %>%
# check for successive years with different ranks
group_by(person, location) %>%
mutate(next_yr_switch = year_num == lag(year_num, default = -Inf) + 1 & rank != lag(rank)) %>%
group_by(person) %>%
summarize(`yes/no` = sum(next_yr_switch))
## A tibble: 4 x 2
# person `yes/no`
#* <chr> <int>
#1 Harry 1
#2 Lily 0
#3 Peter 0
#4 Sam 1

Related

R: Adding Missing Rows to a Dataset

I am working with the R.
I have a dataset that looks something like this:
id = c("john", "john", "john", "john","john", "james", "james", "james", "james", "james")
year = c(2010,2011, 2014, 2016,2017, 2013, 2016, 2017, 2018,2020)
var = c(1,1,1,1,1,1,1,1,1,1)
my_data = data.frame(id, year, var)
> my_data
id year var
1 john 2010 1
2 john 2011 1
3 john 2014 1
4 john 2016 1
5 john 2017 1
6 james 2013 1
7 james 2016 1
8 james 2017 1
9 james 2018 1
10 james 2020 1
As we can see, there are some missing years (i.e. non-consecutive years) in this dataset - for each ID, I am trying to add rows corresponding to these missing years and assign the "var" variable as "0" in these rows.
As an example, this would look something like this for the first ID:
id year var
1 john 2010 1
2 john 2011 1
3 john 2012 0
4 john 2013 0
5 john 2014 1
6 john 2015 0
7 john 2016 1
8 john 2017 1
I tried to do this with the following code:
# https://stackoverflow.com/questions/74365569/backfilling-rows-based-on-max-conditions-in-r
library(dplyr)
library(tidyr)
my_data %>%
group_by(id) %>%
complete(year = full_seq(year, period = 1)) %>%
fill(year, var, .direction = "downup") %>%
mutate(var= 0 ) %>%
ungroup
But this is not giving the desired result - as we can see, rows have been deleted and all values of "var" have been replaced with 0:
A tibble: 16 x 3
id year var
<chr> <dbl> <dbl>
1 james 2013 0
2 james 2014 0
3 james 2015 0
4 james 2016 0
5 james 2017 0
6 james 2018 0
7 james 2019 0
8 james 2020 0
Can someone please show me how to fix this problem?
Thanks!
I would include the fill argument in your complete function. There you can specify in a named list what you want to include as values for missing combinations.
library(tidyverse)
my_data %>%
group_by(id) %>%
complete(year = full_seq(year, period = 1), fill = list(var = 0)) %>%
ungroup
Output
id year var
<chr> <dbl> <dbl>
1 james 2013 1
2 james 2014 0
3 james 2015 0
4 james 2016 1
5 james 2017 1
6 james 2018 1
7 james 2019 0
8 james 2020 1
9 john 2010 1
10 john 2011 1
11 john 2012 0
12 john 2013 0
13 john 2014 1
14 john 2015 0
15 john 2016 1
16 john 2017 1
You can create a data.frame with all year's and id's, then do a full_join with the original data.frame
library(dplyr)
library(tidyr)
expand_grid(id = unique(my_data$id),year = min(my_data$year):max(my_data$year)) %>%
full_join(my_data) %>%
replace_na(replace = list(var = 0))

Extracting hidden Info from R Dataframe

I have a dataframe looks like below:
person year Office Job rank
Harry 2002 Los Angeles CEO 0
Harry 2006 Boston CEO 0
Harry 2006 Los Angeles Advisor 1
Harry 2006 Chicago Chairman 2
Peter 2001 New York Director 0
Peter 2001 Chicago CFO 1
Peter 2001 Chicago COO 2
Peter 2002 Chicago CEO 0
Lily 2005 Springfield CEO 0
Lily 2007 New York CFO 0
Lily 2008 Boston COO 0
Lily 2011 Chicago Advisor 0
Lily 2011 New York board 1
Sam 2006 Chicago COO 0
Sam 2007 Chicago CFO 0
Sam 2007 Chicago CEO 1
Sam 2010 New York Advisor 0
I want to know at a person level, who has at least one of the following two patterns:
in a previous available year, an office has rank 0 and in the next available year, the office still exist but rank is no longer and should be bigger than 0 (job does not matter). For example, Los Angeles for Harry.
in a next available year, an office has rank 0 and in the previous available year, the office still exist but rank is is no longer and should be bigger than 0 (For example, Chicago for Peter).
Note that New York for Lily does not have either of the above situation as 2007 is not the previous available year for Lily (2008 is).
Note that an office can exist multiple times in a year (differ in jobs). Chicago for Sam shows one such case. Note that Chicago for Sam also does not count as although Chicago has rank 1 in 2007 and rank 0 in previous available year, Chicago also has rank 0 in 2007.
Thus, the output should look like:
person yes/no
Harry 1
Peter 1
Lily 0
Sam 0
If I understand correctly, I think this will work. You want to:
Figure out if any job in a person-year-office are rank 0
For each person-office, check the two cases you're interested in (current row has a rank 0, and either the previous or the next does not have a rank 0. This is easier to do if you expand the dataframe to include all combinations for each person-year for each office.
For each person, check if any row matches either case you specified and fill the missing values.
library(tidyverse)
df <- read_table(
"person year Office Job rank
Harry 2002 Los Angeles CEO 0
Harry 2006 Boston CEO 0
Harry 2006 Los Angeles Advisor 1
Harry 2006 Chicago Chairman 2
Peter 2001 New York Director 0
Peter 2001 Chicago CFO 1
Peter 2001 Chicago COO 2
Peter 2002 Chicago CEO 0
Lily 2005 Springfield CEO 0
Lily 2007 New York CFO 0
Lily 2008 Boston COO 0
Lily 2011 Chicago Advisor 0
Lily 2011 New York board 1
Sam 2006 Chicago COO 0
Sam 2007 Chicago CFO 0
Sam 2007 Chicago CEO 1
Sam 2010 New York Advisor 0
"
)
df %>%
group_by(person, year, Office) %>%
summarise(any_rank_0 = any(rank == 0)) %>%
ungroup() %>%
complete(nesting(person, year), Office) %>%
arrange(person, Office, year) %>%
group_by(person, Office) %>%
mutate(
case_1 = any_rank_0 & !lead(any_rank_0), #current 0, next not 0
case_2 = any_rank_0 & !lag(any_rank_0) #current 0, previous not 0
) %>%
group_by(person) %>%
summarise(result = replace_na(any(case_1) | any(case_2), FALSE))
#> # A tibble: 4 x 2
#> person result
#> <chr> <lgl>
#> 1 Harry TRUE
#> 2 Lily FALSE
#> 3 Peter TRUE
#> 4 Sam FALSE
Created on 2021-05-20 by the reprex package (v1.0.0)

R - clean up data based on preceding and following values

I have got a table which is later on divided into multiple intervals based on multiple conditions. In some rare cases, I one or multiple rows which do not fall into the defined interval, so I'd like to preform some extra clean-up in the data.
For each group (name, location), if the row value in stop == 0, I need to count how many of those rows are in the interval. If that less then <3, I need to check how many continous rows are market as stop == 1 above and below the interval with zero value. If the count of values with stop == 1 above & below == 1 then I need to change values in the intervals with zero to 1.
I hope the picture will make it more clear:
df <- read.table(text="name location stop
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 0
John London 0
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 0
John New_York 0
John New_York 0
John New_York 0
John New_York 1
John New_York 0
",header = TRUE, stringsAsFactors = FALSE)
You could iterate over the rows, but it seems that all you want to do is replace all instances of 101 with 111 and 1001 with 1111 in stop. You can do this by turning the stop column to string and then make substitutions using gsub():
stopString = paste0(df$stop, collapse = "")
stopString = gsub("101","111",stopString)
stopString = gsub("1001","1111",stopString)
df$stop = as.numeric(unlist(strsplit(stopString,"")))
> df
name location stop
1 John London 1
2 John London 1
3 John London 1
4 John London 1
5 John London 1
6 John London 1
7 John London 1
8 John London 1
9 John London 1
10 John London 1
11 John London 1
12 John London 1
13 John London 1
14 John London 1
15 John London 1
16 John London 0
17 John New_York 0
18 John New_York 0
19 John New_York 0
20 John New_York 1
21 John New_York 0
Edit: grouping by name and location:
df <- read.table(text="name location stop
John London 1
John London 0
John London 1
John New_York 0
John New_York 1
John New_York 0
John New_York 0
John New_York 0
John New_York 1
John New_York 0
",header = TRUE, stringsAsFactors = TRUE)
f <- function(x)
{
stopString = paste0(x, collapse = "")
stopString = gsub("101","111",stopString)
stopString = gsub("1001","1111",stopString)
as.numeric(unlist(strsplit(stopString,"")))
}
> df %>% dplyr::group_by(name, location) %>%
dplyr::summarise(stop=stop, s=f(stop))
# A tibble: 10 x 4
# Groups: name, location [2]
name location stop s
<fct> <fct> <int> <dbl>
1 John London 1 1
2 John London 0 1
3 John London 1 1
4 John New_York 0 0
5 John New_York 1 1
6 John New_York 0 0
7 John New_York 0 0
8 John New_York 0 0
9 John New_York 1 1
10 John New_York 0 0

How to manually enter a cell in a dataframe? [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.
This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555
If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))

How to join/ merge two tables using character values?

I would like to combine two tables based on first name, last name, and year, and create a new binary variable indicating whether the row from table 1 was present in the 2nd table.
First table is a panel data set of some attributes of NBA players during a season:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic","Larry","Larry")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson","Bird","Bird")
year<-c("1991","1992","1993","1991","1992","1993","1992","1992")
season<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
7 Larry Bird 1992
8 Larry Bird 1992
The second data.frame is a panel data set of some attributes of NBA players selected to the All-Star game:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson")
year<-c("1991","1992","1993","1991","1992","1993")
ALLSTARS<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
My desired result looks like:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
I tried to use a left join. But not sure whether that makes sense:
test<-join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
Here's a simple solution using data.table binary join which allows you to update a column by reference while joing
library(data.table)
setkey(setDT(season), firstname, lastname, year)[ALLSTARS, allstars := 1L]
season
# firstname lastname year allstars
# 1: Larry Bird 1992 NA
# 2: Larry Bird 1992 NA
# 3: Magic Johnson 1991 1
# 4: Magic Johnson 1992 1
# 5: Magic Johnson 1993 1
# 6: Michael Jordan 1991 1
# 7: Michael Jordan 1992 1
# 8: Michael Jordan 1993 1
Or using dplyr
library(dplyr)
ALLSTARS %>%
mutate(allstars = 1L) %>%
right_join(., season)
# firstname lastname year allstars
# 1 Michael Jordan 1991 1
# 2 Michael Jordan 1992 1
# 3 Michael Jordan 1993 1
# 4 Magic Johnson 1991 1
# 5 Magic Johnson 1992 1
# 6 Magic Johnson 1993 1
# 7 Larry Bird 1992 NA
# 8 Larry Bird 1992 NA
In base R:
ALLSTARS$allstars <- 1L
newdf <- merge(season, ALLSTARS, by=c('firstname', 'lastname', 'year'), all.x=TRUE)
newdf$allstars[is.na(newdf$allstars)] <- 0L
newdf
Or one I like for a different approach:
season$allstars <- (apply(season, 1, function(x) paste(x, collapse='')) %in%
apply(ALLSTARS, 1, function(x) paste(x, collapse='')))+0L
#
# firstname lastname year allstars
# 1 Michael Jordan 1991 1
# 2 Michael Jordan 1992 1
# 3 Michael Jordan 1993 1
# 4 Magic Johnson 1991 1
# 5 Magic Johnson 1992 1
# 6 Magic Johnson 1993 1
# 7 Larry Bird 1992 0
# 8 Larry Bird 1992 0
It looks like you are using join() from the plyr package. You were almost there: just preface your command with ALLSTARS$allstars <- 1. Then do your join as it is written and finally convert the NA values to 0. So:
ALLSTARS$allstars <- 1
test <- join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
test$allstars[is.na(test$allstars)] <- 0
Result:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
Though I personally would use left_join or right_join from the dplyr package, as in David's answer, instead of plyr's join(). Also note that you don't actually need the by argument of join() in this case as by default the function will try to join on all fields with common names, which is what you want here.

Resources