r deleting certain rows of dataframe based on multiple columns - r

I have a dataframe looks like below:
Year Name Place Job
2010 Jim USA CEO
2010 Jim Canada Advisor
2010 Jim Canada Board Member
2011 Jim USA CEO
2017 Peter Mexico COO
2019 Peter Korea CEO
2019 Peter China Advisor
2013 Harry USA Chairman
2014 Harry Canada CEO
2015 Harry Canada CEO
2015 Harry Canada Advisor
I want to remove certain rows in the above dataframe based on the "Year" and "Name" column. basically, all "Year/Name" occurs in the below list (in dataframe format) should be removed:
Year Name
2010 Jim
2019 Peter
2013 Harry
2014 Harry
Thus, the final output looks like:
Year Name Place Job
2011 Jim USA CEO
2017 Peter Mexico COO
2015 Harry Canada CEO
2015 Harry Canada Advisor

base R
While dplyr (below) has anti_join, in base R one needs to merge and find the rows that did not match and remove them by hand.
# using the `rem` frame, augmenting a little
rem$keep <- FALSE
tmp <- merge(dat, rem, by = c("Year", "Name"), all.x = TRUE)
tmp
# Year Name Place Job keep
# 1 2010 Jim USA CEO FALSE
# 2 2010 Jim Canada Advisor FALSE
# 3 2010 Jim Canada Board Member FALSE
# 4 2011 Jim USA CEO NA
# 5 2013 Harry USA Chairman FALSE
# 6 2014 Harry Canada CEO FALSE
# 7 2015 Harry Canada CEO NA
# 8 2015 Harry Canada Advisor NA
# 9 2017 Peter Mexico COO NA
# 10 2019 Peter Korea CEO FALSE
# 11 2019 Peter China Advisor FALSE
tmp[ is.na(tmp$keep), ]
# Year Name Place Job keep
# 4 2011 Jim USA CEO NA
# 7 2015 Harry Canada CEO NA
# 8 2015 Harry Canada Advisor NA
# 9 2017 Peter Mexico COO NA
dplyr
dplyr::anti_join(dat, rem, by = c("Year", "Name"))
# Year Name Place Job
# 1 2011 Jim USA CEO
# 2 2017 Peter Mexico COO
# 3 2015 Harry Canada CEO
# 4 2015 Harry Canada Advisor
Data
dat <- structure(list(Year = c(2010L, 2010L, 2010L, 2011L, 2017L, 2019L, 2019L, 2013L, 2014L, 2015L, 2015L), Name = c("Jim", "Jim", "Jim", "Jim", "Peter", "Peter", "Peter", "Harry", "Harry", "Harry", "Harry"), Place = c("USA", "Canada", "Canada", "USA", "Mexico", "Korea", "China", "USA", "Canada", "Canada", "Canada"), Job = c("CEO", "Advisor", "Board Member", "CEO", "COO", "CEO", "Advisor", "Chairman", "CEO", "CEO", "Advisor")), row.names = c(NA, -11L), class = "data.frame")
rem <- structure(list(Year = c(2010L, 2019L, 2013L, 2014L), Name = c("Jim", "Peter", "Harry", "Harry")), class = "data.frame", row.names = c(NA, -4L))

Using data.table
library(data.table)
setDT(df)[!remove, on = .(Year, Name)]
-ouptut
# Year Name Place Job
#1: 2011 Jim USA CEO
#2: 2017 Peter Mexico COO
#3: 2015 Harry Canada CEO
#4: 2015 Harry Canada Advisor

Another approach:
library(dplyr)
library(stringr)
dat %>% mutate(x = str_c(Year, Name)) %>%
filter(str_detect(x, str_c(str_c(rem$Year,rem$Name), collapse = '|'), negate = TRUE)) %>%
select(-x)
Year Name Place Job
1 2011 Jim USA CEO
2 2017 Peter Mexico COO
3 2015 Harry Canada CEO
4 2015 Harry Canada Advisor

We could use str_c:
library(dplyr)
library(stringr)
dat %>%
filter(!Year %in% str_c(rem$Year))
Output:
Year Name Place Job
1 2011 Jim USA CEO
2 2017 Peter Mexico COO
3 2015 Harry Canada CEO
4 2015 Harry Canada Advisor

A base R option using merge + subset
subset(
merge(
dat,
cbind(rem, Removal = 1),
all = TRUE
),
is.na(Removal),
select = -Removal
)
gives
Year Name Place Job
4 2011 Jim USA CEO
7 2015 Harry Canada CEO
8 2015 Harry Canada Advisor
9 2017 Peter Mexico COO

Related

Transform "start-end" datasets to panel dataset in r [duplicate]

This problem is also known as 'transforming a "start-end" dataset to a panel dataset'
I have a data frame containing "name" of U.S. Presidents, the years when they start and end in office, ("from" and "to" columns). Here is a sample:
name from to
Bill Clinton 1993 2001
George W. Bush 2001 2009
Barack Obama 2009 2012
...and the output from dput:
dput(tail(presidents, 3))
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
I want to create data frame with two columns ("name" and "year"), with a row for each year that a president was in office. Thus, I need to create a regular sequence with each year from "from", to "to". Here's my expected out:
name year
Bill Clinton 1993
Bill Clinton 1994
...
Bill Clinton 2000
Bill Clinton 2001
George W. Bush 2001
George W. Bush 2002
...
George W. Bush 2008
George W. Bush 2009
Barack Obama 2009
Barack Obama 2010
Barack Obama 2011
Barack Obama 2012
I know that I can use data.frame(name = "Bill Clinton", year = seq(1993, 2001)) to expand things for a single president, but I can't figure out how to iterate for each president.
How do I do this? I feel that I should know this, but I'm drawing a blank.
Update 1
OK, I've tried both solutions, and I'm getting an error:
foo<-structure(list(name = c("Grover Cleveland", "Benjamin Harrison", "Grover Cleveland"), from = c(1885, 1889, 1893), to = c(1889, 1893, 1897)), .Names = c("name", "from", "to"), row.names = 22:24, class = "data.frame")
ddply(foo, "name", summarise, year = seq(from, to))
Error in seq.default(from, to) : 'from' must be of length 1
Here's a data.table solution. It has the nice (if minor) feature of leaving the presidents in their supplied order:
library(data.table)
dt <- data.table(presidents)
dt[, list(year = seq(from, to)), by = name]
# name year
# 1: Bill Clinton 1993
# 2: Bill Clinton 1994
# ...
# ...
# 21: Barack Obama 2011
# 22: Barack Obama 2012
Edit: To handle presidents with non-consecutive terms, use this instead:
dt[, list(year = seq(from, to)), by = c("name", "from")]
You can use the plyr package:
library(plyr)
ddply(presidents, "name", summarise, year = seq(from, to))
# name year
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# [...]
and if it is important that the data be sorted by year, you can use the arrange function:
df <- ddply(presidents, "name", summarise, year = seq(from, to))
arrange(df, df$year)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# 3 Bill Clinton 1995
# [...]
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Edit 1: Following's #edgester's "Update 1", a more appropriate approach is to use adply to account for presidents with non-consecutive terms:
adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]
An alternate tidyverse approach using unnest and map2. However many data columns you have (such as name), they will all be correctly present in the new data frame.
library(tidyverse)
presidents %>%
mutate(year = map2(from, to, seq)) %>%
unnest(year) %>%
select(-from, -to)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
...
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Before tidyr v1.0.0, one could create variables as part of unnest().
presidents %>%
unnest(year = map2(from, to, seq)) %>%
select(-from, -to)
Here's a dplyr solution:
library(dplyr)
# the data
presidents <-
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
# the expansion of the table
presidents %>%
rowwise() %>%
do(data.frame(name = .$name, year = seq(.$from, .$to, by = 1)))
# the output
Source: local data frame [22 x 2]
Groups: <by row>
name year
(chr) (dbl)
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001
.. ... ...
h/t: https://stackoverflow.com/a/24804470/1036500
Two base solutions.
Using sequence:
len = d$to - d$from + 1
data.frame(name = d$name[rep(1:nrow(d), len)], year = sequence(len, d$from))
Using mapply:
l <- mapply(`:`, d$from, d$to)
data.frame(name = d$name[rep(1:nrow(d), lengths(l))], year = unlist(l))
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# ...snip
# 8 Bill Clinton 2000
# 9 Bill Clinton 2001
# 10 George W. Bush 2001
# 11 George W. Bush 2002
# ...snip
# 17 George W. Bush 2008
# 18 George W. Bush 2009
# 19 Barack Obama 2009
# 20 Barack Obama 2010
# 21 Barack Obama 2011
# 22 Barack Obama 2012
As noted by #Esteis in comment, there may well be several columns that needs to be repeated following the expansion of the ranges (not only 'name', like in OP). In such case, instead of repeating values of a single column, simply repeat the rows of the entire data frame, except the 'from' & 'to' columns. A simple example:
d = data.frame(x = 1:2, y = 3:4, names = c("a", "b"),
from = c(2001, 2011), to = c(2003, 2012))
# x y names from to
# 1 1 3 a 2001 2003
# 2 2 4 b 2011 2012
len = d$to - d$from + 1
cbind(d[rep(1:nrow(d), len), setdiff(names(d), c("from", "to"))],
year = sequence(len, d$from))
x y names year
1 1 3 a 2001
1.1 1 3 a 2002
1.2 1 3 a 2003
2 2 4 b 2011
2.1 2 4 b 2012
Here is a quick base-R solution, where Df is your data.frame:
do.call(rbind, apply(Df, 1, function(x) {
data.frame(name=x[1], year=seq(x[2], x[3]))}))
It gives some warnings about row names, but appears to return the correct data.frame.
Another option using tidyverse could be to gather data into long format, group_by name and create a sequence between from and to date.
library(tidyverse)
presidents %>%
gather(key, date, -name) %>%
group_by(name) %>%
complete(date = seq(date[1], date[2]))%>%
select(-key)
# A tibble: 22 x 2
# Groups: name [3]
# name date
# <chr> <dbl>
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# 7 Bill Clinton 1995
# 8 Bill Clinton 1996
# 9 Bill Clinton 1997
#10 Bill Clinton 1998
# … with 12 more rows
Another solution using dplyr and tidyr. It correctly preserves any data columns you have.
library(magrittr) # for pipes
df <- data.frame(
tata = c('toto1', 'toto2'),
from = c(2000, 2004),
to = c(2001, 2009),
measure1 = rnorm(2),
measure2 = 10 * rnorm(2)
)
tata from to measure1 measure2
1 toto1 2000 2001 -0.575 -10.13
2 toto2 2004 2009 -0.258 17.37
df %>%
dplyr::rowwise() %>%
dplyr::mutate(year = list(seq(from, to))) %>%
dplyr::select(-from, -to) %>%
tidyr::unnest(c(year))
# A tibble: 8 x 4
tata measure1 measure2 year
<chr> <dbl> <dbl> <int>
1 toto1 -0.575 -10.1 2000
2 toto1 -0.575 -10.1 2001
3 toto2 -0.258 17.4 2004
4 toto2 -0.258 17.4 2005
5 toto2 -0.258 17.4 2006
6 toto2 -0.258 17.4 2007
7 toto2 -0.258 17.4 2008
8 toto2 -0.258 17.4 2009
Use by to create a by list L of data.frames, one data.frame per president, and then rbind them together. No packages are used.
L <- by(presidents, presidents$name, with, data.frame(name, year = from:to))
do.call("rbind", setNames(L, NULL))
If you don't mind row names then the last line could be reduced to just:
do.call("rbind", L)
An addition to the tidyverse solutions can be:
df %>%
uncount(to - from + 1) %>%
group_by(name) %>%
transmute(year = seq(first(from), first(to)))
name year
<chr> <dbl>
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001

R duplicate rows based on the elements in a string column [duplicate]

This question already has answers here:
str_extract_all: return all patterns found in string concatenated as vector
(2 answers)
Closed 2 years ago.
I have a more or less specific question that probably pertains to loops in R. I have a dataframe:
X location year
1 North Dakota, Minnesota, Michigan 2011
2 California, Tennessee 2012
3 Bastrop County (Texas) 2013
4 Dallas (Texas) 2014
5 Shasta (California) 2015
6 California, Oregon, Washington 2011
I have two problems with this data: 1) I need a column that consists of just the state names of each row. I guess this should be generally easy with gsub and using a list of all US state names.
list <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "etc")
pat <- paste0("\\b(", paste0(list, collapse="|"), ")\\b")
pat
data$state <- gsub(data$location, "", paragraph)
The bigger issue for me is 2) I need an individual (duplicate) row for each state that is in the dataset. So if row 6 has California, Oregon and Washington in 2011, I need to have a separate row of each one separately like this:
X location year
1 California 2011
2 Oregon 2011
3 Washington 2011
Thank you for your help!
You can use str_extract_all to extract all the states and unnest to duplicate rows such that each state is in a separate row. There is an inbuilt constant state.name which have the state names of US which can be used here to create pattern.
library(dplyr)
pat <- paste0("\\b", state.name, "\\b", collapse = "|")
df %>%
mutate(states = stringr::str_extract_all(location, pat)) %>%
tidyr::unnest(states)
# A tibble: 11 x 3
# location year states
# <chr> <int> <chr>
# 1 North Dakota, Minnesota, Michigan 2011 North Dakota
# 2 North Dakota, Minnesota, Michigan 2011 Minnesota
# 3 North Dakota, Minnesota, Michigan 2011 Michigan
# 4 California, Tennessee 2012 California
# 5 California, Tennessee 2012 Tennessee
# 6 Bastrop County (Texas) 2013 Texas
# 7 Dallas (Texas) 2014 Texas
# 8 Shasta (California) 2015 California
# 9 California, Oregon, Washington 2011 California
#10 California, Oregon, Washington 2011 Oregon
#11 California, Oregon, Washington 2011 Washington
data
df <- structure(list(location = c("North Dakota, Minnesota, Michigan",
"California, Tennessee", "Bastrop County (Texas)", "Dallas (Texas)",
"Shasta (California)", "California, Oregon, Washington"), year = c(2011L,
2012L, 2013L, 2014L, 2015L, 2011L)), class = "data.frame", row.names = c(NA, -6L))

Value of a variable matching the first row of another variable by group [duplicate]

This question already has answers here:
Using mutate to create a new column with the first value of each group in R
(3 answers)
Closed 3 years ago.
As in the title, I would like to have a process that allows me to assign a set of unique values of a first variable, to the most common value of a second variable, matching the first row of a third value. Example:
Name Year Job
Alicia 1990 Butcher
Alicia 1991 Baker
George 1989 Scientist
George 1990 Banker
George 1991 Banker
I would like to easily identify what is the first job each unique Name did:
Name Year Job First Job
Alicia 1990 Butcher Butcher
Alicia 1991 Baker Butcher
George 1989 Scientist Scientist
George 1990 Banker Scientist
George 1991 Banker Scientist
We can use data.table for this:
library(data.table)
setDT(df1)[order(Year),FirstJob:=Job[1],.(Name)][]
## or using which.min instead of ordering as akrun suggested:
# setDT(df1)[,FirstJob:=Job[which.min(Year)], .(Name)][]
#> Name Year Job FirstJob
#> 1: Alicia 1990 Butcher Butcher
#> 2: Alicia 1991 Baker Butcher
#> 3: George 1989 Scientist Scientist
#> 4: George 1990 Banker Scientist
#> 5: George 1991 Banker Scientist
Data:
read.table(text="Name Year Job
Alicia 1990 Butcher
Alicia 1991 Baker
George 1989 Scientist
George 1990 Banker
George 1991 Banker",
header=T, stringsAsFactors=F) -> df1
We can group by 'Name' and extract the first 'Job' to create the new column 'FirstJob'
library(dplyr)
df1 %>%
group_by(Name) %>%
mutate(FirstJob = first(Job))
# A tibble: 5 x 4
# Groups: Name [2]
# Name Year Job FirstJob
# <chr> <int> <chr> <chr>
#1 Alicia 1990 Butcher Butcher
#2 Alicia 1991 Baker Butcher
#3 George 1989 Scientist Scientist
#4 George 1990 Banker Scientist
#5 George 1991 Banker Scientist
If the 'Year' is not ordered
df1 %>%
group_by(Name) %>%
mutate(FirstJob = Job[which.min(Year)])
data
df1 <- structure(list(Name = c("Alicia", "Alicia", "George", "George",
"George"), Year = c(1990L, 1991L, 1989L, 1990L, 1991L), Job = c("Butcher",
"Baker", "Scientist", "Banker", "Banker")), class = "data.frame",
row.names = c(NA,
-5L))

Reshaping data from an from-to format into country-year in R

I have a data set containing information about political leaders in the following format:
leader country begin end
clinton USA 1994 2001
bush USA 2002 2009
... ... ... ...
In order to merge it with other data, however, I would like to reshape it into the commonly used country-year format, like the following:
country year leader
USA 1994 clinton
USA 1995 clinton
My current approach (creating an empty data frame and using a nested for-loop) takes a very long time and tbh seems very stupid. As the dataset is rather large I am looking for a smarter and more efficient way to do this.
PS: Dont worry about the weird years, the leaders are assigned to years which they already started as the leader. which is why Bush starts only in 2002 not 2001
Do you mean something like this?
df <- structure(list(leader = structure(2:1, .Label = c("bush", "clinton"
), class = "factor"), country = structure(c(1L, 1L), .Label = "USA", class = "factor"),
begin = c(1994L, 2002L), end = c(2001L, 2009L)), class = "data.frame", row.names = c(NA,
-2L))
df %>%
group_by(leader,country) %>%
expand(year=begin:end) %>%
arrange(year)
You get for every country+leader, all the years between begin and end
# A tibble: 16 x 3
# Groups: leader, country [2]
leader country year
<fct> <fct> <int>
1 clinton USA 1994
2 clinton USA 1995
3 clinton USA 1996
4 clinton USA 1997
5 clinton USA 1998
6 clinton USA 1999
7 clinton USA 2000
8 clinton USA 2001
9 bush USA 2002
10 bush USA 2003
11 bush USA 2004
12 bush USA 2005
13 bush USA 2006
14 bush USA 2007
15 bush USA 2008
16 bush USA 2009

A problem with multiple conditions in R that does not work

I am having problem with multiple conditions in R.
My data is like this:
Region in UK Year Third column (year.city)
Liverpool 2008
Manchester 2010
Liverpool 2016
Chester 2015
Birmingham 2016
Blackpool 2012
Birmingham 2005
Chester 2009
Liverpool 2005
Hull 2011
Leeds 2013
Liverpool 2014
Bradford 2008
London 2010
Coventry 2009
Cardiff 2016
Liverpool 2007
What I want to create is a third column in a way that it has for groups in it: Liverpool before 2010, Liverpool after 2010, Other cities before 2010, other cities after 2010. I tried couple of codes like mutate but could not solve it. May you please help me to do it?
Thanks
I would do this as #dvibisan suggested and use dplyr.
# Create a dataframe
df <- structure(list(`Region in UK` = c("Liverpool", "Manchester", "Liverpool",
"Chester", "Birmingham", "Blackpool", "Birmingham", "Chester",
"Liverpool", "Hull", "Leeds", "Liverpool", "Bradford", "London",
"Coventry", "Cardiff", "Liverpool"),
Year = c(2008L, 2010L, 2016L, 2015L, 2016L, 2012L, 2005L, 2009L, 2005L, 2011L, 2013L, 2014L, 2008L, 2010L, 2009L, 2016L, 2007L)),
row.names = c(NA, -17L), class = c("data.table", "data.frame"))
# Load the dplyr library to use mutate and if_else (if there were more than 2 conditions of interest for each variable could use case_when)
library(dplyr)
# Create a new column using mutate, pasting together two conditions
df <-
df %>%
mutate(`Third column (year.city)` = paste0(if_else(grepl("Liverpool", `Region in UK`, fixed = TRUE), `Region in UK`, "Other cities"),
if_else(Year < 2010, " before 2010", " 2010 or after")))
The easiest way I think is using vectorisation with base R:
# create index of categories
vec <- c("Other cities after 2010", "Liverpool after 2010", "Other cities before 2010", "Liverpool before 2010")
# create index vector
ix <- 1 + (df$Region.in.UK == "Liverpool") + 2*(df$Year < 2010)
# index the categories-vector with the index-vector
df$year.city <- vec[ix]
The result:
> df
Region.in.UK Year year.city
1 Liverpool 2008 Liverpool before 2010
2 Manchester 2010 Other cities after 2010
3 Liverpool 2016 Liverpool after 2010
4 Chester 2015 Other cities after 2010
5 Birmingham 2016 Other cities after 2010
6 Blackpool 2012 Other cities after 2010
7 Birmingham 2005 Other cities before 2010
8 Chester 2009 Other cities before 2010
9 Liverpool 2005 Liverpool before 2010
10 Hull 2011 Other cities after 2010
11 Leeds 2013 Other cities after 2010
12 Liverpool 2014 Liverpool after 2010
13 Bradford 2008 Other cities before 2010
14 London 2010 Other cities after 2010
15 Coventry 2009 Other cities before 2010
16 Cardiff 2016 Other cities after 2010
17 Liverpool 2007 Liverpool before 2010
Try this
Region_in_UK = c( "Liverpool", "Manchester", "Liverpool", "Chester",
"Birmingham", "Blackpool", "Birmingham", "Chester", "Liverpool", "Hull",
"Leeds", "Liverpool", "Bradford", "London", "Coventry", "Cardiff", "Liverpool")
Year = c(2008, 2010, 2016, 2015, 2016, 2012, 2005, 2009, 2005, 2011, 2013,
2014, 2008, 2010, 2009, 2016, 2007)
df = data.frame(Region_in_UK, Year)
# erase the code above and replace your own dataframe if its bigger
# than the data you displayed at this point and name it "df" (e.g.:
# df = your_dataframe)
df$year_city = rep(NA, dim(df)[1])
df = mutate(df, year_city =
ifelse (grepl("Liverpool", df$Region_in_UK) & df$Year < 2010,
"Liverpool before 2010", df$year_city))
df = mutate(df, year_city =
ifelse (grepl("Liverpool", df$Region_in_UK) & df$Year >= 2010,
"Liverpool 2010 and after", df$year_city))
df = mutate(df, year_city =
ifelse (!grepl("Liverpool", df$Region_in_UK) & df$Year < 2010,
"Other before 2010", df$year_city))
df = mutate(df, year_city =
ifelse (!grepl("Liverpool", df$Region_in_UK) & df$Year >= 2010,
"Other 2010 and after", df$year_city))
Using base R you could do:
transform(df, year.city = factor(paste(sub('^((?!Liver).)*$', 'other', Region_in_UK,perl = TRUE), Year>2010), label=1:4))
Region_in_UK Year year.city
1 Liverpool 2008 1
2 Manchester 2010 3
3 Liverpool 2016 2
4 Chester 2015 4
5 Birmingham 2016 4
6 Blackpool 2012 4
7 Birmingham 2005 3
8 Chester 2009 3
9 Liverpool 2005 1
10 Hull 2011 4
11 Leeds 2013 4
12 Liverpool 2014 2
13 Bradford 2008 3
14 London 2010 3
15 Coventry 2009 3
16 Cardiff 2016 4
17 Liverpool 2007 1
You can also do:
transform(df,m=factor(paste(!grepl("Liverpool",Region_in_UK),Year>2010),label=1:4))
or
transform(df,m = factor(paste(sub('(Liverpool)|.*','\\1',Region_in_UK),Year<=2010),label=4:1))
Region_in_UK Year m
1 Liverpool 2008 1
2 Manchester 2010 3
3 Liverpool 2016 2
4 Chester 2015 4
5 Birmingham 2016 4
6 Blackpool 2012 4
7 Birmingham 2005 3
8 Chester 2009 3
9 Liverpool 2005 1
10 Hull 2011 4
11 Leeds 2013 4
12 Liverpool 2014 2
13 Bradford 2008 3
14 London 2010 3
15 Coventry 2009 3
16 Cardiff 2016 4
17 Liverpool 2007 1

Resources