A problem with multiple conditions in R that does not work - r

I am having problem with multiple conditions in R.
My data is like this:
Region in UK Year Third column (year.city)
Liverpool 2008
Manchester 2010
Liverpool 2016
Chester 2015
Birmingham 2016
Blackpool 2012
Birmingham 2005
Chester 2009
Liverpool 2005
Hull 2011
Leeds 2013
Liverpool 2014
Bradford 2008
London 2010
Coventry 2009
Cardiff 2016
Liverpool 2007
What I want to create is a third column in a way that it has for groups in it: Liverpool before 2010, Liverpool after 2010, Other cities before 2010, other cities after 2010. I tried couple of codes like mutate but could not solve it. May you please help me to do it?
Thanks

I would do this as #dvibisan suggested and use dplyr.
# Create a dataframe
df <- structure(list(`Region in UK` = c("Liverpool", "Manchester", "Liverpool",
"Chester", "Birmingham", "Blackpool", "Birmingham", "Chester",
"Liverpool", "Hull", "Leeds", "Liverpool", "Bradford", "London",
"Coventry", "Cardiff", "Liverpool"),
Year = c(2008L, 2010L, 2016L, 2015L, 2016L, 2012L, 2005L, 2009L, 2005L, 2011L, 2013L, 2014L, 2008L, 2010L, 2009L, 2016L, 2007L)),
row.names = c(NA, -17L), class = c("data.table", "data.frame"))
# Load the dplyr library to use mutate and if_else (if there were more than 2 conditions of interest for each variable could use case_when)
library(dplyr)
# Create a new column using mutate, pasting together two conditions
df <-
df %>%
mutate(`Third column (year.city)` = paste0(if_else(grepl("Liverpool", `Region in UK`, fixed = TRUE), `Region in UK`, "Other cities"),
if_else(Year < 2010, " before 2010", " 2010 or after")))

The easiest way I think is using vectorisation with base R:
# create index of categories
vec <- c("Other cities after 2010", "Liverpool after 2010", "Other cities before 2010", "Liverpool before 2010")
# create index vector
ix <- 1 + (df$Region.in.UK == "Liverpool") + 2*(df$Year < 2010)
# index the categories-vector with the index-vector
df$year.city <- vec[ix]
The result:
> df
Region.in.UK Year year.city
1 Liverpool 2008 Liverpool before 2010
2 Manchester 2010 Other cities after 2010
3 Liverpool 2016 Liverpool after 2010
4 Chester 2015 Other cities after 2010
5 Birmingham 2016 Other cities after 2010
6 Blackpool 2012 Other cities after 2010
7 Birmingham 2005 Other cities before 2010
8 Chester 2009 Other cities before 2010
9 Liverpool 2005 Liverpool before 2010
10 Hull 2011 Other cities after 2010
11 Leeds 2013 Other cities after 2010
12 Liverpool 2014 Liverpool after 2010
13 Bradford 2008 Other cities before 2010
14 London 2010 Other cities after 2010
15 Coventry 2009 Other cities before 2010
16 Cardiff 2016 Other cities after 2010
17 Liverpool 2007 Liverpool before 2010

Try this
Region_in_UK = c( "Liverpool", "Manchester", "Liverpool", "Chester",
"Birmingham", "Blackpool", "Birmingham", "Chester", "Liverpool", "Hull",
"Leeds", "Liverpool", "Bradford", "London", "Coventry", "Cardiff", "Liverpool")
Year = c(2008, 2010, 2016, 2015, 2016, 2012, 2005, 2009, 2005, 2011, 2013,
2014, 2008, 2010, 2009, 2016, 2007)
df = data.frame(Region_in_UK, Year)
# erase the code above and replace your own dataframe if its bigger
# than the data you displayed at this point and name it "df" (e.g.:
# df = your_dataframe)
df$year_city = rep(NA, dim(df)[1])
df = mutate(df, year_city =
ifelse (grepl("Liverpool", df$Region_in_UK) & df$Year < 2010,
"Liverpool before 2010", df$year_city))
df = mutate(df, year_city =
ifelse (grepl("Liverpool", df$Region_in_UK) & df$Year >= 2010,
"Liverpool 2010 and after", df$year_city))
df = mutate(df, year_city =
ifelse (!grepl("Liverpool", df$Region_in_UK) & df$Year < 2010,
"Other before 2010", df$year_city))
df = mutate(df, year_city =
ifelse (!grepl("Liverpool", df$Region_in_UK) & df$Year >= 2010,
"Other 2010 and after", df$year_city))

Using base R you could do:
transform(df, year.city = factor(paste(sub('^((?!Liver).)*$', 'other', Region_in_UK,perl = TRUE), Year>2010), label=1:4))
Region_in_UK Year year.city
1 Liverpool 2008 1
2 Manchester 2010 3
3 Liverpool 2016 2
4 Chester 2015 4
5 Birmingham 2016 4
6 Blackpool 2012 4
7 Birmingham 2005 3
8 Chester 2009 3
9 Liverpool 2005 1
10 Hull 2011 4
11 Leeds 2013 4
12 Liverpool 2014 2
13 Bradford 2008 3
14 London 2010 3
15 Coventry 2009 3
16 Cardiff 2016 4
17 Liverpool 2007 1
You can also do:
transform(df,m=factor(paste(!grepl("Liverpool",Region_in_UK),Year>2010),label=1:4))
or
transform(df,m = factor(paste(sub('(Liverpool)|.*','\\1',Region_in_UK),Year<=2010),label=4:1))
Region_in_UK Year m
1 Liverpool 2008 1
2 Manchester 2010 3
3 Liverpool 2016 2
4 Chester 2015 4
5 Birmingham 2016 4
6 Blackpool 2012 4
7 Birmingham 2005 3
8 Chester 2009 3
9 Liverpool 2005 1
10 Hull 2011 4
11 Leeds 2013 4
12 Liverpool 2014 2
13 Bradford 2008 3
14 London 2010 3
15 Coventry 2009 3
16 Cardiff 2016 4
17 Liverpool 2007 1

Related

Create a dataframe from some rows of a already existing dataframe in R

I need to create a new dataframe from the data of 4 different countries of this already existing dataframe:
Country Happiness.Score GDP year
Switzerland 7.587 1.39651 2015
Iceland 7.561 1.30232 2015
Denmark 7.527 1.32548 2015
Norway 7.522 1.459 2015
Canada 7.427 1.32629 2015
Finland 7.406 1.29025 2015
Netherlands 7.378 1.32944 2015
Sweden 7.364 1.33171 2015
New Zealand 7.286 1.25018 2015
Australia 7.284 1.33358 2015
Israel 7.278 1.22857 2015
Costa Rica 7.226 0.95578 2015
Austria 7.2 1.33723 2015
Mexico 7.187 1.02054 2015
United States 7.119 1.39451 2015
Brazil 6.983 0.98124 2015
Ireland 6.94 1.33596 2015
Belgium 6.937 1.30782 2015
OBS.: This dataframe above is just an example, the original dataframe has way more countries and which one has data from the following years: 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022
Something like
selected_countries <- c("Iceland", "Norway", "Spain", "Nigeria")
new_dd <- dd[dd$Country %in% selected_countries, ]
or
new_dd <- subset(dd, Country %in% selected_countries)
or
library(dplyr)
new_dd <- dd %>% filter(Country %in% selected_countries)

r deleting certain rows of dataframe based on multiple columns

I have a dataframe looks like below:
Year Name Place Job
2010 Jim USA CEO
2010 Jim Canada Advisor
2010 Jim Canada Board Member
2011 Jim USA CEO
2017 Peter Mexico COO
2019 Peter Korea CEO
2019 Peter China Advisor
2013 Harry USA Chairman
2014 Harry Canada CEO
2015 Harry Canada CEO
2015 Harry Canada Advisor
I want to remove certain rows in the above dataframe based on the "Year" and "Name" column. basically, all "Year/Name" occurs in the below list (in dataframe format) should be removed:
Year Name
2010 Jim
2019 Peter
2013 Harry
2014 Harry
Thus, the final output looks like:
Year Name Place Job
2011 Jim USA CEO
2017 Peter Mexico COO
2015 Harry Canada CEO
2015 Harry Canada Advisor
base R
While dplyr (below) has anti_join, in base R one needs to merge and find the rows that did not match and remove them by hand.
# using the `rem` frame, augmenting a little
rem$keep <- FALSE
tmp <- merge(dat, rem, by = c("Year", "Name"), all.x = TRUE)
tmp
# Year Name Place Job keep
# 1 2010 Jim USA CEO FALSE
# 2 2010 Jim Canada Advisor FALSE
# 3 2010 Jim Canada Board Member FALSE
# 4 2011 Jim USA CEO NA
# 5 2013 Harry USA Chairman FALSE
# 6 2014 Harry Canada CEO FALSE
# 7 2015 Harry Canada CEO NA
# 8 2015 Harry Canada Advisor NA
# 9 2017 Peter Mexico COO NA
# 10 2019 Peter Korea CEO FALSE
# 11 2019 Peter China Advisor FALSE
tmp[ is.na(tmp$keep), ]
# Year Name Place Job keep
# 4 2011 Jim USA CEO NA
# 7 2015 Harry Canada CEO NA
# 8 2015 Harry Canada Advisor NA
# 9 2017 Peter Mexico COO NA
dplyr
dplyr::anti_join(dat, rem, by = c("Year", "Name"))
# Year Name Place Job
# 1 2011 Jim USA CEO
# 2 2017 Peter Mexico COO
# 3 2015 Harry Canada CEO
# 4 2015 Harry Canada Advisor
Data
dat <- structure(list(Year = c(2010L, 2010L, 2010L, 2011L, 2017L, 2019L, 2019L, 2013L, 2014L, 2015L, 2015L), Name = c("Jim", "Jim", "Jim", "Jim", "Peter", "Peter", "Peter", "Harry", "Harry", "Harry", "Harry"), Place = c("USA", "Canada", "Canada", "USA", "Mexico", "Korea", "China", "USA", "Canada", "Canada", "Canada"), Job = c("CEO", "Advisor", "Board Member", "CEO", "COO", "CEO", "Advisor", "Chairman", "CEO", "CEO", "Advisor")), row.names = c(NA, -11L), class = "data.frame")
rem <- structure(list(Year = c(2010L, 2019L, 2013L, 2014L), Name = c("Jim", "Peter", "Harry", "Harry")), class = "data.frame", row.names = c(NA, -4L))
Using data.table
library(data.table)
setDT(df)[!remove, on = .(Year, Name)]
-ouptut
# Year Name Place Job
#1: 2011 Jim USA CEO
#2: 2017 Peter Mexico COO
#3: 2015 Harry Canada CEO
#4: 2015 Harry Canada Advisor
Another approach:
library(dplyr)
library(stringr)
dat %>% mutate(x = str_c(Year, Name)) %>%
filter(str_detect(x, str_c(str_c(rem$Year,rem$Name), collapse = '|'), negate = TRUE)) %>%
select(-x)
Year Name Place Job
1 2011 Jim USA CEO
2 2017 Peter Mexico COO
3 2015 Harry Canada CEO
4 2015 Harry Canada Advisor
We could use str_c:
library(dplyr)
library(stringr)
dat %>%
filter(!Year %in% str_c(rem$Year))
Output:
Year Name Place Job
1 2011 Jim USA CEO
2 2017 Peter Mexico COO
3 2015 Harry Canada CEO
4 2015 Harry Canada Advisor
A base R option using merge + subset
subset(
merge(
dat,
cbind(rem, Removal = 1),
all = TRUE
),
is.na(Removal),
select = -Removal
)
gives
Year Name Place Job
4 2011 Jim USA CEO
7 2015 Harry Canada CEO
8 2015 Harry Canada Advisor
9 2017 Peter Mexico COO

R duplicate rows based on the elements in a string column [duplicate]

This question already has answers here:
str_extract_all: return all patterns found in string concatenated as vector
(2 answers)
Closed 2 years ago.
I have a more or less specific question that probably pertains to loops in R. I have a dataframe:
X location year
1 North Dakota, Minnesota, Michigan 2011
2 California, Tennessee 2012
3 Bastrop County (Texas) 2013
4 Dallas (Texas) 2014
5 Shasta (California) 2015
6 California, Oregon, Washington 2011
I have two problems with this data: 1) I need a column that consists of just the state names of each row. I guess this should be generally easy with gsub and using a list of all US state names.
list <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "etc")
pat <- paste0("\\b(", paste0(list, collapse="|"), ")\\b")
pat
data$state <- gsub(data$location, "", paragraph)
The bigger issue for me is 2) I need an individual (duplicate) row for each state that is in the dataset. So if row 6 has California, Oregon and Washington in 2011, I need to have a separate row of each one separately like this:
X location year
1 California 2011
2 Oregon 2011
3 Washington 2011
Thank you for your help!
You can use str_extract_all to extract all the states and unnest to duplicate rows such that each state is in a separate row. There is an inbuilt constant state.name which have the state names of US which can be used here to create pattern.
library(dplyr)
pat <- paste0("\\b", state.name, "\\b", collapse = "|")
df %>%
mutate(states = stringr::str_extract_all(location, pat)) %>%
tidyr::unnest(states)
# A tibble: 11 x 3
# location year states
# <chr> <int> <chr>
# 1 North Dakota, Minnesota, Michigan 2011 North Dakota
# 2 North Dakota, Minnesota, Michigan 2011 Minnesota
# 3 North Dakota, Minnesota, Michigan 2011 Michigan
# 4 California, Tennessee 2012 California
# 5 California, Tennessee 2012 Tennessee
# 6 Bastrop County (Texas) 2013 Texas
# 7 Dallas (Texas) 2014 Texas
# 8 Shasta (California) 2015 California
# 9 California, Oregon, Washington 2011 California
#10 California, Oregon, Washington 2011 Oregon
#11 California, Oregon, Washington 2011 Washington
data
df <- structure(list(location = c("North Dakota, Minnesota, Michigan",
"California, Tennessee", "Bastrop County (Texas)", "Dallas (Texas)",
"Shasta (California)", "California, Oregon, Washington"), year = c(2011L,
2012L, 2013L, 2014L, 2015L, 2011L)), class = "data.frame", row.names = c(NA, -6L))

Interpolate based on multiple conditions in r

Beginner r user here. I have a dataset of yearly employment numbers for different industry classifications and different subregions. For some observations, the number of employees is null. I would like to fill these values through linear interpolation (using na.approx or some other method). However, I only want to interpolate within the same industry classification and subregion.
For example, I have this:
subregion <- c("East Bay", "East Bay", "East Bay", "East Bay", "East Bay", "South Bay")
industry <-c("A","A","A","A","A","B" )
year <- c(2013, 2014, 2015, 2016, 2017, 2002)
emp <- c(50, NA, NA, 80,NA, 300)
data <- data.frame(cbind(subregion,industry,year, emp))
subregion industry year emp
1 East Bay A 2013 50
2 East Bay A 2014 <NA>
3 East Bay A 2015 <NA>
4 East Bay A 2016 80
5 East Bay A 2017 <NA>
6 South Bay B 2002 300
I need to generate this table, skipping interpolating the fifth observation because subregion and industry do not match the previous observation.
subregion industry year emp
1 East Bay A 2013 50
2 East Bay A 2014 60
3 East Bay A 2015 70
4 East Bay A 2016 80
5 East Bay A 2017 <NA>
6 South Bay B 2002 300
Articles like this have been helpful, but I cannot figure out how to adapt the solution to match the requirement that two columns be the same for interpolation to occur, instead of one. Any help would be appreciated.
We could do a group by na.approx (from zoo)
library(tidyverse)
data %>%
group_by(subregion, industry) %>%
mutate(emp = zoo::na.approx(emp, na.rm = FALSE))
# A tibble: 6 x 4
# Groups: subregion, industry [2]
# subregion industry year emp
# <fct> <fct> <dbl> <dbl>
#1 East Bay A 2013 50
#2 East Bay A 2014 60
#3 East Bay A 2015 70
#4 East Bay A 2016 80
#5 East Bay A 2017 NA
#6 South Bay B 2002 300
data
data <- data.frame(subregion,industry,year, emp)

How to create two-way table with character variables in r

I have the following data frame in R:
df <- data.frame(Year = c(2011, 2012, 2013, 2011, 2012, 2013, 2011, 2012, 2013),
Country = c("England", "England", "England", "French", "French", "French", "Germany", "Germany", "Germany"),
Pop = c(53.107, 53.493, 53.865, 63.070, 63.375, 63.697, 80.328, 80.524, 80.767))
# df
# Year Country Pop
# 1 2011 England 53.107
# 2 2012 England 53.493
# 3 2013 England 53.865
# 4 2011 French 63.070
# 5 2012 French 63.375
# 6 2013 French 63.697
# 7 2011 Germany 80.328
# 8 2012 Germany 80.524
# 9 2013 Germany 80.767
I would like to get the following table:
Year
2011 2012 2013
Country Pop Country Pop Country Pop
England 53,107 England 53,493 England 53,865
French 63,07 French 63,375 French 63,697
Germany 80,328 Germany 80,524 Germany 80,767
Will this do?
> xtabs(Pop ~ Country + as.factor(Year), df)
as.factor(Year)
Country 2011 2012 2013
England 53.107 53.493 53.865
French 63.070 63.375 63.697
Germany 80.328 80.524 80.767
Solution with dplyr + tidyr:
library(dplyr)
library(tidyr)
df_reshaped = df %>%
mutate(Year = paste0("Pop_", Year)) %>%
spread(Year, Pop)
compute_margins = df_reshaped %>%
summarize_if(is.numeric, sum, na.rm = TRUE) %>%
as.list(.) %>%
c(Country = "Total") %>%
bind_rows(df_reshaped, .) %>%
mutate(Total = rowSums(.[2:4]))
Result:
> df_reshaped
Country Pop_2011 Pop_2012 Pop_2013
1 England 53.107 53.493 53.865
2 French 63.070 63.375 63.697
3 Germany 80.328 80.524 80.767
> compute_margins
Country Pop_2011 Pop_2012 Pop_2013 Total
1 England 53.107 53.493 53.865 160.465
2 French 63.070 63.375 63.697 190.142
3 Germany 80.328 80.524 80.767 241.619
4 Total 196.505 197.392 198.329 592.226
To get the format you want, you can do the following:
Map(function(x, y){
temp = cbind(compute_margins[1], x)
names(temp)[2] = y
return(temp)
}, compute_margins[2:4], names(compute_margins)[2:4]) %>%
unname() %>%
do.call(cbind, .) %>%
cbind(compute_margins[5])
Result:
Country Pop_2011 Country Pop_2012 Country Pop_2013 Total
1 England 53.107 England 53.493 England 53.865 160.465
2 French 63.070 French 63.375 French 63.697 190.142
3 Germany 80.328 Germany 80.524 Germany 80.767 241.619
4 Total 196.505 Total 197.392 Total 198.329 592.226

Resources