Creating a variable from other two categorical variables R - r

I looked for similar questions/answers on StackOverflow, but I could not find the answer I am looking for.
I have two categorical variables -
Region of education
Educational residence
Both variables (region of education and educational residence) are in one dataset after merging two different datasets.
Region of education has variables such as - Western, Non-western, Unknown
Educational residence has only two variables - In the USA and Out of the USA
Now, I have to make a new categorical variable with recoding options - Western as Education from Western countries, Non-Western as Education from Non-Western countries, Unknown as Unknown, while from the educational residence I have to use only in the USA.
So in end, the new variable will have four options -
Education from western countries,
Education from non-western countries,
Education from the USA,
Unknown
Is there any idea, of how to do this?
I apologise as I cannot post the data output due to ethical and legal issues.
I will be very thankful for any kind of help.

You don't need to share real data to get a pointer to the correct answer, just a simple reproducible example should suffice. According to your description, the relevant columns in your data frame should look something like this reproducible example:
set.seed(1)
df <- data.frame(Region = sample(c("Western", "Non-Western", "Unknown"), 10, T),
Residence = sample(c("USA", "Non-USA"), 10, T))
df
#> Region Residence
#> 1 Western USA
#> 2 Unknown USA
#> 3 Western USA
#> 4 Non-Western USA
#> 5 Western Non-USA
#> 6 Unknown Non-USA
#> 7 Unknown Non-USA
#> 8 Non-Western Non-USA
#> 9 Non-Western USA
#> 10 Unknown USA
We can smoosh these columns together using ifelse. Where the Residence column is "USA", the output will be "USA", and otherwise it will retain the "Western", "Non-Western" and "Unknown" levels from the Region column:
df$Education <- ifelse(df$Residence == "USA", "USA", df$Region)
df
#> Region Residence Education
#> 1 Western USA USA
#> 2 Unknown USA USA
#> 3 Western USA USA
#> 4 Non-Western USA USA
#> 5 Western Non-USA Western
#> 6 Unknown Non-USA Unknown
#> 7 Unknown Non-USA Unknown
#> 8 Non-Western Non-USA Non-Western
#> 9 Non-Western USA USA
#> 10 Unknown USA USA
Created on 2022-08-20 with reprex v2.0.2

Related

Insert a value to a column by condition

I am attempting to fill in a new column in my dataset. I have a dataset containing information on football matches. There is a column called "Stadium", which has various stadium names. I wish to add a new column which contains the country of which the stadium is located within. My set looks something like this
Match ID Stadium
1 Anfield
2 Camp Nou
3 Stadio Olimpico
4 Anfield
5 Emirates
I am attempting to create a new column looking like this:
Match ID Stadium Country
1 Anfield England
2 Camp Nou Spain
3 Stadio Olimpico Italy
4 Anfield England
5 Emirates England
There is only a handful of stadiums but many rows, meaning I am trying to find a way to avoid inserting the values manually. Any tips?
You want to get the unique stadium names from your data, manually create a vector with the country for each of those stadiums, then join them using Stadium as a key.
library(dplyr)
# Example data
df <- data.frame(`Match ID` = 1:12,
Stadium = rep(c("Stadio Olympico", "Anfield",
"Emirates"), 4))
# Get the unique stadium names in a vector
unique_stadiums <- df %>% pull(Stadium) %>% unique()
unique_stadiums
#> [1] "Stadio Olympico" "Anfield" "Emirates"
# Manually create a vector of country names corresponding to each element of
# the unique stadum name vector. Ordering matters here!
countries <- c("Italy", "England", "England")
# Place them both into a data.frame
lookup <- data.frame(Stadium = unique_stadiums, Country = countries)
# Join the country names to the original data on the stadium key
left_join(x = df, y = lookup, by = "Stadium")
#> Match.ID Stadium Country
#> 1 1 Stadio Olympico Italy
#> 2 2 Anfield England
#> 3 3 Emirates England
#> 4 4 Stadio Olympico Italy
#> 5 5 Anfield England
#> 6 6 Emirates England
#> 7 7 Stadio Olympico Italy
#> 8 8 Anfield England
#> 9 9 Emirates England
#> 10 10 Stadio Olympico Italy
#> 11 11 Anfield England
#> 12 12 Emirates England

data wrangling in R with names_pattern for pivoting on ENDING pattern?

I have a dataset with a column, CatSex, that's got data in it in a form similar to "American.Indian.or.Alaska.Native.men"--the characters after the last period, I want to turn into a new pivoted column, so I have two columns, one called Cat with only the demographic info in it, and one called Sex with the sex in it. The characters before the sex designation don't follow any clear pattern. I am not very good at R, but it's better than Tableau Prep with large data sets, it seems. What I ultimately want is to pivot the data so that I have two distinct columns for the different categories here. I used this code to get part of the way there (the original data held like 119 columns with names like "Grand.total.men..C2005_A_RV..First.major..Area..ethnic..cultural..and.gender.studies...Degrees.total"), but I can't figure out how to do this with the pattern I'm now left with in the column CatSex:
pivot_longer(
cols = -c(UnitID, Institution.Name),
names_to = c("CatSex", "Disc"),
names_pattern = "(.*)..C2005_A_RV..First.major..(.*)",
values_to = "Count",
values_drop_na = TRUE
)
Here's a screenshot of the data structure I have now. I'm sorry for not putting in reproducible code--I don't know how to do that in this context!
EDIT: Here's a head(df) of the cleaned data so far:
# A tibble: 6 × 5
UnitID Institution.Name CatSex Disc Count
<int> <fct> <chr> <chr> <int>
1 177834 A T Still University of Health Sciences Grand.total.men Health.professions.and.related.clinical.sciences...Degrees.total. 212
2 177834 A T Still University of Health Sciences Grand.total.women Health.professions.and.related.clinical.sciences...Degrees.total. 359
3 177834 A T Still University of Health Sciences White.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 181
4 177834 A T Still University of Health Sciences White.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 317
5 177834 A T Still University of Health Sciences Black.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 3
6 177834 A T Still University of Health Sciences Black.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 5
Using extract from tidyr package (it is in tidyverse)
Capture 2 groups with ()
Define second group to have one or more characters that are not . up to the end $
library(dplyr)
library(tidyr)
df %>%
extract(CatSex, c("Cat", "Sex"), "(.*)\\.([^.]+)$")
UnitID Institution.Name Cat Sex
1 222178 Abilene Christian University Hispanic men
2 222178 Abilene Christian University Hispanic women
3 222178 Abilene Christian University American.Indian.or.Alaska.Native men
4 222178 Abilene Christian University American.Indian.or.Alaska.Native women
5 222178 Abilene Christian University Asian.or.Pacific.Islander women
6 222178 Abilene Christian University Asian.or.Pacific.Islander men
7 222178 Abilene Christian University Grand.total men
8 222178 Abilene Christian University Grand.total women
9 222178 Abilene Christian University White.non.Hispanic men
10 222178 Abilene Christian University White.non.Hispanic women
11 222178 Abilene Christian University lack.non.Hispanic men
12 222178 Abilene Christian University Black.non.Hispanic women
13 222178 Abilene Christian University Hispanic men
14 222178 Abilene Christian University Hispanic women
15 222178 Abilene Christian University American.Indian.or.Alaska.Native men
Disc
1 Communication journalism..and.related.programs
2 Communication journalism and.related.programs
3 Communication journalism..and.related.programs
4 Communication..journalism..and.related.programs
5 Communication..journalism..and.related.programs
6 Communication .journalism..and.related.program
7 Computer.and.information.sciences.and.support.serv
8 computer.and.information.sciences.and.support.servi
9 Computer.and.information.sciences.and.support.servi
10 Computer.and.information.sciences.and.support.servi
11 Computer.and.information.sciences.and.support.servi
12 Computer.and.information.sciences.and.support.servi.
13 Computer.and.information.sciences.and.support.serv
14 Computer.and.information.sciences.and.support.servi.
15 Computer.and.information.sciences.and.support.servi
pivot_longer is not the right function in this context.
Here are few options -
Using tidyr::separate
tidyr::separate(df, 'CatSex', c('Cat', 'Sex'), sep = '(\\.)(?!.*\\.)')
#. Cat Sex
#1 Grand.total men
#2 Grand.total women
#3 White.non.Hispanic men
#4 White.non.Hispanic women
#5 Black.non.Hispanic men
#6 Black.non.Hispanic women
Using stringr functions
library(dplyr)
library(stringr)
df %>%
mutate(Sex = str_extract(CatSex, 'men|women'),
Cat = str_remove(CatSex, '\\.(men|women)'))
In base R
transform(df, Sex = sub('.*\\.(men|women)', '\\1', CatSex),
Cat = sub('\\.(men|women)', '', CatSex))
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(CatSex = c("Grand.total.men", "Grand.total.women",
"White.non.Hispanic.men", "White.non.Hispanic.women",
"Black.non.Hispanic.men", "Black.non.Hispanic.women"))

Assigning new label / group by partial string matching with vector of shortened labels

I am trying to group data together in R. I'm using data from a Tidy Tuesday challenge (global seafood, stock), and want to group the data into oceans. Currently, the data is separated into ocean segments (e.g Eastern Central Atlantic and northeast central Atlantic)
Ocean code year bio_sus bio_nonsus
1 Eastern Central Atlantic NA 2015 57.1 42.9
2 Eastern Central Atlantic NA 2017 57.1 42.9
3 Southeast Central Atlantic NA 2015 67.6 32.4
4 Southeast Central Atlantic NA 2017 67.6 32.4
Is there a way to combine the different ocean data (the bio_sus and bio_nonsus) into one larger bit of data (e.g all the segments of Atlantic into one Atlantic for 2015,2017).
I have four different oceans in total: Pacific, Atlantic, Indian and Mediterranean that are segmented like this
#This is the data:
stock <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-12/fish-stocks-within-sustainable-levels.csv')
This is essentially a "multiple partial strings matching" problem. Here one approach. Loop over your partial strings to get the indices for each partial match, then replace the original vector with the matches. Then summarise by your new column.
library(dplyr)
stock <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-12/fish-stocks-within-sustainable-levels.csv')
oceans <- c("pacific", "atlantic", "indian", "mediterranean")
lu <- stack(sapply(oceans, grep, x = stock$Entity, ignore.case = TRUE))
stock$oceans <- stock$Entity
stock$oceans[lu$values] <- as.character(lu$ind)
stock %>%
group_by(oceans) %>%
summarise(across(matches("^share"), sum))
#> # A tibble: 5 × 3
#> oceans `Share of fish stocks within biologi… `Share of fish stocks tha…
#> <chr> <dbl> <dbl>
#> 1 atlantic 742. 458.
#> 2 indian 277. 123.
#> 3 mediterranean 75.3 125.
#> 4 pacific 894. 306.
#> 5 World 1609. 491.
Created on 2021-11-13 by the reprex package (v2.0.1)
Why not to use str_split() of the stringr package to extract the ocean and make a column just for the ocean and one for the sub-segment?

Why I cannot merge the two files, with left_join, in R?

I am trying to link two files with left join, in R. I do not get the output I need.
Here is an example of two files:
here is file_1:
And when I do a left_join in R, I do not get what I want. Here is the code:
minap_piv_na_stemi_nstemi <- left_join(file_1, file_2)
As you can see in the last line, Rochdale Infirmary should be populated with the provider_name and trust_code. This is not happening? Can someone help ?
Try this. Use left_join to join by hospital_name. Use e.g. coalesce to fill in the missing information for trust_code and provider_name. Get rid of the .x and .y columns:
library(dplyr)
left_join(file_1, file_2, by = "hospital_name") %>%
mutate(trust_code = coalesce(trust_code.x, trust_code.y),
provider_name = coalesce(provider_name.x, provider_name.y)) %>%
select(-ends_with(".x"), -ends_with(".y"))
#> # A tibble: 182 x 3
#> hospital_name trust_code provider_name
#> <chr> <chr> <chr>
#> 1 Addenbrooke's Hospital RGT Cambridge University Hospitals NHS Founda…
#> 2 Royal Albert Edward In… RRF Wrightington, Wigan and Leigh NHS Foundat…
#> 3 Airedale General Hospi… RCF Airedale NHS Trust
#> 4 Wycombe Hospital RXQ Buckinghamshire Healthcare NHS Trust
#> 5 Barnsley Hospital RFF Barnsley Hospital NHS Foundation Trust
#> 6 Basildon Hospital RDD Basildon and Thurrock University Hospital…
#> 7 Royal United Hospital … RD1 Royal United Hospital Bath NHS Trust
#> 8 Bedford Hospital RC1 Bedford Hospital NHS Trust
#> 9 Broomfield Hospital RQ8 Mid Essex Hospital Services NHS Trust
#> 10 Rochdale Infirmary RW6 Pennine Acute Hospitals NHS Trust
#> # … with 172 more rows

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

Resources