I am trying to filter on two conditions, but I keep removing all patients with either condition - r

I'm a beginner on R so apologies for errors, and thank you for helping.
I have a dataset (liver) where rows are patient ID numbers, and columns include what region the patient resides in (London, Yorkshire etc) and what unit the patient was treated in (hospital name). Some of the units are private units. I've identified 120 patients from London, of whom 100 were treated across three private units. I want to remove the 100 London patients treated in private units but I keep accidentally removing all patients treated in the private units (around 900 patients). I'd be grateful for advice on how to just remove the London patients treated privately.
I've tried various combinations of using subset and filter with different exclamation points and brackets in different places including for example:
liver <- filter(liver, region_name != "London" & unit_name!="Primrose Hospital" & unit_name != "Oak Hospital" & unit_name != "Wilson Hospital")
Thank you very much.

Your unit_name condition is zeroing your results. Try using the match function which is more commonly seen in its infix form %in%:
liver <- filter(liver,
region_name != "London",
! unit_name %in% c("Primrose Hospital",
"Oak Hospital",
"Wilson Hospital"))
Also you can separate logical AND conditions using a comma.

Building on Pariksheet's great start (still drops outside-London private hospital patients). Here we need to use the OR | operator within the filter function. I've made an example dataframe which demonstrates how this works for your case. The example tibble contains your three private London hospitals plus one non-private hospital that we want to keep. Plus, it has Manchester patients who attend both Manch and one of the private hospitals, all of whom we want to keep.
EDITED: Now includes character vectors to allow generalisation of combinations to exclude.
liver <- tibble(region_name = rep(c('London', 'Liverpool', 'Glasgow', 'Manchester'), each = 4),
unit_name = c(rep(c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital',
'State Hospital'), times = 3),
rep(c('Manch General', 'Primrose Hospital'), each = 2)))
liver
# A tibble: 16 x 2
region_name unit_name
<chr> <chr>
1 London Primrose Hospital
2 London Oak Hospital
3 London Wilson Hospital
4 London State Hospital
5 Liverpool Primrose Hospital
6 Liverpool Oak Hospital
7 Liverpool Wilson Hospital
8 Liverpool State Hospital
9 Glasgow Primrose Hospital
10 Glasgow Oak Hospital
11 Glasgow Wilson Hospital
12 Glasgow State Hospital
13 Manchester Manch General
14 Manchester Manch General
15 Manchester Primrose Hospital
16 Manchester Primrose Hospital
excl.private.regions <- c('London',
'Liverpool',
'Glasgow')
excl.private.hospitals <- c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital')
liver %>%
filter(! region_name %in% excl.private.regions |
! unit_name %in% excl.private.hospitals)
# A tibble: 7 x 2
region_name unit_name
<chr> <chr>
1 London State Hospital
2 Liverpool State Hospital
3 Glasgow State Hospital
4 Manchester Manch General
5 Manchester Manch General
6 Manchester Primrose Hospital
7 Manchester Primrose Hospital

Related

How can I add the country name to a dataset based on city name and population? [duplicate]

This question already has answers here:
extracting country name from city name in R
(3 answers)
Closed 7 months ago.
I have a dataset containing information on a range of cities, but there is no column which says what country the city is located in. In order to perform the analysis, I need to add an extra column which has the name of the country.
population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin
I expect the output to look like this:
population city country
500,000 Oslo Norway
750,000 Bristol England
500,000 Liverpool England
1,000,000 Dublin Ireland
How can I add a column of country names based on the city and population to a large dataset in R?
I am adapting Tom Hoel's answer, as suggested by Ian Campbell. If this is selected I am happy to mark it as community wiki.
library(maps)
library(dplyr)
data("world.cities")
df <- readr::read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df |>
inner_join(
select(world.cities, name, country.etc, pop),
by = c("city" = "name")
) |> group_by(city) |>
filter(
abs(pop - population) == min(abs(pop - population))
)
# A tibble: 4 x 4
# Groups: city [4]
# population city country.etc pop
# <dbl> <chr> <chr> <int>
# 1 500000 Oslo Norway 821445
# 2 750000 Bristol UK 432967
# 3 500000 Liverpool UK 468584
# 4 1000000 Dublin Ireland 1030431
As stated by others, the cities exists in other countries too as well.
library(tidyverse)
library(maps)
data("world.cities")
df <- read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df %>%
merge(., world.cities %>%
select(name, country.etc),
by.x = "city",
by.y = "name")
# A tibble: 7 × 3
city population country.etc
<chr> <dbl> <chr>
1 Bristol 750000 UK
2 Bristol 750000 USA
3 Dublin 1000000 USA
4 Dublin 1000000 Ireland
5 Liverpool 500000 UK
6 Liverpool 500000 Canada
7 Oslo 500000 Norway
I think your best bet would be to add a new column in your dataset called country and fill it out, this is part of the CRSIP-DM process data preparation so this is not uncommon. If that does not answer your question please let me know and i will do my best to help.

data wrangling in R with names_pattern for pivoting on ENDING pattern?

I have a dataset with a column, CatSex, that's got data in it in a form similar to "American.Indian.or.Alaska.Native.men"--the characters after the last period, I want to turn into a new pivoted column, so I have two columns, one called Cat with only the demographic info in it, and one called Sex with the sex in it. The characters before the sex designation don't follow any clear pattern. I am not very good at R, but it's better than Tableau Prep with large data sets, it seems. What I ultimately want is to pivot the data so that I have two distinct columns for the different categories here. I used this code to get part of the way there (the original data held like 119 columns with names like "Grand.total.men..C2005_A_RV..First.major..Area..ethnic..cultural..and.gender.studies...Degrees.total"), but I can't figure out how to do this with the pattern I'm now left with in the column CatSex:
pivot_longer(
cols = -c(UnitID, Institution.Name),
names_to = c("CatSex", "Disc"),
names_pattern = "(.*)..C2005_A_RV..First.major..(.*)",
values_to = "Count",
values_drop_na = TRUE
)
Here's a screenshot of the data structure I have now. I'm sorry for not putting in reproducible code--I don't know how to do that in this context!
EDIT: Here's a head(df) of the cleaned data so far:
# A tibble: 6 × 5
UnitID Institution.Name CatSex Disc Count
<int> <fct> <chr> <chr> <int>
1 177834 A T Still University of Health Sciences Grand.total.men Health.professions.and.related.clinical.sciences...Degrees.total. 212
2 177834 A T Still University of Health Sciences Grand.total.women Health.professions.and.related.clinical.sciences...Degrees.total. 359
3 177834 A T Still University of Health Sciences White.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 181
4 177834 A T Still University of Health Sciences White.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 317
5 177834 A T Still University of Health Sciences Black.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 3
6 177834 A T Still University of Health Sciences Black.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 5
Using extract from tidyr package (it is in tidyverse)
Capture 2 groups with ()
Define second group to have one or more characters that are not . up to the end $
library(dplyr)
library(tidyr)
df %>%
extract(CatSex, c("Cat", "Sex"), "(.*)\\.([^.]+)$")
UnitID Institution.Name Cat Sex
1 222178 Abilene Christian University Hispanic men
2 222178 Abilene Christian University Hispanic women
3 222178 Abilene Christian University American.Indian.or.Alaska.Native men
4 222178 Abilene Christian University American.Indian.or.Alaska.Native women
5 222178 Abilene Christian University Asian.or.Pacific.Islander women
6 222178 Abilene Christian University Asian.or.Pacific.Islander men
7 222178 Abilene Christian University Grand.total men
8 222178 Abilene Christian University Grand.total women
9 222178 Abilene Christian University White.non.Hispanic men
10 222178 Abilene Christian University White.non.Hispanic women
11 222178 Abilene Christian University lack.non.Hispanic men
12 222178 Abilene Christian University Black.non.Hispanic women
13 222178 Abilene Christian University Hispanic men
14 222178 Abilene Christian University Hispanic women
15 222178 Abilene Christian University American.Indian.or.Alaska.Native men
Disc
1 Communication journalism..and.related.programs
2 Communication journalism and.related.programs
3 Communication journalism..and.related.programs
4 Communication..journalism..and.related.programs
5 Communication..journalism..and.related.programs
6 Communication .journalism..and.related.program
7 Computer.and.information.sciences.and.support.serv
8 computer.and.information.sciences.and.support.servi
9 Computer.and.information.sciences.and.support.servi
10 Computer.and.information.sciences.and.support.servi
11 Computer.and.information.sciences.and.support.servi
12 Computer.and.information.sciences.and.support.servi.
13 Computer.and.information.sciences.and.support.serv
14 Computer.and.information.sciences.and.support.servi.
15 Computer.and.information.sciences.and.support.servi
pivot_longer is not the right function in this context.
Here are few options -
Using tidyr::separate
tidyr::separate(df, 'CatSex', c('Cat', 'Sex'), sep = '(\\.)(?!.*\\.)')
#. Cat Sex
#1 Grand.total men
#2 Grand.total women
#3 White.non.Hispanic men
#4 White.non.Hispanic women
#5 Black.non.Hispanic men
#6 Black.non.Hispanic women
Using stringr functions
library(dplyr)
library(stringr)
df %>%
mutate(Sex = str_extract(CatSex, 'men|women'),
Cat = str_remove(CatSex, '\\.(men|women)'))
In base R
transform(df, Sex = sub('.*\\.(men|women)', '\\1', CatSex),
Cat = sub('\\.(men|women)', '', CatSex))
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(CatSex = c("Grand.total.men", "Grand.total.women",
"White.non.Hispanic.men", "White.non.Hispanic.women",
"Black.non.Hispanic.men", "Black.non.Hispanic.women"))

Why I cannot merge the two files, with left_join, in R?

I am trying to link two files with left join, in R. I do not get the output I need.
Here is an example of two files:
here is file_1:
And when I do a left_join in R, I do not get what I want. Here is the code:
minap_piv_na_stemi_nstemi <- left_join(file_1, file_2)
As you can see in the last line, Rochdale Infirmary should be populated with the provider_name and trust_code. This is not happening? Can someone help ?
Try this. Use left_join to join by hospital_name. Use e.g. coalesce to fill in the missing information for trust_code and provider_name. Get rid of the .x and .y columns:
library(dplyr)
left_join(file_1, file_2, by = "hospital_name") %>%
mutate(trust_code = coalesce(trust_code.x, trust_code.y),
provider_name = coalesce(provider_name.x, provider_name.y)) %>%
select(-ends_with(".x"), -ends_with(".y"))
#> # A tibble: 182 x 3
#> hospital_name trust_code provider_name
#> <chr> <chr> <chr>
#> 1 Addenbrooke's Hospital RGT Cambridge University Hospitals NHS Founda…
#> 2 Royal Albert Edward In… RRF Wrightington, Wigan and Leigh NHS Foundat…
#> 3 Airedale General Hospi… RCF Airedale NHS Trust
#> 4 Wycombe Hospital RXQ Buckinghamshire Healthcare NHS Trust
#> 5 Barnsley Hospital RFF Barnsley Hospital NHS Foundation Trust
#> 6 Basildon Hospital RDD Basildon and Thurrock University Hospital…
#> 7 Royal United Hospital … RD1 Royal United Hospital Bath NHS Trust
#> 8 Bedford Hospital RC1 Bedford Hospital NHS Trust
#> 9 Broomfield Hospital RQ8 Mid Essex Hospital Services NHS Trust
#> 10 Rochdale Infirmary RW6 Pennine Acute Hospitals NHS Trust
#> # … with 172 more rows

put the resulting values from for loop into a table in r [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
I'm trying to calculate the total number of matches played by each team in the year 2019 and put them in a table along with the corresponding team names
teams<-c("Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lions", "Rising Pune Supergiants",
"Royal Challengers Bangalore","Kolkata Knight Riders","Delhi Daredevils",
"Kings XI Punjab", "Deccan Chargers","Rajasthan Royals", "Chennai Super Kings",
"Kochi Tuskers Kerala", "Pune Warriors", "Delhi Capitals", " Gujarat Lions")
for (j in teams) {
print(j)
ipl_table %>%
filter(season==2019 & (team1==j | team2 ==j)) %>%
summarise(match_count=n())->kl
print(kl)
match_played<-data.frame(Teams=teams,Match_count=kl)
}
The match played by last team (i.e Gujarat Lions is 0 and its filling 0's for all other teams as well.
The output match_played can be found on the link given below.
I'd be really glad if someone could help me regarding this error as I'm very new to R.
filter for the particular season, get data in long format and then count number of matches.
library(dplyr)
matches %>%
filter(season == 2019) %>%
tidyr::pivot_longer(cols = c(team1, team2), values_to = 'team_name') %>%
count(team_name) -> result
result
# team_name n
# <chr> <int>
#1 Chennai Super Kings 17
#2 Delhi Capitals 16
#3 Kings XI Punjab 14
#4 Kolkata Knight Riders 14
#5 Mumbai Indians 16
#6 Rajasthan Royals 14
#7 Royal Challengers Bangalore 14
#8 Sunrisers Hyderabad 15
Here is an example
library(tidyr)
df_2019 <- matches[matches$season == 2019, ] # get the season you need
df_long <- gather(df_2019, Team_id, Team_Name, team1:team2) # Make it long format
final_count <- data.frame(t(table(df_long$Team_Name)))[-1] # count the number of matches
names(final_count) <- c("Team", "Matches")
Team Matches
1 Chennai Super Kings 17
2 Delhi Capitals 16
3 Kings XI Punjab 14
4 Kolkata Knight Riders 14
5 Mumbai Indians 16
6 Rajasthan Royals 14
7 Royal Challengers Bangalore 14
8 Sunrisers Hyderabad 15
Or by using base R
final_count <- data.frame(t(table(c(df_2019$team1, df_2019$team2))))[-1]
names(final_count) <- c("Team", "Matches")
final_count

Partial String Matching in R to unify text into one category

I have dataset as follow
EstablishmentName Freq
bahria university 20
bahria university islamabad 12
arid agriculture 3
arid agriculture university 15
arid rawalpindi 9
college of e&me, nust 20
college of e & me (nust) 15
college of eme 30
As you can see above that Bahria University and Bahria University Islamabad are almost same, so goes for other strings. I want to unify them into one such that
Expected Output
EstablishmentName Freq
Bahria University 32
Arid Agriculture 27
College of EME 30
I have tried the following solution but it doesn't seems to work.
library(SnowballC)
library(dplyr)
mutate(df, word = wordStem(EstablishmentName)) %>%
group_by(EstablishmentName) %>%
summarise(total = sum(Freq))

Resources