Assigning new label / group by partial string matching with vector of shortened labels - r

I am trying to group data together in R. I'm using data from a Tidy Tuesday challenge (global seafood, stock), and want to group the data into oceans. Currently, the data is separated into ocean segments (e.g Eastern Central Atlantic and northeast central Atlantic)
Ocean code year bio_sus bio_nonsus
1 Eastern Central Atlantic NA 2015 57.1 42.9
2 Eastern Central Atlantic NA 2017 57.1 42.9
3 Southeast Central Atlantic NA 2015 67.6 32.4
4 Southeast Central Atlantic NA 2017 67.6 32.4
Is there a way to combine the different ocean data (the bio_sus and bio_nonsus) into one larger bit of data (e.g all the segments of Atlantic into one Atlantic for 2015,2017).
I have four different oceans in total: Pacific, Atlantic, Indian and Mediterranean that are segmented like this
#This is the data:
stock <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-12/fish-stocks-within-sustainable-levels.csv')

This is essentially a "multiple partial strings matching" problem. Here one approach. Loop over your partial strings to get the indices for each partial match, then replace the original vector with the matches. Then summarise by your new column.
library(dplyr)
stock <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-12/fish-stocks-within-sustainable-levels.csv')
oceans <- c("pacific", "atlantic", "indian", "mediterranean")
lu <- stack(sapply(oceans, grep, x = stock$Entity, ignore.case = TRUE))
stock$oceans <- stock$Entity
stock$oceans[lu$values] <- as.character(lu$ind)
stock %>%
group_by(oceans) %>%
summarise(across(matches("^share"), sum))
#> # A tibble: 5 × 3
#> oceans `Share of fish stocks within biologi… `Share of fish stocks tha…
#> <chr> <dbl> <dbl>
#> 1 atlantic 742. 458.
#> 2 indian 277. 123.
#> 3 mediterranean 75.3 125.
#> 4 pacific 894. 306.
#> 5 World 1609. 491.
Created on 2021-11-13 by the reprex package (v2.0.1)

Why not to use str_split() of the stringr package to extract the ocean and make a column just for the ocean and one for the sub-segment?

Related

Creating a variable from other two categorical variables R

I looked for similar questions/answers on StackOverflow, but I could not find the answer I am looking for.
I have two categorical variables -
Region of education
Educational residence
Both variables (region of education and educational residence) are in one dataset after merging two different datasets.
Region of education has variables such as - Western, Non-western, Unknown
Educational residence has only two variables - In the USA and Out of the USA
Now, I have to make a new categorical variable with recoding options - Western as Education from Western countries, Non-Western as Education from Non-Western countries, Unknown as Unknown, while from the educational residence I have to use only in the USA.
So in end, the new variable will have four options -
Education from western countries,
Education from non-western countries,
Education from the USA,
Unknown
Is there any idea, of how to do this?
I apologise as I cannot post the data output due to ethical and legal issues.
I will be very thankful for any kind of help.
You don't need to share real data to get a pointer to the correct answer, just a simple reproducible example should suffice. According to your description, the relevant columns in your data frame should look something like this reproducible example:
set.seed(1)
df <- data.frame(Region = sample(c("Western", "Non-Western", "Unknown"), 10, T),
Residence = sample(c("USA", "Non-USA"), 10, T))
df
#> Region Residence
#> 1 Western USA
#> 2 Unknown USA
#> 3 Western USA
#> 4 Non-Western USA
#> 5 Western Non-USA
#> 6 Unknown Non-USA
#> 7 Unknown Non-USA
#> 8 Non-Western Non-USA
#> 9 Non-Western USA
#> 10 Unknown USA
We can smoosh these columns together using ifelse. Where the Residence column is "USA", the output will be "USA", and otherwise it will retain the "Western", "Non-Western" and "Unknown" levels from the Region column:
df$Education <- ifelse(df$Residence == "USA", "USA", df$Region)
df
#> Region Residence Education
#> 1 Western USA USA
#> 2 Unknown USA USA
#> 3 Western USA USA
#> 4 Non-Western USA USA
#> 5 Western Non-USA Western
#> 6 Unknown Non-USA Unknown
#> 7 Unknown Non-USA Unknown
#> 8 Non-Western Non-USA Non-Western
#> 9 Non-Western USA USA
#> 10 Unknown USA USA
Created on 2022-08-20 with reprex v2.0.2

Add zero padding to numbers in a column by using str_pad in string package

I want to use the string str_pad function to make a column in my desired format, which includes zero padding the numbers in the "Code" column to 3 digits.
I've run this code:
Animals %>%
gather(most_common, cnt, M:OG) %>%
group_by(name) %>%
slice(which.max(cnt)) %>%
arrange(code)
Which resulted in the following tibble:
Code Name most_common
32 Monkey Africa
33 Wolf Europe
34 Tiger Asia
35 Godzilla Asia
#With 1 234 more rows
I'm happy with my code above. However, because I'm going to merge this df later on, I need the "Code" column to be three digits with zero padding (i.e. in the format "nnn" / "032"), as this:
Code Name most_common
032 Monkey Africa
033 Wolf Europe
034 Tiger Asia
035 Godzilla Asia
#With 1 234 more rows
I've run string str_pad($code, $3, $0), but it doesn't work. I guess there's something wrong there. Should I run this code wherever I want in my chunk or by using %>%?
A possible solution:
library(tidyverse)
df <- read.table(text = "Code Name most_common
32 Monkey Africa
33 Wolf Europe
34 Tiger Asia
35 Godzilla Asia", header = T)
df %>%
mutate(Code = str_pad(Code, width = 3, pad = "0"))
#> Code Name most_common
#> 1 032 Monkey Africa
#> 2 033 Wolf Europe
#> 3 034 Tiger Asia
#> 4 035 Godzilla Asia
In base R, we can use sprintf
df1$Code <- sprintf("%03d", df1$Code)
Another option could be using formatC with "d" for integer and a flag "0" the prepending zero like this:
df$Code <- formatC(df$Code, width = 3, format = "d", flag = "0")
df
#> Code Name most_common
#> 1 032 Monkey Africa
#> 2 033 Wolf Europe
#> 3 034 Tiger Asia
#> 4 035 Godzilla Asia
Created on 2022-07-23 by the reprex package (v2.0.1)

R combine rows and columns within a dataframe

I've looked around for a while trying to figure this out, but I just can't seem to describe my problem concisely enough to google my way out of it. I am trying to work with Michigan COVID stats where the data has Detroit listed separately from Wayne County. I need to add Detroit's numbers to Wayne County's numbers, then remove the Detroit rows from the data frame.
I have included a screen grab too. For the purposes of this problem, can someone explain how I can get Detroit City added to Dickinson, and then make the Detroit City rows disappear? Thanks.
library(tidyverse)
library(openxlsx)
cases_deaths <- read.xlsx("https://www.michigan.gov/coronavirus/-/media/Project/Websites/coronavirus/Cases-and-Deaths/4-20-2022/Cases-and-Deaths-by-County-2022-04-20.xlsx?rev=f9f34cd7a4614efea0b7c9c00a00edfd&hash=AA277EC28A17C654C0EE768CAB41F6B5.xlsx")[,-5]
# Remove rows that don't describe counties
cases_deaths <- cases_deaths[-c(51,52,101,102,147,148,167,168),]
Code chunk output picture
You could do:
cases_deaths %>%
filter(COUNTY %in% c("Wayne", "Detroit City")) %>%
mutate(COUNTY = "Wayne") %>%
group_by(COUNTY, CASE_STATUS) %>%
summarize_all(sum) %>%
bind_rows(cases_deaths %>%
filter(!COUNTY %in% c("Wayne", "Detroit City")))
#> # A tibble: 166 x 4
#> # Groups: COUNTY [83]
#> COUNTY CASE_STATUS Cases Deaths
#> <chr> <chr> <dbl> <dbl>
#> 1 Wayne Confirmed 377396 7346
#> 2 Wayne Probable 25970 576
#> 3 Alcona Confirmed 1336 64
#> 4 Alcona Probable 395 7
#> 5 Alger Confirmed 1058 8
#> 6 Alger Probable 658 5
#> 7 Allegan Confirmed 24109 294
#> 8 Allegan Probable 3024 52
#> 9 Alpena Confirmed 4427 126
#> 10 Alpena Probable 1272 12
#> # ... with 156 more rows
Created on 2022-04-23 by the reprex package (v2.0.1)

Change order of conditions when plotting normalised counts for single gene

I have a df of 17 variables (my samples) with the condition location which I would like to plot based on a single gene "photosystem II protein D1 1"
View(metadata)
sample location
<chr> <chr>
1 X1344 West
2 X1345 West
3 X1365 West
4 X1366 West
5 X1367 West
6 X1419 West
7 X1420 West
8 X1421 West
9 X1473 Mid
10 X1475 Mid
11 X1528 Mid
12 X1584 East
13 X1585 East
14 X1586 East
15 X1678 East
16 X1679 East
17 X1680 East
View(countdata)
func X1344 X1345 X1365 X1366 X1367 X1419 X1420 X1421 X1473 X1475 X1528 X1584 X1585 X1586 X1678 X1679 X1680
photosystem II protein D1 1 11208 6807 3483 4091 12198 7229 7404 5606 6059 7456 4007 2514 5709 2424 2346 4447 5567
countdata contains thousands of genes but I am only showing the headers and gene of interest
ddsMat has been created like this:
ddsMat <- DESeqDataSetFromMatrix(countData = countdata,
colData = metadata,
design = ~ location)
When plotting:
library(DeSeq2)
plotCounts(ddsMat, "photosystem II protein D1 1", intgroup=c("location"))
By default, the function plots the "conditions" alphabetically eg: East-Mid-West. But I would like to order them so I can see them on the graph West-Mid-East.
Check plotCountsIMAGEhere
Is there a way of doing this?
Thanks,
I have found that you can manually change the order like this:
ddsMat$location <- factor(ddsMat$location, levels=c("West", "Mid", "East"))

How to custom order levels of a factor variable by 2+ conditions?

STP_name PCT
<fct> <dbl>
1 ENGLAND 2.58
2 MIDLANDS AND EAST OF ENGLAND 2.64
3 LEICESTER, LEICESTERSHIRE AND RUTLAND STP 2.31
4 BATH, SWINDON AND WILTSHIRE STP 2.94
5 KENT AND MEDWAY STP 2.8
6 BUCKINGHAMSHIRE, OXFORDSHIRE AND BERKSHIRE WEST STP 2.73
7 NORTH CENTRAL LONDON STP 1.93
8 NORTHAMPTONSHIRE STP 2.63
9 SOMERSET STP 3.41
10 SURREY HEARTLANDS STP 2.87
I would like to order the STP_name factor variable by their descending values in PCT, however I want the row for ENGLAND and MIDLANDS AND EAST OF ENGLAND to appear at the start of the factor levels regardless of their PCT value.
Ideally looking for a forcats solution but base is fine.
Assuming that the dataset have all unique 'STP_name', change it to factor with levels specified based on the order of 'PCT' and then add the two levels ("ENGLAND", ""MIDLANDS AND EAST OF ENGLAND") at the beginning with relevel or fct_relevel (from forcats)
library(tidyverse)
library(forcats)
df1 %>%
mutate(STP_name = fct_relevel(factor(STP_name, levels = STP_name[order(-PCT)]),
c("ENGLAND", "MIDLANDS AND EAST OF ENGLAND")))

Resources