Separating a column with multiple different entries with tidyr - r

I am trying to split up one column in a data frame that shows the period active(s) for several artists/ bands into two columns (start_of_career, end_of_career). The variable class is character. I tried to use tidyrs separate function for it and when I run it, I see that it is split in the console but not in the data frame itself, so I assume that it doesn't work properly.
Please see here a made up example of the data I want to split:
Column A
Column B
Artist A
1995-present
Artist B
1995-1997, 2008, 2010-present
As you can see, some rows will consists only of a start and end date, while others have several dates.
All I actually need is the first number and the last, e.g. for Artist B I need only start_of_career 1995 and end_of_career "present". But I am somehow not able to solve this issue.
The code I used was:
library(tidyr)
df %>% separate(col = period_active, into = c('start_of_career', 'end_of_career'), sep = '-')
I also tried other separators(",", " "), but it didn't work either.
I also tried:
df$start_of_career = strsplit(df$period_active, split = '-')
But this didn't work as well.

Using df shown reproducibly in the Note at the end remove everything except first and last parts of Column B and then separate what is left.
library(dplyr)
library(tidyr)
dd %>%
mutate(`Column B` = sub("-.*-", "-", `Column B`)) %>%
separate(`Column B`, c("start", "end"))
## Column A start end
## 1 Artist A 1995 present
## 2 Artist B 1995 present
Note
df <-
structure(list(`Column A` = c("Artist A", "Artist B"), `Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))

Using base R
df <- cbind(df[1], read.table(text = sub("-[0-9, ]+", "", df$`Column B`),
header = FALSE, col.names = c("start", "end"), sep = "-"))
-output
> df
Column A start end
1 Artist A 1995 present
2 Artist B 1995 present
We could do this with separate as well
library(tidyr)
separate(df, `Column B`, into = c("start", "end"), sep = "-[^A-Za-z]*")
Column A start end
1 Artist A 1995 present
2 Artist B 1995 present
data
df <- structure(list(`Column A` = c("Artist A", "Artist B"),
`Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame",
row.names = c(NA,
-2L))

We could use separate_rows and then filter for first and last row of group:
library(tidyr)
library(dplyr)
df %>%
separate_rows(Column.B) %>%
group_by(Column.A) %>%
filter(row_number()==1 | row_number()==n()) %>%
mutate(Colum.C = c("start", "end"))
Column.A Column.B Colum.C
<chr> <chr> <chr>
1 Artist A 1995 start
2 Artist A present end
3 Artist B 1995 start
4 Artist B present end
data:
structure(list(Column.A = c("Artist A", "Artist B"), Column.B = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))

Using strsplit and then subsequently pick the first and the last entry.
library(dplyr)
df %>%
rowwise() %>%
mutate(splitrow = strsplit(`Column B`, "-"),
start_of_career = splitrow[1],
end_of_career = splitrow[length(splitrow)],
splitrow = NULL) %>%
ungroup()
# A tibble: 2 × 4
`Column A` `Column B` start_of_career end_of_career
<chr> <chr> <chr> <chr>
1 Artist A 1995-present 1995 present
2 Artist B 1995-1997, 2008, 2010-present 1995 present
Data
df <- structure(list(`Column A` = c("Artist A", "Artist B"), `Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))

Another option: use strsplit, and return the list of start and end values
f <- \(v) {
v = strsplit(v, "-|,| ")[[1]]
list(start = v[1],end = v[length(v)])
}
df %>%
mutate(df, `Column B` = lapply(`Column B`,f)) %>%
unnest_wider(`Column B`)
Output:
# A tibble: 2 × 3
`Column A` start end
<chr> <chr> <chr>
1 Artist A 1995 present
2 Artist B 1995 present

Below code extract the first word before the dash and last word after.
for(i in 1:length(df))
{
df$start[i] <-sub("-.*", "", df$`Column B`[i])
df$end[i] <-sub("^.+-", "", df$`Column B`[i])
}

Related

Count unique values per month in R

I have a dataset with dead bird records from field observers.
Death.Date Observer Species Bird.ID
1 03/08/2021 DA MF FC10682
2 15/08/2021 AG MF FC10698
3 12/01/2022 DA MF FC20957
4 09/02/2022 DA MF FC10708
I want to produce a dataset from this with the number of unique Bird.ID / Month so I can produce a graph from that. ("unique" because some people make mistakes and enter a bird twice sometimes).
The output in this case would be:
Month Number of dead
08/2021 2
01/2022 1
02/2022 1
The idea is to use the distinct function but by month (knowing the value is in date format dd/mm/yyyy).
In case your Date column is character type first transform to date type with dmy
Change format to month and year
group_by and summarize
library(dplyr)
library(lubridate) # in case your Date is in character format
df %>%
mutate(Death.Date = dmy(Death.Date)) %>% # you may not need this line
mutate(Month = format(as.Date(Death.Date), "%m/%Y")) %>%
group_by(Month) %>%
summarise(`Number of dead`=n())
Month `Number of dead`
<chr> <int>
1 01/2022 1
2 02/2022 1
3 08/2021 2
For completeness, this can be achieved using aggregate without any additional packages:
df <- data.frame(
Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
Observer = c("DA", "AG", "DA", "DA"),
Species = c("MF", "MF", "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)
aggregate.data.frame(
x = df["Bird.ID"],
by = list(death_month = format(as.Date(df$Death.Date, "%d/%m/%Y"), "%m/%Y")),
FUN = function(x) {length(unique(x))}
)
Notes
The anonymous function function(x) {length(unique(x)) provides the count of the unique values
format(as.Date(df$Death.Date, "%d/%m/%Y"), "%m/%Y")) call ensures that the month/Year string is provided
data.table solution
library(data.table)
library(lubridate)
# Reproductible example with a duplicated bird
deadbirds <- data.table::data.table(Death.Date = c("03/08/2021", "15/08/2021", "12/01/2022", "09/02/2022", "03/08/2021"),
Observer = c("DA", "AG", "DA", "DA", "DA"),
Species = c("MF", "MF", "MF" , "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708", "FC10682"))
# Clean dataset = option 1 : delete all duplicated row
deadbirds <- base::unique(deadbirds)
# Clean dataset = option 2 : keep only the first line by bird (can be useful when there is duplicated data with differents values in useless columns)
deadbirds <- deadbirds[
j = .SD[1],
by = c("Bird.ID")
]
# Death.Date as date
deadbirds <- deadbirds[
j = Death.Date := lubridate::dmy(Death.Date)
]
# Create month.Death.Date
deadbirds <- deadbirds[
j = month.Death.Date := base::paste0(lubridate::month(Death.Date),
"/",
lubridate::year(Death.Date))
]
# Count by month
deadbirds <- deadbirds[
j = `Number of dead` := .N,
by = month.Death.Date]
A possible solution, based on tidyverse, lubridate and zoo::as.yearmon:
library(tidyverse)
library(lubridate)
library(zoo)
df <- data.frame(
Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
Observer = c("DA", "AG", "DA", "DA"),
Species = c("MF", "MF", "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)
df %>%
group_by(date = as.yearmon(dmy(Death.Date))) %>%
summarise(nDead = n_distinct(Bird.ID), .groups = "drop")
#> # A tibble: 3 x 2
#> date nDead
#> <yearmon> <int>
#> 1 Aug 2021 2
#> 2 Jan 2022 1
#> 3 Feb 2022 1
You could use:
as.data.frame(table(format(as.Date(df$Death.Date,'%d/%m/%Y'), '%m/%Y')))
# Var1 Freq
# 1 01/2022 1
# 2 02/2022 1
# 3 08/2021 2
data:
df <- data.frame(
Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
Observer = c("DA", "AG", "DA", "DA"),
Species = c("MF", "MF", "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)

extract component from a list of file names in a column to create a new column in R

I have a data frame with numerous columns and rows. One column in particular "Filename" has information that I would like to separate and make into a new column "ID"
A Filename
1 Sample.2020-03-16_2345_WES01_FF001_089-267/2355245_H445.FASTQs/AA56789_1.clipped.fastq.gz
2 Sample.2020-03-15_2355_WES01_FF001_089-267/2345245_H345.FASTQs/AA52399_1.clipped.fastq.gz
My new df2 I would like to create is
A ID
1 AA56789
2 AA52399
I do not have a code written to this as I am just begining to understand sub and gsub. And help would be appreciate, thank you!
We could use basename with trimws
df2 <- cbind(df1['A'], ID = trimws(basename(df1$Filename), whitespace = "_.*"))
-output
df2
A ID
1 1 AA56789
2 2 AA52399
data
df1 <- structure(list(A = 1:2, Filename = c("Sample.2020-03-16_2345_WES01_FF001_089-267/2355245_H445.FASTQs/AA56789_1.clipped.fastq.gz",
"Sample.2020-03-15_2355_WES01_FF001_089-267/2345245_H345.FASTQs/AA52399_1.clipped.fastq.gz"
)), class = "data.frame", row.names = c(NA, -2L))
You can use regex to extract the interested value. In base R using sub you can do -
cbind(df[1], ID = sub('.*/(\\w+)_.*', '\\1', df$Filename))
# A ID
#1 1 AA56789
#2 2 AA52399
A tidyverse-solution:
library(dplyr)
library(stringr)
df %>%
mutate(ID = str_replace(Filename, '.*/(\\w+)_.*', '\\1'), .keep = "unused")
which is basically the same as Ronak Shah's or
df %>%
mutate(ID = str_extract(Filename, "(?<=/)(\\w+)(?=_\\d+\\.clipped)"), .keep = "unused")
Both return
# A tibble: 2 x 2
A ID
<dbl> <chr>
1 1 AA56789
2 2 AA52399

R - Collapsing observations and creating new columns

In my dataframe there are multiple rows for a single observation (each referenced by ref). I would like to collapse the rows and create new columns for the keyword column. The outcome would include as many keyowrd colums as the number of rows for an observation (e.g. keyword_1, keyword_2, etc). Do you have any idea? Thanks a lot.
This is my MWE
df1 <- structure(list(rif = c("text10", "text10", "text10", "text11", "text11"),
date = c("20180329", "20180329", "20180329", "20180329", "20180329"),
keyword = c("Lucca", "Piacenza", "Milano", "Cascina", "Padova")),
row.names = c(NA, 5L), class = "data.frame")
Does this work:
library(dplyr)
library(tidyr)
df1 %>% group_by(rif,date) %>% mutate(n = row_number()) %>% pivot_wider(id_cols = c(rif,date), values_from = keyword, names_from = n, names_prefix = 'keyword')
# A tibble: 2 x 5
# Groups: rif, date [2]
rif date keyword1 keyword2 keyword3
<chr> <chr> <chr> <chr> <chr>
1 text10 20180329 Lucca Piacenza Milano
2 text11 20180329 Cascina Padova NA

How to replace with only the part before the ":" in every row of a column in R

so in a dataset, I have a column named "Interventions", and each row looks like this:
row1: "Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600"
row2: "Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
I want to only extract the Intervention type such as "Drug", "Biological", "Procedure" to remain in the column. And even better, if can only have the unique Intervention type instead of "Drug" 4 times like the first row.
The expected output would look like this:
row1: "Drug"
row2: "Biological, Drug, Procedure"
I am just getting started with r, I have tidyverse installed and kinda used to playing with the %>%. If anyone can help me with this, much appreciated !
If we want to extract only the prefix part before the :
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
df1 %>%
mutate(Interventions = map_chr(str_extract_all(Interventions,
"\\w+(?=:)"), ~ toString(sort(unique(.x)))))
# Interventions
#1 Drug
#2 Biological, Drug, Procedure
Or another option is to separate the rows based on the delimiters, slice the alternate rows and paste together the sorted unique values in 'Interventions'
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Interventions, sep="[:|]") %>%
group_by(rn) %>%
slice(seq(1, n(), by = 2)) %>%
distinct() %>%
summarise(Interventions = toString(sort(unique(Interventions)))) %>%
ungroup %>%
select(-rn)
# A tibble: 2 x 1
# Interventions
# <chr>
#1 Drug
#2 Biological, Drug, Procedure
data
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
Not as concise and the same logic as Akruns but in Base R:
# Create df:
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
# Assign a row id vec:
df1$row_num <- 1:nrow(df1)
# Split string on | delim:
split_up <- strsplit(df1$Interventions, split = "[|]")
# Roll down the dataframe - keep uniques:
rolled_out <- unique(data.frame(row_num = rep(df1$row_num, sapply(split_up, length)),
Interventions = gsub("[:].*","", unlist(split_up))))
# Stack the dataframe:
df2 <- aggregate(Interventions~row_num, rolled_out, paste0, collapse = ", ")
# Drop id vec:
df2 <- within(df2, rm("row_num"))

Remove row below conditionally in dataframe and add values together in R

I have a large dataset with 3 columns: Name, Country, and Sales.
I'd like to sum the Sales column by Names that are both identical and occur consecutively. Then I'd like to remove all rows but the first occurrence of a series, replacing the value of Sales with the series sum.
For example:
Name,Country,Sales
A,V,100
A,W,100
B,X,100
B,Y,100
A,Z,100
Would be reduced to:
Name,Country,Sales
A,V,200
B,X,200
A,Z,100
Anyone got any idea how to do this?
Your data
df <- structure(list(Name = c("A", "A", "B"), Country = c("X", "Y",
"Z"), Sales = c(100L, 100L, 100L)), .Names = c("Name", "Country",
"Sales"), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))
dplyr solution
library(dplyr)
library(data.table)
ans <- df %>%
group_by(rleid(Name)) %>%
summarise(Name = unique(Name), Sales=sum(Sales)) %>%
select(-1)
Output
Name Sales
<chr> <int>
1 A 200
2 B 100
Alternative example
newdf <- rbind(df, data.frame(Name=c("A","A","B","B"),
Country=c("A","B","C","D"),
Sales=c(100,100,100,100)))
ans <- newdf %>%
group_by(rleid(Name)) %>%
summarise(Name = unique(Name), Sales=sum(Sales)) %>%
select(-1)
Output
Name Sales
<fctr> <dbl>
1 A 200
2 B 100
3 A 200
4 B 200
Here's another solution using sqldf:
library(data.table)
df <- fread("Name,Country,Sales
A,V,100
A,W,100
B,X,100
B,Y,100
A,Z,100")
df$rle = rleid(df$Name)
library(sqldf)
sqldf("select min(rowid) as row_names,
Name,
Country,
sum(Sales) as Sales
from df group by rle", row.names = TRUE)
# Name Country Sales
# 1 A V 200
# 3 B X 200
# 5 A Z 100
row.names = TRUE searches for a column named row_names and treats it as row names, so min(rowid) will not show up as a new column if I set it as row_names.
Try this:
require(dplyr)
df %>%
group_by(Series=rleid(Name)) %>%
mutate(Sales = sum(Sales)) %>%
filter(1:n() == 1)
Output:
Name Country Sales Series
1 A V 200 1
2 B X 200 2
3 A Z 100 3
Sample data:
require(data.table)
df <- fread("Name,Country,Sales
A,V,100
A,W,100
B,X,100
B,Y,100
A,Z,100")

Resources