Tidyr: Drop string until a certain character - r

What's the easiest way to drop a string before a certain character?
The data looks as follows:
library(tidyverse)
df <- data.frame(var1 = c("lang:10,q1:10,m2:20,q3:20,m5:10",
"lang:1,q1:10,m2:20,m3:20,q3:10",
"lang:100,q1:10,m2:20"))
Now, I'd like to remove the "lang:xy," part at the beginning of each row.
I tried to use "separate", but the comma is also used afterwards (everything that comes after the first comma should stay together).
So my desired output is:
var1
-------------------------
q1:10,m2:20,q3:20,m5:10
q1:10,m2:20,m3:20,q3:10",
q1:10,m2:20
Thanks!

You can use str_remove from stringr package:
df %>%
mutate(
var1 = var1 %>% str_remove("^lang:[0-9]*,")
)

Or try this:
library(tidyverse)
#Code
df %>% mutate(id=1:n()) %>%separate_rows(var1,sep = ',') %>%
filter(!grepl('lang',var1)) %>%
mutate(var='var') %>%
group_by(id) %>%
summarise(var1=paste0(var1,collapse = ',')) %>% ungroup() %>%
select(-id)
Output:
# A tibble: 3 x 1
var1
<chr>
1 q1:10,m2:20,q3:20,m5:10
2 q1:10,m2:20,m3:20,q3:10
3 q1:10,m2:20

Just to round out the answers, the sub function from base R can also work here:
df$var1 <- sub("^lang:\\d+,", "", df$var1)
df
var1
1 q1:10,m2:20,q3:20,m5:10
2 q1:10,m2:20,m3:20,q3:10
3 q1:10,m2:20

We can use trimws from base R
df$var1 <- trimws(df$var1, whitespace = "lang:\\d+,")

Related

What is the tidyverse way to apply a function designed to take data.frames as input across a grouped tibble in R?

I've written a function that takes multiple columns as its input that I'd like to apply to a grouped tibble, and I think that something with purrr::map might be the right approach, but I don't understand what the appropriate input is for the various map functions. Here's a dummy example:
myFun <- function(DF){
DF %>% mutate(MyOut = (A * B)) %>% pull(MyOut) %>% sum()
}
MyDF <- data.frame(A = 1:5, B = 6:10)
myFun(MyDF)
This works fine. But what if I want to add some grouping?
MyDF <- data.frame(A = 1:100, B = 1:100, Fruit = rep(c("Apple", "Mango"), each = 50))
MyDF %>% group_by(Fruit) %>% summarize(MyVal = myFun(.))
This doesn't work. I get the same value for every group in my data.frame or tibble. I then tried using something with purrr:
MyDF %>% group_by(Fruit) %>% map(.f = myFun)
Apparently, that's expecting character data as input, so that's not it.
This next variation is basically what I need, but the output is a list of lists rather than a tibble with one row for each value of Fruit:
MyDF %>% group_by(Fruit) %>% group_map(~ myFun(.))
We can use the OP's function in group_modify
library(dplyr)
MyDF %>%
group_by(Fruit) %>%
group_modify(~ .x %>%
summarise(MyVal = myFun(.x))) %>%
ungroup
-output
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425
Or in group_map where the .y is the grouping column
MyDF %>%
group_by(Fruit) %>%
group_map(~ bind_cols(.y, MyVal = myFun(.))) %>%
bind_rows
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425

add numbers to specific observations

I have several IDs I am working with. I want to add a leading zero for values that have 1 integer after the dash in id. Here is sample data. I
id
2034-5
1023-12
1042-22
1231-9
I want this:
id
2034-05
1023-12
1042-22
1231-09
I tried this, but it's not working. Any advice?
x <-sprintf("%02d", df$id)
You could actually use sub here for a base R option:
df$id <- sub("-(\\d)$", "-0\\1", df$id)
df
id
1 2034-05
2 1023-12
3 1042-22
4 1231-09
Data:
df <- data.frame(id=c("2034-5", "1023-12", "1042-22", "1231-9"), stringsAsFactors=FALSE)
To use sprintf you have to separate out both the numbers, use sprintf on second number and then combine them again.
library(dplyr)
library(tidyr)
df %>%
separate(id, c('id1', 'id2')) %>%
mutate(id2 = sprintf('%02s', id2)) %>%
unite(id, id1, id2, sep = '-')
# id
#1 2034-05
#2 1023-12
#3 1042-22
#4 1231-09
An option with strsplit and sprintf from base R
df$id <- sapply(strsplit(df$id, "-"), function(x)
do.call(sprintf, c(as.list(x), fmt = "%s-%02s")))
df$id
#[1] "2034-05" "1023-12" "1042-22" "1231-09"

Use REGEX in R to extract specific string in value as a new column?

I have a column that contains string of characters/values that looks like this
Current
111111~24-JUL-17 10:43:36~6.14
Desired Output
24-JUL-17 10:43:36
Hoping to take everything between the '~' --> So Date/Time and disregard everything else.
I am have this code right now but only seems to take part of it
df$Last <- gsub(".+\\s(.+)$", "\\1", df$col1)
We can use tidyr's separate to get below result:
library(dplyr)
library(tidyr)
df <- data.frame(c1 = c('111111~24-JUL-17 10:43:36~6.14','111111~24-JUL-21 10:34:36~6.14'))
df
c1
1 111111~24-JUL-17 10:43:36~6.14
2 111111~24-JUL-21 10:34:36~6.14
df %>% separate(col = c1, into = c('x','Date','y'), sep = '~') %>% select(2)
Date
1 24-JUL-17 10:43:36
2 24-JUL-21 10:34:36
Using stringr package:
library(dplyr)
library(stringr)
df %>% mutate(c1 = str_extract(c1, '(?<=~).*(?=~)'))
c1
1 24-JUL-17 10:43:36
2 24-JUL-21 10:34:36
We can use sub in base R
df$c1 <- sub(".*~([^~]+)~.*", "\\1", df$c1)
df$c1
#[1] "24-JUL-17 10:43:36" "24-JUL-21 10:34:36"
data
df <- data.frame(c1 = c('111111~24-JUL-17 10:43:36~6.14',
'111111~24-JUL-21 10:34:36~6.14'))

Tidyverse extract elemens from character conditions

I would like to locate in this dataframe the elements of var1 for which the elements of var2 are different
data.frame(var1=c("a","a","a","b","b","c","c","c"),
var2=c("X","X","X","Y","Z","W","W","W"),
stringsAsFactors = F)
expected result
data.frame(var1=c("b"))
Many thanks in advance !
Does this work:
library(dplyr)
df %>% group_by(var1) %>% filter(length(unique(var2))>1) %>% distinct(var1)
# A tibble: 1 x 1
# Groups: var1 [1]
var1
<chr>
1 b
Following #Karthik S another alternative approach and if i understood correctly the question.
library(dplyr)
df <- data.frame(var1=c("a","a","a","b","b","c","c","c"), var2=c("X","X","X","Y","Z","W","W","W"), stringsAsFactors = F)
dplyr::select(
distinct(df)[duplicated(distinct(df)$var1),], var1
)
Call select only if you need to guarantee that the output is a dataframe/tibble.

Dplyr top_n returns multiple rows

Dplyr provides a function top_n(), however in case of equal values it returns all rows (more than one). I would like to return exactly one row per group. See the example below.
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
df %>% group_by(id1) %>% top_n(n=1)
You can use a combination of arrange and slice
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice(1)
Use desc with in arrange if you want the larges element otherwise leave it out.
Apparently also slice_head is the new name of the function that you are looking for
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice_head(id2, n=2)
Use slice_max() with the argument with_ties = FALSE:
library(dplyr)
df %>%
group_by(id1) %>%
slice_max(id2, with_ties = FALSE)
# A tibble: 3 x 2
# Groups: id1 [3]
id1 id2
<chr> <dbl>
1 A 8
2 B 7
3 C 5
If you don't want to remember so many {dplyr} function names that are prone to be changed anyway, I can recommend the {data.table} package for such tasks. Plus, it's faster.
require(data.table)
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
setDT(df)
df[ ,
.(id2_head = head(id2, 1)),
by = id1 ]

Resources