combine tibbles with inexact values - r

I have two tibbles and I want to combine them based on the Batsman column. However, the values in the 2 columns are not completely identical, i.e. "V Kohli" vs. "Virat Kohli (IND)". How can I combine the tibbles based on these inexact matches?
Thank you!
x1 <- tibble(Batsman=c("V Kohli (INDIA)","RG Sharma (INDIA)","Babar Azam (PAK)","GJ Maxwell (AUS)"),
Runs=c(500,400,300,200),
Matches=c(67,54,47,23)
x2 <- tibble(Rank=c(1,2,3,4),
Batsman=c("Virat Kohli", "Rohit Sharma", "Glenn Maxwell","Babar Azam"),
Rating=c(853,820,640,500))

So you want to join two texts strings,
> x1$Batsman
[1] "V Kohli (INDIA)" "RG Sharma (INDIA)" "Babar Azam (PAK)" "GJ Maxwell (AUS)"
> x2$Batsman
[1] "Virat Kohli" "Rohit Sharma" "Glenn Maxwell" "Babar Azam"
I guess you have much more names than these four?
It is definitely a tricky task, computer are notoriously bad at doing these kind of tasks. (There is some famous examples of very long function just to read phone numbers). From the strings you provide, i can see they always have similar names.
I would use stringr to extract the names with a regexp.
The full code:
library(tibble)
library(stringr)
x1 <- tibble(Batsman=c("V Kohli (INDIA)","RG Sharma (INDIA)","Babar Azam (PAK)","GJ Maxwell (AUS)"),
Runs=c(500,400,300,200),
Matches=c(67,54,47,23) )
x2 <- tibble(Rank=c(1,2,3,4),
Batsman=c("Virat Kohli", "Rohit Sharma", "Glenn Maxwell","Babar Azam"),
Rating=c(853,820,640,500))
AA <- str_sub(x1$Batsman, start = str_locate(x1$Batsman, " ")[,1]+1, 20)
AA <- str_sub(AA, start = 1, end = str_locate(AA, " ")[,1]-1) %>%
str_to_lower()
BB <- str_sub(x2$Batsman, start = str_locate(x2$Batsman, " ")[,1]+1, 20) %>%
str_to_lower()
match(AA, BB)

Related

Extract certain words from dynamic strings vector

I'm working with questionnaire datasets where I need to extract some brands' names from several questions. The problem is each data might have a different question line, for example:
Data #1
What do you know about AlphaToy?
Data #2
What comes to your mind when you heard AlphaCars?
Data #3
What do you think of FoodTruckers?
What I want to extract are the words AlphaToy, AlphaCars, and FoodTruckers. In Excel, I can get those brands' names via flash fill, the illustration is below.
As I working with R, I need to convert the "flash fill" step into an R function, yet I couldn't found out how to do it. Here's desired output:
brandName <- list(
Toy = c(
"1. What do you know about AlphaToy?",
"2. What do you know about BetaToyz?",
"3. What do you know about CharlieDoll?",
"4. What do you know about DeltaToys?",
"5. What do you know about Echoty?"
),
Car = c(
"18. What comes to your mind when you heard AlphaCars?",
"19. What comes to your mind when you heard BestCar?",
"20. What comes to your mind when you heard CoolCarz?"
),
Trucker = c(
"5. What do you think of FoodTruckers?",
"6. What do you think of IceCreamTruckers?",
"7. What do you think of JellyTruckers?",
"8. What do you think of SodaTruckers?"
)
)
extractBrandName <- function(...) {
#some codes here
}
#desired output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
As the title says, the function should work to dynamic strings, so when the function is applied to brandName the desired output is:
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Edit:
The brand name can be in lowercase, uppercase, or even two words or more, for instance: IBM, Louis Vuitton
The brand names might appear in the middle of the sentence, it's not always come at the end of the sentence. The thing is, the sentences are unpredictable because each client might provide different data of each other
Can anyone help me with the function code to achieve the desired output? Thank you in advance!
Edit, here's attempt
The idea (thanks to shs' answer) is to find similar words from the input, then exclude them leaving the unique words (it should be the brand names) behind. Following this post, I use intersect() wrapped inside a Reduce() to get the common words, then I exclude them via lapply() and make sure any two or more words brand names merged together with str_c(collapse = " ").
Code
library(stringr)
extractBrandName <- function(x) {
cleanWords <- x %>%
str_remove_all("^\\d+|\\.|,|\\?") %>%
str_squish() %>%
str_split(" ")
commonWords <- cleanWords %>%
Reduce(intersect, .)
extractedWords <- cleanWords %>%
lapply(., function(y) {
y[!y %in% commonWords] %>%
str_c(collapse = " ")
}) %>% unlist()
return(extractedWords)
}
Output (1st test case)
> #output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Output (2nd test case)
This test case includes two or more words brand names, located at the middle and the beginning of the sentence.
brandName2 <- list(
Middle = c("Have you used any products from AlphaToy this past 6 months?",
"Have you used any products from BetaToys Collection this past 6 months?",
"Have you used any products from Charl TOYZ this past 6 months?"),
First = c("AlphaCars is the best automobile dealer, yes/no?",
"Best Vehc is the best automobile dealer, yes/no?",
"CoolCarz & Bike is the best automobile dealer, yes/no?")
)
> #output
> lapply(brandName2, extractBrandName)
$Middle
[1] "AlphaToy" "BetaToys Collection" "Charl TOYZ"
$First
[1] "AlphaCars" "Best Vehc" "CoolCarz & Bike"
In the end, the solution to this problem is found. Thanks to shs who gave the initial idea and the answer from the post I linked above. If you have any suggestions, please feel free to comment. Thank you.
This function checks which words the first two strings have in common and then removes everything from the beginning of the strings up to and including the common element, leaving only the desired part of the string:
library(stringr)
extractBrandName <- function(x) {
x %>%
str_split(" ") %>%
{.[[1]][.[[1]] %in% .[[2]]]} %>%
str_c(collapse = " ") %>%
str_c("^.+", .) %>%
str_remove(x, .) %>%
str_squish() %>%
str_remove("\\?")
}
lapply(brandName, extractBrandName)
#> $Toy
#> [1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
#>
#> $Car
#> [1] "AlphaCars" "BestCar" "CoolCarz"
#>
#> $Trucker
#> [1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"

Removing strings after in a dataframe to make group

This is my set of col name which i want to make group.
colnames(data4)
[1] "VC_1_UI_S3" "VC_2_UI_S4" "BAF60A_KD_1_S13" "BAF60A_KD_2_S14" "VC_VD3_1_S7" "VC_VD3_2_S8"
[7] "BAF60A_VD3_1_S15" "BAF60A_VD3_2_S16"
Im doing this
metadata$Group <- sub("_[^[:alpha:]]+_S[0-9]", "", (colnames(data4)))
Which results in this
metadata
Group
VC_1_UI_S3 VC_1_UI_S3
VC_2_UI_S4 VC_2_UI_S4
BAF60A_KD_1_S13 BAF60A_KD3
BAF60A_KD_2_S14 BAF60A_KD4
VC_VD3_1_S7 VC_VD3
VC_VD3_2_S8 VC_VD3
BAF60A_VD3_1_S15 BAF60A_VD35
BAF60A_VD3_2_S16 BAF60A_VD36
For VC_VD3_1_S7 and VC_VD3_1_S8 Im getting the desired result which is only VC_VD3 but not for others so what I require is like this below
Group
VC_1_UI_S3 VC_1_UI
VC_2_UI_S4 VC_2_UI
BAF60A_KD_1_S13 BAF60A_KD
BAF60A_KD_2_S14 BAF60A_KD
VC_VD3_1_S7 VC_VD
VC_VD3_2_S8 VC_VD
BAF60A_VD3_1_S15 BAF60A_VD3
BAF60A_VD3_2_S16 BAF60A_VD3
Any suggestion or help really appreciated
We could use
sub("_\\d*_*S[0-9]+$", "", x)
#[1] "VC_1_UI" "VC_2_UI" "BAF60A_KD" "BAF60A_KD"
#[5]"VC_VD3" "VC_VD3" "BAF60A_VD3" "BAF60A_VD3"
Or use str_remove from stringr
library(stringr)
str_remove(x, "_\\d*_*S[0-9]+$")
data
x <- c("VC_1_UI_S3", "VC_2_UI_S4", "BAF60A_KD_1_S13", "BAF60A_KD_2_S14",
"VC_VD3_1_S7", "VC_VD3_2_S8", "BAF60A_VD3_1_S15", "BAF60A_VD3_2_S16"
)
You can try this regex making the number before S[0-9]+ optional -
sub("_(\\d+_)?S[0-9]+$", "", x)
#[1] "VC_1_UI" "VC_2_UI" "BAF60A_KD" "BAF60A_KD" "VC_VD3" "VC_VD3"
#[7] "BAF60A_VD3" "BAF60A_VD3"
data
x <- c("VC_1_UI_S3" , "VC_2_UI_S4" ,"BAF60A_KD_1_S13" ,"BAF60A_KD_2_S14" ,
"VC_VD3_1_S7","VC_VD3_2_S8", "BAF60A_VD3_1_S15", "BAF60A_VD3_2_S16")

Sequence of numbers by hyphen without hyphenating single occurrences

I want to generate readable number sequences (e.g. 1, 2, 3, 4 = 1-4), but for a set of data where each number in the sequence must have four digits (e.g. 99 = 0099 or 1 = 0001 or 1022 = 1022) AND where there are different letters in front of each number.
I was looking at the answer to this question, which managed to do almost exactly as I want with two caveats:
If there is a stand-alone number that does not appear in a sequence, it will appear twice with a hyphen in between
If there are several stand-alone numbers that do no appear in a sequence, they won't be included in the result
### Create Data Set ====
## Create the data for different tags. I'm only using two unique levels here, but in my dataset I've got
## 400+ unique levels.
FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))
## Combine data
my.seq1 <- c(FM, SC)
## Sort data by number in sequence
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]
### Attempt Number Sequencing ====
## Get the letters
sp.tags <- substr(my.seq1, 1, 2)
## Get the readable number sequence
lapply(split(my.seq1, sp.tags), ## Split data by the tag ID
function(x){
## Get the run lengths as per [previous answer][1]
rl <- rle(c(1, pmin(diff(as.numeric(substr(x, 3, 7))), 2)))
## Generate number sequence by separator as per [previous answer][1]
seq2 <- paste0(x[c(1, cumsum(rl$lengths))], c("-", ",")[rl$values], collapse="")
return(substr(seq2, 1, nchar(seq2)-1))
})
## Combine lists and sort elements
my.seq2 <- unlist(strsplit(do.call(c, my.seq2), ","))
my.seq2 <- my.seq2[order(substr(my.seq2, 3, 7))]
names(my.seq2) <- NULL
my.seq2
[1] "FM0001-FM0001" "SC0002-SC0004" "FM0016-FM0019" "FM0028" "SC0039"
my.seq1
[1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021"
[13] "FM0024" "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
The major problems with this are:
Some values are completely missing from the data set (e.g. FM0021, FM0024, FM0026)
The first number in the sequence (FM0001) appears with a hyphen in between
I feel like I'm getting warmer by using A5C1D2H2I1M1N2O1R2T1's answer to utilize seqToHumanReadable because it's quite elegant AND solves both problems. Two more problems are that I'm not able to tag the ID before each number and can't force the number of digits to four (e.g. 0004 becomes 4).
library(R.utils)
lapply(split(my.seq1, sp.tags), function(x){
return(unlist(strsplit(seqToHumanReadable(substr(x, 3, 7)), ',')))
})
$FM
[1] "1" " 16-19" " 21" " 24" " 26" " 28"
$SC
[1] "2-4" " 10" " 12" " 14" " 33" " 36" " 39"
Ideally the result would be:
"FM0001, SC002-SC004, SC0012, SC0014, FM0017-FM0019, FM0021, FM0024, FM0026, FM0028, SC0033, SC0036, SC0039"
Any ideas? It's one of those things that's really simple to do by hand but would take blinking ages, and you'd think a function would exist for it but I haven't found it yet or it doesn't exist :(
This should do?
# get the prefix/tag and number
tag <- gsub("(^[A-z]+)(.+)", "\\1", my.seq1)
num <- gsub("([A-z]+)(\\d+$)", "\\2", my.seq1)
# get a sequence id
n <- length(tag)
do_match <- c(FALSE, diff(as.numeric(num)) == 1 & tag[-1] == tag[-n])
seq_id <- cumsum(!do_match) # a sequence id
# tapply to combine the result
res <- setNames(tapply(my.seq1, seq_id, function(x)
if(length(x) < 2)
return(x)
else
paste(x[1], x[length(x)], sep = "-")), NULL)
# show the result
res
#R> [1] "FM0001" "SC0002-SC0004" "SC0010" "SC0012" "SC0014" "FM0016-FM0019" "FM0021"
#R> [8] "FM0024" "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
# compare with
my.seq1
#R> [1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021" "FM0024"
#R> [14] "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
Data
FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))
my.seq1 <- c(FM, SC)
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]

Extracting Sub-expressions from a Dataframe of Strings Using Regular Expressions

I have a regular expression that is able to match my data, using grepl, but I can't figure out how to extract the sub-expressions inside it to new columns.
This is returning the test string as foo, without any of the sub-expressions:
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+)\\s+(\\d*\\:?\\d+\\.\\d+)"
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
m <- regexpr(entryPattern, test)
foo <- regmatches(test, m)
In my real use case, I'm acting on lots of strings similar to test. I'm able to find the correctly formatted ones, so I think the pattern is correct.
rows$isMatch <- grepl(entryPattern, rows$text)
What 'm hoping to do is add the sub-expressions as new columns in the rows dataframe (i.e. rows$rank, rows$name, rows$country, etc.). Thanks in advance for any advice.
It seems that regmatches won't do what I want. Instead, I need the stringr package, as suggested by #kent-johnson.
library(stringr)
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+?)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+?)\\s+(\\d*\\:?\\d+\\.\\d+)"
str_match(test, entryPattern)[1,2:8]
Which outputs:
[1] "101"
[2] "POULET Laure"
[3] "FRA"
[4] "1992"
[5] "25-29"
[6] "E. M. S. Bron Natation"
[7] "26.00"

how to generate very small numbers randomly in R

I want to generate very small numbers in the range of 1e-9 to 1.
if possible those numbers should be from all orders.
for example 1e-9, 2e-5, 3.2e-6 , 1.6e-4 .... etc
I tried this
set.seed(123)
kk <- runif(20,1e-9,1)
#min(kk)
#0.04205953
How can I do this?
EDIT
#RichardScriven suggested that decreasing the max number, so tried that
kk <- runif(20,1e-9,1e-5)
kk
#[1] 6.479287e-06 3.198886e-06 3.077892e-06 2.198457e-06 3.695519e-06 #9.842208e-06 1.542869e-06 9.113490e-07 1.419927e-06 6.900381e-06 #6.192946e-06 8.914050e-06 6.730318e-06 7.371040e-06
#[15] 5.211836e-06 6.598725e-06 8.218233e-06 7.863029e-06 9.798239e-06 #4.394876e-06
Maybe log the ranges, then exp them back to actual values:
set.seed(13031982)
exp(runif(10, log(1e-9),log(1)))
# [1] 1.758939e-02 1.343684e-06 1.803232e-06 1.564901e-04 5.603956e-07
# [6] 1.042067e-09 6.536568e-08 1.374840e-05 2.210080e-04 6.245864e-03

Resources