R - use Dplyr mutate with Purrr for string manipulation - r

I have two tibbles with list of strings in each. I need to compare one list of strings with another list of strings and depending on the comparison create a new column.
Small example below:
## Tibble 1 - the 'master'
structure(list(terms = c("This", "is", "a", "stri", "of", "areas",
"times", "two", "to", "see", "what", "will", "be", "in", "the",
"magic", "will", "rally", "for", "a", "cry", "from", "the", "deepest",
"part", "of", "the", "ocean", "com", "en", "au", "us"), rank = c("A",
"B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N",
"O", "P", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K",
"L", "M", "N", "O", "P"), id = 1:32), row.names = c(NA, -32L), class = c("tbl_df",
"tbl", "data.frame"))
## Tibble 2 - the 'comparison'
structure(list(conds = c("this.com", "two.org", "magic.edu",
"cry/en/org", "magic.com"), ind = structure(c(2L, 1L, 5L, 3L,
4L), .Label = c("bad", "good", "Indifferent", "Maybe", "Ugly"
), class = "factor")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Ideally the output would be a mutated 'master' tibble with the ind value inserted depending on the comparison of the strings
Attempt so far:
terms <- terms %>% mutate(
test = ifelse(
sapply(lapply(terms, grepl, condition_str$conds), any) == TRUE,
condition_str$ind,
'NA'))
terms
result
# A tibble: 32 x 4
terms rank id test
<chr> <chr> <int> <chr>
1 This A 1 NA
2 is B 2 1
3 a C 3 5
4 stri D 4 NA
5 of E 5 NA
6 areas F 6 NA
7 times G 7 NA
8 two H 8 5
9 to I 9 NA
10 see J 10 NA
It gives me a result, the factor levels are carried across but the factor names are not. It fails on a larger data set I am working on.
Questions:
Is there a purrr solution that uses stringr or stringi? My problem might be in my string matching
Is there a way to use incorporate fixed = TRUE into the grepl function?
Is there a way to get the classification levels into the mutated column?
Thanks for any assistance.
James

Related

How to remove everything after % in dataframe and merge deleted data into a new data frame?

I have a rather odd question. Not sure if this is possible, but if it is it would be a workaround for a problem I am having. I am creating the following table:
library(janitor)
firsttable <- tabyl(df, Essay, Grade) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns()
Essay A B C D
N 30.0% (3) 37.5% (3) 70.0% (7) 93.8% (15)
Y 70.0% (7) 62.5% (5) 30.0% (3) 6.2% (1)
As you can see, the variables are in character format, and include a percentage and count in parentheses. I would like to:
remove the % sign and everything after it
save what I remove and merge it into another data frame (with the same dimensions as the initial table) after
So the above table would become:
Essay A B C D
N 30.0 37.5 70.0 93.8
Y 70.0 62.5 30.0 6.2
And I would save the the % sign and values that follow it to be merged into a data frame b of the same dimensions:
b <- tabyl(df, TrueFalse, Color)
TrueFalse B G R Y
FALSE 7 5 1 1
TRUE 11 5 9 5
So the final table would be:
TrueFalse B G R Y
FALSE 7% (3) 5% (3) 1% (7) 1% (15)
TRUE 11% (7) 5% (5) 9% (3) 5% (1)
I understand that in this example this would produce totally incorrect percentages for the final table, but all I'm looking for is the ability to remove everything including and after the percentage sign then merge it into the cells of a data frame of the same dimensions. It's an odd question, I know.
Any help would be appreciated!
Data:
df <- structure(list(Grade = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C",
"C", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D", "D",
"D", "D", "D", "D", "D", "D", "D", "D", "D", "D"), Essay = c("Y",
"Y", "Y", "Y", "Y", "N", "N", "Y", "Y", "N", "Y", "Y", "Y", "Y",
"Y", "N", "N", "N", "N", "N", "N", "N", "N", "N", "Y", "Y", "N",
"Y", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",
"Y", "N", "N", "N"), Color = c("B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "G",
"G", "G", "G", "G", "G", "G", "G", "G", "G", "R", "R", "R", "R",
"R", "R", "R", "R", "R", "R", "Y", "Y", "Y", "Y", "Y", "Y"),
TrueFalse = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)), class = "data.frame", row.names = c(NA,
-44L))
We can use readr::parse_number across the columns except the 'Essay'
library(dplyr)
firsttable %>%
mutate(across(-Essay, readr::parse_number))
# Essay A B C D
# N 30 37.5 70 93.8
# Y 70 62.5 30 6.2
Inorder to get the second output, extract the number inside the bracket along with the bracket from the columns after selecting out 'Essay', then use map2 (from purrr) to paste (str_c) with the corresponding columns of 'b' (except the 'TrueFalse')
library(stringr)
library(purrr)
firsttable %>%
select(-Essay) %>%
mutate(across(everything(), ~ str_extract(., "\\(\\d+\\)"))) %>%
map2_dfc(b %>%
select(-TrueFalse), ., str_c, sep='% ') %>%
add_column(TrueFalse = b$TrueFalse, .before = 1)
# A tibble: 2 x 5
# TrueFalse B G R Y
# <lgl> <chr> <chr> <chr> <chr>
#1 FALSE 7% (3) 5% (3) 1% (7) 1% (15)
#2 TRUE 11% (7) 5% (5) 9% (3) 5% (1)
Or use Map in base R
b_new <- b
b_new[-1] <- Map(function(x, y)
sprintf('%d%% %s', x, sub(".*\\s+", "", y)), b[-1], firsttable[-1])
Extract the values, core, from the tabyl object and combine it with b:
core <- attr(firsttable, "core")
replace(b, -1, Map(sprintf, "%d%% (%d)", b[-1], core[-1]))
## TrueFalse B G R Y
## FALSE 7% (3) 5% (3) 1% (7) 1% (15)
## TRUE 11% (7) 5% (5) 9% (3) 5% (1)
The table of percentages can be recreated from the core values:
setNames(cbind(core[1], 100 * proportions(as.matrix(core[-1]), 2)), names(core))
## Essay A B C D
## 1 N 30 37.5 70 93.75
## 2 Y 70 62.5 30 6.25
or alternately the strings could be extracted from firsttable and converted to numeric:
replace(firsttable, -1, Map(function(x) as.numeric(sub("%.*", "", x)), firsttable[-1]))
## Essay A B C D
## N 30 37.5 70 93.8
## Y 70 62.5 30 6.2

Loop through the columns to search for multiple variables in R

Sorry, that this is a follow-up question. I am trying to count how many 'S' and 'T' appears in each column as 'downstream' from 1 to 10 rows and then as 'upstream' from 15 to 25.
ST <- data.frame(scale = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0),
aa = c('A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y'))
#input (example)
V1 V2 V3 V4 V5
1 C D E R N
2 C A M K P
3 V T Q Q E
4 A T S S S
5 C D E R N
6 C A M K P
7 V T Q Q E
8 A T S S S
9 R V D S A
10 W R H I C
11 S N I P T
12 Q A S D E
13 C D E R N
14 C A M K P
15 V T Q Q E
16 A T S S S
17 C D E R N
18 C A M K P
19 V T Q Q E
20 A T S S S
21 R V D S A
22 W R H I C
23 S N I P T
24 G A D S S
25 N T T S A
When I had a data frame with 'S' only, the script below worked but with 'ST', it doesn't. Could someone tell me why? Of course, I could get 'S' and 'T' separately and then add it later but is there a way to do it through this single data frame 'ST'?
#sum values from positions 1 to 10 and then from 15 to 25 works well for 1 letter only
count_aa <- df_trial %>%
summarise(across(everything(), ~ c(sum(.[1:10] == 'T'), sum(.[15:25] == 'T')))) %>%
mutate(categ = c('upstream', 'downstream'), .before = 1)
#view(count_aa)
df_count_aa<- as.data.frame(t(count_aa))
#view(df_count_aa)
We can use %in% instead of == when there are more than one element to compare
library(dplyr)
df_trial %>%
summarise(across(everything(), ~
c(sum(.[1:10] %in% c('S', 'T')),
sum(.[15:25] %in% c('S', 'T'))))) %>%
mutate(categ = c('upstream', 'downstream'), .before = 1)
-output
# categ V1 V2 V3 V4 V5
#1 upstream 0 4 2 3 2
#2 downstream 1 5 3 5 4
The == is doing elementwise comparison. If we do the == with more than one element as == c("S", "T"), then it does a recycling of the vector elements to the entire length of the column resulting i.e. 'S' gets compared to the first element of the colum, 'T' to second element, 'S' again to 3rd element and so on... i.e. the comparison would be based on position
In base R we can do colSums
colSums(df_trial == 'S') + colSums(df_trial == 'T')
In base R, you can do this sapply :
data.frame(categ = c('upstream', 'downstream'),
sapply(df_trial, function(x)
c(sum(x[1:10] %in% c('S', 'T')), sum(x[15:25] %in% c('S', 'T')))))
# categ V1 V2 V3 V4 V5
#1 upstream 0 4 2 3 2
#2 downstream 1 5 3 5 4
Using base R
> rbind(downstream = sapply(df[1:10,], function(x) sum(grepl('[ST]',x))),
+ upstream = sapply(df[15:25,], function(x) sum(grepl('[ST]',x))))
V1 V2 V3 V4 V5
downstream 0 4 2 3 2
upstream 1 5 3 5 4
>
Data Used:
> dput(df)
structure(list(V1 = c("C", "C", "V", "A", "C", "C", "V", "A",
"R", "W", "S", "Q", "C", "C", "V", "A", "C", "C", "V", "A", "R",
"W", "S", "G", "N"), V2 = c("D", "A", "T", "T", "D", "A", "T",
"T", "V", "R", "N", "A", "D", "A", "T", "T", "D", "A", "T", "T",
"V", "R", "N", "A", "T"), V3 = c("E", "M", "Q", "S", "E", "M",
"Q", "S", "D", "H", "I", "S", "E", "M", "Q", "S", "E", "M", "Q",
"S", "D", "H", "I", "D", "T"), V4 = c("R", "K", "Q", "S", "R",
"K", "Q", "S", "S", "I", "P", "D", "R", "K", "Q", "S", "R", "K",
"Q", "S", "S", "I", "P", "S", "S"), V5 = c("N", "P", "E", "S",
"N", "P", "E", "S", "A", "C", "T", "E", "N", "P", "E", "S", "N",
"P", "E", "S", "A", "C", "T", "S", "A")), row.names = c(NA, -25L
), class = c("tbl_df", "tbl", "data.frame"))
>

How to pull the column indices when matching the rows of a dataframe and a vector

Say I have a dataframe of letters like so:
X1 X2 X3
1 G A C
2 G T C
3 G T C
4 A T G
5 A C G
And a vector like so:
ref <- c("A", "C", "C", "A", "G")
Going row-wise, how do I pull the column indices of the dataframe which match the vector?
So the answer should be a vector of numbers like so:
2, 3, 3, 1, 3
We can use
max.col(df1 == ref)
#[1] 2 3 3 1 3
data
df1 <- structure(list(X1 = c("G", "G", "G", "A", "A"), X2 = c("A", "T",
"T", "T", "C"), X3 = c("C", "C", "C", "G", "G")), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5"))

Replace values across multiple varibles in R

I have a dataframe with 82 variables. Many of the variables contain alphabetic letters, which I want to change into a set of numbers. I can do this column-by-column, number-by-number using the code below:
library(tibble)
mydf <- tribble(~Var1, ~Var2.a, ~Var3.a, ~Var4.a,
"A", "b", "b", "d",
"B", "w", NA, "w",
"C", "g", "k", "b",
"D", "k", NA, "j")
newdf <- mydf %>%
mutate(Var2.a = ifelse(Var2.a %in% c("m", "p", "w", "h", "n"), 1, Var2.a),
Var2.a = ifelse(Var2.a %in% c("k", "b", "g", "j", "f", "d"), 2, Var2.a),
Var3.a = ifelse(Var3.a %in% c("m", "p", "w", "h", "n"), 1, Var3.a),
Var3.a = ifelse(Var3.a %in% c("k", "b", "g", "j", "f", "d"), 2, Var3.a),
Var4.a = ifelse(Var4.a %in% c("m", "p", "w", "h", "n"), 1, Var4.a),
Var4.a = ifelse(Var4.a %in% c("k", "b", "g", "j", "f", "d"), 2, Var4.a))
But this will take a lot of time for the 70+ columns I need to change!
All the variables of interest have a matching letter combination in the variable name (".a" in the example data), so I should be able to use an ifelse statement on these columns using contains(). However I can't work out how to do this!
I have looked at this answer, which I think is getting me close, but I can't work out how to embed an if-statement into it:
newdf <- mydf %>%
mutate_at(vars[2:4] = ifelse(vars %in% c("m", "p", "w", "h", "n"), 1, vars)
But I get the error Error in vars[2:4] : object of type 'closure' is not subsettable. I think the brackets are wrong here, and probably also the use of vars!
Try this example:
# custom function, I prefer case_when (we could use nested if_else if needed.)
foo <- function(x){
case_when(
x %in% c("m", "p", "w", "h", "n") ~ 1L,
x %in% c("k", "b", "g", "j", "f", "d") ~ 2L,
TRUE ~ NA_integer_)
}
mydf %>%
mutate_at(vars(Var2.a:Var4.a), foo)
# # A tibble: 4 x 4
# Var1 Var2.a Var3.a Var4.a
# <chr> <int> <int> <int>
# 1 A 2 2 2
# 2 B 1 NA 1
# 3 C 2 2 2
# 4 D 2 NA 2

comparing columns of two dataframes and get the deviation point in R

I have 2 dataframes:
> dput(DF1)
structure(c("a", "b", "c", "d", "e", "f", "g"), .Dim = c(1L,
7L), .Dimnames = list("1", c("seq1", "seq2", "seq3", "seq4",
"seq5", "seq6", "seq7")))
> dput(DF2)
structure(list(seq1 = c("a", "a", "a", "a", "a"), seq2 = c("b",
"d", "d", "d", "b"), seq3 = c("c", "c", "c", "c", "c"), seq4 = c("e",
"e", "d", "d", "d"), seq5 = c("f", "f", "f", "g", "e"), seq6 = c("g",
"g", "g", "g", "g"), seq7 = c("g", "g", "g", "g", "g"), UserId = c("1",
"2", "3", "4", "5")), .Names = c("seq1", "seq2", "seq3", "seq4",
"seq5", "seq6", "seq7", "UserId"), row.names = c(NA, -5L), class = "data.frame")
These are the above two datasets which I want to compare for e.g User1 in DF2 has deviated to e ( instead of goind to d, he went to e). DF1 is my correct defined sequence.
So in the end i need to make a dataframe the below requirements:
> dput(required_dataframe)
structure(list(UserID = c("1", "2", "3", "4", "5"), Deviation = c("e",
"d", "d", "d", "g"), Actual_sequence = c("d", "b", "b", "b",
"f")), .Names = c("UserID", "Deviation", "Actual_sequence"), row.names = c(NA,
-5L), class = "data.frame")
For an instance that user1 deviated to point e (it should have gone to d). So for all users I need to calculate the deviation point along with the actual seq.
Please find the attached images of DF1 ,DF2 and the required dataframe as well.
DF1
DF2
Required_dataframe
Once you get the two matrices to line up perfectly, you can compare them row-by-row and find out where they don't match. You can then find the first value in each row and use that as a selection:
sel <- cbind(
seq_len(nrow(DF2)),
max.col(t(t(DF2[seq_along(DF1)]) != c(DF1)), "first")
)
cbind(DF2["UserId"], Deviation=DF2[sel], Actual=DF1[sel[,2]])
# UserId Deviation Actual
#1 1 e d
#2 2 d b
#3 3 d b
#4 4 d b
#5 5 g f
The core of the comparison is this part, where you can see each cell being lined up:
t(DF2[seq_along(DF1)]) != c(DF1)
# [,1] [,2] [,3] [,4] [,5]
#seq1 FALSE FALSE FALSE FALSE FALSE
#seq2 FALSE TRUE TRUE TRUE FALSE
#seq3 FALSE FALSE FALSE FALSE FALSE
#seq4 TRUE TRUE FALSE FALSE FALSE
#seq5 TRUE TRUE TRUE TRUE FALSE
#seq6 TRUE TRUE TRUE TRUE TRUE
#seq7 FALSE FALSE FALSE FALSE FALSE

Resources