Calculate ratio every two rows with partial string matches

Calculate ratio every two rows with partial string matches - r

I am trying to calculate a ratio using this formula: log2(_5p/3p).
I have a dataframe in R and the entries have the same name except their last part that will be either _3p or _5p. I want to do this operation log2(_5p/_3p) for each specific name.
For instance for the first two rows the result will be like this:
LQNS02277998.1_30988 log2(40/148)= -1.887525
Ideally I want to create a new data frame with the results where only the common part of the name is kept.
LQNS02277998.1_30988 -1.887525
How can I do this in R?
> head(dup_res_LC1_b_2)
# A tibble: 6 x 2
microRNAs n
<chr> <int>
1 LQNS02277998.1_30988_3p 148
2 LQNS02277998.1_30988_5p 40
3 Dpu-Mir-279-o6_LQNS02278070.1_31942_3p 4
4 Dpu-Mir-279-o6_LQNS02278070.1_31942_5p 4
5 LQNS02000138.1_777_3p 73
6 LQNS02000138.1_777_5p 12
structure(list(microRNAs = c("LQNS02277998.1_30988_3p",
"LQNS02277998.1_30988_5p", "Dpu-Mir-279-o6_LQNS02278070.1_31942_3p",
"Dpu-Mir-279-o6_LQNS02278070.1_31942_5p", "LQNS02000138.1_777_3p",
"LQNS02000138.1_777_5p"), n = c(148L, 40L, 4L, 4L, 73L, 12L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

We can use a group by operation by removing the substring at the end i.e. _3p or _5p with str_remove, then use the log division of the pair of 'n'
library(dplyr)
library(stringr)
df1 %>%
group_by(grp = str_remove(microRNAs, "_[^_]+$")) %>%
mutate(new = log2(last(n)/first(n)))

Related

Keep only rows if number is greater than... in specific column

This is an example of data:
exp_data <- structure(list(Seq = c("AAAARVDS", "AAAARVDSSSAL",
"AAAARVDSRASDQ"), Change = structure(c(19L, 20L, 13L), .Label = c("",
"C[+58]", "C[+58], F[+1152]", "C[+58], F[+1152], L[+12], M[+12]",
"C[+58], L[+2909]", "L[+12]", "L[+370]", "L[+504]", "M[+12]",
"M[+1283]", "M[+1457]", "M[+1491]", "M[+16]", "M[+16], Y[+1013]",
"M[+16], Y[+1152]", "M[+16], Y[+762]", "M[+371]", "M[+386], Y[+12]",
"M[+486], W[+12]", "Y[+12]", "Y[+1240]", "Y[+1502]", "Y[+1988]",
"Y[+2918]"), class = "factor"), `Mass` = c(1869.943,
1048.459, 707.346), Size = structure(c(2L, 2L, 2L), .Label = c("Matt",
"Greg",
"Kieran"
), class = "factor"), `Number` = c(2L, 2L, 2L)), row.names = c(244L,
392L, 396L), class = "data.frame")
I would like to bring your attention to column name Change as this is the one which I would like to use for filtering. We have three rows here and I would like to keep only first one because there is a change bigger than 100 for specific letter. I would like to keep all of the rows which contain the change of letter greater than +100. It might be a situatation that there is up to 4-5 letters in change column but if there is at least one with modification of at least +100 I would like to keep this row.
Do you have any simple solution for that ?
Expected output:
Seq Change Mass Size Number
244 AAAARVDS M[+486], W[+12] 1869.943 Greg 2

Not entirely sure I understood your problem statement correctly, but perhaps something like this
library(dplyr)
library(stringr)
exp_data %>% filter(str_detect(Change, "\\d{3}"))
# Seq Change Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg 2
Or the same in base R
exp_data[grep("\\d{3}", exp_data$Change), ]
# Seq Change Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg 2
The idea is to use a regular expression to keep only those rows where Change contains at least one three-digit expression.

You can use str_extract_all from the stringr package
library(stringr)
data.table solution
library(data.table)
setDT(exp_data)
exp_data[, max := max(as.numeric(str_extract_all(Change, "[[:digit:]]+")[[1]])), by = Seq]
exp_data[max > 100, ]
Seq Change Mass Size Number max
1: AAAARVDS M[+486], W[+12] 1869.9 Greg 2 486
dplyr solution
library(dplyr)
exp_data %>%
group_by(Seq) %>%
filter(max(as.numeric(str_extract_all(Change, "[[:digit:]]+")[[1]])) > 100)
# A tibble: 1 x 5
# Groups: Seq [1]
Seq Change Mass Size Number
<chr> <fct> <dbl> <fct> <int>
1 AAAARVDS M[+486], W[+12] 1870. Greg 2

R function that sums up partial words?

I'm trying to sum up the words 'Moderna' in R and make a count.
The problem is that the original excel file has the value Moderna mixed with other vaccines. As you can see, my original R file has words with 'Moderna' in them mixed with 'Oxford/Astrazeneca'
This is my attempt trying to sum the words 'Moderna' in the Code is Below.
Code is below:
Number_Of_Countries_Using_Moderna <- Number_of_Vaccines_used %>%
group_by(vaccines) %>%
summarize(Moderna_Countries=sum(n))
I would group_by vaccines, to get Moderna, then attempt to sum the amount of Moderna (making a new column in the process). The problem is that using 'group_by(vaccines) function' wouldn't be correct.
Do you guys have any suggestions? Thank you for your time :)
Problem was solved with either of the two solutions below, thank you.

If I understood correctly, you are trying to get the sum of n whenever Moderna is mentionned in the column vaccines? If that's the case, here is a solution below. You need to "filter", not "group_by":
Number_of_Vaccines_used %>%
filter(grepl("Moderna", vaccines)) %>%
summarize(Moderna_Countries = sum(n))

Not exactly what you ask for: If you are looking for a complete list of vaccines and their counts, you could use
library(dplyr)
library(tidyr)
Number_of_Vaccines_used %>%
mutate(vaccines = strsplit(vaccines, ", ")) %>%
unnest(vaccines) %>%
group_by(vaccines) %>%
summarise(n = sum(n))
This results in something like
# A tibble: 10 x 2
vaccines n
<chr> <int>
1 Covaxin 1
2 EpiVacCorona 1
3 Johnson&Johnson 2
4 Moderna 35
5 Oxford/AstraZeneca 105
6 Pfizer/BioNTech 82
7 Sinopharm/Beijing 24
8 Sinopharm/Wuhan 2
9 Sinovac 18
10 Sputnik V 20
Data
structure(list(vaccines = c("Covaxin, Oxford/AstraZeneca", "EpiVacCorona, Sputnik V", "Johnson&Johnson", "Johnson&Johnson, Moderna, Pfizer/BioNTech", "Moderna", "Moderna, Oxford/AstraZeneca"), n = c(1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

Create new column based on partial match with another column

I want to create a new column for a dataframe, using partial matches in another column. The problem is that my values are only partial matches, the suffix _3p, or _5p at the end of the names only exist in the original dataframe but not in the other column I am using to test to.
The code I am using should work, but due to the partial match thing is not and I am stuck.
> head(df)
# A tibble: 6 x 2
microRNAs `number of targets`
<chr> <int>
1 bantam|LQNS02278082.1_33125_3p 128
2 bantam|LQNS02278082.1_33125_5p 8
3 Dpu-Mir-10-P2_LQNS02277998.1_30984_3p 44
4 Dpu-Mir-10-P2_LQNS02277998.1_30984_5p 78
5 Dpu-Mir-10-P3_LQNS02277998.1_30988_3p 1076
6 Dpu-Mir-10-P3_LQNS02277998.1_30988_5p 309
> dput(head(df))
structure(list(microRNAs = c("bantam|LQNS02278082.1_33125_3p",
"bantam|LQNS02278082.1_33125_5p", "Dpu-Mir-10-P2_LQNS02277998.1_30984_3p",
"Dpu-Mir-10-P2_LQNS02277998.1_30984_5p", "Dpu-Mir-10-P3_LQNS02277998.1_30988_3p",
"Dpu-Mir-10-P3_LQNS02277998.1_30988_5p"), `number of targets` = c(128L,
8L, 44L, 78L, 1076L, 309L)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
#matches to look for
unique
1 miR-9|LQNS02278094.1_36129
2 LQNS02278139.1_39527
3 LQNS02278139.1_39523
4 LQNS02278075.1_32386
5 Dpu-Mir-10-P3_LQNS02277998.1_30988
> dput(head(unique))
structure(list(unique = c("miR-9|LQNS02278094.1_36129",
"LQNS02278139.1_39527", "LQNS02278139.1_39523", "LQNS02278075.1_32386",
"Dpu-Mir-10-P3_LQNS02277998.1_30988")), row.names = c(NA,
6L), class = "data.frame")
#Create new column with Yes, No
df$new <- ifelse(df$microRNAs %in% unique$unique, 'Yes', 'No')
##But it all appears like No due to the partial match.

A fast solution using data.table.
library(data.table)
# convert data.frame to data.table
setDT(df)
# create temporary column dropping the last 3 characters
df[, microRNAs_short := substr(microRNAs ,1, nchar(microRNAs)-3) ]
# check values in common
df[, new := fifelse( microRNAs_short %in% df2$unique, 'Yes', 'No')]

We could use regex_left_join from fuzzyjoin
library(fuzzyjoin)
regex_left_join(df, unique, by = c("microRNAs" = "unique"))

Delete rows using conditional [duplicate]

This question already has answers here:
Regular expressions (RegEx) and dplyr::filter()
(2 answers)
Closed 4 years ago.
I have a data.frame like this:
Client Product
1 VV_Brazil_Jul
2 VV_Brazil_Mar
5 VV_US_Jul
1 VV_JP_Apr
3 VV_CH_May
6 VV_Brazil_Aug
I would like to delete all rows with "Brazil".

You can do this using the grepl function and the ! to find the cases that are not matched:
# Create a dataframe where some cases have the product with Brazil as part of the value
df <- structure(list(Client = c(1L, 2L, 5L, 1L, 3L, 6L),
Product = c("VV_Brazil_Jul", "VV_Brazil_Mar", "VV_US_Jul", "VV_JP_Apr", "VV_CH_May", "VV_Brazil_Aug")),
row.names = c(NA, -6L), class = c("data.table", "data.frame"))
# Display the original dataframe in the Console
df
# Limit the dataframe to cases which do not have Brazil as part of the product
df <- df[!grepl("Brazil", df$Product, ignore.case = TRUE),]
# Display the revised dataframe in the Console
df

You can do the same thing with the tidyverse collection
dplyr::slice(df, -stringr::str_which(df$Product, "Brazil"))

how can I group based on similarity in strings

I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group

You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculate ratio every two rows with partial string matches - r

We can use a group by operation by removing the substring at the end i.e. _3p or _5p with str_remove, then use the log division of the pair of 'n' library(dplyr) library(stringr) df1 %>% group_by(grp = str_remove(microRNAs, "_[^_]+$")) %>% mutate(new = log2(last(n)/first(n)))

Related

Keep only rows if number is greater than... in specific column

R function that sums up partial words?

Create new column based on partial match with another column

Delete rows using conditional [duplicate]

how can I group based on similarity in strings

Categories

Resources