I have got a dataset called colours, in which I am interested to find some keywords based on a list (colouryellow , colourblue, colourwhite) that I have created. This is an example of the dataset:
USER
MESSAGE
23456
The colouryellow is very bright!
31245
Most girls like colourpink
99999
I am having a break
9877
The colouryellow is like the sun
Is there a way where I can obtain the number of times each keywords based on the list appear on the column MESSAGE?
For example, the output would be like:
Keyword
Frequency of Keywords
colouryellow
2
colourblue
0
colourwhite
0
I have tried the following code but it does not provide me the frequency for each keyword, instead displays them together.
colour= read.csv("C: xxxxxx")
keywordcount= dplyr::filter(colour, grepl("colouryellow|colourblue|colourwhite, MESSAGE))
Thank you in advance.
Some things you can do.
some_colours <- c("colouryellow", "colourblue", "colourwhite")
some_col_regex <- paste0("\\b(", paste(some_colours, collapse = "|"), ")\\b")
grepl(some_col_regex, colour$MESSAGE)
# [1] TRUE FALSE FALSE TRUE
lengths(regmatches(colour$MESSAGE, gregexpr(some_col_regex, colour$MESSAGE)))
# [1] 1 0 0 1
table(unlist(regmatches(colour$MESSAGE, gregexpr(some_col_regex, colour$MESSAGE))))
# colouryellow
# 2
Data
colour <- structure(list(USER = c(23456L, 31245L, 99999L, 9877L), MESSAGE = c("The colouryellow is very bright!", "Most girls like colourpink", "I am having a break", "The colouryellow is like the sun")), class = "data.frame", row.names = c(NA, -4L))
Related
This question already has answers here:
How do I separate a character column into two columns? [duplicate]
(2 answers)
Closed 1 year ago.
What gsub function can I use in R to get the gene name and the id number from a vector which looks like this?
head(colnames(cn), 20)
[1] "A1BG (1)" "NAT2 (10)" "ADA (100)" "CDH2 (1000)" "AKT3 (10000)" "GAGE12F (100008586)"
[7] "RNA5-8SN5 (100008587)" "RNA18SN5 (100008588)" "RNA28SN5 (100008589)" "LINC02584 (100009613)" "POU5F1P5 (100009667)" "ZBTB11-AS1 (100009676)"
[13] "MED6 (10001)" "NR2E3 (10002)" "NAALAD2 (10003)" "DUXB (100033411)" "SNORD116-1 (100033413)" "SNORD116-2 (100033414)"
[19] "SNORD116-3 (100033415)" "SNORD116-4 (100033416)"
1) Assuming the input s given in the Note at the end we can use read.table specifying that the fields are separated by ( and that ) is a comment character. We also strip white space around fields and give meaningful column names. No packages are used.
DF <- read.table(text = s, sep = "(", comment.char = ")",
strip.white = TRUE, col.names = c("Gene", "Id"))
DF
giving this data frame so DF$Gene is the genes and DF$Id is the id's.
Gene Id
1 A1BG 1
2 NAT2 10
3 ADA 100
4 CDH2 1000
5 AKT3 10000
6 GAGE12F 100008586
7 RNA5-8SN5 100008587
8 RNA18SN5 100008588
9 RNA28SN5 100008589
10 LINC02584 100009613
11 POU5F1P5 100009667
12 ZBTB11-AS1 100009676
13 MED6 10001
14 NR2E3 10002
15 NAALAD2 10003
16 DUXB 100033411
17 SNORD116-1 100033413
18 SNORD116-2 100033414
19 SNORD116-3 100033415
20 SNORD116-4 100033416
2) A variation of the above is to first remove the parentheses and then read it in giving the same result. Note that the second argument of chartr contains two spaces so that each parenthesis is translated to a space.
read.table(text = chartr("()", " ", s), col.names = c("Gene", "Id"))
Note
Lines <- '[1] "A1BG (1)" "NAT2 (10)" "ADA (100)" "CDH2 (1000)" "AKT3 (10000)" "GAGE12F (100008586)"
[7] "RNA5-8SN5 (100008587)" "RNA18SN5 (100008588)" "RNA28SN5 (100008589)" "LINC02584 (100009613)" "POU5F1P5 (100009667)" "ZBTB11-AS1 (100009676)"
[13] "MED6 (10001)" "NR2E3 (10002)" "NAALAD2 (10003)" "DUXB (100033411)" "SNORD116-1 (100033413)" "SNORD116-2 (100033414)"
[19] "SNORD116-3 (100033415)" "SNORD116-4 (100033416)" '
L <- Lines |>
textConnection() |>
readLines() |>
gsub(pattern = "\\[\\d+\\]", replacement = "")
s <- scan(text = L, what = "")
so s looks like this:
> dput(s)
c("A1BG (1)", "NAT2 (10)", "ADA (100)", "CDH2 (1000)", "AKT3 (10000)",
"GAGE12F (100008586)", "RNA5-8SN5 (100008587)", "RNA18SN5 (100008588)",
"RNA28SN5 (100008589)", "LINC02584 (100009613)", "POU5F1P5 (100009667)",
"ZBTB11-AS1 (100009676)", "MED6 (10001)", "NR2E3 (10002)", "NAALAD2 (10003)",
"DUXB (100033411)", "SNORD116-1 (100033413)", "SNORD116-2 (100033414)",
"SNORD116-3 (100033415)", "SNORD116-4 (100033416)")
First, in the future please share your data using the dput() command. See this for details.
Second, here is one solution for extracting the parts you need:
library(tidyverse)
g<-c("A1BG (1)","NAT2 (10)","ADA (100)" , "RNA18SN5 (100008588)", "RNA28SN5 (100008589)")
gnumber<-stringr::str_extract(g,"(?=\\().*?(?<=\\))")
gnumber
gname<-stringr::str_extract(g, "[:alpha:]+")
gname
# or, to get the whole first word:
gname<-stringr::word(g,1,1)
gname
I am trying the following :
gg <-c("delete from below 110 11031133 11 11031135 110",
"delete froml #10989431 from adfdaf 10888022 <(>&<)> 10888018",
"this is for the deletion of an incorrect numberss that is no longer used for asd09 and sd040",
"please delete the following mangoes from trey 10246211 1 10821224 1 10821248 1 10821249",
"from 11015647 helppp 1 na from 0050 - zfhhhh 10840637 1")
pattern_to_find <- c('\\d{4,}')
aa <- str_extract_all(gg, pattern_to_find)
aa
with this code I am able to extact any numeric pattern with number greater than a fixed number. But if I want to extract 2 didit number then it picks up all the first two numbers from the numeric field .
pattern_to_find <- c('\\d{2}').
How can I modify my pattern to work on both ways.
Regards,
R
Tidyverse solution:
library(tidyverse)
pattern_to_find <- c('\\d{2,}')
aa <- str_extract_all(gg, pattern_to_find)
Base R solution:
base_aa <- regmatches(gg, gregexpr(pattern_to_find, gg))
I have a question about extracting a part of a string from several files that has these rows:
units = specified
- name 0 = prDM: Pressure, Digiquartz [db]
- name 1 = t090C: Temperature [ITS-90, deg C]
- name 2 = c0S/m: Conductivity [S/m]
- name 3 = t190C:Temperature, 2 [ITS-90, deg C]
- name 4 = c1S/m: Conductivity, 2 [S/m]
- name 5 = flSP: Fluorescence, Seapoint
- name 6 = sbeox0ML/L: Oxygen, SBE 43 [ml/l]
- name 7 = altM: Altimeter [m]
- name 8 = sal00: Salinity, Practical [PSU]
- name 9 = sal11: Salinity, Practical, 2 [PSU]
- span 0 = 1.000, 42.000
I need to extract only the information of the columns that start with "name" and extract everything between = and: .
For example, in the row "name 0 = prDM: Pressure, Digiquartz [db]" the desired result will be prDM.
Some files have different number of "name"rows (i.e. this example has 13 rows but other files has 16, and the number varies), so I want it to be as general as I can so I can allways extract the right strings independently the number of rows.Rows starts with # and a space before name.
I have tried this code but it only extract the first row. Can you please help me with this? Many thanks!
CNV<-NULL
for (i in 1:nro.files){
x <- readLines(all.files[i])
name.col<-grep("^\\# name", x)
df <- data.table::fread(text = x[name.col])
CNV[[i]]<-df
}
using stringr and the regex pattern "name \\d+ = (.*?):" which means in words "name followed by one or more digits followed by an equals sign followed by a space followed by a captured group containing any character (the period) zero or more times (the *) followed by a colon".
library(stringr)
strings <- c("name 0 = prDM: Pressure, Digiquartz [db]",
"name 1 = t090C: Temperature [ITS-90, deg C]",
"name 2 = c0S/m: Conductivity [S/m]",
"name 3 = t190C:Temperature, 2 [ITS-90, deg C]",
"name 4 = c1S/m: Conductivity, 2 [S/m]",
"name 5 = flSP: Fluorescence, Seapoint",
"name 6 = sbeox0ML/L: Oxygen, SBE 43 [ml/l]",
"name 7 = altM: Altimeter [m]",
"name 8 = sal00: Salinity, Practical [PSU]",
"name 9 = sal11: Salinity, Practical, 2 [PSU]")
result <- str_match(strings, "name \\d+ = (.*):")
result[,2]
[1] "prDM" "t090C" "c0S/m" "t190C" "c1S/m" "flSP" "sbeox0ML/L"
[8] "altM" "sal00" "sal11"
Or if you prefer base
pattern = "name \\d+ = (.*):"
result <- regmatches(strings, regexec(pattern, strings))
sapply(result, "[[", 2)
[1] "prDM" "t090C" "c0S/m" "t190C" "c1S/m" "flSP" "sbeox0ML/L"
[8] "altM" "sal00" "sal11"
Use str_extract from package stringr and positive lookahead and lookbehind:
str <- "name 0 = prDM: Pressure, Digiquartz [db]"
str_extract(str, "(?<== ).*(?=:)")
[1] "prDM"
Explanation:
(?<== )if you see =followed by white space on the left (lookbehind)
.* match anything until ...
(?=:)... you see a colon on the right (lookahead)
In Base R
test <- c("name 0 = prDM: Pressure, Digiquartz [db]","name 1 = t090C: Temperature [ITS-90, deg C]")
gsub("^name [0-9]+ = (.+):.+","\\1",test)
[1] "prDM" "t090C"
explanation
^name [0-9]+ Searches for a the beginning of a string ^ with name folowed by any length of number
= (.+): any length + of any character . found between = and : are stored ( ) to be later recalled by \\1
In Base r how do I get
Ref2 - the first 2 initials of the Ref, e.g. AC12 = AC, AL34 = AL
Street2 - the first initial of each Street e.g. Abbey Court =
AC, Albert Gardens = AG.
compare Ref2 & Street2 to see if same or not
then only use those that are not the same for further
calculations
You can try the following
> substr(Ref2,1,2) ==gsub("[a-z| ]","",Street2)
[1] TRUE FALSE
You can use that logical vector to remove the FALSE values from your original data.
The code works by only taking the first two characters from Ref2 and removing all lowercase characters + spaces from Street2.
Data
Ref2 = c("AC12","AL34")
Street2=c("Abbey Court","Albert Gardens")
Just adding an option for anybody who wants to extract the first letter of each word where case is not consistent or the whole word is the same case.
This also includes filtering the table for continued use (using data.table).
library(data.table)
library(stringr)
data_example <- data.table(Ref2 = c("AC12", "AL34", "AG34"),
Street = c("Abbey Court", "Albert gardens", "albert gardens"))
data_example <- data_example[tolower(str_extract(Ref2, "^.{2}")) == tolower(paste0(str_extract(Street, "^."), str_extract(Street, "(?<=\\s).")))]
> View(data_example)
> data_example
Ref2 Street
1: AC12 Abbey Court
2: AG34 albert gardens
In a single dataset (QueryTM), I have two columns Query and TM. I want to check if the Query contains value of TM (in the same row) or not. Consider an example, If TM is "Coca Cola" and Query is "Coca Cola India", Query should match with TM. However, if query is "Coca Colala India", it shouldn't match. The results are to be stored in another column, say Result
I am using R as the platform.
You will need to add word boundaries to capture exact matching. Using mapply you can do,
dd$result <- mapply(grepl, paste0('\\b', dd$TM, '\\b'), dd$Query)
dd
# TM Query result
#1 Coca Cola Coca Colala India FALSE
#2 Fanta Orange Fanta Orange India TRUE
DATA
dput(dd)
structure(list(TM = c("Coca Cola", "Fanta Orange"), Query = c("Coca Colala India",
"Fanta Orange India")), .Names = c("TM", "Query"), row.names = c(NA,
-2L), class = "data.frame")