I am facing a problem to replace words in a tweet with the numeric value of their frequency.
I have already made a data frame showing the words ranked by their frequency.
Now I want to substitute the words in the tweets with the frequency rank of every word.
I attached snips of my data frames.
Tweets and word frequency data:
My goal is that a tweets looks like this:
[1] [3] [7] [11] [18] [12] [10] [5] [3] [44] [23] [46] [2] [90]
The [1] means that it is the most frequent word in the dataset.
Any help appreciated! :)
I think stringr::str_replace_all is an efficient way to go about it: just pass it a named vector with your word frequency and you're done.
See reprex below. The first few lines just generate random data; your frequency table looks like the df I generated below.
sentence <- "the quick brown fox jumps over the lazy dog"
sentence_split <- unique(as.character(stringr::str_split(string = sentence, pattern = " ", simplify = TRUE)))
names(sentence_split) <- sample(x = 1:1000, size = length(sentence_split))
df <- data.frame(word = sentence_split,
n = sample(x = 1:1000, size = length(sentence_split)))
df
#> word n
#> the 740
#> quick 192
#> brown 145
#> fox 809
#> jumps 700
#> over 910
#> lazy 352
#> dog 256
replace_vector <- paste0("[", df$n, "]")
names(replace_vector) <- df$word
stringr::str_replace_all(string = sentence, pattern = replace_vector)
#> [1] "[740] [192] [145] [809] [700] [910] [740] [352] [256]"
Created on 2021-07-24 by the reprex package (v2.0.0)
Related
I am assessing the impact of hotspot single nucleotide polymorphism (SNPs) from a next generation sequencing (NGS) experiment on the protein sequence of a virus. I have the reference DNA sequence and a list of hotspots. I need to first figure out the reading frame of where these hotspots are seen. To do this, I generated a DNAStringSetList with all human codons and want to use a vmatchpattern or matchpattern from the Biostrings package to figure out where the hotspots land in the codon reading frame.
I often struggle with lapply and other apply functions, so I tend to utilize for loops instead. I am trying to improve in this area, so welcome a apply solution should one be available.
Here is the code for the list of codons:
alanine <- DNAStringSet("GCN")
arginine <- DNAStringSet(c("CGN", "AGR", "CGY", "MGR"))
asparginine <- DNAStringSet("AAY")
aspartic_acid <- DNAStringSet("GAY")
asparagine_or_aspartic_acid <- DNAStringSet("RAY")
cysteine <- DNAStringSet("TGY")
glutamine <- DNAStringSet("CAR")
glutamic_acid <- DNAStringSet("GAR")
glutamine_or_glutamic_acid <- DNAStringSet("SAR")
glycine <- DNAStringSet("GGN")
histidine <- DNAStringSet("CAY")
start <- DNAStringSet("ATG")
isoleucine <- DNAStringSet("ATH")
leucine <- DNAStringSet(c("CTN", "TTR", "CTY", "YTR"))
lysine <- DNAStringSet("AAR")
methionine <- DNAStringSet("ATG")
phenylalanine <- DNAStringSet("TTY")
proline <- DNAStringSet("CCN")
serine <- DNAStringSet(c("TCN", "AGY"))
threonine <- DNAStringSet("ACN")
tyrosine <- DNAStringSet("TGG")
tryptophan <- DNAStringSet("TAY")
valine <- DNAStringSet("GTN")
stop <- DNAStringSet(c("TRA", "TAR"))
codons <- DNAStringSetList(list(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
Current for loop code:
reference_stringset <- DNAStringSet(covid)
codon_locations <- list()
for (i in 1:length(codons)) {
pattern <- codons[[i]]
codon_locations[i] <- vmatchPattern(pattern, reference_stringset)
}
Current error code. I am filtering the codon DNAStringSetList so that it is a DNAStringSet.
Error in normargPattern(pattern, subject) : 'pattern' must be a single string or an XString object
I can't give out the exact nucleotide sequence, but here is the COVID genome (link: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta) to use as a reprex:
#for those not used to using .fasta files, first copy and past genome into notepad and save as a .fasta file
#use readDNAStringSet from Biostrings package to read in the .fasta file
filepath = #insert file path
covid <- readDNAStringSet(filepath)
For the current code, change the way the codons is formed. Currently the output of codons looks like this:
DNAStringSetList of length 24
[[1]] GCN
[[2]] CGN AGR CGY MGR
[[3]] AAY
[[4]] GAY
[[5]] RAY
[[6]] TGY
[[7]] CAR
[[8]] GAR
[[9]] SAR
[[10]] GGN
...
<14 more elements>
Change it from DNAStringSetList to a conglomerate DNAStringSet of the amino acids.
codons <- DNAStringSet(c(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
codons
DNAStringSet object of length 32:
width seq
[1] 3 GCN
[2] 3 CGN
[3] 3 AGR
[4] 3 CGY
[5] 3 MGR
... ... ...
[28] 3 TGG
[29] 3 TAY
[30] 3 GTN
[31] 3 TRA
[32] 3 TAR
When I run the script I get the following output with the SARS-CoV-2 isolate listed for the example (I'm showing a small slice)
codon_locations[27:28]
[[1]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 0 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[[2]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 554 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 89 91 3
[2] 267 269 3
[3] 283 285 3
[4] 352 354 3
[5] 358 360 3
... ... ... ...
[550] 29261 29263 3
[551] 29289 29291 3
[552] 29472 29474 3
[553] 29559 29561 3
[554] 29793 29795 3
Looking at the ones that had an output, only those with the standard nucleotides ("ATCG", no wobbles) found matches. Those will need to be changed as well to search.
If you're on twitter, I suggest linking the question using the #rstats, #bioconductor, and #bioinformatics hashtags to generate some more traction, I've noticed that bioinformatic specific questions on SO don't generate as much buzz.
For university research I try to scrape an FDA table (robots.txt allows to scrape this content)
The table contains 19 rows and 2 columns:
https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181
The format I try to extract is:
col1 col2 url_of_col2
<chr> <chr> <chr>
1 Device Classificati~ distal transcutaneous electrical stimulator for treatm~ https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?s~
What I achieved:
I can easly extract the items of the first column:
#library
library(tidyverse)
library(xml2)
library(rvest)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
html %>%
html_nodes("table") -> tables
tables[[9]] -> table
# extract col 1 items
table %>%
html_nodes("th") %>%
html_text() %>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "Device Classification Name" "510(k) Number"
#> [3] "Device Name" "Applicant"
#> [5] "Applicant Contact" "Correspondent"
#> [7] "Correspondent Contact" "Regulation Number"
#> [9] "Classification Product Code" "Date Received"
#> [11] "Decision Date" "Decision"
#> [13] "Regulation Medical Specialty" "510k Review Panel"
#> [15] "summary" "Type"
#> [17] "Clinical Trials" "Reviewed by Third Party"
#> [19] "Combination Product"
Created on 2021-02-27 by the reprex package (v1.0.0)
Where I get stuck
Since some cells of column 2 contain a table, this approach does not give the same number of items:
# extract col 2 items
table %>%
html_nodes("td") %>%
html_text()%>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "distal transcutaneous electrical stimulator for treatment of acute migraine"
#> [2] "K203181"
#> [3] "Nerivio, FGD000075-4.7"
#> [4] "Theranica Bioelectronics ltd4 Ha-Omanutst. Poleg Industrial Parknetanya, IL4250574"
#> [5] "Theranica Bioelectronics ltd"
#> [6] "4 Ha-Omanutst. Poleg Industrial Park"
#> [7] "netanya, IL4250574"
#> [8] "alon ironi"
#> [9] "Hogan Lovells US LLP1735 Market StreetSuite 2300philadelphia, PA 19103"
#> [10] "Hogan Lovells US LLP"
#> [11] "1735 Market Street"
#> [12] "Suite 2300"
#> [13] "philadelphia, PA 19103"
#> [14] "janice m. hogan"
#> [15] "882.5899"
#> [16] "QGT "
#> [17] "QGT "
#> [18] "10/26/2020"
#> [19] "01/22/2021"
#> [20] "substantially equivalent (SESE)"
#> [21] "Neurology"
#> [22] "Neurology"
#> [23] "summary"
#> [24] "Traditional"
#> [25] "NCT04089761"
#> [26] "No"
#> [27] "No"
Created on 2021-02-27 by the reprex package (v1.0.0)
Moreover, I could not find a way to extract the urls of col2
I found a good manual to read html tables with cells spanning on multiple rows. However, I think this approach does not work for nested dataframes.
There is similar question regarding a nested table without links (How to scrape older html with nested tables in R?) which has not been answered yet. A comment suggested this question, unfortunately I could not apply it to my html table.
There is the unpivotr package that aims to read nested html tables, however, I could not solve my problem with that package.
Yes the tables within the rows of the parent table does make it more difficult. The key for this one is to find the 27 rows of the table and then parse each row individually.
library(rvest)
library(stringr)
library(dplyr)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
tables <- html %>% html_nodes("table")
table <- tables[[9]]
#find all of the table's rows
trows <- table %>% html_nodes("tr")
#find the left column
leftside <- trows %>% html_node("th") %>% html_text() %>% trimws()
#find the right column (remove white at the end and in the middle)
rightside <- trows %>% html_node("td") %>% html_text() %>% str_squish() %>% trimws()
#get links
links <-trows %>% html_node("td a") %>% html_attr("href")
answer <-data.frame(leftside, rightside, links)
One will will need to use paste("https://www.accessdata.fda.gov/", answer$links) on some of the links to obtain the full web address.
The final dataframe does have several cells containing "NA" these can be removed and the table can be cleaned up some more depending on the final requirements. See tidyr::fill() as a good starting point.
Update
To reduce the answer down to the desired 19 original rows:
library(tidyr)
#replace NA with blanks
answer$links <- replace_na(answer$links, "")
#fill in the blank is the first column to allow for grouping
answer <-fill(answer, leftside, .direction = "down")
#Create the final results
finalanswer <- answer %>% group_by(leftside) %>%
summarize(info=paste(rightside, collapse = " "), link=first(links))
This question already has answers here:
Pasting two vectors with combinations of all vectors' elements
(8 answers)
Closed 2 years ago.
I have two vectors, one that contains a list of variables, and one that contains dates, such as
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
I want to merge them to have a vector with each variable indexed by my date, that is my desired output is
> Colonnes_Pays_Principaux
[1] "PIB_2020" "PIB_2021" "ConsommationPrivee_2020"
[4] "ConsommationPrivee_2021" "ConsommationPubliques_2020" "ConsommationPubliques_2021"
[7] "FBCF_2020" "FBCF_2021" "ProductionIndustrielle_2020"
[10] "ProductionIndustrielle_2021" "Inflation_2020" "Inflation_2021"
[13] "InflationSousJacente_2020" "InflationSousJacente_2021" "PrixProductionIndustrielle_2020"
[16] "PrixProductionIndustrielle_2021" "CoutHoraireTravail_2020" "CoutHoraireTravail_2021"
Is there a simpler / more readabl way than a double for loop as I have tried and succeeded below ?
Colonnes_Pays_Principaux <- vector()
for (Variable in (1:length(Variables_Pays))){
for (Annee in (1:length(Annee_Pays))){
Colonnes_Pays_Principaux=
append(Colonnes_Pays_Principaux,
paste(Variables_Pays[Variable],Annee_Pays[Annee],sep="_")
)
}
}
expand.grid will create a data frame with all combinations of the two vectors.
with(
expand.grid(Variables_Pays, Annee_Pays),
paste0(Var1, "_", Var2)
)
#> [1] "PIB_2000" "ConsommationPrivee_2000"
#> [3] "ConsommationPubliques_2000" "FBCF_2000"
#> [5] "ProductionIndustrielle_2000" "Inflation_2000"
#> [7] "InflationSousJacente_2000" "PrixProductionIndustrielle_2000"
#> [9] "CoutHoraireTravail_2000" "PIB_2001"
#> [11] "ConsommationPrivee_2001" "ConsommationPubliques_2001"
#> [13] "FBCF_2001" "ProductionIndustrielle_2001"
#> [15] "Inflation_2001" "InflationSousJacente_2001"
#> [17] "PrixProductionIndustrielle_2001" "CoutHoraireTravail_2001"
We can use outer :
c(t(outer(Variables_Pays, Annee_Pays, paste, sep = '_')))
# [1] "PIB_2000" "PIB_2001"
# [3] "ConsommationPrivee_2000" "ConsommationPrivee_2001"
# [5] "ConsommationPubliques_2000" "ConsommationPubliques_2001"
# [7] "FBCF_2000" "FBCF_2001"
# [9] "ProductionIndustrielle_2000" "ProductionIndustrielle_2001"
#[11] "Inflation_2000" "Inflation_2001"
#[13] "InflationSousJacente_2000" "InflationSousJacente_2001"
#[15] "PrixProductionIndustrielle_2000" "PrixProductionIndustrielle_2001"
#[17] "CoutHoraireTravail_2000" "CoutHoraireTravail_2001"
No real need to go beyond the basics here! Use paste for pasting the strings and rep to repeat either Annee_Pays och Variables_Pays to get all combinations:
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
# To get this is the same order as in your example:
paste(rep(Variables_Pays, rep(2, length(Variables_Pays))), Annee_Pays, sep = "_")
# Alternative order:
paste(Variables_Pays, rep(Annee_Pays, rep(length(Variables_Pays), 2)), sep = "_")
# Or, if order doesn't matter too much:
paste(Variables_Pays, rep(Annee_Pays, length(Variables_Pays)), sep = "_")
In base R:
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
cbind(paste(Variables_Pays, Annee_Pays,sep="_"),paste(Variables_Pays, rev(Annee_Pays),sep="_")
I would like for R to read in the first 10,000 digits of Pi and group every 10 digits together
e.g., I want R to read in a sequence
pi <- 3.14159265358979323846264338327950288419716939937510582097...
and would like R to give me a table where each row contains 10 digit:
3141592653
5897932384
6264338327
...
I am new to R and really don't know where to start so any help would be much appreciated!
Thank you in advance
https://rextester.com/OQRM27791
p <- strsplit("314159265358979323846264338327950288419716939937510582097", "")
digits <- p[[1]]
split(digits, ceiling((1:length(digits)) / 10));
Here's one way to do it. It's fully reproducible, so just cut and paste it into your R console. The vector result is the first 10,000 digits of pi, split into 1000 strings of 10 digits.
For this many digits, I have used an online source for the precalculated value of pi. This is read in using readChar and the decimal point is stripped out with gsub. The resulting string is split into individual characters and put in a 1000 * 10 matrix (filled row-wise). The rows are then pasted into strings, giving the result. I have displayed only the first 100 entries of result for clarity of presentation.
pi_url <- "https://www.pi2e.ch/blog/wp-content/uploads/2017/03/pi_dec_1m.txt"
pi_char <- gsub("\\.", "", readChar(url, 1e4 + 1))
pi_mat <- matrix(strsplit(pi_char, "")[[1]], byrow = TRUE, ncol = 10)
result <- apply(pi_mat, 1, paste0, collapse = "")
head(result, 100)
#> [1] "3141592653" "5897932384" "6264338327" "9502884197" "1693993751"
#> [6] "0582097494" "4592307816" "4062862089" "9862803482" "5342117067"
#> [11] "9821480865" "1328230664" "7093844609" "5505822317" "2535940812"
#> [16] "8481117450" "2841027019" "3852110555" "9644622948" "9549303819"
#> [21] "6442881097" "5665933446" "1284756482" "3378678316" "5271201909"
#> [26] "1456485669" "2346034861" "0454326648" "2133936072" "6024914127"
#> [31] "3724587006" "6063155881" "7488152092" "0962829254" "0917153643"
#> [36] "6789259036" "0011330530" "5488204665" "2138414695" "1941511609"
#> [41] "4330572703" "6575959195" "3092186117" "3819326117" "9310511854"
#> [46] "8074462379" "9627495673" "5188575272" "4891227938" "1830119491"
#> [51] "2983367336" "2440656643" "0860213949" "4639522473" "7190702179"
#> [56] "8609437027" "7053921717" "6293176752" "3846748184" "6766940513"
#> [61] "2000568127" "1452635608" "2778577134" "2757789609" "1736371787"
#> [66] "2146844090" "1224953430" "1465495853" "7105079227" "9689258923"
#> [71] "5420199561" "1212902196" "0864034418" "1598136297" "7477130996"
#> [76] "0518707211" "3499999983" "7297804995" "1059731732" "8160963185"
#> [81] "9502445945" "5346908302" "6425223082" "5334468503" "5261931188"
#> [86] "1710100031" "3783875288" "6587533208" "3814206171" "7766914730"
#> [91] "3598253490" "4287554687" "3115956286" "3882353787" "5937519577"
#> [96] "8185778053" "2171226806" "6130019278" "7661119590" "9216420198"
Created on 2020-07-23 by the reprex package (v0.3.0)
We can use str_extract:
pi <- readLines("https://www.pi2e.ch/blog/wp-content/uploads/2017/03/pi_dec_1m.txt")
library(stringr)
t <- unlist(str_extract_all(sub("\\.","", pi), "\\d{10}"))
t[1:100]
[1] "3141592653" "5897932384" "6264338327" "9502884197" "1693993751" "0582097494" "4592307816" "4062862089"
[9] "9862803482" "5342117067" "9821480865" "1328230664" "7093844609" "5505822317" "2535940812" "8481117450"
[17] "2841027019" "3852110555" "9644622948" "9549303819" "6442881097" "5665933446" "1284756482" "3378678316"
[25] "5271201909" "1456485669" "2346034861" "0454326648" "2133936072" "6024914127" "3724587006" "6063155881"
[33] "7488152092" "0962829254" "0917153643" "6789259036" "0011330530" "5488204665" "2138414695" "1941511609"
[41] "4330572703" "6575959195" "3092186117" "3819326117" "9310511854" "8074462379" "9627495673" "5188575272"
[49] "4891227938" "1830119491" "2983367336" "2440656643" "0860213949" "4639522473" "7190702179" "8609437027"
[57] "7053921717" "6293176752" "3846748184" "6766940513" "2000568127" "1452635608" "2778577134" "2757789609"
[65] "1736371787" "2146844090" "1224953430" "1465495853" "7105079227" "9689258923" "5420199561" "1212902196"
[73] "0864034418" "1598136297" "7477130996" "0518707211" "3499999983" "7297804995" "1059731732" "8160963185"
[81] "9502445945" "5346908302" "6425223082" "5334468503" "5261931188" "1710100031" "3783875288" "6587533208"
[89] "3814206171" "7766914730" "3598253490" "4287554687" "3115956286" "3882353787" "5937519577" "8185778053"
[97] "2171226806" "6130019278" "7661119590" "9216420198"
Consider the following vector
x <- paste0(LETTERS,1:26)
I want to replace the letters with the NATO phonetic alphabet (alpha, bravo, charli etc.) whilst keeping the numbers. Is there a vector within r, similar to LETTERS that has the full NATO phonetic alphabet?
I'm not aware of a built in list. It's just a vector of words, you can get it yourself.
NATO <- strsplit("Alfa, Bravo, Charlie, Delta, Echo, Foxtrot, Golf, Hotel, India, Juliett, Kilo, Lima, Mike, November, Oscar, Papa, Quebec, Romeo, Sierra, Tango, Uniform, Victor, Whiskey, X-ray, Yankee, Zulu", ", ")
z <- paste0(unlist(NATO),1:26)
z
#> [1] "Alfa1" "Bravo2" "Charlie3" "Delta4" "Echo5"
#> [6] "Foxtrot6" "Golf7" "Hotel8" "India9" "Juliett10"
#> [11] "Kilo11" "Lima12" "Mike13" "November14" "Oscar15"
#> [16] "Papa16" "Quebec17" "Romeo18" "Sierra19" "Tango20"
#> [21] "Uniform21" "Victor22" "Whiskey23" "X-ray24" "Yankee25"
#> [26] "Zulu26"