Related
I have 50+ manuscript titles in R markdown and they are directly copied from a word document. I'm wondering whether there's a function or package I can sort these titles by alphabet order so I can list them back in R markdown.
Hospital admission and mortality rates for non-covid diseases in Denmark during covid-19 pandemic: nationwide population based cohort study
Covid-19 deaths in Africa: prospective systematic postmortem surveillance study
Food anaphylaxis in the United Kingdom: analysis of national data, 1998-2018
Association of first trimester prescription opioid use with congenital malformations in the offspring: population based cohort study
In the following example, you should be able to copy your text into the text argument of read.table. sep defines that the entries are separated by line breaks, and empty lines are skipped (blank.lines.skip = T)
df <- read.table(sep = "\n", blank.lines.skip = T, stringsAsFactors=FALSE,
text = "Hospital admission and mortality rates for non-covid diseases in Denmark during covid-19 pandemic: nationwide population based cohort study
Covid-19 deaths in Africa: prospective systematic postmortem surveillance study
Food anaphylaxis in the United Kingdom: analysis of national data, 1998-2018
Association of first trimester prescription opioid use with congenital malformations in the offspring: population based cohort study"
)
df <- sort(df$V1)
df
# [1] "Association of first trimester prescription opioid use with congenital malformations in the offspring: population based cohort study"
# [2] "Covid-19 deaths in Africa: prospective systematic postmortem surveillance study"
# [3] "Food anaphylaxis in the United Kingdom: analysis of national data, 1998-2018"
# [4] "Hospital admission and mortality rates for non-covid diseases in Denmark during covid-19 pandemic: nationwide population based cohort study"
I am unable to load Groceries data set in R.
Can anyone help?
> data()
Data sets in package ‘datasets’:
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales) Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European Stock Indices,
1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly pine trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Seatbelts Road Casualties in Great Britain 1969-84
Theoph Pharmacokinetics of Theophylline
Titanic Survival of passengers on the Titanic
ToothGrowth The Effect of Vitamin C on Tooth Growth in Guinea Pigs
UCBAdmissions Student Admissions at UC Berkeley
UKDriverDeaths Road Casualties in Great Britain 1969-84
UKgas UK Quarterly Gas Consumption
USAccDeaths Accidental Deaths in the US 1973-1978
USArrests Violent Crime Rates by US State
USJudgeRatings Lawyers' Ratings of State Judges in the US Superior Court
USPersonalExpenditure Personal Expenditure Data
UScitiesD Distances Between European Cities and Between US Cities
VADeaths Death Rates in Virginia (1940)
WWWusage Internet Usage per Minute
WorldPhones The World's Telephones
ability.cov Ability and Intelligence Tests
airmiles Passenger Miles on Commercial US Airlines, 1937-1960
airquality New York Air Quality Measurements
anscombe Anscombe's Quartet of 'Identical' Simple Linear
Regressions
attenu The Joyner-Boore Attenuation Data
attitude The Chatterjee-Price Attitude Data
austres Quarterly Time Series of the Number of Australian
Residents
beaver1 (beavers) Body Temperature Series of Two Beavers
beaver2 (beavers) Body Temperature Series of Two Beavers
cars Speed and Stopping Distances of Cars
chickwts Chicken Weights by Feed Type
co2 Mauna Loa Atmospheric CO2 Concentration
crimtab Student's 3000 Criminals Data
discoveries Yearly Numbers of Important Discoveries
esoph Smoking, Alcohol and (O)esophageal Cancer
euro Conversion Rates of Euro Currencies
euro.cross (euro) Conversion Rates of Euro Currencies
eurodist Distances Between European Cities and Between US Cities
faithful Old Faithful Geyser Data
fdeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK
freeny Freeny's Revenue Data
freeny.x (freeny) Freeny's Revenue Data
freeny.y (freeny) Freeny's Revenue Data
infert Infertility after Spontaneous and Induced Abortion
iris Edgar Anderson's Iris Data
iris3 Edgar Anderson's Iris Data
islands Areas of the World's Major Landmasses
ldeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK
lh Luteinizing Hormone in Blood Samples
longley Longley's Economic Regression Data
lynx Annual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK
morley Michelson Speed of Light Data
mtcars Motor Trend Car Road Tests
nhtemp Average Yearly Temperatures in New Haven
nottem Average Monthly Temperatures at Nottingham, 1920-1939
npk Classical N, P, K Factorial Experiment
occupationalStatus Occupational Status of Fathers and their Sons
precip Annual Precipitation in US Cities
presidents Quarterly Approval Ratings of US Presidents
pressure Vapor Pressure of Mercury as a Function of Temperature
quakes Locations of Earthquakes off Fiji
randu Random Numbers from Congruential Generator RANDU
rivers Lengths of Major North American Rivers
rock Measurements on Petroleum Rock Samples
sleep Student's Sleep Data
stack.loss (stackloss) Brownlee's Stack Loss Plant Data
stack.x (stackloss) Brownlee's Stack Loss Plant Data
stackloss Brownlee's Stack Loss Plant Data
state.abb (state) US State Facts and Figures
state.area (state) US State Facts and Figures
state.center (state) US State Facts and Figures
state.division (state) US State Facts and Figures
state.name (state) US State Facts and Figures
state.region (state) US State Facts and Figures
state.x77 (state) US State Facts and Figures
sunspot.month Monthly Sunspot Data, from 1749 to "Present"
sunspot.year Yearly Sunspot Data, 1700-1988
sunspots Monthly Sunspot Numbers, 1749-1983
swiss Swiss Fertility and Socioeconomic Indicators (1888) Data
treering Yearly Treering Data, -6000-1979
trees Diameter, Height and Volume for Black Cherry Trees
uspop Populations Recorded by the US Census
volcano Topographic Information on Auckland's Maunga Whau Volcano
warpbreaks The Number of Breaks in Yarn during Weaving
women Average Heights and Weights for American Women
Use ‘data(package = .packages(all.available = TRUE))’
to list the data sets in all *available* packages.
> head(Groceries)
Error in head(Groceries) : object 'Groceries' not found
> groceries <- data(Groceries)
Warning message:
In data(Groceries) : data set ‘Groceries’ not found
> library(datasets)
> groceries <- data(Groceries)
Warning message:
In data(Groceries) : data set ‘Groceries’ not found
>
Groceries is in the arules package.
install.packages("arules")
library(arules)
data(Groceries)
I need to extract the journal titles from a bibliography list. The titles are all within quotation marks.
So is there a way to ask R to extract all text that is within parenthesis?
I have read the list into R as a text file:
"data <- readLines("Publications _ CCDM.txt")"
here are a few lines from the list:
Andronis, C.E., Hane, J., Bringans, S., Hardy, G., Jacques, S., Lipscombe, R., Tan, K-C. (2020). “Gene validation and remodelling using proteogenomics of Phytophthora cinnamomi, the causal agent of Dieback.” bioRxiv. DOI: https://doi.org/10.1101/2020.10.25.354530
Beccari, G., Prodi, A., Senatore, M.T., Balmas, V,. Tini, F., Onofri, A., Pedini, L., Sulyok, M,. Brocca, L., Covarelli, L. (2020). “Cultivation Area Affects the Presence of Fungal Communities and Secondary Metabolites in Italian Durum Wheat Grains.” Toxins https://www.mdpi.com/2072-6651/12/2/97
Corsi, B., Percvial-Alwyn, L., Downie, R.C., Venturini, L., Iagallo, E.M., Campos Mantello, C., McCormick-Barnes, C., See, P.T., Oliver, R.P., Moffat, C.S., Cockram, J. “Genetic analysis of wheat sensitivity to the ToxB fungal effector from Pyrenophora tritici-repentis, the causal agent of tan spot” Theoretical and Applied Genetics. https://doi.org/10.1007/s00122-019-03517-8
Derbyshire, M.C., (2020) Bioinformatic Detection of Positive Selection Pressure in Plant Pathogens: The Neutral Theory of Molecular Sequence Evolution in Action. (2020) Frontiers in Microbiology. https://doi.org/10.3389/fmicb.2020.00644
Dodhia, K.N., Cox, B.A., Oliver, R.P., Lopez-Ruiz, F.J. (2020). “When time really is money: in situ quantification of the strobilurin resistance mutation G143A in the wheat pathogen Blumeria graminis f. sp. tritici.” bioRxiv, doi: https://doi.org/10.1101/2020.08.20.258921
Graham-Taylor, C., Kamphuis, L.G., Derbyshire, M.C. (2020). “A detailed in silico analysis of secondary metabolite biosynthesis clusters in the genome of the broad host range plant pathogenic fungus Sclerotinia sclerotiorum.” BMC Genomics https://doi.org/10.1186/s12864-019-6424-4
try something like this:
library(stringr)
str_extract_all(x, "“.*?”") %>% .[[1]]
if you want to remove quotation from result add this at the end of pipeline:
str_remove_all("[“”]")
Output:
[1] "Gene validation and remodelling using proteogenomics of Phytophthora cinnamomi, the causal agent of Dieback."
[2] "Cultivation Area Affects the Presence of Fungal Communities and Secondary Metabolites in Italian Durum Wheat Grains."
[3] "Genetic analysis of wheat sensitivity to the ToxB fungal effector from Pyrenophora tritici-repentis, the causal agent of tan spot"
[4] "When time really is money: in situ quantification of the strobilurin resistance mutation G143A in the wheat pathogen Blumeria graminis f. sp. tritici."
[5] "A detailed in silico analysis of secondary metabolite biosynthesis clusters in the genome of the broad host range plant pathogenic fungus Sclerotinia sclerotiorum."
I am analyzing some survey data in R. Due to the sample design, all analysis has to be done with the "survey" package that can take the sample structure into account, which means I can't just get within-column or within-row percents using prop.table() the way I would on non-survey data.
For anyone not familiar with the row/column percent terminology, what I mean is percents for one variable conditional on being in a specific row/column for another variable. For example:
| male | female
black | 10 | 20
white | 15 | 15
other | 10 | 15
A row percent would be number of observations in a cell divided by number of observations in that row, for example the percent for "male" in the row "other" is 40% (10/(10+15)). A column percent would be number of observations in a cell divided by number of observations in that column, for example the percent for "other" in the column "female" is 30% (15/(20+15+15)). Normally these are easily calculated with prop.table(), but I can't use prop.table() this time because it doesn't account for survey sample design.
I have been Googling and testing things trying to figure out how to do this with the "survey" package, and so far I have found the svytable() function and can get it to give me a basic cross-tab of counts (eg. race by gender) but not survey-weighted percents. I have also found the svymean() and svytotal() functions, but so far all I've managed to do is get univariate weighted percents from svymean() (which appears to dummy-code each category as 0/1 then take a mean), and to combine svymean with the interaction function (eg. svymean(~interaction(race,gender),...)) to get cell percents (eg. "black males are XX% of the total sample"), but I still can't get within-row and within-column percents.
How do I get the "survey" package to give me survey-adjusted column and row percents for a cross-tab of two variables?
You didn't provide any sample data, so I'll use the built-in datasets of the survey package:
library(survey)
data(api)
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
svyby(~awards, by = ~stype, design=dclus1, FUN=svymean)
stype awardsNo awardsYes se.awardsNo se.awardsYes
E E 0.2291667 0.7708333 0.02904587 0.02904587
H H 0.5714286 0.4285714 0.14564997 0.14564997
M M 0.4800000 0.5200000 0.11663553 0.11663553
These are row percentages, or the percentages of each award category (yes / no) within each of the three school types. We see that 77.1% of elementary schools in the whole state of California were eligible for an awards program.
I am trying to join two tables in spark data frame where the ngrams generated matches a list.
list of articles(df_sparklyr):
id description
1 In order to investigate the role of calcium pathway in myeloid differentiation, the expression level of genes related to calcium pathway in all trans retinoic acid (ATRA) induced NB4 cell differentiation was detected by cDNA microarray, some of which were further confirmed by quantitative real time RT PCR. At the same time, the expressions of these genes in NB4 R1 cells treated with ATRA and 8 CPT cAM P alone or in combination, and in differentiation of primary cells from ATRA induced newly diagnosed APL patients were detected by real time RT PCR. The results showed that during differentiation of ATRA induced NB4 cells, the expressions of genes related to calcium concentration had changed, the expression of downstream effectors in calcium pathway was up regulated and confirmed by real time RT PCR assay. The expression of genes related to calcium concentration did not change significantly when NB4 R1 cells were treated by ATRA or 8 CPT cAMP alone, but expression changes of those genes were similar to the changes in ATRA induced NB4 cell differentiation when NB4 R1 cells were treated by ATRA combined with 8 CPT cAMP. In addition, the expression changes of those genes in ATRA induced primary cells of patients with APL were also similar to changes in ATRA induced NB4 cell differentiation. It is concluded that calcium pathway may be involved in ATRA induced differentiation in APL cell.
2 This study was aimed to investigate the inhibitory effect of flavonoids of puerarin (PR) in different concentrations on proliferation of 4 kinds of acute myeloid leukemia (AML) cell lines (Kasumi 1, HL 60, NB4 and U937), and to explore its possible mechanism. The MTT method was used to detected the inhibitory effect of PR on proliferation of AML cell lines. The flow cytometry was adopted to determine the change of cell cycle in vitro. The results showed that a certain concentration of PR could inhibit the proliferation of these 4 cell lines effectively in time and dose dependent manners, and the intensity of inhibition on 4 kinds of AML cell lines was from high to low as follows: NB4>Kasumi 1>U937>HL 60. Meanwhile, PR could also change cycle process, cell proportion in G1 G0 phase decreased, cells in S phase increased and Sub diploid peak also appeared. It is concluded that PR can selectively inhibit the proliferation of 4 AML cell lines and block cell cycle process, especially for NB4 cells.
3 This study was aimed to investigate the effects of flavonoids of puerarin (PR) on apoptosis of acute promyelocytic leukemia (APL) cell line NB4 cells and its mechanism. The NB4 were treated with PR in vitro, the MTT assay was used to detect the inhibitory effect of PR on cell proliferation. The apoptosis of NB4 cells were detected by flow cytometry labelled with Annexin V PI. The expressions of pml rar alpha, bcl 2 and survivin were detected by real time reverse transcription polymerase chain reaction (real time RT PCR), the expressions of JNK, p38 MAPK, FasL, caspase 3, caspase 8 were detected by Western blot. The results showed that with the increasing of PR concentrations, the apoptosis rates of NB4 cells were gradually elevated. Simultaneously, the mRNA expression of pml rar alpha, bcl 2 and survivin decreased, while the protein expression of JNK, FasL, caspase 3 and caspase 8 increased, which presented the positive correlation to PR concentrations. When PR combined with arsenic trioxide (ATO), the expression levels of above mentioned mRNA and protein decreased or increased more significantly. It is concluded that PR can effectively induce the apoptosis of NB4 cells. PR combined with ATO displays synergistic effect. It may be triggered by the activation of JNK signal pathway.
list of keywords(dict_tbl):
[1] "3 M SYNDROME"
[2] "3-M SYNDROME"
[3] "3-M SYNDROME 1"
[4] "3M SYNDROME"
[5] "DOLICHOSPONDYLIC DYSPLASIA"
[6] "GLOOMY FACE SYNDROME"
[7] "LE MERRER SYNDROME"
[8] "THREE M SYNDROME"
[9] "YAKUT SHORT STATURE SYNDROME"
[10] "ABDOMINAL AORTIC ANEURYSM"
[11] "ANEURYSM ABDOMINAL AORTIC"
[12] "AORTIC ANEURYSM ABDOMINAL"
[13] "AORTIC ANEURYSM FAMILIAL ABDOMINAL 1"
[14] "ABSENCE EPILEPSY"
[15] "ABSENCE SEIZURE"
[16] "CHILDHOOD ABSENCE EPILEPSY"
[17] "JUVENILE ABSENCE EPILEPSY"
[18] "PETIT MAL SEIZURE"
[19] "PYKNOLEPSY"
[20] "ACANTHAMOEBA INFECTION"
[21] "ACANTHAMOEBA INFECTIONS"
[22] "ACANTHAMOEBA KERATITIS"
[23] "ACCOMMODATIVE SPASM"
Using the following code:
s_2 = df_sparklyr %>%
ft_tokenizer("description", "words")%>%
ft_ngram(input_col = "words", output_col = "ngrams")%>%
semi_join(y = dict_tbl, by = c("ngrams" = "Keywords"))
I am getting the following error:
Error: org.apache.spark.sql.AnalysisException: cannot resolve '(outer() = RHS.Keywords)' due to data type mismatch: differing types in '(outer() = RHS.Keywords)' (array and string).;
It appears that you are missing few things,
1. the parameter n which dictates how many token to use for each ngram
2. the function explode which turns those
lists of ngrams per row into individual ngrams per row
3. with the join it's much easier to just rename the column you are joining on
here is the detailed approach, I hope it helps
step 1 : generate the spark data frame
my_text =
'In order to investigate the role of calcium pathway in myeloid differentiation, the expression level of genes related to calcium pathway in all trans retinoic acid (ATRA) induced NB4 cell differentiation was detected by cDNA microarray, some of which were further confirmed by quantitative real time RT PCR. At the same time, the expressions of these genes in NB4 R1 cells treated with ATRA and 8 CPT cAM P alone or in combination, and in differentiation of primary cells from ATRA induced newly diagnosed APL patients were detected by real time RT PCR. The results showed that during differentiation of ATRA induced NB4 cells, the expressions of genes related to calcium concentration had changed, the expression of downstream effectors in calcium pathway was up regulated and confirmed by real time RT PCR assay. The expression of genes related to calcium concentration did not change significantly when NB4 R1 cells were treated by ATRA or 8 CPT cAMP alone, but expression changes of those genes were similar to the changes in ATRA induced NB4 cell differentiation when NB4 R1 cells were treated by ATRA combined with 8 CPT cAMP. In addition, the expression changes of those genes in ATRA induced primary cells of patients with APL were also similar to changes in ATRA induced NB4 cell differentiation. It is concluded that calcium pathway may be involved in ATRA induced differentiation in APL cell.
This study was aimed to investigate the inhibitory effect of flavonoids of puerarin (PR) in different concentrations on proliferation of 4 kinds of acute myeloid leukemia (AML) cell lines (Kasumi 1, HL 60, NB4 and U937), and to explore its possible mechanism. The MTT method was used to detected the inhibitory effect of PR on proliferation of AML cell lines. The flow cytometry was adopted to determine the change of cell cycle in vitro. The results showed that a certain concentration of PR could inhibit the proliferation of these 4 cell lines effectively in time and dose dependent manners, and the intensity of inhibition on 4 kinds of AML cell lines was from high to low as follows: NB4>Kasumi 1>U937>HL 60. Meanwhile, PR could also change cycle process, cell proportion in G1 G0 phase decreased, cells in S phase increased and Sub diploid peak also appeared. It is concluded that PR can selectively inhibit the proliferation of 4 AML cell lines and block cell cycle process, especially for NB4 cells.
This study was aimed to investigate the effects of flavonoids of puerarin (PR) on apoptosis of acute promyelocytic leukemia (APL) cell line NB4 cells and its mechanism. The NB4 were treated with PR in vitro, the MTT assay was used to detect the inhibitory effect of PR on cell proliferation. The apoptosis of NB4 cells were detected by flow cytometry labelled with Annexin V PI. The expressions of pml rar alpha, bcl 2 and survivin were detected by real time reverse transcription polymerase chain reaction (real time RT PCR), the expressions of JNK, p38 MAPK, FasL, caspase 3, caspase 8 were detected by Western blot. The results showed that with the increasing of PR concentrations, the apoptosis rates of NB4 cells were gradually elevated. Simultaneously, the mRNA expression of pml rar alpha, bcl 2 and survivin decreased, while the protein expression of JNK, FasL, caspase 3 and caspase 8 increased, which presented the positive correlation to PR concentrations. When PR combined with arsenic trioxide (ATO), the expression levels of above mentioned mRNA and protein decreased or increased more significantly. It is concluded that PR can effectively induce the apoptosis of NB4 cells. PR combined with ATO displays synergistic effect. It may be triggered by the activation of JNK signal pathway.'
my_col = my_text %>% strsplit(split = '\n') %>% unlist
my_df <-
as.data.frame(my_col, stringsAsFactors = FALSE) %>% as_tibble() %>%
rownames_to_column('id') %>%
rename(description = my_col)
my_spark_df <- my_df %>% copy_to(sc, ., 'my_spark_df')
step 2 : generate the list of key words
key_words <- c(
"3-M SYNDROME"
,"3-M SYNDROME 1"
,"3M SYNDROME"
,"DOLICHOSPONDYLIC DYSPLASIA"
,"GLOOMY FACE SYNDROME"
,"LE MERRER SYNDROME"
,"THREE M SYNDROME"
,"YAKUT SHORT STATURE SYNDROME"
,"ABDOMINAL AORTIC ANEURYSM"
,"ANEURYSM ABDOMINAL AORTIC"
,"AORTIC ANEURYSM ABDOMINAL"
,"AORTIC ANEURYSM FAMILIAL ABDOMINAL 1"
,"ABSENCE EPILEPSY"
,"ABSENCE SEIZURE"
,"CHILDHOOD ABSENCE EPILEPSY"
,"JUVENILE ABSENCE EPILEPSY"
,"PETIT MAL SEIZURE"
,"PYKNOLEPSY"
,"ACANTHAMOEBA INFECTION"
,"ACANTHAMOEBA INFECTIONS"
,"ACANTHAMOEBA KERATITIS"
,"ACCOMMODATIVE SPASM")
key_words_spark_df <-
as.data.frame(key_words, stringsAsFactors = FALSE) %>% as_tibble() %>%
mutate(key_words = tolower(key_words)) %>%
copy_to(sc, ., 'keywords_spark')
the join
my_spark_df %>%
ft_tokenizer("description", "words")%>%
ft_ngram(input_col = "words", output_col = "ngrams", n = 2)%>%
mutate(ngrams = explode(ngrams)) %>%
select(id, ngrams) %>%
rename( key_words = ngrams) %>%
inner_join(key_words_spark_df)