Match error when using semi_join in sparklyr - r

I am trying to join two tables in spark data frame where the ngrams generated matches a list.
list of articles(df_sparklyr):
id description
1 In order to investigate the role of calcium pathway in myeloid differentiation, the expression level of genes related to calcium pathway in all trans retinoic acid (ATRA) induced NB4 cell differentiation was detected by cDNA microarray, some of which were further confirmed by quantitative real time RT PCR. At the same time, the expressions of these genes in NB4 R1 cells treated with ATRA and 8 CPT cAM P alone or in combination, and in differentiation of primary cells from ATRA induced newly diagnosed APL patients were detected by real time RT PCR. The results showed that during differentiation of ATRA induced NB4 cells, the expressions of genes related to calcium concentration had changed, the expression of downstream effectors in calcium pathway was up regulated and confirmed by real time RT PCR assay. The expression of genes related to calcium concentration did not change significantly when NB4 R1 cells were treated by ATRA or 8 CPT cAMP alone, but expression changes of those genes were similar to the changes in ATRA induced NB4 cell differentiation when NB4 R1 cells were treated by ATRA combined with 8 CPT cAMP. In addition, the expression changes of those genes in ATRA induced primary cells of patients with APL were also similar to changes in ATRA induced NB4 cell differentiation. It is concluded that calcium pathway may be involved in ATRA induced differentiation in APL cell.
2 This study was aimed to investigate the inhibitory effect of flavonoids of puerarin (PR) in different concentrations on proliferation of 4 kinds of acute myeloid leukemia (AML) cell lines (Kasumi 1, HL 60, NB4 and U937), and to explore its possible mechanism. The MTT method was used to detected the inhibitory effect of PR on proliferation of AML cell lines. The flow cytometry was adopted to determine the change of cell cycle in vitro. The results showed that a certain concentration of PR could inhibit the proliferation of these 4 cell lines effectively in time and dose dependent manners, and the intensity of inhibition on 4 kinds of AML cell lines was from high to low as follows: NB4>Kasumi 1>U937>HL 60. Meanwhile, PR could also change cycle process, cell proportion in G1 G0 phase decreased, cells in S phase increased and Sub diploid peak also appeared. It is concluded that PR can selectively inhibit the proliferation of 4 AML cell lines and block cell cycle process, especially for NB4 cells.
3 This study was aimed to investigate the effects of flavonoids of puerarin (PR) on apoptosis of acute promyelocytic leukemia (APL) cell line NB4 cells and its mechanism. The NB4 were treated with PR in vitro, the MTT assay was used to detect the inhibitory effect of PR on cell proliferation. The apoptosis of NB4 cells were detected by flow cytometry labelled with Annexin V PI. The expressions of pml rar alpha, bcl 2 and survivin were detected by real time reverse transcription polymerase chain reaction (real time RT PCR), the expressions of JNK, p38 MAPK, FasL, caspase 3, caspase 8 were detected by Western blot. The results showed that with the increasing of PR concentrations, the apoptosis rates of NB4 cells were gradually elevated. Simultaneously, the mRNA expression of pml rar alpha, bcl 2 and survivin decreased, while the protein expression of JNK, FasL, caspase 3 and caspase 8 increased, which presented the positive correlation to PR concentrations. When PR combined with arsenic trioxide (ATO), the expression levels of above mentioned mRNA and protein decreased or increased more significantly. It is concluded that PR can effectively induce the apoptosis of NB4 cells. PR combined with ATO displays synergistic effect. It may be triggered by the activation of JNK signal pathway.
list of keywords(dict_tbl):
[1] "3 M SYNDROME"
[2] "3-M SYNDROME"
[3] "3-M SYNDROME 1"
[4] "3M SYNDROME"
[5] "DOLICHOSPONDYLIC DYSPLASIA"
[6] "GLOOMY FACE SYNDROME"
[7] "LE MERRER SYNDROME"
[8] "THREE M SYNDROME"
[9] "YAKUT SHORT STATURE SYNDROME"
[10] "ABDOMINAL AORTIC ANEURYSM"
[11] "ANEURYSM ABDOMINAL AORTIC"
[12] "AORTIC ANEURYSM ABDOMINAL"
[13] "AORTIC ANEURYSM FAMILIAL ABDOMINAL 1"
[14] "ABSENCE EPILEPSY"
[15] "ABSENCE SEIZURE"
[16] "CHILDHOOD ABSENCE EPILEPSY"
[17] "JUVENILE ABSENCE EPILEPSY"
[18] "PETIT MAL SEIZURE"
[19] "PYKNOLEPSY"
[20] "ACANTHAMOEBA INFECTION"
[21] "ACANTHAMOEBA INFECTIONS"
[22] "ACANTHAMOEBA KERATITIS"
[23] "ACCOMMODATIVE SPASM"
Using the following code:
s_2 = df_sparklyr %>%
ft_tokenizer("description", "words")%>%
ft_ngram(input_col = "words", output_col = "ngrams")%>%
semi_join(y = dict_tbl, by = c("ngrams" = "Keywords"))
I am getting the following error:
Error: org.apache.spark.sql.AnalysisException: cannot resolve '(outer() = RHS.Keywords)' due to data type mismatch: differing types in '(outer() = RHS.Keywords)' (array and string).;

It appears that you are missing few things,
1. the parameter n which dictates how many token to use for each ngram
2. the function explode which turns those
lists of ngrams per row into individual ngrams per row
3. with the join it's much easier to just rename the column you are joining on
here is the detailed approach, I hope it helps
step 1 : generate the spark data frame
my_text =
'In order to investigate the role of calcium pathway in myeloid differentiation, the expression level of genes related to calcium pathway in all trans retinoic acid (ATRA) induced NB4 cell differentiation was detected by cDNA microarray, some of which were further confirmed by quantitative real time RT PCR. At the same time, the expressions of these genes in NB4 R1 cells treated with ATRA and 8 CPT cAM P alone or in combination, and in differentiation of primary cells from ATRA induced newly diagnosed APL patients were detected by real time RT PCR. The results showed that during differentiation of ATRA induced NB4 cells, the expressions of genes related to calcium concentration had changed, the expression of downstream effectors in calcium pathway was up regulated and confirmed by real time RT PCR assay. The expression of genes related to calcium concentration did not change significantly when NB4 R1 cells were treated by ATRA or 8 CPT cAMP alone, but expression changes of those genes were similar to the changes in ATRA induced NB4 cell differentiation when NB4 R1 cells were treated by ATRA combined with 8 CPT cAMP. In addition, the expression changes of those genes in ATRA induced primary cells of patients with APL were also similar to changes in ATRA induced NB4 cell differentiation. It is concluded that calcium pathway may be involved in ATRA induced differentiation in APL cell.
This study was aimed to investigate the inhibitory effect of flavonoids of puerarin (PR) in different concentrations on proliferation of 4 kinds of acute myeloid leukemia (AML) cell lines (Kasumi 1, HL 60, NB4 and U937), and to explore its possible mechanism. The MTT method was used to detected the inhibitory effect of PR on proliferation of AML cell lines. The flow cytometry was adopted to determine the change of cell cycle in vitro. The results showed that a certain concentration of PR could inhibit the proliferation of these 4 cell lines effectively in time and dose dependent manners, and the intensity of inhibition on 4 kinds of AML cell lines was from high to low as follows: NB4>Kasumi 1>U937>HL 60. Meanwhile, PR could also change cycle process, cell proportion in G1 G0 phase decreased, cells in S phase increased and Sub diploid peak also appeared. It is concluded that PR can selectively inhibit the proliferation of 4 AML cell lines and block cell cycle process, especially for NB4 cells.
This study was aimed to investigate the effects of flavonoids of puerarin (PR) on apoptosis of acute promyelocytic leukemia (APL) cell line NB4 cells and its mechanism. The NB4 were treated with PR in vitro, the MTT assay was used to detect the inhibitory effect of PR on cell proliferation. The apoptosis of NB4 cells were detected by flow cytometry labelled with Annexin V PI. The expressions of pml rar alpha, bcl 2 and survivin were detected by real time reverse transcription polymerase chain reaction (real time RT PCR), the expressions of JNK, p38 MAPK, FasL, caspase 3, caspase 8 were detected by Western blot. The results showed that with the increasing of PR concentrations, the apoptosis rates of NB4 cells were gradually elevated. Simultaneously, the mRNA expression of pml rar alpha, bcl 2 and survivin decreased, while the protein expression of JNK, FasL, caspase 3 and caspase 8 increased, which presented the positive correlation to PR concentrations. When PR combined with arsenic trioxide (ATO), the expression levels of above mentioned mRNA and protein decreased or increased more significantly. It is concluded that PR can effectively induce the apoptosis of NB4 cells. PR combined with ATO displays synergistic effect. It may be triggered by the activation of JNK signal pathway.'
my_col = my_text %>% strsplit(split = '\n') %>% unlist
my_df <-
as.data.frame(my_col, stringsAsFactors = FALSE) %>% as_tibble() %>%
rownames_to_column('id') %>%
rename(description = my_col)
my_spark_df <- my_df %>% copy_to(sc, ., 'my_spark_df')
step 2 : generate the list of key words
key_words <- c(
"3-M SYNDROME"
,"3-M SYNDROME 1"
,"3M SYNDROME"
,"DOLICHOSPONDYLIC DYSPLASIA"
,"GLOOMY FACE SYNDROME"
,"LE MERRER SYNDROME"
,"THREE M SYNDROME"
,"YAKUT SHORT STATURE SYNDROME"
,"ABDOMINAL AORTIC ANEURYSM"
,"ANEURYSM ABDOMINAL AORTIC"
,"AORTIC ANEURYSM ABDOMINAL"
,"AORTIC ANEURYSM FAMILIAL ABDOMINAL 1"
,"ABSENCE EPILEPSY"
,"ABSENCE SEIZURE"
,"CHILDHOOD ABSENCE EPILEPSY"
,"JUVENILE ABSENCE EPILEPSY"
,"PETIT MAL SEIZURE"
,"PYKNOLEPSY"
,"ACANTHAMOEBA INFECTION"
,"ACANTHAMOEBA INFECTIONS"
,"ACANTHAMOEBA KERATITIS"
,"ACCOMMODATIVE SPASM")
key_words_spark_df <-
as.data.frame(key_words, stringsAsFactors = FALSE) %>% as_tibble() %>%
mutate(key_words = tolower(key_words)) %>%
copy_to(sc, ., 'keywords_spark')
the join
my_spark_df %>%
ft_tokenizer("description", "words")%>%
ft_ngram(input_col = "words", output_col = "ngrams", n = 2)%>%
mutate(ngrams = explode(ngrams)) %>%
select(id, ngrams) %>%
rename( key_words = ngrams) %>%
inner_join(key_words_spark_df)

Related

automatically testing a list of variables for a ks.test within code

I am running ks.tests on a long list of variables (about 15) in univariate and multivariate analysis. the goal of my analysis is to assess memory performance in different high and low risk categories for memory impairment. for example, "age" would be one variable, and in analysis with a single variable i would be comparing the memory scores of young (low risk) vs. old (high risk) participants. when looking at two variables i might compare age and education, so that would be young well educated (low risk) vs old low education (high risk) participants. and so on and so forth.
I have been going through and plugging in each variable to the same code. in single variate analysis that involved 15 of the same chunks of code with only the variable name changed. however when layering two and factors the number of code chunks started to grow exponentially. For example with two factor analysis i had to do age x all 15 other variables, sex x all 15 other variables, education x all 15 other variables and so on. i was wondering if there was a way that the chunk of code could automatically run each variable without me having to copy and paste it and change the variable name
newdataLOGOS$"risk factor"<- ifelse(newdataLOGOS$"risk factor"==0,1,
ifelse(newdataLOGOS$"riskfactor"==1,2,0))
newdataLOGOSx"risk factor"lowrisk <- subset(newdataLOGOS, "risk factor"==1,
select = c("id","risk factor","memory score"))
newdataLOGOSx"risk factor"highrisk <- subset(newdataLOGOS, "risk factor"==2,
select = c("id","risk factor", "memory score"))
ks.test(newdataLOGOSx"risk factor"lowrisk$"memory score",
newdataLOGOSx"risk factor"highrisk$"memory score")

How do I work out reduction percentage for a group of drenches?

I have data from a sheep producer who has tried several different drenches (treatment used for worm control) on his flock of sheep in order to work out the individual efficacy of each drug. The producer took faecal samples at Day 0 which represented the number of eggs in each animal before treatment and Day 14 samples which represented the number of eggs after treatment. The Column EPG means Eggs Per Gram of faeces and these values are what I need to use when determining the reduction percentage (i.e. compare post treatments with the pre-control and control groups). From the image, I need to get a reduction percentage from BZ and LEV (post counts) using the PRE data (pre-counts)
I have tried the following:
model <- fecr_stan(data1$before, data1$after, rawCounts=FALSE, preCF=50,
paired=TRUE, indEfficacy=TRUE)
But I get error message:
Error in if (mean(preFEC) < mean(postFEC)) warning("mean of
pre-treatment is smaller of post-treatment. Results may be unreliable
if default priors are used.\n") : missing value where TRUE/FALSE
needed

I'm having problems performing sentiment analysis on R using a JSON file

I have the the following JSON file and I want to do a sentiment analysis on the "descriptions" keys in the JSON file.I'm using this tutorial for help. Below is the code I have so far:
library(jsonlite) #formats the json file into an appropriate R structure
formatdata<- fromJSON("the json file", flatten = TRUE)
View(formatdata)
text <- c(formatdata)
library(dplyr)
text_df<-data_frame(line =1:20, text=text)
library(tidytext)
text_df %>%
unnest_tokens(word, text)
A portion of the "formatdata" variable looks like this :
[1] "The complex interplay between cancer cells, stromal cells and immune cells in the tumor microenvironment (TME) regulates tumorigenesis and provides emerging targets for immunotherapies. Crosstalk between CD4(+) T cells and proliferating chronic lymphocytic leukemia (CLL) tumor B cells occurs within lymphoid tissue pseudo-follicles and investigating these interactions is essential to understand both disease pathogenesis and the effects of immunotherapy. Tumor-derived extracellular vesicle (EV) shedding is emerging as an important mode of intercellular communication in the TME. In order to characterize tumor EVs released in response to T cell-derived TME signals, we performed microRNA (miR) profiling of EVs released from CLL cells stimulated with CD40 and IL-4. Our results reveal an enrichment of specific cellular miRNAs including miR-363 within EVs derived from CD40/IL-4 stimulated CLL cells compared to parental cell miRNA content and control EVs from unstimulated CLL cells. We demonstrate that autologous patient CD4(+) T cells internalize CLL-EVs containing miR-363 that targets the immunomodulatory molecule CD69. We further reveal that autologous CD4(+) T cells that are exposed to EVs from CD40/IL-4 stimulated CLL cells exhibit enhanced migration, immunological synapse signaling and interactions with tumor cells. Knockdown of miR-363 in CLL cells prior to CD40/IL-4 stimulation, prevented the ability of CLL-EVs to induce increased synapse signaling and confer altered functional properties to CD4(+) T cells. Taken together, these data reveal a novel role for CLL-EVs in modifying T cell function that highlights unanticipated complexity of intercellular communication that may have implications for bidirectional CD4(+) T-cell:tumor interactions within the TME.Peer-reviewedPost-prin"
[2] "Objective: Gut homing of lymphocytes via adhesion molecules has recently emerged as new target for therapy in inflammatory bowel diseases. We aimed to analyze the in vivo homing of effector (Teff) and regulatory (Treg) T cells to the inflamed gut via α4β7 and GPR15. Design: We assessed the expression of homing receptors on T cells in peripheral blood and inflamed mucosa. We studied the migration pattern and homing of Teff and Treg cells to the inflamed gut using intravital confocal microscopy and FACS in a humanized mouse model in DSS-treated NSG (NOD.Cg-Prkdcscid-Il2rgtm1Wjl/SzJ) mice. Results: Expression of GPR15 and α4β7 was significantly increased on Treg rather than Teff cells in peripheral blood of patients with ulcerative colitis (UC) as compared to Crohn´s disease and controls. In vivo analysis in a humanized mouse model showed augmented gut homing of UC Treg cells as compared to controls. Moreover, suppression of UC (but not control) Teff and Treg cell homing was noted upon treatment with the α4β7 antibody vedolizumab. In contrast, siRNA blockade of GPR15 had only effects on homing of Teff cells but did not affect Treg homing in UC. Clinical vedolizumab treatment was associated with marked expansion of UC Treg cells in peripheral blood. Conclusion: α4β7 rather than GPR15 is crucial for increased colonic homing of UC Treg cells in vivo, while both receptors control UC Teff homing. Vedolizumab treatment impairs homing of UC Treg cells leading to their accumulation in peripheral blood with subsequent suppression of systemic effector T cell expansion"
[3] "This work was financially supported by the “Deutsches Zentrum für Luft- und Raumfahrt e.V.” (DLR) under the project “Solarzellenkonzepte für Raumfahrtgeneratoren der nächsten Generation” (SoNG, Contract No. 50RN1301). S.H. gratefully acknowledges the support by the Royal Society and the Wolfson Foundation.Dilute nitride GaInN(Sb)As with a band gap (Eg) of 1.0 eV is a promising material for the integration in next generation multijunction solar cells. We have investigated the effect of a compositionally graded GaInNAs absorber layer on the spectral response of a GaInNAs sub cell. We produced band gap gradings (ΔEg) of up to 39 meV across a 1 μm thick GaInNAs layer. Thereby, the external quantum efficiency—compared to reference cells—was increased due to the improved extraction of photo-generated carriers from 34.0% to 36.7% for the wavelength range from 900 nm to 1150 nm. However, this device figure improvement is accompanied by a small decrease in the open circuit voltage of about 20 mV and the shift of the absorption edge to shorter wavelengths.Publisher PDFPeer reviewe"
What I want to do is using the "nrc" or "bing" lexicon to do sentiment analysis. But when I try using the code below, this error appears:
text_df %>%
+ inner_join(get_sentiments("bing"))
Error: `by` required, because the data sources have no common variables
Can anyone offer any suggestions?
Try combining the strings inside formatdata together and see if it works

Plotted LDA Analysis in R - how to drawn attention to/reclassify outliers?

I have some clinical data that I ran an LDA analysis on - 98 patients and their corresponding protein levels for a specific type of analysis. I plotted the output and have essentially 6 different "clusters" of individuals. Here's the problem - some of the clusters have outliers that are much closer to other clusters. We know that in some cases the medical diagnosis that the cluster is based on (phenotype) might be in error, and the outlier member of the green cluster that appears closer to the blu cluster might actually be part of the blue cluster. So my question is: is there a computational way for me detect these individuals and re-evaluate whether or not they are actually members of the cluster they are closer to based on the data if we ignore the medical diagnosis?
Here's a sample of the output:

Kernel Ridge Regression in R (for Drug-Target Interaction)

It's very hard to find any information on implementing KRR, therefore any minor input will be truly highly appreciated.
I want to run Kernel Ridge Regression on a set of kernels I have computed, but I do not know how to do this in R. I found the constructKRRLearner function from CVST package, but the manual is not clear at all, especially for me being a complete beginner in Machine Learning. The function needs and x and y, but I have no idea what to input there, as I only have a data frame that has the pairwise kernel computed as kronecker product between drugs and proteins.
How can I do a Kernel Ridge Regression task in R?
Ideally I also want to visualize my data points and then illustrate the regression line on the plot! For instance like this:
http://scikit-learn.org/stable/_images/plot_kernel_ridge_regression_0011.png
MORE INFO ON MY DATASET
I have a drug-target interactions (DTI) data set. The data set comprises of 100 drug compounds (rows) and 100 protein kinase targets (columns). there are some NAN's (missing values) in this data set. Values in this data set reflect how tightly a compound binds to a target.
I have drugs' SMILES and CHEMBL IDs.
I have the protein's (targets) sequences and UNIPROT IDs.
For drugs [100 drugs]: I converted drug SMILES to SDFset, and then I computed the fingerprints for each drug using OpenBabel. Based on these fingerprints I computed Tanimoto kernels for all possible combinations between drugs. (using "fpSim" function), e.g. Drug 1 with Drug 2, 3, 4, ... 10. Then Drug 2 with Drug 1, 3, 4... 100 and so on until Drug 99 with Drug 100. I named this BASE_DRUG_KERNELS
For proteins: I had the protein sequences, so I computed Smith-Waterman scores for all combination of protein pairs; e.g. Protein 1 with Protein 2, 3, ... 100, then Protein 2 with Protein 1, 3, 4, ... 100 and so on until Protein 99 with Protein 100. I named this BASE_PROTEIN_KERNELS
Then I computed the Kronecker between BASE_DRUG_KERNELS and BASE_PROTEIN_KERNELS which gave me a matrix of 100,000,000 elements. I named this matrix KRONECKER_PRODUCTS
I wish to run Kernel Ridge Regression on the matrix KRONECKER_PRODUCTS.

Resources