Does SyntaxNet allow me to add a custom dicitonary? - syntaxnet

What I want to do is add in a custom resource that tells SyntaxNet to combine two tokens into a single token. I'm processing biomedical data from NCBI and species are almost always written with their genus (so, genus + species). I need to preserve the genus + species format into a single token.
Egs,
Arthrobacter globiformis (genus = "Arthrobacter", species = "globiformis")
Desulfosporosinus meridiei (genus = "Desulfosporosinus", species = "meridiei")
E. coli (genus = "E.", species = "coli")
Is there a way to do this in SyntaxNet that does not include retraining?

I am afraid there is no easy (and principled) solution for your problem. You could try preprocessing your data before parsing it with SyntaxNet. More principled solutions would require code changes.

Related

R - [DESeq2] - Making DESeq Dataset object from csv of already normalized counts

I'm trying to use DESeq2's PCAPlot function in a meta-analysis of data.
Most of the files I have received are raw counts pre-normalization. I'm then running DESeq2 to normalize them, then running PCAPlot.
One of the files I received does not have raw counts or even the FASTQ files, just the data that has already been normalized by DESeq2.
How could I go about importing this data (non-integers) as a DESeqDataSet object after it has already been normalized?
Consensus in vignettes and other comments seems to be that objects can only be constructed from matrices of integers.
I was mostly concerned with getting the format the same between plots. Ultimately, I just used a workaround to get the plots looking the same via ggfortify.
If anyone is curious, I just ended up doing this. Note, the "names" file is just organized like the meta file for colData for building a DESeq object from DESeqDataSetFrom Matrix, but I changed the name of the design column from "conditions" to "group" so it would match the output of PCAplot. Should look identical.
library(ggfortify)
data<-read.csv('COUNTS.csv',sep = ",", header = TRUE, row.names = 1)
names<-read.csv("NAMES.csv")
PCA<-prcomp(t(data))
autoplot(PCA, data = names, colour = "group", size=3)

Occurrence of a certain word within a text stored in columns using R

I'm facing quiet a lot of challenges currently by doing text analysis with R.
Therefore I have in a table the columns Date, Text and Likes
I want to count how often a certain word occurs within the texts of a column (max 1 per column) and how often not.
I want to plot the results by displaying the result like in this picture
but I would like dots for "occurrence" and "not occurrence" of the searched word with different colors as dots and aggregate it monthly on y-axis and likes on x-axis
It would be great if you could help me with this challenge
As update I have here the sample data available https://drive.google.com/file/d/1IWqDoRFBTL8er8VmvisHDeB5uM3BGgJe/view?usp=sharing
It looks like there are several moving parts here so let me outline the tasks I think you are looking for assistance with:
Determine if a word appears in text, row by row.
Plot this information.
Display the information by category, i.e. word found or not found.
Provide some sort of smoothed fit over the data.
You can accomplish the first task by using your choice of pattern matching function. grepl for example will search with the pattern as its first argument. You may want to look into other parameters such as case sensitivity to ensure they match your needs. You'll want to store this result into another column, assuming you use ggplot. Then, you can pass the data to ggplot and use the col argument to have it separate out categories for you.
It doesn't appear that your data is readily available from your question. In the future, it generally helps if you can share some sample data. I have made my own sample which should be similar to what you describe. See the example code below.
library(tidyverse)
library(ggplot2)
set.seed(5)
data <- data.frame(Date = seq.Date(from = as.Date("2021-01-01"),
to = as.Date("2021-03-01"),
by = "day"),
fruit = sample(c("banana", "orange", "apple")),
likes = runif(60, 100, 1000))
data$good_fruit <- ifelse(grepl("orange", data$fruit), "orange", "not orange")
data %>%
ggplot() +
geom_point(aes(Date, likes, col = good_fruit)) +
geom_smooth(aes(Date, likes))
Since I threw together literally random data, there is not much a pattern here, but I think this illustrates the general idea of what you wanted to show? If you wanted a more specific kind of aggregation, I would recommend performing that manipulation before passing to ggplot, but for a rough fit this should work.
Sample Image

Biomart in R to convert rssnp to gene name

I have the following code in R.
library(biomaRt)
snp_mart = useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp")
snp_attributes = c("refsnp_id", "chr_name", "chrom_start",
"associated_gene", "ensembl_gene_stable_id", "minor_allele_freq")
getENSG <- function(rs, mart = snp_mart) {
results <- getBM(attributes = snp_attributes,
filters = "snp_filter", values = rs, mart = mart)
return(results)
}
getENSG("rs144864312")
refsnp_id chr_name chrom_start associated_gene ensembl_gene_stable_id
1 rs144864312 8 20254959 NA ENSG00000061337
minor_allele_freq
1 0.000399361
I have no background in biology so please forgive me if this is an obvious question. I was told that rs144864312 should match to the gene name "LZTS1".
The code above I largely got from off the internet. My question is where do I extract that gene name from? I get that the listAttributes(snp_mart) gives a list of all possible outputs but I don't see any that give me the above "gene name". Where do I extract this gene name from using biomart (and given the rs number)? Thank you in advance.
PS: I need to do this for something like 500 entries (not just 1). Hence why I created a simple function as above to extract the gene name.
First I think your question will draw more professional attention on https://www.biostars.org/
That said, to my knowledge, now you have the ensembl ID (ENSG00000061337), you are just one step away from getting the gene name. If you google "how to convert ensembl ID to gene name" you will find many approaches. Here I list a few options:
use: https://david.ncifcrf.gov/conversion.jsp
use biomart under ensemble: http://www.ensembl.org/biomart/martview/1cb4c119ae91cb34b2cd5280be0a1aac
download a table with both gene name and ensembl ID, and customize your query. You might want to download it from UCSC Genome Browser, and here are some instructions: https://www.biostars.org/p/92939/
Good luck

Data cleaning in Excel sheets using R

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?
The car package has a recode function. See it's help page for worked examples.
In fact an argument could be made that this should be a closed question:
Why is recode in R not changing the original values?
How to recode a variable to numeric in R?
Recode/relevel data.frame factors with different levels
And a few more questions easily identifiable with a search: [r] recode
EDIT:
I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
my_recode <- function(fac, levslist){ nfac <- factor(fac);
inlevs <- levels(nfac);
othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
# levslist of the form :::: list(
# animal = c("cow", "pig"),
# bird = c("eagle", "pigeon") )
levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"),
stringsAsFactors = FALSE)
df$type <- my_recode(df$name, list(
animal = c("cow", "pig"),
bird = c("eagle", "pigeon") ) )
df
#-----------
name type
1 cow animal
2 pig animal
3 eagle bird
4 pigeon bird
5 zebra all_others
You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)
If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.
It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.
The videos at the official web site are useful.
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.
You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.
The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
Now that we have a sorted list of unique values, we can create a listing of recodes:
Let's say we wanted to rename A through G as just A
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
And you'd repeat the process above eliminating or group the unique names as appropriate.

Transform a matrix txt file in spectra data for ChemoSpec package

I want to use ChemoSpec with a mass spectra of about 60'000 datapoint.
I have them already in one txt file as a matrix (X + 90 samples = 91 columns; 60'000 rows).
How may I adapt this file as spectra data without exporting again each single file in csv format (which is quite long in R given the size of my data)?
The typical (and only?) way to import data into ChemoSpec is by way of the getManyCsv() function, which as the question indicates requires one CSV file for each sample.
Creating 90 CSV files from the 91 columns - 60,000 rows file described, may be somewhat slow and tedious in R, but could be done with a standalone application, whether existing utility or some ad-hoc script.
An R-only solution would be to create a new method, say getOneBigCsv(), adapted from getManyCsv(). After all, the logic of getManyCsv() is relatively straight forward.
Don't expect such a solution to be sizzling fast, but it should, in any case, compare with the time it takes to run getManyCsv() and avoid having to create and manage the many files, hence overall be faster and certainly less messy.
Sorry I missed your question 2 days ago. I'm the author of ChemoSpec - always feel free to write directly to me in addition to posting somewhere.
The solution is straightforward. You already have your data in a matrix (after you read it in with >read.csv("file.txt"). So you can use it to manually create a Spectra object. In the R console type ?Spectra to see the structure of a Spectra object, which is a list with specific entries. You will need to put your X column (which I assume is mass) into the freq slot. Then the rest of the data matrix will go into the data slot. Then manually create the other needed entries (making sure the data types are correct). Finally, assign the Spectra class to your completed list by doing something like >class(my.spectra) <- "Spectra" and you should be good to go. I can give you more details on or off list if you describe your data a bit more fully. Perhaps you have already solved the problem?
By the way, ChemoSpec is totally untested with MS data, but I'd love to find out how it works for you. There may be some changes that would be helpful so I hope you'll send me feedback.
Good Luck, and let me know how else I can help.
many years passed and I am not sure if anybody is still interested in this topic. But I had the same problem and did a little workaround to convert my data to class 'Spectra' by extracting the information from the data itself:
#Assumption:
# Data is stored as a numeric data.frame with column names presenting samples
# and row names including domain axis
dataframe2Spectra <- function(Spectrum_df,
freq = as.numeric(rownames(Spectrum_df)),
data = as.matrix(t(Spectrum_df)),
names = paste("YourFileDescription", 1:dim(Spectrum_df)[2]),
groups = rep(factor("Factor"), dim(Spectrum_df)[2]),
colors = rainbow(dim(Spectrum_df)[2]),
sym = 1:dim(Spectrum_df)[2],
alt.sym = letters[1:dim(Spectrum_df)[2]],
unit = c("a.u.", "Domain"),
desc = "Some signal. Describe it with 'desc'"){
features <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "unit", "desc")
Spectrum_chem <- vector("list", length(features))
names(Spectrum_chem) <- features
Spectrum_chem$freq <- freq
Spectrum_chem$data <- data
Spectrum_chem$names <- names
Spectrum_chem$groups <- groups
Spectrum_chem$colors <- colors
Spectrum_chem$sym <- sym
Spectrum_chem$alt.sym <- alt.sym
Spectrum_chem$unit <- unit
Spectrum_chem$desc <- desc
# important step
class(Spectrum_chem) <- "Spectra"
# some warnings
if (length(freq)!=dim(data)[2]) print("Dimension of data is NOT #samples X length of freq")
if (length(names)>dim(data)[1]) print("Too many names")
if (length(names)<dim(data)[1]) print("Too less names")
if (length(groups)>dim(data)[1]) print("Too many groups")
if (length(groups)<dim(data)[1]) print("Too less groups")
if (length(colors)>dim(data)[1]) print("Too many colors")
if (length(colors)<dim(data)[1]) print("Too less colors")
if (is.matrix(data)==F) print("'data' is not a matrix or it's not numeric")
return(Spectrum_chem)
}
Spectrum_chem <- dataframe2Spectra(Spectrum)
chkSpectra(Spectrum_chem)

Resources