I analyzed DNA sequences in a bioinformatics pipeline to identify genetic variants of my samples. The effects of these variants have been estimated using the software snpEff. It returns a vcf file like this example file.
Since I have a multitude of those vcf files, I'd like to read in the vcf files and extract data from the annotation field (ANN=). The problem I have is that every line after the header contains an ANN field, but the number of annotations can vary from line to line. Thus, I'm looking for a simple way to convert the annotation subfields into a list of data frames (one row for every annotation, columns for the annotation subfields).
I'd be happy if you'd help and suggest a way on how to succeed in extracting the annotation info. Thanks a lot in advance!
Related
Is there a way to only use data from one CSV file that is not the same as this other CSV file? I recently split some data to conduct EFA and CFA analysis. I need to not use the information that will use to conduct the EFA analysis because then it serves no point to randomly split the data.
So how do I only use the data that I did not use in the CFA? If anyone can help please, it would be much appreciated.
Edit:
what I did was the following
Usage <-anti_join(file one, file two, by ='the column in which I could separate by')
then I just exported the file into a CSV, thank you all!
As the title says, I'm not a programmer. I've tried R before, got very confused and abandoned it. I'm a physician, and I do all my statistics either with SPSS or Excel. I'd like to learn some coding for when I get into problems like this:
I have an ascii file that I'd like to extract data from. The fields are contained within columns of variable width. 90% of the file is useless to me. For example, the fields I'm interested in extracting are encoded in columns 00645-00649, 03315-03319, etc. I'd like to get this into a format so I can run stats in SPSS/Excel. Should I be looking to use R, Python, something else or am I totally beyond hope?
Thanks in advance.
It's impossible to say for certain given only the information here, but the DATA LIST command in SPSS may well allow you to read the data into SPSS directly from the current file. If you can specify the column locations of the desired variables, you can specify those on that command, and SPSS will simply skip over the unnamed columns.
It's the first time I deal with Matlab files in R.
The rationale for saving the information in a .mat file type was the length. (the dataset contains 226518 rows). We were worried to excel (and then a csv) would not take them.
I can upload the original file if necessary
So I have my Matlab file and when I open it in Matlab all good.
There are various arrays and the one I want is called "allPoints"
I can open it and then see that it contains values around 0.something.
Screenshot:
What I want to do is to extract the same data in R.
library(R.matlab)
df <- readMat("170314_Col_HD_R20_339-381um_DNNhalf_PPP1-EN_CellWallThickness.mat")
str(df)
And here I get stuck. How do I pull out "allPoints" from it. $ does not seem to work.
I will have multiple files that need to be put together in one single dataframe in R so the plan is to mutate each extracted df generating a new column for sample and then I will rbind together.
Could anybody help?
I would like to find out what the "R way" would be to let users the following with R: I have a file that can contain the data of one or more analysis runs of some other software. My R package should provide additional ways to calculate statistics or produce plots for those analyses. So the first step a user would have to do, is read in the file (with one or more analyses), then select the analysis and work with it.
An analysis is uniquely identified by two names (an analysis name and an analysis type where the type should later correspond to an S3 class).
What I am not sure about is how to best represent the collection of analyses that is returned when reading in the file: should this be an object or simply a list of lists (since there are two ids for identifying an analysis, the first list could be indexed by name and the second by type). Using a list feels very low-level and clumsy though.
If the read function returns a special kind of container object what would be a good method to access one of the contained objects based on name and type?
There are probably many ways how to do this, but since I only started to work with R in a way where others should eventually use my code, I am not sure how to best follow existing R-conventions for how to design this.
Can someone please give me a small example of what the microarray data that I import into LIMMA should look like when I import it into R?
I am trying to decipher differentially regulated genes from a microarray sample. Thanks.
A tab (or whatever) separated file with normalized expression levels with in addition a column with probeset ids (or other gene identifiers) and a header which defines samples - generally speaking.
To get an example of the needed code I suggest you to inspect a geo2r generated script (accessible from any GEO dataset) and to read the limma vignette.