Retrieving GWAS information with R - r

I am trying to get specific disease-related information from the GWAS catalog. This can be done directly from the website via a spreadsheet download. But I was wondering if I could possibly do it programmatically in R. Any suggestions will be greatly appreciated.
Thanks.
Avoks

Checkout the function download.file() and the package rcurl (http://cran.r-project.org/web/packages/RCurl/index.html) - this should do what you are looking for

You will have to download .tsv file(s) first and manually edit them.
This is because GWAS Catalog files contain HTML symbols, like &#x000A7 in "Behçet's disease" (defining that special fourth letter). The # in these symbols will be interpreted by R as an end of line, thus you will get an error message, e.g.:
line 2028 did not have 34 elements
So you downlad it first, open in plain text editor, automatically replace every # with empty character, and only then load it into R with:
read.table("gwas_catalog_v1.0-associations_e91_r2018-02-21.tsv",sep="\t",h=T,stringsAsFactors = F,quote="")

Related

Why won't the HTML function in R actually write the HTML?

So I recently helped write a code for my lab which takes our processed data and makes a merged data frame of it. For purpose of keeping the lab updated, we keep our data tables updated on a secure wiki and thus I need an HTML made so I can basically upload the dataframe onto the wiki easily. It's worked before - all I did was basically copy what was already written and working and edited it to work for a different time point in our data collection. I have no errors given back to me and the data looks how I want it to look. As far as I know this script should be written logically and working well and so far it does except for one issue: R will make a file for the HTML, but there is no HTML written in the text document.
I have HTML's written from the other data time points which are written the exact same as this one, so I don't think it is a script construction thing.
Any ideas as to why this could be happening? I just need to know where to triage.
The package used for HTML is R2HTML, included in my packages list up at the top of the script. For HTML(, file=paste()), you will need to use your own directory to see if the HTML is written as a text file.
If I am not wrong , You are trying to get the dataframe in html format .
In this case you need to use xtable package in R
Just the below code in bottom of the script
## install the xtable package before importing it
library("xtable")
print(xtable(ChildSRPtotsFU_wiki), type="html", file="check_stack_overflow.html")

Is there a way to accelerate formatted table writing from R to excel?

I have a 174603 rows and 178 column dataframe, which I'm importing to Excel using openxlsx::saveWorkbook, (Using this package to obtain the aforementioned format of cells, with colors, header styles and so on). But the process is extremely slow, (depending on the amount of memory used by the machine it can take from 7 to 17 minutes!!) and I need a way to reduce this significantly (Doesn't need to be seconds, but anything bellow 5 min would be OK)
I've already searched other questions but they all seem to focus either in exporting to R (I have no problem with this) or writing non-formatted files to R (using write.csv and other options of the like)
Apparently I can't use xlsx package because of the settings on my computer (industrial computer, Check comments on This question)
Any suggestions regarding packages or other functionalities inside this package to make this run faster would be highly appreciated.
This question has some time ,but I had the same problem as you and came up with a solution worth mentioning.
There is package called writexl that has implemented a way to export a data frame to Excel using the C library libxlsxwriter. You can export to excel using the next code:
library(writexl)
writexl::write_xlsx(df, "Excel.xlsx",format_headers = TRUE)
The parameter format_headers only apply centered and bold titles, but I had edited the C code of the its source in github writexl library made by ropensci.
You can download it or clone it. Inside src folder you can edit write_xlsx.c file.
For example in the part that he is inserting the header format
//how to format headers (bold + center)
lxw_format * title = workbook_add_format(workbook);
format_set_bold(title);
format_set_align(title, LXW_ALIGN_CENTER);
you can add this lines to add background color to the header
format_set_pattern (title, LXW_PATTERN_SOLID);
format_set_bg_color(title, 0x8DC4E4);
There are lots of formating you can do searching in the libxlsxwriter library
When you have finished editing that file and given you have the source code in a folder called writexl, you can build and install the edited package by
shell("R CMD build writexl")
install.packages("writexl_1.2.tar.gz", repos = NULL)
Exporting again using the first chunk of code will generate the Excel with formats and faster than any other library I know about.
Hope this helps.
Have you tried ;
write.table(GroupsAlldata, file = 'Groupsalldata.txt')
in order to obtain it in txt format.
Then on Excel, you can simply transfer you can 'text to column' to put your data into a table
good luck

BlySky Statistics - File naming conventions

When opening file 'TestFile.RData' in BlueSky Statistics it is opened with this name PLUS Dataset3 attached. Looks like this in tab TestFile.RData(Dataset3)
I would like to use my original name when using r code in the r command editor but from what I see BlueSky wants me to use the Dataset3 name.
Please clarify this file name issue for me.
If my original name is changed I see issues with reproducing things - as the given name of Dataset3 is not controllable.
Regards
Your observation is correct. When ever a file is opened in BlueSky Statistics (that is not an R datafile) we create a dataframe object in R. We name these objects sequentially namely Dataset1, Dataset2,Dataset3, etc. We could always use the name of the original file, however we went with Dataset1,Dataset2,Dataset3 for compatibility with SPSS. Many of our users come from SPSS and that is exactly what SPSS does. There is a simple work around, see below.
To work around this you need to change the default code we use to open the dataset. To see the code in the output window, Go to the top level menu Tools , Tools->Configuration settings->Select the Output tab and select the checkbox near the text "Show syntax in output window"
The code you will see when you open a dataset in the output Window is
BSkyloadDataset(fullpathfilename='C:/Users/Aaron_2/Documents/BlueSky Statistics/Sample Datasets/IRT/engagement.csv', filetype='CSV', worksheetName='',load.missing=FALSE, character.to.factor=FALSE, csvHeader=TRUE, isBasketData=FALSE, trimSPSStrailing=FALSE, sepChar=',', deciChar='.', datasetName='Dataset2')
All you need to do is change the datasetName parameter to the name you want to use
I will also add an enhancement to make the default behavior of naming the dataset when opening files to be the name of the file. This is easy to do.
With R datasets this is not a problem because we load all dataframe objects into the grid. The name of the dataset in the grid, continues to be the dataset object
BlueSky is one of the few packages that use R and allow you to open and work on multiple data files at once. This naming approach is its way of allowing that while using files that have not yet been stored as R data files (.RData). After importing data from a non-R file, simply use "File> Save as" and save it as an R Object (.RData). The next time you open that file, it will maintain the name you've given it.

RStudio: Save data from Viewer

Due to a stupid mistake and a defective USB stick I lost a bunch of data and I am now trying to recover it.
Some of the the data is still displayed in the Viewer tabs when I open RStudio. However, I can only save R Scripts and R Markdownfiles out of the Viewer. The displayed data frames are nice and complete, I can sort and filter them in the Viewer, however, I cannot find a "save" option. Is there a possibility to save this displayed data into Rdata or csv or something similar?
I would suggest three different approaches, but none of them will necessarily work. I sort them according to my prior expectations of success.
1) You can copy all your data frame from the viewer and paste it into an external spreadsheet software to obtain a .csv file. E.g. through the "convert text to columns" button in MS Excel.
2) You can copy and paste the character string into an object that is passed to the text option of read.table or to dput(). Check out the "Copy your data" section of this famous SO question
3) Finally, you can get google Chrome's "Inspect Element" function to inspect the html code of the object in the viewer. Once you find the table you can copy paste and scrape with an html parser, e.g. using the rvest package. Good luck!
Thanks everybody, there is a way to access the data as Rdata files, which was kindly explained to me here
I used the second method and located the files in %localappdata%\RStudio-Desktop\viewer-cache.

How to write multiple tables, dataframes, regression results etc - to one excel file?

I am looking for an easy way to get objects into MS Excel.
(I am using the preinstalled "Puromycin"-dataset for the examples)
I would like to place the contents of these objects to a single excel file:
Puromycin
summary(Puromycin$rate)
summary(Purymycin$conc)
table(Puromycin$state)
lm( conc ~ rate , data=Puromycin)
By "contents" i mean what is shown in the console when i press enter. I dont know what to call it.
I tried to do this:
sink("datafilewhichexcelhopefullyunderstands.csv")
Puromycin
summary(Puromycin$rate)
summary(Purymycin$conc)
table(Puromycin$state)
lm( conc ~ rate , data=Puromycin)
sink()
This gives med a file with the CSV-extension, however when i open the file in notepad,
there is comma-separation. That means that i cant get Excel to open it properly. By properly
i mean that each number is in its own cell.
Others have suggested this for a similar problem
https://stackoverflow.com/a/13007555/1831980
But as a novice i feel that the solution is too complex, and I am hoping for a simpler method.
What I am doing now is this:
write.table(Puromycin, file="clipboard" , sep=";" , row.names=FALSE )
write.table(summary(Purymycin$conc), file="clipboard" , sep=";" , row.names=FALSE )
... etc...
But this requires i lot of copy-ing and pasting, which I hope to eliminate.
Any help would appreciated.
write.table and its friends are intended to write out columns of data separated by whatever separator is specified. Your clipboard contains several data types because you are using summary which always gives a unique output.
For writing the data values out, you can use write.csv on a data frame and then open with Excel. For example, Puromycin is already a data frame (which you can see with str(Puromycin)) so you can just write it out directly:
write.csv(file = "some file.csv", x = Puromycin)
Which will go into the current working directory (which can be determined with getwd()).
To write out/save the results of the regression model is a bit more of a challenge. You could definitely use sink as you did, but specify an extension of .txt on your file so a text editor can open it. There are fancier methods (sweave, knitr) which you might want to look into in the long run, as they can write really nice reports automatically.
In the meantime, get to know str(any R object) as it will be your friend. You can see all the objects in your workspace with ls().
This will only be helpful if you are prepared to use Excel's Data/Text to Columns functions:
capture.output( sapply( c(Puromycin,
summary(Puromycin$rate),
summary(Puromycin$conc),
table(Puromycin$state),
lm( conc ~ rate , data=Puromycin) ), FUN=print), file="datafilewhichexcelhopefullyunderstands.csv", append=TRUE)
The problem being that Excel will not read the whitespace as a cell separator unless you specifically tell it to. You can (and I have often done so) use the fixed filed input features offered by the Text-to-Columns dialog interface.
Your simplest option may be to use the RExcel tool, it transfers information between R and Excel. However it is not free software.
The XLConnect package is another option, it can be used to write information directly to an Excel file.
The tricky part is the lm call. lm does not return a simple vector, matrix, or data frame (all of which are easy to convert to csv or send directly) and there is not a clear way to convert the various parts of a list to cells in a spreadsheet. What would be better is to use extractor functions to pull the important parts from the return of lm or the summary of the lm object and send those to Excel using the other tools.
If you can tell us more about why you want the numbers in Excel and what you plan to do with them after, then we may be able to offer better help (you may be able to completely skip excel).
If the main goal is to share output with others then you should really look at the knitr package (or other related packages). This will not create Excel files, but can be used (along with the pandoc program and possibly other tools) to create a report file in a format easy to share with others not familiar with R. You could put everything into a .pdf file or a .docx file (the latter read by MS Word and would have tables wich can be edited using Word). There is not a simple way to get edits back into R, but with the track changes you can easily see what changes have been made and hand edit your R script/template accordingly.

Resources