Export .RData into a CSV in R - r

I want to export fake.bc.Rdata in package "qtl" into a CSV, and when running "summary" it shows this is an object of class "cross", which makes me fail to convert it. And I tried to use resave, but there is warning :cannot coerce class ‘c("bc", "cross")’ to a data.frame.
Thank you all for your help in advance!

CSV stands for comma-separated values, and is not suitable for all kinds of data.
It requires, like indicated in the comments, clear columns and rows.
Take this JSON as an example:
{
"name":"John",
"age":30,
"likes":"Walking","Running"
}
If you were to represent this in CSV-format, how would you deal with the difference in length? One way would be to have repeated data
name,age,likes
John,30,Walking
John,30,Running
But that doesn't really look right. Even if you merge the two into one you would still have trouble reading the data back, e.g.
name,age,likes
John,30,Walking/Running
Thus, CSV is best suited for tidy data.
TL;DR
Can your data be represented tidily as comma-separated values, or should you be looking at alternative forms of exporting your data?
EDIT:
It appears you do have some options:
If you look at the reference, you have the option to export your data using write.cross().
For your data, you could use write.cross(fake.bc, "csv", "myCrossData", c(1,5,13)). It then does the following:
Comma-delimited formats: a single csv file is created in the formats
"csv" or "csvr". Two files are created (one for the genotype data and
one for the phenotype data) for the formats "csvs" and "csvsr"; if
filestem="file", the two files will be names "file_gen.csv" and
"file_phe.csv".

Related

Issue when importing float as string from Excel. Adding precision incorrectly

Using openxlsx read.xlsx to import a dataframe from a multi-class column. The desired result is to import all values as strings, exactly as they're represented in Excel. However, some decimals are represented as very long floats.
Sample data is simply an Excel file with a column containing the following rows:
abc123,
556.1,
556.12,
556.123,
556.1234,
556.12345
require(openxlsx)
df <- read.xlsx('testnumbers.xlsx', )
Using the above R code to read the file results in df containing these string
values:
abc123,
556.1,
556.12,
556.12300000000005,
556.12339999999995,
556.12345000000005
The Excel file provided in production has the column formatted as "General". If I format the column as Text, there is no change unless I explicitly double-click each cell in Excel and hit enter. In that case, the number is correctly displayed as a string. Unfortunately, clicking each cell isn't an option in the production environment. Any solution, Excel, R, or otherwise is appreciated.
*Edit:
I've read through this question and believe I understand the math behind what's going on. At this point, I suppose I'm looking for a workaround. How can I get a float from Excel to an R dataframe as text without changing the representation?
Why Are Floating Point Numbers Inaccurate?
I was able to get the correct formats into a data frame using pandas in python.
import pandas as pd
test = pd.read_excel('testnumbers.xlsx', dtype = str)
This will suffice as a workaround, but I'd like to see a solution built in R.
Here is a workaround in R using openxlsx that I used to solve a similar issue. I think it will solve your question, or at least allow you to format as text in the excel files programmatically.
I will use it to reformat specific cells in a large number of files (I'm converting from general to 'scientific' in my case- as an example of how you might alter this for another format).
This uses functions in the openxlsx package that you reference in the OP
First, load the xlsx file in as a workbook (stored in memory, which preserves all the xlsx formatting/etc; slightly different than the method shown in the question, which pulls in only the data):
testnumbers <- loadWorkbook(here::here("test_data/testnumbers.xlsx"))
Then create a "style" to apply which converts the numbers to "text" and apply it to the virtual worksheet (in memory).
numbersAsText <- createStyle(numFmt = "TEXT")
addStyle(testnumbers, sheet = "Sheet1", style = numbersAsText, cols = 1, rows = 1:10)
finally, save it back to the original file:
saveWorkbook(testnumbers,
file = here::here("test_data/testnumbers_formatted.xlsx"),
overwrite = T)
When you open the excel file, the numbers will be stored as "text"

R readr: get columns specification of existing data, not imported one?

I have a dataset created in an R session, that I want to 1) export as csv 2) save the readr-type column specifications separately. This will allow me to read this data later on, using read_csv() and specifying col_types from the file saved in 2).
Problem: one gets column specifications (attribute spec) only for data read with a read_* function. It does not seem possible to obtain directly column specifications from dataset created within R?
My worflow so far is:
Export item: write_csv()
Read specification from the exported file: spec_csv().
Save the column specification: write_rds()
Then finally read_csv(step_1, col_types=step_3)
But this is error prone, as spec_csv() can get it wrong: it is indeed only guessing, so in case all values are NA, need to attribute arbitrary (character) class. Ideally I would like to be able to extract column specifications directly from the original dataset, instead of writing/re-loading. How can I do that? I.e., how can I convert my classes of a data-frame to a spec object?
Thanks!
Actual (inefficient) worfkow:
library(tidyverse)
write_csv(iris, "iris.csv")
spec_csv("iris.csv") %>%
write_rds("col_specs_path.rda")
read_csv("iris.csv", col_types = read_rds("col_specs_path.rda"))

Reading zipped folder containing non-traditional spreadsheet

I'm trying to read a zipped folder called etfreit.zip contained in Purchases from April 2016 onward.
Inside the zipped folder is a file called 2016.xls which is difficult to read as it contains empty rows along with Japanese text.
I have tried various ways of reading the xls from R, but I keep getting errors. This is the code I tried:
download.file("http://www3.boj.or.jp/market/jp/etfreit.zip", destfile="etfreit.zip")
unzip("etfreit.zip")
data <- read.csv(text=readLines("2016.xls")[-(1:10)])
I'm trying to skip the first 10 rows as I simply wish to read the data in the xls file. The code works only to the extent that it runs, but the data looks truly bizarre.
Would greatly appreciate any help on reading the spreadsheet properly in R for purposes of performing analysis.
There is more than one bizzare thing going on here I think, but I had some success with (somewhat older) gdata package:
data = gdata::read.xls("2016.xls")
By the way, treating xls file as csv seldom works. Actually it shouldn't work at all :) Find out a proper import function for your type of data and then use it, don't assume that read.csv is going to take care about anything else than csv (properly).
As per your comment: I'm not sure what you mean by "not properly aligned", but here is some code that cleans the data a bit, and gives you numeric variables instead of factors (note I'm using tidyr for that):
data2 = data[-c(1:7), -c(1, 6)]
names(data2) = c("date", "var1", "var2", "var3")
data2[, c(2:4)] = sapply(data2[, c(2:4)], tidyr::extract_numeric)
# Optionally convert the column with factor dates to Posixct
data2$date = as.POSIXct(data2$date)
Also, note that I am removing only 7 upper rows - this seems to be the portion of the data that contains the header with Japanese.
"Odd" unusual excel tables cab be read with the jailbreakr package. It is still in development, but looks pretty ace:
https://github.com/rsheets/jailbreakr

R: give data frames new names based on contents of their current name

I'm writing a script to plot data from multiple files. Each file is named using the same format, where strings between “.” give some info on what is in the file. For example, SITE.TT.AF.000.52.000.001.002.003.WDSD_30.csv.
These data will be from multiple sites, so SITE, or WDSD_30, or any other string, may be different depending on where the data is from, though it's position in the file name will always indicate a specific feature such as location or measurement.
So far I have each file read into R and saved as a data frame named the same as the file. I'd like to get something like the following to work: if there is a data frame in the global environment that contains WDSD_30, then plot a specific column from that data frame. The column will always have the same name, so I could write plot(WDSD_30$meas), and no matter what site's files were loaded in the global environment, the script would find the WDSD_30 file and plot the meas variable. My goal for this script is to be able to point it to any folder containing files from a particular site, and no matter what the site, the script will be able to read in the data and find files containing the variables I'm interested in plotting.
A colleague suggested I try using strsplit() to break up the file name and extract the element I want to use, then use that to rename the data frame containing that element. I'm stuck on how exactly to do this or whether this is the best approach.
Here's what I have so far:
site.files<- basename(list.files( pattern = ".csv",recursive = TRUE,full.names= FALSE))
sfsplit<- lapply(site.files, function(x) strsplit(x, ".", fixed =T)[[1]])
for (i in 1:length(site.files)) assign(site.files[i],read.csv(site.files[i]))
for (i in 1:length(site.files))
if (sfsplit[[i]][10]==grep("PARQL", sfsplit[[i]][10]))
{assign(data.frame.getting.named.PARQL, sfsplit[[i]][10])}
else if (sfsplit[i][10]==grep("IRBT", sfsplit[[i]][10]))
{assign(data.frame.getting.named.IRBT, sfsplit[[i]][10])
...and so on for each data frame I'd like to eventually plot from.Is this a good approach, or is there some better way? I'm also unclear on how to refer to the objects I made up for this example, data.frame.getting.named.xxxx, without using the entire filename as it was read into R. Is there something like data.frame[1] to generically refer to the 1st data frame in the global environment.

R Save Files Into Array

I have multiple files that I would like to essentially merge (.txt and .csv). They are all very different tables so I would essentially like to have about 30 sheets of different tables, and then be able to save that one file and index it later.
I've had trouble trying to find the most efficient way to do this, as most of my searches have ended up trying to merge() files together, which isn't possible since this collection of data files are unique.
The biggest issue is that each data frame is different, varying in names of columns and number of rows, unlike similar questions that have been asked.
What's the best way to combine the tables I have into one array, and save it?
EDIT:
To add some more detail, I have essentially three different kinds of data frames from multiple different files:
.csv files with table headers "X" "gene" "baseMean" "log2FoldChange" "lfcSE" "stat"
"pvalue" "padj" "TuLob" "TuDu"
one kind of .txt files with headers "hgnc_symbol" "ensembl_gene_id" "ensembl_transcript_id" "ensembl_peptide_id"
"band" "chromosome_name" "start_position" "end_position"
"transcript_start" "transcript_end" "description" "go_id"
"name_1006" "transcript_source" "status"
and a second kind of .txt files with headers "hgnc_symbol" "ensembl_gene_id" "ensembl_transcript_id" "ensembl_peptide_id"
"band" "chromosome_name" "start_position" "end_position"
"transcript_start" "transcript_end" "description" "name_1006"
"transcript_source" "status"
Again, I'm not trying to merge these tables, just save them in a stack or three dimension array as one file, to be opened and indexed later
I think what you want to do is use the save function to save the data in R's internal format.
df1 <- data.frame(x=rnorm(100))
df2 <- data.frame(y=rnorm(10), z=rnorm(10))
Gives us two data frames with different columns, rows, etc.
save(df1, df2, file="agg.RData")
Saves it to agg.RData
You can later do a
load("agg.RData")
head(df1)
...
See also attach, which does what load does, only lazily, so it will only load the objects once you try to access them.
Finally, you can get some measure of isolation by specifying and environment for load:
agg <- new.env()
load("agg.RData", agg)
head(agg$df1)
...

Resources