R readr: get columns specification of existing data, not imported one?

R readr: get columns specification of existing data, not imported one? - r

I have a dataset created in an R session, that I want to 1) export as csv 2) save the readr-type column specifications separately. This will allow me to read this data later on, using read_csv() and specifying col_types from the file saved in 2).
Problem: one gets column specifications (attribute spec) only for data read with a read_* function. It does not seem possible to obtain directly column specifications from dataset created within R?
My worflow so far is:
Export item: write_csv()
Read specification from the exported file: spec_csv().
Save the column specification: write_rds()
Then finally read_csv(step_1, col_types=step_3)
But this is error prone, as spec_csv() can get it wrong: it is indeed only guessing, so in case all values are NA, need to attribute arbitrary (character) class. Ideally I would like to be able to extract column specifications directly from the original dataset, instead of writing/re-loading. How can I do that? I.e., how can I convert my classes of a data-frame to a spec object?
Thanks!
Actual (inefficient) worfkow:
library(tidyverse)
write_csv(iris, "iris.csv")
spec_csv("iris.csv") %>%
write_rds("col_specs_path.rda")
read_csv("iris.csv", col_types = read_rds("col_specs_path.rda"))

Related

Export .RData into a CSV in R

I want to export fake.bc.Rdata in package "qtl" into a CSV, and when running "summary" it shows this is an object of class "cross", which makes me fail to convert it. And I tried to use resave, but there is warning :cannot coerce class ‘c("bc", "cross")’ to a data.frame.
Thank you all for your help in advance!

CSV stands for comma-separated values, and is not suitable for all kinds of data.
It requires, like indicated in the comments, clear columns and rows.
Take this JSON as an example:
{
"name":"John",
"age":30,
"likes":"Walking","Running"
}
If you were to represent this in CSV-format, how would you deal with the difference in length? One way would be to have repeated data
name,age,likes
John,30,Walking
John,30,Running
But that doesn't really look right. Even if you merge the two into one you would still have trouble reading the data back, e.g.
name,age,likes
John,30,Walking/Running
Thus, CSV is best suited for tidy data.
TL;DR
Can your data be represented tidily as comma-separated values, or should you be looking at alternative forms of exporting your data?
EDIT:
It appears you do have some options:
If you look at the reference, you have the option to export your data using write.cross().
For your data, you could use write.cross(fake.bc, "csv", "myCrossData", c(1,5,13)). It then does the following:
Comma-delimited formats: a single csv file is created in the formats
"csv" or "csvr". Two files are created (one for the genotype data and
one for the phenotype data) for the formats "csvs" and "csvsr"; if
filestem="file", the two files will be names "file_gen.csv" and
"file_phe.csv".

how do i get my variables to be recognized?

I am writing a dataframe using a csv file. I am making a data frame. However, when I go to run it, it's not recognizing the objects in the file. It will recognize some of them, but not all.
smallsample <- data.frame(read.csv("SmallSample.csv",header = TRUE),smallsample$age,smallsample$income,smallsample$gender,smallsample$marital,smallsample$numkids,smallsample$risk)
smallsample
It wont recognize marital or numkids, despite the fact that those are the column names in the table in the .csv file.

When you use read.csv the output is already in a dataframe.
You can simple use smallsample <- read.csv("SmallSample.csv")
Result using a dummy csv file
<table><tbody><tr><th> </th><th>age</th><th>income</th><th>gender</th><th>marital</th><th>numkids</th><th>risk</th></tr><tr><td>1</td><td>32</td><td>34932</td><td>Female</td><td>Single</td><td>1</td><td>0.9611315</td></tr><tr><td>2</td><td>22</td><td>50535</td><td>Male</td><td>Single</td><td>0</td><td>0.7257541</td></tr><tr><td>3</td><td>40</td><td>42358</td><td>Male</td><td>Single</td><td>1</td><td>0.6879534</td></tr><tr><td>4</td><td>40</td><td>54648</td><td>Male</td><td>Single</td><td>3</td><td>0.568068</td></tr></tbody></table>

R / openxlsx / Finding the first non-empty cell in Excel file

I'm trying to write data to an existing Excel file from R, while preserving the formatting. I'm able to do so following the answer to this question (Write from R into template in excel while preserving formatting), except that my file includes empty columns at the beginning, and so I cannot just begin to write data at cell A1.
As a solution I was hoping to be able to find the first non-empty cell, then start writing from there. If I run read.xlsx(file="myfile.xlsx") using the openxlsx package, the empty columns and rows are automatically removed, and only the data is left, so this doesn't work for me.
So I thought I would first load the worksheet using wb <- loadWorkbook("file.xlsx") so I have access to getStyles(wb) (which works). However, the subsequent command getTables returns character(0), and wb$tables returns NULL. I can't figure out why this is? Am I right in that these variables would tell me the first non-empty cell?
I've tried manually removing the empty columns and rows preceding the data, straight in the Excel file, but that doesn't change things. Am I on the right path here or is there a different solution?

As suggested by Stéphane Laurent, the package tidyxl offers the perfect solution here.
For instance, I can now search the Excel file for a character value, like my variable names of interest ("Item", "Score", and "Mean", which correspond to the names() of the data.frame I want to write to my Excel file):
require(tidyxl)
colnames <- c("Item","Score","Mean")
excelfile <- "FormattedSheet.xlsx"
x <- xlsx_cells(excelfile)
# Find all cells with character values: return their address (i.e., Cell) and character (i.e., Value)
chars <- x[x$data_type == "character", c("address", "character")]
starting.positions <- unlist(
chars[which(chars$character %in% colnames), "address"]
)
# returns: c(C6, D6, E6)

Issue when importing float as string from Excel. Adding precision incorrectly

Using openxlsx read.xlsx to import a dataframe from a multi-class column. The desired result is to import all values as strings, exactly as they're represented in Excel. However, some decimals are represented as very long floats.
Sample data is simply an Excel file with a column containing the following rows:
abc123,
556.1,
556.12,
556.123,
556.1234,
556.12345
require(openxlsx)
df <- read.xlsx('testnumbers.xlsx', )
Using the above R code to read the file results in df containing these string
values:
abc123,
556.1,
556.12,
556.12300000000005,
556.12339999999995,
556.12345000000005
The Excel file provided in production has the column formatted as "General". If I format the column as Text, there is no change unless I explicitly double-click each cell in Excel and hit enter. In that case, the number is correctly displayed as a string. Unfortunately, clicking each cell isn't an option in the production environment. Any solution, Excel, R, or otherwise is appreciated.
*Edit:
I've read through this question and believe I understand the math behind what's going on. At this point, I suppose I'm looking for a workaround. How can I get a float from Excel to an R dataframe as text without changing the representation?
Why Are Floating Point Numbers Inaccurate?

I was able to get the correct formats into a data frame using pandas in python.
import pandas as pd
test = pd.read_excel('testnumbers.xlsx', dtype = str)
This will suffice as a workaround, but I'd like to see a solution built in R.

Here is a workaround in R using openxlsx that I used to solve a similar issue. I think it will solve your question, or at least allow you to format as text in the excel files programmatically.
I will use it to reformat specific cells in a large number of files (I'm converting from general to 'scientific' in my case- as an example of how you might alter this for another format).
This uses functions in the openxlsx package that you reference in the OP
First, load the xlsx file in as a workbook (stored in memory, which preserves all the xlsx formatting/etc; slightly different than the method shown in the question, which pulls in only the data):
testnumbers <- loadWorkbook(here::here("test_data/testnumbers.xlsx"))
Then create a "style" to apply which converts the numbers to "text" and apply it to the virtual worksheet (in memory).
numbersAsText <- createStyle(numFmt = "TEXT")
addStyle(testnumbers, sheet = "Sheet1", style = numbersAsText, cols = 1, rows = 1:10)
finally, save it back to the original file:
saveWorkbook(testnumbers,
file = here::here("test_data/testnumbers_formatted.xlsx"),
overwrite = T)
When you open the excel file, the numbers will be stored as "text"

Importing a .csv with headers that show up as first row of data

I'm importing a csv file into R. I read a post here that said in order to get R to treat the first row of data as headers I needed to include the call header=TRUE.
I'm using the import function for RStudio and there is a Code Preview section in the bottom right. The default is:
library(readr)
existing_data <- read_csv("C:/Users/rruch/OneDrive/existing_data.csv")
View(existing_data)
I've tried placing header=TRUE in the following places:
read_csv(header=TRUE, "C:/Users...)
existing_data.csv", header=TRUE
after 2/existing_data.csv")
Would anyone be able to point me in the right direction?

You should use col_names instead of header. Try this:
library(readr)
existing_data <- read_csv("C:/Users/rruch/OneDrive/existing_data.csv", col_names = TRUE)
There are two different functions to read csv files (actually far more than two): read.csv from utils package and read_csv from readr package. The first one gets header argument and the second one col_names.
You could also try fread function from data.table package. It may be the fastest of all.
Good luck!

It looks like there is one variable name that is correctly identified as a variable name (notice your first column). I would guess that your first row only contains the variable "Existing Product List", and that your other variable names are actually contained in the second row. Open the file in Excel or LibreOffice Calc to confirm.
If it is indeed the case that all of the variable names you've listed (including "Existing Product List") are in the first row, then you're in the same boat as me. In my case, the first row contains all of my variables, however they appear as both variable names and the first row of observations. Turns out the encoding is messed up (which could also be your problem), so my solution was simply to remove the first row.
library(readr)
mydat = read_csv("my-file-path-&-name.csv")
mydat = mydat[-1, ]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R readr: get columns specification of existing data, not imported one? - r

Related

Export .RData into a CSV in R

how do i get my variables to be recognized?

R / openxlsx / Finding the first non-empty cell in Excel file

Issue when importing float as string from Excel. Adding precision incorrectly

Importing a .csv with headers that show up as first row of data

Categories

Resources