I would like to create some R functions and combine them in a small R package for an university course. Those functions will take input values (tree diameter at breast height) and estimate an output parameter (biomass) based on a function with parameters published in an article. Those parameters depend on the tree species and the functions are of this form (just a simplified example):
fun <- function(diameter, species){
parameter <- dataframe$parameter[which(dataframe$species == species)]
mass <- parameter*diameter
return(mass)
}
Hence, the function will have to look up those parameters in a table (named dataframe in this example).
Now I wondered what the best practice would be, to implement this. I have the data as Excel tables and I can load them as data.frames in R. I could use dput() and paste the output into my functions, so they contain the data to look up the values. However, this approach is probably not the most efficient one? It will make the functions pretty much unreadable.
Creating the data.frame as a global variable in the user's environment is probably also not best practice?
So, I wondered, how data sets should be included in R functions. Unfortunately, I was too stupid to google a solution (searching for terms such as "data.frame" or "function" obviously leads to a lot of stuff not related to my question; I have no useful keywords to start with). I hope this question is not a duplicate (the suggestions show very different questions).
The question is whether the data are to be user-visible as test data set. Then the right place for the data frame is a text file in the /data subdirectory of the package. This is the standard method. However, if the data are just parameters, one can define a local environment in the package and then place the the data in an .R script together with the function /Rfolder of the package.
You may have a look at the source code of package marelac, a package for aquatic sciences that uses both techniques.
Finally one can store it in R/sysdata.R as shown in the link provided by #Gabor Grothendiek.
Briefly the method that was used in marelac: First, there is a file aaa.R that defines a new environment as follows:
.marelac <- new.env()
The file name aaa.R ensures that this is loaded first, the dot that it is a "hidden" variable.
Then a function can make use of it:
## -----------------------------------------------------------------------------
## Seawater Composition
## function taken from package marelac, license: GPL >= 2
## -----------------------------------------------------------------------------
.marelac$sw_comp <- c(Na = 0.3065958, Mg = 0.0365055, Ca = 0.0117186,
K = 0.0113495, Sr = 0.0002260, Cl = 0.5503396, SO4 = 0.0771319,
HCO3 = 0.0029805, Br = 0.0019134, CO3 = 0.0004078, BOH4 = 0.0002259,
F = 0.0000369, OH = 0.0000038, BOH3 = 0.0005527, CO2 = 0.0000121)
sw_comp <- function(species = c("Na", "Mg", "Ca", "K", "Sr", "Cl", "SO4", "HCO3",
"Br", "CO3", "BOH4", "F", "OH", "BOH3", "CO2")) {
species <- match.arg(species, several.ok = TRUE)
.marelac$sw_comp[species]
}
Here sw_comp is the standard composition of seawater. This is the smallest (almost trivial) function with this technique and there are some others, e.g. gas_solubilityor diffcoeff.
Related
The R package vtreat provides a handy way of creating "one-hot encoders" for the categorical variables (see a relevant post at the Win-Vector blog). Is there any way to save the treatment plan tplan object for further use (e.g., equivalent mechanism of pickle in Python).
tplan <- vtreat::designTreatmentsZ(dTrain, vars)
oneHotEncoded <- as.matrix(vtreat::prepare(tplan, dTrain, varRestriction = vars))
I would like to transform whatever data I will get with this particular treatment plan (which was computed on the dTrain), in a situation where the dTrain is no longer available. That is, I cannot re-use dTrain the next time I will call the script.
P.s. the solution should not necessary be confined to using vtreat
Base R provides the general functions save() and load() for such purposes.
Here is a reproducible example using code snippets from the post you have linked to:
library(titanic)
library(vtreat)
data(titanic_train)
outcome <- 'Survived'
target <- 1
shouldBeCategorical <- c('PassengerId', 'Pclass', 'Parch')
for(v in shouldBeCategorical) {
titanic_train[[v]] <- as.factor(titanic_train[[v]])
}
tooDetailed <- c("Ticket", "Cabin", "Name", "PassengerId")
vars <- setdiff(colnames(titanic_train), c(outcome, tooDetailed))
dTrain <- titanic_train
set.seed(4623762)
tplan <- vtreat::designTreatmentsZ(dTrain, vars,
minFraction= 0,
verbose=FALSE)
save(tplan, file='tplan.RData')
The file tplan.RData will be saved in your current working directory; afterwards, in a new R session, when you ask for
load('tplan.RData')
you will get your tplan variable back.
Alternatively, base R functions saveRDS and loadRDS will also do the job; their usage is exactly similar, and they seem to be preferable.
I have multiple time series (each in a seperate file), which I need to adjust seasonally using the season package in R and store the adjusted series each in a seperate file again in a different directory.
The Code works for a single county.
So I tried to use a for Loop but R is unable to use the read.dta with a wildcard.
I'm new to R and using usually Stata so the question is maybe quite stupid and my code quite messy.
Sorry and Thanks in advance
Nathan
for(i in 1:402)
{
alo[i] <- read.dta("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County[i]")
alo_ts[i] <-ts(alo[i], freq = 12, start = 2007)
m[i] <- seas(alo_ts[i])
original[i]<-as.data.frame(original(m[i]))
adjusted[i]<-as.data.frame(final(m[i]))
trend[i]<-as.data.frame(trend(m[i]))
irregular[i]<-as.data.frame(irregular(m[i]))
County[i] <- data.frame(cbind(adjusted[i],original[i],trend[i],irregular[i], deparse.level =1))
write.dta(County[i], "/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County[i].dta")
}
This is a good place to use a function and the *apply family. As noted in a comment, your main problem is likely to be that you're using Stata-like character string construction that will not work in R. You need to use paste (or paste0, as here) rather than just passing the indexing variable directly in the string like in Stata. Here's some code:
f <- function(i) {
d <- read.dta(paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County",i,".dta"))
alo_ts <- ts(d, freq = 12, start = 2007)
m <- seas(alo_ts)
original <- as.data.frame(original(m))
adjusted <- as.data.frame(final(m))
trend <- as.data.frame(trend(m))
irregular <- as.data.frame(irregular(m))
County <- cbind(adjusted,original,trend,irregular, deparse.level = 1)
write.dta(County, paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County",i,".dta"))
invisible(County)
}
# return a list of all of the resulting datasets
lapply(1:402, f)
It would probably also be a good idea to take advantage of relative directories by first setting your working directory:
setwd("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/")
Then you can simply the above paths to:
d <- read.dta(paste0("./SINGLE_SERIES/County",i,".dta"))
and
write.dta(County, paste0("./ADJUSTED_SERIES/County",i,".dta"))
which will make your code more readable and reproducible should, for example, someone ever run it on another computer.
I want to share some software as a package but some of my scripts do not seem to go very naturally as functions. For example consider the following chunk of code where 'raw.df' is a data frame containing variables of both discrete and continuous kinds. The functions 'count.unique' and 'squash' will be defined in the package. The script splits the data frame into two frames, 'cat.df' to be treated as categorical data and 'cts.df' to be treated as continuous data.
My idea of how this would be used is that the user would read in the data frame 'raw.df', source the script, then interactively edit 'cat.df' and 'cts.df', perhaps combining some categories and transforming some variables.
dcutoff <- 9
tail(raw.df)
(nvals <- apply(raw.df, 2, count.unique))
p <- dim(raw.df)[2]
(catvar <- (1:p)[nvals <= dcutoff])
p.cat <- length(catvar)
(ctsvar <- (1:p)[nvals > dcutoff])
p.cts <- length(ctsvar)
cat.df <- raw.df[ ,catvar]
for (i in 1:p.cat) cat.df[ ,i] <- squash(cat.df[ ,i])
head(cat.df)
for(i in 1:p.cat) {
cat(as.vector(table(cat.df[ ,i])), "\n")
}
cts.df <- raw.df[ ,ctsvar]
for(i in 1:p.cts) {
cat( quantile(cts.df[ ,i], probs = seq(0, 1, 0.1)), "\n")
}
Now this could, of course, be made into a function returning a list containing nvals, p, p.cat, cat.df, etc; however this seems rather ugly to me. However the only provision for including scripts in a package seems to be the 'demo' folder which does not seem to be the right way to go. Advice on how to proceed would be gratefully received.
(But the gratitude would not be formally expressed as it seems that using a comment to express thanks is deprecated.)
It is better to encapsulate your code in a function. It is not ugly to return a list, S3 objects for example are just a list with an attribute class.
object <- list(attribute.name = something, ..)
class(object) <- "cname"
return (object)
You can also use inst folder (as mentioned in Dirk comment) since the contents of the inst subdirectory will be copied recursively to the installation directory.
you create an inst folder:
inst
----scripts
some_scripts.R
You can call it from a function in your package and use system.file mechanism to load it.
load_myscript <- function(){
source(system.file(package='your_pkg_name','scripts/some_scripts.R'))
}
You call it as any other function in your package:
load_myscript()
At my company, we are thinking of gradually phasing out SPSS in choice of R. During the transition though we'll still be having the data coming in SPSS data file format (.sav).
I'm having issues importing this SPSS datafile into R. When I import an SPSS file into R, I want to retain both the values and value labels for the variables. The read.spss() function from foreign package gives me option to retain either values OR value labels of a variable but not both.
AFAIK, R does allow factor variables to have values (levels) and value labels (level labels). I was just wondering if it's possible to somehow modify the read.spss() function to incorporate this.
Alternatively, I came across spss.system.file() function from memisc package which supposedly allows this to happen, but it asks for a separate syntax file (codes.file), which is not necessarily available to me always.
Here's a sample data file.
I'd appreciate any help resolving this issue.
Thanks.
I do not know how to read in SPSS metadata; I usually read .csv files and add metadata back, or write a small one-off PERL script to do the job. What I wanted to mention is that a recently published R package, Rz, may assist you with bringing SPSS data into R. I have had a quick look at it and seems useful.
There is a solution to read SPSS data file in R by ODBC driver.
1) There is a IBM SPSS Statistics Data File Driver. I could not find the download link. I got it from my SPSS provider. The Standalone Driver is all you need. You do not need SPSS to install or use the driver.
2) Create a DSN for the SPSS data driver.
3) Using RODBC package you can read in R any SPSS data file. It will be possible to get value labels for each variable as separate tables. Then it is possible to use the labels in R in any way as you wish.
Here is a working example on Windows (I do not have SPSS on my computer now) to read in R your example data file. I have not testted this on Linux. It probably works also on Linux, because there is a SPSS data driver also for Linux.
require(RODBC)
# Create connection
# Change the DSN name and CP_CONNECT_STRING according to your setting
con <- odbcDriverConnect("DSN=spss_ehsis;SDSN=SAVDB;HST=C:\\Program Files\\IBM\\SPSS\\StatisticsDataFileDriver\\20\\Standalone\\cfg\\oadm.ini;PRT=StatisticsSAVDriverStandalone;CP_CONNECT_STRING=C:\\temp\\data_expt.sav")
# List of tables
Tables <- sqlTables(con)
Tables
# List of table names to extract
table.names <- Tables$TABLE_NAME[Tables$TABLE_SCHEM != "SYSTEM"]
# Function to query a table by name
sqlQuery.tab.name <- function(table) {
sqlQuery(con, paste0("SELECT * FROM [", table, "]"))
}
# Retrieve all tables
Data <- lapply(table.names, sqlQuery.tab.name)
# See the data
lapply(Data, head)
# Close connection
close(con)
For example we can that value labels are defined for two variables:
[[5]]
VAR00002 VAR00002_label
1 1 Male
2 2 Female
[[6]]
VAR00003 VAR00003_label
1 2 Student
2 3 Employed
3 4 Unemployed
Additional information
Here is a function that allows to read SPSS data after the connection has been made to the SPSS data file. The function allows to specify the list of variables to be selected. If value.labels=T the selected variables with value labels in SPSS data file are converted to the R factors with labels attached.
I have to say I am not satisfied with the performance of this solution. It work good for small data files. The RAM limit is reached quite often for large SPSS data files (even the subset of variables is selected).
get.spss <- function(channel, variables = NULL, value.labels = F) {
VarNames <- sqlQuery(channel = channel,
query = "SELECT VarName FROM [Variables]", as.is = T)$VarName
if (is.null(variables)) variables <- VarNames else {
if (any(!variables %in% VarNames)) stop("Wrong variable names")
}
if (value.labels) {
ValueLabelTableName <- sqlQuery(channel = channel,
query = "SELECT VarName FROM [Variables]
WHERE ValueLabelTableName is not null",
as.is = T)$VarName
ValueLabelTableName <- intersect(variables, ValueLabelTableName)
}
variables <- paste(variables, collapse = ", ")
data <- sqlQuery(channel = channel,
query = paste("SELECT", variables, "FROM [Cases]"),
as.is = T)
if (value.labels) {
for (var in ValueLabelTableName) {
VL <- sqlQuery(channel = channel,
query = paste0("SELECT * FROM [VLVAR", var,"]"),
as.is = T)
data[, var] <- factor(data[, var], levels = VL[, 1], labels = VL[, 2])
}
}
return(data)
}
My work is going through the same transition.
read.spss() returns the variable labels as an attribute of the object you create with it. So in the example below I have a data frame called rvm which was created by read.spss() with to.data.frame=TRUE. It has 3,500 variables with short names a1, a2 etc but long labels for each variable in SPSS. I can access the variable labels by
cbind(attributes(rvm)$variable.labels)
which returns a list of all 3,500 variables full names up to
…
x23 "Other Expenditure Uncapped Daily Expenditure In Region"
x24 "Accommodation Expenditure In Region"
x25 "Food/Meals/Drink Expenditure In Region"
x26 "Local Transport Expenditure In Region"
x27 "Sightseeing/Attractions Expenditure In Region"
x28 "Event/Conference Expenditure In Region"
x29 "Gambling/Casino Expenditure In Region"
x30 "Gifts/Souvenirs Expenditure In Region"
x31 "Other Shopping Expenditure In Region"
x0 "Accommodation Daily Expenditure In Region"
What to do with these is another matter, but at least I have them, and if I want I can put them in some other object for safekeeping, searching with grep, etc.
Since you have SPSS available, I recommend installing the "Essentials for R" plugin (free of charge, but you need to register, also see the installation instructions) which allows you to run R within SPSS. The plugin includes an R package with functions that transfer the active SPSS data frame to R (and back) - including labeled factor levels, dates, German umlauts - details that are otherwise notoriously difficult. In my experience, it is more reliable than R's own foreign package.
Once you have everything set up, open the data in SPSS, and run something like the following code in the syntax window:
begin program r.
myDf <- spssdata.GetDataFromSPSS(missingValueToNA=TRUE,
factorMode="labels",
rDate="POSIXct")
save(myDf, file="d:/path/to/your/myDf.Rdata")
end program.
Essentials for R plugin link (apparently breaks markdown link syntax):
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/We70df3195ec8_4f95_9773_42e448fa9029/page/Downloads%20for%20IBM®%20SPSS®%20Statistics?lang=en
Nowadays, the package haven provides the functionality to achieve what you want (and much more).
The function read_sav() can import *.sav and *.zsav files and returns a tibble. The variable labels are automatically stored in the labels attribute of the corresponding variables within that tibble. The class labelled preserves the original semantics and allows us to associate arbitrary labels with numeric or character vectors. If needed, we can use the function as_factor() to coerce labeled objects, i.e. objects of the class labelled, and even all labeled vectors within data.frames or tibbles (at once) to factors.
I want to use ChemoSpec with a mass spectra of about 60'000 datapoint.
I have them already in one txt file as a matrix (X + 90 samples = 91 columns; 60'000 rows).
How may I adapt this file as spectra data without exporting again each single file in csv format (which is quite long in R given the size of my data)?
The typical (and only?) way to import data into ChemoSpec is by way of the getManyCsv() function, which as the question indicates requires one CSV file for each sample.
Creating 90 CSV files from the 91 columns - 60,000 rows file described, may be somewhat slow and tedious in R, but could be done with a standalone application, whether existing utility or some ad-hoc script.
An R-only solution would be to create a new method, say getOneBigCsv(), adapted from getManyCsv(). After all, the logic of getManyCsv() is relatively straight forward.
Don't expect such a solution to be sizzling fast, but it should, in any case, compare with the time it takes to run getManyCsv() and avoid having to create and manage the many files, hence overall be faster and certainly less messy.
Sorry I missed your question 2 days ago. I'm the author of ChemoSpec - always feel free to write directly to me in addition to posting somewhere.
The solution is straightforward. You already have your data in a matrix (after you read it in with >read.csv("file.txt"). So you can use it to manually create a Spectra object. In the R console type ?Spectra to see the structure of a Spectra object, which is a list with specific entries. You will need to put your X column (which I assume is mass) into the freq slot. Then the rest of the data matrix will go into the data slot. Then manually create the other needed entries (making sure the data types are correct). Finally, assign the Spectra class to your completed list by doing something like >class(my.spectra) <- "Spectra" and you should be good to go. I can give you more details on or off list if you describe your data a bit more fully. Perhaps you have already solved the problem?
By the way, ChemoSpec is totally untested with MS data, but I'd love to find out how it works for you. There may be some changes that would be helpful so I hope you'll send me feedback.
Good Luck, and let me know how else I can help.
many years passed and I am not sure if anybody is still interested in this topic. But I had the same problem and did a little workaround to convert my data to class 'Spectra' by extracting the information from the data itself:
#Assumption:
# Data is stored as a numeric data.frame with column names presenting samples
# and row names including domain axis
dataframe2Spectra <- function(Spectrum_df,
freq = as.numeric(rownames(Spectrum_df)),
data = as.matrix(t(Spectrum_df)),
names = paste("YourFileDescription", 1:dim(Spectrum_df)[2]),
groups = rep(factor("Factor"), dim(Spectrum_df)[2]),
colors = rainbow(dim(Spectrum_df)[2]),
sym = 1:dim(Spectrum_df)[2],
alt.sym = letters[1:dim(Spectrum_df)[2]],
unit = c("a.u.", "Domain"),
desc = "Some signal. Describe it with 'desc'"){
features <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "unit", "desc")
Spectrum_chem <- vector("list", length(features))
names(Spectrum_chem) <- features
Spectrum_chem$freq <- freq
Spectrum_chem$data <- data
Spectrum_chem$names <- names
Spectrum_chem$groups <- groups
Spectrum_chem$colors <- colors
Spectrum_chem$sym <- sym
Spectrum_chem$alt.sym <- alt.sym
Spectrum_chem$unit <- unit
Spectrum_chem$desc <- desc
# important step
class(Spectrum_chem) <- "Spectra"
# some warnings
if (length(freq)!=dim(data)[2]) print("Dimension of data is NOT #samples X length of freq")
if (length(names)>dim(data)[1]) print("Too many names")
if (length(names)<dim(data)[1]) print("Too less names")
if (length(groups)>dim(data)[1]) print("Too many groups")
if (length(groups)<dim(data)[1]) print("Too less groups")
if (length(colors)>dim(data)[1]) print("Too many colors")
if (length(colors)<dim(data)[1]) print("Too less colors")
if (is.matrix(data)==F) print("'data' is not a matrix or it's not numeric")
return(Spectrum_chem)
}
Spectrum_chem <- dataframe2Spectra(Spectrum)
chkSpectra(Spectrum_chem)