How to save and load the output of seqefsub() in TraMineR - r

I have long dataset where I want to experiment with different settings for the seqefsub() function, and depending on the setting, one run can take relatively long. Therefore I want the computer to calculate all the different variations and later evaluate the results, evtl. use them for further processing.
My problem is when I save the results in a file and load them, the structure of the data appears to be broken. As a result I cannot use the TraMineR functions on this data after I load it, hence I need to reproduce all the calculations every single time after closing R.
Saving to the workspace with RStudio (.RData) gives the same error. Saving to binary format gives the same error.
This is how the sequence list looks like in RStudio, before saving:
And after loading:
This is the code I used for this example:
library(TraMineR)
data(actcal.tse)
seqe <- seqecreate(actcal.tse[1:100, ])
fsub <- seqefsub(seqe, minSupport = 0.1)
save(fsub, file="fsub.rda")
rm(fsub)
load("fsub.rda")
Details of my system:
x86_64-pc-linux-gnu (Ubuntu 14.04 LTE)
R version 3.2.0 (2015-04-16)
RStudio Version 0.98.1103
TraMineR stable version 1.8-9 (Built: 2015-04-22)

If you check the value returned from the seqefsub() it is a subseqelist object. This kind object contains other objects listed in the docs as:
seqe: The list of sequences in which the subsequences were searched (a seqelist event sequence object).
subseq: A list of subsequences (a seqelist event sequence object).
data: A data frame containing details (support, frequency, ...) about the subsequences
And others. What I did to save the results was to convert the data that I needed in to lists and build a data frame with them before save it.
library(TraMineR)
data(actcal.tse)
seqe <- seqecreate(actcal.tse[1:100, ])
fsub <- seqefsub(seqe, minSupport = 0.1)
#Get the data I need only
#(Explore the other objects to get what you need)====
#Gets the column support from data (which is a data frame)
support <- fsub$data$Support
#subseq is class that cannot be converted to a data frame
#it stores de subsquences found and I will convert them to strings
sequences <- as.character(fsub$subseq)
#Builds the data frame
result <- data.frame(sequences, support)
#Save it at root
save(result, file="~/result.rda")
rm(result)
load('~/result.rda')
I hope it still helps you.

Related

R show variable list/header of Stata or SAS file in R without loading the complete dataset

I am given very big (around 10 Gb each) datasets in both SAS and Stata format. I am going to read them into R for analysis.
Is there a way to show what variables (columns) they contain inside without reading the whole data file? I often only need some of the variables. I can view them of course from File Explorer, but it's not reproducible and takes a lot of time.
Both SAS and Stata are available on the system, but just opening a file might take a minute or so.
If you have SAS run a proc contents or proc datasets to see the details of the dataset without opening it. You may want to do that anyways, so that you can verify variable types, lengths and formats.
libname myFiles 'path to your sas7bdatfiles';
proc contents data=myfiles.datasetName;
run;
See below for the dta solution, which you can update to SAS using read_sas.
library(haven)
# read in first row of dta
dta_head <- read_dta("my_data.dta",
n_max = 1)
# get variable names of dta
dta_names <- names(dta_head)
After examining the names and labels of your dta file, you can then remove the n_max = 1 option and read in full while possibly adding the col_select option specifying the subset of variables you wish to read in.

rbind and bind_rows not working on Mac but works on Windows

I am working on importing multiple csv files which contain data from climate sensors (temp, humidity, etc.). My code is below. First, I get a list of all the files I am importing and the length of that list. I then have a for loop cycling through each file, inside of that an if loop checking if the file is a csv. In the if loop, the code reads the file in and temporarily stores it in "store". Afterwards, a data frame which holds all of the data, called "main", and "store" are combined using rbind.
file_list <- list.files(".")
num_files <- length(file_list)
main <- data.frame()
for (i in 1:num_files){
check <- stri_length(file_list[i])
if (substr(file_list[i],start = check - 2, stop = check) == "csv"){
store <- read.csv(file_list[i])
main <- rbind(main, store)
}
However, when I run the code, I get the following error: "Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match"
There are a few things that are interesting with this code and error. First, when I run this code on a windows computer, it works brilliantly with no error, but it does not work on my MAC. Second, all the columns in the imported data are identical, so rbind should work, and it does on a windows. Third, when I examine the empty data frame "main," it has very weird column names that do not match up with the data, for example, X21092, X16, X5.2, etc... Hence, it will not bind the temporarily stored data and the empty data frame. Now, when I do not use rbind and try something different such as bind_rows, I get the same wacky column names that do not match up with anything, so the problem is not with rbind. Finally, when I run codes that use a very similar for loop with rbind on my MAC, they work just as expected.
I have tried clearing everything, such as the environment, using the code on a new file, and uninstalling R and RStudio, all of which yield the same result and error.
Is the some reason I am getting this error on my MAC and what should I do to fix it?
Here is a photo of the data frame when it runs correctly
enter image description here
Here is an image of the data frame when I get the error on my MAC
enter image description here

How to export an R dataframe to a Power Query table

I'm using an R script within Power Query to do some data transformations and return a scaled table.
My R code is like this:
# 'dataset'
It does seem like odd that this fails to return. A quick glance online gave this 3 minute youtube video, which uses the same method, which you are using. Further searching down a source, one may come across the Microsoft Documentation, which gives a possible reason for why there might be an issue.
When preparing and running an R script in Power BI Desktop, there are a few limitations:
Only data frames are imported, so make sure the data you want to import to Power BI is represented in a data frame
Columns that are typed as Complex and Vector are not imported, and are replaced with error values in the created table
These seem like the most obvious reasons. Betting that there is no complex columns in your dataset, I'd believe the prior is likely the reason. A quick recreation of your dataset shows that the scale functions changes your dataset into a matrix class object. This is kept by cbind, and as such output is of class matrix and not data.frame.
>dataset <- as.data.frame(abs(matrix(rnorm(1000),ncol=4)))
>class(dataset)
[1]"data.frame"
>library(dplyr)
>df_normal <- log(dataset + 1) %>%
> select(c(2:4)) %>%
> scale
>class(df_normal)
[1] "matrix"
>df_normal <- cbind(dataset[,1], df_normal)
>output <- df_normal
>class(output)
[1] "matrix"
A simple fix would then seem to be adding output <- as.data.frame(output), as this is in line with the documentation of powerBI. Maybe it would need a return like statement at the end. Adding a line at the end of the script simply stating output should fix this.
Edit
For clarification, I believe the following edited script (of yours) should return the data expected
# 'dataset' contém os dados de entrada neste script
library(dplyr)
df_normal <- log(dataset+1) %>%
select(c(2:4)) %>%
scale
df_normal <-cbind(dataset[,c(1)], df_normal)
output <- as.data.frame(df_normal)
#output ##This line might be needed without the first comment

Why am I getting different output from the Alteryx R tool

I using the Alteryx R Tool to sign an amazon http request. To do so, I need the hmac function that is included in the digest package.
I'm using a text input tool that includes the key and a datestamp.
Key= "foo"
datastamp= "20120215"
Here's the issue. When I run the following script:
the.data <- read.Alteryx("1", mode="data.frame")
write.Alteryx(base64encode(hmac(the.data$key,the.data$datestamp,algo="sha256",raw = TRUE)),1)
I get an incorrect result when compared to when I run the following:
write.Alteryx(base64encode(hmac("foo","20120215",algo="sha256",raw = TRUE)),1)
The difference being when I hardcode the values for the key and object I get the correct result. But if use the variables from the R data frame I get incorrect output.
Does the data frame alter the data in someway. Has anyone come across this when working with the R Tool in Alteryx.
Thanks for your input.
The issue appears to be that when creating the data frame, your character variables are converted to factors. The way to fix this with the data.frame constructor function is
the.data <- data.frame(Key="foo", datestamp="20120215", stringsAsFactors=FALSE)
I haven't used read.Alteryx but I assume it has a similar way of achieving this.
Alternatively, if your data frame has already been created, you can convert the factors back into character:
write.Alteryx(base64encode(hmac(
as.character(the.data$Key),
as.character(the.data$datestamp),
algo="sha256",raw = TRUE)),1)

Package a large data set

Column-wise storage in the inst/extdata directory of a package, as suggested by Jan, is now implemented in the dfunbind package.
I'm using the data-raw idiom to make entire analyses from the raw data to the results reproducible. For this, datasets are first wrapped in R packages which can then be loaded with library().
One of the datasets I'm using is largish, around 8 million observations with about 80 attributes. For my current analysis I only need a small fraction of the attributes, but I'd like to package the entire dataset anyway.
Now, if it is simply packaged as a data frame (e.g., with devtools::use_data()), it will be loaded in its entirety when first accessing it. What would be the best approach to package this kind of data so that I can lazy-load at the column level? (Only those columns which I'm actually accessing are loaded, the others happily stay on disk and don't occupy RAM.) Would the ff package help? Can anyone point me to a working example?
I think, I would store the data in inst/extdata. Then create a couple of functions in your package that can read and return parts of that data. In your functions you can get the path to your data using: system.file("extdata", "yourfile", package = "yourpackage"). (As on the page you linked to).
The question then is in what format you store your data and how do you obtain selections from it without reading the data in memory. For that, there are a large number of options. To name some:
sqlite: Store your data in a sqlite database. You can then perform queries on this data using the rsqlite package.
ff: store your data in ff objects (e.g. save using the save.ffdf function from ffbase; use load.ffdf to load again). ff doesn't handle character fields well (they are always converted to factors). And in theory the files are not cross platform although as long as you stay on intel platforms you should be ok.
CSV: store your data in a plain old csv file. You can then make selections from this file using the LaF package. The performance will probably be less than with ff but might be good enough.
RDS: store each of your columns in a seperate RDS file (using saveRDS) and load them using readRDS the advantage is that you do not depend on any R-packages. This is fast. The disadvantage is that you cannot do row selections (but that does not seem to be the case).
If you only want to select columns, I would go with RDS.
A rough example using RDS
The following code creates an example package containing the iris data set:
load_data <- function(dataset, columns) {
result <- vector("list", length(columns));
for (i in seq_along(columns)) {
col <- columns[i]
fn <- system.file("extdata", dataset, paste0(col, ".RDS"), package = "lazydata")
result[[i]] <- readRDS(fn)
}
names(result) <- columns
as.data.frame(result)
}
store_data <- function(package, name, data) {
dir <- file.path(package, "inst", "exdata", name)
dir.create(dir, recursive = TRUE)
for (col in names(data)) {
saveRDS(data[[col]], file.path(dir, paste0(col, ".RDS")))
}
}
packagename <- "lazyload"
package.skeleton(packagename, "load_data")
store_data(packagename, "iris", iris)
After building and installing the package (you'll need to fix the documentation, e.g. delete it) you can do:
library(lazyload)
data <- load_data("iris", "Sepal.Width")
To load the Sepal.Width column of the iris data set.
Of course this is a very simple implementation of load_data: no error handling, it assumes all column exist, it does not know which columns exist, it does not know which data sets exist.

Resources