I work with .dta files and try to make loading data as comfortable as possible. In my view, I need a combination of haven and readstata13.
haven looks perfect. It provides best "sub-labels". But it does not provide a column-selector-function. I cannot use read_dta for large files ( ~ 1 GB / on 64 GB RAM, Intel Xeon E5).
Question: Is there a way to select/load a subset of data?
read.dta13 is my best workaround. It has select.cols. But I have to get attr later, save and merge them (for about 10 files).
Question: How can I manually add these second labels which the haven package creates? (How are they called?)
Here is the MWE:
library(foreign)
write.dta(mtcars, "mtcars.dta")
library(haven)
mtcars <- read_dta("mtcars.dta")
library(readstata13)
mtcars2 <- read.dta13("mtcars.dta", convert.factors = FALSE, select.cols=(c("mpg", "cyl", "vs")))
var.labels <- attr(mtcars2,"var.labels")
data.key.mtcars2 <- data.frame(var.name=names(mtcars2),var.labels)
haven's development version supports selecting columns with the col_select argument:
library(haven) # devtools::install_github("tidyverse/haven")
mtcars <- read_dta("mtcars.dta", col_select = c(mpg, cyl, vs))
Alternatively; the column labels in RStudio's viewer are taken from the data frame's columns' "label" attribute. You can use a simple loop to assign them from the labels read by readstata13:
for (i in seq_along(mtcars2)) {
attr(mtcars2[[i]], "label") <- var.labels[i]
}
View(mtcars2)
Related
I would be very grateful for any guidance on how to use the xltabr package to automatically format tables in r, please:
https://github.com/moj-analytical-services/xltabr
In SPSS for example, I would apply the relevant weight and then run a cross tab on the raw data e.g var1*var2.
How would you go about doing this in r so that the package recognises it to produce the table?
Much appreciated.
You need to create/ read in the dataframe which you want to use first.
dat <- read.spss("mydataframe.sav")
Then you need to put it in the format you want: As in your example of crosstables, you can do this:
library(reshape2)
ct <- reshape2::dcast(iris, variable1 ~ variable2, fun.aggregate = length)
#depending on what data you want, you can change the fun.aggreagte function (e.g. sum or mean).
Then you can use the xltabr package to prepare the excel file by creating a Workbook:
wb <- xltabr::auto_crosstab_to_wb(ct)
Then you can save it as .xlsx file:
library(openxlsx)
openxlsx::saveWorkbook(wb, file = "crosstable.xlsx", overwrite = T)
I hope this helps
My question boilds down to: what is the Sparklyr equivalent to the str R command?
I am opening a large table (from a file), call it my_table, in Spark, from R using the Sparklyr package.
How can describe the table? Column names and types, a few examples, etc.
Apologies in advance for what must be a very basic question but I did search for it, and checked Rstudio's Sparklyr Cheatsheet and did not find the answer.
Let's use the mtcars dataset and move it to a local spark instance for example purposes:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
tbl_cars <- dplyr::copy_to(sc, mtcars, "mtcars")
Now you have many options, here are 2 of them, each slightly different - choose based on your needs:
1.Collect the first row into R (now it is a standard R data frame) and look at str:
str(tbl_cars %>% head(1) %>% collect())
2.Invoke the schema method and look at the result:
spark_dataframe(tbl_cars) %>% invoke("schema")
This will give something like:
StructType(StructField(mpg,DoubleType,true), StructField(cyl,DoubleType,true), StructField(disp,DoubleType,true), StructField(hp,DoubleType,true), StructField(drat,DoubleType,true), StructField(wt,DoubleType,true), StructField(qsec,DoubleType,true), StructField(vs,DoubleType,true), StructField(am,DoubleType,true), StructField(gear,DoubleType,true), StructField(carb,DoubleType,true))
I have a large data frame containing about 4 million rows and 15 variables. I'm trying to implement a model selection algorithm, which adds in a variable that results in the highest increase in the r-squared to the lm model.
The following code snippet is where my function fails due to the large data size. I tried biglm but still no luck. I use mtcars as an example here just to illustrate.
library(biglm)
library(dplyr)
data <- mtcars
y <- "mpg"
vars.model <- "cyl"
vars.remaining <- setdiff(names(data), c("mpg", "cyl"))
new.rsq <- sapply(vars.remaining,
function (x) {
vars.test <- paste(vars.model, x, sep="+")
fit.sum <- biglm(as.formula(paste(y, vars.test, sep="~")),
data) %>% summary()
new.rsq <- fit.sum$rsq
})
new.rsq
I'm not sure how exactly R handles the memory here, but the biglm output for my 4 million rows of data is extremely small (6.6 KB). I don't know how it accumulates to several GB when I wrapped it into sapply. Any tips on how to optimise this is greatly appreciated.
Memory usage goes up because each call to biglm() makes a copy of the data in memory. Since sapply() is basically a for loop, using doMC (or doParallel) allows to go through the loop with a single copy of the data in memory. Here is one possibility:
EDIT: As #moho wu pointed, parallel fitting helps, but not quite enough. Factors are more efficient than plain characters, so that helps too. Then ff can help even more as it keeps bigger data sets on the disk, rather than in memory. I updated the code below to make it a complete toy example using ff and doMC.
library(tidyverse)
library(pryr)
# toy data
df <- sample_n(mtcars, size = 1e7, replace = T)
df$A <- as.factor(letters[1:5])
# get objects / save on disk
all_vars <- names(df)
y <- "mpg"
vars.model <- "cyl"
vars.remaining <- all_vars[-c(1:2)]
save(y, vars.model, vars.remaining, file = "all_vars.RData")
readr::write_delim(df, path = "df.csv", delim = ";")
# close R session and start fresh
library(ff)
library(biglm)
library(doMC)
library(tidyverse)
# read flat file as "ff" ; also read variables
ff_df <- read.table.ffdf(file = "df.csv", sep = ";", header = TRUE)
load("all_vars.RData")
# prepare the "cluster"
nc <- 2 # number of cores to use
registerDoMC(cores = nc)
# make all formula
fo <- paste0(y, "~", vars.model, "+", vars.remaining)
fo <- modify(fo, as.formula) %>%
set_names(vars.remaining)
# fit models in parallel
all_rsq <- foreach(fo = fo) %dopar% {
fit <- biglm(formula = fo, data = ff_df)
new.rsq <- summary(fit)$rsq
}
The culprit to my problem is that I have a lot of character columns. It works fine after I change them all to factors using my original script.
data %>%
mutate_if(is.character, as.factor)
#meriops' answer is also sound. Parallel processing might be something to consider if factorising your data frame doesn't solve the problem
I'm using R Sweave and wanted to begin my document with showing a sample of my table. My problem is, that my table has 39 variables and many rows. For the rows it isn't a problem, I can take only a few ones using sample_n, but I need to habe all my variables visible. It would sadly not fit either on a landscape sheet. I'm using xtable to generate my table. I think the easier way would be to put so much variables as possible on the sheet, then begin with the rest under, and so on, until it is all displayed.
Here some minimalist exemple:
dat <- bind_cols(mtcars, mtcars, mtcars, mtcars)
a <- as.data.frame(dat) %>%
sample_n(5)
print(xtable(a))
I've already know the longtable function, but it would only help me if I had too much rows, and not too much columns, isn't it? I'm still a little bit lost with having at the same time R and LaTeX on the same file...
An answer using my huxtable package. Create the table, then break it up by columns:
library(huxtable)
dat <- sample_n(as.data.frame(bind_cols(mtcars, mtcars, mtcars, mtcars)), 5)
ht <- as_hux(dat, add_colnames = TRUE)
# now format to taste:
bold(ht)[1,] <- TRUE
ht[,1:5] # first 5 columns. Will print as LaTeX within a Rmarkdown document
I have a spss file which contents variables and value labels. I saw foreign package with read.spss function:
data <- read.spss("2017.sav", to.data.frame = TRUE, use.value.labels = TRUE)
If i use use.value.labels = TRUE, all string change to factor variables and i dont want it because they are not factor all.
I found one solution but i dont know if it is the best way to do it
1º First read spss file with previous sentence
2º select which variables are not factor and change it to string with:
cols <- c("x", "ab")
data[cols] <- lapply(data[cols], as.character)
if i dont use use.value.labels = TRUE i will have not value labels and i cannot export file correctly
You can also use the memisc package:
sav <- spss.system.file("file.sav")
df <- as.data.set(sav)
My company regularly deals with SAV files and we extract out the metadata separately. With the foreign package, you can get the metadata out in a few different ways (after you have loaded the file in):
data.label.table <- attr(sav, "label.table")
missings <- attr(sav, "missings")
The other bits require various lapply and sapply functions to get them out. The script I have is quite long, so I will not share it here. If you read the data in with read.spss(sav, to.data.frame = TRUE) you can get:
VariableLabels <- unname(attr(sav, "variable.labels"))
I dont know why, but I can’t install a "foreign" package.
Here is what I did instead to import a dataset from SPSS to R (through Excel):
Open your data in SPSS.
Export dataset from SPSS to Excel, but make sure to choose the "Save
value labels where defined instead of data values" option at the
very bottom.
Open R.
Import dataset from Excel.
Now, you have a dataset in R with value labels.
Use the haven package:
library(haven)
data <- read_sav("2017.sav")
The labels are shown in the RStudio viewer.