I have spent a few weeks looking for solutions to my problem of not only finding but creating a descriptive statistics summary table on my data that is exportable to xlsx (ideally). I have only found partial answers and my knowledge in R and R packages is still basic enough to limit my progress. My data set is time series data with 6 columns that have 50,000+ rows.
My DF information:
DateTime:POSIXCT format "YYYY-MM-DD HH:MM:SS"
Var1: num
Var2: num
Var3: num
Var4: factor w/ 2 levels "A","B"
Var5: factor w/ 4 levels "S1","S2","S3","S4"
My objectives are as follows:
Manipulate my data frame using tidyverse to subset my data
Take the subsetted data to create 1 summary table( i.e. in a tibble or data.frame format) with 2 sub factors (Var4 and/or Var5) for Var1, Var2, and Var3. Below is a simplified, visual example of the table I am aiming for:
Export the summary table (or summary tables if one table is not possible) to xlxs (ideally), .CSV, or .TXT to be used in Excel for stylistic table edits. At the moment, "writexl" package works very well for me as I have problems with "xlsx" and "openxlsx" packages. Here is the code needed to export to xlsx using the openxlsx package: write_xlsx(dataframe, path = "C:/Users/user/Desktop"). Note for MacOS users, path = /Users/admin/yoursubfolder/yoursubfolder.... (fill in "yoursubfolder" with actual folder name on your computer)
What I have done:
Used dplyr and the %>% function to manipulate the data without and with factor Var4 or Var5
Tried to create a summary table with Var4 as a factor for Var1,Var2, and Var3 (partial success; style is not what I want or it is not exportable to excel)
Looked in multiple StackOverflow questions and Google searches with no success to find code that works for my particular case. I've tried qwraps2 to create one and looking into the following packages for something pre-made: psych, stargazer, and HMSIC. I do not like their table styles and they do not all have the option to just show N, mean, StDev, SEM, Min, and Max.
I know SEM is not a standard function in most packages; thus, I borrowed this function from an answer on stack overflow because I do not know how to create functions. here is the borrowed code: SEM <- function(x) sd(x)/sqrt(length(x))
Since I cannot attach sample data and my coding is very basic, here is what I could come up with:
Example data:
Unfortunately, I cannot attach sample data for testing. Also due to my limited knowledge of R, I cannot make a perfect data frame. Below is a sample data frame, but I cannot get the factor to be evenly distributed in their respective columns (Sorry). Here is my code:
df <- data.frame(
"DateTime" = seq(c(ISOdate(2018,03,01)), by = "day", length.out = 100),
"Var1" = rnorm(1:100),
"Var2" = rnorm(1:100),
"Var3" = rnorm(1:100),
"Var4" = c("A", "B"),
"Var5" = c("S1","S2", "S3", "S4"))
I was trying this:
"S1"[(1:25)],
"S2"[(26:50)],
"S3"[(51:75)],
"S4"[(76:100)] # and
"A"[(1:50], "B"[(51:100)] #but that didn't work, so sorry again.
Despite my lack of proper coding, any guidance, tips, and suggestions from anyone with more experience in R would be greatly appreciated as I do like R and all the capabilities of the software, but I find it very inconvenient that there is no simple, straightforward way to export tables in the console to copy and paste into useful forms like Excel spreadsheets or Word documents instead of standard exporting in LaTex format (which I do not understand at all btw). I know this topic has been discussed in different forums and others share my sentiment on how terrible it is especially for people who need it for data processing instead of document creations like Rmarkdown.
Some example with your df:
library(dplyr)
library(tidyr)
SEM_function <- function(x){sd(x)/sqrt(length(x))}
df %>% as_tibble() %>%
gather("Var_num", "value",Var1:Var3) %>%
group_by(Var_num, Var4,Var5) %>%
summarise("N" = n(),
"mean" = mean(value),
"StDev" = sd(value),
"SEM" = SEM_function(value) ,
"min" = min(value),
"max" = max(value))
Hope this helps
Related
I know there are number of questions similar to this here but 1) most of the solutions rely on deprecated functions like ml_create_dummy_variables and 2) other solutions are incomplete.
Is there a function or an approach to easily hot encode a categorical variable into multiple dummy variables in sparklyr?
This post asks for a solution in SparkR, incidentally a sparklyr solution is given that only works when the categories are unique in a given column, which renders its pointless.
This solution, results in a single dummy instead of a dummy for each category (grabs the first category). This is also the solution I stumbled onto (based on this post), which does not cut it:
iris_sdf <- copy_to(sc, iris, overwrite = TRUE)
iris_sdf %>%
ft_string_indexer(input_col = "Species", output_col = "species_num") %>%
mutate(cat_num = species_num + 1) %>%
ft_one_hot_encoder("species_num", "species_dum") %>%
ft_vector_assembler(c("species_dum"))
I'm looking for a solution that will take Species from the iris dataset and generate three columns -one for each category in Species (virginica, setosa, and versicolor). Using R, fastDummies package has what I need, but I'm left wondering how to achieve similar functionality in sparklyr.
Again, I'll note that ml_create_dummy_variables (suggested by this post) produced the following error:
Error in ml_create_dummy_variables(., "species_num", "species_dum") : Error in ml_create_dummy_variables(., "species_num", "species_dum") :
could not find function "ml_create_dummy_variables"
Note: I'm using sparklyr_1.3.1
I am analysing student level data from PISA 2015. The data is available in SPSS format here
I can load the data into R using the read_sav function in the haven package. I need to be able to edit the data in R and then save/export the data in SPSS format with the original value labels that are included in the SPSS download intact. The code I have used is:
library(haven)
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
student2<-data.frame(student)
#some edits to data
write_sav(student2,"testdata1.sav")
When my colleague (who works in SPSS) tries to open the "testdata1.sav" the value labels are missing. I've read through the haven documentation and can't seem to find a solution for this. I have also tried read/write.spss in the foreign package but have issues loading in the dataset.
I am using R version 3.4.0 and the latest build of haven.
Does anyone know if there is a solution for this? I'd be very grateful of your help. Please let me know if you require any additional information to answer this.
library(foreign)
df <- read.spss("spss_file.sav", to.data.frame = TRUE)
This may not be exactly what you are looking for, because it uses the labels as the data. So if you have an SPSS file with 0 for "Male" and 1 for "Female," you will have a df with values that are all Males and Females. It gets you one step further, but perhaps isn't the whole solution. I'm working on the same problem and will let you know what else I find.
library ("sjlabelled")
student <- sjlabelled::read_spss("CY6_MS_CMB_STU_QQQ.sav")
student2 <-student
write_spss(student2,"testdata1.sav")
I did not try and hope it works. The sjlabelled package is good with non-ascii-characters as German Umlaute.
But keep in mind, that R saves the labels as attributes. These attributes are lost, when doing some data transformations (as subsetting data for example). When lost in R they won't show up in SPSS of course. The sjlabelled::copy_labels function is helpful in those cases:
student2 <- copy_labels(student2, student) #after data transformations and before export to spss
I think you need to recover the value labels in the dataframe after importing dataset into R. Then write the that dataframe into sav file.
#load library
libray(haven)
# load dataset
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
#map to find class of each columns
map_dataset<-map(student, function(x)attr(x, "class"))
#Run for loop to identify all Factors with haven-labelled
factor_variable<-c()
for(i in 1:length(map_dataset)){
if(map_dataset[i]!="NULL"){
name<-names(map_dataset[i])
factor_variable<-c(factor_variable,name)
}
}
#convert all haven labelled variables into factor
student2<-student %>%
mutate_at(vars(factor_variable), as_factor)
#write dataset
write_sav(student2, "testdata1.sav")
I'm using R Sweave and wanted to begin my document with showing a sample of my table. My problem is, that my table has 39 variables and many rows. For the rows it isn't a problem, I can take only a few ones using sample_n, but I need to habe all my variables visible. It would sadly not fit either on a landscape sheet. I'm using xtable to generate my table. I think the easier way would be to put so much variables as possible on the sheet, then begin with the rest under, and so on, until it is all displayed.
Here some minimalist exemple:
dat <- bind_cols(mtcars, mtcars, mtcars, mtcars)
a <- as.data.frame(dat) %>%
sample_n(5)
print(xtable(a))
I've already know the longtable function, but it would only help me if I had too much rows, and not too much columns, isn't it? I'm still a little bit lost with having at the same time R and LaTeX on the same file...
An answer using my huxtable package. Create the table, then break it up by columns:
library(huxtable)
dat <- sample_n(as.data.frame(bind_cols(mtcars, mtcars, mtcars, mtcars)), 5)
ht <- as_hux(dat, add_colnames = TRUE)
# now format to taste:
bold(ht)[1,] <- TRUE
ht[,1:5] # first 5 columns. Will print as LaTeX within a Rmarkdown document
I'm running through a large dataset chunk by chunk, updating a list of linear models as I go using the biglm function. The issue occurs when a particular chunk does not contain all the factors that I have in my linear model, and I get this error:
Error in update.biglm(model, new) : model matrices incompatible
The description of update.biglm mentions that factor levels must be the same across all chunks. I could probably come up with a workaround to avoid this, but there must be a better way. This pdf, on the 'biglm' page, mentions that "Factors must have their full set of levels
specified (not necessarily present in the data chunk)". So I think there is some way to specify all the possible levels so that I can update a model with not all the factors present, but I can't figure out how to do it.
Here's an example piece of code to illustrate my problem:
df = data.frame(a = rnorm(12),b = as.factor(rep(1:4,each = 3)),c = rep(0:1,6))
model = biglm(a~b+c,data = df
df.new = data.frame(a = rnorm(6),b = as.factor(rep(1:2,each = 3)),c =rep(0:1, 3))
model.new = update(model,df.new)
Thanks for any advice you have.
I came across this problem also. Are the variables in your large data frame specified as factors before breaking them into chunks? Also, is the data set formatted as a data frame?
large_df <- as.data.frame(large_data_set) # just to make sure it's a df.
large_df$factor.vars <- as.factor(large_df$factor.vars)
If this is the case, then all of the factor levels should be preserved in the factor variables even after breaking the data frame into chunks. This will ensure that biglm creates the proper design matrix from the first call, and that all subsequent updates will be compatible.
If you have different data frames from the start, (as you illustrate in your example), perhaps you should merge them into one before breaking down into chunks. Continuing from your example:
df.large <- rbind(df,df.new)
chunk1 <- df.large[1:12,]
chunk2 <- df.large[13:18,]
model <- biglm(a~b+c,data = chunk1)
model.new <- update(model,chunk2) # this is now compatible
I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?
The car package has a recode function. See it's help page for worked examples.
In fact an argument could be made that this should be a closed question:
Why is recode in R not changing the original values?
How to recode a variable to numeric in R?
Recode/relevel data.frame factors with different levels
And a few more questions easily identifiable with a search: [r] recode
EDIT:
I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
my_recode <- function(fac, levslist){ nfac <- factor(fac);
inlevs <- levels(nfac);
othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
# levslist of the form :::: list(
# animal = c("cow", "pig"),
# bird = c("eagle", "pigeon") )
levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"),
stringsAsFactors = FALSE)
df$type <- my_recode(df$name, list(
animal = c("cow", "pig"),
bird = c("eagle", "pigeon") ) )
df
#-----------
name type
1 cow animal
2 pig animal
3 eagle bird
4 pigeon bird
5 zebra all_others
You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)
If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.
It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.
The videos at the official web site are useful.
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.
You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.
The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
Now that we have a sorted list of unique values, we can create a listing of recodes:
Let's say we wanted to rename A through G as just A
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
And you'd repeat the process above eliminating or group the unique names as appropriate.