Comparison of good vs bad dataset using R - r

Stuck in a problem. There are two datasets A and B. Say they're datasets of two factories. Factory A is performing really well whereas Factory B is not. I have the data-set of Factory A (data being output from the manufacturing units) as well as Factory B, both having the same variables. How can I identify the problematic variable in Factory B which needs to be fixed so that Factory B starts performing well too? Therefore, I need to identify the problematic variable which needs immediate attention.
Looking forward to your response.
p.s: coding language being used is R

Well this is shameless plug for the dataMaid package which I helped write and which sort of does what you are asking. The idea of the dataMaid package is to run a battery of tests on the variables in a data frame and produce a report that a human investigator (preferably someone with knowledge about the context) can look through in order to identify potential problems.
A super simple way to get started is to load the package and use the
clean function on a data frame (if you try to clean the same data
frame several times then it may be necessary to add the replace=TRUE
argument to overwrite the existing report).
devtools::install_github("ekstroem/dataMaid")
library(dataMaid)
data(trees)
clean(trees)
This will create a report with summaries and error checks for each
variable in the trees data frame. A summary of all the variables is provided and for the trees data it looks like this
while the information from each variable may look like this
Here we get a status about the variable type, summary statistics, a plot and - in this case - an indicator that there might be a problem with outliers.
The dataMaid package can also be used interactively by running checks for the individual variables or for all variables in the dataset
data(toyData)
check(toyData$var2) # Individual check of var2
check(toyData) # Check all variables at once
By default the standard battery of tests is run depending on the
variable type, but it is possible to extend the package by providing your own checks.
In your case I would run the package on both datasets to get two reports, and any major differences in those would raise a flag about what could be problematic.

Related

How can I get My.stepwise.glm to return the model outside the console?

I asked this question on RCommunity but haven't had anyone bite... so I'm here!
My current project involves me predicting whether some trees will survive given future climate change scenarios. Against better judgement (like using Maxent) I've decided to pursue this with a GLM, which requires presence and absence data. Everytime I generate my absence data (as I was only given presence data) using randomPoints from dismo, the resulting GLM model has different significant variables. I found a package called My.stepwise that has a My.stepwise.glm function (here: My.stepwise.glm: Stepwise Variable Selection Procedure for Generalized Linear... in My.stepwise: Stepwise Variable Selection Procedures for Regression Analysis) , and this goes through a forward/backward selection process to find the best variables and returns a model ready for you.
My problem is that I don't want to run My.stepwise.glm just once and use the model it spits out for me. I'd like to run it roughly 100 times with different pseudo-absence data and see which variables it returns, then take the most frequent variables and move forward with building my model using those. The issue is that the My.stepwise.glm function ends by 'print(summary(initial.model))' and I would like to be able to access the output similar to how step() returns a list, where you can then say 'step$coefficients' and have the function coefficients return as numerics. Can anyone help me with this?

Is there an R function/package to transform time series data for confidentiality reasons?

I wish to share a dataset (largely time-series data) with a group of data scientists to explore the statistical relationships within the data (e.g. between variables). However, for confidentiality reasons, I am unable to share the original dataset and so I was wondering if I may be able to transform the data with some random transformation that I know but that the recipients won't. Is this a common practice? Is there an associated R package?
I have been exploring the use of synthetic datasets, and have looked at 'synthpop' but I have a challenge that seems slightly different. For example, I don't necessarily want the data to include fictional individuals that resemble the original file. Rather I'd prefer the value associated with a specific variable to be unclear (e.g. still numerical but also nonsensical) to the human viewer but still enable statistical analysis (e.g. despite the actual values being unclear, the relationships between variable 'x' and 'y' remain the same).
I have a feeling that this is probably quite a simple process (e.g. change names of variables, apply the same transformation across all variables), but I'm not a mathematician/statistician and so I don't want to violate underlying relationships through an inappropriate transformation.
Thanks!

R: [Indicspecies package] multipatt function: extract values from summary.multipatt

I am working with the 'indicspecies' package - multipatt function and am unable to extract summary values of the package. Unfortunately I can't print all the summary and am left with impartial information for my model. The reason is the huge amount of data that needs to be printed from the summary (300.000 different species, 3 groups, 6 comparable combinations).
This is what happens with summary being saved (pre-code incl.):
x <- multipatt(data, ...)
sumx <-summary(x)
sumx
NULL
str(sumx)
NULL
So, the summary does not work exactly like a generic summary. It seems that the function is based around the older indval function from the 'labdsv' package (which is mentioned in the documentation). I found an archived thread where a similar problem is discussed: http://r.789695.n4.nabble.com/extract-values-from-summary-of-function-indval-of-the-package-labdsv-td4637466.html
but it seems not resolved (and is not exactly about the same function, rather the base function indval).
I was wondering if anyone has experience with the indicspecies package and knows a way to either extract the info from the summary.
It is possible to extract significance and other information from the other saved data from the model, but it might be nice to just get a quick complete overview from the data.
ps. I tried
options(max.print=1000000)
but this didn't solve it for me.
I use to capture the summary output for a multipatt object, but don't any more because the p-values reported are not corrected for multiple testing. To answer the OP's question you can capture the summary output using capture.output
ex.
dat.multipatt.summary<-capture.output(summary(dat.multipatt, indvalcomp=TRUE))
Again, I do not recommend this. It is very important to correct the p-values for multiple testing, so the summary output actually isn't helpful. To be clear ?multipatt states:
"sign Data table with results of the best matching pattern, the association value and the degree of statistical significance of the association (i.e. p-values from permutation test). Note that p-values are not corrected for multiple testing."
I just posted an answer for how to correct the p-values here https://stats.stackexchange.com/questions/370724/indiscpecies-multipatt-and-overcoming-multi-comparrisons/401277#401277
I don't have any experience with this package and since you haven't provided the data, it's difficult to reproduce. But since summary is returning NULL, are you sure your x is computed properly? Check the object.size or class or something else of x to see if it indeed has any content.
Also instead of accessing all the contents of summary(x) together, you can use # to access slots of it (similar to $ in dataframe).
If you need further assistance, it'd be better t provide atleast a small subset or some other sample data so that the community can work with it.

Using `data()` for time series objects in R

I apologise if this question has been asked already (I haven't been able to find it). I was under the impression that I could access datasets in R using data(), for example, from the datasets package. However, this doesn't work for time series objects. Are there other examples where this is not the case? (And why?)
data("ldeaths") # no dice
ts("ldeaths") # works
(However, this works for data("austres"), which is also a time-series object).
The data function is designed to load package data sets and all their attributes, time series or otherwise.
I think the issue your having is that there is no stand-alone data set called ldeaths in the datasets package. ldeaths does exist as 1 of 3 data sets in the UKLungDeaths data set. The other two are fdeaths and mdeaths.
The following should lazily load all data sets.
data(UKLungDeaths)
Then, typing ldeaths in the console or using it as an argument in some function will load it.
str(ldeaths)
While it is uncommon for package authors to include multiple objects in 1 data set, it does happen. This line from the data function documentation gives on a 'heads up' about this:
"For each given data set, the first two types (‘.R’ or ‘.r’, and ‘.RData’ or ‘.rda’ files) can create several variables in the load environment, which might all be named differently from the data set"
That is the case here, as while there are three time series objects contained in the data set, not one of them is named UKLungDeaths.
This choice occurs when the package author uses the save function to write multiple R objects to an external file. In the wild, I've seen folks use the save function to bundle a description file with the data set, although this would not be the proper way to document something in a full on package. If your really curious, go read the documentation on the save function.
Justin
r

Convention for R function to read a file and return a collection of objects

I would like to find out what the "R way" would be to let users the following with R: I have a file that can contain the data of one or more analysis runs of some other software. My R package should provide additional ways to calculate statistics or produce plots for those analyses. So the first step a user would have to do, is read in the file (with one or more analyses), then select the analysis and work with it.
An analysis is uniquely identified by two names (an analysis name and an analysis type where the type should later correspond to an S3 class).
What I am not sure about is how to best represent the collection of analyses that is returned when reading in the file: should this be an object or simply a list of lists (since there are two ids for identifying an analysis, the first list could be indexed by name and the second by type). Using a list feels very low-level and clumsy though.
If the read function returns a special kind of container object what would be a good method to access one of the contained objects based on name and type?
There are probably many ways how to do this, but since I only started to work with R in a way where others should eventually use my code, I am not sure how to best follow existing R-conventions for how to design this.

Resources