Is there an R function/package to transform time series data for confidentiality reasons? - r

I wish to share a dataset (largely time-series data) with a group of data scientists to explore the statistical relationships within the data (e.g. between variables). However, for confidentiality reasons, I am unable to share the original dataset and so I was wondering if I may be able to transform the data with some random transformation that I know but that the recipients won't. Is this a common practice? Is there an associated R package?
I have been exploring the use of synthetic datasets, and have looked at 'synthpop' but I have a challenge that seems slightly different. For example, I don't necessarily want the data to include fictional individuals that resemble the original file. Rather I'd prefer the value associated with a specific variable to be unclear (e.g. still numerical but also nonsensical) to the human viewer but still enable statistical analysis (e.g. despite the actual values being unclear, the relationships between variable 'x' and 'y' remain the same).
I have a feeling that this is probably quite a simple process (e.g. change names of variables, apply the same transformation across all variables), but I'm not a mathematician/statistician and so I don't want to violate underlying relationships through an inappropriate transformation.
Thanks!

Related

Predicting a numeric attribute through high dimensional nominal attributes

I'm having difficulties mining a big (100K entries) dataset of mine concerning logistics transportation. I have around 10 nominal String attributes (i.e. city/region/country names, customers/vessel identification codes, etc.). Along with those, I have one date attribute "departure" and one ratio-scaled numeric attribute "goal".
What I'm trying to do is using a training set to find out which attributes have strong correlations with "goal" and then validating these patterns by predicting the "goal" value of entries in a test set.
I assume clustering, classification and neural networks could be useful for this problem, so I used RapidMiner, Knime and elki and tried to apply some of their tools on my data. However, most of these tools only handle numeric data, so I got no useful results.
Is it possible to transform my nominal attributes into numeric ones? Or do I need to find different algorithms that can actually handle nominal data?
you most likely want to use tree based algorithm. These are good to use nominal features. Please be aware, that you do not want to use "id-like" attributes.
I would recommend RapidMiner's AutoModel feature as a start. GBT and RandomForest should work well.
Best,
Martin
the handling of nominal attributes does not depend on the tool. It is a question what algorithm you use. For example k-means with Euclidean distance can't handle string values. But other distance functions can handle them and algorithms can handle them, for example the random forest implementation of RapidMiner
You can also of course transform the nominal attributes to numerical, for example by using a binary dummy encoding or assigning an unique integer value (which might result in some bias). In RapidMiner you have the Nominal to Numerical operator for that.
Depending on the distribution of your nominal values it might also be useful to handle rare values. You could either group them together in a new category (such as "other") or to use a feature selection algorithm after you apply the dummy encoding.
See the screen shot for a sample RapidMiner process (which uses the Replace Rare Values operator from the Operator Toolbox extension).
Edit: Martin is also right, AutoModel will be a good start to check for problematic attributes and find a fitting algorithm.

Comparison of good vs bad dataset using R

Stuck in a problem. There are two datasets A and B. Say they're datasets of two factories. Factory A is performing really well whereas Factory B is not. I have the data-set of Factory A (data being output from the manufacturing units) as well as Factory B, both having the same variables. How can I identify the problematic variable in Factory B which needs to be fixed so that Factory B starts performing well too? Therefore, I need to identify the problematic variable which needs immediate attention.
Looking forward to your response.
p.s: coding language being used is R
Well this is shameless plug for the dataMaid package which I helped write and which sort of does what you are asking. The idea of the dataMaid package is to run a battery of tests on the variables in a data frame and produce a report that a human investigator (preferably someone with knowledge about the context) can look through in order to identify potential problems.
A super simple way to get started is to load the package and use the
clean function on a data frame (if you try to clean the same data
frame several times then it may be necessary to add the replace=TRUE
argument to overwrite the existing report).
devtools::install_github("ekstroem/dataMaid")
library(dataMaid)
data(trees)
clean(trees)
This will create a report with summaries and error checks for each
variable in the trees data frame. A summary of all the variables is provided and for the trees data it looks like this
while the information from each variable may look like this
Here we get a status about the variable type, summary statistics, a plot and - in this case - an indicator that there might be a problem with outliers.
The dataMaid package can also be used interactively by running checks for the individual variables or for all variables in the dataset
data(toyData)
check(toyData$var2) # Individual check of var2
check(toyData) # Check all variables at once
By default the standard battery of tests is run depending on the
variable type, but it is possible to extend the package by providing your own checks.
In your case I would run the package on both datasets to get two reports, and any major differences in those would raise a flag about what could be problematic.

Different types of categorical variables and how to make R use them in PCA

I am starting to use R, and learning about the PCA analysis too, but my problem here relates to categorical variables.
I have three main types of categorical variables in my data set.
Type one: Presence/absence of a trait, and I know that I can transform that into 1/0 in the table and no worries.
Type two: Frequency or abundance of a trait, that I assume R will understand if I transform something that is absent/less abundant/abundant into 0/1/2, for example.
Type three: The problematic one. This is truly categorical, functioning as a label. The name of the variable is "cambial variant", and the possibilites are: absent/lobed/compound/fissured/corded and so on...
These are different types of cambial variants, and those different types can be found in different species of plants.
First I assumed that it would be ok to simply use different numbers for these different types (ex: absent=0, lobed=1, compound=2, and so on). So I performed the prcomp like that, and the results were exactly what I expected. But then I realized I made a mistake by using different numbers for things that are not related at all, right? The program understands that 3 is bigger than 2, and that is somehow wrong, because the "fissured type" and the "compound type" are not levels of intensity (abundance, frequency, whatever) of the same thing. They are different types of a same structure, but one does not transform into the other, they are not related in that type of way.
So, in sum, I want to know how I could transform this variables that are more like labels into something that the function prcomp can understand and use. Or, if that is not an option, if there is another function in R that can do a PCA with these categorical variables that can`t be modified into numbers.
Thanks you all, and I am sorry if my question is way too dumb, I am really just a beginner here!

treating categorical/binary variables in lm

When building the regression function using lm, do we need to explicitly specify which variables should be categorical or binary? If we have to, how to do that? Thanks.
This brings up another important question: Is whether a variable is numeric or categorical a property of the data or a property of the analysis?
Back in the early days of statistical computing it was easier to store categorical variables as numbers and therefore it was necessary to designate at some point that these variables were indeed representing categories rather than the numbers themselves having meaning. The common place to designate this was at the point of analysis. This results in a legacy of having the variable type be a property of the analysis.
R (and others) is a much more modern language and takes the approach that this should be a property of the data itself. This simplifies things in that you can make this designation once and all resulting analyses/graphs/tables/etc. will treat the variable properly. I think this approach is much more intuitive and simple, after all, if a particular variable is categorical for one analysis, shouldn't it be categorical for all analyses, graphs, tables etc.?
This has been a bit of a long answer, but the idea is to help you shift your thinking from how to designate this in the analysis to thinking how to designate the properties for the data itself. If you designate that your data is a factor (using the factor, ordered or other functions) before any analysis, then the R analysis/graphing/table tools will do the correct thing. Depending on how your data looks and how it was entered/imported, this conversion may have already been done for you.
Other properties, such as the order of the categories, should also be properties of the data, not the analysis/graph/table/etc.

technique to obfuscate clustered data and preserve privacy in r

background
i have some private survey data that contains a column of confidential information: the geographic location of the survey respondents. under no circumstances can this information be released.
as is common in survey research, in order for users to correctly calculate a variance on my survey data set, those users will either need that geographic location (unacceptable) or, alternatively, a set of replicate weights. i can create that set of replicate weights; however, it's quite easy to look at the correlations between those weights and back-calculate which of the survey respondents share the same geographic location. that is also unacceptable.
to help me with this question, you don't have to be familiar with replicate weights -- just think of them as a few columns of strongly-correlated clustered data.
i understand that if i want to maintain that clustering, an evil data user will always have semi-decent guesses at who shares geographic locations; i just want to make that guessing game less precise. on the un-obfuscated replicate weights, an evil data user can figure out 100% of the cases.
request
i am looking for a technique that
prevents the public use file users from easily deducing the shared geographic location off of the correlations between my replicate weights variables
does not obliterate the correlations between my columns of data (the replicate weights variables)
can be implemented on an R data.frame object without a major time investment
i say shared because the evil user might not know where the location is, but they might know if two survey respondents are from the same location -- an unacceptable possibility.
what i have tried
i don't really want to re-invent the wheel here. i am looking for r syntax, an r package, or anything else that would be relatively straightforward to implement. i've found one, two, three, four papers describing techniques that would all be suitable for my purposes; unfortunately, none of the authors have been willing to share actual code to implement them.
i can do simple things like add and subtract random values to my replicate weights columns according to a normal distribution, but i'd prefer to rely on the work of someone who understands privacy issues better than i do.
thanks!!!!
i have written this nine-step tutorial to walk through the process in an attempt to answer my own question. i am not an expert in the field of privacy/confidentiality and would love to hear both feedback about this idea and also other ideas. thanks!
http://www.asdfree.com/2014/09/how-to-provide-variance-calculation-on.html

Resources