Creating 'structured' missing data in a data table in R - r

I am doing research in a lab with a mentor who has developed a model that analyzes genetic data, which utilizes an ANOVA. I have simulated a dataset that I want to use in evaluating our model's ability to handle varying levels of missing data.
Our dataset consists of 15 species, with 4 individuals each, which we represent by naming the columns 'A'(x4) 'B'(x4)...etc. Each row represents a gene.
I'm trying to come up with a code that removes 1% of the data randomly, but such that each species has at least 2 individuals with valid data, because otherwise our model will just quit out (since it's ANOVA-based).
I realize this makes the 'randomly' missing data not so random, but we're trying different methods. It's important that the missing data is otherwise randomized. I'm hoping someone could help me with setting this up?

I try to do a toy example that maybe can help
is_valid_df<-function(df,col,val){
all(table(df[col])>val)
}
filter_function<-function(df,perc,col,val){
n=dim(df)[1]
filter<-sample(1:n,n*perc)
if(is_valid_df(df[-filter,],col,val)){
return(df[-filter,])
}else{
filter_function(df,perc,col,val)
cat("resampling\n")
}
}
set.seed(20)
a<-(filter_function(iris,0.1,"Species",44))

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

R - select cases so that the mean of a variable is some given number

I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.

Predictive modelling for a 3 dimensional dataframe

I have a dataset which contains all the quotes made by a company over the past 3 years. I want to create a predictive model using the library caret in R to predict whether a quote will be accepted or rejected.
The structure of the dataset is causing me some problems. It contains 45 variables, however, I have only included two bellow as they are the only variables that are important to this problem. An extract of the dataset is shown below.
contract.number item.id
0030586792 32X10AVC
0030586792 ZFBBDINING
0030587065 ZSTAIRCL
0030587065 EMS164
0030591125 YCLEANOFF
0030591125 ZSTEPSWC
contract.number <- c("0030586792","0030586792","0030587065","0030587065","0030591125","0030591125")
item.id <- c("32X10AVC","ZFBBDINING","ZSTAIRCL","EMS164","YCLEANOFF","ZSTEPSWC")
dataframe <- data.frame(contract.number,item.id)
Each unique contract.number corresponds to a single quote made. The item.id corresponds to the item that is being quoted for. Therefore, quote 0030586792 includes both items 32X10AVC and ZFBBDINING.
If I randomise the order of the dataset and model it in its current form I am worried that a model would just learn which contract.numbers won and lost during training and this would invalidate my testing as in the real world this is not known prior to the prediction being made. I also have the additional issue of what to do if the model predicts that the same contract.number will win with some item.id's and loose with others.
My ideal solution would be to condense each contract.number into a single line with multiple item.ids per line to form a 3 dimensional dataframe. But i am not aware if caret would then be able to model this? It is not realistic to split the item.ids into multiple columns as some quotes have 100s of item.id's. Any help would be much appreciated!
(Sorry if I haven't explained well!)

PLS-DA in mdatools package in R

I have been trying to use the mdatools package to run a pls-da using the plsda() function. I have data with 9000 variables and around 30 observations. Each of the 30 rows are patients; the first column contains the clinical status for each patient (disease or control) and the remaining 8999 columns contain numerical data on the patients. I used the following code to run the plsda:
plsda(data[,2:9000], data[,1], ncomp = 8999, coeffs.ci = 'jk')
When the code finally compiles, it returns an error, saying"Error in selectCompNum.pls(model,ncomp): wrong number of selected components!"
I chose ncomp =8999 as the total number of numbers from 2 to 9000...and the strange thing is, this worked well with a low number of components. For example, when I tried
plsda(data[,2:10], data[,1], ncomp = 9, coeffs.ci = 'jk')
No error message is returned.
Perhaps I am misunderstanding how to select the right number of components? I would greatly appreciate any help. Thank you very much in advance!
I am developer of the mdatools package and just came across your question accidentally. Number of components in PLS/PLS-DA is a number of latent variables. Every latent variable is a linear combination of original variables. Normally, you need much less components than the original variables, depending on the type of data number of components can be from 1-2 to 10-20. I recommend you to look at PLS part of the tutorial and ask me directly (e.g. by email) if you still have any questions or issues with the package.

What makes these two R data frames not identical?

I have two small data frames, this_tx and last_tx. They are, in every way that I can tell, completely identical. this_tx == last_tx results in a frame of identical dimensions, all TRUE. this_tx %in% last_tx, two TRUEs. Inspected visually, clearly identical. But when I call
identical(this_tx, last_tx)
I get a FALSE. Hilariously, even
identical(str(this_tx), str(last_tx))
will return a TRUE. If I set this_tx <- last_tx, I'll get a TRUE.
What is going on? I don't have the deepest understanding of R's internal mechanics, but I can't find a single difference between the two data frames. If it's relevant, the two variables in the frames are both factors - same levels, same numeric coding for the levels, both just subsets of the same original data frame. Converting them to character vectors doesn't help.
Background (because I wouldn't mind help on this, either): I have records of drug treatments given to patients. Each treatment record essentially specifies a person and a date. A second table has a record for each drug and dose given during a particular treatment (usually, a few drugs are given each treatment). I'm trying to identify contiguous periods during which the person was taking the same combinations of drugs at the same doses.
The best plan I've come up with is to check the treatments chronologically. If the combination of drugs and doses for treatment[i] is identical to the combination at treatment[i-1], then treatment[i] is a part of the same phase as treatment[i-1]. Of course, if I can't compare drug/dose combinations, that's right out.
Generally, in this situation it's useful to try all.equal which will give you some information about why two objects are not equivalent.
Well, the jaded cry of "moar specifics plz!" may win in this case:
Check the output of dput() and post if possible. str() just summarizes the contents of an object whilst dput() dumps out all the gory details in a form that may be copied and pasted into another R interpreter to regenerate the object.

Resources