I have a dataset which contains all the quotes made by a company over the past 3 years. I want to create a predictive model using the library caret in R to predict whether a quote will be accepted or rejected.
The structure of the dataset is causing me some problems. It contains 45 variables, however, I have only included two bellow as they are the only variables that are important to this problem. An extract of the dataset is shown below.
contract.number item.id
0030586792 32X10AVC
0030586792 ZFBBDINING
0030587065 ZSTAIRCL
0030587065 EMS164
0030591125 YCLEANOFF
0030591125 ZSTEPSWC
contract.number <- c("0030586792","0030586792","0030587065","0030587065","0030591125","0030591125")
item.id <- c("32X10AVC","ZFBBDINING","ZSTAIRCL","EMS164","YCLEANOFF","ZSTEPSWC")
dataframe <- data.frame(contract.number,item.id)
Each unique contract.number corresponds to a single quote made. The item.id corresponds to the item that is being quoted for. Therefore, quote 0030586792 includes both items 32X10AVC and ZFBBDINING.
If I randomise the order of the dataset and model it in its current form I am worried that a model would just learn which contract.numbers won and lost during training and this would invalidate my testing as in the real world this is not known prior to the prediction being made. I also have the additional issue of what to do if the model predicts that the same contract.number will win with some item.id's and loose with others.
My ideal solution would be to condense each contract.number into a single line with multiple item.ids per line to form a 3 dimensional dataframe. But i am not aware if caret would then be able to model this? It is not realistic to split the item.ids into multiple columns as some quotes have 100s of item.id's. Any help would be much appreciated!
(Sorry if I haven't explained well!)
Related
It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks
I am doing research in a lab with a mentor who has developed a model that analyzes genetic data, which utilizes an ANOVA. I have simulated a dataset that I want to use in evaluating our model's ability to handle varying levels of missing data.
Our dataset consists of 15 species, with 4 individuals each, which we represent by naming the columns 'A'(x4) 'B'(x4)...etc. Each row represents a gene.
I'm trying to come up with a code that removes 1% of the data randomly, but such that each species has at least 2 individuals with valid data, because otherwise our model will just quit out (since it's ANOVA-based).
I realize this makes the 'randomly' missing data not so random, but we're trying different methods. It's important that the missing data is otherwise randomized. I'm hoping someone could help me with setting this up?
I try to do a toy example that maybe can help
is_valid_df<-function(df,col,val){
all(table(df[col])>val)
}
filter_function<-function(df,perc,col,val){
n=dim(df)[1]
filter<-sample(1:n,n*perc)
if(is_valid_df(df[-filter,],col,val)){
return(df[-filter,])
}else{
filter_function(df,perc,col,val)
cat("resampling\n")
}
}
set.seed(20)
a<-(filter_function(iris,0.1,"Species",44))
I have been trying to use the mdatools package to run a pls-da using the plsda() function. I have data with 9000 variables and around 30 observations. Each of the 30 rows are patients; the first column contains the clinical status for each patient (disease or control) and the remaining 8999 columns contain numerical data on the patients. I used the following code to run the plsda:
plsda(data[,2:9000], data[,1], ncomp = 8999, coeffs.ci = 'jk')
When the code finally compiles, it returns an error, saying"Error in selectCompNum.pls(model,ncomp): wrong number of selected components!"
I chose ncomp =8999 as the total number of numbers from 2 to 9000...and the strange thing is, this worked well with a low number of components. For example, when I tried
plsda(data[,2:10], data[,1], ncomp = 9, coeffs.ci = 'jk')
No error message is returned.
Perhaps I am misunderstanding how to select the right number of components? I would greatly appreciate any help. Thank you very much in advance!
I am developer of the mdatools package and just came across your question accidentally. Number of components in PLS/PLS-DA is a number of latent variables. Every latent variable is a linear combination of original variables. Normally, you need much less components than the original variables, depending on the type of data number of components can be from 1-2 to 10-20. I recommend you to look at PLS part of the tutorial and ask me directly (e.g. by email) if you still have any questions or issues with the package.
I am trying to generate a time dummy variable in R. I am analyzing quarterly panel data (1990q1-2013q3). How do I generate a time dummy variable for 2007q1-2009q1 period, i.e. for 2007q1 dummy=1...
Data looks like in the picture. Asset rank is the id variable.
Regards & Thanks!
I would say model.matrix is probably your best bet.
date.f <- factor(dat$date)
dummies = model.matrix(~date.f)
I used more simpler way following this answer. I guess there is no difference between time series and panel data here in terms of application.
print date
dummy <- as.numeric(date >= "2007 Q1" & date<="2008 Q4")
print (dummy)
The answer of #Demet is useful, but it gets kind of tedious if you have many (e.g. 50 periods).
The answer of #Amstell is useful too, it returns a matrix including an intercept with ones. Depending on how you want to continue analyzing the data you have to take out which is the most useful for your follow-up analysis.
In addition to the answers proposed I propose the following code which shows you just a single dummy variable instead of a huge matrix.
dummies = table(1:length(date),as.factor(date))
Furthermore it is important to take care which time period is the reference group for interpreting the model. You can change the reference group if you have two time periods by applying the following code.
abs(Date(-1))
I'm trying to analyse multiple sequences with TraMineR at once. I've had a look at seqdef but I'm struggling to understand how I'd create a TraMineR dataset when I'm dealing with multiple variables. I guess I'm working with something similar to the dataset used by Aassve et al. (as mentioned in the tutorial), whereby each wave has information about several states (e.g. children, marriage, employment). All my variables are binary. Here's an example of a dataset with three waves (D,W2,W3) and three variables.
D<-data.frame(ID=c(1:4),A1=c(1,1,1,0),B1=c(0,1,0,1),C1=c(0,0,0,1))
W2<-data.frame(A2=c(0,1,1,0),B2=c(1,1,0,1),C2=c(0,1,0,1))
W3<-data.frame(A3=c(0,1,1,0),B3=c(1,1,0,1),C3=c(0,1,0,1))
L<-data.frame(D,W2,W3)
I may be wrong but the material I found deals with the data management and analysis of one variable at a time only (e.g. employment status across several waves). My dataset is much larger than the above so I can't really impute these manually as shown on page 48 of the tutorial. Has anyone dealt with this type of data using TraMineR (or similar package)?
1) How would you feed the data above to TraMineR?
2) How would you compute the substitution costs and then cluster them?
Many thanks
When using sequence analysis, we are interested in the evolution of one variable (for instance, a sequence of one variable across several waves). You have then multiple possibilities to analyze several variables:
Create on sequences per variable and then analyze the links between the cluster of sequences. In my opinion, this is the best way to go, if your variables measure different concepts (for instance, family and employment).
Create a new variable for each wave that is the interaction of the different variables of one wave using the interaction function. For instance, for wave one, use L$IntVar1 <- interaction(L$A1, L$B1, L$C1, drop=T) (use drop=T to remove unused combination of answers). And then analyze the sequence of this newly created variable. In my opinion, this is the prefered way if your variables are different dimensions of the same concept. For instance, marriage, children and union are all related to familly life.
Create one sequence object per variable and then use seqdistmc to compute the distance (multi-channel sequence analysis). This is equivalent to the previous method depending on how you will set substitution costs (see below).
If you use the second strategy, you could use the following substitution costs. You can count the differences between the original variable to set the substition costs. For instance, between states "Married, Child" and "Not married and Child", you could set the substitution to "1" because there is only a difference on the "marriage" variable. Similarly, you would set the substition cost between states "Married, Child" and "Not married and No Child" to "2" because all of your variables are different. Finally, you set the indel cost to half the maximum substitution cost. This is the strategy used by seqdistmc.
Hope this helps.
In Biemann and Datta (2013) they talk about multi dimensional analysis. That means creating multiple sequences for the same "individuals".
I used the following approach to do so:
1) define 3 dimensional sequences
comp.seq <- seqdef(comp,NULL,states=comp.scodes,labels=comp.labels, alphabet=comp.alphabet,missing="Z")
titles.seq <- seqdef(titles,NULL,states=titles.scodes,labels=titles.labels, alphabet=titles.alphabet,missing="Z")
member.seq <- seqdef(member,NULL,states=member.scodes,labels=member.labels, alphabet=member.alphabet,missing="Z")
2) Compute the multi channel (multi dimension) distance
mcdist <- seqdistmc(channels=list(comp.seq,member.seq,titles.seq),method="OM",sm=list("TRATE","TRATE","TRATE"),with.missing=TRUE)
3) cluster it with ward's method:
library(cluster)
clusterward<- agnes(mcdist,diss=TRUE,method="ward")
plot(clusterward,which.plots=2)
Nevermind the parameters like "missing" or "left" and etc. but i hope the brief code sample helps.