I am trying to generate a time dummy variable in R. I am analyzing quarterly panel data (1990q1-2013q3). How do I generate a time dummy variable for 2007q1-2009q1 period, i.e. for 2007q1 dummy=1...
Data looks like in the picture. Asset rank is the id variable.
Regards & Thanks!
I would say model.matrix is probably your best bet.
date.f <- factor(dat$date)
dummies = model.matrix(~date.f)
I used more simpler way following this answer. I guess there is no difference between time series and panel data here in terms of application.
print date
dummy <- as.numeric(date >= "2007 Q1" & date<="2008 Q4")
print (dummy)
The answer of #Demet is useful, but it gets kind of tedious if you have many (e.g. 50 periods).
The answer of #Amstell is useful too, it returns a matrix including an intercept with ones. Depending on how you want to continue analyzing the data you have to take out which is the most useful for your follow-up analysis.
In addition to the answers proposed I propose the following code which shows you just a single dummy variable instead of a huge matrix.
dummies = table(1:length(date),as.factor(date))
Furthermore it is important to take care which time period is the reference group for interpreting the model. You can change the reference group if you have two time periods by applying the following code.
abs(Date(-1))
Related
It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks
I have a dataset which contains all the quotes made by a company over the past 3 years. I want to create a predictive model using the library caret in R to predict whether a quote will be accepted or rejected.
The structure of the dataset is causing me some problems. It contains 45 variables, however, I have only included two bellow as they are the only variables that are important to this problem. An extract of the dataset is shown below.
contract.number item.id
0030586792 32X10AVC
0030586792 ZFBBDINING
0030587065 ZSTAIRCL
0030587065 EMS164
0030591125 YCLEANOFF
0030591125 ZSTEPSWC
contract.number <- c("0030586792","0030586792","0030587065","0030587065","0030591125","0030591125")
item.id <- c("32X10AVC","ZFBBDINING","ZSTAIRCL","EMS164","YCLEANOFF","ZSTEPSWC")
dataframe <- data.frame(contract.number,item.id)
Each unique contract.number corresponds to a single quote made. The item.id corresponds to the item that is being quoted for. Therefore, quote 0030586792 includes both items 32X10AVC and ZFBBDINING.
If I randomise the order of the dataset and model it in its current form I am worried that a model would just learn which contract.numbers won and lost during training and this would invalidate my testing as in the real world this is not known prior to the prediction being made. I also have the additional issue of what to do if the model predicts that the same contract.number will win with some item.id's and loose with others.
My ideal solution would be to condense each contract.number into a single line with multiple item.ids per line to form a 3 dimensional dataframe. But i am not aware if caret would then be able to model this? It is not realistic to split the item.ids into multiple columns as some quotes have 100s of item.id's. Any help would be much appreciated!
(Sorry if I haven't explained well!)
I have extensively read and re-read the Troubleshooting R Connections and Tableau and R Integration help documents, but as a new Tableau user they just aren't helping me.
I need to be able to calculate Kaplan-Meier survival probabilities across any dimensions that are dragged onto the sheet. Ideally, I would be able to retrieve this in a tabular format at multiple time points, but for now, I would be happy just to get it at a single time point.
My data in Tableau have columns for [event-boolean] and [time to event]. Let's say I also have columns for Gender and District.
Currently, I have a calculated field [surv] as:
SCRIPT_REAL('
library(survival);
fit <- summary(survfit(Surv(.arg2,.arg1) ~ 1), times=365);
fit$surv'
, min([event-boolean])
, min([time to event])
)
I have messed with Computed Using, Addressing, Partitions, Aggregate Measures, and parameters to the R function, but no combination I have tried has worked.
If [District] is in Columns, do I need to change my SCRIPT_REAL call or do I just need to change some other combination of levers?
I used Andrew's solution to solve this problem. Essentially,
- Turn off Aggregate Measures
- In the Measure Values shelf, select Compute Using > Cell
- In the calculated field, start with If FIRST() == 0 script_*() END
- Ctrl+drag the measure to the Filters shelf and use a Special > Non-null filter.
I'm a new user to R. I need to run wilcoxon test on a large set of data. Currently I have a whole year of transaction data (each transaction is categorized by quarter, say Q12014) and was able to get a result for the complete set. My code is as follows (with ties broken by transaction amount):
> total$reRank=NA
> total$reRank[order(total$Rank,-total$TxnAmount.x)]=1:length(total$Rank)
> Findings=total$reRank[total$Findings==1]
> NOFindings=total$reRank[total$Findings==0]
> wilcox.test(Findings,NOFindings,na.action=na.omit,alternative='less',exact=F)
Now that I was asked to run wilcoxon test quarter by quarter, what code shall I use to filter the data by each quarter?
Without a reproducible example, it's difficult to give you exact code specific to your data.
However, it seems like your problem can be answered with library(dplyr)
library(dplyr)
quarter1Data <- filter(fullData, Quarter == "Q12014")
quarter2Data <- filter(fullData, Quarter == "Q22014")
And so on. See this page for a more in depth explanation on how to use this package.
You can then re-run your existing code replacing your total dataset with these smaller datasets. There is likely a more efficient way to do this, but without knowing the structure of your dataset, this is the simplest method I can think of.
I am currently analysing a rather large dataset (22k+records) and am having some trouble getting the data into a wide format (with one row corresponding to each observation, and columns representing variables).
The data came in two CSV files, one giving demographics and the other giving participants probability ratings to a number of questions. Both of these CSV files were in long format.
I have used the reshape (and reshape2 for speed) packages to attempt to solve my problem. The specific issue i am having is the following.
I have the participants probability ratings in the following form (after one successful reshape).
dtf <- read.csv("http://dl.dropbox.com/u/8566396/foobar.csv")
Now, the format i would like my data to be in is as follows:
User ID Qid1, ....Qid255 Time, with the probabilities for each question in the questions corresponding column.
I have tried a loop and apply to put the values into a new data frame, and many variations of melt and cast. I have also tried the base reshape function, but all to no avail.
In the past, i've always edited my CSV files directly, but this is not an option with the size of this file (my laziness when it comes to data manipulation within R has come back to haunt me).
Any advice or solution you can give to avoid me having to do this by hand would be greatly appreciated.
Your dataset has 6 rows, 3 of which have the column "variable" equal to "probability" and 3 of which have that column equal to "time". You want to have probability be the value of each, and time be added onto the right.
I think there's a difficulty in making this work for you because what you want to do isn't clear. You have values for each UID-Time-X### cell, and values for each UID-Prob-X### cell. Therefore, you have to discard information to get it into your preferred format (UID-Time-X### with probabilities as the values). It seems to me like you're treating time as an ID variable, but it's storing values like a content variable.
To avoid discarding any data, your output would have to look something like:
UID Time1 Time2 Time3 Prob1 Prob2 Prob3
Which is simply reshaped wide.