I'm a new user to R. I need to run wilcoxon test on a large set of data. Currently I have a whole year of transaction data (each transaction is categorized by quarter, say Q12014) and was able to get a result for the complete set. My code is as follows (with ties broken by transaction amount):
> total$reRank=NA
> total$reRank[order(total$Rank,-total$TxnAmount.x)]=1:length(total$Rank)
> Findings=total$reRank[total$Findings==1]
> NOFindings=total$reRank[total$Findings==0]
> wilcox.test(Findings,NOFindings,na.action=na.omit,alternative='less',exact=F)
Now that I was asked to run wilcoxon test quarter by quarter, what code shall I use to filter the data by each quarter?
Without a reproducible example, it's difficult to give you exact code specific to your data.
However, it seems like your problem can be answered with library(dplyr)
library(dplyr)
quarter1Data <- filter(fullData, Quarter == "Q12014")
quarter2Data <- filter(fullData, Quarter == "Q22014")
And so on. See this page for a more in depth explanation on how to use this package.
You can then re-run your existing code replacing your total dataset with these smaller datasets. There is likely a more efficient way to do this, but without knowing the structure of your dataset, this is the simplest method I can think of.
Related
It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks
I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.
I am working with a time-series data stream from an experiment. We record multiple data channels, including a trigger channel ('X7.ramptrig' in linked data: Time-Series Data Example), that indicates when a relevant event occurs in other channels.
I am trying to create subsets of the next n-rows (e.g. 15,000) of the time-series (time steps are 0.1ms) that occur after onset of a trigger ('1'). That column has multiple triggers ('1') interspersed at irregular intervals. Every other time step is a '0', indicating no new event.
I am asking to see if there is a more efficient solution to directly subset the subsequent n-rows after a trigger is detected instead of the indirect (possibly inflexible) solution I have come up with.
Link to simple example data:
https://gtvault-my.sharepoint.com/:t:/g/personal/shousley6_gatech_edu/EZZSVk6pPpJPvE0fXq1W2KkBhib1VDoV_X5B0CoSerdjFQ?e=izlkml
I have a working solution that creates an index from the trigger channel and splits the dataset on that index. Because triggers have variability in placement in time, the subsequent data frame subsets are not consistent and there are occasionally 'extra' subsets that precede 'important' ones ('res$0' in example). Additionally, I need to have the subsets be matched for total time and aligned for trigger onset.
My current solution 'cuts' the lists of data frames to the same size (in the example to the first 15,000 rows). While this technically works it seems clunky. I also tried to translate a SQL solution using FETCH NEXT but those functions are not available in the SQLite supported in R.
I am completely open to alternatives so please be unconstrained by my current solution.
##create index to detect whenever an event trigger occurs
idx<-c(0, cumsum(diff(Time_Series_Data_Example$X7.ramptrig) >0))
## split the original dataframe on event triggers
split1<-split(Time_Series_Data_Example, idx)
## cuts DFs down to 1.5s
res <- lapply(split1, function(x){
x <- top_n(x, -15000)
})
Here is an example of data output: 'head(res[["1"]]' 2
For the example data and code provided, the output is 4 subsets, 3 of which are 'important' and time synced to the trigger. The first 'res$0' is a throw away subset.
Thanks in advance and please let me know how I can improve my question asking (this is my first attempt).
I am trying to generate a time dummy variable in R. I am analyzing quarterly panel data (1990q1-2013q3). How do I generate a time dummy variable for 2007q1-2009q1 period, i.e. for 2007q1 dummy=1...
Data looks like in the picture. Asset rank is the id variable.
Regards & Thanks!
I would say model.matrix is probably your best bet.
date.f <- factor(dat$date)
dummies = model.matrix(~date.f)
I used more simpler way following this answer. I guess there is no difference between time series and panel data here in terms of application.
print date
dummy <- as.numeric(date >= "2007 Q1" & date<="2008 Q4")
print (dummy)
The answer of #Demet is useful, but it gets kind of tedious if you have many (e.g. 50 periods).
The answer of #Amstell is useful too, it returns a matrix including an intercept with ones. Depending on how you want to continue analyzing the data you have to take out which is the most useful for your follow-up analysis.
In addition to the answers proposed I propose the following code which shows you just a single dummy variable instead of a huge matrix.
dummies = table(1:length(date),as.factor(date))
Furthermore it is important to take care which time period is the reference group for interpreting the model. You can change the reference group if you have two time periods by applying the following code.
abs(Date(-1))
so I have a data set with the following columns: test_group, person_id, gross, purchases. This is essentially a list of people, how much they've spent, how many times they've purchased, and what group they are in.
I'm using the following ddply code to get some summary statistics:
mean_rpu <- ddply(data, .(test_group), summarise, total_rpu=sum(gross),
total_users=length(person_id), total_purchasers=length(subset(data,
purchases > 0)$person_id), mean_rpu=mean(gross), sd_rpu=sd(gross))
The problem I'm running into is with the "total_purchasers" summary. I'm trying to get a count of people who are purchasers within each test_group. The current code only displays the total_purchasers in the entire dataset, not repsective of the test_group factor. Any optimizations I can do with this?
I appreciate the help!
Without a reproducible example its hard to say for sure, but perhaps you wanted this:
total_purchasers=length(person_id[purchases>0])