so I have a data set with the following columns: test_group, person_id, gross, purchases. This is essentially a list of people, how much they've spent, how many times they've purchased, and what group they are in.
I'm using the following ddply code to get some summary statistics:
mean_rpu <- ddply(data, .(test_group), summarise, total_rpu=sum(gross),
total_users=length(person_id), total_purchasers=length(subset(data,
purchases > 0)$person_id), mean_rpu=mean(gross), sd_rpu=sd(gross))
The problem I'm running into is with the "total_purchasers" summary. I'm trying to get a count of people who are purchasers within each test_group. The current code only displays the total_purchasers in the entire dataset, not repsective of the test_group factor. Any optimizations I can do with this?
I appreciate the help!
Without a reproducible example its hard to say for sure, but perhaps you wanted this:
total_purchasers=length(person_id[purchases>0])
Related
I have a little problem with my code. I hope you can help me :)
I used a function apply to create a list of 20 data frames (data about stock index returns, grouped by year and index - about three companies and the stock, for 5 years). And now I want to use function with two arguments (it calculates proportion of covariance of the returns for selected company and the stock to variance (for every year) - this is why I'm trying to group the data. How to do it... automatically, without manual typing code for every year and company?
I don't have any idea if I should use for loop or there is any other way...?
And the other thing is in which way can I delete uneccesary columns from list of data frames?
I'll be thankful for your help.
And sorry for my English :D
You may consider purrr::map_dfr(). The first argument will be your list of data frames, and the second the action to do with that data frame. The final result will be a single data frame uniting the result of all of the above. Your code will likely look something like this:
purrr::map_dfr(list_of_dataframes, function(x) {...})
Within the bracketes, instead of ... insert your logic. In that context, x will be the same as list_of_dataframes[[1]], and then list_of_dataframes[[2]], etc.
You may want to consult the documentation of the package purrr for further details.
I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.
I am trying to change my data from long to wide format. It is a factorial design with one between subject and two within subject variables.
My data:
https://drive.google.com/file/d/0B9lnMw6dkH9KZUZKQkh4M3BIbGM/view?usp=sharing
When I try
library(reshape2)
data.wide<- dcast(correct.anal,group+subnum~speed+int, value.var="corr")
on the data, it says
Aggregation function missing: defaulting to length
I do not have duplicate values though so I do not understand what I need to do.
What I want to achieve is to get from my current data to output one line per subject with 22 columns (subnum, group and the twenty combinations).
Can anyone help with that?
Perhaps this can help:
data.wide<- dcast(correct.anal,group+subnum~speed+int,fun.aggregate=mean, value.var="corr")
I just add the fun.aggregate=mean to average the duplicates.
I'm a new user to R. I need to run wilcoxon test on a large set of data. Currently I have a whole year of transaction data (each transaction is categorized by quarter, say Q12014) and was able to get a result for the complete set. My code is as follows (with ties broken by transaction amount):
> total$reRank=NA
> total$reRank[order(total$Rank,-total$TxnAmount.x)]=1:length(total$Rank)
> Findings=total$reRank[total$Findings==1]
> NOFindings=total$reRank[total$Findings==0]
> wilcox.test(Findings,NOFindings,na.action=na.omit,alternative='less',exact=F)
Now that I was asked to run wilcoxon test quarter by quarter, what code shall I use to filter the data by each quarter?
Without a reproducible example, it's difficult to give you exact code specific to your data.
However, it seems like your problem can be answered with library(dplyr)
library(dplyr)
quarter1Data <- filter(fullData, Quarter == "Q12014")
quarter2Data <- filter(fullData, Quarter == "Q22014")
And so on. See this page for a more in depth explanation on how to use this package.
You can then re-run your existing code replacing your total dataset with these smaller datasets. There is likely a more efficient way to do this, but without knowing the structure of your dataset, this is the simplest method I can think of.
I have a full dataset of observations and over 40 columns of categories but I only want two, NameID and Error and I want to sort Error in a descending order but still have NameID connected to each observation. Here is some code I've tried:
z<-15
sort(data.frame(skill$Error,skill$NameID),decreasing = TRUE)[1:z]
data.frame(skill$NameID,sort(kill#Error,decreasing=T)[1:z])
error2<-skill[order(Error , )]
Hopefully from what I've tried you can understand what I'm trying to do. Again, I want to pull two values from my skills data set, Error and NameID, but have Error sorted at the same time with NameID attached to the values. I need this all done inside of R. Thanks!
df <- data.frame(Error=skill$Error,NameID=skill$NameID)
df <- df[order(df$Error, decreasing=TRUE), ]
best of luck with whatever you are doing. Hopefully you have someone else to learn some R from.
Assuming that skill is a data frame
Errors <- skill[,c("Error","NameID")]
Errors <- Errors[order(-Errors$Error),]
You don't want to ever use sort in a data frame because it sorts whatever column you tell it to independently from the rest of the data frame. You only ever want order, order keeps the links between other columns intact.