I asked a very general version of this question a while ago. I thought I would have enough programming background to make the jump from the answer to create my function, but turns out I was wrong. This is my first time using R, and I'm having some trouble.
Given the following dataset:
Amount_Bought CustomerID
12 28
18 28
2 6
9 6
10 6
I want to create a column called "average spending" which tabulates the average spending of each customer based on their ID. There is about 1000 entries to the data with varying number of purchases.
For example, for customerID 28, I would want average spending to be (12 + 18)/2 = 15
So, something like this:
Amount_Bought CustomerID Average_Spending
12 28
18 28 15
2 6
9 6
10 6 7
How would I go about doing this?
Thank you
How about:
library(plyr)
sumdat <- ddply(my_data,"Customer_ID",summarise,
avg_spending = mean(Amount_Bought))
merge(my_data,sumdat)
(There are a variety of ways to aggregate data in this way in R: ave, aggregate in base R, dplyr package, data.table package ... there are lots of questions on SO comparing efficiency etc. of these various approaches, e.g. Joining aggregated values back to the original data frame )
Related
Guess this is pretty basic but I'm struggling to find a way and find a answer online either. I'm trying to create a dataframe with future dates but those dates should be duplicated per combinations of other 2 variables
so I should have
Dates | Channel | Product
Channel can take 4 values and product 7 values and I need to create dates for future 45 days after my last day in current df. Therefore I have 28 combinations per day and my new df should be 1260 rows (45 * 7 *4)
as the sample below
I know about this function
Dates =seq(max(train$Date), by="day", length.out=45)
However this will create a vector not duplicating dates for each combination. Anyway I can adapt this?
This question already has answers here:
Functional way to reverse cumulative sum?
(2 answers)
Closed 1 year ago.
I'm new to R, and dealing with a Date Frame. I'm looking for a way to create a new column, and populate it with the reverse of another column which contains a cumulative sum. I want the individual values added each time.
I have data that looks like this:
Cumulative Sum
0
4
9
18
33
I'd like to create a new column and populate it with the individual value, reversing the cumulative sum, something like the following:
Cumulative Sum
Individual Value
0
0
4
4
9
5
18
9
33
15
Any and all help would be very much appreciated.
Use diff, which computes iterated differences.
vec=c(0,4,9,18,33)
c(0,diff(vec))
[1] 0 4 5 9 15
If you vector does not start with 0, just use diff(vec).
I am working on a problem for a statistics class that utilizes baseball team data such as attendance, wins/losses, and other stats about baseball teams. The problem statement calls for variables to be created to include winning teams (those with 81 or more wins), losing teams (with less than 81 wins), and attendance figures on three categories, less than 2 million, between 2 and 3 million, and more than 3 million.
The raw data is keyed by team name, with one team per row and then the stats in each column.
I then need to create a table with counts of the number of teams along those dimensions, like:
Winning Season Low Attendance Med. Attendance High Attendance
Yes 3 12 3
No 2 10 2
We can use whatever tool we'd like to complete it and I am attempting to use R and RStudio to create the table in order to gain knowledge about stats and R at the same time. However, I can't figure out how to make it happen or what function(s) to use to create a table with those aggregate numbers.
I have looked at data.table and dplyr and others but I cannot seem to figure out how to get counts sorted by each team. If it was SQL, I would be able to
select count(*) from table where attend < 2000000 and wins < 81
and then programmatically create the table. I can't figure out how to do the same in R.
Thank you for any help.
This question already has answers here:
How to filter a data frame
(2 answers)
Select rows from a data frame based on values in a vector
(3 answers)
Closed 5 years ago.
I have a data.frame listing locations that have been sampled several years, and calculated the number of species at those locations. Hence, I have per location a species number for each year.
However, not every location has been sampled each year.
It would look something like this:
Location Year Species
1 2007 3
1 2008 10
2 2008 4
2 2009 5
2 2010 6
3 2007 3
3 2008 10
3 2009 5
3 2010 6
I want to select only those stations that have been sampled each year, and get a data.frame showing only these locations, the relevant years and their species numbers.
In the above example that would obviously be only location 3.
I searched various sites thoroughly but could not find the answer. I guess the answer is quite simple using either aggregate or subset, and I tried various solutions, but to no avail.
Edit: the answers referred to as being duplicate do not answer my question: I want to select the various years, but only return the stations that contain all these years. Answers referred to only supply the lines with the various years, but not conditional on the stations.
Edit
The answer to my question appeared to be adding a column to my data.frame counting the frequency of the unique values in the column Location, using the following code:
transform(df, freq.loc = ave(seq(nrow(df)), location, FUN=length))
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I would like to expand a sample survey and simulate a population. For example, if I have the following data sample(very small for explain my question) like
control weight sex age race
1 2 F 23 W
2 3.1 M 21 B
3 5.3 F 19 W
In this case, control represents the interviewed people. For example, I would like get a dataframe where the control 1 (some person, sex female , 23 yeard old and white) repeats 2 times(2 rows). The dificult arises when I try to repeats 3.1 times the control number 2 and 5.3 the contol number 3, preserving the sex, age and race.
There is the "survey" package, but I don't know if there is some function for this situation.
How can I find a solution for this problem?
If you need the expand the rows of the dataset, based on the value in the 'weight' column, one option would be expandRows from splitstackshape. This will be similar to df1[rep(1:nrow(df1), weight),].
library(splitstackshape)
expandRows(df1, 'weight')