Hello and thank you in advance for your assistance,
(PLEASE Note Comments section for additional insight: i.e. the cost column in the example below was added to this question; Simon, provides a great answer, but the cost column itself is not represented in the data response from him, although the function he provides works with the cost column)
I have a data set, lets call it 'data' which looks like this
NAME DATE COLOR PAID COST
Jim 1/1/2013 GREEN 150 100
Jim 1/2/2013 GREEN 50 25
Joe 1/1/2013 GREEN 200 150
Joe 1/2/2013 GREEN 25 10
What I would like to do is sum the PAID (and COST) elements of the records with the same NAME value and reduce the number of rows (as in this example) to 2, such that my new data frame looks like this:
NAME DATE COLOR PAID COST
Jim 1/2/2013 GREEN 200 125
Joe 1/2/2013 GREEN 225 160
As far as the dates are concerned, I don't really care about which one survives the summation process.
I've gotten as far as rowSums(data), but I'm not exactly certain how to use it. Any help would be greatly appreciated.
aggregate is the function you are looking for:
aggregate( cbind( PAID , COST ) ~ NAME + COLOR , data = data , FUN = sum )
# NAME PAID
# 1 Jim 200
# 2 Joe 225
Related
I'm looking to identify duplicate records in my data set based on multiple columns, review the records, and keep the ones with the most complete data in R. I would like to keep the row(s) associated with each name that have the maximum number of data points populated. In the case of date columns, I would also like to treat invalid dates as missing. My data looks like this:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
So in this case I would keep records 1, 4, and 5. There are approximately 85000 records and 130 variables, so if there is a way to do this systematically, I'd appreciate the help. Also, I'm a total R newbie (as if you couldn't tell), so any explanation is also appreciated. Thanks!
#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB
1 1 Ed Bee 123 12/6/1995
4 4 Sue Cord 456 12/5/1956
5 5 Ed Bee 789 10/4/1980
Edit: Per your comment, if you also want to filter on invalid DOBs, you can start by converting the column to date format, which will automatically treat invalid dates as NA (missing data).
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")
I have a data set in R that involves students and GPAs, for example
Student GPA
Jim 3.00
Tom 3.29
Ana 3.99
and so on.
I want a column that puts them in a bin. for example
Student GPASplit
Jim 3.0-3.5
Tom 3.0-3.5
Ana 3.5-4.0
Because when I try to take the statistics for the GPA all the bins are seperated based on the actual GPA. For example I am trying to find the percentage for how many students have higher than a 3.5, a GPA between 3.0-3.5, and so forth. But I get the percentage in terms of the actual GPA and when you have 4000 data points all with different GPAs, it is hard to figure out how many have a GPA higher than 3.5 and so forth? Does this make sense? Sorry if it doesn't.
You can use the cut() function to split data into bins that you define. You have to be careful about values that fall exactly on the boundaries though, and make sure they're being treated how you want. With your example data:
> df$GPA_split = cut(df$GPA, breaks = c(3.0, 3.5, 4.0), include.lowest = TRUE)
> df
Student GPA GPA_split
1 Jim 3.00 [3,3.5]
2 Tom 3.29 [3,3.5]
3 Ana 3.99 (3.5,4]
# Count values in each bin
> table(df$GPA_split)
[3,3.5] (3.5,4]
2 1
In R a dataset data1 that contains game and times. There are 6 games and times simply tells us how many time a game has been played in data1. So head(data1) gives us
game times
1 850
2 621
...
6 210
Similar for data2 we get
game times
1 744
2 989
...
6 711
And sum(data1$times) is a little higher than sum(data2$times). We have about 2000 users in data1 and about 1000 users in data2 but I do not think that information is relevant.
I want to compare the two datasets and see if there is a statistically difference and which game "causes" that difference.
What test should I use two compare these. I don't think Pearson's chisq.test is the right choice in this case, maybe wilcox.test is the right to chose ?
please see data sample as follows:
3326 2015-03-03 Wm Eu Apple 2L 60
3327 2015-03-03 Tp Euro 2 Layer 420
3328 2015-03-03 Tpe 3-Layer 80
3329 2015-03-03 14/3 Bgs 145
3330 2015-03-04 T/P 196
3331 2015-03-04 Wm Eu Apple 2L 1,260
3332 2015-03-04 Tp Euro 2 Layer 360
3333 2015-03-04 14/3 Bgs 1,355
Currently graphing this data creates a really horrible graph because the amount of cartons change so rapidly by day. It would make more sense to sum the cartons by month so that each data point represents a sum for that month rather than an individual day. The current range of the data is 11/01/2008-04/01/2015.
This is the code that I am using to graph (which may or may not be relevant for this):
ggvis(myfile, ~Shipment.Date, ~ctns) %>%
layer_lines()
Shipment.Date is column 2 in the data set and ctns is the 4th column.
I don't know much about R and have given it a few trys with some code that I have found here but I don't think I have found a problem similar enough to match the code. My idea is to create a new table, sum Act. Ctns for the month and then save it as that new table and graph from there.
Thanks for any assistance! :)
Do you need this:
data.aggregated<-aggregate(list(new.value=data$value),
by=list(date.time=cut(data$date.time, breaks="1 month")),
FUN=function(x) sum(x))
heres a table, the time when the query runs i.e now is 2010-07-30 22:41:14
number | person | timestamp
45 mike 2008-02-15 15:31:14
56 mike 2008-02-15 15:30:56
67 mike 2008-02-17 13:31:14
34 mike 2010-07-30 22:31:14
56 bob 2009-07-30 22:37:14
67 bob 2009-07-30 22:37:14
22 tom 2010-07-30 22:37:14
78 fred 2010-07-30 22:37:14
Id like a query that can add up the number for each person. Then only display the name totals which have a recent entry say last 60 minutes. The difficult seems to be, that although its possible to use AND timestamp > now( ) - INTERVAL 600, this has the affect of stopping the full sum of the number.
the results I would from above are
Mike 202
tom 22
fred 78
bob is not included his latest entry is not recent enough its a year old! mike although he has several old entries is valid because he has one entry recently - but key, it still adds up his full 'number' and not just those with the time period.
go on get your head round that one in a single query ! and thanks
andy.
You want a HAVING clause:
select name, sum(number), max(timestamp_column)
from table
group by name
HAVING max( timestamp_column) > now( ) - INTERVAL 600;
andrew - in the spirit of education, i'm not going to show the query (actually, i'm being lazy but don't tell anyone) :).
basically tho', you'd have to do a subselect within your main criteria select. in psuedo code it would be:
select person, total as (select sum(number) from table1 t2 where t2.person=t1.person)
from table1 t1 where timestamp > now( ) - INTERVAL 600
that will blow up, but you get the gist...
jim