R data frame: add values in common rows - r

I have a data frame like this .
> df1
portfolio date ticker quantity price
1 port 2010-01-01 AAPL 100 10
2 port 2010-01-01 AAPL 200 10
3 port 2010-01-01 AAPL 400 11
If the rows of df1 except quantity are same, then add the quantity of common rows.
I mean, i need the following output
portfolio date ticker quantity price
1 port 2010-01-01 AAPL 300 10
3 port 2010-01-01 AAPL 400 11
How can i do that? Thanks..

Here you go... :-)
For plyr :
ddply(df, .(portfolio, date, ticker, price),summarize, quantity=sum(quantity))
For data.table :
dt <- data.table(df)
dt[,list(quantity=sum(quantity)),by=list(portfolio,date,ticker,price)]
There may be a more concise way to express the list of grouping variables. Otherwise, the aggregate solution is much more elegant.

Use aggregate. Assuming your data.frame is called "mydf":
> aggregate(quantity ~ ., mydf, sum)
portfolio date ticker price quantity
1 port 2010-01-01 AAPL 10 300
2 port 2010-01-01 AAPL 11 400
Of course, we should all now wait for the data.table and ddply versions to populate the answers list....

Related

How to calculate aggregate statistics on a dataframe in R by applying conditions on time values?

I am working on climate data analysis. After loading file in R, my interest is to subset data based upon hours in a day.
for time analysis we can use $hour with the variable in which time vector has been stored if our interest is to deal with hours.
I want to subset my data for each hour in a day for 365 days and then take an average of the data at a particular hour throughout the year. Say I am interested to take values of irradiation/wind speed etc at 12:OO PM for a year and then take mean of these values to get the desired result.
I know how to subset a data frame based upon conditions. If for example my data is in a matrix called data and contains 2 rows say time and wind speed and I'm interested to subset rows of data in which irradiationb isn't zero. We can do this using the following code
my_data <- subset(data, data[,1]>0)
but now in order to deal with hours values in time column which is a variable stored in data, how can I subset values?
My data look like this:
I hope I made sense in this question.
Thanks in advance!
Here is a possible solution. You can create a hourly grouping with format(df$time,'%H'), so we obtain only the hour for each period, we can then simply group by this new column and calculate the mean for each group.
df = data.frame(time=seq(Sys.time(),Sys.time()+2*60*60*24,by='hour'),val=sample(seq(5),49,replace=T))
library(dplyr)
df %>% mutate(hour=format(df$time,'%H')) %>%
group_by(hour) %>%
summarize(mean_val = mean(val))
To subset the non-zero values first, you can do either:
df = subset(df,val!=0)
or start the dplyr chain with:
df %>% filter(df$val!=0)
Hope this helps!
df looks as follows:
time val
1 2018-01-31 12:43:33 4
2 2018-01-31 13:43:33 2
3 2018-01-31 14:43:33 2
4 2018-01-31 15:43:33 3
5 2018-01-31 16:43:33 3
6 2018-01-31 17:43:33 1
7 2018-01-31 18:43:33 2
8 2018-01-31 19:43:33 4
... ... ... ...
And the output:
# A tibble: 24 x 2
hour mean_val
<chr> <dbl>
1 00 3.50
2 01 3.50
3 02 4.00
4 03 2.50
5 04 3.00
6 05 2.00
.... ....
This assumes your time column is already of class POSIXct, otherwise you'd first have to convert it using for example as.POSIXct(x,format='%Y-%m-%d %H:%M:%S')

Rbind Difference of rows

I want to determine the difference of each row and have that total difference rbinded at the end. Below is a sample dataset:
DATE <- as.Date(c('2016-11-28','2016-11-29'))
TYPE <- c('A', 'B')
Revenue <- c(2000, 1000)
Sales <- c(1000, 4000)
Price <- c(5.123, 10.234)
Material <- c(10000, 7342)
df<-data.frame(DATE, TYPE, Revenue, Sales, Price, Material)
df
DATE TYPE Revenue Sales Price Material
1 2016-11-28 A 2000 1000 5.123 10000
2 2016-11-29 B 1000 4000 10.234 7342
How Do I calculate the Difference of Each of the Columns to produce this total:
DATE TYPE Revenue Sales Price Material
1 2016-11-28 A 2000 1000 5.123 10000
2 2016-11-29 B 1000 4000 10.234 7342
3 DIFFERENCE -1000 3000 5.111 -2658
I can easily do it by columns but having trouble doing it by row.
Any help would be great thanks!
As 'DATE' is Date class, we may need to change it to character before proceeding with rbinding with string "DIFFERENCE". Other than that, subset the numeric columns of 'df', loop it with lapply, get the difference, concatenate with the 'DATE' and 'TYPE', and rbind with original dataset.
df$DATE <- as.character(df$DATE)
rbind(df, c(DATE = "DIFFERENCE", TYPE= NA, lapply(df[-(1:2)], diff)))
# DATE TYPE Revenue Sales Price Material
#1 2016-11-28 A 2000 1000 5.123 10000
#2 2016-11-29 B 1000 4000 10.234 7342
#3 DIFFERENCE <NA> -1000 3000 5.111 -2658

How to Have a COUNTIF Function dependent on the dates of the same row in R

My main problem is figuring out a way to count the number of days a particular item was sold. For example, if I have the following data frame, I would like to count the number of days in which item A or B were sold, i.e., item A was only sold on one day during our sample, and item B was sold 3 times, however only sold on 2 different days. My goal would be to have a function that outputs the number of days in which item was sold, here being (A,B)=(1, 2).
row item_name date
1 A 2016-03-04 3:49
2 B 2016-05-31 16:15
3 B 2016-05-31 16:35
4 B 2016-06-08 16:05
Try this
library(dplyr)
df1 %>% group_by(item_name) %>% summarise(n_distinct(as.Date(date)))

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

Taking average of dataframe elements sharing same date

I am a bit lost in how to take average of a data frame formatted in the following way:
id date quantity product
1 12-05-2015 10 apple
2 21-03-2015 12 orange
3 12-05-2015 15 orange
4 21-03-2015 16 apple
Expected result:
date quantity
21-03-2015 14
12-05-2015 12.5
I tried converting it to zoo object, but then I run into issues as dates are non-unique.
Try
aggregate(quantity~date, df1, mean)
# date quantity
#1 12-05-2015 12.5
#2 21-03-2015 14.0
Or
library(data.table)
setDT(df1)[, list(quantity=mean(quantity)), date]
As #Alex A. mentioned in the comments, list( can be replaced by .( in the recent data.table versions.
You could also use the dplyr package. Assuming your data frame is called df:
library(dplyr)
df %>%
group_by(date) %>%
summarize(quantity = mean(quantity))
# date quantity
# 1 12-05-2015 12.5
# 2 21-03-2015 14.0
This gets the mean quantity grouped by date.

Resources