aggregate value with time in data frame - r

I have the following data frame;
Date <- as.Date(c('2006-08-23', '2006-08-30', '2006-09-06', '2006-09-13', '2006-09-20'))
order <- c("buy", "buy", "sell", "buy", "buy")
cost <- c(10, 15, 12, 13, 8)
df <- data.frame(Date, order, cost)
df
Date order cost
1 2006-08-23 buy 10
2 2006-08-30 buy 15
3 2006-09-06 sell 12
4 2006-09-13 buy 13
5 2006-09-20 buy 8
How can I sum the cost column by taking into account date and the order column and obtain the new balance column in a new data frame like this one?
Date order cost balance
1 2006-08-23 buy 10 10
2 2006-08-30 buy 15 25
3 2006-09-06 sell 12 13
4 2006-09-13 buy 13 26
5 2006-09-20 buy 8 34

Assuming you have a sorted DF, "cost" is an unsmart label, since we will have to generate a sign to show what the actual cost was based on the buy/sell flag.
df$cost[df$order == 'sell'] <- -df$cost[df$order == 'sell']
balance is then cumsum(df$cost).

Related

Cumsum function step wise in R

I am facing one problem, I calculated a monthly interest rate for a mortgage, however, I would need to sum the results in order to have it yearly (always 12 months).
H <- 2000000 # mortgage
i.m <- 0.03/12 # rate per month
year <- 15 # years
a <- (H*i.m*(1+i.m)^(12*year))/
((1+i.m)^(12*year)-1)
a # monthly payment
interest <- a*(1-(1/(1+i.m)^(0:(year*12))))
interest
cumsum(a*(1-(1/(1+i.m)^(0:(year*12))))) # first 12 values together and then next 12 values + first values and ... (I want to have for every year a value)
You may do this with tapply in base R.
monthly <- cumsum(a*(1-(1/(1+i.m)^(0:(year*12)))))
yearly <- tapply(monthly, ceiling(seq_along(monthly)/12), sum)
I think you can use the following solution:
monthly <- cumsum(a*(1-(1/(1+i.m)^(0:(year*12)))))
sapply(split(monthly, ceiling(seq_along(monthly) / 12)), function(x) x[length(x)])
1 2 3 4 5 6 7 8
2254.446 9334.668 21098.218 37406.855 58126.414 83126.695 112281.337 145467.712
9 10 11 12 13 14 15 16
182566.812 223463.138 268044.605 316202.434 367831.057 422828.023 481093.905 486093.905

Is there any method : text to table with R?

I have a strings which have some pattern.
for example
>oc
[1]"for financial company payment manufacturer company payment distributor people payment other payment total payment 1 month payment 10 20 30 40 100 2 month payment 8 14 15 30 67 1 year payment 5 9 11 15 40"
raw material is table and there is some disturbing things, I decide to extract text from table and organize, clean them with code then reshape table form.
The raw material table looks like this
for financial company payment | manufacturer company payment | distributor people payment | other..
1 m..| 10 20 30 ...
2 m..| 8 14 15 ...
1 y..| 5 9 11 ...
I appreciate any method so please, leave any comment for it. It would be great help to me.
Also what I tried to do is first, use extract_text function (in tabilizer library)
and second I use regular expression to make strings tidy
finally I use scan function.
Again, any method is okay. please leave any help. Thank you!
Here's a solution--anything but elegant but working:
Your data:
oc <- "for financial company payment manufacturer company payment distributor people payment other payment total payment 1 month payment 10 20 30 40 100 2 month payment 8 14 15 30 67 1 year payment 5 9 11 15 40"
First, split the string at payment:
oc <- strsplit(oc, " payment ")
Prepare a matrix to fill in the data:
mt <- matrix(NA, ncol = 5, nrow = 3)
Grab the relevant elements from oc as column names:
colnames(mt) <- oc[[1]][1:5]
Define the rownames:
rownames(mt) <- c("1 month", "2 month", "1 year")
Grab the numbers from oc:
numbers <- ocx[[1]][7:9]
Clean numbers:
numbers <- gsub("( 2 month| 1 year)", "", numbers)
Now breaknumbers into individual numbers using str_extract_all from the package stringr:
library(stringr)
numbers <- str_extract_all(numbers, "\\d+")
Iterate over the rows in mt to fill in the numbers from numbers:
for(i in 1:nrow(mt)){
mt[i,] <- numbers[[i]]
}
Finally redefine mt as a dataframe:
mt <- as.data.frame(mt)
Et voilá, the result:
mt
for financial company manufacturer company distributor people other total
1 month 10 20 30 40 100
2 month 8 14 15 30 67
1 year 5 9 11 15 40

Aggregating by subsets in dplyr

I have a dataset with a million records that I need to aggregate after first subsetting the data. It is difficult to provide a good reproducible sample because in this case, the sample size would be rather large - but I will try anyway.
A random sample of the data that I am working with looks like this:
> df
auto_id user_id month
164537 7124 240249 10
151635 7358 226423 9
117288 7376 172463 9
177119 6085 199194 11
128904 7110 141608 9
157194 7143 241964 9
71303 6090 141646 7
72480 6808 175910 7
108705 6602 213098 8
97889 7379 185516 8
184906 6405 212580 12
37242 6057 197905 8
157284 6548 162928 9
17910 6885 194180 10
70660 7162 161827 7
8593 7375 207061 8
28712 6311 176373 10
144194 7324 142715 9
73106 7196 176153 7
67065 7392 171039 7
77954 7116 161489 7
59842 7107 162637 7
101819 5994 182973 9
183546 6427 142029 12
102881 6477 188129 8
In every month, there many users who are the same, and first we should subset by month and make a frequency table of the users and the amount of trips taken (unfortunately, in the random sample above there is only one trip per user, but in the larger dataset, this is not the case):
full_data <- full_data[full_data$month == 7,]
users <- as.data.frame(table(full_data$user_id))
head(users)
Var1 Freq
1 100231 10
2 100744 17
3 111281 1
4 111814 2
5 113716 3
6 117493 3
As we can see, in the full data set, in month of July (month = 7), users have taken multiple trips. Now the important part - which is to subset only the top 10% of these users (the top 10% in terms of Freq)
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
Now the new dataframe - topten - can be summed and we get the amount of trips taken by the top ten percent of users
sum(topten$Freq)
[1] 12147
In the end the output should look like this
> output
month trips
1 7 12147
2 8 ...
3 9 ...
4 10 ...
5 11 ...
6 12 ...
Is there a way to automate this process using dplyr - I mean specifically the subsetting by the top ten percent ? I have tried
output <- full_data %>%
+ group_by(month) %>%
+ summarise(n = n())
But this only aggregates total trips by month. Could someone suggest a way to integrate this part into the query in dplyr ? :
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
The code below counts the number of rows for each user_id in each month, and then selects the 10% of users with the most rows in each month and sums them. Let me know if it solves your problem.
library(dplyr)
full_data %>% group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
UPDATE: Following up on your comment, let's do a check with some fake data. Below we have 30 different values of user_id and 10,000 total rows. I've also used the prob argument so that the probability of a user_id being selected is proportional to its value (i.e., user_id 1 is the least likely to be chosen and user_id 30 is the most likely to be chosen).
set.seed(3)
full_data = data.frame(user_id=sample(1:30,10000, replace=TRUE, prob=1:30),
month=sample(1:12, 10000, replace=TRUE))
Let's look as the number of rows for each user_id for month==1. The code below counts the number of rows for each user_id and sorts from most to least common. Note that the three most common values of user_id (28,29,26) comprise 171 rows (60+57+54). Since there are 30 different values of user_id the top three users represent the top 10% of users:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
arrange(desc(n)) %>% as.data.frame
month user_id n
1 1 28 60
2 1 29 57
3 1 26 54
4 1 30 53
5 1 27 49
6 1 22 43
7 1 21 41
8 1 20 40
9 1 23 40
10 1 24 38
11 1 25 38
12 1 19 37
13 1 18 33
14 1 16 28
15 1 15 27
16 1 17 27
17 1 14 26
18 1 9 20
19 1 12 20
20 1 13 20
21 1 10 17
22 1 11 17
23 1 6 15
24 1 7 13
25 1 8 13
26 1 4 9
27 1 5 7
28 1 2 3
29 1 3 2
30 1 1 1
So now let's take the next step and select the top 10% of users. To answer the question in your comment, filter(percent_rank(n) >= 0.9) keeps only the top 10% of user_id, based on the value of n (which is the number of rows for each user_id). percent_rank is on of several ranking functions in dplyr that have different ways of dealing with ties (which may be the reason you're not getting the results you expect). See ?percent_rank for details:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9)
month user_id n
1 1 26 54
2 1 28 60
3 1 29 57
And the sum of n (the total number of trips for the top 10%) is:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
month n_trips
1 1 171
So it looks like the code does what we'd naively expect, but maybe the issue is related to how ties are dealt with. Let me know if you're still getting anomalous results in your real data or if I've misunderstood what you're trying to accomplish.

Assign rows to a group based on spatial neighborhood and temporal criteria in R

I have an issue that I just cannot seem to sort out. I have a dataset that was derived from a raster in arcgis. The dataset represents every fire occurrence during a 10-year period. Some raster cells had multiple fires within that time period (and, thus, will have multiple rows in my dataset) and some raster cells will not have had any fire (and, thus, will not be represented in my dataset). So, each row in the dataset has a column number (sequential integer) and a row number assigned to it that corresponds with the row and column ID from the raster. It also has the date of the fire.
I would like to assign a unique ID (fire_ID) to all of the fires that are within 4 days of each other and in adjacent pixels from one another (within the 8-cell neighborhood) and put this into a new column.
To clarify, if there were an observation from row 3, col 3, Jan 1, 2000 and another from row 2, col 4, Jan 4, 2000, those observations would be assigned the same fire_ID.
Below is a sample dataset with "rows", which are the row IDs of the raster, "cols", which are the column IDs of the raster, and "dates" which are the dates the fire was detected.
rows<-sample(seq(1,50,1),600, replace=TRUE)
cols<-sample(seq(1,50,1),600, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),600, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
I've tried sorting the data by "row", then "column", then "date" and looping through, to create a new fire_ID if the row and column ID were within one value and the date was within 4 days, but this obviously doesn't work, as fires which should be assigned the same fire_ID are assigned different fire_IDs if there are observations in between them in the list that belong to a different fire_ID.
fire_df2<-fire_df[order(fire_df$rows, fire_df$cols, fire_df$date),]
fire_ID=numeric(length=nrow(fire_df2))
fire_ID[1]=1
for (i in 2:nrow(fire_df2)){
fire_ID[i]=ifelse(
fire_df2$rows[i]-fire_df2$rows[i-1]<=abs(1) & fire_df2$cols[i]-fire_df2$cols[i-1]<=abs(1) & fire_df2$date[i]-fire_df2$date[i-1]<=abs(4),
fire_ID[i-1],
i)
}
length(unique(fire_ID))
fire_df2$fire_ID<-fire_ID
Please let me know if you have any suggestions.
I think this task requires something along the lines of hierarchical clustering.
Note, however, that there will be necessarily some degree of arbitrariness in the ids. This is because it is entirely possible that the cluster of fires itself is longer than 4 days yet every fire is less than 4 days away from some other fire in that cluster (and thus should have the same id).
library(dplyr)
# Create the distances
fire_dist <- fire_df %>%
# Normalize dates
mutate( norm_dates = as.numeric(dates)/4) %>%
# Only keep the three variables of interest
select( rows, cols, norm_dates ) %>%
# Compute distance using L-infinite-norm (maximum)
dist( method="maximum" )
# Do hierarchical clustering with "single" aggl method
fire_clust <- hclust(fire_dist, method="single")
# Cut the tree at height 1 and obtain groups
group_id <- cutree(fire_clust, h=1)
# First attach the group ids back to the data frame
fire_df2 <- cbind( fire_df, group_id ) %>%
# Then sort the data
arrange( group_id, dates, rows, cols )
# Print the first 20 records
fire_df2[1:10,]
(Make sure you have dplyr library installed. You can run install.packages("dplyr",dep=TRUE) if not installed. It is a really good and very popular library for data manipulations)
A couple of simple tests:
Test #1. The same forest fire moving.
rows<-1:6
cols<-1:6
dates<-seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
fire_df<-data.frame(rows, cols, dates)
gives me this:
rows cols dates group_id
1 1 1 2000-01-01 1
2 2 2 2000-01-02 1
3 3 3 2000-01-03 1
4 4 4 2000-01-04 1
5 5 5 2000-01-05 1
6 6 6 2000-01-06 1
Test #2. 6 different random forest fires.
set.seed(1234)
rows<-sample(seq(1,50,1),6, replace=TRUE)
cols<-sample(seq(1,50,1),6, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),6, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
output:
rows cols dates group_id
1 6 1 2000-01-10 1
2 32 12 2000-01-30 2
3 31 34 2000-01-10 3
4 32 26 2000-01-27 4
5 44 35 2000-01-10 5
6 33 28 2000-01-09 6
Test #3: one expanding forest fire
dates <- seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
rows_start <- 50
cols_start <- 50
fire_df <- data.frame(dates = dates) %>%
rowwise() %>%
do({
diff = as.numeric(.$dates - as.Date("2000/01/01"))
expand.grid(rows=seq(rows_start-diff,rows_start+diff),
cols=seq(cols_start-diff,cols_start+diff),
dates=.$dates)
})
gives me:
rows cols dates group_id
1 50 50 2000-01-01 1
2 49 49 2000-01-02 1
3 49 50 2000-01-02 1
4 49 51 2000-01-02 1
5 50 49 2000-01-02 1
6 50 50 2000-01-02 1
7 50 51 2000-01-02 1
8 51 49 2000-01-02 1
9 51 50 2000-01-02 1
10 51 51 2000-01-02 1
and so on. (All records identified correctly to belong to the same forest fire.)

Summarize variable for different time periods and by group using ddply

I am trying to summarize a sales report by client and get the total sales for different periods of time:
Client Q Sales Date
A 2 30 01/01/2014
A 3 24 02/01/2014
A 1 10 03/01/2014
B 4 10 01/01/2014
B 1 20 02/01/2014
B 3 30 03/01/2014
I am able to summarize by client using ddply:
rapport <- ddply(df, CLIENT, summarise,
Q = sum(Q),
Sales = sum(Sales) )
Client Q Sales
A 6 64
B 7 60
I would like to add an extra column with sales only for the date 03/01/2014
Client Q Sales Sales03/01/2014
A 6 64 10
B 7 60 30
DF <- read.table(text=" Client Q Sales Date
A 2 30 01/01/2014
A 3 24 02/01/2014
A 1 10 03/01/2014
B 4 10 01/01/2014
B 1 20 02/01/2014
B 3 30 03/01/2014", header=TRUE)
library(plyr)
ddply(DF, .(Client), summarise,
Q = sum(Q),
`Sales03/01/2014` = Sales[Date=="03/01/2014"],
Sales = sum(Sales))
# Client Q Sales03/01/2014 Sales
#1 A 6 10 64
#2 B 8 30 60
Note that order of evaluation is important here if you want the same name for output as for input Sales. Also, it is best to avoid names that are not valid syntax.
You can also achieve the same result using dplyr:
library(dplyr)
DF %>%
group_by(Client) %>%
summarise(SumOfQ = sum(Q)
`Sales03/01/2014` = Sales[Date=="03/01/2014"],
SumOfSales = sum(Sales))
dplyr is slower for the example case, but much faster for large data frames.

Resources