I have a data.frame that looks like this:
> head(activity_data)
ev_id cust_id active previous_active start_date
1 1141880 201 1 0 2008-08-17
2 4927803 201 1 0 2013-03-17
3 1141880 244 1 0 2008-08-17
4 2391524 244 1 0 2011-02-05
5 1141868 325 1 0 2008-08-16
6 1141872 325 1 0 2008-08-16
for each cust_id
for each ev_id
create a new variable $recent_active (= sum $active across all rows with this cust_id where $start_date > [this_row]$start_date - 10)
I am struggling to do this using ddply, as my split grouping was .(cust_id) and I wanted to return rows with cust_id and ev_id
Here is what I tried
ddply(activity_data, .(cust_id), function(x) recent_active=sum(x[this_row,]$active))
If ddply is not an option what other effieicent ways do you recommend. My dataset has ~200mn rows and I need to do this about 10-15 times per row.
sample data is here
You actually need to use two step approach here (and also need to convert date into date format before using the following code)
ddply(activity_date, .(cust_id), transform, recent_active=your function) #Not clear what you are asking regarding the function
ddply(activity_date, .(cust_id,ev_id), summarize,recent_active=sum(recent_active))
Related
Looking to fill a matrix with a reverse cumsum. There are multiple breaks that must be maintained.
I have provided a sample matrix for what I want to accomplish. The first column is the data, the second column is what I want. You will see that column 2 is updated to reflect the number of items that are left. When there are 0's the previous number must be carried through.
update <- matrix(c(rep(0,4),rep(1,2),2,rep(0,2),1,3,
rep(10,4), 9,8,6, rep(6,2), 5, 2),ncol=2)
I have tried multiple ways to create a sequence, loop using numerous packages (i.e. zoo). What is difficult is that the numbers in column 1 can be between 0,1,..,X but less than column 2.
Any help or tips would be appreciated
EDIT: Column 2 starts with a given value which can represent any starting value (i.e. inventory at the beginning of a month). Column 1 would then represent "purchases" made which; thus, column 2 should reflect the total number of remaining items available.
The following will report the purchase and inventory balance as described:
starting_inventory <- 100
df <- data.frame(purchases=c(rep(0,4),rep(1,2),2,rep(0,2),1,3))
df$cum_purchases <- cumsum(df$purchases)
df$remaining_inventory <- starting_inventory - df$cum_purchases
Result:
purchases cum_purchases remaining_inventory
1 0 0 100
2 0 0 100
3 0 0 100
4 0 0 100
5 1 1 99
6 1 2 98
7 2 4 96
8 0 4 96
9 0 4 96
10 1 5 95
11 3 8 92
In a data frame, I am trying to delete the column whose sum is the least. I want it to be dynamic since I want to use it in a function
E.g
a b c
1 434 0 45
2 5452 1 456
3 42342 0 26
4 542 1 15
5 542 1 323
6 413 0 45
I want to remove the 2nd column [i.e. column b] since its sum is the least, but this I want it to be done dynamically since I have to use it as a part of a function
We can try with colSums with which.min to create the index of the minimum column sum and remove that column.
df1[-which.min(colSums(df1))]
Or another option is Filter
mn <- min(sapply(df1, sum))
Filter(function(x) sum(x) != mn, df1)
I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link
Some time ago I asked a question about creating market basket data. Now I would like to create a similar data.frame, but based on a third variable. Unfortunately I run into problems trying. Previous question: Effecient way to create market basket matrix in R
#shadow and #SimonO101 gave me good answers, but I was not able to alter their anwser correctly. I have the following data:
Customer <- as.factor(c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003))
Product <- as.factor(c(100001,100001,100001,100004,100004,100002,100003,100003,100003,100002,100003,100008))
input <- data.frame(Customer,Product)
I can create a contingency table now the following way:
input_df <- as.data.frame.matrix(table(input))
However I have a third (numeric) variable which I want as output in the table.
Number <- c(3,1,-4,1,1,1,1,1,1,1,1,1)
input <- data.frame(Customer,Product,Number)
Now the code (of course, now there are 3 variables) does not work anymore. The result I am looking for has unique Customer as row names and unique Product as column names. And has Number as value (or 0 if not present), this number could be calculated by:
input_agg <- aggregate( Number ~ Customer + Product, data = input, sum)
Hope my question is clear, please comment if something is not clear.
You can use xtabs for that :
R> xtabs(Number~Customer+Product, data=input)
Product
Customer 100001 100002 100003 100004 100008
1000001 0 1 0 2 0
1000002 0 0 3 0 0
1000003 0 1 1 0 1
This class of problem is designed for reshape2::dcast...
require( reshape2 )
# Too many rows so change to a data.table.
dcast( input , Customer ~ Product , fun = sum , value.var = "Number" )
# Customer 100001 100002 100003 100004 100008
#1 1000001 0 1 0 2 0
#2 1000002 0 0 3 0 0
#3 1000003 0 1 1 0 1
Recently, the method for using dcast with data.table object was implemented by #Arun responding to FR #2627. Great stuff. You will have to use the development version 1.8.11. Also at the moment, it should be used as dcast.data.table. This is because dcast is not a S3 generic yet in reshape2 package. That is, you can do:
require(reshape2)
require(data.table)
input <- data.table(input)
dcast.data.table(input , Customer ~ Product , fun = sum , value.var = "Number")
# Customer 100001 100002 100003 100004 100008
# 1: 1000001 0 1 0 2 0
# 2: 1000002 0 0 3 0 0
# 3: 1000003 0 1 1 0 1
This should work quite well on bigger data and should be much faster than reshape2:::dcast as well.
Alternatively, you can try the reshape:::cast version which may or may not crash... Try it!!
require(reshape)
input <- data.table( input )
cast( input , Customer ~ Product , fun = sum , value = .(Number) )
I have data in the following format called DF (this is just a made up simplified sample):
eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0 random
1 1 1500 1500 100 120 40 232342
2 2 1000 1250 100 120 40 11843
3 3 1250 1250 100 120 40 981340234
4 4 1000 1187.5 100 120 40 4363453
5 1 2000 2000 200 100 40 345902
6 1 3000 3000 150 90 10 943
7 1 2000 2000 90 90 100 9304358
8 2 1800 1900 90 90 100 284333
However, the eval.count column is incorrect and I need to fix it. It should report the number of rows with the same values for (green.h.0, green.v.0, and offset.0) by only looking at the previous rows.
The example above uses the expected values, but assume they are incorrect.
How can I add a new column (say "count") which will count all previous rows which have the same values of the specified variables?
I have gotten help on a similar problem of just selecting all rows with the same values for specified columns, so I supposed I could just write a loop around that, but it seems inefficient to me.
Ok, let's first do it in the easy case where you just have one column.
> data <- rep(sample(1000, 5),
sample(5, 5))
> head(data)
[1] 435 435 435 278 278 278
Then you can just use rle to figure out the contiguous sequences:
> sequence(rle(data)$lengths)
[1] 1 2 3 1 2 3 4 5 1 2 3 4 1 2 1
Or altogether:
> head(cbind(data, sequence(rle(data)$lengths)))
[1,] 435 1
[2,] 435 2
[3,] 435 3
[4,] 278 1
[5,] 278 2
[6,] 278 3
For your case with multiple columns, there are probably a bunch of ways of applying this solution. Easiest might be to just paste the columns you care about together to form a single vector.
Okay I used the answer I had on another question and worked out a loop that I think will work. This is what I'm going to use:
cmpfun2 <- function(r) {
count <- 0
if (r[1] > 1)
{
for (row in 1:(r[1]-1))
{
if(all(r[27:51] == DF[row,27:51,drop=FALSE])) # compare to row bind
{
count <- count + 1
}
}
}
return (count)
}
brows <- apply(DF[], 1, cmpfun2)
print(brows)
Please comment if I made a mistake and this won't work, but I think I've figured it out. Thanks!
I have a solution I figured out over time (sorry I haven't checked this in a while)
checkIt <- function(bind) {
print(bind)
cmpfun <- function(r) {all(r == heeds.data[bind,23:47,drop=FALSE])}
brows <- apply(heeds.data[,23:47], 1, cmpfun)
#print(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")])
print(nrow(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")]))
}
Note that heeds.data is my actual data frame and I just printed a few columns originally to make sure that it was working (now commented out). Also, 23:47 is the part that needs to be checked for duplicates
Also, I really haven't learned as much R as I should so I'm open to suggestions.
Hope this helps!