R - Modifying several interrelated columns with dynamic lag data simultaneously - r

I have a dataset that look like
C_ID I_ID Loan R1 Prot_id Collateral R2 maxRank
1 A c 341 1 p1 506 1 3
2 A c 341 1 p2 366 2 3
3 A c 341 1 p3 263 3 3
4 A a 689 2 p1 506 1 3
5 A a 689 2 p2 366 2 3
6 A a 689 2 p3 263 3 3
7 A d 720 3 p1 506 1 3
8 A d 720 3 p2 366 2 3
9 A d 720 3 p3 263 3 3
10 A b 334 4 p1 506 1 3
11 A b 334 4 p2 366 2 3
12 A b 334 4 p3 263 3 3
13 A e 752 5 p1 506 1 3
14 A e 752 5 p2 366 2 3
15 A e 752 5 p3 263 3 3
16 B h 193 1 p5 529 1 2
17 B h 193 1 p4 414 2 2
18 B g 494 2 p5 529 1 2
19 B g 494 2 p4 414 2 2
20 B f 227 3 p5 529 1 2
21 B f 227 3 p4 414 2 2
22 B j 785 4 p5 529 1 2
23 B j 785 4 p4 414 2 2
24 B i 371 5 p5 529 1 2
25 B i 371 5 p4 414 2 2
26 B k 395 6 p5 529 1 2
27 B k 395 6 p4 414 2 2
Where R1 is ranking of loan for each contract_id group and R2 is ranking of each collateral under the cotnract_id. What is needed is
C_ID I_ID Loan R1 Prot_id Prot_value R2 maxRank PreAllocation Allocation PostAllocation Residual
1 A c 341 1 p1 506 1 3 341 341 0 165
2 A c 341 1 p2 366 2 3 0 0 0 366
3 A c 341 1 p3 263 3 3 0 0 0 263
4 A a 689 2 p1 506 1 3 689 165 524 0
5 A a 689 2 p2 366 2 3 524 366 158 0
6 A a 689 2 p3 263 3 3 158 158 0 105
7 A d 720 3 p1 506 1 3 720 0 720 0
8 A d 720 3 p2 366 2 3 720 0 720 0
9 A d 720 3 p3 263 3 3 720 105 615 0
10 A b 334 4 p1 506 1 3 334 0 334 0
11 A b 334 4 p2 366 2 3 334 0 334 0
12 A b 334 4 p3 263 3 3 334 0 334 0
13 A e 752 5 p1 506 1 3 752 0 752 0
14 A e 752 5 p2 366 2 3 752 0 752 0
15 A e 752 5 p3 263 3 3 752 0 752 0
16 B h 193 1 p5 529 1 2 193 193 0 336
17 B h 193 1 p4 414 2 2 0 0 0 414
18 B g 494 2 p5 529 1 2 494 336 158 0
19 B g 494 2 p4 414 2 2 158 158 0 256
20 B f 227 3 p5 529 1 2 227 0 227 0
21 B f 227 3 p4 414 2 2 227 227 0 29
22 B j 785 4 p5 529 1 2 785 0 785 0
23 B j 785 4 p4 414 2 2 785 29 756 0
24 B i 371 5 p5 529 1 2 371 0 371 0
25 B i 371 5 p4 414 2 2 371 0 371 0
26 B k 395 6 p5 529 1 2 395 0 395 0
27 B k 395 6 p4 414 2 2 395 0 395 0
Only the Allocation column is important and the other column are just to arrive at Allocation column. I was able to arrive at this using loop as below
df3 <- as.data.frame(df3)
df3$PreAllocation <- 0
df3$Allocation <- 0
df3$PostAllocation <- 0
df3$Residual <- 0
for (i in 1:nrow(df3)){
df3$PreAllocation[i] <- ifelse(df3$R2[i]==1,df3$Loan[i],df3$PostAllocation[i-1])
df3$Allocation[i]<- ifelse(df3$R1[i] >1, min(df3$Residual[i -
df3$maxRank[i]],df3$PreAllocation[i]),min(df3$PreAllocation[i],df3$Prot_value[i]))
df3$PostAllocation[i]<- df3$PreAllocation[i] - df3$Allocation[i]
df3$Residual[i] <- ifelse(df3$R1[i]==1, (df3$Prot_value[i] - df3$Allocation[i]), (df3$Residual[i-
df3$maxRank[i]] - df3$Allocation[i]))
}
However when dataset is big, there are performance issues. I have been trying to arrive at the same using apply functions; rowwise + transform etc but could not arrive at it. for
1. Columns are interdependent.
2. Need to use dynamic (based on maxRank) lag of columns being generated in calculation of later rows etc.
Any suggestion. Thanks.

It looks like a loop was used so you could look at the value in the previous row. Below is a solution using dplyr, which has a function lag() (and lead()) which let you look at previous (or successive) rows. It is also using pmin() (there's also a pmax()), which takes the min/max between corresponding elements in a set of vectors.
# `mutate()` takes data and one or more LHS=RHS statements about the data.
# Each column in the LHS is created (or overwritten) with the logic on the
# RHS. Conveniently, we don't have to prepend each with `df$3`.
# Uses dplyr's if_else()` instead of base R.
# `lag(x, n=1)` looks at previous row's value for x.
# Run `pmin(1:5, 5:1)` for a simple example
# of how it works.
df3 <- mutate(df3,
PreAllocation = if_else(R2 == 1, Loan, lag(PostAllocation, n = 1)),
Allocation = if_else(R3 > 1, pmin(lag(Residual, n = maxRank), PreAllocation),
pmin(PreAllocation, Prot_value)),
PostAllocation = PreAllocation - Allocation,
Residual = if_else(R1 == 1, Prot_value - Allocation, lag(Residual, n = maxRank - Allocation))
)
I encourage you to look at the dplyr CRAN page and the "Introduction to dplyr" vignette for further information.
If you'd like a syntax which is closer to base R's for subsetting & assignment, you might also consider the data.table package.
These are very popular frameworks for data manipulation and aggregation.

Related

R - substituting loop for row wise manipulation, may require taking lag of column being generated

I have two tables
C_ID Loan_ID Loan
A L1 341
A L2 689
A L3 720
A L4 334
B L5 193
B L6 494
B L7 227
C_ID Prot_id Prot_value
A p1 506
A p2 366
A p3 263
B p4 529
B p5 414
Table 1 has loan details and table 2 has collateral details, both loans and collateral are at contract (C_ID) level; Collateral has be allocated to the loan with rule that highest collateral gets assigned first and first (L1 before L2 in same contract) loan gets the allocation first.
So finally, i should get
C_ID Loan_ID Loan Coll_ID Collateral Allocation
A L1 341 p1 506 341
A L1 341 p2 366 0
A L1 341 p3 263 0
A L2 689 p1 506 165
A L2 689 p2 366 366
A L2 689 p3 263 158
A L3 720 p1 506 0
A L3 720 p2 366 0
A L3 720 p3 263 105
A L4 334 p1 506 0
A L4 334 p2 366 0
A L4 334 p3 263 0
B L5 193 p4 529 193
B L5 193 p5 414 0
B L6 494 p4 529 336
B L6 494 p5 414 158
B L7 227 p4 529 0
B L7 227 p5 414 227
I was able to arrive at these using merge and loop; but loop (with supporting columns) takes lots of time big dataset.
df3<-merge(x=df1,y=df2,by="C_ID")
df3 <- mutate(df3,PreAllocation = 0,Allocation = 0, PostAllocation = 0,Residual = 0)
for (i in 1:nrow(df3)){
df3$PreAllocation[i] <- ifelse(df3$R2[i]==1,df3$Loan[i],df3$PostAllocation[i-1])
df3$Allocation[i]<- ifelse(df3$R1[i] >1, min(df3$Residual[i -
df3$maxRank[i]],df3$PreAllocation[i]),min(df3$PreAllocation[i],df3$Prot_value[i]))
df3$PostAllocation[i]<- df3$PreAllocation[i] - df3$Allocation[i]
df3$Residual[i] <- ifelse(df3$R1[i]==1, (df3$Prot_value[i] - df3$Allocation[i]),
(df3$Residual[i-df3$maxRank[i]] - df3$Allocation[i]))
}
Anyone please help me with the alternate way to arrive at the Allocation column which can save time.
Thanks.
#akrun after merge and ranking, it may look like
C_ID Loan_ID Loan R1 Coll_ID Collateral R2 maxRank
A c 341 1 p1 506 1 3
A c 341 1 p2 366 2 3
A c 341 1 p3 263 3 3
A a 689 2 p1 506 1 3
A a 689 2 p2 366 2 3
A a 689 2 p3 263 3 3
A d 720 3 p1 506 1 3
A d 720 3 p2 366 2 3
A d 720 3 p3 263 3 3
A b 334 4 p1 506 1 3
A b 334 4 p2 366 2 3
A b 334 4 p3 263 3 3
A e 752 5 p1 506 1 3
A e 752 5 p2 366 2 3
A e 752 5 p3 263 3 3
B h 193 1 p5 529 1 2
B h 193 1 p4 414 2 2
B g 494 2 p5 529 1 2
B g 494 2 p4 414 2 2
B f 227 3 p5 529 1 2
B f 227 3 p4 414 2 2
B j 785 4 p5 529 1 2
B j 785 4 p4 414 2 2
B i 371 5 p5 529 1 2
B i 371 5 p4 414 2 2
B k 395 6 p5 529 1 2
B k 395 6 p4 414 2 2

merging multiple p-values from Fisher test to the original data

I have done a Fisher test on all my rows which outputs a lot of p-values. How could I correctly combine p-values to the original columns? I tried the following codes but the rows in original data (d) do not match with p-values (e) in the merged dataframe (f).
d <- read.table('test.txt', header = FALSE)
e <-apply(d,1, function(x) fisher.test(matrix(x,nr=2), alternative='greater')$p.value)
f <-merge(d,as.data.frame(e),by.x=0,by.y=0)
> d
V1 V2 V3 V4
1 1 839 63 222247
2 1 839 47 222263
3 1 839 299 222011
4 6 834 1821 220489
5 1 839 198 222112
6 1 839 324 221986
7 2 838 808 221502
8 3 837 935 221375
9 4 836 1723 220587
10 1 839 117 22219
> e
[1] 2.144749e-01 1.656028e-01 6.776690e-01 6.848409e-01 5.280300e-01 7.067099e-01 8.091576e-01 6.859446e-01
[9] 8.895988e-01 3.592658e-01
> f
Row.names V1 V2 V3 V4 e
1 1 1 839 63 222247 2.144749e-01
2 10 1 839 117 222193 3.592658e-01
3 11 6 834 850 221460 1.071752e-01
4 12 29 811 11625 210685 9.941101e-01
5 13 2 838 1231 221079 9.463472e-01
6 14 1 839 1236 221074 9.907043e-01
7 15 3 837 905 221405 6.647785e-01
8 16 3 837 793 221517 5.768163e-01
9 17 6 834 687 221623 4.906665e-02
10 18 1 839 226 222084 5.753710e-01
f <-cbind(d,e)
# V1 V2 V3 V4 e
#1 1 839 63 222247 0.2144749
#2 1 839 47 222263 0.1656028
#3 1 839 299 222011 0.6776690
#4 6 834 1821 220489 0.6848409
#5 1 839 198 222112 0.5280300
#6 1 839 324 221986 0.7067099
#7 2 838 808 221502 0.8091576
#8 3 837 935 221375 0.6859446
#9 4 836 1723 220587 0.8895988
#10 1 839 117 22219 0.9873172

How to export tibble to .csv

I did a rfm analysis using package "rfm". The results are in tibble and I can't seem to figure out how to export it to .csv. I tried argument below but it exported a blank file.
> dim(bmdata4RFM)
[1] 1182580 3
> str(bmdata4RFM)
'data.frame': 1182580 obs. of 3 variables:
$ customer_ID: num 0 0 0 0 0 0 0 0 0 0 ...
$ sales_date : Factor w/ 366 levels "1/1/2018 0:00:00",..: 267 275 286 297 300 301 302 303 304 305 ...
$ sales : num 101541 110543 60932 75472 43588 ...
> head(bmdata4RFM,5)
customer_ID sales_date sales
1 0 6/30/2017 0:00:00 101540.70
2 0 7/1/2017 0:00:00 110543.35
3 0 7/2/2017 0:00:00 60932.20
4 0 7/3/2017 0:00:00 75471.93
5 0 7/4/2017 0:00:00 43587.70
> library(rfm)
> # convert date from factor to date format
> bmdata4RFM[,2] <- as.Date(as.character(bmdata4RFM[,2]), format = "%m/%d/%Y")
> rfm_result_v2
# A tibble: 535,868 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<dbl> <date> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
1 0 2018-06-30 12 366 42462470. 5 5 5 555
2 1 2018-06-30 12 20 2264. 5 5 5 555
3 2 2018-01-12 181 24 1689 3 5 5 355
4 3 2018-05-04 69 27 1984. 4 5 5 455
5 6 2017-12-07 217 12 922. 2 5 5 255
6 7 2018-01-15 178 19 1680. 3 5 5 355
7 9 2018-01-05 188 19 2106 2 5 5 255
8 20 2018-04-11 92 4 414. 4 5 5 455
9 26 2018-02-10 152 1 72 3 1 2 312
10 48 2017-12-20 204 1 90 2 1 3 213
11 68 2017-09-30 285 1 37 1 1 1 111
12 70 2017-12-17 207 1 18 2 1 1 211
13 104 2017-08-11 335 1 90 1 1 3 113
14 120 2017-07-27 350 1 19 1 1 1 111
15 134 2018-01-13 180 1 275 3 1 4 314
16 153 2018-06-24 18 10 1677 5 5 5 555
17 155 2018-05-28 45 1 315 5 1 4 514
18 171 2018-06-11 31 6 3485. 5 5 5 555
19 172 2018-05-24 49 1 93 5 1 3 513
20 174 2018-06-06 36 3 347. 5 4 5 545
# ... with 535,858 more rows
> write.csv(rfm_result_v2,"bmdataRFMFunction_output071218v2.csv")
The problem seems to be that the result of the rfm_table_order is not only a tibble: looking at this question already solved, and using its data, you can know this:
> class(rfm_result)
[1] "rfm_table_order" "tibble" "data.frame"
So if for example choose this:
> rfm_result$rfm
# A tibble: 325 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<int> <date> <dbl> <dbl> <int> <int> <int> <int> <dbl>
1 1 2017-08-06 353 1 145 4 1 2 412
2 2 2016-10-15 648 1 268 2 1 3 213
3 5 2016-12-14 588 1 119 3 1 1 311
4 7 2017-04-27 454 1 290 3 1 3 313
5 8 2016-12-07 595 3 835 2 5 5 255
6 10 2017-07-31 359 1 192 4 1 2 412
7 11 2017-08-16 343 1 278 4 1 3 413
8 12 2017-10-14 284 2 294 5 4 3 543
9 15 2016-07-12 743 1 206 2 1 2 212
10 17 2017-05-22 429 2 405 4 4 4 444
# ... with 315 more rows
You can export it with this command:
write.table(rfm_result$rfm , file = "your_path\\df.csv")
OP asks for a CSV output.
Being very picky, write.table(rfm_result$rfm , file = "your_path\\df.csv") creates a TSV.
If you want a CSV add the sep="," parameter and also you'll likely want to not write out the row names so also use row.names=FALSE.
write.table(rfm_result$rfm , file = "your_path\\df.csv", sep=",", row.names=FALSE)

R, LME and Tukey Test produces error after sorting

Below you will find a simplified code snip which is able tp reproduce an error I get:
rm(list=ls(all=TRUE))
rawdata <- read.table(file = "test.csv", header=TRUE, sep=",", dec=".", stringsAsFactors = TRUE)
sets <- levels(as.factor(rawdata[,'set']))
index <- rawdata[,'set']==sets[1]
testdata <- rawdata[index,]
testdata$name <- as.factor(testdata$name)
testdata$name <- factor(testdata$name, as.character(testdata$name))
#### sorting step
sortindex <- with(testdata,order(name))
testdata <- testdata[sortindex,]
####
testdata.lme <- lme (var~name, random=~1|sample,testdata, method='REML')
testdata.lme.tuk <- summary(glht(testdata.lme, linfct=mcp(name='Tukey')))
the error is the following one:
Error in glht.matrix(model = list(modelStruct = list(reStruct = list(sample = -10.3240629947066)), :
‘ncol(linfct)’ is not equal to ‘length(coef(model))’
which is not produced if the sorting step is left out. However, I need the sorting in the larger code for other functions and data clean up. Also I want the first variable to represent my control and therefore the intercept.
The same problem occured in some other blogs before but was either not solved, was a data clean up problem or else. Does anyone have an idea?
The data I used here are listed below but I think the error is reproducable as soon as a subset and a sorting step is included:
name var sample set
423 10.31 1 1
423 10.39 1 1
423 10.86 1 1
421 10.75 1 1
421 10.24 1 1
421 10.27 1 1
424 10.75 1 1
424 10.75 1 1
424 10.75 1 1
423 10.14 2 1
423 10.85 2 1
423 10.48 2 1
425 10.56 1 1
425 10.67 1 1
425 10.17 1 1
426 10.50 1 1
426 10.30 1 1
426 10.29 1 1
423 10.57 3 1
423 10.07 3 1
423 10.01 3 1
428 10.26 1 1
428 10.17 1 1
428 10.72 1 1
429 10.88 1 1
429 10.93 1 1
429 10.63 1 1
423 10.38 4 1
423 10.54 4 1
423 10.25 4 1
432 10.72 1 1
432 10.62 1 1
432 10.14 1 1
434 10.45 1 1
434 10.38 1 1
434 10.41 1 1
435 10.64 1 1
435 10.21 1 1
435 10.21 1 1
423 10.46 5 1
423 10.41 5 1
423 10.13 5 1
501 10.09 1 1
501 10.86 1 1
501 10.05 1 1
503 10.22 1 1
503 10.94 1 1
503 10.38 1 1
423 10.31 1 2
423 10.39 1 2
423 10.86 1 2
421 10.75 1 2
421 10.24 1 2
421 10.27 1 2
424 10.75 1 2
424 10.75 1 2
424 10.75 1 2
423 10.14 2 2
423 10.85 2 2
423 10.48 2 2
425 10.56 1 2
425 10.67 1 2
425 10.17 1 2
426 10.50 1 2
426 10.30 1 2
426 10.29 1 2
423 10.57 3 2
423 10.07 3 2
423 10.01 3 2
428 10.26 1 2
428 10.17 1 2
428 10.72 1 2
429 10.88 1 2
429 10.93 1 2
429 10.63 1 2
423 10.38 4 2
423 10.54 4 2
423 10.25 4 2
432 10.72 1 2
432 10.62 1 2
432 10.14 1 2
434 10.45 1 2
434 10.38 1 2
434 10.41 1 2
435 10.64 1 2
435 10.21 1 2
435 10.21 1 2
423 10.46 5 2
423 10.41 5 2
423 10.13 5 2
501 10.09 1 2
501 10.86 1 2
501 10.05 1 2
503 10.22 1 2
503 10.94 1 2
503 10.38 1 2
All you should have to do is to apply factor() function to your variable again after a subset or a sorting step:
testdata$name <- factor(testdata$name)

use rollapply and zoo to calculate rolling average of a column of variables

I want to calculate the rolling mean for all variables in column "sp". This is a sample of my data:
the_date sp wins
01-06--2012 1 305
02-06--2012 1 276
03-06--2012 1 184
04-06--2012 1 248
05-06--2012 1 243
06-06--2012 1 363
07-06--2012 1 272
01-06--2012 2 432
02-06--2012 2 369
03-06--2012 2 302
04-06--2012 2 347
05-06--2012 2 357
06-06--2012 2 331
07-06--2012 2 380
01-06--2012 3 1
02-06--2012 3 2
03-06--2012 3 3
04-06--2012 3 2
05-06--2012 3 0
06-06--2012 3 2
07-06--2012 3 0
What I want, is to have a column added to data, that gives the moving average over 3 days for each sp. So the following output is what I desire:
the_date sp wins SMA_wins
01-06--2012 1 305 305.00
02-06--2012 1 276 290.50
03-06--2012 1 184 255.00
04-06--2012 1 248 236.00
05-06--2012 1 243 225.00
06-06--2012 1 363 284.67
07-06--2012 1 272 292.67
01-06--2012 2 432 432.00
02-06--2012 2 369 400.50
03-06--2012 2 302 367.67
04-06--2012 2 347 339.33
05-06--2012 2 357 335.33
06-06--2012 2 331 345.00
07-06--2012 2 380 356.00
01-06--2012 3 1 1.00
02-06--2012 3 2 1.50
03-06--2012 3 3 2.00
04-06--2012 3 2 2.33
05-06--2012 3 0 1.67
06-06--2012 3 2 1.33
07-06--2012 3 0 0.67
I am using rollapply.
df <- group_by(df, sp)
df_zoo <- zoo(df$wins, df$the_date)
mutate(df, SMA_wins=rollapplyr(df_zoo, 3, mean, align="right", partial=TRUE))
If I filter my data on a specific sp, it works perfectly.
How can I make this work when I group by sp?
Thanks
You can do it like this:
library(dplyr)
library(zoo)
df %>% group_by(sp) %>%
mutate(SMA_wins=rollapplyr(wins, 3, mean, partial=TRUE))
It looks like your use of df and df_zoo in your mutate call was messing things up.

Resources