Related
So, I'm having a headache finding a way to program this in R. Since it is very easy to do Excel, I'm hoping this is just my n00b lack of knowledge.
Checking the example table I'm presenting: my objetive is to create the last column (Balance).
Each TRA (101 and 102) has a number of IDAs (the order of all entries in that TRA, from 1 to last).
Balance in the 1st IDA is the total sum of the Principal. For each next IDA, the Balance value is reduce by the total amount of its Principal, until the last Balance being simply equal to the last TDA.
In order words,
the Balance value of one row is the sum of the Principal value in that same row plus the Balance value of the next IBA row, until the last one of each TRA.
So, for instance:
For TRA 101, we got fow rows (IDA from 1 to 4). The Balance value of the 1st row is (-4.799.471 + -14.398.412 = -19.197.882), the Principal of 1st row plus Balance of the 2nd.
For last IDA of each TRA (4 in 101, 9 in 102), I just need the value of the principal.
We tried this option, but it isn't working when we have different Principal values through the TRA.
df %<>%
group_by(TRA)%>%
arrange(desc(IDA))%>%
mutate(saldo = cumsum(Principal))%>%
ungroup()%>%
arrange(TRA)
Can someone point the best approach for me, please?
ROW TRA IDA IDB Principal Balance
1 101 1 1011 -4,799,471 -19,197,882
2 101 2 1012 -4,799,471 -14,398,412
3 101 3 1013 -4,799,471 -9,598,941
4 101 4 1014 -4,799,471 -4,799,471
5 102 1 1021 -5,248,583 -47,237,250
6 102 2 1022 -5,248,583 -41,988,667
7 102 3 1023 -5,248,583 -36,740,084
8 102 4 1024 -5,248,583 -31,491,500
9 102 5 1025 -5,248,583 -26,242,917
10 102 6 1026 -5,248,583 -20,994,334
11 102 7 1027 -5,248,583 -15,745,750
12 102 8 1028 -5,248,583 -10,497,167
13 102 9 1029 -5,248,584 -5,248,584
If your posted data is the data frame you're working with you need to convert your Principal column to numeric, e.g.
df %>%
group_by(TRA) %>%
arrange(desc(IDA)) %>%
mutate(saldo = cumsum(gsub(",", "", Principal))) %>%
ungroup() %>%
arrange(TRA)
# A tibble: 13 × 7
ROW TRA IDA IDB Principal Balance saldo
<int> <int> <int> <int> <chr> <chr> <dbl>
1 4 101 4 1014 -4,799,471 -4,799,471 -4799471
2 3 101 3 1013 -4,799,471 -9,598,941 -9598942
3 2 101 2 1012 -4,799,471 -14,398,412 -14398413
4 1 101 1 1011 -4,799,471 -19,197,882 -19197884
5 13 102 9 1029 -5,248,584 -5,248,584 -5248584
6 12 102 8 1028 -5,248,583 -10,497,167 -10497167
7 11 102 7 1027 -5,248,583 -15,745,750 -15745750
8 10 102 6 1026 -5,248,583 -20,994,334 -20994333
9 9 102 5 1025 -5,248,583 -26,242,917 -26242916
10 8 102 4 1024 -5,248,583 -31,491,500 -31491499
11 7 102 3 1023 -5,248,583 -36,740,084 -36740082
12 6 102 2 1022 -5,248,583 -41,988,667 -41988665
13 5 102 1 1021 -5,248,583 -47,237,250 -47237248
It works fine, no?
df <- read_table(
"ROW TRA IDA IDB Principal Balance
1 101 1 1011 -4,799,471 -19,197,882
2 101 2 1012 -4,799,471 -14,398,412
3 101 3 1013 -4,799,471 -9,598,941
4 101 4 1014 -4,799,471 -4,799,471
5 102 1 1021 -5,248,583 -47,237,250
6 102 2 1022 -5,248,583 -41,988,667
7 102 3 1023 -5,248,583 -36,740,084
8 102 4 1024 -5,248,583 -31,491,500
9 102 5 1025 -5,248,583 -26,242,917
10 102 6 1026 -5,248,583 -20,994,334
11 102 7 1027 -5,248,583 -15,745,750
12 102 8 1028 -5,248,583 -10,497,167
13 102 9 1029 -5,248,584 -5,248,584"
)
df %>%
group_by(TRA) %>%
arrange(TRA, desc(IDA)) %>%
mutate(saldo = cumsum(Principal)) %>%
ungroup()
# A tibble: 13 × 7
ROW TRA IDA IDB Principal Balance saldo
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 101 4 1014 -4799471 -4799471 -4799471
2 3 101 3 1013 -4799471 -9598941 -9598942
3 2 101 2 1012 -4799471 -14398412 -14398413
4 1 101 1 1011 -4799471 -19197882 -19197884
5 13 102 9 1029 -5248584 -5248584 -5248584
6 12 102 8 1028 -5248583 -10497167 -10497167
7 11 102 7 1027 -5248583 -15745750 -15745750
8 10 102 6 1026 -5248583 -20994334 -20994333
9 9 102 5 1025 -5248583 -26242917 -26242916
10 8 102 4 1024 -5248583 -31491500 -31491499
11 7 102 3 1023 -5248583 -36740084 -36740082
12 6 102 2 1022 -5248583 -41988667 -41988665
13 5 102 1 1021 -5248583 -47237250 -47237248
I have a longitudinal dataset in the long form with the length of around 2800, with around 400 participants in total. Here's a sample of my data.
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 30 2 65 16
#8 1003 2 30 2 66 16
#9 1003 3 29 2 67 16
#10 1003 4 28 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
I want to create a new column "cutoff" with values "Normal" or "Impaired" because my outcome data, "score" has a cutoff score indicating impairment according to norm. The norm consists of different -1SD measures(the cutoff point) according to Sex, Edu(year of education), and Age.
Below is what I'm currently doing, checking an excel file myself and putting in the corresponding cutoff score according to the three conditions. First of all, I am not sure if I am creating the right column.
data$cutoff <- ifelse(data$sex==1 & data$age<70
& data$edu<3
& data$score<19.91, "Impaired", "Normal")
data$cutoff <- ifelse(data$sex==2 & data$age<70
& data$edu<3
& data$score<18.39, "Impaired", "Normal")
Additionally, I am wondering if I can import an excel file stating the norm, and create a column according to the values in it.
The excel file has a structure as shown below.
# Sex Male Female
#60-69 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
#Age Number 22 51 119 72 130 138 106 51
# Mean 24.45 26.6 27.06 27.83 23.31 25.86 27.26 28.09
# SD 3.03 1.89 1.8 1.53 3.28 2.55 1.85 1.44
# -1.5SD' 19.92 23.27 23.76 24.8 18.53 21.81 23.91 25.15
#70-79 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
....
I have created new columns "agecat" and "educat," allocating each ID into a group of age and education used in the norm. Now I want to make use of these columns, matching it with rows and columns of the excel file above. One of the motivations is to create a code that can be used for further research using the test scores of my data.
I think your ifelse statements should work fine, but I would definitely import the Excel file rather than hardcoding it, though you may need to structure it a bit differently. I would structure it just like a dataset, with columns for Sex, Edu, Age, Mean, SD, -1.5SD, etc., import it into R, then do a left outer join on Sex+Edu+Age:
merge(x = long_df, y = norm_df, by = c("Sex", "Edu(yr)", "Age"), all.x = TRUE)
Then you can compare the columns directly.
If I understand correctly, the OP wants to mark a certain type of outliers in his dataset. So, there are two tasks here:
Compute the statistics mean(score), sd(score), and cutoff value mean(score) - 1.5 * sd(score) for each group of sex, age category agecat, and edu category edcat.
Find all rows where score is lower than the cutoff value for the particular group.
As already mentioned by hannes101, the second step can be implemented by a non-equi join.
library(data.table)
# categorize age and edu (left closed intervals)
mydata[, c("agecat", "educat") := .(cut(age, c(seq(0, 90, 10), Inf), right = FALSE),
cut(edu, c(0, 4, 7, 13, Inf), right = FALSE))][]
# compute statistics
cutoffs <- mydata[, .(.N, Mean = mean(score), SD = sd(score),
m1.5SD = mean(score) - 1.5 * sd(score)),
by = .(sex, agecat, educat)]
# non-equi update join
mydata[, cutoff := "Normal"]
mydata[cutoffs, on = .(sex, agecat, educat, score < m1.5SD), cutoff := "Impaired"][]
mydata
ID wave score sex age edu agecat educat cutoff
1: 1001 1 28 1 69 12 [60,70) [7,13) Normal
2: 1001 2 27 1 70 12 [70,80) [7,13) Normal
3: 1001 3 28 1 71 12 [70,80) [7,13) Normal
4: 1001 4 26 1 72 12 [70,80) [7,13) Normal
5: 1002 1 30 2 78 9 [70,80) [7,13) Normal
6: 1002 3 30 2 80 9 [80,90) [7,13) Normal
7: 1003 1 33 2 65 16 [60,70) [13,Inf) Normal
8: 1003 2 32 2 66 16 [60,70) [13,Inf) Normal
9: 1003 3 31 2 67 16 [60,70) [13,Inf) Normal
10: 1003 4 24 2 68 16 [60,70) [13,Inf) Impaired
11: 1004 1 22 2 85 4 [80,90) [4,7) Normal
12: 1005 1 20 2 60 9 [60,70) [7,13) Normal
13: 1005 2 18 1 61 9 [60,70) [7,13) Normal
14: 1006 1 22 1 74 9 [70,80) [7,13) Normal
15: 1006 2 23 1 75 9 [70,80) [7,13) Normal
16: 1006 3 25 1 76 9 [70,80) [7,13) Normal
17: 1006 4 19 1 77 9 [70,80) [7,13) Normal
18: 1007 1 33 2 65 16 [60,70) [13,Inf) Normal
19: 1007 2 32 2 66 16 [60,70) [13,Inf) Normal
20: 1007 3 31 2 67 16 [60,70) [13,Inf) Normal
21: 1007 4 31 2 68 16 [60,70) [13,Inf) Normal
ID wave score sex age edu agecat educat cutoff
In this made-up example there is only one row which meets the "Impaired" conditions.
Likewise, the statistics is rather sparsely populated:
cutoffs
sex agecat educat N Mean SD m1.5SD
1: 1 [60,70) [7,13) 2 23.00000 7.071068 12.39340
2: 1 [70,80) [7,13) 7 24.28571 3.147183 19.56494
3: 2 [70,80) [7,13) 1 30.00000 NA NA
4: 2 [80,90) [7,13) 1 30.00000 NA NA
5: 2 [60,70) [13,Inf) 8 30.87500 2.900123 26.52482
6: 2 [80,90) [4,7) 1 22.00000 NA NA
7: 2 [60,70) [7,13) 1 20.00000 NA NA
Data
OP's sample dataset has been modified in one group for demonstration.
library(data.table)
mydata <- fread("
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 33 2 65 16
#8 1003 2 32 2 66 16
#9 1003 3 31 2 67 16
#10 1003 4 24 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
#18 1007 1 33 2 65 16
#19 1007 2 32 2 66 16
#20 1007 3 31 2 67 16
#21 1007 4 31 2 68 16
", drop = 1L)
I did a rfm analysis using package "rfm". The results are in tibble and I can't seem to figure out how to export it to .csv. I tried argument below but it exported a blank file.
> dim(bmdata4RFM)
[1] 1182580 3
> str(bmdata4RFM)
'data.frame': 1182580 obs. of 3 variables:
$ customer_ID: num 0 0 0 0 0 0 0 0 0 0 ...
$ sales_date : Factor w/ 366 levels "1/1/2018 0:00:00",..: 267 275 286 297 300 301 302 303 304 305 ...
$ sales : num 101541 110543 60932 75472 43588 ...
> head(bmdata4RFM,5)
customer_ID sales_date sales
1 0 6/30/2017 0:00:00 101540.70
2 0 7/1/2017 0:00:00 110543.35
3 0 7/2/2017 0:00:00 60932.20
4 0 7/3/2017 0:00:00 75471.93
5 0 7/4/2017 0:00:00 43587.70
> library(rfm)
> # convert date from factor to date format
> bmdata4RFM[,2] <- as.Date(as.character(bmdata4RFM[,2]), format = "%m/%d/%Y")
> rfm_result_v2
# A tibble: 535,868 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<dbl> <date> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
1 0 2018-06-30 12 366 42462470. 5 5 5 555
2 1 2018-06-30 12 20 2264. 5 5 5 555
3 2 2018-01-12 181 24 1689 3 5 5 355
4 3 2018-05-04 69 27 1984. 4 5 5 455
5 6 2017-12-07 217 12 922. 2 5 5 255
6 7 2018-01-15 178 19 1680. 3 5 5 355
7 9 2018-01-05 188 19 2106 2 5 5 255
8 20 2018-04-11 92 4 414. 4 5 5 455
9 26 2018-02-10 152 1 72 3 1 2 312
10 48 2017-12-20 204 1 90 2 1 3 213
11 68 2017-09-30 285 1 37 1 1 1 111
12 70 2017-12-17 207 1 18 2 1 1 211
13 104 2017-08-11 335 1 90 1 1 3 113
14 120 2017-07-27 350 1 19 1 1 1 111
15 134 2018-01-13 180 1 275 3 1 4 314
16 153 2018-06-24 18 10 1677 5 5 5 555
17 155 2018-05-28 45 1 315 5 1 4 514
18 171 2018-06-11 31 6 3485. 5 5 5 555
19 172 2018-05-24 49 1 93 5 1 3 513
20 174 2018-06-06 36 3 347. 5 4 5 545
# ... with 535,858 more rows
> write.csv(rfm_result_v2,"bmdataRFMFunction_output071218v2.csv")
The problem seems to be that the result of the rfm_table_order is not only a tibble: looking at this question already solved, and using its data, you can know this:
> class(rfm_result)
[1] "rfm_table_order" "tibble" "data.frame"
So if for example choose this:
> rfm_result$rfm
# A tibble: 325 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<int> <date> <dbl> <dbl> <int> <int> <int> <int> <dbl>
1 1 2017-08-06 353 1 145 4 1 2 412
2 2 2016-10-15 648 1 268 2 1 3 213
3 5 2016-12-14 588 1 119 3 1 1 311
4 7 2017-04-27 454 1 290 3 1 3 313
5 8 2016-12-07 595 3 835 2 5 5 255
6 10 2017-07-31 359 1 192 4 1 2 412
7 11 2017-08-16 343 1 278 4 1 3 413
8 12 2017-10-14 284 2 294 5 4 3 543
9 15 2016-07-12 743 1 206 2 1 2 212
10 17 2017-05-22 429 2 405 4 4 4 444
# ... with 315 more rows
You can export it with this command:
write.table(rfm_result$rfm , file = "your_path\\df.csv")
OP asks for a CSV output.
Being very picky, write.table(rfm_result$rfm , file = "your_path\\df.csv") creates a TSV.
If you want a CSV add the sep="," parameter and also you'll likely want to not write out the row names so also use row.names=FALSE.
write.table(rfm_result$rfm , file = "your_path\\df.csv", sep=",", row.names=FALSE)
cI have the following dataframe:
teamID X3M TR AS ST BK PTS FGP FTP
1 423 2884 1405 585 344 5797 0.4763141 0.7370821
2 467 2509 868 326 200 6159 0.4590164 0.7604167
3 769 1944 1446 614 168 6801 0.4248021 0.7825521
4 814 2457 1596 620 308 8058 0.4348856 0.8241445
5 356 2215 1153 403 243 4801 0.4427576 0.7478921
6 302 3360 1151 381 393 6271 0.4626974 0.6757176
7 384 2318 1070 431 269 5225 0.4345146 0.7460317
8 353 2529 1683 561 203 6150 0.4537273 0.7344740
9 598 2384 1635 497 162 6439 0.4512104 0.7998392
10 502 3191 1898 525 337 7107 0.4598565 0.7836970
I want to produce a dataframe like this:
teamID rank_X3M rank_TR rank_AS rank_ST rank_BK rank_PTS rank_FGP rank_FTP
1 5
2 6
3 9
4 10
5 3
6 1
7 4
8 2
9 8
10 7
I tried apply(-df[,c(2:9)], 1, rank, ties.method='min') and got this
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
X3M 4 4 5 4 4 6 4 4 5 6
TR 2 2 2 2 2 2 2 2 2 2
AS 3 3 3 3 3 3 3 3 3 3
ST 5 5 4 5 5 4 5 5 4 5
BK 6 6 6 6 6 5 6 6 6 4
PTS 1 1 1 1 1 1 1 1 1 1
FGP 8 8 8 8 8 8 8 8 8 8
FTP 7 7 7 7 7 7 7 7 7 7
Any suggestions about what to try next? Thanks!
Try sapply like below, you can change names of the variables later
cl <- read.table(text="
teamID X3M TR AS ST BK PTS FGP FTP
1 423 2884 1405 585 344 5797 0.4763141 0.7370821
2 467 2509 868 326 200 6159 0.4590164 0.7604167
3 769 1944 1446 614 168 6801 0.4248021 0.7825521
4 814 2457 1596 620 308 8058 0.4348856 0.8241445
5 356 2215 1153 403 243 4801 0.4427576 0.7478921
6 302 3360 1151 381 393 6271 0.4626974 0.6757176
7 384 2318 1070 431 269 5225 0.4345146 0.7460317
8 353 2529 1683 561 203 6150 0.4537273 0.7344740
9 598 2384 1635 497 162 6439 0.4512104 0.7998392
10 502 3191 1898 525 337 7107 0.4598565 0.7836970", header=T)
new <- cbind(cl$teamID, sapply(cl[,c(2:9)], rank))
new
X3M TR AS ST BK PTS FGP FTP
[1,] 1 5 8 5 8 9 3 10 3
[2,] 2 6 6 1 1 3 5 7 6
[3,] 3 9 1 6 9 2 8 1 7
[4,] 4 10 5 7 10 7 10 3 10
[5,] 5 3 2 4 3 5 1 4 5
[6,] 6 1 10 3 2 10 6 9 1
[7,] 7 4 3 2 4 6 2 2 4
[8,] 8 2 7 9 7 4 4 6 2
[9,] 9 8 4 8 5 1 7 5 9
[10,] 10 7 9 10 6 8 9 8 8
I am not sure I am approaching this in the correct way, but what I am trying to do is split up a data frame into groups based on the difference between values. For example, using the data below I would like to split on the difference between values in the MIN column, so if the difference is >2 then create a split, in the example below I would end up with 4 split sets of data.
MIN SEC PT CO2R CO2D PAR
58 10 5 375.7 -11.6 1002
58 20 5 375.4 -11.6 1001
58 33 5 375.2 -11.6 1001
58 43 5 375.2 -11.5 1000
58 54 5 375.3 -11.8 1000
2 0 5 375.5 -6.3 1001
2 8 5 375.3 -6 1000
2 21 5 375.2 -6.1 997
2 37 5 375.3 -6.2 993
2 51 5 375.4 -6.2 1003
5 20 5 376.3 -7.6 1000
5 35 5 376.1 -7.3 1000
5 52 5 375.9 -7.3 1000
6 5 5 376 -7.8 1000
6 23 5 376.1 -8 1002
10 2 5 376.3 -3.3 1003
10 14 5 376.3 -3.1 1003
10 27 5 376.5 -3.4 1003
10 41 5 376.7 -3.7 1006
10 55 5 376.8 -3.9 997
I have used the split function previously when there is unique element to each subset of data, however I have nothing unique in this data set from which to split. Perhaps this function is not what I need? Any hints appreciated!
Thanks,
You could use diff to find the differences between consecutive values and split to split the data frame. Assuming your data frame is called dat:
# create an index for differences > 2
idx <- c(0, cumsum(abs(diff(dat$MIN)) > 2))
# split the data frame
split(dat, idx)
The result (a list of 4 data frames):
$`0`
MIN SEC PT CO2R CO2D PAR
1 58 10 5 375.7 -11.6 1002
2 58 20 5 375.4 -11.6 1001
3 58 33 5 375.2 -11.6 1001
4 58 43 5 375.2 -11.5 1000
5 58 54 5 375.3 -11.8 1000
$`1`
MIN SEC PT CO2R CO2D PAR
6 2 0 5 375.5 -6.3 1001
7 2 8 5 375.3 -6.0 1000
8 2 21 5 375.2 -6.1 997
9 2 37 5 375.3 -6.2 993
10 2 51 5 375.4 -6.2 1003
$`2`
MIN SEC PT CO2R CO2D PAR
11 5 20 5 376.3 -7.6 1000
12 5 35 5 376.1 -7.3 1000
13 5 52 5 375.9 -7.3 1000
14 6 5 5 376.0 -7.8 1000
15 6 23 5 376.1 -8.0 1002
$`3`
MIN SEC PT CO2R CO2D PAR
16 10 2 5 376.3 -3.3 1003
17 10 14 5 376.3 -3.1 1003
18 10 27 5 376.5 -3.4 1003
19 10 41 5 376.7 -3.7 1006
20 10 55 5 376.8 -3.9 997