I want to create a new variable on the data frame that uses a look up table. So I had df1 (dataframe) that has Amount and Term. And I need to create a new variable "Premium" that create its values using the look up table.
I tried the ifelse function but it's too tedious.
Below is an illustration/example
df1 <- data.frame(Amount, Term)
df1
# Amount Term
# 1 2500 23
# 2 3600 30
# 3 7000 45
# 4 12000 50
# 5 16000 38
And I need to create new variable the 'Premium' by using the Premium Lookup table below.
Term
Amount 0-24 Mos 25-36 Mos 37-48 Mos 49-60 Mos
0 - 5,000 133 163 175 186
5,001 - 10,000 191 213 229 249
10,001 - 15,000 229 252 275 306
15,001 - 20,000 600 615 625 719
20,001 - 25,000 635 645 675 786
So the output for premium should be.
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625
Data
df1 <- structure(list(Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)),
class = "data.frame",
row.names = c(NA, -5L))
lkp <- structure(c(133L, 191L, 229L, 600L, 635L,
163L, 213L, 252L, 615L, 645L,
175L, 229L, 275L, 625L, 675L,
186L, 249L, 306L, 719L, 786L),
.Dim = 5:4,
.Dimnames = list(Amount = c("0 - 5,000", "5,001 - 10,000",
"10,001 - 15,000", "15,001 - 20,000",
"20,001 - 25,000"),
Term = c("0-24 Mos", "25-36 Mos", "37-48 Mos",
"49-60 Mos")))
Code
Create first the upper limits for month and amount using regular expressions from the column and row names (you did not post your data in a reproducible way, so this regex may need adaptation based on your real lookup table structure):
(month <- c(0, as.numeric(sub("\\d+-(\\d+) Mos$",
"\\1",
colnames(lkp)))))
# [1] 0 24 36 48 60
(amt <- c(0, as.numeric(sub("^\\d+,*\\d* - (\\d+),(\\d+)$",
"\\1\\2",
rownames(lkp)))))
# [1] 0 5000 10000 15000 20000 25000
Get the positions for each element of df1 using findInterval:
(rows <- findInterval(df1$Amount, amt))
# [1] 1 1 2 3 4
(cols <- findInterval(df1$Term, month))
# [1] 1 2 3 4 3
Use these indices to subset the lookup matrix:
df1$Premium <- lkp[cbind(rows, cols)]
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625
To get to what you want you need to organise the table and categorise the data. I have provided a potential workflow to handle such situations. Hope this is helpful:
library(tidyverse)
df1 <- data.frame(
Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# functions for analysis ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
amount_tier_function <- function(x){
case_when(x <= 5000 ~ "Tier_5000",
x <= 10000 ~ "Tier_10000",
x <= 15000 ~ "Tier_15000",
x <= 20000 ~ "Tier_20000",
TRUE ~ "Tier_25000")
}
month_tier_function <- function(x){
case_when(x <= 24 ~ "Tier_24",
x <= 36 ~ "Tier_36",
x <= 48 ~ "Tier_48",
TRUE ~ "Tier_60")
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Recut lookup table headings ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df <- data.frame(stringsAsFactors=FALSE,
amount_tier = c("Tier_5000", "Tier_10000", "Tier_15000", "Tier_20000",
"Tier_25000"),
Tier_24 = c(133L, 191L, 229L, 600L, 635L),
Tier_36 = c(163L, 213L, 252L, 615L, 645L),
Tier_48 = c(175L, 229L, 275L, 625L, 675L),
Tier_60 = c(186L, 249L, 306L, 719L, 786L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Join everything together ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df_tidy <- lookup_df %>%
gather(mth_tier, Premium, - amount_tier)
df1 %>%
mutate(amount_tier = amount_tier_function(Amount),
mth_tier = month_tier_function(Term)) %>%
left_join(., lookup_df_tidy) %>%
select(-amount_tier, -mth_tier)
Related
I have a data frame where each row contains numbers of a contingency table on which I would like to run a chisq.test command (to each row in data frame) in R. The output from each row should be added into the data frame as new columns (X-squared-value,p-value).
DF1:
ID1 ID2 female_boxing female_cycling male_boxing male_cycling
A zit 43 170 159 710
B tag 37 134 165 744
C hfs 32 96 170 784
D prt 17 61 185 811
E its 31 112 169 762
F qrw 68 233 130 645
This is what I tried:
apply(DF1[,c('female_boxing','female_cycling','male_boxing','male_cycling')], 1, function(x) chisq.test(x) )
But this gives me only the summary table for each row.
You were close, just inspect one single test with str which helps you to decide which elements to select.
apply(dat[,c('female_boxing','female_cycling','male_boxing','male_cycling')],
1, function(x) chisq.test(x)[c('statistic', 'p.value')] )
The apply gives you a list, the results are a little nicer using sapply and looping over the rows.
chi <- t(sapply(seq(nrow(dat)), function(i)
chisq.test(dat[i, c('female_boxing','female_cycling','male_boxing','male_cycling')])[
c('statistic', 'p.value')]))
cbind(dat, chi)
# ID1 ID2 female_boxing female_cycling male_boxing male_cycling statistic p.value
# 1 A zit 43 170 159 710 988.7209 5.033879e-214
# 2 B tag 37 134 165 744 1142.541 2.146278e-247
# 3 C hfs 32 96 170 784 1334.991 3.762222e-289
# 4 D prt 17 61 185 811 1518.015 0
# 5 E its 31 112 169 762 1245.218 1.133143e-269
# 6 F qrw 68 233 130 645 752.3941 9.129485e-163
Data:
dat <- structure(list(ID1 = c("A", "B", "C", "D", "E", "F"), ID2 = c("zit",
"tag", "hfs", "prt", "its", "qrw"), female_boxing = c(43L, 37L,
32L, 17L, 31L, 68L), female_cycling = c(170L, 134L, 96L, 61L,
112L, 233L), male_boxing = c(159L, 165L, 170L, 185L, 169L, 130L
), male_cycling = c(710L, 744L, 784L, 811L, 762L, 645L)), class = "data.frame", row.names = c(NA,
-6L))
In a column of data I am trying to identify the minimum value in between a new high and the previous new high. In the example below I marked where the new highs are and what the minimum value is between them. What is the R formula to figure this out? In excel I would be able to do it using the match and max/min formula. I am not sure how to find the minimum value in a segment of a column in r.
data
0 New High
-80
-160
-160
-160
-160
-160
-347
-351
-351
-444
-444
-444
43 New High -444
43
10
10
-6
20
352 New High -6
352
352
528 New High 352
528
511
511
518
472
You can use cummax to calculate cumulative maximum until that row and create a new group whenever the current row's value is greater than previous cummax value. Within each group you can return the minimum value.
library(dplyr)
df %>%
group_by(group = cumsum(V1 > lag(cummax(V1), default = first(V1)))) %>%
summarise(min_value = min(V1))
# group min_value
# <int> <int>
#1 0 -444
#2 1 -6
#3 2 352
#4 3 472
This considers the last part as another group hence also returns minimum value in that part. You can remove the last row if it is not needed.
To apply for multiple columns, we can write a function and call it with lapply :
apply_fun <- function(data, col) {
col1 <- sym(col)
df %>%
group_by(group = cumsum(!!col1 > lag(cummax(!!col1),
default = first(!!col1)))) %>%
summarise(min_value = min(!!col1))
}
result <- lapply(names(df), apply_fun, data = df)
data
df <- structure(list(V1 = c(0L, -80L, -160L, -160L, -160L, -160L, -160L,
-347L, -351L, -351L, -444L, -444L, -444L, 43L, 43L, 10L, 10L,
-6L, 20L, 352L, 352L, 352L, 528L, 528L, 511L, 511L, 518L, 472L
)), class = "data.frame", row.names = c(NA, -28L))
I have data that looks like this:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 19 obs. of 7 variables:
$ Week Ending : chr "5/1/18" "5/1/18" "5/1/18" "5/1/18" ...
$ Agent : chr "telbenja ." "Tomsaint ." "davidlor ." "moniquec
." ...
$ Inbound : int 25 62 44 36 1 22 144 36 28 51 ...
$ Manual : int 0 3 4 22 0 0 13 6 2 1 ...
$ Avg Talk Time: 'hms' num 00:03:29 00:03:20 00:03:51 00:02:37 ...
..- attr(*, "units")= chr "secs"
$ Avg Wrap Time: 'hms' num 00:01:57 00:01:13 00:01:31 00:01:24 ...
..- attr(*, "units")= chr "secs"
$ Avg Hold Time: 'hms' num 00:00:11 00:00:02 00:00:02 00:00:00
This is just a sample, I have about 100,000 rows.
Ultimately what I need is to have an 'Agent' called 'Average' whose values in all other columns are just the average of all other rows within the same 'Week Ending' (date).
I believe the solution to this is some sort of group_by and summarize dplyr sorcery, however I can't seem to make this work for returning row values, group by and summarize would give me a brand new column, but thats not what I want, I need a new row entry for each date ('Week Ending') which features the mean of values in each column of the same date.
Any help with this is greatly appreciated (fully anticipating shock and horror for my wording/problem, hit the downvote if you are having trouble sleeping).
Result of dput(head(my_data)):
dput(head(response_codes))
structure(list(`Response Code` = structure(c(105L, 72L, 79L,
159L, 104L, 17L), .Label = c("304001", "312001", "799007", "843001",
"951001", "1490001", "1490002", "1524002", "1524003", "1620001",
"1696001", "2297001", "2299001", "2302001", "2305001", "2312001",
"2314001", "2315001", "2316001", "2317001", "2327001", "2328001",
"2329001", "2330001", "2333001", "2374001", "2380002", "2415001",
"2420001", "2428001", "2428004", "2428005", "2428006", "2434001",
"2435002", "2444002", "2449002", "2457002", "2457003", "2462001",
"2463001", "2463002", "2478001", "2586010", "2673002", "2677001",
"2678002", "2682001", "2683002", "2835005", "2938001", "2950001",
"2974001", "3006001", "3006002", "3007001", "3046001", "3077003",
"3091001", "3093001", "3093010", "3094003", "3115001", "3115006",
"3115010", "3116001", "3116003", "3117001", "3117002", "3148001",
"3214001", "3239001", "3244001", "3245001", "3245002", "3245003",
"3262001", "3262002", "3273001", "3276001", "3276002", "3276003",
"3276005", "3276006", "3276012", "3276013", "3276017", "3276019",
"3276020", "3276021", "3276023", "3276030", "3276036", "3276037",
"3276038", "3276039", "3276043", "3276044", "3276045", "3276048",
"3276050", "3289001", "3330001", "3334001", "3334002", "3347001",
"3348001", "3361001", "3382001", "3383001", "3393001", "3394001",
"3394002", "3399001", "3403005", "3486003", "3488003", "3491001",
"3558001", "3584001", "3585002", "3586001", "3588001", "3591001",
"3677002", "3677003", "3678001", "3678002", "3691003", "3691004",
"3691005", "3691006", "3691009", "3691010", "3691014", "3692001",
"3693002", "3694002", "3695002", "3741001", "3743001", "3753001",
"3753002", "3755001", "3762001", "3765001", "3766001", "3767001",
"3767002", "3768001", "3769001", "3771001", "3772001", "3792001",
"3795001", "3797001", "3799001", "3800001", "3810001", "7014001",
"7371007", "7445001", "9007001", "9009001"), class = "factor"),
`Total Recruits` = c(518L, 467L, 345L, 335L, 333L, 224L),
`Number of 2nd Purchase (Converts)` = c(217L, 248L, 181L,
106L, 218L, 150L), `Total Cms that took a wp on or after their recruitment case` = c(187L,
169L, 142L, 104L, 361L, 233L), `Currently Closed Wine Plans` = c(135L,
130L, 108L, 79L, 295L, 188L), `Currently Active Wine Plans` = c(52L,
39L, 34L, 25L, 66L, 45L), `Upgrade to WP %` = c(36.1, 36.19,
41.16, 31.04, 108.41, 104.02), `2nd Purchase Conversion Rate` = c(41.89,
53.1, 52.46, 31.64, 65.47, 66.96), `Number of Conti Cases Purchased` = c(232L,
208L, 171L, 108L, 449L, 353L), `Number of Distinct WP Customers` = c(94L,
101L, 84L, 51L, 193L, 141L)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
library(dplyr)
# 1. Adding fake week and Agent
response_codes <- response_codes %>%
mutate(fake_week = rep(1:3, each = 2),
Agent = letters[1:6])
# 2. Make summary by week
summarized <- response_codes %>%
group_by(fake_week) %>%
summarise_if(is.numeric, mean) %>%
mutate(Agent = "Average")
# 3. Combine
combo <- bind_rows(response_codes, summarized)
Output
# Just showing columns 1-3 and 10-12:
> combo[, c(1:3,10:12)]
# A tibble: 9 x 6
`Response Code` `Total Recruits` `Number of 2nd Purchase (Converts)` `Number of Distinct WP Customers` fake_week Agent
<fct> <dbl> <dbl> <dbl> <int> <chr>
1 3334002 518 217 94 1 a
2 3239001 467 248 101 1 b
3 3273001 345 181 84 2 c
4 3810001 335 106 51 2 d
5 3334001 333 218 193 3 e
6 2314001 224 150 141 3 f
7 NA 492. 232. 97.5 1 Average
8 NA 340 144. 67.5 2 Average
9 NA 278. 184 167 3 Average
I have a dataframe as follows:
Jen Rptname freq
AKT bilb1 23
AKT bilb1 234
DFF bilb22 987
DFF bilf34 7
DFF jhs23 623
AKT j45 53
JFG jhs98 65
I know how to group the whole dataframe based on individual columns but how do I merge individual rows based on a grep (in this case bilb.* and jhs.*)
I want to be able to merge the rows (and therefore also add the frequencies together) with bilb* and separately the rows with jhs* so that I end up with
AKT bilb 257
DFF bilb 987
DFF bilf34 7
DFF jhs 623
AKT j45 53
JFG jhs 65
This is so that the aggregation is by Jen and Rptname so I can see how many of the same Rptnames are in each Jen
We can use grep to get the index of 'Rptname' elements that have 'bilb' or 'jhs', remove the numeric part with sub and use aggregate to get the sum of 'Freq' by 'Rptname'
indx <- grep('bilb|jhs', df1$Rptname)
df1$Rptname[indx] <- sub('\\d+', '', df1$Rptname[indx])
aggregate(freq~Rptname, df1, FUN=sum)
# Rptname freq
#1 bilb 1244
#2 bilf34 7
#3 j45 53
#4 jhs 688
Update
Suppose your dataset is 'df2'
df2$grp <- gsub("([A-Z]+|[a-z]+)[^A-Z]+", "\\1", df2$Rptname)
aggregate(freq~grp+Jen, df2, FUN=sum)
data
df1 <- structure(list(Rptname = c("bilb1", "bilb1", "bilb22",
"bilf34",
"jhs23", "j45", "jhs98"), freq = c(23L, 234L, 987L, 7L, 623L,
53L, 65L)), .Names = c("Rptname", "freq"), class = "data.frame",
row.names = c(NA, -7L))
df2 <- structure(list(Jen = c("AKT", "AKT", "AKT", "DFF", "DFF",
"DFF",
"DFF", "DFF", "DFF", "AKT", "JFG", "JFG", "JFG"), Rptname = c("bilb1",
"bilb1", "bilb22", "bilb22", "bilb1", "BTBy", "bilf34", "BTBx",
"jhs23", "j45", "jhs98", "BTBfd", "BTBx"), freq = c(23L, 234L,
22L, 987L, 18L, 18L, 7L, 9L, 623L, 53L, 65L, 19L, 14L)),
.Names = c("Jen",
"Rptname", "freq"), class = "data.frame", row.names = c(NA, -13L))
Similar to akrun's and I like his use of aggregate better than my creation of an intermediate vector:
> inter <- tapply(dat$freq, sub("^(bilb|jhs)(.+)$", "\\1", dat$Rptname) ,sum)
> final <- data.frame( nams = names(inter), sums = inter)
> final
nams sums
bilb bilb 1244
bilf34 bilf34 7
j45 j45 53
jhs jhs 688
My pattern would require that the 'bilb' amd 'jhs' be at the beginning of the value. Remove the "^" if that was not intended, but if so, add a "(.*)" and switch to "\\2" in the replacement.
I have two time series (zoo) objects and a data frame
z1
z1 <- structure(c(400L, 125L, 125L, 125L, 120L,400L, 125L, 125L, 125L, 120L,400L, 125L, 125L, 125L, 120L
,400L, 125L, 125L, 125L, 120L), .Dim = c(5L, 4L), .Dimnames = list(NULL, c("T1", "T2", "T3", "T6"
)), index = structure(c(15723, 15725, 15726, 15727, 15728), class = "Date"),
class = "zoo")
T1 T2 T3 T6
2013-01-18 400 400 400 400
2013-01-20 125 125 125 125
2013-01-21 125 125 125 125
2013-01-22 125 125 125 125
2013-01-23 120 120 120 120
z2
z2 <- structure(c(40L, 12L, 25L, 15L, 10L,40L, 25L, 15L, 123L, 190L,150L, 115L, 155L, 105L, 80L
,40L, 425L, 225L, 115L, 20L), .Dim = c(5L, 4L), .Dimnames = list(NULL, c("T1", "T2", "T3", "T6"
)), index = structure(c(15723, 15725, 15726, 15727, 15728), class = "Date"),
class = "zoo")
T1 T2 T3 T6
2013-01-18 40 40 150 40
2013-01-20 12 25 115 425
2013-01-21 25 15 155 225
2013-01-22 15 123 105 115
2013-01-23 10 190 80 20
df
l <- "Name, DOB, TypeOfApply, House
T1, 2008-12-16, sync,44
T2, 2008-12-15, sync,54
T3, 2008-12-19, async,34
T4, 2008-12-18, async,84
T5, 2008-12-11, sync,94"
df <- read.csv(text = l)
I want to apply a formula(function I created to use "calc") bsaed on condition that TypeOfApply == "sync". Z1 and Z2 is going to have same no of rows and columns.
calc(z1,z2,df$DOB-2013-01-18,df$House)
T1 T2 T3 T6
2013-01-18 calc(400,40,((2008-12-16)-(2013-01-18)),44) calc(400,40,((2008-12-15)-(2013-01-18)),54) 400 400
2013-01-20 calc(125,12,((2008-12-16)-(2013-01-20)),44) calc(400,25,((2008-12-15)-(2013-01-20)),54) 125 125
2013-01-21 calc(125,25,((2008-12-16)-(2013-01-21)),44) calc(400,15,((2008-12-15)-(2013-01-21)),54) 125 125
2013-01-22 calc(125,15,((2008-12-16)-(2013-01-22)),44) calc(400,123,((2008-12-15)-(2013-01-22)),54) 125 125
2013-01-23 calc(120,10,((2008-12-16)-(2013-01-23)),44) calc(400,190,((2008-12-15)-(2013-01-23)),54) 120 120
So, in this code T1 and T2 will have formula to be applied, but others will not
T3 - Type of Apply is async
T5 - Does not exist in z1 and z2
T6 - Does not exist in df
Update
Sequence of names in df may be different. So it may be like T2, T1, T3, T5, T4
Just as sample calc function
calc <- function(x,y,z,v)
{
val <- x+y+(z/365)+v
return(val)
}
Here, I am using str_trim as there are leading/lagging spaces in "df" columns. Converted the "factor" column "DOB' to "Date" class, created a "indx" based on the condition that of "TypeOfApply" elements are "sync" and corresponding "Name" elements are present in the column names of "z1". This "indx" is used for subsetting the "df", as well as "z1", and "z2". Then use "Map" function and get the corresponding columns of "z1", "z2", elements of "df1$DOB", "df1$House", which can be used as inputs in the "calc" function.
library(stringr)
indx <- intersect(with(df,str_trim(Name[str_trim(TypeOfApply)=='sync'])),
colnames(z1))
df1 <- df[str_trim(as.character(df$Name)) %in% indx,c(2,4)]
df1$DOB <- as.Date(str_trim(df1$DOB))
Map(function(u,v,x,y) calc(u,v, x-'2013-01-18', y),
as.data.frame(z1[,indx]), as.data.frame(z2[,indx]), df1$DOB, df1$House)
Update
Using the calc function from OP's post
z3 <- z1[,indx]
index <- as.Date('2013-01-18')
z3[] <- mapply(calc, as.data.frame(z1[,indx]),
as.data.frame(z2[,indx]), df1$DOB-index, df1$House)
z3
# T1 T2
#2013-01-18 479.9068 489.9041
#2013-01-20 176.9068 199.9041
#2013-01-21 189.9068 189.9041
#2013-01-22 179.9068 297.9041
#2013-01-23 169.9068 359.9041
Suppose, if I change the order of "df" rows
set.seed(24)
df <- df[sample(1:nrow(df)),]
Then, the "Map" list elements will be in the same order as "indx", for example,
indx
#[1] "T2" "T1"
df1
# DOB House
#2 2008-12-15 54
#1 2008-12-16 44
Map(function(u,v,x,y) u, as.data.frame(z1[,indx]),
as.data.frame(z2[,indx]), df1$DOB, df1$House)
#$T2
#[1] 400 125 125 125 120
#$T1
#[1] 400 125 125 125 120