In a column of data I am trying to identify the minimum value in between a new high and the previous new high. In the example below I marked where the new highs are and what the minimum value is between them. What is the R formula to figure this out? In excel I would be able to do it using the match and max/min formula. I am not sure how to find the minimum value in a segment of a column in r.
data
0 New High
-80
-160
-160
-160
-160
-160
-347
-351
-351
-444
-444
-444
43 New High -444
43
10
10
-6
20
352 New High -6
352
352
528 New High 352
528
511
511
518
472
You can use cummax to calculate cumulative maximum until that row and create a new group whenever the current row's value is greater than previous cummax value. Within each group you can return the minimum value.
library(dplyr)
df %>%
group_by(group = cumsum(V1 > lag(cummax(V1), default = first(V1)))) %>%
summarise(min_value = min(V1))
# group min_value
# <int> <int>
#1 0 -444
#2 1 -6
#3 2 352
#4 3 472
This considers the last part as another group hence also returns minimum value in that part. You can remove the last row if it is not needed.
To apply for multiple columns, we can write a function and call it with lapply :
apply_fun <- function(data, col) {
col1 <- sym(col)
df %>%
group_by(group = cumsum(!!col1 > lag(cummax(!!col1),
default = first(!!col1)))) %>%
summarise(min_value = min(!!col1))
}
result <- lapply(names(df), apply_fun, data = df)
data
df <- structure(list(V1 = c(0L, -80L, -160L, -160L, -160L, -160L, -160L,
-347L, -351L, -351L, -444L, -444L, -444L, 43L, 43L, 10L, 10L,
-6L, 20L, 352L, 352L, 352L, 528L, 528L, 511L, 511L, 518L, 472L
)), class = "data.frame", row.names = c(NA, -28L))
Related
id timepoint dv.a
1 baseline 100
1 1min 105
1 2min 90
2 baseline 70
2 1min 100
2 2min 80
3 baseline 80
3 1min 80
3 2min 90
I have repeated measures data for a given subject in long format as above. I'm looking to calculate percent change relative to baseline for each subject.
id timepoint dv pct.chg
1 baseline 100 100
1 1min 105 105
1 2min 90 90
2 baseline 70 100
2 1min 100 143
2 2min 80 114
3 baseline 80 100
3 1min 80 100
3 2min 90 113
df <- expand.grid( time=c("baseline","1","2"), id=1:4)
df$dv <- sample(100,12)
df %>% group_by(id) %>%
mutate(perc=dv*100/dv[time=="baseline"]) %>%
ungroup()
You're wanting to do something for each 'id' group, so that's the group_by, then you need to create a new column, so there's a mutate. That new variable is the old dv, scaled by the value that dv takes at the baseline - hence the inner part of the mutate. And finally it's to remove the grouping you'd applied.
Try creating a helper column, group and arrange on that. Then use the window function first in your mutate function:
df %>% mutate(clean_timepoint = str_remove(timepoint,"min") %>% if_else(. == "baseline", "0", .) %>% as.numeric()) %>%
group_by(id) %>%
arrange(id,clean_timepoint) %>%
mutate(pct.chg = (dv / first(dv)) * 100) %>%
select(-clean_timepoint)
in Base Ryou can do this
for(i in 1:(NROW(df)/3)){
df[1+3*(i-1),4] <- 100
df[2+3*(i-1),4] <- df[2+3*(i-1),3]/df[1+3*(i-1),3]*100
df[3+3*(i-1),4] <- df[3+3*(i-1),3]/df[1+3*(i-1),3]*100
}
colnames(df)[4] <- "pct.chg"
output:
> df
id timepoint dv.a pct.chg
1 1 baseline 100 100.0000
2 1 1min 105 105.0000
3 1 2min 90 90.0000
4 2 baseline 70 100.0000
5 2 1min 100 142.8571
6 2 2min 80 114.2857
7 3 baseline 80 100.0000
8 3 1min 80 100.0000
9 3 2min 90 112.5000
Base R solution: (assuming "baseline" always appears as first record per group)
data.frame(do.call("rbind", lapply(split(df, df$id),
function(x){x$pct.change <- x$dv/x$dv[1]; return(x)})), row.names = NULL)
Data:
df <- structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
timepoint = c(
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min"
),
dv = c(100L, 105L, 90L, 70L, 100L, 80L, 80L, 80L, 90L)
),
class = "data.frame",
row.names = c(NA,-9L)
)
I have a tibble in R, where I want to change values in some columns with a condition based on a value of another column. So in the tibble df below, I want to multiply all values in the columns agr, man and ser where value in variable column is equal to va with 1000 and where value is equal to emp with 100 and replace the values in the respective columns with these calculated values. There must be a simple solution to it but I am at a loss.
df
country variable year agr man ser
chn va 1980 345 124 62
chn emp 1980 34 65 58
chn va 1981 345 243 670
ind emp 1980 54 34 40
ind va 1980 456 345 760
I have tried using ifelse, mutate_at and sweep functions but it does not work out.
Assuming that there would be also other value in 'variable' column, an option is to use case_when with mutate_at
library(dplyr)
df %>%
mutate_at(vars(agr:ser), ~ case_when(variable == 'va'~ . * 1000,
variable == 'emp' ~ .* 100, TRUE ~ as.numeric(.)))
data
df <- structure(list(country = c("chn", "chn", "chn", "ind", "ind"),
variable = c("va", "emp", "va", "emp", "va"), year = c(1980L,
1980L, 1981L, 1980L, 1980L), agr = c(345L, 34L, 345L, 54L,
456L), man = c(124L, 65L, 243L, 34L, 345L), ser = c(62L,
58L, 670L, 40L, 760L)), class = "data.frame", row.names = c(NA,
-5L))
I want to create a new variable on the data frame that uses a look up table. So I had df1 (dataframe) that has Amount and Term. And I need to create a new variable "Premium" that create its values using the look up table.
I tried the ifelse function but it's too tedious.
Below is an illustration/example
df1 <- data.frame(Amount, Term)
df1
# Amount Term
# 1 2500 23
# 2 3600 30
# 3 7000 45
# 4 12000 50
# 5 16000 38
And I need to create new variable the 'Premium' by using the Premium Lookup table below.
Term
Amount 0-24 Mos 25-36 Mos 37-48 Mos 49-60 Mos
0 - 5,000 133 163 175 186
5,001 - 10,000 191 213 229 249
10,001 - 15,000 229 252 275 306
15,001 - 20,000 600 615 625 719
20,001 - 25,000 635 645 675 786
So the output for premium should be.
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625
Data
df1 <- structure(list(Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)),
class = "data.frame",
row.names = c(NA, -5L))
lkp <- structure(c(133L, 191L, 229L, 600L, 635L,
163L, 213L, 252L, 615L, 645L,
175L, 229L, 275L, 625L, 675L,
186L, 249L, 306L, 719L, 786L),
.Dim = 5:4,
.Dimnames = list(Amount = c("0 - 5,000", "5,001 - 10,000",
"10,001 - 15,000", "15,001 - 20,000",
"20,001 - 25,000"),
Term = c("0-24 Mos", "25-36 Mos", "37-48 Mos",
"49-60 Mos")))
Code
Create first the upper limits for month and amount using regular expressions from the column and row names (you did not post your data in a reproducible way, so this regex may need adaptation based on your real lookup table structure):
(month <- c(0, as.numeric(sub("\\d+-(\\d+) Mos$",
"\\1",
colnames(lkp)))))
# [1] 0 24 36 48 60
(amt <- c(0, as.numeric(sub("^\\d+,*\\d* - (\\d+),(\\d+)$",
"\\1\\2",
rownames(lkp)))))
# [1] 0 5000 10000 15000 20000 25000
Get the positions for each element of df1 using findInterval:
(rows <- findInterval(df1$Amount, amt))
# [1] 1 1 2 3 4
(cols <- findInterval(df1$Term, month))
# [1] 1 2 3 4 3
Use these indices to subset the lookup matrix:
df1$Premium <- lkp[cbind(rows, cols)]
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625
To get to what you want you need to organise the table and categorise the data. I have provided a potential workflow to handle such situations. Hope this is helpful:
library(tidyverse)
df1 <- data.frame(
Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# functions for analysis ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
amount_tier_function <- function(x){
case_when(x <= 5000 ~ "Tier_5000",
x <= 10000 ~ "Tier_10000",
x <= 15000 ~ "Tier_15000",
x <= 20000 ~ "Tier_20000",
TRUE ~ "Tier_25000")
}
month_tier_function <- function(x){
case_when(x <= 24 ~ "Tier_24",
x <= 36 ~ "Tier_36",
x <= 48 ~ "Tier_48",
TRUE ~ "Tier_60")
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Recut lookup table headings ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df <- data.frame(stringsAsFactors=FALSE,
amount_tier = c("Tier_5000", "Tier_10000", "Tier_15000", "Tier_20000",
"Tier_25000"),
Tier_24 = c(133L, 191L, 229L, 600L, 635L),
Tier_36 = c(163L, 213L, 252L, 615L, 645L),
Tier_48 = c(175L, 229L, 275L, 625L, 675L),
Tier_60 = c(186L, 249L, 306L, 719L, 786L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Join everything together ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df_tidy <- lookup_df %>%
gather(mth_tier, Premium, - amount_tier)
df1 %>%
mutate(amount_tier = amount_tier_function(Amount),
mth_tier = month_tier_function(Term)) %>%
left_join(., lookup_df_tidy) %>%
select(-amount_tier, -mth_tier)
I have a large data set like this :
ID Number
153 31
28
31
30
104 31
30
254 31
266 31
and I want to compute sum by ID include the NA. I mean get this :
ID Number
153 120
104 61
254 31
266 31
I tried aggregate but I dont get the expected result. Some help would be appreciated
One option is to convert the blanks to NA, then fill replace the NA elements with non-NA adjacent elements above in 'ID', grouped by 'ID', get the sum of 'Number'
library(tidyverse)
df1 %>%
mutate(ID = na_if(ID, "")) %>%
fill(ID) %>%
group_by(ID) %>%
summarise(Number = sum(Number))
# A tibble: 4 x 2
# ID Number
# <chr> <int>
#1 104 61
#2 153 120
#3 254 31
#4 266 31
Or without using fill, create a grouping variable with a logical expression and cumsum, and then do the sum
df1 %>%
group_by(grp = cumsum(ID != "")) %>%
summarise(ID = first(ID), Number = sum(Number)) %>%
select(-grp)
data
df1 <- structure(list(ID = c("153", "", "", "", "104", "", "254", "266"
), Number = c(31L, 28L, 31L, 30L, 31L, 30L, 31L, 31L)), row.names = c(NA,
-8L), class = "data.frame")
Or do it straightforwardly :) by
cbind(df1[df1$ID != "", "ID", drop = FALSE],
Number = rev(diff(c(0, rev((rev(cumsum(rev(df1$Number)))[df1$ID != ""]))))))
I have a dataframe as follows:
Jen Rptname freq
AKT bilb1 23
AKT bilb1 234
DFF bilb22 987
DFF bilf34 7
DFF jhs23 623
AKT j45 53
JFG jhs98 65
I know how to group the whole dataframe based on individual columns but how do I merge individual rows based on a grep (in this case bilb.* and jhs.*)
I want to be able to merge the rows (and therefore also add the frequencies together) with bilb* and separately the rows with jhs* so that I end up with
AKT bilb 257
DFF bilb 987
DFF bilf34 7
DFF jhs 623
AKT j45 53
JFG jhs 65
This is so that the aggregation is by Jen and Rptname so I can see how many of the same Rptnames are in each Jen
We can use grep to get the index of 'Rptname' elements that have 'bilb' or 'jhs', remove the numeric part with sub and use aggregate to get the sum of 'Freq' by 'Rptname'
indx <- grep('bilb|jhs', df1$Rptname)
df1$Rptname[indx] <- sub('\\d+', '', df1$Rptname[indx])
aggregate(freq~Rptname, df1, FUN=sum)
# Rptname freq
#1 bilb 1244
#2 bilf34 7
#3 j45 53
#4 jhs 688
Update
Suppose your dataset is 'df2'
df2$grp <- gsub("([A-Z]+|[a-z]+)[^A-Z]+", "\\1", df2$Rptname)
aggregate(freq~grp+Jen, df2, FUN=sum)
data
df1 <- structure(list(Rptname = c("bilb1", "bilb1", "bilb22",
"bilf34",
"jhs23", "j45", "jhs98"), freq = c(23L, 234L, 987L, 7L, 623L,
53L, 65L)), .Names = c("Rptname", "freq"), class = "data.frame",
row.names = c(NA, -7L))
df2 <- structure(list(Jen = c("AKT", "AKT", "AKT", "DFF", "DFF",
"DFF",
"DFF", "DFF", "DFF", "AKT", "JFG", "JFG", "JFG"), Rptname = c("bilb1",
"bilb1", "bilb22", "bilb22", "bilb1", "BTBy", "bilf34", "BTBx",
"jhs23", "j45", "jhs98", "BTBfd", "BTBx"), freq = c(23L, 234L,
22L, 987L, 18L, 18L, 7L, 9L, 623L, 53L, 65L, 19L, 14L)),
.Names = c("Jen",
"Rptname", "freq"), class = "data.frame", row.names = c(NA, -13L))
Similar to akrun's and I like his use of aggregate better than my creation of an intermediate vector:
> inter <- tapply(dat$freq, sub("^(bilb|jhs)(.+)$", "\\1", dat$Rptname) ,sum)
> final <- data.frame( nams = names(inter), sums = inter)
> final
nams sums
bilb bilb 1244
bilf34 bilf34 7
j45 j45 53
jhs jhs 688
My pattern would require that the 'bilb' amd 'jhs' be at the beginning of the value. Remove the "^" if that was not intended, but if so, add a "(.*)" and switch to "\\2" in the replacement.