How to merge specific rows that match a grep pattern

How to merge specific rows that match a grep pattern - r

I have a dataframe as follows:
Jen Rptname freq
AKT bilb1 23
AKT bilb1 234
DFF bilb22 987
DFF bilf34 7
DFF jhs23 623
AKT j45 53
JFG jhs98 65
I know how to group the whole dataframe based on individual columns but how do I merge individual rows based on a grep (in this case bilb.* and jhs.*)
I want to be able to merge the rows (and therefore also add the frequencies together) with bilb* and separately the rows with jhs* so that I end up with
AKT bilb 257
DFF bilb 987
DFF bilf34 7
DFF jhs 623
AKT j45 53
JFG jhs 65
This is so that the aggregation is by Jen and Rptname so I can see how many of the same Rptnames are in each Jen

We can use grep to get the index of 'Rptname' elements that have 'bilb' or 'jhs', remove the numeric part with sub and use aggregate to get the sum of 'Freq' by 'Rptname'
indx <- grep('bilb|jhs', df1$Rptname)
df1$Rptname[indx] <- sub('\\d+', '', df1$Rptname[indx])
aggregate(freq~Rptname, df1, FUN=sum)
# Rptname freq
#1 bilb 1244
#2 bilf34 7
#3 j45 53
#4 jhs 688
Update
Suppose your dataset is 'df2'
df2$grp <- gsub("([A-Z]+|[a-z]+)[^A-Z]+", "\\1", df2$Rptname)
aggregate(freq~grp+Jen, df2, FUN=sum)
data
df1 <- structure(list(Rptname = c("bilb1", "bilb1", "bilb22",
"bilf34",
"jhs23", "j45", "jhs98"), freq = c(23L, 234L, 987L, 7L, 623L,
53L, 65L)), .Names = c("Rptname", "freq"), class = "data.frame",
row.names = c(NA, -7L))
df2 <- structure(list(Jen = c("AKT", "AKT", "AKT", "DFF", "DFF",
"DFF",
"DFF", "DFF", "DFF", "AKT", "JFG", "JFG", "JFG"), Rptname = c("bilb1",
"bilb1", "bilb22", "bilb22", "bilb1", "BTBy", "bilf34", "BTBx",
"jhs23", "j45", "jhs98", "BTBfd", "BTBx"), freq = c(23L, 234L,
22L, 987L, 18L, 18L, 7L, 9L, 623L, 53L, 65L, 19L, 14L)),
.Names = c("Jen",
"Rptname", "freq"), class = "data.frame", row.names = c(NA, -13L))

Similar to akrun's and I like his use of aggregate better than my creation of an intermediate vector:
> inter <- tapply(dat$freq, sub("^(bilb|jhs)(.+)$", "\\1", dat$Rptname) ,sum)
> final <- data.frame( nams = names(inter), sums = inter)
> final
nams sums
bilb bilb 1244
bilf34 bilf34 7
j45 j45 53
jhs jhs 688
My pattern would require that the 'bilb' amd 'jhs' be at the beginning of the value. Remove the "^" if that was not intended, but if so, add a "(.*)" and switch to "\\2" in the replacement.

Related

How to find the minimum value in between two values using R

In a column of data I am trying to identify the minimum value in between a new high and the previous new high. In the example below I marked where the new highs are and what the minimum value is between them. What is the R formula to figure this out? In excel I would be able to do it using the match and max/min formula. I am not sure how to find the minimum value in a segment of a column in r.
data
0 New High
-80
-160
-160
-160
-160
-160
-347
-351
-351
-444
-444
-444
43 New High -444
43
10
10
-6
20
352 New High -6
352
352
528 New High 352
528
511
511
518
472

You can use cummax to calculate cumulative maximum until that row and create a new group whenever the current row's value is greater than previous cummax value. Within each group you can return the minimum value.
library(dplyr)
df %>%
group_by(group = cumsum(V1 > lag(cummax(V1), default = first(V1)))) %>%
summarise(min_value = min(V1))
# group min_value
# <int> <int>
#1 0 -444
#2 1 -6
#3 2 352
#4 3 472
This considers the last part as another group hence also returns minimum value in that part. You can remove the last row if it is not needed.
To apply for multiple columns, we can write a function and call it with lapply :
apply_fun <- function(data, col) {
col1 <- sym(col)
df %>%
group_by(group = cumsum(!!col1 > lag(cummax(!!col1),
default = first(!!col1)))) %>%
summarise(min_value = min(!!col1))
}
result <- lapply(names(df), apply_fun, data = df)
data
df <- structure(list(V1 = c(0L, -80L, -160L, -160L, -160L, -160L, -160L,
-347L, -351L, -351L, -444L, -444L, -444L, 43L, 43L, 10L, 10L,
-6L, 20L, 352L, 352L, 352L, 528L, 528L, 511L, 511L, 518L, 472L
)), class = "data.frame", row.names = c(NA, -28L))

How to replace values in specific rows of some columns in R tibble with transformed values conditional on row values?

I have a tibble in R, where I want to change values in some columns with a condition based on a value of another column. So in the tibble df below, I want to multiply all values in the columns agr, man and ser where value in variable column is equal to va with 1000 and where value is equal to emp with 100 and replace the values in the respective columns with these calculated values. There must be a simple solution to it but I am at a loss.
df
country variable year agr man ser
chn va 1980 345 124 62
chn emp 1980 34 65 58
chn va 1981 345 243 670
ind emp 1980 54 34 40
ind va 1980 456 345 760
I have tried using ifelse, mutate_at and sweep functions but it does not work out.

Assuming that there would be also other value in 'variable' column, an option is to use case_when with mutate_at
library(dplyr)
df %>%
mutate_at(vars(agr:ser), ~ case_when(variable == 'va'~ . * 1000,
variable == 'emp' ~ .* 100, TRUE ~ as.numeric(.)))
data
df <- structure(list(country = c("chn", "chn", "chn", "ind", "ind"),
variable = c("va", "emp", "va", "emp", "va"), year = c(1980L,
1980L, 1981L, 1980L, 1980L), agr = c(345L, 34L, 345L, 54L,
456L), man = c(124L, 65L, 243L, 34L, 345L), ser = c(62L,
58L, 670L, 40L, 760L)), class = "data.frame", row.names = c(NA,
-5L))

Create new variable based on the Look up table

I want to create a new variable on the data frame that uses a look up table. So I had df1 (dataframe) that has Amount and Term. And I need to create a new variable "Premium" that create its values using the look up table.
I tried the ifelse function but it's too tedious.
Below is an illustration/example
df1 <- data.frame(Amount, Term)
df1
# Amount Term
# 1 2500 23
# 2 3600 30
# 3 7000 45
# 4 12000 50
# 5 16000 38
And I need to create new variable the 'Premium' by using the Premium Lookup table below.
Term
Amount 0-24 Mos 25-36 Mos 37-48 Mos 49-60 Mos
0 - 5,000 133 163 175 186
5,001 - 10,000 191 213 229 249
10,001 - 15,000 229 252 275 306
15,001 - 20,000 600 615 625 719
20,001 - 25,000 635 645 675 786
So the output for premium should be.
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625

Data
df1 <- structure(list(Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)),
class = "data.frame",
row.names = c(NA, -5L))
lkp <- structure(c(133L, 191L, 229L, 600L, 635L,
163L, 213L, 252L, 615L, 645L,
175L, 229L, 275L, 625L, 675L,
186L, 249L, 306L, 719L, 786L),
.Dim = 5:4,
.Dimnames = list(Amount = c("0 - 5,000", "5,001 - 10,000",
"10,001 - 15,000", "15,001 - 20,000",
"20,001 - 25,000"),
Term = c("0-24 Mos", "25-36 Mos", "37-48 Mos",
"49-60 Mos")))
Code
Create first the upper limits for month and amount using regular expressions from the column and row names (you did not post your data in a reproducible way, so this regex may need adaptation based on your real lookup table structure):
(month <- c(0, as.numeric(sub("\\d+-(\\d+) Mos$",
"\\1",
colnames(lkp)))))
# [1] 0 24 36 48 60
(amt <- c(0, as.numeric(sub("^\\d+,*\\d* - (\\d+),(\\d+)$",
"\\1\\2",
rownames(lkp)))))
# [1] 0 5000 10000 15000 20000 25000
Get the positions for each element of df1 using findInterval:
(rows <- findInterval(df1$Amount, amt))
# [1] 1 1 2 3 4
(cols <- findInterval(df1$Term, month))
# [1] 1 2 3 4 3
Use these indices to subset the lookup matrix:
df1$Premium <- lkp[cbind(rows, cols)]
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625

To get to what you want you need to organise the table and categorise the data. I have provided a potential workflow to handle such situations. Hope this is helpful:
library(tidyverse)
df1 <- data.frame(
Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# functions for analysis ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
amount_tier_function <- function(x){
case_when(x <= 5000 ~ "Tier_5000",
x <= 10000 ~ "Tier_10000",
x <= 15000 ~ "Tier_15000",
x <= 20000 ~ "Tier_20000",
TRUE ~ "Tier_25000")
}
month_tier_function <- function(x){
case_when(x <= 24 ~ "Tier_24",
x <= 36 ~ "Tier_36",
x <= 48 ~ "Tier_48",
TRUE ~ "Tier_60")
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Recut lookup table headings ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df <- data.frame(stringsAsFactors=FALSE,
amount_tier = c("Tier_5000", "Tier_10000", "Tier_15000", "Tier_20000",
"Tier_25000"),
Tier_24 = c(133L, 191L, 229L, 600L, 635L),
Tier_36 = c(163L, 213L, 252L, 615L, 645L),
Tier_48 = c(175L, 229L, 275L, 625L, 675L),
Tier_60 = c(186L, 249L, 306L, 719L, 786L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Join everything together ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df_tidy <- lookup_df %>%
gather(mth_tier, Premium, - amount_tier)
df1 %>%
mutate(amount_tier = amount_tier_function(Amount),
mth_tier = month_tier_function(Term)) %>%
left_join(., lookup_df_tidy) %>%
select(-amount_tier, -mth_tier)

Sorting dataframe by column of letters and numbers

I have been attempting to sort my dataframe by the first column - or day - with multiple different methods listed below to no avail. I suspect it could be because it is attempting to order by the first number but I am unsure how I would change that to get it to order the rows properly. The dataset is as follows:
df1
[day][sample1][sample2]
[1,]day0 22 11
[2,]day11 23 15
[3,]day15 25 14
[4,]day2 21 13
[5,]day8 20 17
...
I am looking to order the entire row by day. I have tried the following
df[sort(as.character(df$day)),]
df[order(as.character(df$day)),]
mixedorder(as.character(df$day)) (gtools package)
The mixedorder merely output an index of numbers.
Current Code:
df_0$day = metadata_df[,3]
df_0 <- df_0[,c(8,1:7)]
df1 <- aggregate(df_0[,2:ncol(df_0)], df_0[1], mean)
df1 <- df1[mixedorder(as.character(df1$day)),]
df1$day <- factor(df1$day, levels = unique(df1$day))
rownames(df1) <- 1:nrow(df1)
##Plotting expression levels
Plot1 <- ggplot() +
geom_line(data=df1, aes(x=day, y=sample1, group=1, color="blue"))+
geom_line(data=df2, aes(x=day, y=sample1, group=2, color="red"))
Note that I have done the same transformations with df2 as I have with df1. Both df1 and df2 are the same, except with slightly different values in them.

The mixedorder gives the ordered index which can be used to order the rows
df1 <- df[mixedorder(as.character(df$day)),]
df1
# day sample1 sample2
#1 day0 22 11
#4 day2 21 13
#5 day8 20 17
#2 day11 23 15
#3 day15 25 14
It is not clear about how the OP is plotting.
library(tidyverse)
df1 %>%
mutate(day = factor(day, levels = unique(day))) %>%
gather(key, val, -day) %>%
ggplot(., aes(x = day, y = val, color = key)) +
geom_point()
data
df <- structure(list(day = structure(1:5, .Label = c("day0", "day11",
"day15", "day2", "day8"), class = "factor"), sample1 = c(22L,
23L, 25L, 21L, 20L), sample2 = c(11L, 15L, 14L, 13L, 17L)), .Names = c("day",
"sample1", "sample2"), class = "data.frame", row.names = c(NA,
-5L))

R comparing 2 dfs to sum data between values

I have 2 dataframes in R, one with start (column 1) and end (column 2) coordinates...
df1
2500 3499
3500 4499
4500 5499
5500 6499
And one with point coordinates (column 1) and associated values (column 2)...
df2
2657 17
2895 33
3875 12
4448 42
5122 3
5633 65
5781 12
I would like to find a vectorized approach to sum the values from df2 column 2 where df2 column 1 coordinates are between the start and stop coordinates for df1. with this data the result should look like this...
df3
2500 3499 50
3500 4499 54
4500 5499 3
5500 6499 77
The dfs contain 100,000+ rows, I can achieve this easily using loops, but as were are in R it is slow and not the best approach.
What is the best way to do this? Also a flexible solution that can be adapted to other functions, other than simply summing data would be good to know.

Here's a possible data.table::foverlaps solution. As you haven't specified column names, I'm assuming that they are called V1 and V2 in both data sets
Solution
library(data.table)
setDT(df1)[, `:=`(start = V1, end = V2)]
setDT(df2)[, `:=`(start = V1, end = V1)]
setkey(df1, start, end)
foverlaps(df2, df1)[, list(SumV2 = sum(i.V2)), by = list(V1, V2)]
# V1 V2 SumV2
# 1: 2500 3499 50
# 2: 3500 4499 54
# 3: 4500 5499 3
# 4: 5500 6499 77
Explanation
Here we converted both data sets to data.table objects and specified the start/end values to overlap on. Then, we keyed the data set that we want to join against. Finally we ran the foverlaps function and then aggregated the matched values of V2 from df2 by the desired columns in df1
Data
df1 <- structure(list(V1 = c(2500L, 3500L, 4500L, 5500L), V2 = c(3499L,
4499L, 5499L, 6499L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(V1 = c(2657L, 2895L, 3875L, 4448L, 5122L, 5633L,
5781L), V2 = c(17L, 33L, 12L, 42L, 3L, 65L, 12L)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to merge specific rows that match a grep pattern - r

Related

How to find the minimum value in between two values using R

How to replace values in specific rows of some columns in R tibble with transformed values conditional on row values?

Create new variable based on the Look up table

Sorting dataframe by column of letters and numbers

R comparing 2 dfs to sum data between values

Categories

Resources