Sum by aggregating complex paired names in R - r

In R, I'm trying to aggregate a dataframe based on unique IDs, BUT I need to use some kind of wild card value for the IDs. Meaning I have paired names like this:
lion_tiger
elephant_lion
tiger_lion
And I need the lion_tiger and tiger_lion IDs to be summed together, because the order in the pair does not matter.
Using this dataframe as an example:
df <- data.frame(pair = c("1_3","2_4","2_2","1_2","2_1","4_2","3_1","4_3","3_2"),
value = c("12","10","19","2","34","29","13","3","14"))
So the values for pair IDs, "1_2" and "2_1" need to be summed in a new table. That new row would then read:
1_2 36
Any suggestions? While my example has numbers as the pair IDs, in reality I would need this to read in text (like the lion_tiger" example above).

We can split the 'pair' column by _, then sort and paste it back, use it in a group by function to get the sum
tapply(as.numeric(as.character(df$value)),
sapply(strsplit(as.character(df$pair), '_'), function(x)
paste(sort(as.numeric(x)), collapse="_")), FUN = sum)
Or another option is gsubfn
library(gsubfn)
df$pair <- gsubfn('([0-9]+)_([0-9]+)', ~paste(sort(as.numeric(c(x, y))), collapse='_'),
as.character(df$pair))
df$value <- as.numeric(as.character(df$value))
aggregate(value~pair, df, sum)

Using tidyverse and purrrlyr
df <- data.frame(name=c("lion_tiger","elephant_lion",
"tiger_lion"),value=c(1,2,3),stringsAsFactors=FALSE)
require(tidyverse)
require(purrrlyr)
df %>% separate(col = name, sep = "_", c("A", "B")) %>%
by_row(.collate = "rows",
..f = function(this_row) {
paste0(sort(c(this_row$A, this_row$B)), collapse = "_")
}) %>%
rename(sorted = ".out") %>%
group_by(sorted) %>%
summarize(sum(value))%>%show
## A tibble: 2 x 2
# sorted `sum(value)`
# <chr> <dbl>
#1 elephant_lion 2
#2 lion_tiger 4

Related

R - Regular Expressions (Regex) with a list of Data Frames (only first match)

So, I'm the happy owner of a 17246 list of data frames and need to extract 3 data from each of them:
To whom the job was given.
The standard code that describes what kind of job it is (Ex. "00" inside this "12-00.07").
The date on which it was assigned.
Each data frame contains data about just one worker.
But the data is inputted differently: It always starts by the regular expression “Worker:” + “Name or number identification”.
So, I can find the data with a regular expression that targets “Worker:”
I can also target the first regular expression that represents a date: “dd/dd/dd”
The desired output is a df with 3 columns (“Worker”, “Code”, “Date”) and then unite all dfs into one.
In order to achieve this end, I find myself with three problems:
a) The information is presented in no order (cannot subset specific
rows).
b) The intended worker and code are a substring inside other
characters.
c) More then one date is presented on each df and I only desire the
first match. All other dates are misleading.
The input is this:
v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")
df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)
PDF_list <- list(df1, df2)
The desired outcome is this:
library(dplyr)
n1 <- c("Joseph", "Joseph")
c1 <- c("00", "11")
d1 <- c("06/01/21", "06/01/21")
n2 <- c("3rd", "3rd")
c2 <- c("08", "57")
d2 <- c("05/01/21", "05/01/21")
df1 <- data.frame(name = n1, code = c1, date = d1)
df2 <- data.frame(name = n2, code = c2, date = d2)
PDF_list <- list(df1, df2)
one_df <- bind_rows(PDF_list)
So far, I've managed to write this poor excuse of a code. It doesn’t select the substrings and it cheats to get the desired date:
library(tidyverse)
library(tidyr)
library(stringr)
v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")
df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)
PDF_list <- list(df1, df2)
for(num in 1:length(PDF_list)){
worker <- filter(PDF_list[[num]], grepl("Worker:\\s*?(\\w.+)", text))
code <- filter(PDF_list[[num]], grepl("-(\\d{2}).+", text))
date <- filter(PDF_list[[num]], grepl("^\\d{2}/\\d{2}.+", text))
if(nrow(date) > 1){
date <- date[1,1]
}
t_list <- cbind(worker, code, date)
names(t_list) <- c("name", "code", "date")
PDF_list[[num]] <- t_list
}
rm(worker, code, date, t_list)
one_df <- bind_rows(PDF_list)
View(one_df)
Any help? Thanks!
A method using tidyverse
Loop over the list - map, arrange the rows of the data so that row with the 'Worker:' becomes the top row
Bind the list elements as a single dataset with _dfr suffix in map, while creating a grouping index by specifying the .id
Group by 'grp' column
Use summarise to create summarised output with the first 'date' from the pattern two digits followed by /, two digits / and two digits from the start (^) till the end ($) of the string elements in 'text' column
The first element will become 'name' after removing the substring 'Worker:' and any spaces - str_remove
Similarly, we extract the 'code' rows based on capturing the digits from those having only digits with some characters - or .
library(dplyr)
library(stringr)
library(purrr)
PDF_list %>%
map_dfr(~ .x %>%
arrange(!str_detect(text, 'Worker:')), .id = 'grp') %>%
group_by(grp) %>%
summarise(date = first(text[str_detect(text, "^\\d{2}/\\d{2}/\\d{2}$")]),
name = str_remove(first(text), "Worker:\\s*"),
code = str_replace(text[str_detect(text, '^\\d+-(\\d+)[.-]\\d+$')],
"^\\d+-(\\d+)[.-]\\d+$", "\\1"), .groups = 'drop') %>%
select(name, code, date)
-output
# A tibble: 4 x 3
name code date
<chr> <chr> <chr>
1 Joseph 00 06/01/21
2 Joseph 11 06/01/21
3 3rd 08 05/01/21
4 3rd 57 05/01/21

R Subsetting text from a comma seperated column in a data-frame

I have a data.frame with a column that looks like that:
diagnosis
F.31.2,A.43.2,R.45.2,F.43.1
I want to somehow split this column into two colums with one containing all the values with F and one for all the other values, resulting in two columns in a df that looks like that.
F other
F.31.2,F43.1 A.43.2,R.45.2
Thanks in advance
Try next tidyverse approach. You can separate the rows by , and then create a group according to the pattern in order to reshape to wide and obtain the expected result:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(diagnosis='F.31.2,A.43.2,R.45.2,F.43.1',stringsAsFactors = F)
#Code
new <- df %>% separate_rows(diagnosis,sep = ',') %>%
mutate(Group=ifelse(grepl('F',diagnosis),'F','Other')) %>%
pivot_wider(values_fn = toString,names_from=Group,values_from=diagnosis)
Output:
# A tibble: 1 x 2
F Other
<chr> <chr>
1 F.31.2, F.43.1 A.43.2, R.45.2
First, use strsplit at the commas. Then, using grep find indexes of F, and select/antiselect them by multiplying by 1 or -1 and paste them.
tmp <- el(strsplit(d$diagnosis, ","))
res <- lapply(c(1, -1), function(x) paste(tmp[grep("F", tmp)*x], collapse=","))
res <- setNames(as.data.frame(res), c("F", "other"))
res
# F other
# 1 F.31.2,F.43.1 A.43.2,R.45.2
Data:
d <- setNames(read.table(text="F.31.2,A.43.2,R.45.2,F.43.1"), "diagnosis")

Create a loop for calculating values from a dataframe in R?

Let's say I make a dummy dataframe with 6 columns with 10 observations:
X <- data.frame(a=1:10, b=11:20, c=21:30, d=31:40, e=41:50, f=51:60)
I need to create a loop that evaluates 3 columns at a time, adding the summed second and third columns and dividing this by the sum of the first column:
(sum(b)+sum(c))/sum(a) ... (sum(e)+sum(f))/sum(d) ...
I then need to construct a final dataframe from these values. For example using the dummy dataframe above, it would look like:
value
1. 7.454545
2. 2.84507
I imagine I need to use the next function to iterate within the loop, but I'm fairly lost! Thank you for any help.
You can split your data frame into groups of 3 by creating a vector with rep where each element repeats 3 times. Then with this list of sub data frames, (s)apply the function of summing the second and third columns, adding them, and dividing by the sum of the first column.
out_vec <-
sapply(
split.default(X, rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (sum(x[2]) + sum(x[3]))/sum(x[1]))
data.frame(value = out_vec)
# value
# 1 7.454545
# 2 2.845070
You could also sum all the columns up front before the sapply with colSums, which will be more efficient.
out_vec <-
sapply(
split(colSums(X), rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (x[2] + x[3])/x[1])
data.frame(value = out_vec, row.names = NULL)
# value
# 1 7.454545
# 2 2.845070
You could use tapply:
tapply(colSums(X), gl(ncol(X)/3, 3), function(x)sum(x[-1])/x[1])
1 2
7.454545 2.845070
Here is an option with tidyverse
library(dplyr) # 1.0.0
library(tidyr)
X %>%
summarise(across(.fn = sum)) %>%
pivot_longer(everything()) %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
summarise(value = sum(lead(value)/first(value), na.rm = TRUE)) %>%
select(value)
# A tibble: 2 x 1
# value
# <dbl>
#1 7.45
#2 2.85

Select unique values

I need to change this function that doesn't match for unique values. For example, if I want MAPK4, the function matches MAPK41 and AMAPK4 etc. The function must select only the unique values.
Function:
library(dplyr)
df2 <- df %>%
rowwise() %>%
mutate(mutated = paste(mutated_genes[unlist(
lapply(mutated_genes, function(x) grepl(x,genes, ignore.case = T)))], collapse=","),
circuit_name = gsub("", "", circuit_name)) %>%
select(-genes) %>%
data.frame()
data:
df <-structure(list(circuit_name = c("hsa04010__117", "hsa04014__118" ), genes = c("MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP3*,DUSP3*,DUSP3*,DUSP3*,PPM1A,AKT3,AKT3,AKT3,ZAK,MAP3K12,MAP3K13,TRAF2,CASP3,IL1R1,IL1R1,TNFRSF1A,IL1A,IL1A,TNF,RAC1,RAC1,RAC1,RAC1,MAP2K7,MAPK8,MAPK8,MAPK8,MECOM,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,MAP4K3,MAPK8IP2,MAP4K1", "MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*")), class = "data.frame", row.names = c(NA, -2L))
mutated_genes <- c("MAP4K4", "MAP3K12","TRAF2", "CACNG3")
output:
circuit_name mutated
1 hsa04010__117 MAP4K4,TRAF2
2 hsa04014__118 MAP4K4
A base R approach would be by splitting the genes on "," and return those string which match mutated_genes.
df$mutated <- sapply(strsplit(df$genes, ","), function(x)
toString(grep(paste0(mutated_genes, collapse = "|"), x, value = TRUE)))
df[c(1, 3)]
# circuit_name mutated
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Please note that based on the mutated_genes vector, your expected output is missing MAP3K12 for hsa04010__117.
Here is a tidyverse possibility
df %>%
separate_rows(genes) %>%
filter(genes %in% mutated_genes) %>%
group_by(circuit_name) %>%
summarise(mutated = toString(genes))
## A tibble: 2 x 2
# circuit_name mutated
# <chr> <chr>
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Explanation: We separate comma-separated entries into different rows, then select only those rows where genes %in% mutated_genes and summarise results per circuit_name by concatenating genes entries.
PS. Personally I'd recommend keeping the data in a tidy long format (i.e. don't concatenate entries with toString); that way you have one row per gene, which will make any post-processing of the data much more straightforward.
We can use str_extract
library(stringr)
df$mutated <- sapply(str_extract_all(df$genes, paste(mutated_genes,
collapse="|")), toString)

Insert Column Name into its Value using R

I need to insert Column Name, Department, into its value. i have code like here:
Department <- c("Store1","Store2","Store3","Store4","Store5")
Department2 <- c("IT1","IT2","IT3","IT4","IT5")
x <- c(100,200,300,400,500)
Result <- data.frame(Department,Department2,x)
Result
The expected result is like:
Department <- c("Department_Store1","Departmentz_Store2","Department_Store3","Department_Store4","Department_Store5")
Department2 <- c("Department2_IT1","Department2_IT2","Department2_IT3","Department2_IT4","Department2_IT5")
x <- c(100,200,300,400,500)
Expected.Result <- data.frame(Department,Department2,x)
Expected.Result
Can somebody help? Thanks
Another way with dplyr and tidyr:
library(dplyr)
library(tidyr)
# Convert to character to avoid warning message, will convert all columns to character
Result[] <- lapply(Result, as.character)
Result %>%
mutate_if(is.factor, as.character) %>% # optional, only convert factor to character, retain all other types
gather(key, value, -x) %>%
mutate(var = paste(key, value, sep = "_")) %>%
select(-value) %>%
spread(key,var)
x Department Department2
1 100 Department_Store1 Department2_IT1
2 200 Department_Store2 Department2_IT2
3 300 Department_Store3 Department2_IT3
4 400 Department_Store4 Department2_IT4
5 500 Department_Store5 Department2_IT5
Data:
Result <- data.frame(
Department = c("Store1","Store2","Store3","Store4","Store5"),
Department2 = c("IT1","IT2","IT3","IT4","IT5"),
x = c(100,200,300,400,500)
)
If you gather the column names in question into a vector dep_col, this is a clean base R solution with a for loop:
df <- data.frame(x = 1:5,
Department = paste0("Store", 1:5),
Department2 = paste0("IT", 1:5))
dep_col <- names(df)[-1]
for (c in dep_col)
df[[c]] <- paste(c, df[[c]], sep = "_")
If I understand correctly, the OP wants to prepend the values in all columns starting with "Department" by the respective column name.
Edit By request of the OP, the code to select columns has been generalized to pick additional column names.
Here is a solution using data.table's fast set() function:
library(data.table)
setDT(Result)
cols <- stringr::str_subset(names(Result), "^(Department|Division|Team)")
for (j in cols) {
set(Result, NULL, j, paste(j, Result[[j]], sep = "_"))
}
Result
Department Department2 x
1: Department_Store1 Department2_IT1 100
2: Department_Store2 Department2_IT2 200
3: Department_Store3 Department2_IT3 300
4: Department_Store4 Department2_IT4 400
5: Department_Store5 Department2_IT5 500
Note that set() updates by reference, i.e., without copying the whole object.

Resources