Making new column with multiple elements after group_by - r

I'm trying to make a new column as described below. the d's actually correspond to dates and V2 are events on the given dates. I need to collect the events for the given date. V3 is a single column whose row entries are a concatenation. Thanks in advance. My attempt does not work.
df = V1 V2
d1 U
d2 M
d1 T
d1 Q
d2 P
desired resulting df
df.1 = V1 V3
d1 U,T,Q
d2 M,P
df.1 <- df %>% group_by(., V1) %>%
mutate(., V3 = c(distinct(., V2))) %>%
as.data.frame
The above code results in the following error; ignore the 15 and 1s--they're specific to my actual code
Error: incompatible size (15), expecting 1 (the group size) or 1

You can use aggregate like this:
df.1 <- aggregate(V2~V1,paste,collapse=",",data=df)
# V1 V2
# 1 d1 U,T,Q
# 2 d2 M,P

It will not allow a vector as an element in data frame. So instead of using c(), you can use paste to concatenate elements as a single string.
df.1 <- df %>% group_by(V1) %>% mutate(V3 = paste(unique(V2), collapse = ",")) %>% select(V1, V3) %>% unique() %>% as.data.frame()

still with dplyr, you can try:
df %>% group_by(V1) %>% summarize(V3 = paste(unique(V2), collapse=", "))

Related

Loop a merge+sum function on a set of dataframes in R

I have the following list of dataset :
dflist <- list(df1_A, df1_B, df1_C, df1_D, df1_E,
df2_A, df2_B, df2_C, df2_D, df2_E,
df3_A, df3_B, df3_C, df3_D, df3_E,
df4_A, df4_B, df4_C, df4_D, df4_E)
names(dflist) <- c("df1_A", "df1_B", "df1_C", "df1_D", "df1_E",
"df2_A", "df2_B", "df2_C", "df2_D", "df2_E",
"df3_A", "df3_B", "df3_C", "df3_D", "df3_E",
"df4_A", "df4_B", "df4_C", "df4_D", "df4_E")
Each dataframe have the same structure (with the same column's names):
df1_A
V1 V2
G18941 17
G20092 534
G19692 10
G19703 260
G16777 231
G20045 0
...
I would like to make a function that merges all the dataframes with the same number (but different letter) in my list and sums the values in column V2 when the names in V1 are the same.
In hard, I managed to do this for df1_A and df1_B with the following code:
newdf <- bind_rows(df1_A, df1_B) %>%
group_by(V1) %>%
summarise_all(., sum, na.rm = TRUE)
I can easily turn this into a function like this:
MergeAndSum <- function(df1,df2)
newdf <- bind_rows(df1, df2) %>%
group_by(V1) %>%
summarise_all(., sum, na.rm = TRUE)
return(newdf)
But I don't really see how to call it to do the loop. I try something like :
for (i in 2:length(dflist)){
df1 <- List_RawCounts_Files[i-1]
df2 <- List_RawCounts_Files[i]
out1 <- MergeAndSum(df1,df2)
return(out1)
}
I imagine something that merges+sum the df1_A to the df1_B and reassigns the result to df1_A, then calls back the function with df1_A and df1_C and reassigns the result to df1_A, then calls back the function with df1_A and df1_D, and reassigns the result to df1_A, then calls back the function with df1_A and df1_E
Then the same thing with df2 (df2_A, df2_B,... df2_E), then df3, df4 and df5.
If you know how to do this I am listening.
bind_rows can combine list of dataframes together. You can combine them with the id column so that the name of the list is added as a new column, extract the dataframe name (df1 from df1_A, df2 from df2_A and so on) and take the sum of V2 column for each dataframe and V1 column as group.
library(dplyr)
bind_rows(dflist, .id = "id") %>%
mutate(id = stringr::str_extract(id, 'df\\d+')) %>%
group_by(id, V1) %>%
summarise(V2 = sum(V2, na.rm = TRUE), .groups = "drop")
Since you want to sum only one column (V2) you can use summarise instead of summarise_all which has been superseded.

R - Regular Expressions (Regex) with a list of Data Frames (only first match)

So, I'm the happy owner of a 17246 list of data frames and need to extract 3 data from each of them:
To whom the job was given.
The standard code that describes what kind of job it is (Ex. "00" inside this "12-00.07").
The date on which it was assigned.
Each data frame contains data about just one worker.
But the data is inputted differently: It always starts by the regular expression “Worker:” + “Name or number identification”.
So, I can find the data with a regular expression that targets “Worker:”
I can also target the first regular expression that represents a date: “dd/dd/dd”
The desired output is a df with 3 columns (“Worker”, “Code”, “Date”) and then unite all dfs into one.
In order to achieve this end, I find myself with three problems:
a) The information is presented in no order (cannot subset specific
rows).
b) The intended worker and code are a substring inside other
characters.
c) More then one date is presented on each df and I only desire the
first match. All other dates are misleading.
The input is this:
v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")
df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)
PDF_list <- list(df1, df2)
The desired outcome is this:
library(dplyr)
n1 <- c("Joseph", "Joseph")
c1 <- c("00", "11")
d1 <- c("06/01/21", "06/01/21")
n2 <- c("3rd", "3rd")
c2 <- c("08", "57")
d2 <- c("05/01/21", "05/01/21")
df1 <- data.frame(name = n1, code = c1, date = d1)
df2 <- data.frame(name = n2, code = c2, date = d2)
PDF_list <- list(df1, df2)
one_df <- bind_rows(PDF_list)
So far, I've managed to write this poor excuse of a code. It doesn’t select the substrings and it cheats to get the desired date:
library(tidyverse)
library(tidyr)
library(stringr)
v1 <- c("Worker: Joseph", "06/01/21", "12-00.07", "06/19/21", "useless", "06-11.85")
v2 <- c("useless","99-08-70", "Worker: 3rd", "05/01/21", "useless", "25-57.99", "07/01/21")
df1 <- data.frame(text = v1)
df2 <- data.frame(text = v2)
PDF_list <- list(df1, df2)
for(num in 1:length(PDF_list)){
worker <- filter(PDF_list[[num]], grepl("Worker:\\s*?(\\w.+)", text))
code <- filter(PDF_list[[num]], grepl("-(\\d{2}).+", text))
date <- filter(PDF_list[[num]], grepl("^\\d{2}/\\d{2}.+", text))
if(nrow(date) > 1){
date <- date[1,1]
}
t_list <- cbind(worker, code, date)
names(t_list) <- c("name", "code", "date")
PDF_list[[num]] <- t_list
}
rm(worker, code, date, t_list)
one_df <- bind_rows(PDF_list)
View(one_df)
Any help? Thanks!
A method using tidyverse
Loop over the list - map, arrange the rows of the data so that row with the 'Worker:' becomes the top row
Bind the list elements as a single dataset with _dfr suffix in map, while creating a grouping index by specifying the .id
Group by 'grp' column
Use summarise to create summarised output with the first 'date' from the pattern two digits followed by /, two digits / and two digits from the start (^) till the end ($) of the string elements in 'text' column
The first element will become 'name' after removing the substring 'Worker:' and any spaces - str_remove
Similarly, we extract the 'code' rows based on capturing the digits from those having only digits with some characters - or .
library(dplyr)
library(stringr)
library(purrr)
PDF_list %>%
map_dfr(~ .x %>%
arrange(!str_detect(text, 'Worker:')), .id = 'grp') %>%
group_by(grp) %>%
summarise(date = first(text[str_detect(text, "^\\d{2}/\\d{2}/\\d{2}$")]),
name = str_remove(first(text), "Worker:\\s*"),
code = str_replace(text[str_detect(text, '^\\d+-(\\d+)[.-]\\d+$')],
"^\\d+-(\\d+)[.-]\\d+$", "\\1"), .groups = 'drop') %>%
select(name, code, date)
-output
# A tibble: 4 x 3
name code date
<chr> <chr> <chr>
1 Joseph 00 06/01/21
2 Joseph 11 06/01/21
3 3rd 08 05/01/21
4 3rd 57 05/01/21

How to drop columns in data frame based on their contribution to the sum over all columns in R

I'm trying to drop columns from my data frame based on their contribution to the sum across all columns.
An example with an 1x5 data frame would be the following (I suppose it would also be possible to drop rows from a 5x1 data frame in a similar way and then transpose it). Assume the values sum up to 100.
df <- data.frame(V1 = 5, V2 = 10, V3 = 20, V4 = 40, V5 = 25)
V1 V2 V3 V4 V5
5 10 20 40 25
I now want to keep the columns that contribute the most to e.g. at least 80% of the sum over all columns.
So what I want to achieve is:
V3 V4 V5
20 40 25
Is there an elegant way to do this?
Thanks in advance!
There could be many possible approaches that can be taken. One way in base R would be to unlist the data, sort it in decreasing order and take cumulative sum of the ratio of values. Stop when it reaches the threshold (0.8) and select all the columns till that.
vals <- cumsum(prop.table(sort(unlist(df), decreasing = TRUE))) > 0.8
df[names(vals[1:which.max(vals)])]
# V4 V5 V3
#1 40 25 20
We can use tidyverse
library(tidyr)
library(dplyr)
pivot_longer(df, everything()) %>%
arrange(desc(value)) %>%
filter(!lag(cumsum(value) > 80, default = FALSE)) %>%
deframe
# V4 V5 V3
#40 25 20
Or if we need in the same order
pivot_longer(df, everything()) %>%
arrange(desc(value)) %>%
filter(!lag(cumsum(value) > 80, default = FALSE)) %>%
arrange(match(name, names(df))) %>%
mutate(rn = 1) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-rn)
Using tidyvesre approach, I created a row id, gathered the data, sorted the values in descending order, calculated the cumulative percent and filtered columns below 0.8.
library(tidyverse)
df %>%
rownames_to_column("id") %>%
gather(var, value, -id) %>%
group_by(id) %>%
arrange(desc(value)) %>%
mutate(sum = (cumsum(value) / sum(value))) %>%
filter(sum < 0.8)
A , B and C are examples of columns, which min identifies the column with the minimum value (thus, the one that "contributes less" to the sum)
A <- c(5)
B <- c(6)
C <-c(4)
df <- cbind(A, B , C)
# condition 1
cond <- which.min(df)
mask <- df[,-cond]

How to numbering unique pairs X,Y

Ok, so I have the following data.frame:
v1<-c(456,234,981,776,112,998)
v2<-c(981,112,456,998,234,776)
df<- data.frame(v1,v2)
I want to obtain an extra variable with a numeric count of pairs of v1 and v2 values. The trick is that I need to number them by unique pairs so, for example (456,981 and 981,456) should be numbered 1.
So the outcome would be something like this:
v1<-c(456,234,981,776,112,998)
v2<-c(981,112,456,998,234,776)
v3<-c(1,2,1,3,2,3)
df<- data.frame(v1,v2,v3)
You can sort rowwise and use match, i.e.
v1 <- do.call(paste, data.frame(t(apply(df, 1, sort))))
match(v1, unique(v1))
#[1] 1 2 1 3 2 3
How about this using dplyr. Basically you would sort the columns for each row. Not sure if it would be more efficient or not. Obviously it is a lot more lines.
library(dplyr)
df <- data.frame(v1,v2)
# Sort by v1 and v2 elements by row
df.new <- df %>%
mutate(z1 = pmin(v1,v2),
z2 = pmax(v1,v2))
# Build a distinct coding table
df.codes <- df.new %>%
distinct(z1, z2) %>%
mutate(v3 = 1:n())
# Join it back together
df.new %>%
left_join(df.codes, by = c("z1", "z2")) %>%
select(v1, v2, v3)

R dedupe records that are not exactly duplicates

I have a list of record that I need to dedup, these look like a combination of the same set of, but using the regular functions to deduplicate records does not work because the two columns are not duplicates. Below is a reproducible example.
df <- data.frame( A = c("2","2","2","43","43","43","331","391","481","490","501","501","501","502","502","502"),
B = c("43","501","502","2","501","502","491","496","490","481","2","43","502","2","43","501"))
Below is the desired output that I'm looking for.
df_Final <- data.frame( A = c("2","2","2","331","391","481"),
B = c("43","501","502","491","496","490"))
I guess the idea is that you want to find when the elements in column A first appear in column B
idx = match(df$A, df$B)
and keep the row if the element in A isn't in B (is.na(idx)) or the element in A occurs before it's first occurrence in B (seq_along(idx) < idx)
df[is.na(idx) | seq_along(idx) < idx,]
Maybe a more-or-less literal tidyverse approach to this would be to create and then drop a temporary column
library(tidyverse)
df %>% mutate(idx = match(A, B)) %>%
filter(is.na(idx) | seq_along(idx) < idx) %>%
select(-idx)
You can remove all rows which would be duplicates under some reordering with
require(dplyr)
df %>%
apply(1, sort) %>% t %>%
data.frame %>%
group_by_all %>%
slice(1)

Resources