I have a dataset that is wide for 10 sessions and each session has ID#s for two team members. I want to paste the to ID#s together to form team IDs. I can do this with 10 mutate (one for each team), but am trying to find a way to have 1 mutate inside a map or pmap.
A simple data example with only 2 sessions is
df2 <- data.frame( subj = c(1001,1002),
id1.s1 = c(21, 44),
id2.s1 = c(21, 55),
id1.s2 = c(23, 44),
id2.s2 = c(21, 77))
df2 <- df2 %>%
mutate(team.s1=paste(id1.s1, id2.s1, sep="-")) %>%
mutate(team.s2=paste(id1.s2, id2.s2, sep="-")) %>%
select(grep("subj|team", names(.)))
This gives
subj team.s1 team.s2
1 1001 21-21 23-21
2 1002 44-55 44-77
Is there a way to make a 3 element list with e1 = 10 team names, e2 = 10 ID#1, e3 = 10 ID#2 and use mutate inside of pmap? OR some other wat that avoids 10 mutate lines?
I could not figure out how to get the data frame name into mutate
A solution based on tidyr's gather and spread functions. The separate function is to separate one column based on a pattern.
library(dplyr)
library(tidyr)
df2 <- df1 %>%
gather(ID_S, Value, -subj) %>%
separate(ID_S, into = c("ID", "S")) %>%
group_by(subj, S) %>%
summarise(Value = paste(Value, collapse = "-")) %>%
mutate(S = paste0("team.", S)) %>%
spread(S, Value) %>%
ungroup()
df2
# # A tibble: 2 x 3
# subj team.s1 team.s2
# * <dbl> <chr> <chr>
# 1 1001 21-21 23-21
# 2 1002 44-55 44-77
DATA
df1 <- data.frame( subj = c(1001,1002),
id1.s1 = c(21, 44),
id2.s1 = c(21, 55),
id1.s2 = c(23, 44),
id2.s2 = c(21, 77))
One option could be split the data frame based on the column names' suffix, i.e., s1/s2 or sessions, then for each session paste the columns with do.call(paste, ...):
With tidyverse (version 1.2.1):
df2 %>%
split.default(sub('id[12]\\.(s[0-9]+)', '\\1', names(.))) %>%
map_dfc(~do.call(paste, c(sep="-", .)))
# A tibble: 2 x 3
# s1 s2 subj
# <chr> <chr> <chr>
#1 21-21 23-21 1001
#2 44-55 44-77 1002
Related
I have two dataframes:
set.seed(1)
df1 <- data.frame(k1 = "AFD(1);Acf(2);Vgr7(2);"
,k2 = "ABC(7);BHG(46);TFG(675);")
df2 <- data.frame(site =c("AFD(1);AFD(2);", "Acf(2);", "TFG(677);",
"XX(275);", "ABC(7);", "ABC(9);")
,p1 = rnorm(6, mean = 5, sd = 2)
,p2 = rnorm(6, mean = 6.5, sd = 2))
The first dataframe is in fact a list of often very long strings, made of 'elements". Each "element" is made of a few letters/numbers, followed by a number in brackets, followed by a semicolon. In this example I only put 3 "elements" into each string, but in my real dataframe there are tens to hundreds of them.
> df1
k1 k2
1 AFD(1);Acf(2);Vgr7(2); ABC(7);BHG(46);TFG(675);
The second dataframe shares some of the "elements" with df1. Its first column, called site, contains some (not all) "elements" from the first dataframe, sometimes the "element" forms the whole string, and sometimes is a part of a longer string:
> df2
site p1 p2
1 AFD(1);AFD(2); 4.043700 3.745881
2 Acf(2); 5.835883 5.670011
3 TFG(677); 7.717359 5.711420
4 XX(275); 4.794425 6.381373
5 ABC(7); 5.775343 8.700051
6 ABC(9); 4.892390 8.026351
I would like to filter the whole df2 using df2$site and each k column from df1 (there are many K columns, not all of them contain k in the names).
The easiest way to explain this is to show how the desired output would look like.
> outcome
k site p1 p2
1 k1 AFD(1);AFD(2): 4.043700 3.745881
2 k1 Acf(2); 5.835883 5.670011
3 k2 ABC(7); 5.775343 8.700051
The first column of the outcome dataframe corresponds to the column names in df1. The second column corresponds to the site column of df2 and contains only sites from df1 columns that were found in df2$sites. Other columns are from df2.
I appreciate that this question is made of two separate "problems", one grepping-related and one related to looping through df1 columns. I decided to show the task in its entirety in case there exists a solution that addresses both in one go.
FAILED SOLUTION 1
I can create a string to grep, but for each column separately:
# this replaces the semicolons with "|", but does not escape the brackets.
k1_pattern <- df1 %>%
select(k1) %>%
deframe() %>%
str_replace_all(";","|")
And then I am not sure how to use it. This (below) didn't work, maybe because I didn't escape brackets, but I am struggling with doing it:
k1_result <- df2 %>%
filter(grepl(pattern = k1_pattern, site))
But even if it did work, it would only deal with a single column from df1, and I have many, and would like to perform this operation on all df1 columns at the same time.
FAILED SOLUTION 2
I can create a list of sites to search in df2 from columns in df1:
k1_sites<- df1 %>%
select(k1) %>%
deframe() %>%
strsplit(., "[;]") %>%
unlist()
but the delimiter is lost here, and %in% cannot be used, as the match will sometimes be partial.
library(dplyr)
df2 %>%
mutate(site_list = strsplit(site, ";")) %>%
rowwise() %>%
filter(length(intersect(site_list,
unlist(strsplit(x = paste0(c(t(df1)), collapse=""),
split = ";")))) != 0) %>%
select(-site_list)
#> # A tibble: 3 x 3
#> # Rowwise:
#> site p1 p2
#> <chr> <dbl> <dbl>
#> 1 AFD(1);AFD(2); 3.75 7.47
#> 2 Acf(2); 5.37 7.98
#> 3 ABC(7); 5.66 9.52
Updated answer:
library(dplyr)
library(tidyr)
df1 %>%
rownames_to_column("id") %>%
pivot_longer(-id, names_to = "k", values_to = "site") %>%
separate_rows(site, sep = ";") %>%
filter(site != "") %>%
select(-id) -> df1_k
df2 %>%
tibble::rownames_to_column("id") %>%
separate_rows(site, sep = ";") %>%
full_join(., df1_k, by = c("site")) %>%
group_by(id) %>%
fill(k, .direction = "downup") %>%
filter(!is.na(id) & !is.na(k)) %>%
summarise(k = first(k),
site = paste0(site, collapse = ";"),
p1 = first(p1),
p2 = first(p2), .groups = "drop") %>%
select(-id)
#> # A tibble: 3 x 4
#> k site p1 p2
#> <chr> <chr> <dbl> <dbl>
#> 1 k1 AFD(1);AFD(2); 3.75 7.47
#> 2 k1 Acf(2); 5.37 7.98
#> 3 k2 ABC(7); 5.66 9.52
Here's a way going to a long format for exact matching (so no regex):
library(dplyr)
library(tidyr)
df1_long = df1 |> stack() |>
separate_rows(values, sep = ";") |>
filter(values != "")
df2 |>
mutate(id = row_number()) |>
separate_rows(site, sep = ";") |>
filter(site != "") |>
left_join(df1_long, by = c("site" = "values")) %>%
group_by(id) |>
filter(any(!is.na(ind))) %>%
summarize(
site = paste(site, collapse = ";"),
across(-site, \(x) first(na.omit(x)))
)
# # A tibble: 3 × 5
# id site p1 p2 ind
# <int> <chr> <dbl> <dbl> <fct>
# 1 1 AFD(1);AFD(2) 3.75 7.47 k1
# 2 2 Acf(2) 5.37 7.98 k1
# 3 5 ABC(7) 5.66 9.52 k2
Let's say I have a dataframe of scores
library(dplyr)
id <- c(1 , 2)
name <- c('John', 'Ninaa')
score1 <- c(8, 6)
score2 <- c(NA, 7)
df <- data.frame(id, name, score1, score2)
Some mistakes have been made so I want to correct them. My corrections are in a different dataframe.
id <- c(2,1)
column <- c('name', 'score2')
new_value <- c('Nina', 9)
corrections <- data.frame(id, column, new_value)
I want to search the dataframe for the correct id and column and change the value.
I have tried something with match but I don't know how mutate the correct column.
df %>% mutate(corrections$column = replace(corrections$column, match(corrections$id, id), corrections$new_value))
We could join by 'id', then mutate across the columns specified in the column and replace the elements based on the matching the corresponding column name (cur_column()) with the column
library(dplyr)
df %>%
left_join(corrections) %>%
mutate(across(all_of(column), ~ replace(.x, match(cur_column(),
column), new_value[match(cur_column(), column)]))) %>%
select(names(df))
-output
id name score1 score2
1 1 John 8 9
2 2 Nina 6 7
It's an implementation of a feasible idea with dplyr::rows_update, though it involves functions of multiple packages. In practice I prefer a moderately parsimonious approach.
library(tidyverse)
corrections %>%
group_by(id) %>%
group_map(
~ pivot_wider(.x, names_from = column, values_from = new_value) %>% type_convert,
.keep = TRUE) %>%
reduce(rows_update, by = 'id', .init = df)
# id name score1 score2
# 1 1 John 8 9
# 2 2 Nina 6 7
Here is a small sample from two tables which I joined on gender:
df1 <- tibble(gender = c("M", "M"), ID = c(1,2))
df2 <- tibble(gender = c("M", "M", "M"), ID = c(30, 40, 50))
When I do a left_join, person 1 can match on either person 30, 40, 50 and so does person 2. I am looking for exactly one match for each person on the left.
However if I group_by ID.x and slice the first option, then person 30 will be a match to both 1 and 2; and I need unique matches (i.e. each person on the left has one match on the right and vice versa). So in this case a better choice would be 1-30 and 2-40, e.g.
How can I achieve this efficiently with joins? The long way to be doing a loop and manually removing patient 30 after it was found as a match for person 1, and so on.
You should need to create another variable to join by.
df1 <- df1 %>% group_by(gender) %>% mutate(ind = 1:n()) %>% ungroup()
df2 <- df2 %>% group_by(gender) %>% mutate(ind = 1:n()) %>% ungroup()
left_join(df1, df2, by = c("gender", "ind")) %>%
select(-ind)
# # A tibble: 2 × 3
# gender ID.x ID.y
# <chr> <dbl> <dbl>
# 1 M 1 30
# 2 M 2 40
I have data that were collected from a year but are broken up by months. For my code, I labeled them df1-df12 for each corresponding month. I am trying to group these data using the group_by function to group all the dataframes similarly. When I do the following code- it works fine alone:
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
However, I would like to streamline this code so that I can use this function for all 12 dataframes without having to copy/paste 12 times, since there is a lot of data to go through. Here is what I have tried to do to that end:
func1<-function(df)
{
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
}
yr19<-c(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12)
map(yr19, func1)
However, i get the following error message: Error in UseMethod("group_by") :
no applicable method for 'group_by' applied to an object of class "character". As stated above- i don't get this error message if I go through and do it individually, but there are many months and many years to be analyzed and from a time perspective I don't think doing this code manually is feasible. Thanks for your help
Two ways you can approach this, first using the approach suggested by #ktiu:
## Create example data
library(dplyr) # for pipe and group_by()
set.seed(914)
df1 <- tibble(
date = sample(1:30, 50, replace = T),
id = sample(1:10, 50, replace = T),
var1 = rnorm(50, mean = 10, sd = 3)
)
df2 <- tibble(
date = sample(1:30, 50, replace = T),
id = sample(1:10, 50, replace = T),
var1 = rnorm(50, mean = 10, sd = 3)
)
Modifying your function to address error
func1<-function(df)
{
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
df
}
## And using list rather than c to combine data frames.
yr19 <- list(df1, df2)
yr19_data <- lapply(yr19, func1)
# This will return a list of data frames you can access with `yr19_data[[1]]`
Alternative approach is to add variable for your source data frames, then collapse it all into a single data frame and manipulate from there. Which approach makes more sense will depend on what else you want to do later.
func2 <- function(df.name){
mutate(get(df.name), source = df.name)
}
# This is set up to get objects given their names, so we'll use a character vector
# of names to iterate off of.
yr19 = c("df1", "df2")
df.list <- lapply(yr19, func2)
df.long <- do.call(bind_rows, df.list)
df.long
# # A tibble: 100 x 4
# date id var1 source
# <int> <int> <dbl> <chr>
# 1 27 9 9.31 df1
# 2 5 3 16.5 df1
# 3 28 3 2.67 df1
# 4 24 4 8.94 df1
# 5 13 3 1.68 df1
At this point you can manipulate one data frame in your original pipe:
df <- df.long %>%
group_by(source, date,id) %>%
slice(n()) %>%
ungroup()
df
# # A tibble: 93 x 4
# date id var1 source
# <int> <int> <dbl> <chr>
# 1 1 8 9.89 df1
# 2 2 4 10.9 df1
# 3 4 3 8.45 df1
# 4 5 3 16.5 df1
# 5 5 7 10.6 df1
This question is slightly modified from this one.
I have a dataframe in long table format like this:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
ID name value
1 a broad
1 c 50
1 a mangrove
1 c 50
1 a mangrove
1 c 50
2 a coniferous
2 c 50
About the data: The value from the second row 50 corresponds to the value broad from the first row. Similarly, the value from the fourth row 50 corresponds to the value mangrove from the third row and so on.. In simple words, values for name c are related with name a.
I want to combine the value in such a way that I could get the corresponding values for each name, which would also aggregate the values with similar names:
df2 <- data.frame(ID=c(1,1,2),
name=c("c_broad","c_mangrove","c_coniferous"),
value=c(50,100,50))
which should look like this:
ID name value
1 c_broad 50
1 c_mangrove 100
2 c_coniferous 50
Using reshape2:
library(reshape2)
df1$grp = cumsum(df1$name == "a")
df2 = dcast(df1, ID + grp ~ name)
df2$c = as.numeric(df2$c)
aggregate(c ~ ID + a, df2, sum)
ID a c
1 1 broad 50
2 2 coniferous 50
3 1 mangrove 100
Column names can be changed if desired, also "c_" can be added to the names with paste.
Using tidyverse:
value_a <- df1 %>% dplyr::filter(name=="a") %>% dplyr::pull(value)
df1 %>%
dplyr::filter(name=="c") %>% #Modify into a sensible data frame from here
dplyr::mutate(a = value_a,
name = stringr::str_c(name, "_" ,a)) %>%
dplyr::select(-a) %>% # to here
dplyr::group_by(ID, name) %>%
dplyr::summarise(value=sum(as.numeric(value)))
# A tibble: 3 x 3
# Groups: ID [2]
ID name value
<dbl> <chr> <dbl>
1 1 c_broad 50
2 1 c_mangrove 100
3 2 c_coniferous 50
Tha main problem you find in your dataframe is that a single column is containing, names and values, and that is the first thing you should fix. My advice is always modify the original dataframe into a tidy format (https://tidyr.tidyverse.org/articles/tidy-data.html) and from there leverage all tidyverse power, or data.table or your framework of choice.
Notice the temporal variable value_a could be included in the pipeline directly I have not done it for clarity. The main idea is to separate values and species in different columns, the first three calls in the pipeline, and then apply the usual tidyverse operations.
Might not be the most elegant, but it works:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50)
)
df1 %>% group_by( 1+floor((1:n()-1)/2) ) %>%
summarize(
ID = ID[1],
name = paste0( name[2], "_", value[1] ),
value = as.numeric(value[2])
) %>% ungroup %>% select( -1 ) %>% group_by(name) %>%
mutate( value = sum(value) ) %>%
unique
Here is somthing improved, that actually is humanly readable:
i <- seq( 1, nrow(df1), 2 )
df1 %>% summarise(
ID = ID[i],
name = paste0( name[i+1], "_", value[i] ),
value = as.numeric(value[i+1])
) %>% group_by(name) %>%
summarize(
ID=ID[1], value = sum( value )
) %>% arrange(ID)
Base R solution:
# Nullify numeric values belonging to a grouping category: grps => character vector
grps <- gsub("\\d+", NA, df1$value)
# Interpolate NA values using prior string value: a => character vector
df1$a <- na.omit(grps)[cumsum(!(is.na(grps)))]
# Split-Apply-Combine aggregation: data.frame => stdout(console)
data.frame(do.call(rbind, lapply(with(df1, split(df1, a)), function(x){
y <- transform(subset(x, !grepl("\\D+", value)), value = as.numeric(value))
setNames(
aggregate(value ~ ID + a, y, FUN = function(z){sum(z, na.rm = TRUE)}),
c("ID", "a", "c")
)
}
)
),
row.names = NULL
)
additional option
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
library(tidyverse)
df1 %>%
pivot_wider(ID, names_from = name, values_from = value) %>%
unnest(c("a", "c")) %>%
group_by(ID, name = a) %>%
summarise(value = sum(as.numeric(c), na.rm = T), .groups = "drop")
#> # A tibble: 3 x 3
#> ID name value
#> <dbl> <chr> <dbl>
#> 1 1 broad 50
#> 2 1 mangrove 100
#> 3 2 coniferous 50
Created on 2021-04-12 by the reprex package (v2.0.0)