r - gather multiple columns in multiple key columns with tidyr - r

Not sure if tidyr::gather can be used to take multiple columns and merge them in multiple key columns.
Similar questions have been asked but they all refer to gathering multiple columns in one key column.
I'm trying to gather 4 columns into 2 key and 2 value columns like in the following example:
Sample data:
df <- data.frame(
subject = c("a", "b"),
age1 = c(33, 35),
age2 = c(43, 45),
weight1 = c(90, 67),
weight2 = c(70, 87)
)
subject age1 age2 weight1 weight2
1 a 33 43 90 70
2 b 35 45 67 87
Desired result:
dfe <- data.frame(
subject = c("a", "a", "b", "b"),
age = c("age1", "age2", "age1", "age2"),
age_values = c(33, 43, 35, 45),
weight = c("weight1", "weight2", "weight1", "weight2"),
weight_values = c(90, 70, 67, 87)
)
subject age age_values weight weight_values
1 a age1 33 weight1 90
2 a age2 43 weight2 70
3 b age1 35 weight1 67
4 b age2 45 weight2 87

Here's one way to do it -
df %>%
gather(key = "age", value = "age_values", age1, age2) %>%
gather(key = "weight", value = "weight_values", weight1, weight2) %>%
filter(substring(age, 4) == substring(weight, 7))
subject age age_values weight weight_values
1 a age1 33 weight1 90
2 b age1 35 weight1 67
3 a age2 43 weight2 70
4 b age2 45 weight2 87

Here's one approach. The idea is to do the use gather, then split the resulting dataframe by variable (age and weight), do the mutate operations separately for each of the two dataframes, then merge the dataframes back together using subject and the variable number (1 or 2).
library(dplyr)
library(tidyr)
library(purrr)
df %>%
gather(age1:weight2, key = key, value = value) %>%
separate(key, sep = -1, into = c("var", "num")) %>%
split(.$var) %>%
map(~mutate(., !!.$var[1] := paste0(var, num), !!paste0(.$var[1], "_values") := value)) %>%
map(~select(., -var, -value)) %>%
Reduce(f = merge, x = .) %>%
select(-num)

Related

R: Double Pivots Using DPLYR?

I am working with the R programming language.
I have a dataset that looks something like this:
x = c("GROUP", "A", "B", "C")
date_1 = c("CLASS 1", 20, 60, 82)
date_1_1 = c("CLASS 2", 37, 22, 8)
date_2 = c("CLASS 1", 15,100,76)
date_2_1 = c("CLASS 2", 84, 18,88)
my_data = data.frame(x, date_1, date_1_1, date_2, date_2_1)
x date_1 date_1_1 date_2 date_2_1
1 GROUP CLASS 1 CLASS 2 CLASS 1 CLASS 2
2 A 20 37 15 84
3 B 60 22 100 18
4 C 82 8 76 88
I am trying to restructure the data so it looks like this:
note : in the real excel data, date_1 is the same date as date_1_1 and date_2 is the same as date_2_1 ... R wont accept the same names, so I called them differently
Currently, I am manually doing this in Excel using different "tranpose" functions - but I am wondering if there is a way to do this in R (possibly using the DPLYR library).
I have been trying to read different tutorial websites online (Pivoting), but so far nothing seems to match the problem I am trying to work on.
Can someone please show me how to do this?
Thanks!
Made assumptions about your data because of the duplicate column names. For example, if the Column header pattern is CLASS_ClassNum_Date
df<-data.frame(GROUP = c("A", "B", "C"),
CLASS_1_1 = c(20, 60, 82),
CLASS_2_1 = c(37, 22, 8),
CLASS_1_2 = c(15,100,76),
CLASS_2_2 = c(84, 18,88))
library(tidyr)
pivot_longer(df, -GROUP,
names_pattern = "(CLASS_.*)_(.*)",
names_to = c(".value", "Date"))
GROUP Date CLASS_1 CLASS_2
<chr> <chr> <dbl> <dbl>
1 A 1 20 37
2 A 2 15 84
3 B 1 60 22
4 B 2 100 18
5 C 1 82 8
6 C 2 76 88
Edit: Substantially improved pivot_longer by using names_pattern= correctly
There are lots of ways to achieve your desired outcome, but I don't believe there is an 'easy'/'simple' way. Here is one potential solution:
library(tidyverse)
library(vctrs)
x = c("GROUP", "A", "B", "C")
date_1 = c("CLASS 1", 20, 60, 82)
date_1_1 = c("CLASS 2", 37, 22, 8)
date_2 = c("CLASS 1", 15,100,76)
date_2_1 = c("CLASS 2", 84, 18,88)
my_data = data.frame(x, date_1, date_1_1, date_2, date_2_1)
# Combine column names with the names in the first row
colnames(my_data) <- paste(my_data[1,], colnames(my_data), sep = "-")
my_data %>%
filter(`GROUP-x` != "GROUP") %>% # remove first row (info now in column names)
pivot_longer(everything(), # pivot the data
names_to = c(".value", "Date"),
names_sep = "-") %>%
mutate(GROUP = vec_fill_missing(GROUP, # fill NAs in GROUP introduced by pivoting
direction = "downup")) %>%
filter(Date != "x") %>% # remove "unneeded" rows
mutate(`CLASS 2` = vec_fill_missing(`CLASS 2`, # fill NAs again
direction = "downup")) %>%
na.omit() %>% # remove any remaining NAs
mutate(across(starts_with("CLASS"), ~as.numeric(.x)),
Date = str_extract(Date, "\\d+")) %>%
rename("date" = "Date", # rename the columns
"group" = "GROUP",
"count_class_1" = `CLASS 1`,
"count_class_2" = `CLASS 2`) %>%
arrange(date) # arrange by "date" to get your desired output
#> # A tibble: 6 × 4
#> date group count_class_1 count_class_2
#> <chr> <chr> <dbl> <dbl>
#> 1 1 A 20 37
#> 2 1 B 60 84
#> 3 1 C 82 18
#> 4 2 A 15 37
#> 5 2 B 100 22
#> 6 2 C 76 8
Created on 2022-12-09 with reprex v2.0.2

Performing pivot_longer() over multiple sets of columns

I am stuck in performing pivot_longer() over multiple sets of columns. Here is the sample dataset
df <- data.frame(
id = c(1, 2),
uid = c("m1", "m2"),
germ_kg = c(23, 24),
mineral_kg = c(12, 17),
perc_germ = c(45, 34),
perc_mineral = c(78, 10))
I need the output dataframe to look like this
out <- df <- data.frame(
id = c(1, 1, 2, 2),
uid = c("m1", "m1", "m2", "m2"),
crop = c("germ", "germ", "mineral", "mineral"),
kg = c(23, 12, 24, 17),
perc = c(45, 78, 34, 10))
df %>%
rename_with(~str_replace(.x,'(.*)_kg', 'kg_\\1')) %>%
pivot_longer(-c(id, uid), names_to = c('.value', 'crop'), names_sep = '_')
# A tibble: 4 x 5
id uid crop kg perc
<dbl> <chr> <chr> <dbl> <dbl>
1 1 m1 germ 23 45
2 1 m1 mineral 12 78
3 2 m2 germ 24 34
4 2 m2 mineral 17 10
If you were to use data.table:
library(data.table)
melt(setDT(df), c('id', 'uid'), patterns(kg = 'kg', perc = 'perc'))
id uid variable kg perc
1: 1 m1 1 23 45
2: 2 m2 1 24 34
3: 1 m1 2 12 78
4: 2 m2 2 17 10
I suspect there might be a simpler way using pivot_long_spec, but one tricky thing here is that your column names don't have a consistent ordering of their semantic components. #Onyambu's answer deals with this nicely by fixing it upsteam.
library(tidyverse)
df %>%
pivot_longer(-c(id, uid)) %>%
separate(name, c("col1", "col2")) %>% # only needed
mutate(crop = if_else(col2 == "kg", col1, col2), # because name
meas = if_else(col2 == "kg", col2, col1)) %>% # structure
select(id, uid, crop, meas, value) %>% # is
pivot_wider(names_from = meas, values_from = value) # inconsistent
# A tibble: 4 x 5
id uid crop kg perc
<dbl> <chr> <chr> <dbl> <dbl>
1 1 m1 germ 23 45
2 1 m1 mineral 12 78
3 2 m2 germ 24 34
4 2 m2 mineral 17 10

How to iterate over dataframe using character vector and calculate the mean for matching items in R

I have a character vector and want to iterate over some dataframes and consider the matching characters and note the corresponding values and finally take the average of all the values and store in results in a new dataframe.
Below is the sample example:
ip <- c("John", "Amanda", "Aaron", "Peter", "Jolie")
dfs <- data.frame("names" = c('John','Peter','jucy'), "value1" = c(21, 24, 26), "value2" = c(20, 23, 32))
dfg <- data.frame("names" = c('Justin','John','Jill'), "value1" = c(35, 11, 10), "value2" = c(10, 28, 27))
dft <- data.frame("names" = c('Louis','Chan','John'), "value1" = c(42, 74, 26), "value2" = c(26, 53, 54))
dfr <- data.frame("names" = c('Ale','Terry','Tom'), "value1" = c(61, 34, 76), "value2" = c(28, 63, 38))
dfm <- data.frame("names" = c('Sam','Jolie','Peter'), "value1" = c(11, 84, 86), "value2" = c(50, 13, 68))
Expected output:
names value1 value2
John 19.33 34
Peter 55 45.5
Jolie 84 13
For John value1 = mean(c(21, 11, 26)) = 19.33 and value2 = mean(c(20, 28, 54)) = 34
Similarly, for Peter value1 = mean(c(24, 86)) = 55 and value2 = mean(c(23,68)) = 45.5
We can get the datasets in a list with mget, bind them together with bind_rows, do a group by mean
library(dplyr)
out <- mget(ls(pattern= '^df[sgtrms]$')) %>%
bind_rows %>%
group_by(names) %>%
summarise(across(everything(), mean, na.rm = TRUE))
out
# A tibble: 12 x 3
# names value1 value2
# <chr> <dbl> <dbl>
# 1 Ale 61 28
# 2 Chan 74 53
# 3 Jill 10 27
# 4 John 19.3 34
# 5 Jolie 84 13
# 6 jucy 26 32
# 7 Justin 35 10
# 8 Louis 42 26
# 9 Peter 55 45.5
#10 Sam 11 50
#11 Terry 34 63
#12 Tom 76 38
If we need to filter based on ip
mget(ls(pattern= '^df[sgtrms]$')) %>%
bind_rows %>%
filter(names %in% ip) %>%
group_by(names) %>%
summarise(across(everything(), mean, na.rm = TRUE))
# A tibble: 3 x 3
# names value1 value2
# <chr> <dbl> <dbl>
#1 John 19.3 34
#2 Jolie 84 13
#3 Peter 55 45.5
Or using base R with aggregate
aggregate(.~ names, subset(do.call(rbind,
mget(ls(pattern = "^df[sgtrms]$"))), names %in% ip), mean)
Another base R approach would be binding all your dataframes, apply a filter based on ip vector and finally aggregate with mean():
#Vector
ip <- c("John", "Amanda", "Aaron", "Peter", "Jolie")
#Data
dfs <- data.frame("names" = c('John','Peter','jucy'),
"value1" = c(21, 24, 26), "value2" = c(20, 23, 32),stringsAsFactors = F)
dfg <- data.frame("names" = c('Justin','John','Jill'),
"value1" = c(35, 11, 10), "value2" = c(10, 28, 27),stringsAsFactors = F)
dft <- data.frame("names" = c('Louis','Chan','John'),
"value1" = c(42, 74, 26), "value2" = c(26, 53, 54),stringsAsFactors = F)
dfr <- data.frame("names" = c('Ale','Terry','Tom'),
"value1" = c(61, 34, 76), "value2" = c(28, 63, 38),stringsAsFactors = F)
dfm <- data.frame("names" = c('Sam','Jolie','Peter'),
"value1" = c(11, 84, 86), "value2" = c(50, 13, 68),stringsAsFactors = F)
#Bind all
dfmacro <- rbind(dfs,dfg,dft,dfr,dfm)
#Filter based on ip
dfmacro2 <- dfmacro[dfmacro$names %in% ip,]
#Aggregate
aggregate(cbind(value1,value2)~names,data=dfmacro2,mean)
Output:
names value1 value2
1 John 19.33333 34.0
2 Jolie 84.00000 13.0
3 Peter 55.00000 45.5
For the sake of completeness, here are two variants which use data.table.
The OP has requested to iterate over the dataframes, to extract the desired rows by matching names, and to compute the mean of values for each name across all extracted rows.
All answers posted so far suggest a different order of operations which combines the dataframes first, then extracts the desired rows and aggregates by name, finally.
The two variants suggested in this answer take the same route.
library(data.table)
rbindlist(list(dfs, dfg, dft, dfr, dfm))[
names %chin% ip, lapply(.SD, mean), keyby = names]
names value1 value2
1: John 19.33333 34.0
2: Jolie 84.00000 13.0
3: Peter 55.00000 45.5
rbindlist() combines all rows, names %chin% ip picks the desired rows, lapply(.SD, mean) computes the means across all columns except the names column which is used for grouping.
An alternative approach aggregates in a join:
library(data.table)
rbindlist(list(dfs, dfg, dft, dfr, dfm))[
.(ip), on = .(names = V1), nomatch = NULL, lapply(.SD, mean), keyby = .EACHI]
Here, the combined rows are joined with ip, non matching rows are neglected. Within the join, the data are grouped and aggregated simultaneously.

How to combine multiple data frame columns in R

I have a .csv file with demographic data for my participants. The data are coded and downloaded from my study database (REDCap) in a way that each race has its own separate column. That is, each participant has a value in each of these columns (1 if endorsed, 0 if unendorsed).
It looks something like this:
SubjID Sex Age White AA Asian Other
001 F 62 0 1 0 0
002 M 66 1 0 0 0
I have to use a roundabout way to get my demographic summary stats. There's gotta be a simpler way to do this. My question is, how can I combine these columns into one column so that there is only one value for race for each participant? (i.e. recoding so 1 = white, 2 = AA, etc, and only the endorsed category is being pulled for each participant and added to this column?)
This is what I would like for it to look:
SubjID Sex Age Race
001 F 62 2
002 M 66 1
This is more or less similar to our approach with similar data from REDCap. We use pivot_longer for dummy variables. The final Race variable could also be made a factor. Please let me know if this is what you had in mind.
Edit: Added names_ptypes to pivot_longer to indicate that Race variable is a factor (instead of mutate).
library(tidyverse)
df <- data.frame(
SubjID = c("001", "002"),
Sex = c("F", "M"),
Age = c(62, 66),
White = c(0, 1),
AA = c(1, 0),
Asian = c(0, 0),
Other = c(0, 0)
)
df %>%
pivot_longer(cols = c("White", "AA", "Asian", "Other"), names_to = "Race", names_ptypes = list(Race = factor()), values_to = "Value") %>%
filter(Value == 1) %>%
select(-Value)
Result:
# A tibble: 2 x 4
SubjID Sex Age Race
<fct> <fct> <dbl> <fct>
1 001 F 62 AA
2 002 M 66 White
Here is another approach using reshape2
df[df == 0] <- NA
df <- reshape2::melt(df, measure.vars = c("White", "AA", "Asian", "Other"), variable.name = "Race", na.rm = TRUE)
df <- subset(df, select = -value)
# SubjID Sex Age Race
# 002 M 66 White
# 001 F 62 AA
Here's a base approach:
race_cols <- 4:7
ind <- max.col(df[, race_cols])
df$Race_number <- ind
df$Race <- names(df[, race_cols])[ind]
df[, -race_cols]
SubjID Sex Age Race_number Race
1 001 F 62 2 AA
2 002 M 66 1 White
Data from #Ben
df <- data.frame(
SubjID = c("001", "002"),
Sex = c("F", "M"),
Age = c(62, 66),
White = c(0, 1),
AA = c(1, 0),
Asian = c(0, 0),
Other = c(0, 0)
)

How to convert diagonal rows into single row in R? [duplicate]

This question already has answers here:
Combining rows based on a column
(1 answer)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I have a dataset1 which is as follows:
dataset1 <- data.frame(
id1 = c(1, 1, 1, 2, 2, 2),
id2 = c(122, 122, 122, 133, 133, 133),
num1 = c(1, NA, NA, 50,NA, NA),
num2 = c(NA, 2, NA, NA, 45, NA),
num3 = c(NA, NA, 3, NA, NA, 4)
)
How to convert multiple rows into a single row?
The desired output is:
id1, id2, num1, num2, num3
1 122 1 2 3
2 133 50 45 4
library(dplyr)
dataset1 %>% group_by(id1, id2) %>%
summarise_all(funs(.[!is.na(.)])) %>%
as.data.frame()
# id1 id2 num1 num2 num3
# 1 1 122 1 2 3
# 2 2 133 50 45 4
Note: Assuming there will be only 1 non-NA item in a column.
Using data.table
library(data.table)
data.table(dataset1)[, lapply(.SD, sum, na.rm = TRUE), by = c("id1", "id2")]
# id1 id2 num1 num2 num3
#1: 1 122 1 2 3
#2: 2 133 50 45 4
You can use dplyr to achieve that:
library(dplyr)
dataset1 %>%
group_by(id1, id2) %>%
mutate(
num1 = sum(num1, na.rm=T),
num2 = sum(num2, na.rm=T),
num3 = sum(num3, na.rm=T)
) %>%
distinct()
Output:
This is also assuming if there's a repeated value in any of the variable we're going to sum it (if id1 = 1 has two values for num1, we're going to sum the value). If you're confident that every id has only one possible value for each of the num (num1 to num3), then don't worry about it.

Resources