I am trying to write Tidyverse code that finds two rows that have exactly matching values on two conditions. The rows should match on Participant_ID & Indicator. There should be no more than two rows that match exactly on these two values. In this pair of matches, one will have occurred at timepoint 1 and the other at timepoint 4. After the matches are identified, I want the score at timepoint 4 to be subtracted from the score at timepoint 1. I would also like to preserve the Group number in the final tibble.
There will be rows that don't have matches. Those can be omitted, if possible. I don't want them in the resulting tibble.
I am having trouble wrapping my head around this, so thank you very much for your help!
example <- tibble (
Participant_ID = c('Part1','Part2','Part1','Part2','Part1','Part2','Part1','Part2'),
Indicator =c('item1','item1','item1','item1','item2','item2','item2','item2'),
Timepoint = c(1,1,4,4,1,1,4,4),
Score = c(3,3,1.5,3,4,4,3.5,3.5),
Group = c(1,2,1,2,1,2,1,2))
example %>%
pivot_wider(c(Participant_ID, Indicator, Group), names_from = Timepoint, values_from = Score) %>%
transmute(Participant_ID, Indicator, Group, Score = `1` - `4`)
# A tibble: 4 x 4
# Participant_ID Indicator Group Score
# <chr> <chr> <dbl> <dbl>
# 1 Part1 item1 1 1.5
# 2 Part2 item1 2 0
# 3 Part1 item2 1 0.5
# 4 Part2 item2 2 0.5
Data
example <- structure(list(Participant_ID = c("Part1", "Part2", "Part1", "Part2", "Part1", "Part2", "Part1", "Part2"), Indicator = c("item1", "item1", "item1", "item1", "item2", "item2", "item2", "item2"), Timepoint = c(1, 1, 4, 4, 1, 1, 4, 4), Score = c(3, 3, 1.5, 3, 4, 4, 3.5, 3.5), Group = c(1, 2, 1, 2, 1, 2, 1, 2)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))
One tidyverse approach is to use use pivot_wider() in tidyr to place the matches matches into one row, then calculate the difference between the two scores:
example %>%
pivot_wider(id_cols = c(Participant_ID, Indicator), values_from = Score, names_from = Timepoint, names_prefix = "Score_Timepoint_") %>%
mutate(Score_difference = Score_Timepoint_1 - Score_Timepoint_4)
This produces:
# A tibble: 4 x 5
Participant_ID Indicator Score_Timepoint_1 Score_Timepoint_4 Score_difference
<chr> <chr> <dbl> <dbl> <dbl>
1 Part1 item1 3 1.5 1.5
2 Part2 item1 3 3 0
3 Part1 item2 4 3.5 0.5
4 Part2 item2 4 3.5 0.5
You could arrange the data by descending Timepoint and then use diff by group.
library(dplyr)
example %>%
arrange(Participant_ID, Indicator, desc(Timepoint)) %>%
group_by(Participant_ID, Indicator) %>%
summarise(Score = diff(Score))
# Participant_ID Indicator Score
# <chr> <chr> <dbl>
#1 Part1 item1 1.5
#2 Part1 item2 0.5
#3 Part2 item1 0
#4 Part2 item2 0.5
Related
Lets say I have data like this
Number
ID
1.5
X
2.4
X
3.1
Y
3.2
Y
My desired output is
ID
1 < x < 2
2 < x < 3
3 < x < 4
X
1
1
0
Y
0
0
2
What would be an efficient approach to creating this?
We can create a column with cut and then use pivot_wider to reshape from 'long' to 'wide' format
library(dplyr)
library(tidyr)
df1 %>%
mutate(grp = cut(Number, breaks = c(1, 2, 3, 4),
labels = c("1<x<2", "2<x<3", "3<x<4"))) %>%
pivot_wider(names_from = grp, values_from= Number,
values_fn = length, values_fill = 0)
-output
# A tibble: 2 × 4
ID `1<x<2` `2<x<3` `3<x<4`
<chr> <int> <int> <int>
1 X 1 1 0
2 Y 0 0 2
data
df1 <- structure(list(Number = c(1.5, 2.4, 3.1, 3.2), ID = c("X", "X",
"Y", "Y")), class = "data.frame", row.names = c(NA, -4L))
I have a dataset that has employees' capacity each month, and I want to get a total for each employee across all months:
library(dplyr)
data <- tibble(employee = c("Justin", "Corey","Sibley", "Justin", "Corey","Sibley"),
education = c("graudate", "student", "student", "graudate", "student", "student"),
fte_max_capacity = c(1, 2, 3, 1, 2, 3),
project = c("big", "medium", "small", "medium", "small", "small"),
aug_2021 = c(1, 1, 1, 1, 1, 1),
sep_2021 = c(1, 1, 1, 1, 1, 1),
oct_2021 = c(1, 1, 1, 1, 1, 1),
nov_2021 = c(1, 1, 1, 1, 1, 1))
I've tried following using the code found here, but I get this error:
data %>%
dplyr::select(-contains("project")) %>%
dplyr::group_by(employee) %>%
mutate(sum = rowSums(select(., vars(contains("_20")))))
Error: Problem with `mutate()` input `sum`.
x Must subset columns with a valid subscript vector.
x Subscript has the wrong type `quosures`.
ℹ It must be numeric or character.
ℹ Input `sum` is `rowSums(select(., vars(contains("_20"))))`.
ℹ The error occurred in group 1: employee = "Corey".
I also tried this a modified version of the solution from this website. But I also get an error, despite all the relevant columns being numeric:
data %>%
dplyr::select(-contains("project")) %>%
dplyr::group_by(employee) %>%
mutate_at(vars(contains("_20"), rowSums(., na.rm = T)))
Error: 'x' must be numeric
It is a grouped data, use cur_data() to do the select otherwise, the grouped variable will also be present as attribute and thus cause the error
library(dplyr)
data %>%
dplyr::select(-contains("project")) %>%
dplyr::group_by(employee) %>%
dplyr::mutate(sum = sum(rowSums(select(cur_data(), contains("_20"))))) %>%
ungroup
-ouptut
# A tibble: 6 x 8
employee education fte_max_capacity aug_2021 sep_2021 oct_2021 nov_2021 sum
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Justin graudate 1 1 1 1 1 8
2 Corey student 2 1 1 1 1 8
3 Sibley student 3 1 1 1 1 8
4 Justin graudate 1 1 1 1 1 8
5 Corey student 2 1 1 1 1 8
6 Sibley student 3 1 1 1 1 8
I have a data frame that looks likes this:
structure(list(value1 = c(1, 2, 3, 4, 5), value2 = c(1, 2, 2,
2, 2), value3 = c(1, 1, 2, 3, 4)), class = "data.frame", row.names = c("apple1",
"apple2", "orange1", "orange2", "plum"))
value1
value2
value3
apple1
1
1
1
apple2
2
2
1
orange1
3
2
2
orange2
4
2
3
plum
5
2
4
now I want to run the mean function on every column based on the first part of the row names
(for example I want to calculate the mean of value1 of the apple group independently from their apple number.)
I figured out that something like this works:
y<-x[grep("apple",row.names(x)),]
mean(y$value1)
mean(y$value2)
mean(y$vvalue3)
y<-x[grep("orange",row.names(x)),]
mean(y$value1)
mean(y$value2)
mean(y$value2)
y<-x[grep("plum",row.names(x)),]
mean(y$value1)
mean(y$value2)
mean(y$value2)
but for a bigger dataset, this is going to take ages, so I was wondering if there is a more efficient way to subset the data based on the first part of the row name and calculating the mean afterward.
Using tidyverse:
library(tidyverse)
df %>%
tibble::rownames_to_column("row") %>%
dplyr::mutate(row = str_remove(row, "\\d+")) %>%
dplyr::group_by(row) %>%
dplyr::summarize(across(where(is.numeric), ~ mean(.), .groups = "drop"))
In base R you could do:
df$row <- gsub("\\d+", "", rownames(df))
data.frame(do.call(cbind, lapply(df[,1:3], function(x) by(x, df$row, mean))))
Output
row value1 value2 value3
* <chr> <dbl> <dbl> <dbl>
1 apple 1.5 1.5 1
2 orange 3.5 2 2.5
3 plum 5 2 4
Data
df <- structure(list(value1 = 1:5, value2 = c(1, 2, 2, 2, 2), value3 = c(1,
1, 2, 3, 4)), class = "data.frame", row.names = c("apple1", "apple2",
"orange1", "orange2", "plum"))
I have a table like the following:
A, B, C
1, Yes, 3
1, No, 2
2, Yes, 4
2, No, 6
etc
I want to convert it to:
A, Yes, No
1, 3, 2
2, 4, 6
I have tried using:
dat <- dat %>%
spread(B, C) %>%
group_by(A)
However, now I have a bunch of NA values. Is it possible to use pivot_longer to do this instead?
We can use pivot_wider
library(tidyr)
pivot_wider(dat, names_from = B, values_from = C)
-output
# A tibble: 2 x 3
# A Yes No
# <dbl> <dbl> <dbl>
#1 1 3 2
#2 2 4 6
If there are duplicate rows, then an option is to create a sequence by that column
library(data.table)
library(dplyr)
dat1 <- bind_rows(dat, dat) # // example with duplicates
dat1 %>%
mutate(rn = rowid(B)) %>%
pivot_wider(names_from = B, values_from = C) %>%
select(-rn)
-output
# A tibble: 4 x 3
# A Yes No
# <dbl> <dbl> <dbl>
#1 1 3 2
#2 2 4 6
#3 1 3 2
#4 2 4 6
data
dat <- structure(list(A = c(1, 1, 2, 2), B = c("Yes", "No", "Yes", "No"
), C = c(3, 2, 4, 6)), class = "data.frame", row.names = c(NA,
-4L))
My data frame looks like this:
id A T C G ref var
1 1 10 15 7 0 A C
2 2 11 9 2 3 A G
3 3 2 31 1 12 T C
I'd like to create two new columns: ref_count and var_count which will have following values:
Value from A column and value from C column, since ref is A and var is C
Value from A column and value from G column, since ref is A and var is G
etc.
So I'd like to select a column based on the value in another column for each row.
Thanks!
We can use pivot_longer to reshape into 'long' format, filter the rows and then reshape it to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = A:G) %>%
group_by(id) %>%
filter(name == ref|name == var) %>%
mutate(nm1 = c('ref_count', 'var_count')) %>%
ungroup %>%
select(id, value, nm1) %>%
pivot_wider(names_from = nm1, values_from = value) %>%
left_join(df1, .)
# A tibble: 3 x 9
# id A T C G ref var ref_count var_count
#* <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#1 1 10 15 7 0 A C 10 7
#2 2 11 9 2 3 A G 11 3
#3 3 2 31 1 12 T C 31 1
Or in base R, we can also make use of the vectorized row/column indexing
df1$refcount <- as.matrix(df1[2:5])[cbind(seq_len(nrow(df1)), match(df1$ref, names(df1)[2:5]))]
df1$var_count <- as.matrix(df1[2:5])[cbind(seq_len(nrow(df1)), match(df1$var, names(df1)[2:5]))]
data
df1 <- structure(list(id = 1:3, A = c(10, 11, 2), T = c(15, 9, 31),
C = c(7, 2, 1), G = c(0, 3, 12), ref = c("A", "A", "T"),
var = c("C", "G", "C")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
The following is a tidyverse alternative without creating a long dataframe that needs filtering. It essentially uses tidyr::nest() to nest the dataframe by rows, after which the correct column can be selected for each row.
df1 %>%
nest(data = -id) %>%
mutate(
data = map(
data,
~mutate(., refcount = .[[ref]], var_count = .[[var]])
)
) %>%
unnest(data)
#> # A tibble: 3 × 9
#> id A T C G ref var refcount var_count
#> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 10 15 7 0 A C 10 7
#> 2 2 11 9 2 3 A G 11 3
#> 3 3 2 31 1 12 T C 31 1
A variant of this does not need the (assumed row-specific) id column but defines the nested groups from the unique values of ref and var directly:
df1 %>%
nest(data = -c(ref, var)) %>%
mutate(
data = pmap(
list(data, ref, var),
function(df, ref, var) {
mutate(df, refcount = df[[ref]], var_count = df[[var]])
}
)
) %>%
unnest(data)
The data were specified by akrun:
df1 <- structure(list(id = 1:3, A = c(10, 11, 2), T = c(15, 9, 31),
C = c(7, 2, 1), G = c(0, 3, 12), ref = c("A", "A", "T"),
var = c("C", "G", "C")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))