Spreading a dataframe with two grouping columns - r

I have a data set of teachers as follows:
df <- data.frame(
teacher = c("A", "A", "A", "A", "B", "B", "C", 'C'),
seg = c("1", '1', "2", "2", "1", "2", "1", "2"),
claim = c(
"beth",
'john',
'john',
'beth',
'summer',
'summer',
"hannah",
"hannah"
)
)
I would ideally like to spread my dataset like this:
Desired output.
Any ideas for how I can use either spread or pivot_wide to achieve this? The issue is that there are two grouping variables here (teacher and segment). Some teachers may have multiple of the same segment, but some teachers don't.

One option would be to create a sequence column grouped by 'teacher', 'seg', and then use pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
df %>%
group_by(teacher, seg) %>%
mutate(segN = c("", "double")[row_number()]) %>%
ungroup %>%
mutate(seg = str_c("seg", seg, segN)) %>%
select(-segN) %>%
pivot_wider(names_from = seg, values_from = claim)
# A tibble: 3 x 5
# teacher seg1 seg1double seg2 seg2double
# <fct> <fct> <fct> <fct> <fct>
#1 A beth john john beth
#2 B summer <NA> summer <NA>
#3 C hannah <NA> hannah <NA>
It can be simplified with rowid from data.table
library(data.table)
df %>%
mutate(seg = str_c('seg', c('', '_double')[rowid(teacher, seg)], seg)) %>%
pivot_wider(names_from = seg, values_from = claim)
#or use spread
# spread(seg, claim)
# teacher seg1 seg_double1 seg2 seg_double2
#1 A beth john john beth
#2 B summer <NA> summer <NA>
#3 C hannah <NA> hannah <NA>

You can also use a base R way with the powerful reshape function and some minor data preparation
# find duplicate values
dups <- duplicated(df[, 1:2])
# assign new names to duplicates
df[dups, 2] <- paste0(df[dups, 2], "double")
# use base r reshape function that automatically builds suitable names
wide <- reshape(df, v.names = "claim", idvar = "teacher",
timevar = "seg", direction = "wide", sep = "")
# change varnames to the desired output
names(wide) <- gsub("claim", "seg", names(wide))
wide

Related

How do I apply a transformation to pairs of columns across a dataframe?

I have a dataframe akin to the following:
Date
England
...3
Wales
...5
2021-04-01
145
e
23
2021-04-02
200
e
90
s
2021-04-03
120
66
e
There are 365 rows.
The columns continue such that they alternate between a location (eg, England) and a generic column name (eg, ...3). There are 190 locations in total, so there are 381 columns. I wanted to use the unite() function from tidyverse to merge each pair of columns with a "-" separator. Eg, "145-e" for the first row for England and "23- " for Wales. The date column should remain as it is. Is there a way to loop this so that each of the 180 pairs unify together? Thank you.
One option would be to first rename your columns in a consistent manner. Then reshape your data to long which makes it easy to apply tidyr::unite and if desired convert back to wide format afterwards:
# Rename columns
names(dat)[seq(3, ncol(dat), by = 2)] <- paste(names(dat)[seq(2, ncol(dat), by = 2)], "2", sep = "_")
names(dat)[seq(2, ncol(dat), by = 2)] <- paste(names(dat)[seq(2, ncol(dat), by = 2)], "1", sep = "_")
library(tidyr)
dat |>
pivot_longer(-Date, names_to = c("location", ".value"), names_sep = "_") |>
replace_na(list(`2` = "")) |>
unite(value, `1`, `2`, sep = "-") |>
pivot_wider(names_from = location, values_from = value)
#> # A tibble: 3 × 3
#> Date England Wales
#> <chr> <chr> <chr>
#> 1 2021-04-01 145-e 23-
#> 2 2021-04-02 200-e 90-s
#> 3 2021-04-03 120- 66-e
DATA
dat <- data.frame(
Date = c("2021-04-01", "2021-04-02", "2021-04-03"),
England = c(145L, 200L, 120L),
...3 = c("e", "e", NA),
Wales = c(23L, 90L, 66L),
...5 = c(NA, "s", "e")
)

How do I split the value of a column based on a lookup table?

I have a dataset with ids and associated values:
df <- data.frame(id = c("1", "2", "3"), value = c("12", "20", "16"))
I have a lookup table that matches the id to another reference label ref:
lookup <- data.frame(id = c("1", "1", "1", "2", "2", "3", "3", "3", "3"), ref = c("a", "b", "c", "a", "d", "d", "e", "f", "a"))
Note that id to ref is a many-to-many match: the same id can be associated with multiple ref, and the same ref can be associated with multiple id.
I'm trying to split the value associated with the df$id column equally into the associated ref columns. The output dataset would look like:
output <- data.frame(ref = "a", "b", "c", "d", "e", f", value = "18", "4", "4", "14", "4", "4")
ref
value
a
18
b
4
c
4
d
14
e
4
f
4
I tried splitting this into four steps:
calling pivot_wider on lookup, turning rows with the same id value into columns (e.g., a, b, c.)
merging the two datasets based on id
dividing each df$value equally into a, b, c, etc. columns that are not empty
transposing the dataset and summing across the id columns.
I can't figure out how to make step (3) work, though, and I suspect there's a much easier approach.
A variation of #thelatemail's answer with base pipes.
merge(df, lookup) |> type.convert(as.is=TRUE) |>
transform(value=ave(value, id, FUN=\(x) x/length(x))) |>
with(aggregate(list(value=value), list(ref=ref), sum))
# ref value
# 1 a 18
# 2 b 4
# 3 c 4
# 4 d 14
# 5 e 4
# 6 f 4
Here's a potential logic. Merge value from df into lookup by id, divide value by number of matching rows, then group by ref and sum. Then take your pick of how you want to do it.
Base R
tmp <- merge(lookup, df, by="id", all.x=TRUE)
tmp$value <- ave(as.numeric(tmp$value), tmp$id, FUN=\(x) x/length(x) )
aggregate(value ~ ref, tmp, sum)
dplyr
library(dplyr)
lookup %>%
left_join(df, by="id") %>%
group_by(id) %>%
mutate(value = as.numeric(value) / n() ) %>%
group_by(ref) %>%
summarise(value = sum(value))
data.table
library(data.table)
setDT(df)
setDT(lookup)
lookup[df, on="id", value := as.numeric(value)/.N, by=.EACHI][
, .(value = sum(value)), by=ref]
# ref value
#1: a 18
#2: b 4
#3: c 4
#4: d 14
#5: e 4
#6: f 4
This may work
lookup %>%
left_join(lookup %>%
group_by(id) %>%
summarise(n = n()) %>%
left_join(dummy, by = "id") %>%
mutate(value = as.numeric(value)) %>%
mutate(repl = value/n) %>%
select(id, repl) ,
by = "id"
) %>% select(ref, repl) %>%
group_by(ref) %>% summarise(value = sum(repl))
ref value
<chr> <dbl>
1 a 18
2 b 4
3 c 4
4 d 14
5 e 4
6 f 4

How fill a dataframe from another one in R?

I want to fill df2 with information from df1.
df1 as below
ID Mutation
1 A
2 B
2 C
3 A
df2 as below
ID A B C
1
2
3
For example, if mutation A is found in ID 1, then I want it in df2 it marked as "Y".
So the df2 result should be
ID A B C
1 Y
2 Y Y
3 Y
I have hundreds of IDs and more than 20 mutations. How can I efficiently achieve this in R? Thanks!
Using data.table you can try
setDT(df)
df2 <- dcast(df,formula = ID~Mutation )
df2[, c("A", "B", "C") := lapply(.SD, function(x) ifelse(is.na(x), " ", "Y")), ID]
df2
#Output
ID A B C
1: 1 Y
2: 2 Y Y
3: 3 Y
Create a new column with value 'Y' and cast the data in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(value = 'Y') %>%
pivot_wider(names_from = Mutation, values_from = value, values_fill = '')
# ID A B C
# <int> <chr> <chr> <chr>
#1 1 "Y" "" ""
#2 2 "" "Y" "Y"
#3 3 "Y" "" ""
data
df <- structure(list(ID = c(1L, 2L, 2L, 3L), Mutation = c("A", "B",
"C", "A")), class = "data.frame", row.names = c(NA, -4L))

How to reshape dataframe by date in R

I have the following dataframe:
df1 <- data.frame(
date = c("14-Mar-20", "14-Mar-20", "14-Mar-20", "15-Mar-20", "15-Mar-20", "15-Mar-20"),
status = c("new", "progress", "completed", "new", "progress", "completed"),
count = c("1", "2", "3", "4", "5", "6"),
stringsAsFactors = FALSE
)
I want to reshape it into the following format:
How can I do so? I am trying to use "melt" function but I am unable to make any headway!
We can use pivot_wider from tidyr
library(dplyr)
library(tidyr)
df1 %>%
pivot_wider(names_from = status, values_from = count)
# A tibble: 2 x 4
# date new progress completed
# <chr> <chr> <chr> <chr>
#1 14-Mar-20 1 2 3
#2 15-Mar-20 4 5 6
dcast from data.table:
setDT(df1)
dcast(df1, date ~ status, value.var = 'count')
Here is a base R solution using reshape
res <- reshape(df1,direction = "wide",idvar = "date",timevar = "status")
> res
date count.new count.progress count.completed
1 14-Mar-20 1 2 3
4 15-Mar-20 4 5 6

How do I count the frequency of a character within a string, by a group?

My data.frame contains information on the movements completed by an individual and a string (of alpha characters) that represents these movements in a database. It is structured as follows:
MovementAnalysis <- structure(list(Strings = c("AaB", "cZhH", "Bb", "bAc"), Descriptor = c("Jog/ Stop/ Turn", "Change/ Shuffle/ Backwards/ Jump", "Turn/ Duck", "Duck/ Jog/ Change"), Person = c("Sally", "Sally", "Ben", "Ben")), .Names = c("Strings", "Descriptor", "Person"), row.names = c(NA, 4L), class = "data.frame")
I wish to capture the frequency of each alpha letter (for example: A, a, B, b) within all the Strings for each Person. There are 48 alpha upper and lower case letters. My actual data.frame contains the movements of 100 + individuals, so a quick solution to iterate over each individual would be ideal. As an example, my anticipated output would be:
Output <- structure(list(Person = c("Sally", "Sally", "Sally", "Sally", "Ben", "Ben", "Ben", "Ben"), Letter = c("A", "a", "B", "b", "A", "a", "B", "b"), Frequency = c(1, 1, 1, 0, 1, 0, 1, 2)), .Names = c("Person", "Letter", "Frequency"), row.names = c(NA, 8L), class = "data.frame")
Thank you!
One option is using data.table
library(data.table)
df2 <- setDT(df1)[,list(Letter={
tmp <- unlist(strsplit(Strings, ''))
factor(tmp[tmp %in% c("A", "a", "B", "b")],
levels=c("A", "a", "B", "b"))}) , Person]
df2[, ind:="Frequency"]
dcast(df2, Person+Letter~ind, value.var="Letter", length, drop=FALSE)
# Person Letter Frequency
#1: Ben A 1
#2: Ben a 0
#3: Ben B 1
#4: Ben b 2
#5: Sally A 1
#6: Sally a 1
#7: Sally B 1
#8: Sally b 0
Less wizardy than akrun's answer, but I think it works:
your.func <- function(data) {
require(dplyr)
bag.of.letters <- function(strings) {
concat.string <- paste(strings, collapse='')
all.chars.vec <- unlist(strsplit(concat.string,""))
result <- data.frame(table(factor(all.chars.vec,levels = c(letters,LETTERS))))
colnames(result) <- c("Letter","Frequency")
result[order(result[["Letter"]]),]
}
lapply(X = unique(data[["Person"]]),
FUN = function(n) {
strings = data %>% filter(Person == n) %>% .[["Strings"]]
data.frame(Person = n, bag.of.letters(strings))
}) %>% do.call(rbind,.)
}
your.func(MovementAnalysis)
If you want to have only letters with positive Frequency in your Letter column, remove the factor(..., levels = c(letters,LETTERS)) part.
Here's an option using cSplit_e from my "splitstackshape" package. I've combined it with "magrittr" so that you can walk through the steps without having to store any intermediate objects or create a long nested expression.
The first option shows how to get the "wide" form, as described by #alistaire.
library(splitstackshape)
library(magrittr)
data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
.[, lapply(.SD, sum), by = Person] %>%
subset(select = grep("Person|_[AaBb]$", names(.)))
# Person Strings_a Strings_A Strings_b Strings_B
# 1: Sally 1 1 0 1
# 2: Ben 0 1 2 1
To go from the above to the long form, you just need to add a melt line.
data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
.[, lapply(.SD, sum), by = Person] %>%
subset(select = grep("Person|_[AaBb]$", names(.))) %>%
melt(id.vars = "Person")
# Person variable value
# 1: Sally Strings_a 1
# 2: Ben Strings_a 0
# 3: Sally Strings_A 1
# 4: Ben Strings_A 1
# 5: Sally Strings_b 0
# 6: Ben Strings_b 2
# 7: Sally Strings_B 1
# 8: Ben Strings_B 1
It's not clear from your question, but if your restricting the data to just "A", "a", "B", and "b" was just for the purpose of illustration and you're actually interested in the full 48 options, then you can also omit the following line:
subset(select = grep("Person|_[AaBb]$", names(.)))

Resources