I have a long form dataframe that have multiple entries for same date and person.
jj <- data.frame(month=rep(1:3,4),
student=rep(c("Amy", "Bob"), each=6),
A=c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5),
B=c(6, 7, 8, 5, 6, 7, 5, 4, 6, 3, 1, 5))
I want to convert it to wide form and make it like this:
month Amy.A Bob.A Amy.B Bob.B
1
2
3
1
2
3
1
2
3
1
2
3
My question is very similar to this. I have used the given code in the answer :
kk <- jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
spread(temp, value)
but it gives following error:
Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6), (13, 16), (14, 17), (15, 18), (7, 10), (8, 11), (9, 12), (19, 22), (20, 23), (21, 24)
Thanks in advance.
Note: I don't want to delete multiple entries.
Your answer was missing mutate id! Here is the solution using dplyr packge only.
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
group_by(temp) %>%
mutate(id=1:n()) %>%
spread(temp, value)
# A tibble: 6 x 6
# month id Amy_A Amy_B Bob_A Bob_B
# * <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 9 6 3 5
# 2 1 4 8 5 5 3
# 3 2 2 7 7 2 4
# 4 2 5 6 6 6 1
# 5 3 3 6 8 1 6
# 6 3 6 9 7 5 5
The issue is the two columns for both A and B. If we can make that one value column, we can spread the data as you would like. Take a look at the output for jj_melt when you use the code below.
library(reshape2)
jj_melt <- melt(jj, id=c("month", "student"))
jj_spread <- dcast(jj_melt, month ~ student + variable, value.var="value", fun=sum)
# month Amy_A Amy_B Bob_A Bob_B
# 1 1 17 11 8 8
# 2 2 13 13 8 5
# 3 3 15 15 6 11
I won't mark this as a duplicate since the other question did not summarize by sum, but the data.table answer could help with one additional argument, fun=sum:
library(data.table)
dcast(setDT(jj), month ~ student, value.var=c("A", "B"), fun=sum)
# month A_sum_Amy A_sum_Bob B_sum_Amy B_sum_Bob
# 1: 1 17 8 11 8
# 2: 2 13 8 13 5
# 3: 3 15 6 15 11
If you would like to use the tidyr solution, combine it with dcast to summarize by sum.
as.data.frame(jj)
library(tidyr)
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
dcast(month ~ temp, fun=sum)
# month Amy_A Amy_B Bob_A Bob_B
# 1 1 17 11 8 8
# 2 2 13 13 8 5
# 3 3 15 15 6 11
Edit
Based on your new requirements, I have added an activity column.
library(dplyr)
jj %>% group_by(month, student) %>%
mutate(id=1:n()) %>%
melt(id=c("month", "id", "student")) %>%
dcast(... ~ student + variable, value.var="value")
# month id Amy_A Amy_B Bob_A Bob_B
# 1 1 1 9 6 3 5
# 2 1 2 8 5 5 3
# 3 2 1 7 7 2 4
# 4 2 2 6 6 6 1
# 5 3 1 6 8 1 6
# 6 3 2 9 7 5 5
The other solutions can also be used. Here I added an optional expression to arrange the final output by activity number:
library(tidyr)
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
group_by(temp) %>%
mutate(id=1:n()) %>%
dcast(... ~ temp) %>%
arrange(id)
# month id Amy_A Amy_B Bob_A Bob_B
# 1 1 1 9 6 3 5
# 2 2 2 7 7 2 4
# 3 3 3 6 8 1 6
# 4 1 4 8 5 5 3
# 5 2 5 6 6 6 1
# 6 3 6 9 7 5 5
The data.table syntax is compact because it allows for multiple value.var columns and will take care of the spread for us. We can then skip the melt -> cast process.
library(data.table)
setDT(jj)[, activityID := rowid(student)]
dcast(jj, ... ~ student, value.var=c("A", "B"))
# month activityID A_Amy A_Bob B_Amy B_Bob
# 1: 1 1 9 3 6 5
# 2: 1 4 8 5 5 3
# 3: 2 2 7 2 7 4
# 4: 2 5 6 6 6 1
# 5: 3 3 6 1 8 6
# 6: 3 6 9 5 7 5
Since tidyr 1.0.0 pivot_wider is the recommended replacement of spread and you could do the following :
jj <- data.frame(month=rep(1:3,4),
student=rep(c("Amy", "Bob"), each=6),
A=c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5),
B=c(6, 7, 8, 5, 6, 7, 5, 4, 6, 3, 1, 5))
library(tidyr)
pivot_wider(
jj,
names_from = "student",
values_from = c("A","B"),
names_sep = ".",
values_fn = list(A= list, B= list)) %>%
unchop(everything())
#> # A tibble: 6 x 5
#> month A.Amy A.Bob B.Amy B.Bob
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 9 3 6 5
#> 2 1 8 5 5 3
#> 3 2 7 2 7 4
#> 4 2 6 6 6 1
#> 5 3 6 1 8 6
#> 6 3 9 5 7 5
Created on 2019-09-14 by the reprex package (v0.3.0)
The twist in this problem is that month is not unique by student, to solve this :
values_fn = list(A= list, B= list)) puts the multiple values in a list
unchop(everything()) unnest the lists vertically, you can use unnest as well here
If we create a unique sequence, then we can the output in the correct format with pivot_wider
library(dplyr)
library(tidyr)
jj %>%
group_by(month, student) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = 'student', values_from = c('A', 'B'),
names_sep='.') %>%
select(-rn)
# A tibble: 6 x 5
# Groups: month [3]
# month A.Amy A.Bob B.Amy B.Bob
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 9 3 6 5
#2 2 7 2 7 4
#3 3 6 1 8 6
#4 1 8 5 5 3
#5 2 6 6 6 1
#6 3 9 5 7 5
data
jj <- structure(list(month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L), student = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Amy", "Bob"), class = "factor"),
A = c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5), B = c(6, 7, 8,
5, 6, 7, 5, 4, 6, 3, 1, 5)), class = "data.frame", row.names = c(NA,
-12L))
Related
I have a dateframe like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
# Limits for desired cumulative sum (CumSum)
maxCumSum <- 8
minCumSum <- 0
What I would like to calculate is a cumulative sum of value by group (grp) within the values of maxCumSum and minCumSum. The respective table dt2 should look something like this:
grp t value CumSum
a 1 -1 0
a 2 5 5
a 3 9 8
a 4 -15 0
a 5 6 6
b 1 5 5
b 2 1 6
b 3 7 8
b 4 -11 0
b 5 9 8
Think of CumSum as a water storage with has a certain maximum capacity and the level of which cannot sink below zero.
The normal cumsum does obviously not do the trick since there are no limitations to maximum or minimum. Has anyone a suggestion how to achieve this? In the real dataframe there are of course more than 2 groups and far more than 5 times.
Many thanks!
What you can do is create a function which calculate the cumsum until it reach the max value and start again at the min value like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
library(dplyr)
maxCumSum <- 8
minCumSum <- 0
f <- function(x, y) max(min(x + y, maxCumSum), minCumSum)
df %>%
group_by(grp) %>%
mutate(CumSum = Reduce(f, value, 0, accumulate = TRUE)[-1])
#> # A tibble: 10 × 4
#> # Groups: grp [2]
#> grp t value CumSum
#> <chr> <int> <dbl> <dbl>
#> 1 a 1 -1 0
#> 2 a 2 5 5
#> 3 a 3 9 8
#> 4 a 4 -15 0
#> 5 a 5 6 6
#> 6 b 1 5 5
#> 7 b 2 1 6
#> 8 b 3 7 8
#> 9 b 4 -11 0
#> 10 b 5 9 8
Created on 2022-07-04 by the reprex package (v2.0.1)
Assume we have an email dataset with a sender and a recipient in every row. We want to find the next occurrence in the dataset for which the sender and the recipient are interchanged. So if sender==x & recipient==y, we are looking for the next row that has sender==y & recipient==x. Subsequently, we want to calculate the difference between counts for those observations. See the column diff_count for the desired output.
# creating the data.frame
id = 1:10
sender = c(1, 2, 3, 2, 3, 1, 2, 1, 2, 3)
recipient = c(2, 1, 2, 3, 1, 2, 3, 3, 1, 1)
count = c(1, 4, 5, 7, 12, 17, 24, 31, 34, 41)
df <- data.frame(id, sender, recipient, count)
# output should look like this
df$diff_count <- c(3, 13, 2, NA, 19, 17, NA, 10, NA, NA)
If there are no more observations that satisfy the requirement, then we simply fill in NA. Solution should be relatively easy with tidyverse, but I seem not to be able to do it.
Another dplyr-way without a custom function but several self joins:
library(dplyr)
data %>%
left_join(data,
by = c("sender" = "recipient", "recipient" = "sender"),
suffix = c("", ".y")) %>%
filter(id < id.y) %>%
group_by(id) %>%
slice_min(id.y) %>%
ungroup() %>%
mutate(diff_count = count.y - count) %>%
right_join(data) %>%
select(-matches("\\.(y|x)")) %>%
arrange(id)
returns
Joining, by = c("id", "sender", "recipient", "count")
# A tibble: 10 x 5
id sender recipient count diff_count
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 1 3
2 2 2 1 4 13
3 3 3 2 5 2
4 4 2 3 7 NA
5 5 3 1 12 19
6 6 1 2 17 17
7 7 2 3 24 NA
8 8 1 3 31 10
9 9 2 1 34 NA
10 10 3 1 41 NA
There should be easier ways, but below is one way using a custom function in tidyverse style:
library(dplyr)
calc_diff <- function(df, send, recp, cnt) {
df %>%
slice_tail(n = nrow(df) - cur_group_rows()) %>%
filter(sender == send, recipient == recp) %>%
slice_head(n = 1) %>%
pull(count) %>%
{ifelse(length(.) == 0, NA, .)} %>%
`-`(., cnt)
}
df %>%
rowwise(id) %>%
mutate(diff_count = calc_diff(df,
send = recipient,
recp = sender,
cnt = count))
#> # A tibble: 10 x 5
#> # Rowwise: id
#> id sender recipient count diff_count
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 1 3
#> 2 2 2 1 4 13
#> 3 3 3 2 5 2
#> 4 4 2 3 7 NA
#> 5 5 3 1 12 19
#> 6 6 1 2 17 17
#> 7 7 2 3 24 NA
#> 8 8 1 3 31 10
#> 9 9 2 1 34 NA
#> 10 10 3 1 41 NA
Created on 2021-08-20 by the reprex package (v2.0.1)
I have a database in R where there are some NAs in the variables. I would like to apply a logic function where the NAs would be filled with the immediately preceding value. Below is an example:
dados <- tibble::tibble(x = c(2, 3, 5, NA, 2, 1, NA, NA, 9, 3),
y = c(4, 1, 9, NA, 8, 5, NA, NA, 1, 2)
)
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 NA NA
5 2 8
6 1 5
7 NA NA
8 NA NA
9 9 1
10 3 2
In this case, the 4th value of the variable x would be filled with a 5 and so on.
Thank you!
We could use fill from tidyr package:
ibrary(tidyr)
library(dplyr)
dados %>%
fill(c(x,y), .direction = "down")
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 5 9
5 2 8
6 1 5
7 1 5
8 1 5
9 9 1
10 3 2
We can use coalesce
library(dplyr)
dados %>%
mutate(across(x:y, ~ coalesce(., lag(.))))
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 2 4
2 3 1
3 5 9
4 5 9
5 2 8
6 1 5
7 1 5
8 NA NA
9 9 1
10 3 2
library(dplyr)
dados %>%
mutate(x = case_when(is.na(x) ~ lag(x),
TRUE ~ x),
y = case_when(is.na(y) ~ lag(y),
TRUE ~ y))
The follow will only work, if the first value in a column is not NA but I leave that for the sake of clear and easy code as an execise for you we can solve this for one column as in:
library(tibble)
dados <- tibble::tibble(x = c(2, 3, 5, NA, 2, 1, NA, NA, 9, 3),
y = c(4, 1, 9, NA, 8, 5, NA, NA, 1, 2)
)
#where are the NA?
pos <- dados$x |>
is.na() |>
which()
# replace
while(any(is.na(dados$x)))
dados$x[pos] <- dados$x[pos-1]
dados
I'm currently having some issues setting up my dataframe in a correct way. I would like to end up with the following columns Participant ID, SpeakerDialect (TSpeaker), SpeakerNumber(TSpeaker) and Score.
The output I'm getting from google forms is 4 columns of scores and one with the timestamp(participant ID). Now here comes the trouble. I would like to add some information about the video that they gave a score to the data frame. I made it work by using the following code - but here the Timestamp is not included. When adding the timestamp it completely messes it up. It is a repeated measures design, so the same timestamp will have to be repeated 4 times in the final dataframe
trustworth1 <- read.csv('Danskernes holdninger til politiske udsagn 1.csv')
trustworth1 <- trustworth1 %>% select(Hvor.troværdig.er.personen., Hvor.troværdig.er.personen..1, Hvor.troværdig.er.personen..2, Hvor.troværdig.er.personen..3)
TSpeaker <- c('2', '3', '4', '1')
TDialect <- c('1', '2', '2', '1')
trustworth1 <- trustworth1 %>% t()
trustworth1 <- cbind(TSpeaker, TDialect, trustworth1) %>%
as.tibble()
trustworth1 <- unite(trustworth1, Score, starts_with('V'), sep = ", ", remove = FALSE, na.rm = FALSE)
trustworth1 <- trustworth1 %>% select(TSpeaker,TDialect, Score)
trustworth1 <- separate_rows(trustworth1, c(Score), convert = FALSE)
Test dataframe
TimeStamp <- c(1, 2, 3, 4, 5, 6, 7)
Speaker1 <- c(4, 7, 9, 3, 2, 4, 9)
Speaker2 <- c(7, 1, 9, 0, 2, 5, 10)
Speaker3 <- c(3, 1, 9, 2, 9, 5, 10)
Speaker4 <- c(1, 1, 6, 0, 6, 5, 1)
df <- data.frame(TimeStamp, Speaker1, Speaker2, Speaker3, Speaker4)
Dialect of speaker 1 is 1
Dialect of speaker 2 is 2
Dialect of speaker 3 is 1
Dialect of speaker 4 is 2
Ideally I would end up with a data frame with 4 rows per participant, one for each rating of the speakers
RAW DATA:
TimeStamp
<chr>
Speaker2
<int>
Speaker3
<int>
Speaker4
<int>
Speaker1
<int>
1 2020/12/07 11:33:39 AM CET 3 8 6 9
2 2020/12/07 12:16:33 PM CET 5 5 5 5
3 2020/12/07 12:29:11 PM CET 6 7 8 9
4 2020/12/07 12:47:39 PM CET 7 8 8 9
5 2020/12/07 1:04:01 PM CET 5 5 5 5
6 2020/12/07 1:05:33 PM CET 0 8 9 5
6 rows
Any ideas?
Here's a dplyr solution.
To bring in the dialect, I suggest using a merge/join operation, which will pair a dialect with every (known) speaker number. For that data, I'll use a frame as well:
dialects <- data.frame(SpeakerNumber = paste0("Speaker", 1:4), SpeakerDialect = c(1L, 2L, 1L, 2L))
Now it's a matter of reshaping/pivoting from a "wide" format to a "long" format:
library(dplyr)
library(tidyr) # pivot_longer
pivot_longer(df, -TimeStamp, names_to = "SpeakerNumber", values_to = "Score") %>%
left_join(dialects, by = "SpeakerNumber")
# # A tibble: 28 x 4
# TimeStamp SpeakerNumber Score SpeakerDialect
# <dbl> <chr> <dbl> <int>
# 1 1 Speaker1 4 1
# 2 1 Speaker2 7 2
# 3 1 Speaker3 3 1
# 4 1 Speaker4 1 2
# 5 2 Speaker1 7 1
# 6 2 Speaker2 1 2
# 7 2 Speaker3 1 1
# 8 2 Speaker4 1 2
# 9 3 Speaker1 9 1
# 10 3 Speaker2 9 2
# # ... with 18 more rows
You use the name SpeakerNumber, suggesting you only want the number from that field, and perhaps as a number itself. If that's the case, add
... %>%
mutate(SpeakerNumber = as.integer(gsub("^Speaker", "", SpeakerNumber)))
using the sample dataset, I think you want something like this ? you can use 'speaker' 'score' instead of var and val
require(dplyr)
require(tidyr)
df %>% head
df %>% gather(var, val, Speaker1:Speaker4) %>%
head
TimeStamp var val
1 1 Speaker1 4
2 2 Speaker1 7
3 3 Speaker1 9
4 4 Speaker1 3
5 5 Speaker1 2
6 6 Speaker1 4
I have a dataframe like this
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("Love_ABC", "Love_CNN", "Hate_ABC", "Hate_CNN", "Love_CNBC", "Hate_CNBC"), row.names = c(NA,
8L), class = "data.frame")
I have made the following for loop
channels = c("ABC", "CNN", "CNBC")
for (channel in channels) {
dataframe <- dataframe %>%
mutate(ALL_channel = Love_channel + Hate_channel)
}
But when i run the for loop R tells me " object Love_channel" not found. Have i done something wrong in the for loop?
Here's a way with rlang. Note, reshaping the data is likely more straightforward. Non-standard evaluation (NSE) is a complicated topic.
for (channel in channels) {
DF <- DF %>%
mutate(!!sym(paste0("ALL_", channel)) := !!sym(paste0("Love_", channel)) + !!sym(paste0("Hate_", channel)))
}
DF
## Love_ABC Love_CNN Hate_ABC Hate_CNN Love_CNBC Hate_CNBC ALL_ABC ALL_CNN ALL_CNBC
## 1 1 1 6 6 1 2 7 7 3
## 2 3 3 3 2 2 3 6 5 5
## 3 4 4 6 4 4 4 10 8 8
## 4 6 2 5 5 5 2 11 7 7
## 5 3 6 3 3 6 2 6 9 8
## 6 2 7 6 7 7 7 8 14 14
## 7 5 2 5 2 6 5 10 4 11
## 8 1 6 3 6 3 2 4 12 5
This is a solution with dplyr and tidyr:
library(tidyr)
library(dplyr)
dataframe <- dataframe %>%
tibble::rowid_to_column()
dataframe %>%
pivot_longer(-rowid, names_to = c(NA, "channel"), names_sep = "_") %>%
pivot_wider(names_from = channel, names_prefix = "ALL_", values_from = value, values_fn = sum) %>%
right_join(dataframe, by = "rowid") %>%
select(-rowid)
#> # A tibble: 8 x 9
#> ALL_ABC ALL_CNN ALL_CNBC Love_ABC Love_CNN Hate_ABC Hate_CNN Love_CNBC Hate_CNBC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 7 7 3 1 1 6 6 1 2
#> 2 6 5 5 3 3 3 2 2 3
#> 3 10 8 8 4 4 6 4 4 4
#> 4 11 7 7 6 2 5 5 5 2
#> 5 6 9 8 3 6 3 3 6 2
#> 6 8 14 14 2 7 6 7 7 7
#> 7 10 4 11 5 2 5 2 6 5
#> 8 4 12 5 1 6 3 6 3 2
The idea is to reshape it to make the sums easier. Then you can join the final result back to the initial dataframe.
start by uniquely identifying each row with a rowid.
reshape with pivot_longer so to have all values neatly in one column. In this step you also separate the names Love/Hate_channel in two and you remove the Love/Hate part (you are interested only on the channel) [that is what the NA does!].
reshape again: this time you want to get one column for each channel. In this step you also sum up what previously was Love and Hate together for each rowid and channel (that's what values_fn=sum does!). Also you add a prefix (names_prefix = "ALL_") to each new column name to have names that respect your expected final result.
with right_join you add the values back to the original dataframe. You have no need for rowid now, so you can remove it.