Setting up dataframe in R - r

I'm currently having some issues setting up my dataframe in a correct way. I would like to end up with the following columns Participant ID, SpeakerDialect (TSpeaker), SpeakerNumber(TSpeaker) and Score.
The output I'm getting from google forms is 4 columns of scores and one with the timestamp(participant ID). Now here comes the trouble. I would like to add some information about the video that they gave a score to the data frame. I made it work by using the following code - but here the Timestamp is not included. When adding the timestamp it completely messes it up. It is a repeated measures design, so the same timestamp will have to be repeated 4 times in the final dataframe
trustworth1 <- read.csv('Danskernes holdninger til politiske udsagn 1.csv')
trustworth1 <- trustworth1 %>% select(Hvor.troværdig.er.personen., Hvor.troværdig.er.personen..1, Hvor.troværdig.er.personen..2, Hvor.troværdig.er.personen..3)
TSpeaker <- c('2', '3', '4', '1')
TDialect <- c('1', '2', '2', '1')
trustworth1 <- trustworth1 %>% t()
trustworth1 <- cbind(TSpeaker, TDialect, trustworth1) %>%
as.tibble()
trustworth1 <- unite(trustworth1, Score, starts_with('V'), sep = ", ", remove = FALSE, na.rm = FALSE)
trustworth1 <- trustworth1 %>% select(TSpeaker,TDialect, Score)
trustworth1 <- separate_rows(trustworth1, c(Score), convert = FALSE)
Test dataframe
TimeStamp <- c(1, 2, 3, 4, 5, 6, 7)
Speaker1 <- c(4, 7, 9, 3, 2, 4, 9)
Speaker2 <- c(7, 1, 9, 0, 2, 5, 10)
Speaker3 <- c(3, 1, 9, 2, 9, 5, 10)
Speaker4 <- c(1, 1, 6, 0, 6, 5, 1)
df <- data.frame(TimeStamp, Speaker1, Speaker2, Speaker3, Speaker4)
Dialect of speaker 1 is 1
Dialect of speaker 2 is 2
Dialect of speaker 3 is 1
Dialect of speaker 4 is 2
Ideally I would end up with a data frame with 4 rows per participant, one for each rating of the speakers
RAW DATA:
TimeStamp
<chr>
Speaker2
<int>
Speaker3
<int>
Speaker4
<int>
Speaker1
<int>
1 2020/12/07 11:33:39 AM CET 3 8 6 9
2 2020/12/07 12:16:33 PM CET 5 5 5 5
3 2020/12/07 12:29:11 PM CET 6 7 8 9
4 2020/12/07 12:47:39 PM CET 7 8 8 9
5 2020/12/07 1:04:01 PM CET 5 5 5 5
6 2020/12/07 1:05:33 PM CET 0 8 9 5
6 rows
Any ideas?

Here's a dplyr solution.
To bring in the dialect, I suggest using a merge/join operation, which will pair a dialect with every (known) speaker number. For that data, I'll use a frame as well:
dialects <- data.frame(SpeakerNumber = paste0("Speaker", 1:4), SpeakerDialect = c(1L, 2L, 1L, 2L))
Now it's a matter of reshaping/pivoting from a "wide" format to a "long" format:
library(dplyr)
library(tidyr) # pivot_longer
pivot_longer(df, -TimeStamp, names_to = "SpeakerNumber", values_to = "Score") %>%
left_join(dialects, by = "SpeakerNumber")
# # A tibble: 28 x 4
# TimeStamp SpeakerNumber Score SpeakerDialect
# <dbl> <chr> <dbl> <int>
# 1 1 Speaker1 4 1
# 2 1 Speaker2 7 2
# 3 1 Speaker3 3 1
# 4 1 Speaker4 1 2
# 5 2 Speaker1 7 1
# 6 2 Speaker2 1 2
# 7 2 Speaker3 1 1
# 8 2 Speaker4 1 2
# 9 3 Speaker1 9 1
# 10 3 Speaker2 9 2
# # ... with 18 more rows
You use the name SpeakerNumber, suggesting you only want the number from that field, and perhaps as a number itself. If that's the case, add
... %>%
mutate(SpeakerNumber = as.integer(gsub("^Speaker", "", SpeakerNumber)))

using the sample dataset, I think you want something like this ? you can use 'speaker' 'score' instead of var and val
require(dplyr)
require(tidyr)
df %>% head
df %>% gather(var, val, Speaker1:Speaker4) %>%
head
TimeStamp var val
1 1 Speaker1 4
2 2 Speaker1 7
3 3 Speaker1 9
4 4 Speaker1 3
5 5 Speaker1 2
6 6 Speaker1 4

Related

Count how many rows have the same ID and add the number in an new column

My dataframe contains data about political careers, such as a unique identifier (called: ui) column for each politician and the electoral term(called: electoral_term) in which they were elected. Since a politician can be elected in multiple electoral terms, there are multiple rows that contain the same ui.
Now I would like to add another column to my dataframe, that counts how many times the politician got re-elected.
So e.g. the politician with ui=1 was re-elected 2 times, since he occured in 3 electoral_terms.
I already tried
df %>% count(ui)
But that only gives out a table which can't be added into my dataframe.
Thanks in advance!
We may use base R
df$reelected <- with(df, ave(ui, ui, FUN = length)-1)
-output
> df
ui electoral reelected
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
data
df <- structure(list(ui = c(1, 1, 1, 2, 3, 3), electoral = c(1, 2,
3, 2, 7, 9)), class = "data.frame", row.names = c(NA, -6L))
mydf <- tibble::tribble(~ui, ~electoral, 1, 1, 1, 2, 1, 3, 2, 2, 3, 7, 3, 9)
library(dplyr)
df |>
add_count(ui, name = "re_elected") |>
mutate(re_elected = re_elected - 1)
# A tibble: 6 × 3
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
library(tidyverse)
df %>%
group_by(ui) %>%
mutate(re_elected = n() - 1)
# A tibble: 6 × 3
# Groups: ui [3]
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1

How to a create a new dataframe of consolidated values from multiple columns in R

I have a dataframe, df1, that looks like the following:
sample
99_Ape_1
93_Cat_1
87_Ape_2
84_Cat_2
90_Dog_1
92_Dog_2
A
2
3
1
7
4
6
B
5
9
7
0
3
7
C
6
8
9
2
3
0
D
3
9
0
5
8
3
I want to consolidate the dataframe by summing the values based on animal present in the header row, i.e. by "Ape", "Cat", "Dog", and end up with the following dataframe:
sample
Ape
Cat
Dog
A
3
10
10
B
12
9
10
C
15
10
3
D
3
14
11
I have created a list that represents all the animals called "animals_list"
I have then created a list of dataframes that subsets each animal into a separate dataframe with:
animals_extract <- c()
for (i in 1:length(animals_list)){
species_extract[[i]] <- df1[, grep(animals_list[i], names(df1))]
}
I am then trying to sum each variable in the row by sample:
for (i in 1:length(species_extract)){
species_extract[[i]]$total <- rowSums(species_extract[[i]])
}
and then create a dataframe 'animal_total' by binding all values in the new 'total' column.
animal_total <- NULL
for (i in 1:length(species_extract)){
animal_total[i] <- cbind(species_extract[[i]]$total)
}
Unfortunately, this doesn't seem to work at all and I think I may have taken the wrong route. Any help would be really appreciated!
EDIT: my dataframe has over 300 animals, meaning incorporating use of my list of identifiers (animals_list) would be highly appreciated! I would also note that some column names do not follow the structure, "number_animal_number" and therefore I can't use a repetitive search (sorry!).
a data.table approach
library(data.table)
library(rlist)
#set data to data.table format
setDT(df1)
# split column 2:n by regex on column names
L <- split.default(df1[,-1], gsub(".*_(.*)_.*", "\\1", names(df1)[-1]))
# Bind together again
data.table(sample = df1$sample,
as.data.table(list.cbind(lapply(L, rowSums))))
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Update: After clarification:
This may work depending on the other names of your animals. but this is a start:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
cols = -sample
) %>%
mutate(name1 = str_extract(name, '(?<=\\_)(.*?)(?=\\_)')) %>%
group_by(sample, name1) %>%
summarise(sum=sum(value)) %>%
pivot_wider(
names_from = name1,
values_from= sum
)
Output:
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
First answer:
Here is how we could do it with dplyr:
library(dplyr)
df %>%
mutate(Cat = rowSums(select(., contains("Cat"))),
Ape = rowSums(select(., contains("Ape"))),
Dog = rowSums(select(., contains("Dog")))) %>%
select(sample, Cat, Ape, Dog)
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
An alternative data.table solution
library(data.table)
# Construct data table
dt <- as.data.table(list(sample = c("A", "B", "C", "D"),
`99_Ape_1` = c(2, 5, 6, 3),
`93_Cat_1` = c(3, 9, 8, 9),
`87_Ape_2` = c(1, 7, 9, 0),
`84_Cat_2` = c(7, 0, 2, 5),
`90_Dog_1` = c(4, 3, 3, 8),
`92_Dog_2` = c(6, 7, 0, 3)))
# Alternatively convert existing dataframe
# dt <- setDT(df)
# Use Regex pattern to drop ids from column names
names(dt) <- gsub("((^[0-9_]{3})|(_[0-9]{1}$))", "", names(dt))
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Alternatively, leaving the column names as is (after comment from OP to previous answer) and assuming that there are multiple observations of the same samples:
dt <- as.data.table(list(sample = c("A", "B", "C", "D", "A"),
`99_Ape_1` = c(2, 5, 6, 3, 1),
`93_Cat_1` = c(3, 9, 8, 9, 1),
`87_Ape_2` = c(1, 7, 9, 0, 1),
`84_Cat_2` = c(7, 0, 2, 5, 1),
`90_Dog_1` = c(4, 3, 3, 8, 1),
`92_Dog_2` = c(6, 7, 0, 3, 1)))
dt
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 2 3 1 7 4 6
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3
# 5: A 1 1 1 1 1 1
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 3 4 2 8 5 7
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3

How to find the next occurrence in a data.frame in R?

Assume we have an email dataset with a sender and a recipient in every row. We want to find the next occurrence in the dataset for which the sender and the recipient are interchanged. So if sender==x & recipient==y, we are looking for the next row that has sender==y & recipient==x. Subsequently, we want to calculate the difference between counts for those observations. See the column diff_count for the desired output.
# creating the data.frame
id = 1:10
sender = c(1, 2, 3, 2, 3, 1, 2, 1, 2, 3)
recipient = c(2, 1, 2, 3, 1, 2, 3, 3, 1, 1)
count = c(1, 4, 5, 7, 12, 17, 24, 31, 34, 41)
df <- data.frame(id, sender, recipient, count)
# output should look like this
df$diff_count <- c(3, 13, 2, NA, 19, 17, NA, 10, NA, NA)
If there are no more observations that satisfy the requirement, then we simply fill in NA. Solution should be relatively easy with tidyverse, but I seem not to be able to do it.
Another dplyr-way without a custom function but several self joins:
library(dplyr)
data %>%
left_join(data,
by = c("sender" = "recipient", "recipient" = "sender"),
suffix = c("", ".y")) %>%
filter(id < id.y) %>%
group_by(id) %>%
slice_min(id.y) %>%
ungroup() %>%
mutate(diff_count = count.y - count) %>%
right_join(data) %>%
select(-matches("\\.(y|x)")) %>%
arrange(id)
returns
Joining, by = c("id", "sender", "recipient", "count")
# A tibble: 10 x 5
id sender recipient count diff_count
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 1 3
2 2 2 1 4 13
3 3 3 2 5 2
4 4 2 3 7 NA
5 5 3 1 12 19
6 6 1 2 17 17
7 7 2 3 24 NA
8 8 1 3 31 10
9 9 2 1 34 NA
10 10 3 1 41 NA
There should be easier ways, but below is one way using a custom function in tidyverse style:
library(dplyr)
calc_diff <- function(df, send, recp, cnt) {
df %>%
slice_tail(n = nrow(df) - cur_group_rows()) %>%
filter(sender == send, recipient == recp) %>%
slice_head(n = 1) %>%
pull(count) %>%
{ifelse(length(.) == 0, NA, .)} %>%
`-`(., cnt)
}
df %>%
rowwise(id) %>%
mutate(diff_count = calc_diff(df,
send = recipient,
recp = sender,
cnt = count))
#> # A tibble: 10 x 5
#> # Rowwise: id
#> id sender recipient count diff_count
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 1 3
#> 2 2 2 1 4 13
#> 3 3 3 2 5 2
#> 4 4 2 3 7 NA
#> 5 5 3 1 12 19
#> 6 6 1 2 17 17
#> 7 7 2 3 24 NA
#> 8 8 1 3 31 10
#> 9 9 2 1 34 NA
#> 10 10 3 1 41 NA
Created on 2021-08-20 by the reprex package (v2.0.1)

Dplyr tranformation based on string filtering and conditions

I would like to tranform messy dataset in R,
However I am having issues figuring out how to do so, I provided example dataset and result that I need to achieve:
dataset <- tribble(
~ID, ~DESC,
1, "3+1Â 81Â mÂ",
2, "2+1Â 90Â mÂ",
3, "3+KK 28Â mÂ",
4, "3+1 120 m (Mezone)")
dataset
dataset_tranformed <- tribble(
~ID, ~Rooms, ~Meters, ~Mezone, ~KK,
1, 4, 81,0, 0,
2, 3, 90,0,0,
3, 3, 28,0,1,
4, 4, 120,1, 0)
dataset_tranformed
columns firstly need to be seperated, however using dataset %>% separate(DESC, c("size", "meters_squared", "Mezone"), sep = " ") does not work because (Mezone) is thrown away.
We can do this by doing evaluation and individually extract the components
library(dplyr)
library(stringr)
library(tidyr)
dataset %>%
mutate(Rooms = map_dbl(DESC, ~
str_extract(.x, "^\\d+\\+\\d*") %>%
str_replace("\\+$", "+0") %>%
rlang::parse_expr(.) %>%
eval ),
Meters = str_extract(DESC, "(?<=\\s)\\d+(?=Â)"),
Mezone = +(str_detect(DESC, "Mezone")),
KK = +(str_detect(DESC, "KK"))) %>%
select(-DESC)
# A tibble: 4 x 5
# ID Rooms Meters Mezone KK
# <dbl> <dbl> <chr> <int> <int>
#1 1 4 81 0 0
#2 2 3 90 0 0
#3 3 3 28 0 1
#4 4 4 120 1 0
Or another option is extract and then make use of str_detect
dataset %>%
extract(DESC, into = c("Rooms1", "Rooms2", "Meters"),
"^(\\d+)\\+(\\d*)[^0-9]+(\\d+)", convert = TRUE, remove = FALSE) %>%
transmute(ID, Mezone = +(str_detect(DESC, "Mezone")),
KK = +(is.na(Rooms2)), Rooms = Rooms1 + replace_na(Rooms2, 0), Meters )
# A tibble: 4 x 5
# ID Mezone KK Rooms Meters
# <dbl> <int> <int> <dbl> <int>
#1 1 0 0 4 81
#2 2 0 0 3 90
#3 3 0 1 3 28
#4 4 1 0 4 120

Using spread with duplicate identifiers for rows

I have a long form dataframe that have multiple entries for same date and person.
jj <- data.frame(month=rep(1:3,4),
student=rep(c("Amy", "Bob"), each=6),
A=c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5),
B=c(6, 7, 8, 5, 6, 7, 5, 4, 6, 3, 1, 5))
I want to convert it to wide form and make it like this:
month Amy.A Bob.A Amy.B Bob.B
1
2
3
1
2
3
1
2
3
1
2
3
My question is very similar to this. I have used the given code in the answer :
kk <- jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
spread(temp, value)
but it gives following error:
Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6), (13, 16), (14, 17), (15, 18), (7, 10), (8, 11), (9, 12), (19, 22), (20, 23), (21, 24)
Thanks in advance.
Note: I don't want to delete multiple entries.
Your answer was missing mutate id! Here is the solution using dplyr packge only.
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
group_by(temp) %>%
mutate(id=1:n()) %>%
spread(temp, value)
# A tibble: 6 x 6
# month id Amy_A Amy_B Bob_A Bob_B
# * <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 9 6 3 5
# 2 1 4 8 5 5 3
# 3 2 2 7 7 2 4
# 4 2 5 6 6 6 1
# 5 3 3 6 8 1 6
# 6 3 6 9 7 5 5
The issue is the two columns for both A and B. If we can make that one value column, we can spread the data as you would like. Take a look at the output for jj_melt when you use the code below.
library(reshape2)
jj_melt <- melt(jj, id=c("month", "student"))
jj_spread <- dcast(jj_melt, month ~ student + variable, value.var="value", fun=sum)
# month Amy_A Amy_B Bob_A Bob_B
# 1 1 17 11 8 8
# 2 2 13 13 8 5
# 3 3 15 15 6 11
I won't mark this as a duplicate since the other question did not summarize by sum, but the data.table answer could help with one additional argument, fun=sum:
library(data.table)
dcast(setDT(jj), month ~ student, value.var=c("A", "B"), fun=sum)
# month A_sum_Amy A_sum_Bob B_sum_Amy B_sum_Bob
# 1: 1 17 8 11 8
# 2: 2 13 8 13 5
# 3: 3 15 6 15 11
If you would like to use the tidyr solution, combine it with dcast to summarize by sum.
as.data.frame(jj)
library(tidyr)
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
dcast(month ~ temp, fun=sum)
# month Amy_A Amy_B Bob_A Bob_B
# 1 1 17 11 8 8
# 2 2 13 13 8 5
# 3 3 15 15 6 11
Edit
Based on your new requirements, I have added an activity column.
library(dplyr)
jj %>% group_by(month, student) %>%
mutate(id=1:n()) %>%
melt(id=c("month", "id", "student")) %>%
dcast(... ~ student + variable, value.var="value")
# month id Amy_A Amy_B Bob_A Bob_B
# 1 1 1 9 6 3 5
# 2 1 2 8 5 5 3
# 3 2 1 7 7 2 4
# 4 2 2 6 6 6 1
# 5 3 1 6 8 1 6
# 6 3 2 9 7 5 5
The other solutions can also be used. Here I added an optional expression to arrange the final output by activity number:
library(tidyr)
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
group_by(temp) %>%
mutate(id=1:n()) %>%
dcast(... ~ temp) %>%
arrange(id)
# month id Amy_A Amy_B Bob_A Bob_B
# 1 1 1 9 6 3 5
# 2 2 2 7 7 2 4
# 3 3 3 6 8 1 6
# 4 1 4 8 5 5 3
# 5 2 5 6 6 6 1
# 6 3 6 9 7 5 5
The data.table syntax is compact because it allows for multiple value.var columns and will take care of the spread for us. We can then skip the melt -> cast process.
library(data.table)
setDT(jj)[, activityID := rowid(student)]
dcast(jj, ... ~ student, value.var=c("A", "B"))
# month activityID A_Amy A_Bob B_Amy B_Bob
# 1: 1 1 9 3 6 5
# 2: 1 4 8 5 5 3
# 3: 2 2 7 2 7 4
# 4: 2 5 6 6 6 1
# 5: 3 3 6 1 8 6
# 6: 3 6 9 5 7 5
Since tidyr 1.0.0 pivot_wider is the recommended replacement of spread and you could do the following :
jj <- data.frame(month=rep(1:3,4),
student=rep(c("Amy", "Bob"), each=6),
A=c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5),
B=c(6, 7, 8, 5, 6, 7, 5, 4, 6, 3, 1, 5))
library(tidyr)
pivot_wider(
jj,
names_from = "student",
values_from = c("A","B"),
names_sep = ".",
values_fn = list(A= list, B= list)) %>%
unchop(everything())
#> # A tibble: 6 x 5
#> month A.Amy A.Bob B.Amy B.Bob
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 9 3 6 5
#> 2 1 8 5 5 3
#> 3 2 7 2 7 4
#> 4 2 6 6 6 1
#> 5 3 6 1 8 6
#> 6 3 9 5 7 5
Created on 2019-09-14 by the reprex package (v0.3.0)
The twist in this problem is that month is not unique by student, to solve this :
values_fn = list(A= list, B= list)) puts the multiple values in a list
unchop(everything()) unnest the lists vertically, you can use unnest as well here
If we create a unique sequence, then we can the output in the correct format with pivot_wider
library(dplyr)
library(tidyr)
jj %>%
group_by(month, student) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = 'student', values_from = c('A', 'B'),
names_sep='.') %>%
select(-rn)
# A tibble: 6 x 5
# Groups: month [3]
# month A.Amy A.Bob B.Amy B.Bob
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 9 3 6 5
#2 2 7 2 7 4
#3 3 6 1 8 6
#4 1 8 5 5 3
#5 2 6 6 6 1
#6 3 9 5 7 5
data
jj <- structure(list(month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L), student = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Amy", "Bob"), class = "factor"),
A = c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5), B = c(6, 7, 8,
5, 6, 7, 5, 4, 6, 3, 1, 5)), class = "data.frame", row.names = c(NA,
-12L))

Resources