I want to take a tibble that represents dialogue and turn it into a .txt that can be manually edited in a text editor and then returned to a tibble for processing.
The key challenge I've had is separating the blocks of text in a way that they can be re-imported to a similar format after editing while preserving the "Speaker" designation.
Speed is important as the volume of files and the length of each text segment are large.
Here's the input tibble:
tibble::tribble(
~word, ~speakerTag,
"been", 1L,
"going", 1L,
"on", 1L,
"and", 1L,
"what", 1L,
"your", 1L,
"goals", 1L,
"are.", 1L,
"Yeah,", 2L,
"so", 2L,
"so", 2L,
"John", 2L,
"has", 2L,
"15", 2L
)
Here's the desired output in a .txt:
###Speaker 1###
been going on and what your goals are.
###Speaker 2###
Yeah, so so John has 15
Here's the desired return after correcting errors manually:
~word, ~speakerTag,
"been", 1L,
"going", 1L,
"on", 1L,
"and", 1L,
"what", 1L,
"your", 1L,
"goals", 1L,
"in", 1L,
"r", 1L,
"Yeah,", 2L,
"so", 2L,
"so", 2L,
"John", 2L,
"hates", 2L,
"50", 2L
)
One way would be to add Speaker name "\n" at the start of each speakerTag
library(data.table)
library(dplyr)
library(tidyr)
setDT(df)[, word := replace(word, 1, paste0("\n\nSpeaker",
first(speakerTag), '\n\n', first(word))), rleid(speakerTag)]
We can write this in text file using
writeLines(paste(df$word, collapse = " "), 'Downloads/temp.txt')
It looks like this :
cat(paste(df$word, collapse = " "))
#Speaker1
#
#been going on and what your goals are.
#
#Speaker2
#
#Yeah, so so John has 15
To read it back in R, we can do :
read.table('Downloads/temp.txt', sep="\t", col.names = 'word') %>%
mutate(SpeakerTag = replace(word, c(FALSE, TRUE), NA)) %>%
fill(SpeakerTag) %>%
slice(seq(2, n(), 2)) %>%
separate_rows(word, sep = "\\s") %>%
filter(word != '')
# word SpeakerTag
#1 been Speaker1
#2 going Speaker1
#3 on Speaker1
#4 and Speaker1
#5 what Speaker1
#6 your Speaker1
#7 goals Speaker1
#8 are. Speaker1
#9 Yeah, Speaker2
#10 so Speaker2
#11 so Speaker2
#12 John Speaker2
#13 has Speaker2
#14 15 Speaker2
Obviously we can remove "Speaker" part in SpeakerTag column if it is not needed.
Related
I have the following dataset
structure(list(Var1 = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("0", "1"), class = "factor"), Var2 = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("congruent", "incongruent"
), class = "factor"), Var3 = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("spoken", "written"), class = "factor"),
Freq = c(8L, 2L, 10L, 2L, 10L, 2L, 10L, 2L)), class = "data.frame", row.names = c(NA,
-8L))
I would like to add another column reporting sum of coupled subsequent rows. Thus the final result would look like this:
I have proceeded like this
Table = as.data.frame(table(data_1$unimodal,data_1$cong_cond, data_1$presentation_mode)) %>%
mutate(Var1 = factor(Var1, levels = c('0', '1')))
row = Table %>% #is.factor(Table$Var1)
summarise(across(where(is.numeric),
~ .[Var1 == '0'] + .[Var1 == '1'],
.names = "{.col}_sum"))
column = c(rbind(row$Freq_sum,rep(NA, 4)))
Table$column = column
But I am looking for the quickest way possible with no scripting separated codes. Here I have used the dplyr package, but if you might know possibly suggest some other ways with map(), for loop, and or the method you deem as the best, please just let me know.
This should do:
df$column <-
rep(colSums(matrix(df$Freq, 2)), each=2) * c(1, NA)
If you are fine with no NAs in the dataframe, you can
df %>%
group_by(Var2, Var3) %>%
mutate(column = sum(Freq))
# A tibble: 8 × 5
# Groups: Var2, Var3 [4]
Var1 Var2 Var3 Freq column
<fct> <fct> <fct> <int> <int>
1 0 congruent spoken 8 10
2 1 congruent spoken 2 10
3 0 incongruent spoken 10 12
4 1 incongruent spoken 2 12
5 0 congruent written 10 12
6 1 congruent written 2 12
7 0 incongruent written 10 12
8 1 incongruent written 2 12
I am doing some tricky data cleaning. I have one dataset (first extract below) that is the output from the digitization of pdf tables. Unfortunately columns were not digitized properly. Sometimes, what shall be in column X3 ended up concatenated in column X2 with the last word of column X2...
What I am trying to do is to bring back what should be in column X3 to X3 and collapse the two rows in X2 together.
I have attached an extract of the output I am trying to create.
Any idea about how can I do this?
Thank you!
structure(list(X1 = c(111L, NA, 2L, NA, NA, 121L, NA, NA, 121L,
NA, NA, 141L, NA, NA, 141L, NA), X2 = structure(c(7L, 1L, 8L,
1L, 1L, 9L, 1L, 1L, 6L, 3L, 1L, 5L, 2L, 1L, 10L, 4L), .Label = c("",
"A - BWHITE", "ASMITH", "B - DBURNEY", "Garden Harris", "House M. Aba",
"House M. Bab", "House M. Cac", "Street M. Bak", "Villa Thomas"
), class = "factor"), X3 = structure(c(2L, 1L, 3L, 1L, 1L, 4L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "A",
"A - C", "D"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L))
structure(list(X1 = c(111L, NA, 2L, NA, NA, 121L, NA, NA, 121L,
NA, NA, 141L, NA, NA, 141L), X2 = structure(c(4L, 1L, 5L, 1L,
1L, 6L, 1L, 1L, 3L, 1L, 1L, 2L, 1L, 1L, 7L), .Label = c("", "Garden Harris WHITE",
"House M. Aba SMITH", "House M. Bab", "House M. Cac", "Street M. Bak",
"Villa Thomas BURNEY"), class = "factor"), X3 = structure(c(2L,
1L, 4L, 1L, 1L, 6L, 1L, 1L, 2L, 1L, 1L, 3L, 1L, 1L, 5L), .Label = c("",
"A", "A - B", "A - C", "B - D", "D"), class = "factor")), class = "data.frame", row.names = c(NA,
-15L))
Follow up question here: Cleaning extract_tables conditional merge rows, systematic extraction
You could use tidyverse:
library(tidyr)
library(stringr)
library(dplyr)
df %>%
filter(X2 != "") %>%
mutate(
extract_name = lead(str_extract(X2, "(?<=[A-Z])[A-Z]+")),
extract_part = lead(str_extract(X2, "[A-Z](\\s-\\s[A-Z]){0,1}(?=[A-Z]+)")),
new_X2 = ifelse(!is.na(extract_name), paste(X2, extract_name), as.character(X2)),
new_X3 = ifelse(X3 != "", as.character(X3), extract_part)
) %>%
drop_na(X1) %>%
select(-extract_name, -extract_part)
which returns
X1 X2 X3 new_X2 new_X3
1 111 House M. Bab A House M. Bab A
2 2 House M. Cac A - C House M. Cac A - C
3 121 Street M. Bak D Street M. Bak D
4 121 House M. Aba House M. Aba SMITH A
5 141 Garden Harris Garden Harris WHITE A - B
6 141 Villa Thomas Villa Thomas BURNEY B - D
Note: I don't think this approach is really stable regarding the regex used. For readability I filtered out some annoying rows containing NA and empty strings, you should remove those parts if necessary.
Here is how we could do it:
Credits to MartinGal for the regex "(?<=[A-Z])[A-Z]+") (upvote!)
Replace empty values with NA
Use lead to move rows up in X3 conditional on NA else not
filter if is not NA in X1
Extract the important information with str_extract and regex "(?<=[A-Z])[A-Z]+" -> combine this info with column X2 with str_c and finally coalesce both.
Remove the string to keep relevant one with regex and str_remove
library(dyplr)
library(stringr)
df %>%
mutate(across(everything(), ~sub("^\\s*$", NA, .)),
X3= ifelse(is.na(X3), lead(X2), X3)) %>%
filter(!is.na(X1)) %>%
mutate(X2 = coalesce(str_c(X2," ", str_extract(X3, "(?<=[A-Z])[A-Z]+")), X2),
X3 = str_remove_all(X3, "(?<=[A-Z])[A-Z]+"))
Output:
X1 X2 X3
1 111 House M. Bab A
2 2 House M. Cac A - C
3 121 Street M. Bak D
4 121 House M. Aba SMITH A
5 141 Garden Harris WHITE A - B
6 141 Villa Thomas BURNEY B - D
This is a yucky one:
# Retype the data and nullify empty values;
# use X1 as a key: intermediateResult => data.frame
intermediateResult <- data.frame(
lapply(
transform(
replace(df, df == "", NA_character_),
X1 = na.omit(X1)[cumsum(!is.na(X1))]
),
as.character
)
)
# Re-structure the data:
# interemdiateResult2 => data.frame
intermediateResult2 <- do.call(
rbind,
Filter(
function(y){
nrow(y) > 0
},
Map(
function(x){
z <- x[!is.na(x$X2),]
if(nrow(z) > 1 & is.na(z$X3[1])){
z$X3[1] <- z$X2[2]
head(z, 1)
}else{
z
}
},
with(
intermediateResult,
split(
intermediateResult,
paste(
X1,
cumsum(
(is.na(X2)
)
),
sep = " - "
)
)
)
)
)
)
# Regex it and hope for the best:
# result => data.frame
result <- data.frame(
transform(
intermediateResult2,
X2 = paste0(
X2,
ifelse(
(nchar(X3) == 1 | grepl("^\\w\\s+-\\s+\\w$", X3)),
"",
ifelse(
!(grepl("^\\w\\s+-\\s+\\w", X3)),
paste0(" ", substr(X3, 2, nchar(X3))),
paste0(" ", gsub("(^\\w\\s+-\\s+\\w)(.*)", "\\2", X3))
)
)
),
X3 = ifelse(
nchar(X3) == 1 | grepl("^\\w\\s+-\\s+\\w$", X3) ,
X3,
ifelse(
!(grepl("^\\w\\s+-\\s+\\w", X3)),
substr(X3, 1, 1),
gsub("(^\\w\\s+-\\s+\\w)(.*)", "\\1", X3)
)
)
),
row.names = NULL
)
I am using R to manipulate a large dataset (dataset) that consists of 20,000+ rows. In my data, I have three important columns to focus on for this question: Trial_Nr (consisting of 90 trials), seconds (increasing in .02 second increments), and threat(fixation to threat: 1=yes, 0=no, NA). Within each trial, I need to answer when the initially fixates to threat (1), how long does it take for them to not fixate on threat (0). So basically, within each trial, I would need to find the first threat=1 and the subsequent threat=0 and subtract the time. I am able to get the first threat with this code:
initalfixthreat <- dataset %>%
group_by(Trial_Nr) %>%
slice(which(threat == '1')[1])
I am stumped on how to get the subsequent threat=0 within that trial number.
Here is an example of the data (sorry don't know how to format it better):
So for Trial_Nr=1, I would be interested in 689.9 seconds- 689.8.
For Trial_Nr=2, I would want 690.04-689.96.
Please let me know if I was unclear and thank you all for your help!
One approach is:
library(dplyr)
df %>%
group_by(Trial_Nr) %>%
filter(!is.na(threat)) %>%
mutate(flag = ifelse(threat == 1, 1, threat - lag(threat))) %>%
filter(abs(flag) == 1 & !duplicated(flag)) %>%
summarise(timediff = ifelse(length(seconds) == 1, NA, diff(seconds)))
# A tibble: 2 x 2
Trial_Nr timediff
<int> <dbl>
1 1 0.1
2 2 0.0800
Data:
df <- structure(list(Trial_Nr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L), seconds = c(689.76, 689.78, 689.8, 689.82,
689.84, 689.86, 689.88, 689.9, 689.92, 689.94, 689.96, 689.98,
690, 690.02, 690.04), threat = c(0L, 0L, 1L, 1L, 1L, NA, NA,
0L, 1L, 0L, 1L, NA, NA, 1L, 0L)), class = "data.frame", row.names = c(NA,
-15L))
I have trouble combining slice and map.
I am interested of doing something similar to this; which is, in my case, transforming a compact person-period file to a long (sequential) person-period one. However, because my file is too big, I need to split the data first.
My data look like this
group id var ep dur
1 A 1 a 1 20
2 A 1 b 2 10
3 A 1 a 3 5
4 A 2 b 1 5
5 A 2 b 2 10
6 A 2 b 3 15
7 B 1 a 1 20
8 B 1 a 2 10
9 B 1 a 3 10
10 B 2 c 1 20
11 B 2 c 2 5
12 B 2 c 3 10
What I need is simply this (answer from this)
library(dplyr)
dt %>% slice(rep(1:n(),.$dur))
However, I am interested in introducing a split(.$group).
How I am suppose to do so ?
dt %>% split(.$group) %>% map_df(slice(rep(1:n(),.$dur)))
Is not working for example.
My desired output is the same as dt %>% slice(rep(1:n(),.$dur))
which is
group id var ep dur
1 A 1 a 1 20
2 A 1 a 1 20
3 A 1 a 1 20
4 A 1 a 1 20
5 A 1 a 1 20
6 A 1 a 1 20
7 A 1 a 1 20
8 A 1 a 1 20
9 A 1 a 1 20
10 A 1 a 1 20
.....
But I need to split this operation because the file is too big.
data
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), ep = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor"), dur = c(20, 10, 5, 5, 10, 15, 20,
10, 10, 20, 5, 10)), .Names = c("group", "id", "var", "ep",
"dur"), row.names = c(NA, -12L), class = "data.frame")
map takes two arguments: a vector/list in .x and a function in .f. It then applies .f on all elements in .x.
The function you are passing to map is not formatted correctly. Try this:
f <- function(x) x %>% slice(rep(1:n(), .$dur))
dt %>%
split(.$group) %>%
map_df(f)
You could also use it like this:
dt %>%
split(.$group) %>%
map_df(slice, rep(1:n(), dur))
This time you directly pass the slice function to map with additional parameters.
I'm not quite sure what your desired final output is, but you could use tidyr to nest the data that you want to repeat and a simple function to expand levels of your nested data, very similar to Tutuchan's answer.
expand_df <- function(df, repeats) {
df %>% slice(rep(1:n(), repeats))
}
dt %>%
tidyr::nest(var:ep) %>%
mutate(expanded = purrr::map2(data, dur, expand_df)) %>%
select(-data) %>%
tidyr::unnest()
Tutuchan's answer gives exactly the same output as your original approach - is that what you were looking for? I don't know if it will have any advantage over your original method.
For a sample dataframe:
df <- structure(list(animal.1 = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 3L, 1L, 1L), .Label = c("cat", "dog", "horse", "rabbit"), class = "factor"),
animal.2 = structure(c(1L, 2L, 2L, 2L, 4L, 4L, 1L, 1L, 3L,
1L), .Label = c("cat", "dog", "hamster", "rabbit"), class = "factor"),
number = c(5L, 3L, 2L, 5L, 1L, 4L, 6L, 7L, 1L, 11L)), .Names = c("animal.1",
"animal.2","number"), class = "data.frame", row.names = c(NA,
-10L))
... I wish to make a new df with 'animal' duplicates all added together. For example multiple rows with the same animal in columns 1 and 2 will be put together. So for example the dataframe above would read:
cat cat 16
dog dog 7
cat dog 3 etc. etc... (those with different animals would be left as they are). Importantly the sum of 'number' in both dataframes would be the same.
My real df is >400K observations, so anything that anyone could recommend could cope with a large dataset would be great!
Thanks in advance.
One option would be to use data.table. Convert "data.frame" to "data.table" (setDT(), if the "animal.1" rows are equal to "animal.2", then, replace the "number" with sum of "number" after grouping by the two columns, and finally get the unique rows.
library(data.table)
setDT(df)[as.character(animal.1)==as.character(animal.2),
number:=sum(number) ,.(animal.1, animal.2)]
unique(df)
# animal.1 animal.2 number
#1: cat cat 16
#2: cat dog 3
#3: dog dog 7
#4: dog rabbit 1
#5: rabbit rabbit 4
#6: rabbit cat 6
#7: horse cat 7
#8: cat hamster 1
Or an option with dplyr. The approach is similar to data.table. We group by "animal.1", "animal.2", then replace the "number" with sum only when "animal.1" is equal to "animal.2", and get the unique rows
library(dplyr)
df %>%
group_by(animal.1, animal.2) %>%
mutate(number=replace(number,as.character(animal.1)==
as.character(animal.2),
sum(number))) %>%
unique()