I'm doing feature extraction from labelled Twitter data to use for predicting fake tweets. I've been spending a lot of time on various GitHub methods, R libraries, stackoverflow posts, but somehow I couldn't find a "direct" method of extracting features related to emojis, e.g. number of emojis, whether the tweet contains emoji(1/0) or even occurrence of specific emojis(that might occur more often in fake/real news).
I'm not sure whether there is a point in showing reproducible code.
"Ore" library, for example, offers functions that gather all tweets in an object and extracts emojis, but the formats are problematic (at least, to me) when trying to create features out of the extractions, as mentioned above. The example below uses a whatsapp text sample. I will add twitter data from kaggle to make it somewhat reproducible.
Twitter Dataset: https://github.com/sherylWM/Fake-News-Detection-using-Twitter/blob/master/FinalDataSet.csv
# save this to '_chat.txt` (it require a login)
# https://www.kaggle.com/sarthaknautiyal/whatsappsample
library(ore)
library(dplyr)
emoji_src <- "https://raw.githubusercontent.com/laurenancona/twimoji/gh-pages/twitterEmojiProject/emoticon_conversion_noGraphic.csv"
emoji_fil <- basename(emoji_src)
if (!file.exists(emoji_fil)) download.file(emoji_src, emoji_fil)
emoji <- read.csv(emoji_fil, header=FALSE, stringsAsFactors = FALSE)
emoji_regex <- sprintf("(%s)", paste0(emoji$V2, collapse="|"))
compiled <- ore(emoji_regex)
chat <- readLines("_chat.txt", encoding = "UTF-8", warn = FALSE)
which(grepl(emoji_regex, chat, useBytes = TRUE))
## [1] 8 9 10 11 13 19 20 22 23 62 65 69 73 74 75 82 83 84 87 88 90 91
## [23] 92 93 94 95 107 108 114 115 117 119 122 123 124 125 130 135 139 140 141 142 143 144
## [45] 146 147 150 151 153 157 159 161 162 166 169 171 174 177 178 183 184 189 191 192 195 196
## [67] 199 200 202 206 207 209 220 221 223 224 225 226 228 229 234 235 238 239 242 244 246 247
## [89] 248 249 250 251 253 259 260 262 263 265 274 275 280 281 282 286 287 288 291 292 293 296
## [111] 302 304 305 307 334 335 343 346 348 351 354 355 356 358 361 362 382 389 390 391 396 397
## [133] 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419
## [155] 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 442 451 452
## [177] 454 459 463 465 466 469 471 472 473 474 475 479 482 484 485 486 488 490 492 493 496 503
## [199] 505 506 507 509 517 518 519 525 526 527 528 531 535 540 543 545 548 549 557 558 559 560
## [221] 566 567 571 572 573 574 576 577 578 580 587 589 591 592 594 597 600 601 603 608 609 625
## [243] 626 627 637 638 639 640 641 643 645 749 757 764
chat_emoji_lines <- chat[which(grepl(emoji_regex, chat, useBytes = TRUE))]
found_emoji <- ore.search(compiled, chat_emoji_lines, all=TRUE)
emoji_matches <- matches(found_emoji)
str(emoji_matches, 1)
## List of 254
## $ : chr [1:4] "\U0001f600" "\U0001f600" "\U0001f44d" "\U0001f44d"
## $ : chr "\U0001f648"
## $ : chr [1:2] "\U0001f44d" "\U0001f44d"
## $ : chr "\U0001f602"
## $ : chr [1:3] "\U0001f602" "\U0001f602" "\U0001f602"
## $ : chr [1:4] "\U0001f44c" "\U0001f44c" "\U0001f44c" "\U0001f44c"
## $ : chr [1:6] "\U0001f602" "\U0001f602" "\U0001f602" "\U0001f602" ...
## $ : chr "\U0001f600"
## $ : chr [1:5] "\U0001f604" "\U0001f604" "\U0001f604" "\U0001f603" ...
## $ : chr "\U0001f44d"
## ...
data_frame(
V2 = flatten_chr(emoji_matches) %>%
map(charToRaw) %>%
map(as.character) %>%
map(toupper) %>%
map(~sprintf("\\x%s", .x)) %>%
map_chr(paste0, collapse="")
) %>%
left_join(emoji) %>%
count(V3, sort=TRUE)
## # A tibble: 89 x 2
## V3 n
## <chr> <int>
## 1 face with tears of joy 110
## 2 smiling face with smiling eyes 50
## 3 face with stuck-out tongue and winking eye 43
## 4 musical note 42
## 5 birthday cake 35
## 6 grinning face with smiling eyes 26
## 7 face with stuck-out tongue and tightly-closed eyes 24
## 8 grinning face 21
## 9 bouquet 17
## 10 thumbs up sign 17
## # ... with 79 more rows
Source: https://gist.github.com/hrbrmstr/e89eb173ae0333f50f94fe5086fedf8b
"textclean" library, offers 2 functions that replace emojis with word equivalents. Source: https://cran.r-project.org/web/packages/textclean/textclean.pdf
Another hit from cran-r's utf8 package description:
Characters with codes above 0xffff, including most emoji, are not
supported on Windows.
Does anyone have any other method, direction, package/function I could use?
I wrote a function for this purpose in my package rwhatsapp.
As your example is a whatsapp dataset, you can test it directly using the package (install via remotes::install_github("JBGruber/rwhatsapp"))
df <- rwhatsapp::rwa_read("_chat.txt")
#> Warning in readLines(x, encoding = encoding, ...): incomplete final line found
#> on '_chat.txt'
df
#> # A tibble: 392 x 6
#> time author text source emoji emoji_name
#> <dttm> <fct> <chr> <chr> <list> <list>
#> 1 2015-06-25 01:42:12 <NA> : Vishnu Gaud … /home/johan… <NULL> <NULL>
#> 2 2015-06-25 01:42:12 <NA> : You were added /home/johan… <NULL> <NULL>
#> 3 2016-12-18 01:57:38 Shahain :<image omitted> /home/johan… <NULL> <NULL>
#> 4 2016-12-21 21:54:46 Pankaj S… :<image omitted> /home/johan… <NULL> <NULL>
#> 5 2016-12-21 21:57:45 Shahain :Wow /home/johan… <NULL> <NULL>
#> 6 2016-12-21 22:48:51 Sakshi :<image omitted> /home/johan… <NULL> <NULL>
#> 7 2016-12-21 22:49:00 Sakshi :<image omitted> /home/johan… <NULL> <NULL>
#> 8 2016-12-21 22:50:12 Neha Wip… :Awsum😀😀👍🏼👍🏼 /home/johan… <chr … <chr [4]>
#> 9 2016-12-21 22:51:21 Sakshi :🙈 /home/johan… <chr … <chr [1]>
#> 10 2016-12-21 22:57:01 Ganguly :🙂🙂👍🏻👍🏻 /home/johan… <chr … <chr [4]>
#> # … with 382 more rows
I extract the emojis from text and store them in a list column as each text can contain multiple emojis. Use unnest to unnest the list column.
library(tidyverse)
df %>%
select(time, emoji) %>%
unnest(emoji)
#> # A tibble: 654 x 2
#> time emoji
#> <dttm> <chr>
#> 1 2016-12-21 22:50:12 😀
#> 2 2016-12-21 22:50:12 😀
#> 3 2016-12-21 22:50:12 👍🏼
#> 4 2016-12-21 22:50:12 👍🏼
#> 5 2016-12-21 22:51:21 🙈
#> 6 2016-12-21 22:57:01 🙂
#> 7 2016-12-21 22:57:01 🙂
#> 8 2016-12-21 22:57:01 👍🏻
#> 9 2016-12-21 22:57:01 👍🏻
#> 10 2016-12-21 23:28:51 😂
#> # … with 644 more rows
You can use this function with any text. The only thing you need to do first is to store the text in a data.frame in a column called text (I use tibble here as it prints nicer):
df <- tibble::tibble(
text = readLines("/home/johannes/_chat.txt")
)
#> Warning in readLines("/home/johannes/_chat.txt"): incomplete final line found on
#> '/home/johannes/_chat.txt'
rwhatsapp::lookup_emoji(df, text_field = "text")
#> # A tibble: 764 x 3
#> text emoji emoji_name
#> <chr> <list> <list>
#> 1 25/6/15, 1:42:12 AM: Vishnu Gaud created this group <NULL> <NULL>
#> 2 25/6/15, 1:42:12 AM: You were added <NULL> <NULL>
#> 3 18/12/16, 1:57:38 AM: Shahain: <image omitted> <NULL> <NULL>
#> 4 21/12/16, 9:54:46 PM: Pankaj Sinha: <image omitted> <NULL> <NULL>
#> 5 21/12/16, 9:57:45 PM: Shahain: Wow <NULL> <NULL>
#> 6 21/12/16, 10:48:51 PM: Sakshi: <image omitted> <NULL> <NULL>
#> 7 21/12/16, 10:49:00 PM: Sakshi: <image omitted> <NULL> <NULL>
#> 8 21/12/16, 10:50:12 PM: Neha Wipro: Awsum😀😀👍🏼👍🏼 <chr [4]> <chr [4]>
#> 9 21/12/16, 10:51:21 PM: Sakshi: 🙈 <chr [1]> <chr [1]>
#> 10 21/12/16, 10:57:01 PM: Ganguly: 🙂🙂👍🏻👍🏻 <chr [4]> <chr [4]>
#> # … with 754 more rows
more details
The way this works under the hood is with a simple dictionary and matching approach. First I split the text into characters and put the characters in a data.frame together with the line id (this is a rewrite of unnest_tokens from tidytext):
lines <- readLines("/home/johannes/_chat.txt")
#> Warning in readLines("/home/johannes/_chat.txt"): incomplete final line found on
#> '/home/johannes/_chat.txt'
id <- seq_along(lines)
l <- stringi::stri_split_boundaries(lines, type = "character")
out <- tibble(id = rep(id, sapply(l, length)), emoji = unlist(l))
Then I match the characters with a dataset of emoji characters (see ?rwhatsapp::emojis for more infos):
out <- add_column(out,
emoji_name = rwhatsapp::emojis$name[
match(out$emoji,
rwhatsapp::emojis$emoji)
])
out
#> # A tibble: 28,652 x 3
#> id emoji emoji_name
#> <int> <chr> <chr>
#> 1 1 "2" <NA>
#> 2 1 "5" <NA>
#> 3 1 "/" <NA>
#> 4 1 "6" <NA>
#> 5 1 "/" <NA>
#> 6 1 "1" <NA>
#> 7 1 "5" <NA>
#> 8 1 "," <NA>
#> 9 1 " " <NA>
#> 10 1 "1" <NA>
#> # … with 28,642 more rows
Now the new column contains either an emoji or NA when no emoji was found. Removing NAs just the emojis are left.
out <- out[!is.na(out$emoji_name), ]
out
#> # A tibble: 656 x 3
#> id emoji emoji_name
#> <int> <chr> <chr>
#> 1 8 😀 grinning face
#> 2 8 😀 grinning face
#> 3 8 👍🏼 thumbs up: medium-light skin tone
#> 4 8 👍🏼 thumbs up: medium-light skin tone
#> 5 9 🙈 see-no-evil monkey
#> 6 10 🙂 slightly smiling face
#> 7 10 🙂 slightly smiling face
#> 8 10 👍🏻 thumbs up: light skin tone
#> 9 10 👍🏻 thumbs up: light skin tone
#> 10 11 😂 face with tears of joy
#> # … with 646 more rows
The disadvantage of this approach is that you rely on the completeness of your emoji data. However, the dataset in the pacakge includes all known emojis from the unicode website (version 13).
Related
I'm trying to use rBlast for protein sequences but somehow it doesn't work. It works fine for nucleotide sequences but for proteins it just doesn't return any match (I used a sequence from the searched dataset, so there can't be no match). In the description it stands "This includes interfaces to blastn, blastp, blastx..." but in the help file in R studio it says "Description Execute blastn from blast+". Did anybody run rBlast for proteins?
Here's what I ran:
listF<-list.files("Trich_prot_fasta/")
fa<-paste0("Trich_prot_fasta/",listF[i])
makeblastdb(fa, dbtype = "prot", args="")
bl <- blast("Trich_prot_fasta/Tri5640_1_GeneModels_FilteredModels1_aa.fasta", type="blastp")
seq <- readAAStringSet("NDRkinase/testSeq.txt")
cl <- predict(bl, seq)
Result:
> cl <- predict(bl, seq)
Warning message: In predict.BLAST(bl, seq) : BLAST did not return a
match!
Tried to reproduce the error but everything worked as expected on my system (macOS BigSur 11.6 / R v4.1.1 / Rstudio v1.4.1717).
Given your blastn was successful, perhaps you are combining multiple fasta files for your protein fasta reference database? If that's the case, try concatenating them together and use the path to the file instead of an R object ("fa") when making your blastdb. Or perhaps:
makeblastdb(file = "Trich_prot_fasta/Tri5640_1_GeneModels_FilteredModels1_aa.fasta", type = "prot)
Instead of:
makeblastdb(fa, dbtype = "prot", args="")
Also, please edit your question to include the output from sessionInfo() (might help narrow things down).
library(tidyverse)
#BiocManager::install("Biostrings")
#devtools::install_github("mhahsler/rBLAST")
library(rBLAST)
# Download an example fasta file:
# https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000001542/UP000001542_5722.fasta.gz
# Grab the first fasta sequence as "example_sequence.fasta"
listF <- list.files("~/Downloads/Trich_example", full.names = TRUE)
listF
#> [1] "~/Downloads/Trich_example/UP000001542_5722.fasta"
#> [2] "~/Downloads/Trich_example/example_sequence.fasta"
makeblastdb(file = "~/Downloads/Trich_example/UP000001542_5722.fasta", dbtype = "prot")
bl <- blast("~/Downloads/Trich_example/UP000001542_5722.fasta", type = "blastp")
seq <- readAAStringSet("~/Downloads/Trich_example/example_sequence.fasta")
cl <- predict(bl, seq)
cl
#> QueryID SubjectID Perc.Ident Alignment.Length
#> 1 Example_sequence_1 tr|A2D8A1|A2D8A1_TRIVA 100.000 694
#> 2 Example_sequence_1 tr|A2E4L0|A2E4L0_TRIVA 64.553 694
#> 3 Example_sequence_1 tr|A2E4L0|A2E4L0_TRIVA 32.436 669
#> 4 Example_sequence_1 tr|A2D899|A2D899_TRIVA 64.344 488
#> 5 Example_sequence_1 tr|A2D899|A2D899_TRIVA 31.004 458
#> 6 Example_sequence_1 tr|A2D899|A2D899_TRIVA 27.070 314
#> 7 Example_sequence_1 tr|A2D898|A2D898_TRIVA 54.915 468
#> 8 Example_sequence_1 tr|A2D898|A2D898_TRIVA 33.691 653
#> 9 Example_sequence_1 tr|A2D898|A2D898_TRIVA 32.936 671
#> 10 Example_sequence_1 tr|A2D898|A2D898_TRIVA 29.969 654
#> 11 Example_sequence_1 tr|A2D898|A2D898_TRIVA 26.694 487
#> 12 Example_sequence_1 tr|A2D898|A2D898_TRIVA 25.000 464
#> 13 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 39.106 716
#> 14 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 30.724 677
#> 15 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 29.257 417
#> 16 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 23.438 640
#> 17 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 22.981 718
#> 18 Example_sequence_1 tr|A2F4I3|A2F4I3_TRIVA 24.107 112
#> 19 Example_sequence_1 tr|A2FI39|A2FI39_TRIVA 33.378 740
#> 20 Example_sequence_1 tr|A2FI39|A2FI39_TRIVA 31.440 722
#> Mismatches Gap.Openings Q.start Q.end S.start S.end E Bits
#> 1 0 0 1 694 1 694 0.00e+00 1402.0
#> 2 243 2 1 692 163 855 0.00e+00 920.0
#> 3 410 15 22 671 1 646 3.02e-94 312.0
#> 4 173 1 205 692 1 487 0.00e+00 644.0
#> 5 308 7 22 476 1 453 3.55e-55 198.0
#> 6 196 5 13 294 173 485 4.12e-25 110.0
#> 7 211 0 1 468 683 1150 8.48e-169 514.0
#> 8 420 11 2 647 501 1147 1.61e-91 309.0
#> 9 396 10 2 666 363 985 5.78e-89 301.0
#> 10 406 11 16 664 195 801 1.01e-66 238.0
#> 11 297 10 208 662 21 479 1.60e-36 147.0
#> 12 316 7 11 469 29 465 3.04e-36 147.0
#> 13 386 4 2 667 248 963 1.72e-149 461.0
#> 14 411 10 2 625 66 737 8.34e-83 283.0
#> 15 286 5 129 542 14 424 2.66e-52 196.0
#> 16 421 15 5 607 365 972 3.07e-38 152.0
#> 17 407 21 77 662 27 730 1.25e-33 138.0
#> 18 81 3 552 661 3 112 2.10e-01 35.4
#> 19 421 9 3 675 394 1128 1.12e-115 375.0
#> 20 409 15 2 647 163 874 1.21e-82 285.0
...
Created on 2021-09-30 by the reprex package (v2.0.1)
Let's assume we somehow ended up with data frame object (T2 in below example) and we want to subset our original data with that dataframe. Is there a way to do without using | in subset object?
Here is a dataset I was playing but failed
education = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) = c("X", "State", "Region", "Urban.Population", "Per.Capita.Income", "Minor.Population", "Education.Expenditures")
head(education)
T1 = c(1,4,13,15,17,23,33,38)
T2 = education[T1,]$State
subset(education, State=="ME"| State=="MA" | State=="MI" | State=="MN" | State=="MO" | State=="MD" | State=="MS" | State=="MT")
subset(education, State==T2[3])
subset(education, State==T2)
PS: I created T2 as states starting with M but I don't want using string or anything. Just assume we somehow ended up with T2 in which outputs are some states.
I'm not quite sure what would be an acceptable answer but subset(education, State %in% T2) uses T2 as is and does not use |. Does this solve your problem? It's almost the same approach as Jon Spring points out in the comments, but instead of specifying a vector we can just use T2 with %in%. You say T2 is a data.frame object, but in the data you provided it turns out to be a character vector.
education = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) = c("X", "State", "Region", "Urban.Population", "Per.Capita.Income", "Minor.Population", "Education.Expenditures")
T1 = c(1,4,13,15,17,23,33,38)
T2 = education[T1,]$State
T2 # T2 is not a data.frame object (R 4.0)
#> [1] "ME" "MA" "MI" "MN" "MO" "MD" "MS" "MT"
subset(education, State %in% T2)
#> X State Region Urban.Population Per.Capita.Income Minor.Population
#> 1 1 ME 1 508 3944 325
#> 4 4 MA 1 846 5233 305
#> 13 13 MI 2 738 5439 337
#> 15 15 MN 2 664 4921 330
#> 17 17 MO 2 701 4672 309
#> 23 23 MD 3 766 5331 323
#> 33 33 MS 3 445 3448 358
#> 38 38 MT 4 534 4418 335
#> Education.Expenditures
#> 1 235
#> 4 261
#> 13 379
#> 15 378
#> 17 231
#> 23 330
#> 33 215
#> 38 302
But lets say T2 would be an actual data.frame:
T2 = education[T1,]["State"]
T2 #check
#> State
#> 1 ME
#> 4 MA
#> 13 MI
#> 15 MN
#> 17 MO
#> 23 MD
#> 33 MS
#> 38 MT
Then we could coerce it into a vector by subsetting it with drop = TRUE.
subset(education, State %in% T2[, , drop = TRUE])
#> X State Region Urban.Population Per.Capita.Income Minor.Population
#> 1 1 ME 1 508 3944 325
#> 4 4 MA 1 846 5233 305
#> 13 13 MI 2 738 5439 337
#> 15 15 MN 2 664 4921 330
#> 17 17 MO 2 701 4672 309
#> 23 23 MD 3 766 5331 323
#> 33 33 MS 3 445 3448 358
#> 38 38 MT 4 534 4418 335
#> Education.Expenditures
#> 1 235
#> 4 261
#> 13 379
#> 15 378
#> 17 231
#> 23 330
#> 33 215
#> 38 302
Created on 2021-06-12 by the reprex package (v0.3.0)
I am splitting a data.frame into a list on the basis of its column names. What I want is to include a id column (id) to not just one item but into all elements of the resulting list.
Presently I am doing it through subsequent binding of id column to all items of list through map and bind_cols (alternatives through Map/do.call/mapply etc. I can do similarly myself). What I want to know is there any canonical way of doing it directly, maybe with a function argument of split.default or through some other function directly and thus saving two or three extra steps.
Reproducible example
df <- data.frame(
stringsAsFactors = FALSE,
id = c("A", "B", "C"),
nm1_a = c(928L, 476L, 928L),
nm1_b = c(61L, 362L, 398L),
nm2_a = c(965L, 466L, 369L),
nm2_b = c(240L, 375L, 904L),
nm3_a = c(429L, 730L, 788L),
nm3_b = c(99L, 896L, 540L),
nm3_c = c(463L, 143L, 870L)
)
df
#> id nm1_a nm1_b nm2_a nm2_b nm3_a nm3_b nm3_c
#> 1 A 928 61 965 240 429 99 463
#> 2 B 476 362 466 375 730 896 143
#> 3 C 928 398 369 904 788 540 870
What I am doing presently
library(tidyverse)
split.default(df[-1], gsub('^(nm\\d+).*', '\\1', names(df)[-1])) %>%
map(~ .x %>% bind_cols('id' = df$id, .))
#> $nm1
#> id nm1_a nm1_b
#> 1 A 928 61
#> 2 B 476 362
#> 3 C 928 398
#>
#> $nm2
#> id nm2_a nm2_b
#> 1 A 965 240
#> 2 B 466 375
#> 3 C 369 904
#>
#> $nm3
#> id nm3_a nm3_b nm3_c
#> 1 A 429 99 463
#> 2 B 730 896 143
#> 3 C 788 540 870
What I want is exactly the same output, but is there any way to do it directly or a more canonical way?
Just for a diversity of options, here's what you said you didn't want to do. The pivot / split / pivot method can help scale better and adapt beyond keeping an ID based just on column position. It also makes use of the ID in order to do the reshaping, so it might also be more flexible if you have other operations to do in the intermediate steps and don't know for sure that your row order will stay the same—that's one of the reasons I sometimes avoid binding columns. It also (at least for me) makes sense to split data based on some variable rather than by groups of columns.
library(tidyr)
df %>%
pivot_longer(-id) %>%
split(stringr::str_extract(.$name, "^nm\\d+")) %>%
purrr::map(pivot_wider, id_cols = id, names_from = name)
#> $nm1
#> # A tibble: 3 x 3
#> id nm1_a nm1_b
#> <chr> <int> <int>
#> 1 A 928 61
#> 2 B 476 362
#> 3 C 928 398
#>
#> $nm2
#> # A tibble: 3 x 3
#> id nm2_a nm2_b
#> <chr> <int> <int>
#> 1 A 965 240
#> 2 B 466 375
#> 3 C 369 904
#>
#> $nm3
#> # A tibble: 3 x 4
#> id nm3_a nm3_b nm3_c
#> <chr> <int> <int> <int>
#> 1 A 429 99 463
#> 2 B 730 896 143
#> 3 C 788 540 870
You can make use of a temporary variable so that the code is cleaner and easy to understand.
common_cols <- 1
tmp <- df[-common_cols]
lapply(split.default(tmp, sub('^(nm\\d+).*', '\\1', names(tmp))),
function(x) cbind(df[common_cols], x))
#$nm1
# id nm1_a nm1_b
#1 A 928 61
#2 B 476 362
#3 C 928 398
#$nm2
# id nm2_a nm2_b
#1 A 965 240
#2 B 466 375
#3 C 369 904
#$nm3
# id nm3_a nm3_b nm3_c
#1 A 429 99 463
#2 B 730 896 143
#3 C 788 540 870
This one should be just two steps, split and replace.
Map(`[<-`, split.default(df[-1], substr(names(df)[-1], 1, 3)), 'id', value=df[1])
# $nm1
# nm1_a nm1_b id
# 1 928 61 A
# 2 476 362 B
# 3 928 398 C
#
# $nm2
# nm2_a nm2_b id
# 1 965 240 A
# 2 466 375 B
# 3 369 904 C
#
# $nm3
# nm3_a nm3_b nm3_c id
# 1 429 99 463 A
# 2 730 896 143 B
# 3 788 540 870 C
I'm looking for a code to extract a time interval (500ms) of a column (called time) for each trial onset, so that I can calculate a baseline of the first 500ms of each trial
actual time in ms between two consecutive rows of the column varies, because the dataset is downsampled and only changes are reported, so I cannot just count a certain number of rows to define the time interval.
I tried this:
baseline <- labchart %>%
dplyr::filter(time[1:(length(labchart$time)+500)]) %>%
dplyr::group_by(Participant, trialonset)
but only got error messages like:
Error: Argument 2 filter condition does not evaluate to a logical vector
And I am not sure, if (time[1:(length(labchart$Time)+500)]) would really give me the first 500ms of each trial?
It's difficult to know exactly what you're asking here. I think what you're asking is how to group observations into 500ms periods given only time intervals between observations.
Suppose the data looks like this:
``` r
labchart <- data.frame(time = sample(50:300, 20, TRUE), data = rnorm(20))
labchart
#> time data
#> 1 277 -1.33120732
#> 2 224 -0.85356280
#> 3 80 -0.32012499
#> 4 255 0.32433366
#> 5 227 -0.49600772
#> 6 248 2.23246918
#> 7 138 -1.40170795
#> 8 115 -0.76525043
#> 9 159 0.14239351
#> 10 207 -1.53064873
#> 11 139 -0.82303066
#> 12 185 1.12473125
#> 13 239 -0.22491238
#> 14 117 -0.55809297
#> 15 147 0.83225435
#> 16 200 0.75178516
#> 17 170 -0.78484405
#> 18 208 1.21000589
#> 19 196 -0.74576650
#> 20 184 0.02459359
Then we can create a column for total elapsed time and which 500ms period the observation belongs to like this:
library(dplyr)
labchart %>%
mutate(elapsed = lag(cumsum(time), 1, 0),
period = 500 * (elapsed %/% 500))
#> time data elapsed period
#> 1 277 -1.33120732 0 0
#> 2 224 -0.85356280 277 0
#> 3 80 -0.32012499 501 500
#> 4 255 0.32433366 581 500
#> 5 227 -0.49600772 836 500
#> 6 248 2.23246918 1063 1000
#> 7 138 -1.40170795 1311 1000
#> 8 115 -0.76525043 1449 1000
#> 9 159 0.14239351 1564 1500
#> 10 207 -1.53064873 1723 1500
#> 11 139 -0.82303066 1930 1500
#> 12 185 1.12473125 2069 2000
#> 13 239 -0.22491238 2254 2000
#> 14 117 -0.55809297 2493 2000
#> 15 147 0.83225435 2610 2500
#> 16 200 0.75178516 2757 2500
#> 17 170 -0.78484405 2957 2500
#> 18 208 1.21000589 3127 3000
#> 19 196 -0.74576650 3335 3000
#> 20 184 0.02459359 3531 3500
I wish to use subqueries like structure in dplyr efficiently. For example we do subqueries like
select * from table where tbl_id in (select tbl_id from table1 where name='a');
Here, I assume we are not saving any results from inner query. I want to use similar structure in dplyr chains.
I have tried to use the result of one dplyr chain in another by putting it in brackets but it doesn't work that way. I already know that we can save it as temporary df and use it but I don't want to save it.
Below are the two table data/dataframes :
# Libraries
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Q 16 Write a query in SQL to find the name of those
# movies where one or more actors acted in two or more movies
movie <- read.csv("q14_movie.csv")
movie_cast <- read.csv("q16_movie_cast.csv")
print(movie)
#> mov_id mov_title mov_year mov_time mov_lang
#> 1 901 Vertigo 1958 128 English
#> 2 902 The Innocents 1961 100 English
#> 3 903 Lawrence of Arabia 1962 216 English
#> 4 904 The Deer Hunter 1978 183 English
#> 5 905 Amadeus 1984 160 English
#> 6 906 Blade Runner 1982 117 English
#> 7 907 Eyes Wide Shut 1999 159 English
#> 8 908 The Usual Suspects 1995 106 English
#> 9 909 Chinatown 1974 130 English
#> 10 910 Boogie Nights 1997 155 English
#> 11 911 Annie Hall 1977 93 English
#> 12 912 Princess Mononoke 1997 134 Japanese
#> 13 913 The Shawshank Redemption 1994 142 English
#> 14 914 American Beauty 1999 122 English
#> 15 915 Titanic 1997 194 English
#> 16 916 Good Will Hunting 1997 126 English
#> 17 917 Deliverance 1972 109 English
#> 18 918 Trainspotting 1996 94 English
#> 19 919 The Prestige 2006 130 English
#> 20 920 Donnie Darko 2001 113 English
#> 21 921 Slumdog Millionaire 2008 120 English
#> 22 922 Aliens 1986 137 English
#> 23 923 Beyond the Sea 2004 118 English
#> 24 924 Avatar 2009 162 English
#> 25 926 Seven Samurai 1954 207 Japanese
#> 26 927 Spirited Away 2001 125 Japanese
#> 27 928 Back to the Future 1985 116 English
#> 28 925 Braveheart 1995 178 English
#> mov_dt_rel mov_rel_country
#> 1 1958-08-24 UK
#> 2 1962-02-19 SW
#> 3 1962-12-11 UK
#> 4 1979-03-08 UK
#> 5 1985-01-07 UK
#> 6 1982-09-09 UK
#> 7 UK
#> 8 1995-08-25 UK
#> 9 1974-08-09 UK
#> 10 1998-02-16 UK
#> 11 1977-04-20 USA
#> 12 2001-10-19 UK
#> 13 1995-02-17 UK
#> 14 UK
#> 15 1998-01-23 UK
#> 16 1998-06-03 UK
#> 17 1982-10-05 UK
#> 18 1996-02-23 UK
#> 19 2006-11-10 UK
#> 20 UK
#> 21 2009-01-09 UK
#> 22 1986-08-29 UK
#> 23 2004-11-26 UK
#> 24 2009-12-17 UK
#> 25 1954-04-26 JP
#> 26 2003-09-12 UK
#> 27 1985-12-04 UK
#> 28 1995-09-08 UK
print(movie_cast)
#> act_id mov_id role
#> 1 101 901 John Scottie Ferguson
#> 2 102 902 Miss Giddens
#> 3 103 903 T.E. Lawrence
#> 4 104 904 Michael
#> 5 105 905 Antonio Salieri
#> 6 106 906 Rick Deckard
#> 7 107 907 Alice Harford
#> 8 108 908 McManus
#> 9 110 910 Eddie Adams
#> 10 111 911 Alvy Singer
#> 11 112 912 San
#> 12 113 913 Andy Dufresne
#> 13 114 914 Lester Burnham
#> 14 115 915 Rose DeWitt Bukater
#> 15 116 916 Sean Maguire
#> 16 117 917 Ed
#> 17 118 918 Renton
#> 18 120 920 Elizabeth Darko
#> 19 121 921 Older Jamal
#> 20 122 922 Ripley
#> 21 114 923 Bobby Darin
#> 22 109 909 J.J. Gittes
#> 23 119 919 Alfred Borden
sqldf('select * from movie m join movie_cast mc on
m.mov_id=mc.mov_id where mc.act_id in
(select act_id from movie_cast group by act_id having count(mov_id)>1)')
#> mov_id mov_title mov_year mov_time mov_lang mov_dt_rel
#> 1 914 American Beauty 1999 122 English
#> 2 923 Beyond the Sea 2004 118 English 2004-11-26
#> mov_rel_country act_id mov_id role
#> 1 UK 114 914 Lester Burnham
#> 2 UK 114 923 Bobby Darin
tmp <- movie_cast %>%
group_by(act_id) %>%
summarise(num_movies=n_distinct(mov_id)) %>%
filter(num_movies>1)
inner_join(movie,movie_cast,by='mov_id') %>% filter(act_id %in% tmp$act_id)
#> mov_id mov_title mov_year mov_time mov_lang mov_dt_rel
#> 1 914 American Beauty 1999 122 English
#> 2 923 Beyond the Sea 2004 118 English 2004-11-26
#> mov_rel_country act_id role
#> 1 UK 114 Lester Burnham
#> 2 UK 114 Bobby Darin
inner_join(movie,movie_cast,by='mov_id') %>%
filter(act_id %in% (movie_cast %>%
group_by(act_id) %>%
summarise(num_movies=n_distinct(mov_id)) %>%
filter(num_movies>1) %>%
select(act_id)))
#> [1] mov_id mov_title mov_year mov_time
#> [5] mov_lang mov_dt_rel mov_rel_country act_id
#> [9] role
#> <0 rows> (or 0-length row.names)
I wish to get same results without saving as tmp as explained in code !
Thanks