Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 10 months ago.
Improve this question
There is a json-File I would like to analyze:
https://dam-api.bfs.admin.ch/hub/api/dam/assets/21364129/master
It has data about four different referendums (all from the same date), all the results for each region (kanton) and on community-level (gemeinde). I tried to make a tidy df out of this, but I don't even understand the exact syntax of the unnest-command (cols?).
After a lot of try and error, I can get the data for a specific kanton, for example like this:
flatten(df$schweiz$vorlagen$kantone[[4]]$gemeinden[[13]])
But there must be some way to get this data for all 26 cantons, and of course with the information to which one of the four referendum the data belongs. I guess that some tools from the purrr-package might be helpful, but all the tutorials i read so far have been a mistery to me.
So, in short: Is there a way to get this file tidy without hours of manual work?
Converting such a large and complex structure into a data frame requires specific knowledge of what you want your final output to be, and you really need to tailor your code to do it. In this case, I assume you want a list containing 4 data frames (one for each referendum), where each row gives the kantone, the gemeinden, and the various fields indicating the result from each gemeinden. This is complex, but something like the following should work:
url <- "https://dam-api.bfs.admin.ch/hub/api/dam/assets/21364129/master"
json <- jsonlite::read_json(url)
result <- lapply(json$schweiz$vorlagen, function(x) {
dplyr::as_tibble(do.call(rbind, lapply(x$kantone, function(y) {
k <- y$geoLevelname
do.call(rbind, lapply(y$gemeinden, function(z) {
cbind(data.frame(kantone = k, gemeinden = z$geoLevelname),
as.data.frame(z$resultat))
}))
})))
})
So your result looks like this:
result
#> [[1]]
#> # A tibble: 2,156 x 10
#> kantone gemeinden gebietAusgezaeh~ jaStimmenInProz~ jaStimmenAbsolut
#> <chr> <chr> <lgl> <dbl> <int>
#> 1 Zürich Aeugst am Albis TRUE 25.4 194
#> 2 Zürich Affoltern am Albis TRUE 20.4 661
#> 3 Zürich Bonstetten TRUE 17.9 345
#> 4 Zürich Hausen am Albis TRUE 20.9 286
#> 5 Zürich Hedingen TRUE 19.3 269
#> 6 Zürich Kappel am Albis TRUE 24.1 93
#> 7 Zürich Knonau TRUE 21.6 161
#> 8 Zürich Maschwanden TRUE 21.1 49
#> 9 Zürich Mettmenstetten TRUE 19.3 368
#> 10 Zürich Obfelden TRUE 21.5 325
#> # ... with 2,146 more rows, and 5 more variables: neinStimmenAbsolut <int>,
#> # stimmbeteiligungInProzent <dbl>, eingelegteStimmzettel <int>,
#> # anzahlStimmberechtigte <int>, gueltigeStimmen <int>
#>
#> [[2]]
#> # A tibble: 2,156 x 10
#> kantone gemeinden gebietAusgezaeh~ jaStimmenInProz~ jaStimmenAbsolut
#> <chr> <chr> <lgl> <dbl> <int>
#> 1 Zürich Aeugst am Albis TRUE 60.3 470
#> 2 Zürich Affoltern am Albis TRUE 61.2 2014
#> 3 Zürich Bonstetten TRUE 59.8 1166
#> 4 Zürich Hausen am Albis TRUE 57.7 796
#> 5 Zürich Hedingen TRUE 58.5 826
#> 6 Zürich Kappel am Albis TRUE 49.6 192
#> 7 Zürich Knonau TRUE 56.8 431
#> 8 Zürich Maschwanden TRUE 52.6 123
#> 9 Zürich Mettmenstetten TRUE 58.3 1119
#> 10 Zürich Obfelden TRUE 57.2 868
#> # ... with 2,146 more rows, and 5 more variables: neinStimmenAbsolut <int>,
#> # stimmbeteiligungInProzent <dbl>, eingelegteStimmzettel <int>,
#> # anzahlStimmberechtigte <int>, gueltigeStimmen <int>
#>
#> [[3]]
#> # A tibble: 2,156 x 10
#> kantone gemeinden gebietAusgezaeh~ jaStimmenInProz~ jaStimmenAbsolut
#> <chr> <chr> <lgl> <dbl> <int>
#> 1 Zürich Aeugst am Albis TRUE 39.6 301
#> 2 Zürich Affoltern am Albis TRUE 35.7 1150
#> 3 Zürich Bonstetten TRUE 36.2 697
#> 4 Zürich Hausen am Albis TRUE 39.5 531
#> 5 Zürich Hedingen TRUE 37.1 509
#> 6 Zürich Kappel am Albis TRUE 42.9 162
#> 7 Zürich Knonau TRUE 36.7 273
#> 8 Zürich Maschwanden TRUE 37.1 86
#> 9 Zürich Mettmenstetten TRUE 39.2 742
#> 10 Zürich Obfelden TRUE 34.4 513
#> # ... with 2,146 more rows, and 5 more variables: neinStimmenAbsolut <int>,
#> # stimmbeteiligungInProzent <dbl>, eingelegteStimmzettel <int>,
#> # anzahlStimmberechtigte <int>, gueltigeStimmen <int>
#>
#> [[4]]
#> # A tibble: 2,156 x 10
#> kantone gemeinden gebietAusgezaeh~ jaStimmenInProz~ jaStimmenAbsolut
#> <chr> <chr> <lgl> <dbl> <int>
#> 1 Zürich Aeugst am Albis TRUE 39.0 300
#> 2 Zürich Affoltern am Albis TRUE 40.8 1321
#> 3 Zürich Bonstetten TRUE 42.8 826
#> 4 Zürich Hausen am Albis TRUE 41.6 570
#> 5 Zürich Hedingen TRUE 44.1 615
#> 6 Zürich Kappel am Albis TRUE 35.5 136
#> 7 Zürich Knonau TRUE 38.6 290
#> 8 Zürich Maschwanden TRUE 42.7 100
#> 9 Zürich Mettmenstetten TRUE 39.6 757
#> 10 Zürich Obfelden TRUE 40.5 610
#> # ... with 2,146 more rows, and 5 more variables: neinStimmenAbsolut <int>,
#> # stimmbeteiligungInProzent <dbl>, eingelegteStimmzettel <int>,
#> # anzahlStimmberechtigte <int>, gueltigeStimmen <int>
If you want them all in one big data frame, you can add a number or name column to indicate which referendum you are referring to, then bind all the data frames together.
Created on 2022-04-16 by the reprex package (v2.0.1)
Related
I am trying to dynamically create and evaluate a function from a string input and am hung up, again, on meta-programming/evaluation (https://adv-r.hadley.nz/metaprogramming.html). I have a feeling this is answered on SO, but I searched and wasn't able to figure out the solution looking through other posts; however, if there is an existing answer, please let me know and flag as duplicate. Thank you so much for your time and help! Below is a reprex of the issue.
library(dplyr)
library(purrr)
library(rlang)
library(palmerpenguins)
# Create data to join with penguins
penguin_colors <-
tibble(
species = c("Adelie", "Chinstrap", "Gentoo"),
color = c("orange", "purple", "green")
)
# Create function to do specified join and print join type
foo <- function(JOINTYPE) {
# DOESN'T RUN
# JOINTYPE_join(penguins, penguin_colors, by = "species")
# call2(sym(paste0(JOINTYPE, "_join")), x = penguins, y = penguin_colors, by = "species")
print(JOINTYPE)
}
# Desired behavior of foo when JOINTYPE == "inner"
inner_join(penguins, penguin_colors, by = "species")
#> # A tibble: 344 x 9
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <chr> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Adelie Torge… 36.7 19.3 193 3450
#> 6 Adelie Torge… 39.3 20.6 190 3650
#> 7 Adelie Torge… 38.9 17.8 181 3625
#> 8 Adelie Torge… 39.2 19.6 195 4675
#> 9 Adelie Torge… 34.1 18.1 193 3475
#> 10 Adelie Torge… 42 20.2 190 4250
#> # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
#> # color <chr>
print("inner")
#> [1] "inner"
# Use function in for loop
for (JOINTYPE in c("inner", "left", "right")) {
foo(JOINTYPE)
}
#> [1] "inner"
#> [1] "left"
#> [1] "right"
# Use function in vectorised fashion
walk(c("inner", "left", "right"), foo)
#> [1] "inner"
#> [1] "left"
#> [1] "right"
Created on 2020-10-27 by the reprex package (v0.3.0)
One option is to use get() to retrieve the appropriate function:
join <- function(JOINTYPE) {
get( paste0(JOINTYPE, "_join") )
}
join("inner")(penguins, penguin_colors, by="species")
If using rlang, the more appropriate function here is rlang::exec:
join2 <- function(JOINTYPE, ...) {
rlang::exec( paste0(JOINTYPE, "_join"), ... )
}
join2("inner", penguins, penguin_colors, by="species")
I'm doing feature extraction from labelled Twitter data to use for predicting fake tweets. I've been spending a lot of time on various GitHub methods, R libraries, stackoverflow posts, but somehow I couldn't find a "direct" method of extracting features related to emojis, e.g. number of emojis, whether the tweet contains emoji(1/0) or even occurrence of specific emojis(that might occur more often in fake/real news).
I'm not sure whether there is a point in showing reproducible code.
"Ore" library, for example, offers functions that gather all tweets in an object and extracts emojis, but the formats are problematic (at least, to me) when trying to create features out of the extractions, as mentioned above. The example below uses a whatsapp text sample. I will add twitter data from kaggle to make it somewhat reproducible.
Twitter Dataset: https://github.com/sherylWM/Fake-News-Detection-using-Twitter/blob/master/FinalDataSet.csv
# save this to '_chat.txt` (it require a login)
# https://www.kaggle.com/sarthaknautiyal/whatsappsample
library(ore)
library(dplyr)
emoji_src <- "https://raw.githubusercontent.com/laurenancona/twimoji/gh-pages/twitterEmojiProject/emoticon_conversion_noGraphic.csv"
emoji_fil <- basename(emoji_src)
if (!file.exists(emoji_fil)) download.file(emoji_src, emoji_fil)
emoji <- read.csv(emoji_fil, header=FALSE, stringsAsFactors = FALSE)
emoji_regex <- sprintf("(%s)", paste0(emoji$V2, collapse="|"))
compiled <- ore(emoji_regex)
chat <- readLines("_chat.txt", encoding = "UTF-8", warn = FALSE)
which(grepl(emoji_regex, chat, useBytes = TRUE))
## [1] 8 9 10 11 13 19 20 22 23 62 65 69 73 74 75 82 83 84 87 88 90 91
## [23] 92 93 94 95 107 108 114 115 117 119 122 123 124 125 130 135 139 140 141 142 143 144
## [45] 146 147 150 151 153 157 159 161 162 166 169 171 174 177 178 183 184 189 191 192 195 196
## [67] 199 200 202 206 207 209 220 221 223 224 225 226 228 229 234 235 238 239 242 244 246 247
## [89] 248 249 250 251 253 259 260 262 263 265 274 275 280 281 282 286 287 288 291 292 293 296
## [111] 302 304 305 307 334 335 343 346 348 351 354 355 356 358 361 362 382 389 390 391 396 397
## [133] 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419
## [155] 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 442 451 452
## [177] 454 459 463 465 466 469 471 472 473 474 475 479 482 484 485 486 488 490 492 493 496 503
## [199] 505 506 507 509 517 518 519 525 526 527 528 531 535 540 543 545 548 549 557 558 559 560
## [221] 566 567 571 572 573 574 576 577 578 580 587 589 591 592 594 597 600 601 603 608 609 625
## [243] 626 627 637 638 639 640 641 643 645 749 757 764
chat_emoji_lines <- chat[which(grepl(emoji_regex, chat, useBytes = TRUE))]
found_emoji <- ore.search(compiled, chat_emoji_lines, all=TRUE)
emoji_matches <- matches(found_emoji)
str(emoji_matches, 1)
## List of 254
## $ : chr [1:4] "\U0001f600" "\U0001f600" "\U0001f44d" "\U0001f44d"
## $ : chr "\U0001f648"
## $ : chr [1:2] "\U0001f44d" "\U0001f44d"
## $ : chr "\U0001f602"
## $ : chr [1:3] "\U0001f602" "\U0001f602" "\U0001f602"
## $ : chr [1:4] "\U0001f44c" "\U0001f44c" "\U0001f44c" "\U0001f44c"
## $ : chr [1:6] "\U0001f602" "\U0001f602" "\U0001f602" "\U0001f602" ...
## $ : chr "\U0001f600"
## $ : chr [1:5] "\U0001f604" "\U0001f604" "\U0001f604" "\U0001f603" ...
## $ : chr "\U0001f44d"
## ...
data_frame(
V2 = flatten_chr(emoji_matches) %>%
map(charToRaw) %>%
map(as.character) %>%
map(toupper) %>%
map(~sprintf("\\x%s", .x)) %>%
map_chr(paste0, collapse="")
) %>%
left_join(emoji) %>%
count(V3, sort=TRUE)
## # A tibble: 89 x 2
## V3 n
## <chr> <int>
## 1 face with tears of joy 110
## 2 smiling face with smiling eyes 50
## 3 face with stuck-out tongue and winking eye 43
## 4 musical note 42
## 5 birthday cake 35
## 6 grinning face with smiling eyes 26
## 7 face with stuck-out tongue and tightly-closed eyes 24
## 8 grinning face 21
## 9 bouquet 17
## 10 thumbs up sign 17
## # ... with 79 more rows
Source: https://gist.github.com/hrbrmstr/e89eb173ae0333f50f94fe5086fedf8b
"textclean" library, offers 2 functions that replace emojis with word equivalents. Source: https://cran.r-project.org/web/packages/textclean/textclean.pdf
Another hit from cran-r's utf8 package description:
Characters with codes above 0xffff, including most emoji, are not
supported on Windows.
Does anyone have any other method, direction, package/function I could use?
I wrote a function for this purpose in my package rwhatsapp.
As your example is a whatsapp dataset, you can test it directly using the package (install via remotes::install_github("JBGruber/rwhatsapp"))
df <- rwhatsapp::rwa_read("_chat.txt")
#> Warning in readLines(x, encoding = encoding, ...): incomplete final line found
#> on '_chat.txt'
df
#> # A tibble: 392 x 6
#> time author text source emoji emoji_name
#> <dttm> <fct> <chr> <chr> <list> <list>
#> 1 2015-06-25 01:42:12 <NA> : Vishnu Gaud … /home/johan… <NULL> <NULL>
#> 2 2015-06-25 01:42:12 <NA> : You were added /home/johan… <NULL> <NULL>
#> 3 2016-12-18 01:57:38 Shahain :<image omitted> /home/johan… <NULL> <NULL>
#> 4 2016-12-21 21:54:46 Pankaj S… :<image omitted> /home/johan… <NULL> <NULL>
#> 5 2016-12-21 21:57:45 Shahain :Wow /home/johan… <NULL> <NULL>
#> 6 2016-12-21 22:48:51 Sakshi :<image omitted> /home/johan… <NULL> <NULL>
#> 7 2016-12-21 22:49:00 Sakshi :<image omitted> /home/johan… <NULL> <NULL>
#> 8 2016-12-21 22:50:12 Neha Wip… :Awsum😀😀👍🏼👍🏼 /home/johan… <chr … <chr [4]>
#> 9 2016-12-21 22:51:21 Sakshi :🙈 /home/johan… <chr … <chr [1]>
#> 10 2016-12-21 22:57:01 Ganguly :🙂🙂👍🏻👍🏻 /home/johan… <chr … <chr [4]>
#> # … with 382 more rows
I extract the emojis from text and store them in a list column as each text can contain multiple emojis. Use unnest to unnest the list column.
library(tidyverse)
df %>%
select(time, emoji) %>%
unnest(emoji)
#> # A tibble: 654 x 2
#> time emoji
#> <dttm> <chr>
#> 1 2016-12-21 22:50:12 😀
#> 2 2016-12-21 22:50:12 😀
#> 3 2016-12-21 22:50:12 👍🏼
#> 4 2016-12-21 22:50:12 👍🏼
#> 5 2016-12-21 22:51:21 🙈
#> 6 2016-12-21 22:57:01 🙂
#> 7 2016-12-21 22:57:01 🙂
#> 8 2016-12-21 22:57:01 👍🏻
#> 9 2016-12-21 22:57:01 👍🏻
#> 10 2016-12-21 23:28:51 😂
#> # … with 644 more rows
You can use this function with any text. The only thing you need to do first is to store the text in a data.frame in a column called text (I use tibble here as it prints nicer):
df <- tibble::tibble(
text = readLines("/home/johannes/_chat.txt")
)
#> Warning in readLines("/home/johannes/_chat.txt"): incomplete final line found on
#> '/home/johannes/_chat.txt'
rwhatsapp::lookup_emoji(df, text_field = "text")
#> # A tibble: 764 x 3
#> text emoji emoji_name
#> <chr> <list> <list>
#> 1 25/6/15, 1:42:12 AM: Vishnu Gaud created this group <NULL> <NULL>
#> 2 25/6/15, 1:42:12 AM: You were added <NULL> <NULL>
#> 3 18/12/16, 1:57:38 AM: Shahain: <image omitted> <NULL> <NULL>
#> 4 21/12/16, 9:54:46 PM: Pankaj Sinha: <image omitted> <NULL> <NULL>
#> 5 21/12/16, 9:57:45 PM: Shahain: Wow <NULL> <NULL>
#> 6 21/12/16, 10:48:51 PM: Sakshi: <image omitted> <NULL> <NULL>
#> 7 21/12/16, 10:49:00 PM: Sakshi: <image omitted> <NULL> <NULL>
#> 8 21/12/16, 10:50:12 PM: Neha Wipro: Awsum😀😀👍🏼👍🏼 <chr [4]> <chr [4]>
#> 9 21/12/16, 10:51:21 PM: Sakshi: 🙈 <chr [1]> <chr [1]>
#> 10 21/12/16, 10:57:01 PM: Ganguly: 🙂🙂👍🏻👍🏻 <chr [4]> <chr [4]>
#> # … with 754 more rows
more details
The way this works under the hood is with a simple dictionary and matching approach. First I split the text into characters and put the characters in a data.frame together with the line id (this is a rewrite of unnest_tokens from tidytext):
lines <- readLines("/home/johannes/_chat.txt")
#> Warning in readLines("/home/johannes/_chat.txt"): incomplete final line found on
#> '/home/johannes/_chat.txt'
id <- seq_along(lines)
l <- stringi::stri_split_boundaries(lines, type = "character")
out <- tibble(id = rep(id, sapply(l, length)), emoji = unlist(l))
Then I match the characters with a dataset of emoji characters (see ?rwhatsapp::emojis for more infos):
out <- add_column(out,
emoji_name = rwhatsapp::emojis$name[
match(out$emoji,
rwhatsapp::emojis$emoji)
])
out
#> # A tibble: 28,652 x 3
#> id emoji emoji_name
#> <int> <chr> <chr>
#> 1 1 "2" <NA>
#> 2 1 "5" <NA>
#> 3 1 "/" <NA>
#> 4 1 "6" <NA>
#> 5 1 "/" <NA>
#> 6 1 "1" <NA>
#> 7 1 "5" <NA>
#> 8 1 "," <NA>
#> 9 1 " " <NA>
#> 10 1 "1" <NA>
#> # … with 28,642 more rows
Now the new column contains either an emoji or NA when no emoji was found. Removing NAs just the emojis are left.
out <- out[!is.na(out$emoji_name), ]
out
#> # A tibble: 656 x 3
#> id emoji emoji_name
#> <int> <chr> <chr>
#> 1 8 😀 grinning face
#> 2 8 😀 grinning face
#> 3 8 👍🏼 thumbs up: medium-light skin tone
#> 4 8 👍🏼 thumbs up: medium-light skin tone
#> 5 9 🙈 see-no-evil monkey
#> 6 10 🙂 slightly smiling face
#> 7 10 🙂 slightly smiling face
#> 8 10 👍🏻 thumbs up: light skin tone
#> 9 10 👍🏻 thumbs up: light skin tone
#> 10 11 😂 face with tears of joy
#> # … with 646 more rows
The disadvantage of this approach is that you rely on the completeness of your emoji data. However, the dataset in the pacakge includes all known emojis from the unicode website (version 13).
Question
I use time-series data regularly. Sometimes, I would like to transmute an entire data frame to obtain some data frame of growth rates, or shares, for example.
When using transmute this is relatively straight-forward. But when I have a lot of columns to transmute and I want to keep the date column, I'm not sure if that's possible.
Below, using the economics data set, is an example of what I mean.
Example
library(dplyr)
economics %>%
transmute(date,
pce * 10,
pop * 10,
psavert * 10)
# A tibble: 574 x 4
date `pce * 10` `pop * 10` `psavert * 10`
<date> <dbl> <dbl> <dbl>
1 1967-07-01 5067 1987120 126
2 1967-08-01 5098 1989110 126
3 1967-09-01 5156 1991130 119
4 1967-10-01 5122 1993110 129
5 1967-11-01 5174 1994980 128
6 1967-12-01 5251 1996570 118
7 1968-01-01 5309 1998080 117
8 1968-02-01 5336 1999200 123
9 1968-03-01 5443 2000560 117
10 1968-04-01 5440 2002080 123
# ... with 564 more rows
Now, using transmute_at. The below predictably removes date in the .vars argument, but I haven't found a way of removing date and reintroducing it in .funs such that the resulting data frame looks as it does above. Any ideas?
economics %>%
transmute_at(.vars = vars(-c(date, uempmed, unemploy)),
.funs = list("trans" = ~ . * 10))
# A tibble: 574 x 3
pce_trans pop_trans psavert_trans
<dbl> <dbl> <dbl>
1 5067 1987120 126
2 5098 1989110 126
3 5156 1991130 119
4 5122 1993110 129
5 5174 1994980 128
6 5251 1996570 118
7 5309 1998080 117
8 5336 1999200 123
9 5443 2000560 117
10 5440 2002080 123
# ... with 564 more rows
We can use if/else inside the function.
library(dplyr)
library(ggplot2)
data(economics)
economics %>%
transmute_at(vars(date:psavert), ~ if(is.numeric(.)) .* 10 else .)
# A tibble: 574 x 4
# date pce pop psavert
# <date> <dbl> <dbl> <dbl>
# 1 1967-07-01 5067 1987120 126
# 2 1967-08-01 5098 1989110 126
# 3 1967-09-01 5156 1991130 119
# 4 1967-10-01 5122 1993110 129
# 5 1967-11-01 5174 1994980 128
# 6 1967-12-01 5251 1996570 118
# 7 1968-01-01 5309 1998080 117
# 8 1968-02-01 5336 1999200 123
# 9 1968-03-01 5443 2000560 117
#10 1968-04-01 5440 2002080 123
# … with 564 more rows
If we need to change the column names selectively, can do this after the transmute_at
library(stringr)
economics %>%
transmute_at(vars(date:psavert), ~ if(is.numeric(.)) .* 10 else .) %>%
rename_at(vars(-date), ~ str_c(., '_trans'))
# A tibble: 574 x 4
# date pce_trans pop_trans psavert_trans
# <date> <dbl> <dbl> <dbl>
# 1 1967-07-01 5067 1987120 126
# 2 1967-08-01 5098 1989110 126
# 3 1967-09-01 5156 1991130 119
# 4 1967-10-01 5122 1993110 129
# 5 1967-11-01 5174 1994980 128
# 6 1967-12-01 5251 1996570 118
# 7 1968-01-01 5309 1998080 117
# 8 1968-02-01 5336 1999200 123
# 9 1968-03-01 5443 2000560 117
#10 1968-04-01 5440 2002080 123
# … with 564 more rows
If we are changing the column names in all the selected columns in transmute_at use list(trans =
economics %>%
transmute_at(vars(date:psavert), list(trans = ~if(is.numeric(.)) .* 10 else .))
I wish to use subqueries like structure in dplyr efficiently. For example we do subqueries like
select * from table where tbl_id in (select tbl_id from table1 where name='a');
Here, I assume we are not saving any results from inner query. I want to use similar structure in dplyr chains.
I have tried to use the result of one dplyr chain in another by putting it in brackets but it doesn't work that way. I already know that we can save it as temporary df and use it but I don't want to save it.
Below are the two table data/dataframes :
# Libraries
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Q 16 Write a query in SQL to find the name of those
# movies where one or more actors acted in two or more movies
movie <- read.csv("q14_movie.csv")
movie_cast <- read.csv("q16_movie_cast.csv")
print(movie)
#> mov_id mov_title mov_year mov_time mov_lang
#> 1 901 Vertigo 1958 128 English
#> 2 902 The Innocents 1961 100 English
#> 3 903 Lawrence of Arabia 1962 216 English
#> 4 904 The Deer Hunter 1978 183 English
#> 5 905 Amadeus 1984 160 English
#> 6 906 Blade Runner 1982 117 English
#> 7 907 Eyes Wide Shut 1999 159 English
#> 8 908 The Usual Suspects 1995 106 English
#> 9 909 Chinatown 1974 130 English
#> 10 910 Boogie Nights 1997 155 English
#> 11 911 Annie Hall 1977 93 English
#> 12 912 Princess Mononoke 1997 134 Japanese
#> 13 913 The Shawshank Redemption 1994 142 English
#> 14 914 American Beauty 1999 122 English
#> 15 915 Titanic 1997 194 English
#> 16 916 Good Will Hunting 1997 126 English
#> 17 917 Deliverance 1972 109 English
#> 18 918 Trainspotting 1996 94 English
#> 19 919 The Prestige 2006 130 English
#> 20 920 Donnie Darko 2001 113 English
#> 21 921 Slumdog Millionaire 2008 120 English
#> 22 922 Aliens 1986 137 English
#> 23 923 Beyond the Sea 2004 118 English
#> 24 924 Avatar 2009 162 English
#> 25 926 Seven Samurai 1954 207 Japanese
#> 26 927 Spirited Away 2001 125 Japanese
#> 27 928 Back to the Future 1985 116 English
#> 28 925 Braveheart 1995 178 English
#> mov_dt_rel mov_rel_country
#> 1 1958-08-24 UK
#> 2 1962-02-19 SW
#> 3 1962-12-11 UK
#> 4 1979-03-08 UK
#> 5 1985-01-07 UK
#> 6 1982-09-09 UK
#> 7 UK
#> 8 1995-08-25 UK
#> 9 1974-08-09 UK
#> 10 1998-02-16 UK
#> 11 1977-04-20 USA
#> 12 2001-10-19 UK
#> 13 1995-02-17 UK
#> 14 UK
#> 15 1998-01-23 UK
#> 16 1998-06-03 UK
#> 17 1982-10-05 UK
#> 18 1996-02-23 UK
#> 19 2006-11-10 UK
#> 20 UK
#> 21 2009-01-09 UK
#> 22 1986-08-29 UK
#> 23 2004-11-26 UK
#> 24 2009-12-17 UK
#> 25 1954-04-26 JP
#> 26 2003-09-12 UK
#> 27 1985-12-04 UK
#> 28 1995-09-08 UK
print(movie_cast)
#> act_id mov_id role
#> 1 101 901 John Scottie Ferguson
#> 2 102 902 Miss Giddens
#> 3 103 903 T.E. Lawrence
#> 4 104 904 Michael
#> 5 105 905 Antonio Salieri
#> 6 106 906 Rick Deckard
#> 7 107 907 Alice Harford
#> 8 108 908 McManus
#> 9 110 910 Eddie Adams
#> 10 111 911 Alvy Singer
#> 11 112 912 San
#> 12 113 913 Andy Dufresne
#> 13 114 914 Lester Burnham
#> 14 115 915 Rose DeWitt Bukater
#> 15 116 916 Sean Maguire
#> 16 117 917 Ed
#> 17 118 918 Renton
#> 18 120 920 Elizabeth Darko
#> 19 121 921 Older Jamal
#> 20 122 922 Ripley
#> 21 114 923 Bobby Darin
#> 22 109 909 J.J. Gittes
#> 23 119 919 Alfred Borden
sqldf('select * from movie m join movie_cast mc on
m.mov_id=mc.mov_id where mc.act_id in
(select act_id from movie_cast group by act_id having count(mov_id)>1)')
#> mov_id mov_title mov_year mov_time mov_lang mov_dt_rel
#> 1 914 American Beauty 1999 122 English
#> 2 923 Beyond the Sea 2004 118 English 2004-11-26
#> mov_rel_country act_id mov_id role
#> 1 UK 114 914 Lester Burnham
#> 2 UK 114 923 Bobby Darin
tmp <- movie_cast %>%
group_by(act_id) %>%
summarise(num_movies=n_distinct(mov_id)) %>%
filter(num_movies>1)
inner_join(movie,movie_cast,by='mov_id') %>% filter(act_id %in% tmp$act_id)
#> mov_id mov_title mov_year mov_time mov_lang mov_dt_rel
#> 1 914 American Beauty 1999 122 English
#> 2 923 Beyond the Sea 2004 118 English 2004-11-26
#> mov_rel_country act_id role
#> 1 UK 114 Lester Burnham
#> 2 UK 114 Bobby Darin
inner_join(movie,movie_cast,by='mov_id') %>%
filter(act_id %in% (movie_cast %>%
group_by(act_id) %>%
summarise(num_movies=n_distinct(mov_id)) %>%
filter(num_movies>1) %>%
select(act_id)))
#> [1] mov_id mov_title mov_year mov_time
#> [5] mov_lang mov_dt_rel mov_rel_country act_id
#> [9] role
#> <0 rows> (or 0-length row.names)
I wish to get same results without saving as tmp as explained in code !
Thanks
I have built a matrix whose names are those of a regressor subset that i want to insert in a regression model formula in R.
For example:
data$age is the response variable
X is the design matrix whose column names are, for example, data$education and data$wage.
The problem is that the column names of X are not fixed (i.e. i don't know which are them in advance), so i tried to code this:
best_model <- lm(data$age ~ paste(colnames(x[, GA#solution == 1]), sep = "+"))
But it doesn't work.
Rather than writing formula by yourself, using pipe(%>%) and dplyr::select() appropriately might be helpful. (Here, change your matrix to data frame.)
library(tidyverse)
mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
#> 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
#> 3 audi a4 2 2008 4 manu… f 20 31 p comp…
#> 4 audi a4 2 2008 4 auto… f 21 30 p comp…
#> 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
#> 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
#> 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
#> 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
#> 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
#> 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
#> # ... with 224 more rows
Select
dplyr::select() subsets column.
mpg %>%
select(hwy, manufacturer, displ, cyl, cty) %>% # subsetting
lm(hwy ~ ., data = .)
#>
#> Call:
#> lm(formula = hwy ~ ., data = .)
#>
#> Coefficients:
#> (Intercept) manufacturerchevrolet manufacturerdodge
#> 2.65526 -1.08632 -2.55442
#> manufacturerford manufacturerhonda manufacturerhyundai
#> -2.29897 -2.98863 -0.94980
#> manufacturerjeep manufacturerland rover manufacturerlincoln
#> -3.36654 -1.87179 -1.10739
#> manufacturermercury manufacturernissan manufacturerpontiac
#> -2.64828 -2.44447 0.75427
#> manufacturersubaru manufacturertoyota manufacturervolkswagen
#> -3.04204 -2.73963 -1.62987
#> displ cyl cty
#> -0.03763 0.06134 1.33805
Denote that -col.name exclude that column. %>% enables formula to use . notation.
Tidyselect
Lots of data sets group their columns using underscore.
nycflights13::flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> 6 2013 1 1 554 558 -4 740
#> 7 2013 1 1 555 600 -5 913
#> 8 2013 1 1 557 600 -3 709
#> 9 2013 1 1 557 600 -3 838
#> 10 2013 1 1 558 600 -2 753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
For instance, both dep_delay and arr_delay are about delay time. Select helpers such as starts_with(), ends_with(), and contains() can handle this kind of columns.
nycflights13::flights %>%
select(starts_with("sched"),
ends_with("delay"),
distance)
#> # A tibble: 336,776 x 5
#> sched_dep_time sched_arr_time dep_delay arr_delay distance
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 515 819 2 11 1400
#> 2 529 830 4 20 1416
#> 3 540 850 2 33 1089
#> 4 545 1022 -1 -18 1576
#> 5 600 837 -6 -25 762
#> 6 558 728 -4 12 719
#> 7 600 854 -5 19 1065
#> 8 600 723 -3 -14 229
#> 9 600 846 -3 -8 944
#> 10 600 745 -2 8 733
#> # ... with 336,766 more rows
After that, just %>% lm().
nycflights13::flights %>%
select(starts_with("sched"),
ends_with("delay"),
distance) %>%
lm(dep_delay ~ ., data = .)
#>
#> Call:
#> lm(formula = dep_delay ~ ., data = .)
#>
#> Coefficients:
#> (Intercept) sched_dep_time sched_arr_time arr_delay
#> -0.151408 0.002737 0.000951 0.816684
#> distance
#> 0.001859