Generate dummy according to quanteda dictionary lookup

Generate dummy according to quanteda dictionary lookup - r

I'm using quanteda to create dictionaries and look up for terms.
Here is a reproducible example of my data:
dput(tweets[1:4, ])
structure(list(tweet_id = c("174457180812_10156824364270813",
"174457180812_10156824136360813", "174457180812_10156823535820813",
"174457180812_10156823868565813"), tweet_message = c("Climate change is a big issue",
"We should care about the environment", "Let's rethink environmental policies",
"#Davos WEF"
), date = c("2019-03-25T23:03:56+0000", "2019-03-25T21:10:36+0000",
"2019-03-25T21:00:03+0000", "2019-03-25T20:00:03+0000"), group = c("1",
"2", "3", "4")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Here is how I use my dictionary following a suggestion I got from this forum:
climate_corpus <- corpus(tweets, text_field = "tweet_message")
climatechange_dict <-
dictionary(list(climate = c("environment*", "climate change")))
groupeddfm <- tokens(climate_corpus) %>%
tokens_lookup(dictionary = climatechange_dict) %>%
dfm(groups = "group")
convert(groupeddfm, to = "data.frame")
What I need to do is to create a dummy in my original dataset "tweets" equal to 1 when tokens_lookup identifies a word included in my dictionary in one specific observation (tweet). Using my reproducible example, I would like to generate a dummy equal to 1 for the first three observations (they include dictionary words), and equal to 0 for the fourth one (no dictionary words).
I would really appreciate your help on this.
Many thanks!

library("quanteda")
## Package version: 2.0.1
tweets <- structure(
list(tweet_id = c(
"174457180812_10156824364270813",
"174457180812_10156824136360813", "174457180812_10156823535820813",
"174457180812_10156823868565813"
), tweet_message = c(
"Climate change is a big issue",
"We should care about the environment", "Let's rethink environmental policies",
"#Davos WEF"
), date = c(
"2019-03-25T23:03:56+0000", "2019-03-25T21:10:36+0000",
"2019-03-25T21:00:03+0000", "2019-03-25T20:00:03+0000"
), group = c(
"1",
"2", "3", "4"
)),
row.names = c(NA, -4L), class = c(
"tbl_df",
"tbl", "data.frame"
)
)
climate_corpus <- corpus(tweets, text_field = "tweet_message")
climatechange_dict <-
dictionary(list(climate = c("environment*", "climate change")))
groupeddfm <- tokens(climate_corpus) %>%
tokens_lookup(dictionary = climatechange_dict) %>%
dfm(groups = "group")
tweets$mentions_climate <- as.logical(groupeddfm[, "climate"])
tweets
## # A tibble: 4 x 5
## tweet_id tweet_message date group mentions_climate
## <chr> <chr> <chr> <chr> <lgl>
## 1 174457180812_1015… Climate change is a b… 2019-03-25T2… 1 TRUE
## 2 174457180812_1015… We should care about … 2019-03-25T2… 2 TRUE
## 3 174457180812_1015… Let's rethink environ… 2019-03-25T2… 3 TRUE
## 4 174457180812_1015… #Davos WEF 2019-03-25T2… 4 FALSE

Related

Hieraching across rows for the same id

So, I have a data set with a lot of observations for X individuals and more rows per some individuals. For each row, I have assigned a classification (the variable clinical_significance) that takes three values in prioritized order: definite disease, possible, colonization. Now, I would like to have only one row for each individual and the "highest classification" across the rows, e.g. definite if present, subsidiary possible and colonization. Any good suggestions on how to overcome this?
For instance, as seen in the example, I would like all ID #23 clinical_signifiance to be 'definite disease' as this outranks 'possible'
id id_row number_of_samples species_ny clinical_significa…
18 1 2 MAC possible
18 2 2 MAC possible
20 1 2 scrofulaceum possible
20 2 2 scrofulaceum possible
23 1 2 MAC possible
23 2 2 MAC definite disease

Making a reproducible example:
df <- structure(
list(
id = c("18", "18", "20", "20", "23", "23"),
id_row = c("1","2", "1", "2", "1", "2"),
number_of_samples = c("2", "2", "2","2", "2", "2"),
species_ny = c("MAC", "MAC", "scrofulaceum", "scrofulaceum", "MAC", "MAC"),
clinical_significance = c("possible", "possible", "possible", "possible", "possible", "definite disease")
),
row.names = c(NA, -6L), class = c("data.frame")
)
The idea is to turn clinical significance into a factor, which is stored as an integer instead of character (i.e. 1 = definite, 2 = possible, 3 = colonization). Then, for each ID, take the row with lowest number.
df_prio <- df |>
mutate(
fct_clin_sig = factor(
clinical_significance,
levels = c("definite disease", "possible", "colonization")
)
) |>
group_by(id) |>
slice_min(fct_clin_sig)

I fixed it using
df <- df %>%
group_by(id) %>%
mutate(clinical_significance_new = ifelse(any(clinical_significance == "definite disease"), "definite disease", as.character(clinical_significance)))

replace and remove duplicate rows using ifelse

I have a data frame of postcodes with a regional/metro classification assigned. In some instances, due to the datasource, the same postcode will occur with both a regional and metro classification.
POSTCODE REGON
1 3000 METRO
2 3000 REGIONAL
3 3256 METRO
4 3145 METRO
I am wondering how to remove the duplicate row and replace the region with "SPLIT" in these instances.
I have tried using the below code however this reassignes the entire dataset with either "METRO" or "REGIONAL"
test <- within(PC_ACTM, REGION <- ifelse(duplicated("Postcode"), "SPLIT", REGION))
The desired output would be
POSTCODE REGON
1 3000 SPLIT
2 3256 METRO
3 3145 METRO
Example data:
dput(PC_ACTM)
structure(list(POSTCODE = c(3000L, 3000L, 3256L, 3145L), REGON = c("METRO",
"REGIONAL", "METRO", "METRO")), class = "data.frame", row.names = c("1",
"2", "3", "4"))

Based on your title, you're looking for an ifelse() solution; perhaps this will suit?
PC_ACTM <- structure(list(POSTCODE = c(3000L, 3000L, 3256L, 3145L),
REGION = c("METRO", "REGIONAL", "METRO", "METRO")),
class = "data.frame",
row.names = c("1", "2", "3", "4"))
PC_ACTM$REGION <- ifelse(duplicated(PC_ACTM$POSTCODE), "SPLIT", PC_ACTM$REGION)
PC_ACTM[!duplicated(PC_ACTM$POSTCODE, fromLast = TRUE),]
#> POSTCODE REGION
#> 2 3000 SPLIT
#> 3 3256 METRO
#> 4 3145 METRO
Created on 2022-04-07 by the reprex package (v2.0.1)

Consider ave to sequential count by group and then subset the last but before use ifslse to replace needed value for any group counts over 1. Below uses new base R 4.1.0+ pipe |>:
test <- within(
PC_ACTM, {
PC_SEQ <- ave(1:nrow(test), POSTCODE, FUN=seq_along)
PC_COUNT <- ave(1:nrow(test), POSTCODE, FUN=length)
REGION <- ifelse(
(PC_SEQ == PC_COUNT) & (PC_COUNT > 1), "SPLIT", REGION
)
}
) |> subset(
subset = PC_SEQ == PC_COUNT, # SUBSET ROWS
select = c(POSTCODE, REGION) # SELECT COLUMNS
) |> `row.names<-`(NULL) # RESET ROW NAMES

R: Passing multiple dataframe columns to dplyr::case_when() as condition while using column title as replacement

I want to pass all values in a dataframe as condition to dplyr::case_when() with stringr::str_detect() while using the respective column title als replacement value.
I have these two data frames:
> print(city_stack)
# A tibble: 11 × 1
city
<chr>
1 Britz
2 Berlin-Reinickendorf
3 Berlin-Kladow
4 Berlin-Spindlersfeld
5 Berlin-Mahlsdorf
6 Berlin-Lichterfelde
7 Berlin-Spandau
8 Berlin-Biesdorf
9 Berlin-Niederschöneweide
10 Rüdersdorf bei Berlin
11 Berlin-Nordend
> print(districts_stack)
# A tibble: 10 × 2
Berlin Köln
<chr> <chr>
1 Adlershof Rodenkirchen
2 Altglienicke Chorweiler
3 Baumschulenweg Ehrenfeld
4 Biesdorf Kalk
5 Blankenburg Lindenthal
6 Blankenfelde Mülheim
7 Bohnsdorf Nippes
8 Britz Porz
9 Buch Kölner Zoo
10 Buckow Universität zu Köln
I tried using a nested for loop:
for (i in colnames(districts_stack)){
for (j in districts_stack[[i]]){
mutate(city_stack, case_when(
str_detect(city, paste0(j) ~ i,
TRUE ~ city)
)
}
}
While that totally works, this is extremely inefficient and gets problematic with the huge dataframe I am actually working with. I feel like there should be a more efficient solution using purrr::map(), but I wasn't able to come up with anything working.
dput() of the dataframes:
dput(city_stack[1:11,])
structure(list(city = c("Britz", "Berlin-Reinickendorf", "Berlin-Kladow",
"Berlin-Spindlersfeld", "Berlin-Mahlsdorf", "Berlin-Lichterfelde",
"Berlin-Spandau", "Berlin-Biesdorf", "Berlin-Niederschöneweide",
"Rüdersdorf bei Berlin", "Berlin-Nordend")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
> dput(districts_stack[1:10,1:2])
structure(list(Berlin = c("Adlershof", "Altglienicke", "Baumschulenweg",
"Biesdorf", "Blankenburg", "Blankenfelde", "Bohnsdorf", "Britz",
"Buch", "Buckow"), Köln = c("Rodenkirchen", "Chorweiler", "Ehrenfeld",
"Kalk", "Lindenthal", "Mülheim", "Nippes", "Porz", "Kölner Zoo",
"Universität zu Köln")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))

I'm not 100% sure the output you're looking for. However, I believe this is a step in the right direction. Rather than looping over the district values and checking for matches, I propose melting the district_stack data and joining that new df to the city names using a fuzzy string match.
That is what I understand is happening in the loop. You then have a dataframe in which you can replace the city value using if_else more easily.
I drew inspiration from this thread: dplyr: inner_join with a partial string match
library(tidyverse)
library(fuzzyjoin) # to join the data based on fuzzy matches to get results in one dataframe for easier manipulation
city_stack <- structure(list(city = c("Britz", "Berlin-Reinickendorf", "Berlin-Kladow",
"Berlin-Spindlersfeld", "Berlin-Mahlsdorf", "Berlin-Lichterfelde",
"Berlin-Spandau", "Berlin-Biesdorf", "Berlin-Niederschöneweide",
"Rüdersdorf bei Berlin", "Berlin-Nordend")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
districts_stack <- structure(list(Berlin = c("Adlershof", "Altglienicke", "Baumschulenweg",
"Biesdorf", "Blankenburg", "Blankenfelde", "Bohnsdorf", "Britz",
"Buch", "Buckow"), Köln = c("Rodenkirchen", "Chorweiler", "Ehrenfeld",
"Kalk", "Lindenthal", "Mülheim", "Nippes", "Porz", "Kölner Zoo",
"Universität zu Köln")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame")) %>%
pivot_longer(., cols = everything(), names_to='city', values_to='district') %>%
arrange(city)
city_stack %>% # left join to get all potential string matches, then mutate
regex_left_join(districts_stack, by = c(city = "district")) %>%
mutate(city.x = if_else(!is.na(city.y), district, city.x))

To match a list type column with other columns in DF

I have a data frame with the approximate structure:
C1 C2 C3
1 c("XXX", "Y3") "XXX" "Y31"
2 c("SFM", "DD31", "DSDW") "SFF" "DD31"
The column C1 is a list. It was a string which I split into separate words. The other 2 columns are character.
I need to match C2 and C3 against C1 so that in case of the match (100% there is a match), replace the value in C1 with another value. For example:
The first row has 2 matches because fuzzy match is also a match:
C1~C2: replace "XXX" in C1 with the modified value from C1 "XXX[TAG]"
C1~C3: replace "Y3" in C1 with the modified value from C3 "Y31[TAG]"
In general I understand how to do that: with a for loop, match function and regex but my knowledge does not allow me to combine everything together. Thank you in advance!
EDITED
What I have:
x <- structure(list(Description = list(c("2012", "Deere", "544K",
"Wheel", "Loader,"), c("Caterpillar","Model", "988", "Year", "1972")),
Manufacturer = c("john deere", "caterpillar"),
Model = c("544k", "988")), .Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
#> Description Manufacturer Model
#> 4 2012, Deere, 544K, Wheel, Loader, john deere 544k
#> 5 Caterpillar, Model, 988, Year, 1972 caterpillar 988
What I want to have:
x.new <- structure(list(Description = list(c("2012", "john deere[Manufacturer]", "544k[Model]",
"Wheel", "Loader,"), c("caterpillar[Manufacturer]","Model", "988[Model]", "Year", "1972")),
Manufacturer = c("john deere", "caterpillar"),
Model = c("544k", "988")), .Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
#> Description Manufacturer Model
#> 4 2012, john deere[Manufacturer], 544k[Model], Wheel, Loader, john deere 544k
#> 5 caterpillar[Manufacturer], Model, 988[Model], Year, 1972 caterpillar 988

With list columns, you'll need a lot of lapply and its multivariate equivalent, Map, which allow you to iterate over the list column and return a list which can be reassigned as a column. For example,
df <- structure(list(C1 = list(c("XXX", "Y3"), c("SFM", "DD31", "DSDW")),
C2 = c("XXX", "SFF"),
C3 = c("Y31", "DD31")),
.Names = c("C1", "C2", "C3"), row.names = c(NA, -2L), class = "data.frame")
df$C1_new <- Map(function(c1, c2, c3){
sapply(c1, function(x){
mtch <- grepl(x, c(c2, c3));
if (any(mtch)) {paste0(c(c2, c3)[mtch], '[', names(df)[-1][mtch], ']')} else {x}
})},
df$C1, df$C2, df$C3)
df
#> C1 C2 C3 C1_new
#> 1 XXX, Y3 XXX Y31 XXX[C2], Y31[C3]
#> 2 SFM, DD31, DSDW SFF DD31 SFM, DD31[C3], DSDW
There are many other ways to set this up, including using using packages like purrr and stringr that make the syntax simpler and more uniform. Vary as you like.
To apply to the second dataset listed, it works with some slight edits:
x <- structure(list(Description = list(c("2012", "Deere", "544K", "Wheel", "Loader,"),
c("Caterpillar","Model", "988", "Year", "1972")),
Manufacturer = c("john deere", "caterpillar"),
Model = c("544k", "988")),
.Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
x$Description <- Map(function(desc, mfr, mdl){
sapply(desc, function(wrd){
mtch <- grepl(wrd, c(mfr, mdl), ignore.case = TRUE);
if (any(mtch)) {paste0(c(mfr, mdl)[mtch], '[', names(x)[-1][mtch], ']')} else {wrd}
})},
x$Description, x$Manufacturer, x$Model)
x
#> Description Manufacturer Model
#> 4 2012, john deere[Manufacturer], 544k[Model], Wheel, Loader, john deere 544k
#> 5 caterpillar[Manufacturer], Model, 988[Model], Year, 1972 caterpillar 988

Comparing pairs of rows in a list of data frames

I have a list that's 1314 element long. Each element is a data frame consisting of two rows and four columns.
Game.ID Team Points Victory
1 201210300CLE CLE 94 0
2 201210300CLE WAS 84 0
I would like to use the lapply function to compare points for each team in each game, and change Victory to 1 for the winning team.
I'm trying to use this function:
test_vic <- lapply(all_games, function(x) {if (x[1,3] > x[2,3]) {x[1,4] = 1}})
But the result it produces is a list 1314 elements long with just the Game ID and either a 1 or a null, a la:
$`201306200MIA`
[1] 1
$`201306160SAS`
NULL
How can I fix my code so that each data frame maintains its shape. (I'm guessing solving the null part involves if-else, but I need to figure out the right syntax.)
Thanks.

Try
lapply(all_games, function(x) {x$Victory[which.max(x$Points)] <- 1; x})
Or another option would be to convert the list to data.table by using rbindlist and then do the conversion
library(data.table)
rbindlist(all_games)[,Victory:= +(Points==max(Points)) ,Game.ID][]
data
all_games <- list(structure(list(Game.ID = c("201210300CLE",
"201210300CLE"
), Team = c("CLE", "WAS"), Points = c(94L, 84L), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
class = "data.frame", row.names = c("1",
"2")), structure(list(Game.ID = c("201210300CME", "201210300CME"
), Team = c("CLE", "WAS"), Points = c(90, 92), Victory = c(0L,
0L)), .Names = c("Game.ID", "Team", "Points", "Victory"),
row.names = c("1", "2"), class = "data.frame"))

You could try dplyr:
library(dplyr)
all_games %>%
bind_rows() %>%
group_by(Game.ID) %>%
mutate(Victory = row_number(Points)-1)
Which gives:
#Source: local data frame [4 x 4]
#Groups: Game.ID
#
# Game.ID Team Points Victory
#1 201210300CLE CLE 94 1
#2 201210300CLE WAS 84 0
#3 201210300CME CLE 90 0
#4 201210300CME WAS 92 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Generate dummy according to quanteda dictionary lookup - r

Related

Hieraching across rows for the same id

replace and remove duplicate rows using ifelse

R: Passing multiple dataframe columns to dplyr::case_when() as condition while using column title as replacement

To match a list type column with other columns in DF

Comparing pairs of rows in a list of data frames

Categories

Resources