I want to remove rows containing specific strings which I stored in a separate vector.
I tried everything from
Delete rows containing specific strings in R and Remove Rows From Data Frame where a Row matches a String
but it always removes every row and my output is empty
I tried it with an example and it works fine, but not for my input and my remove_list
My input is:
ID Aufzeichnungen
<dbl> <chr>
1 1 "Aufzeichnungen"
2 1 "07.03.22 A: stechender Schmerz"
3 1 " scharfkantig"
4 1 "D/B:"
5 1 "T:"
6 1 "pat aht an 36 üz distal"
7 1 " seit paartagen"
8 1 "36 vipr++"
9 1 " perk-"
10 1 " keine c zu entdekcne"
11 1 "üz bilfuird"
12 1 "pat aufgekläörtggf RÖ um c auszuschileßen"
13 1 " pat verweigert RÖ aus Angst vor Strahlung"
14 1 " pat"
15 1 "aufgeklärt angst nicht nötig und c unter fllg oder apprx nicht auszuschließen"
16 1 ""
17 1 "pat knirscht"
18 1 " schiene empohlen"
19 1 " pat meldet sich.."
and I want to remove every row containing strings from this list:
remove_list <- paste(c("einverst", "empf", "raten", "aufgeklä", "nicht", "weiß nicht", "bespr", "soll",
"kein", "?", "raten", "klären", "überprüf", "erst, wenn", "verweiger",
"notwendig"), collapse = '|')
Logically it should remove rows 10, 12, 13, 15
My codes are:
removed <- PKV[grep(remove_list, PKV$Aufzeichnungen, invert = TRUE), ]
removed <- PKV %>% filter(!grepl(remove.list, PKV$Aufzeichnungen ))
and also every variant with str_detect
But the output looks always like this:
# A tibble: 0 × 2
# Groups: ID [0]
# … with 2 variables: ID <dbl>, Aufzeichnungen <chr>
Thank you for your help!
We can first grep the indices of the rows contains one of remove_list words , then exclude them from your data.frame
remove_ind <- lapply(strsplit(remove_list , "\\|")[[1]] ,
\(x) grep(x , PKV$Aufzeichnungen , fixed = T)) |>
unlist() |> unique()
#> [1] 12 15 10 13
PKV[-remove_ind,]
output
ID Aufzeichnungen
1 1 Aufzeichnungen
2 1 07.03.22 A: stechender Schmerz
3 1 scharfkantig
4 1 D/B:
5 1 T:
6 1 pat aht an 36 üz distal
7 1 seit paartagen
8 1 36 vipr++
9 1 perk-
11 1 üz bilfuird
14 1 pat
16 1
17 1 pat knirscht
18 1 schiene empohlen
19 1 pat meldet sich..
Try this:
remove_list <- c("einverst", "empf", "raten", "aufgeklä", "nicht", "weiß nicht", "bespr", "soll",
"kein", "?", "raten", "klären", "überprüf", "erst, wenn", "verweiger",
"notwendig")
mylist <- c("notwendig","einverst","1","2" )
mylist[!mylist %in% remove_list]
#> [1] "1" "2"
Created on 2022-08-10 by the reprex package (v2.0.1)
I see that you have a question mark in remove_list but ? has a meaning in regex. So I suggest you escape it, i.e.
remove_list <- paste(c("einverst", "empf", "raten", "aufgeklä", "nicht", "weiß nicht", "bespr", "soll",
"kein", "\\?", "raten", "klären", "überprüf", "erst, wenn", "verweiger",
"notwendig"), collapse = '|')
Then select the right rows, using the inverse of the grepl using !
PKV[!grepl(remove_list, PKV$AUFZEICHNUNGEN),]
Example of escaping ?:
#first rows of your data
dt <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1), AUFZEICHNUNGEN = c("Aufzeichnungen",
"07.03.22 A: stechender Schmerz", " scharfkantig", "D/B:",
" keine c zu entdekcne", "pat aufgekläörtggf RÖ um c auszuschileßen",
" seit paartagen", "36 vipr++")), class = "data.frame", row.names = c(NA,
-8L))
#grepl without escaping ?
grepl("?", PKV$AUFZEICHNUNGEN)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#grepl with escaping ?
grepl("\\?", PKV$AUFZEICHNUNGEN)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Related
I have a dataframe with multiple values per cell, and I want to find and return the values that are only in one column.
ID <- c("1","1","1","2","2","2","3","3","3")
locus <- c("A","B","C","A","B","C","A","B","C")
Acceptable <- c("K44 Z33 G49","K72 QR123","B92 N12 PQ16 G99","V3","L89 I203 UPA66 QF29"," ","K44 Z33 K72","B92 PQ14","J22 M43 VC78")
Unacceptable <- c("K44 Z33 G48","K72 QR123 B22","B92 N12 PQ16 G99","V3 N9 Q7","L89 I203 UPA66 QF29","B8","K44 Z33"," ","J22 M43 VC78")
df <- data.frame(ID,locus,Acceptable,Unacceptable)
dataframe
I want to make another column, Unique_values, that returns all the unique values that are only present in Unacceptable, and that are not in Acceptable. So the output should be this.
I already have a poorly optimized method to find the duplicates between the two columns:
df$Duplicate_values <- do.call(paste, c(df[,c("Acceptable","Unacceptable")], sep=" "))
df$Duplicate_values = sapply(strsplit(df$Duplicate_values, ' '), function(i)paste(i[duplicated(i)]))
#this is for cleaning up the text field so that it looks like the other columns
df$Duplicate_values = gsub("[^0-9A-Za-z///' ]"," ",df$Duplicate_values)
df$Duplicate_values = gsub("character 0",NA,df$Duplicate_values)
df$Duplicate_values = gsub("^c ","",df$Duplicate_values)
df$Duplicate_values = gsub(" "," ",df$Duplicate_values)
df$Duplicate_values = trimws(df$Duplicate_values)
(if anyone knows a faster method to return these duplicates, please let me now!)
I cannot use this method to find the unique values however, because it would then also return the unique values of the Acceptable column, which I do not want.
Any suggestions?
A similar approach using setdiff:
lA <- strsplit(df$Acceptable, " ")
lU <- strsplit(df$Unacceptable, " ")
df$Unique_values <- lapply(1:nrow(df), function(i) paste0(setdiff(lU[[i]], lA[[i]]), collapse = " "))
df
#> ID locus Acceptable Unacceptable Unique_values
#> 1 1 A K44 Z33 G49 K44 Z33 G48 G48
#> 2 1 B K72 QR123 K72 QR123 B22 B22
#> 3 1 C B92 N12 PQ16 G99 B92 N12 PQ16 G99
#> 4 2 A V3 V3 N9 Q7 N9 Q7
#> 5 2 B L89 I203 UPA66 QF29 L89 I203 UPA66 QF29
#> 6 2 C B8 B8
#> 7 3 A K44 Z33 K72 K44 Z33
#> 8 3 B B92 PQ14
#> 9 3 C J22 M43 VC78 J22 M43 VC78
I want to split info into Refseq ID, cDNA level change and Protein level change, where Refseq ID represents the substring from the start to the first colon :, cDNA level change is the substring between the first and second colon, and Protein level change is the substring after the second colon.
library(stringr)
df=read.csv("variant_calls.txt", sep="\t")
info=df["AAChange.refGene"]
id=stringr::str_extract(info, "SRR.(\\d{6})")
aa=info[!id]
> dput(info)
structure(list(AAChange.refGene = c("NM_002725:c.C301T:p.P101S",
"NM_001024940:c.T1054A:p.Y352N", "NM_001098209:c.T109C:p.S37P",
"NM_152539:c.G955A:p.E319K", "NM_032421:c.A2422G:p.T808A", "NM_003141:c.G431A:p.G144E",
"NM_006645:c.C749T:p.S250L", "NM_206927:c.C778A:p.P260T", "NM_012240:c.G209A:p.G70E",
"NM_152336:c.A382C:p.K128Q", "NM_002773:c.G750C:p.W250C", "NM_001797:c.C2125T:p.R709W",
"NM_058216:c.C797A:p.A266D", "NM_198977:c.C1543T:p.R515W", "NM_000307:c.C356T:p.A119V"
)), row.names = c(NA, -15L), class = "data.frame")
Expected output:
Refseq ID
cDNA level change
Protein level change
NM_001024940
c.T1054A
p.Y352N
NM_001098209
c.T109C
p.S37P
NM_152539
c.G955A
p.E319K
NM_032421
c.A2422G
p.T808A
Using base R with read.csv after replacing the first : with ,
read.csv(text = sub(":", ",", df$AAChange.refGene),
header = FALSE, col.names = c("id", "aa"))
id aa
1 NM_002725 c.C301T:p.P101S
2 NM_001024940 c.T1054A:p.Y352N
3 NM_001098209 c.T109C:p.S37P
4 NM_152539 c.G955A:p.E319K
5 NM_032421 c.A2422G:p.T808A
6 NM_003141 c.G431A:p.G144E
7 NM_006645 c.C749T:p.S250L
8 NM_206927 c.C778A:p.P260T
9 NM_012240 c.G209A:p.G70E
10 NM_152336 c.A382C:p.K128Q
11 NM_002773 c.G750C:p.W250C
12 NM_001797 c.C2125T:p.R709W
13 NM_058216 c.C797A:p.A266D
14 NM_198977 c.C1543T:p.R515W
15 NM_000307 c.C356T:p.A119V
If we don't need the last part after the :
read.csv(text = trimws(df$AAChange.refGene, whitespace = ":[^:]+",
which = "right"), header = FALSE, col.names = c("id", "aa"), sep = ":")
id aa
1 NM_002725 c.C301T
2 NM_001024940 c.T1054A
3 NM_001098209 c.T109C
4 NM_152539 c.G955A
5 NM_032421 c.A2422G
6 NM_003141 c.G431A
7 NM_006645 c.C749T
8 NM_206927 c.C778A
9 NM_012240 c.G209A
10 NM_152336 c.A382C
11 NM_002773 c.G750C
12 NM_001797 c.C2125T
13 NM_058216 c.C797A
14 NM_198977 c.C1543T
15 NM_000307 c.C356T
You could use sub for a base R option:
df$id <- sub(".*?:([^:]+):.*", "\\1", df$AAChange.refGene)
df$aa <- sub(".*?:", "", df$AAChange.refGene)
In this case, you can try using seperate() instead of the regex.
library(tidyr)
info %>%
separate(AAChange.refGene,
c("Refseq ID", "cDNA level change", "Protein level change"),
sep = ":", extra = "drop")
Refseq ID cDNA level change Protein level change
1 NM_002725 c.C301T p.P101S
2 NM_001024940 c.T1054A p.Y352N
3 NM_001098209 c.T109C p.S37P
4 NM_152539 c.G955A p.E319K
5 NM_032421 c.A2422G p.T808A
6 NM_003141 c.G431A p.G144E
7 NM_006645 c.C749T p.S250L
8 NM_206927 c.C778A p.P260T
9 NM_012240 c.G209A p.G70E
10 NM_152336 c.A382C p.K128Q
11 NM_002773 c.G750C p.W250C
12 NM_001797 c.C2125T p.R709W
13 NM_058216 c.C797A p.A266D
14 NM_198977 c.C1543T p.R515W
15 NM_000307 c.C356T p.A119V
My favorite method for this kind of task is extract:
library(tidyr)
info %>%
extract(AAChange.refGene,
into = c("Refseq ID", "cDNA level change", "Protein level change"),
regex = "(.*):(.*):(.*)")
Refseq ID cDNA level change Protein level change
1 NM_002725 c.C301T p.P101S
2 NM_001024940 c.T1054A p.Y352N
3 NM_001098209 c.T109C p.S37P
4 NM_152539 c.G955A p.E319K
5 NM_032421 c.A2422G p.T808A
6 NM_003141 c.G431A p.G144E
7 NM_006645 c.C749T p.S250L
8 NM_206927 c.C778A p.P260T
9 NM_012240 c.G209A p.G70E
10 NM_152336 c.A382C p.K128Q
11 NM_002773 c.G750C p.W250C
12 NM_001797 c.C2125T p.R709W
13 NM_058216 c.C797A p.A266D
14 NM_198977 c.C1543T p.R515W
15 NM_000307 c.C356T p.A119V
For the sake of completeness, the fread() function from the data.table package is quite handy for cases like this one:
data.table::fread(text = info$AAChange.refGene, sep = ":", header = FALSE,
col.names = c("Refseq ID", "cDNA level change", "Protein level change"))
Refseq ID cDNA level change Protein level change
1: NM_002725 c.C301T p.P101S
2: NM_001024940 c.T1054A p.Y352N
3: NM_001098209 c.T109C p.S37P
4: NM_152539 c.G955A p.E319K
5: NM_032421 c.A2422G p.T808A
6: NM_003141 c.G431A p.G144E
7: NM_006645 c.C749T p.S250L
8: NM_206927 c.C778A p.P260T
9: NM_012240 c.G209A p.G70E
10: NM_152336 c.A382C p.K128Q
11: NM_002773 c.G750C p.W250C
12: NM_001797 c.C2125T p.R709W
13: NM_058216 c.C797A p.A266D
14: NM_198977 c.C1543T p.R515W
15: NM_000307 c.C356T p.A119V
In order to identify nonsense text (e.g. djsarejslslasdfhsl) from real (German) words, I would like to do an analysis of letter-frequencies.
My idea is to calculate the relative frequencies of two-letter-combinations ("te", "ex", "xt", "is" etc.) using a long text. Based on that information I would like to calculate the probability that a given word (or sentence) is real German.
But my first problem is, how to extract all the two-letter-combinations and to count them? I fear that using substring(string, start, stop) and increasing the values of start and stop in a loop might not be a very efficient solution. Do you have any idea?
# A short sample text
text <- 'Es ist ein Freudentag – ohne Zweifel. Gesundheitsminister Alain Berset und der Bundesrat gehen weiter, als man annehmen durfte. Die Zertifikatspflicht wird aufgehoben, die Maskenpflicht gilt nur noch im ÖV und in Gesundheitseinrichtungen.
Die beste Meldung des Tages aber ist: Die Covid-19-Task-Force, inzwischen als «Task-Farce» verballhornt, wird auf Ende März aufgehoben – zwei Monaten früher als geplant. Die Dauerkritik war wohl mit ein Grund, dass dieses Gremium sich jetzt rasch auflösen will.
Keine Rosen ohne Dornen: Einzelne Punkte von Bersets Ausführungen geben zu denken.
Die «Isolationshaft» für positiv Getestete bleibt zwingend. Das ist Unsinn und steht in einem scharfen Kontrast zu den übrigen Öffnungsschritten. Die Grundimmunität der Bevölkerung beträgt über 90 Prozent, das Virus ist nicht mehr gefährlich, warum will man weiter Leute zu Hause einsperren? Wer schwer krank ist, geht von sich aus nicht zur Arbeit. Die krankheitsbedingte Bettruhe muss man den Menschen nicht vorschreiben.
Gesundheitsminister Berset findet, das Modell Task-Force habe eine interessante Möglichkeit aufgezeigt für die Zusammenarbeit zwischen Regierung und Wissenschaft. Unter Umständen eigne sich dieses Modell auch für andere Bereiche.
Nein danke, Herr Berset.
Die Task-Force war mit ihrem öffentlichen Dauer-Alarmismus und ihren haarsträubenden Falsch-Prognosen vor allem eine Manipulationsmaschine.
Und dann noch dies: Irgendwann während der heutigen Pressekonferenz gab Alain Berset zu verstehen, man habe mit all diesen Massnahmen die Bevölkerung schützen wollen. Vielleicht hatte man diese hehre Absicht einmal im Hinterkopf. Alle Massnahmen ab der zweiten Welle erfolgten nicht zum Schutz der Bevölkerung, sondern, um einen Zusammenbruch des Spital-Systems zu verhindern.
Doch jetzt stossen wir erst einmal auf das Ende der Apartheit an.'
# Some cleaning:
library(stringr)
text <- str_replace_all(text, "[^[:alnum:]]", " ")
text <- tolower(text)
words <- strsplit(text, "\\s+")[[1]]
words
for(word in words){
???
}
Clean, replacing any sequence of non-alphanumeric with a space
text = tolower(gsub("[^[:alnum:]]+", " ", text))
Find all pairs of sequential letters
twos = substring(text, 1:(nchar(text) - 1), 2:nchar(text))
but only keep those that did not overlap a space
twos[nchar(trimws(twos)) == 2L]
Here's the result
> twos[nchar(trimws(twos)) == 2L] |> table()
19 90 aa ab af ag äg ah äh ai al am an än ap ar är as at ät au äu ba be bl br
1 1 1 6 2 2 1 2 2 2 14 2 16 1 1 10 1 15 6 1 12 1 1 24 1 2
bs bt bu ce ch co da de dh di do du dw eb ed ef eg eh ei ek el em en ep er es
1 1 1 4 34 1 9 23 3 18 2 2 1 1 1 1 1 9 32 1 7 5 54 1 42 19
et eu ev ez fa fä fe ff fg fi fl fn fo fr ft fü ga ge gi gl gn gr gs gt ha he
12 3 3 1 2 1 4 2 3 2 3 1 4 2 3 4 1 19 2 1 2 3 1 4 8 17
hi hk hl hm hn ho hr ht hu hü hw ib ic id ie if ig ih ik il im in io ip ir is
3 1 1 3 2 3 9 11 1 1 1 2 16 1 18 2 4 2 2 3 3 28 2 1 5 12
it iu iv je ka ke kh ko kr kt la ld le lg lh li lk ll ln lö ls lt ma mä me mi
19 1 1 2 1 8 1 3 3 1 6 1 7 1 1 5 3 11 1 1 4 1 12 1 8 7
mm mo mö ms mu na nb nd ne nf ng ni nk nm nn no np nr ns nt nu nz ob oc od öf
3 3 1 2 3 4 1 23 13 1 10 8 5 2 4 3 1 1 6 10 2 3 2 3 2 2
og ög oh ol öl on op or os ös ov öv oz pa pe pf pi pl po pr pu ra rä rb rc rd
1 1 3 3 3 8 1 7 4 1 1 1 1 1 1 3 1 1 1 3 2 5 2 3 4 2
re rf rg rh ri rk rl rm rn ro rr rs rt ru rü rz sa sb sc se sf sh si sk sm sn
14 3 1 1 4 2 1 1 4 3 2 9 2 11 1 1 3 1 13 17 1 1 6 5 4 2
so sp sr ss st su sy ta tä te th ti tl to tr ts tt tu tz ub üb uc ud ue uf uh
2 3 1 9 17 3 1 7 2 24 1 6 1 1 4 6 3 1 4 1 2 2 1 2 6 1
üh ul um un ur ür us ut üt ve vi vo vö wa wä we wi wo ys ze zt zu zw
2 1 5 24 3 3 8 3 1 3 3 4 3 4 1 8 9 2 1 5 2 9 6
The algorithm seems to generalize to sequences of any number of letters by separating words with
chartuples <-
function(text, n = 2)
{
n0 <- n - 1
text <- tolower(gsub(
"[^[:alnum:]]+", paste(rep(" ", n0), collapse = ""), text
))
tuples <- substring(text, 1:(nchar(text) - n0), n:nchar(text))
tuples[nchar(trimws(tuples)) == n]
}
This is also easy to use for looking up the values of any 'word'
counts <- table(charuples(text))
counts[chartuples("djsarejslslasdfhsl")] |> as.vector()
(the NA's in the resulting vector mean letters not present in your original corpus).
words <- unlist(strsplit(text, '[^[:alnum:]]+'))
cmbs2 <- sapply(words, function(x)substring(x, len <- seq(nchar(x) - 1), len + 1),USE.NAMES = TRUE)
head(cmbs2) ## Just to show a few words.
$Es
[1] "Es"
$ist
[1] "is" "st"
$ein
[1] "ei" "in"
$Freudentag
[1] "Fr" "re" "eu" "ud" "de" "en" "nt" "ta" "ag"
$ohne
[1] "oh" "hn" "ne"
$Zweifel
[1] "Zw" "we" "ei" "if" "fe" "el"
If I'm not wrong, this should be pretty efficient:
tokens_char <- function(str, window = 2) {
# remove non-word characters
str <- stringi::stri_replace_all_regex(str, "\\W", "")
# lowercase
str <- tolower(str)
# prep window variable
win <- window - 1
len1 <- seq_len(nchar(str) - win)
# split into strings of length window
stringi::stri_sub(str, from = len1, to = len1 + win)
}
The key is stringi::stri_sub which is a vectorised version of substr. A string is split by moving the window one character at the time. So "This text" is turned into "th" "hi" "is" "st" "te" "ex" "xt". After doing this, we can use some tidyverse code to count occurrences of tokens:
library(tidyverse)
tibble(
token = tokens_char(text, window = 2)
) %>%
count(token, sort = TRUE)
#> # A tibble: 308 × 2
#> token n
#> <chr> <int>
#> 1 en 55
#> 2 er 43
#> 3 ei 35
#> 4 ch 34
#> 5 nd 34
#> 6 in 28
#> 7 te 28
#> 8 be 24
#> 9 un 24
#> 10 de 23
#> # … with 298 more rows
Note that I also included a window argument, which I believe might be useful for your analysis.
tibble(
token = tokens_char(text, window = 3)
) %>%
count(token, sort = TRUE)
#> # A tibble: 851 × 2
#> token n
#> <chr> <int>
#> 1 die 16
#> 2 ich 16
#> 3 ein 15
#> 4 end 13
#> 5 sch 13
#> 6 und 12
#> 7 eit 11
#> 8 nde 10
#> 9 cht 9
#> 10 der 9
#> # … with 841 more rows
And finally, you can also first split your string into words so that letters following each other over word boundaries do not count. For example, "This text" is turned into "th" "hi" "is" "te" "ex" "xt":
tokens_char_words <- function(str, window = 2) {
str <- unlist(tokenizers::tokenize_words(str))
# prep window variable
win <- window - 1
len1 <- lapply(nchar(str) - win, seq_len)
# split into strings of length window
unlist(stringi::stri_sub_all(str = str, from = len1, to = lapply(len1, function(x) x + win)))
}
tokens_char_words("This text", window = 2)
#> [1] "th" "hi" "is" "te" "ex" "xt"
Created on 2022-02-18 by the reprex package (v2.0.1)
I am trying to filter the Symbol column based on whether it's of the form \uxxxx
This is easy visually, that is, some look like $, ¢, £, and others like \u058f, \u060b, \u07fe.
But I cannot seem to figure it out using stringi / dplyr
library(dplyr)
library(stringi)
df <- structure(list(Character = c("\\u0024", "\\u00A2", "\\u00A3",
"\\u00A4", "\\u00A5", "\\u058F", "\\u060B", "\\u07FE", "\\u07FF",
"\\u09F2", "\\u09F3", "\\u09FB", "\\u0AF1", "\\u0BF9", "\\u0E3F",
"\\u17DB", "\\u20A0", "\\u20A1", "\\u20A2", "\\u20A3"),
Symbol = c("$", "¢", "£", "¤", "¥", "\u058f", "\u060b", "\u07fe", "\u07ff",
"৲", "৳", "\u09fb", "\u0af1", "\u0bf9", "฿", "៛", "₠",
"₡", "₢", "₣")), row.names = c(NA, 20L), class = "data.frame")
Character Symbol
1 \\u0024 $
2 \\u00A2 ¢
3 \\u00A3 £
4 \\u00A4 ¤
5 \\u00A5 ¥
6 \\u058F \u058f
7 \\u060B \u060b
8 \\u07FE \u07fe
9 \\u07FF \u07ff
10 \\u09F2 ৲
11 \\u09F3 ৳
12 \\u09FB \u09fb
13 \\u0AF1 \u0af1
14 \\u0BF9 \u0bf9
15 \\u0E3F ฿
16 \\u17DB ៛
17 \\u20A0 ₠
18 \\u20A1 ₡
19 \\u20A2 ₢
20 \\u20A3 ₣
What I've tried
I have tried using variations on nchar but haven't had luck
df$Symbol %>% nchar
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
df$Symbol %>% stri_unescape_unicode %>% nchar
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
df$Symbol %>% stri_escape_unicode %>% nchar
# [1] 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Question
How can I filter on the Symbol column for all the rows of the form $, ¢, £ etc (and conversely for rows like \u058f, \u060b, \u07fe)?
Edit:
The function glyphs_match() from the gdtools package is designed for this, however, using it didn't quite return the expected result. I'm using Lucida Console as my font and obtain the following output when using glyphs_match(). There seems to be one glyph that isn't rendered but for which the function returns TRUE. Perhaps other users can explain why that is the case.
df$glyph_match <- gdtools::glyphs_match(df$Symbol, fontfile = "C:\\WINDOWS\\Fonts\\lucon.TTF")
df
Character Symbol glyph_match
1 \\u0024 $ TRUE
2 \\u00A2 ¢ TRUE
3 \\u00A3 £ TRUE
4 \\u00A4 ¤ TRUE
5 \\u00A5 ¥ TRUE
6 \\u058F <U+058F> FALSE
7 \\u060B <U+060B> FALSE
8 \\u07FE <U+07FE> FALSE
9 \\u07FF <U+07FF> FALSE
10 \\u09F2 <U+09F2> FALSE
11 \\u09F3 <U+09F3> FALSE
12 \\u09FB <U+09FB> FALSE
13 \\u0AF1 <U+0AF1> FALSE
14 \\u0BF9 <U+0BF9> FALSE
15 \\u0E3F <U+0E3F> FALSE
16 \\u17DB <U+17DB> FALSE
17 \\u20A0 <U+20A0> FALSE
18 \\u20A1 ¢ TRUE
19 \\u20A2 <U+20A2> FALSE
20 \\u20A3 <U+20A3> TRUE
Earlier answer - may only work on Windows:
There will be variation depending on your font/system, for example, when running your code my output doesn't match what you've provided:
df <- structure(list(Character = c("\\u0024", "\\u00A2", "\\u00A3",
"\\u00A4", "\\u00A5", "\\u058F", "\\u060B", "\\u07FE", "\\u07FF",
"\\u09F2", "\\u09F3", "\\u09FB", "\\u0AF1", "\\u0BF9", "\\u0E3F",
"\\u17DB", "\\u20A0", "\\u20A1", "\\u20A2", "\\u20A3"),
Symbol = c("$", "¢", "£", "¤", "¥", "\u058f", "\u060b", "\u07fe", "\u07ff",
"৲", "৳", "\u09fb", "\u0af1", "\u0bf9", "฿", "៛", "₠",
"₡", "₢", "₣")), row.names = c(NA, 20L), class = "data.frame")
df
Character Symbol
1 \\u0024 $
2 \\u00A2 ¢
3 \\u00A3 £
4 \\u00A4 ¤
5 \\u00A5 ¥
6 \\u058F <U+058F>
7 \\u060B <U+060B>
8 \\u07FE <U+07FE>
9 \\u07FF <U+07FF>
10 \\u09F2 <U+09F2>
11 \\u09F3 <U+09F3>
12 \\u09FB <U+09FB>
13 \\u0AF1 <U+0AF1>
14 \\u0BF9 <U+0BF9>
15 \\u0E3F <U+0E3F>
16 \\u17DB <U+17DB>
17 \\u20A0 <U+20A0>
18 \\u20A1 ¢
19 \\u20A2 <U+20A2>
20 \\u20A3 <U+20A3>
But one crude way of capturing if the glyph exists is:
nchar(capture.output(cat(df$Symbol, sep = "\n"))) == 1
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] TRUE FALSE FALSE
So the glyphs can be filtered by:
library(dplyr)
df %>%
filter(nchar(capture.output(cat(Symbol, sep = "\n"))) == 1)
Character Symbol
1 \\u0024 $
2 \\u00A2 ¢
3 \\u00A3 £
4 \\u00A4 ¤
5 \\u00A5 ¥
6 \\u20A1 ¢
Use as.character.POSIXt to 'render' symbols and pad with spaces. Unicode characters in the form "\uxxxx" will be printed as a single character and all others will be larger; then you can filter according to length:
# To keep 'single char' symbols e.g. "$":
df %>% filter(nchar(as.character.POSIXt(Symbol)) >= 2)
# Or for 'unicode format' symbols e.g. "\u07fe":
df %>% filter(nchar(as.character.POSIXt(Symbol)) == 1)
If you have a long string as a 'symbol' (e.g. "aaaaaaaaaa₣") the padding will be increased and need to be accounted for e.g.
# To keep 'single char' symbols e.g. "$":
df %>% filter(nchar(as.character.POSIXt(Symbol)) >= 11)
# Or for 'unicode format' symbols e.g. "\u07fe":
df %>% filter(nchar(as.character.POSIXt(Symbol)) <= 10)
I am using the quanteda package by Ken Benoit and Paul Nulty to work with textual data.
My corpus contains texts with full German sentences and I want to work with the nouns of every text only. One trick in German is to use the upper case words only, but this would fail at the beginning of a sentence.
Text1 <- c("Halle an der Saale ist die grünste Stadt Deutschlands")
Text2 <- c("In Hamburg regnet es immer, das ist also so wie in London.")
Text3 <- c("James Bond trinkt am liebsten Martini")
myCorpus <- corpus(c(Text1, Text2, Text3))
metadoc(myCorpus, "language") <- "german"
summary(myCorpus, showmeta = T)
myDfm <- dfm(myCorpus, tolower = F, remove_numbers = T,
remove = stopwords("german"), remove_punct = TRUE,
remove_separators = T)
topfeatures(myDfm, 20)
From this minimal example, I would like to retrieve:
"Halle", "Saale", "Stadt", "Deutschland", "Hamburg", "London", "Martini", "James", "Bond".
I assume I need a dictionary, which defines verbs/nouns/etc. and the proper names (James Bond, Hamburg etc.), or is there a build in function/dict?
Bonus Question: Does the solution work for English texts too?
You need some help from a part-of-speech tagger. Fortunately there is a great one, with a German language model, in the form of spaCy, and a package we wrote as a wrapper around it, spacyr. Installation instructions are at the spacyr page.
This code will do what you want:
txt <- c("Halle an der Saale ist die grünste Stadt Deutschlands",
"In Hamburg regnet es immer, das ist also so wie in London.",
"James Bond trinkt am liebsten Martini")
library("spacyr")
spacy_initialize(model = "de")
txtparsed <- spacy_parse(txt, tag = TRUE, pos = TRUE)
head(txtparsed, 20)
# doc_id sentence_id token_id token lemma pos tag entity
# 1 text1 1 1 Halle halle PROPN NE LOC_B
# 2 text1 1 1 an an ADP APPR LOC_I
# 3 text1 1 1 der der DET ART LOC_I
# 4 text1 1 1 Saale saale PROPN NE LOC_I
# 5 text1 1 1 ist ist AUX VAFIN
# 6 text1 1 1 die die DET ART
# 7 text1 1 1 grünste grünste ADJ ADJA
# 8 text1 1 1 Stadt stadt NOUN NN
# 9 text1 1 1 Deutschlands deutschlands PROPN NE LOC_B
# 10 text2 1 1 In in ADP APPR
# 11 text2 1 1 Hamburg hamburg PROPN NE LOC_B
# 12 text2 1 1 regnet regnet VERB VVFIN
# 13 text2 1 1 es es PRON PPER
# 14 text2 1 1 immer immer ADV ADV
# 15 text2 1 1 , , PUNCT $,
# 16 text2 1 1 das das PRON PDS
# 17 text2 1 1 ist ist AUX VAFIN
# 18 text2 1 1 also also ADV ADV
# 19 text2 1 1 so so ADV ADV
# 20 text2 1 1 wie wie CONJ KOKOM
(nouns <- with(txtparsed, subset(token, pos == "NOUN")))
# [1] "Stadt"
(propernouns <- with(txtparsed, subset(token, pos == "PROPN")))
# [1] "Halle" "Saale" "Deutschlands" "Hamburg" "London"
# [6] "James" "Bond" "Martini"
Here, you can see that the nouns you wanted are marked in the simpler pos field as "proper nouns". The tag field is a more detailed, German-language tagset that you could also select from.
The lists of selected nouns can then be used in quanteda:
library("quanteda")
myDfm <- dfm(txt, tolower = FALSE, remove_numbers = TRUE,
remove = stopwords("german"), remove_punct = TRUE)
head(myDfm)
# Document-feature matrix of: 3 documents, 14 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
# features
# docs Halle Saale grünste Stadt Deutschlands Hamburg
# text1 1 1 1 1 1 0
# text2 0 0 0 0 0 1
# text3 0 0 0 0 0 0
head(dfm_select(myDfm, pattern = propernouns))
# Document-feature matrix of: 3 documents, 8 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
# features
# docs Halle Saale Deutschlands Hamburg London James
# text1 1 1 1 0 0 0
# text2 0 0 0 1 1 0
# text3 0 0 0 0 0 1