Removing a tweet/row if it contains any non-english word - r

I want to remove the whole tweet or a row from a data-frame if it contains any non-english word.
My data-frame looks like
text
1 | morning why didnt i go to sleep earlier oh well im seEING DNP TODAY!!
JIP UHH <f0><U+009F><U+0092><U+0096><f0><U+009F><U+0092><U+0096>
2 | #natefrancis00 #SimplyAJ10 <f0><U+009F><U+0098><U+0086><f0><U+009F
<U+0086> if only Alan had a Twitter hahaha
3 | #pchirsch23 #The_0nceler #livetennis Whoa whoa let’s not take this too
far now
4 | #pchirsch23 #The_0nceler #livetennis Well Pat that’s just not true
5 | One word #Shame on you! #Ji allowing looters to become president
The expected dataframe should be like this:
text
3 | #pchirsch23 #The_0nceler #livetennis Whoa whoa let’s not take this too
far now
4 | #pchirsch23 #The_0nceler #livetennis Well Pat that’s just not true
5 | One word #Shame on you! #Ji allowing looters to become president.

You want to preserve the alpha-numeric characters along with some of punctuation's like #, ! etc.
If your column contains mainly of <unicode>, then this should do:
For data frame df with text column, using grep:
new_str <- grep(df_str$text, pattern = "<*>", value= TRUE , invert = TRUE )
new_str[new_str != ""]
To put it back to your original column text. you can just work with indices that you need and put other to NA:
idx <- grep(df$text, pattern = "<*>", invert = TRUE )
df$text[-idx] <- NA
For cleaning the tweet, you can use gsub function. refer this post cleaning tweet in R

Related

How to remove data after certain characters

I need to know how to remove all characters from a value after the first D letter and 1st number or 2 second number. I am not sure how to start.
I have a data frame and I have a column of type Character
The column is called " Eircode "
The postal codes go from D01 to D24 ( these are Dublin postal codes )
The values are inputted like so
What you see in red is what needs to be removed.
I need to be able to remove the characters after the last digit.
My dataframe is called "MainSchools"
So if the " Eircode " is D03P820, I need to have it as D03 after my change.
I would preferably like to be able to do this with the Tidyverse package if possible.
You may use sub here:
df <- data.frame(Eircode=c("D15P820", "K78YD27", "D03P820"),
stringsAsFactors=FALSE)
df$Eircode <- sub("^(D(?:0[1-9]|1[0-9]|2[0-4])).*$", "\\1", df$Eircode)
df
Eircode
1 D15
2 K78YD27
3 D03
The regex pattern used above matches and captures Dublin postal codes as follows:
D match D
(?:
0[1-9] followed by 0-9
| OR
1[0-9] 10-19
| OR
2[0-4] 20-24
)
Then, we use \1 as the replacement in sub, leaving behind only the 3 character Dublin postal code.
I like to use the stringr package for such operations.
library(dplyr)
library(sitrngr)
df %>% mutate(Eircode = str_extract_all(Eircode, '^[A-Z][0-9]{2}'))
output with the data from #Tim Biegeleisen:
Eircode
1 D15
2 K78
3 D03

Splitting Text information in a dataframe into single words and detect if they are part of a dictionary R

I am trying write a script to detect if one word of an undefined amount of words is part of a dictionary.
To make this problem a bit more understandable, I have the following data:
Items | Descriptions |
-------------------------
Item1 | poster
Item2 | used cd music etc
Item3 | hckd herbal ingds.
Item4 | 823942 blc
So what I want to do know, is to check the column descriptions if any of those single words is part of a dictionary or self created vector of strings.
So the result should look something like:
Items | Descriptions | inDictionary
--------------------------------------------------
Item1 | poster | TRUE
Item2 | used cd music etc | TRUE
Item3 | hckd herbal ingds. | TRUE
Item4 | 823942 blc | FALSE
For this example I just assume a english dictionary. In this specific case its sufficient if only one word is part of a dictionary.
I already tried this with the qdapDictionaries library and tokenizers to tokenize the contents of the dataframe cells but I fail to get the check right for cells where I have more than one word.
Help is much appreciated,
Thank you!
As I don't know which dictionary you are working with, here's a description of how in principle you can go about this task:
Data:
df <- data.frame(Descriptions = c("cyber"," &%#","aah ingds.", "823942 blc"))
Let's say you work with the GradyAugmented dictionary from the library(qdapDictionaries), you could paste the words in the dictionary together separating them by the regex alternation marker |and use grepl, which returns TRUE or FALSE, to check whether the dictionary words are contained in any of the df$Description strings:
df$inDict <- grepl(paste0("\\b(", paste(GradyAugmented[1:100], collapse = "|"), ")\\b"), df$Descriptions)
Result:
df
Descriptions inDict
1 cyber TRUE
2 &%# FALSE
3 aah ingds. TRUE
4 823942 blc FALSE
The dictionary may be very large and you may run into memory problems. In that case you can take a different route, via %in%:
df$inDict <- lapply(strsplit(df$Descriptions, " "), function(x) x %in% GradyAugmented)
Here the rows are lists:
df$inDict <- lapply(strsplit(df$Descriptions, " "), function(x) x %in% GradyAugmented)
df
Descriptions inDict
1 cyber TRUE
2 &%# FALSE
3 aah ingds. TRUE, FALSE
4 823942 blc FALSE, FALSE
Hope this helps.

Regex in R: how to fill dataframe with multiple matches to left and right of target string

(This is a follow-up to Regex in R: match collocates of node word.)
I want to extract word combinations (collocates) to the left and to the right of a target word (node) and store the three elements in a dataframe.
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while. Going on for ages. It's still going on. And will go on and on, and go on forever.")
Aim:
The target word is the verb GO in any of its possible realizations, be it 'go', 'going', goes', 'gone, or 'went' and I'm interested in extracting 3 words to the left of GO and to the right of GO. The three words can cross sentence boundaries but the extracted strings should not include punctuation.
What I've tried so far:
To extract left-hand collocates I've used str_extract_all from stringr:
unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))"))
[1] "This little sentence" " went on and" " It was" "s still"
[5] " And will" " and"
This captures most but not all matches and includes spaces.
The extraction of the node, by contrast, looks okay:
unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went"))
[1] "went" "went" "going" "Going" "going" "go" "go"
To extract the right hand collocates:
unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
[1] " on and went" " on" " on for quite" " on for ages" " on" " on and on"
[7] " on forever"
Again the matches are incomplete and unwanted spaces are included.
And finally assembling all the matches in a dataframe throws an error:
collocates <- data.frame(
Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),
Node = unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went")),
Right = unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))); collocates
Error in data.frame(Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")), :
arguments imply differing number of rows: 6, 7
Expected output:
Left Node Right
This little sentence went on and went
went on and went on It was
on It was going on for quite
quite a while Going on for ages
ages It’s still going on And will
on And will go on and on
and on and go on forever
Does anyone know how to fix this? Suggestions much appreciated.
If you use Quanteda, you can get the following result. When you deal with texts, you want to use small letters. I converted capital letters with tolower(). I also removed . and , with gsub(). Then, I applied kwic() to the text. If you do not mind losing capital letters, dots, and commas, you get pretty much what you want.
library(quanteda)
library(dplyr)
library(splitstackshape)
myvec <- c("go", "going", "goes", "gone", "went")
mytext <- gsub(x = tolower(GO), pattern = "\\.|,", replacement = "")
mydf <- kwic(x = mytext, pattern = myvec, window = 3) %>%
as_tibble %>%
select(pre, keyword, post) %>%
cSplit(splitCols = c("pre", "post"), sep = " ", direction = "wide", type.convert = FALSE) %>%
select(contains("pre"), keyword, contains("post"))
pre_1 pre_2 pre_3 keyword post_1 post_2 post_3
1: this little sentence went on and went
2: went on and went on it was
3: on it was going on for quite
4: quite a while going on for ages
5: ages it's still going on and will
6: on and will go on and on
7: and on and go on forever <NA>
A little late but not too late for posterity or contemporaries doing collocation research on unannotated text, here's my own answer to my question. Full credit is given to #jazzurro's pointer to quantedaand his answer.
My question was: how to compute collocates of a given node in a text and store the results in a dataframe (that's the part not addressed by #jazzurro).
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while.
Going on for ages. It's still going on. And will go on and on, and go on forever.")
Step 1: Prepare data for analysis
go <- gsub("[.!?;,:]", "", tolower(GO)) # get rid of punctuation
go <- gsub("'", " ", tolower(go)) # separate clitics from host
Step 2: Extract KWIC using regex pattern and argument valuetype = "regex"
concord <- kwic(go, "go(es|ing|ne)?|went", window = 3, valuetype = "regex")
concord
[text1, 4] this little sentence | went | on and went
[text1, 7] went on and | went | on it was
[text1, 11] on it was | going | on for quite
[text1, 17] quite a while | going | on for ages
[text1, 24] it s still | going | on and will
[text1, 28] on and will | go | on and on
[text1, 33] and on and | go | on forever
Step 3: Identify strings with fewer collocates than defined by window:
# Number of collocates on the left:
concord$nc_l <- unlist(lengths(strsplit(concordance$pre, " "))); concord$nc_l
[1] 3 3 3 3 3 3 3 # nothing missing here
# Number of collocates on the right:
concord$nc_r <- unlist(lengths(strsplit(concordance$post, " "))); concord$nc_r
[1] 3 3 3 3 3 3 2 # last string has only two collocates
Step 4: Add NA to strings with missing collocates:
# define window:
window <- 3
# change string:
concord$post[!concord$nc_r == window] <- paste(concord$post[!concord$nc_r == window], NA, sep = " ")
Step 5: Fill dataframe with slots for collocates and node, using str_extract from library stringras well as regex with lookarounds to determine split points for collocates:
library(stringr)
L3toR3 <- data.frame(
L3 = str_extract(concord$pre, "^\\w+\\b"),
L2 = str_extract(concord$pre, "(?<=\\s)\\w+\\b(?=\\s)"),
L1 = str_extract(concord$pre, "\\w+\\b$"),
Node = concord$keyword,
R1 = str_extract(concord$post, "^\\w+\\b"),
R2 = str_extract(concord$post, "(?<=\\s)\\w+\\b(?=\\s)"),
R3 = str_extract(concord$post, "\\w+\\b$")
)
Result:
L3toR3
L3 L2 L1 Node R1 R2 R3
1 this little sentence went on and went
2 went on and went on it was
3 on it was going on for quite
4 quite a while going on for ages
5 it s still going on and will
6 on and will go on and on
7 and on and go on forever NA

Check which rows with each word in a string are capitalised and space separated

I have a column with string of values as shown below
a=["iam best in the world" "you are awesome" ,"Iam Good"]
and I need to check which rows of each word in string are lower case and separated by space.
I know how to convert those to Upper and space separated but i need to find which rows are lower case & space separated.
I have tried using
grepl("\\b([a-z])\\s([a-z])\\b",aa, perl = TRUE)
We can try using grepl with the pattern \b[a-z]+(?:\\s+[a-z]+)*\b:
matches = a[grepl("\\b[a-z]+(?:\\s+[a-z]+)*\\b", a$some_col), ]
matches
v1 some_col
1 1 iam best in the world
2 2 you are awesome
Data:
a <- data.frame(v1=c(1:3),
some_col=c("iam best in the world", "you are awesome", "Iam Good"))
The regex pattern used matches an all-lowercase word, followed by a space and another all-lowercase word, the latter repeated zero or more times. Note that we place word boundaries around the pattern to ensure that we don't get false flag matches from a word beginning with an uppercase letter.
x <- c("iam best in the word ", "you are awesome", "Iam Good")
Here I did something different, first I separeted by space then I check if is lower case. So, the output is a list for each phrase with only the lower case words split by space.
sapply(strsplit(x, " "), function(x) {
x[grepl("^[a-z]", x)]
})
Another idea is to use stri_trans_totitle from stringi package,
a[!!!stringi::stri_trans_totitle(as.character(a$some_col)) == a$some_col,]
# v1 some_col
#1 1 iam best in the world
#2 2 you are awesome
We can convert the column to lower-case and compare with actual value. Using #Tim's data
a[tolower(a$some_col) == a$some_col, ]
# v1 some_col
#1 1 iam best in the world
#2 2 you are awesome
If we also need to check for space, we could add another condition with grepl
a[tolower(a$some_col) == a$some_col & grepl("\\s+", a$some_col), ]
We can use filter
library(dplyr)
a %>%
filter(tolower(some_col) == some_col)
# v1 some_col
#1 1 iam best in the world
#2 2 you are awesome

R: Read in .csv file and convert into multiple column data frame

I am new to R and currently having a plenty of trouble just reading in .csv file and converting it into data.frame with 7 columns. Here is what I am doing:
gene_symbols_table <- as.data.frame(read.csv(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=TRUE, sep=","))
After that I am getting a data.frame with dim = 46761 x 1, but I need it to be 46761 x 7. I tried the following stackoverflow threads:
How can you read a CSV file in R with different number of columns
read.delim() - errors "more columns than column names" and "header and ''col.names" are of different lengths"
Split a column of a data frame to multiple columns
But somehow nothing is working in my case.
Here is how the table looks:
> head(gene_symbols_table, 3)
input.reason.matches.organism.name.primaryIdentifier.symbol.briefDescription.c
lass.secondaryIdentifier
1 WBGene00008675 MATCH 1 Caenorhabditis elegans
WBGene00008675 irld-26 Gene F11A5.7
2 WBGene00008676 MATCH 1 Caenorhabditis elegans
WBGene00008676 oac-15 Gene F11A5.8
3 WBGene00008677 MATCH 1 Caenorhabditis elegans
WBGene00008677 Gene F11A5.9
The .csv file in Excel looks like this:
input | reason | matches | organism.name | primaryIdentifier | symbol |
briefDescription
WBGene00008675 | MATCH | 1 | Caenorhabditis elegans WBGene00008675 | irld-26 | ...
...
The following code:
gene_symbols_table <- read.table(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=FALSE, sep=",",
col.names = paste0("V",seq_len(7)), fill = TRUE)
Seems to be working, however when I look into dim I can see right away that it is wrong: 20124 x 7. Then:
V1
1input;reason;matches;organism.name;primaryIdentifier;symbol;briefDescription;class;secondaryIdentifier
2 WBGene00008675;MATCH;1;Caenorhabditis
elegans;WBGene00008675;irld-26;;Gene;F11A5.7
3 WBGene00008676;MATCH;1;Caenorhabditis
elegans;WBGene00008676;oac-15;;Gene;F11A5.8
V2 V3 V4 V5
1
2
3
1
So, it is wrong
Other attempts at read.table are giving me the error specified in the second stackoverflow thread.
I have also tried splitting the data.frame with one column into 7, but so far no success.
The sep seems to be space or semi-colon, and not comma from what the table looks like. So either try specifying that, or you could try fread from the data.table package, which automatically detects the separator.
gene_symbols_table <- as.data.frame(fread(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=TRUE))

Resources