Subset dataframe with keywords

Subset dataframe with keywords - r

I have a dataframe consisting of twitter data (ID number, follower_count, clean_text). I am interested in dividing my dataframe into two subsets: one where keywords are present, and one where keywords are not present.
For example, I have the keywords stored as a value:
KeyWords <- c("abandon*", "abuse*", "agitat*" ,"attack*", "bad", "brutal*",
"care", "caring", "cheat*", "compassion*", "cruel*", "damag*",
"damn*", "destroy*", "devil*", "devot*", "disgust*", "envy*",
"evil*", "faith*","fault*", "fight*", "forbid*", "good", "goodness",
"greed*", "gross*", "hate", "heaven*", "hell", "hero*", "honest*",
"honor*", "hurt*","ideal*", "immoral*", "kill*", "liar*","loyal*",
"murder*", "offend*", "pain", "peace*","protest", "punish*","rebel*",
"respect", "revenge*", "ruin*", "safe*", "save", "secur*", "shame*",
"sin", "sinister", "sins", "slut*", "spite*", "steal*", "victim*",
"vile", "virtue*", "war", "warring", "wars", "whore*", "wicked*",
"wrong*", "benefit*", "harm*", "suffer*","value*") %>% paste0(collapse="|")
And I have made a subset (Data2) of my original dataframe (Data1) where Data2 consists of only the observations in Data1 where one or more of the keywords are present in the clean_text column. Like so:
Data2 <- Data1[with(Data1, grepl(paste0("\\b(?:",paste(KeyWords, collapse="|"),")\\b"), clean_text)),]
Now, I want to make Data3 where only the observations in Data1 where the keywords are not present in the clean_text column. Is there a way to do the inverse of my keyword subsetting above? Or, can I substract my Data2 from Data1 to get my new subset, Data3?

The "inverse" operator in R is ! - this will flip TRUE to FALSE and vice versa. So, with your example, what you're looking for is
Data3 <- Data1[!with(Data1, grepl(paste0("\\b(?:",paste(KeyWords, collapse="|"),")\\b"), clean_text)),]

Related

Why won't R recognize numbers 10 or above in my data frame?

I'm still fairly new to R, so maybe this is an obvious fix, but I'm having trouble with R recognizing numbers above 9; as shown below, I'm trying to order the data frame userCommentsDF and eventually remove any numComments entries of 10 or higher. When I order the data frame, everything is in order as it should be, except for entries of ten or higher, which appear together seemingly randomly in the middle of the dataset (between entries of 1 and 2). In addition, when I check using summary(), R tells me that there aren't any observations within that column above 9.
Here is the code I'm currently working on:
positions <- order(userCommentDF$numComments, decreasing = T)
orderedUserComments <- userCommentDF[positions, ]
summary(userCommentDF$numComments > 9)
Here is the code I used to create the data frame:
userCommentDF <- data.frame("test", "data", "age", "gender", "location", stringsAsFactors = FALSE)
names(userCommentDF) <- c("user", "numComments", "age", "gender", "location")
userNumber <- 1
for (user in commenters){
userCount <- (count(comments, comAuthor == user)[2, 2])
aDF <- data.frame(user, userCount, comAge[userNumber], comGender[userNumber], comAuthorLocation[userNumber])
names(aDF) <- c("user", "numComments", "age", "gender", "location")
userCommentDF <- rbind(userCommentDF, aDF)
userNumber <- userNumber + 1
}
userCommentDF <- userCommentDF[-c(1), ]
Does anyone know why this might be happening or how I might fix it?
Thank you!
EDIT: I used as.numeric() which worked.

Change column values depending on other column in R

I have problem with my data frame.
I have a dataframe with 2 columns, 'word' and 'word_categories'. I created different variables which include the different words, e.g. 'noun' which includes all the nouns of the word column. I now want to change the labels in the word_categories column to the corresponding variable. So if the word in the word column is included in the object 'noun', I want the word_categories column to display 'noun'.
df <- read.csv("palm.csv")
noun <- c("house", ...)
adj <- c("hard", ...)
...
The data frame looks like the following. It includes other columns but they are fine.
word word_categories
house
car
hard
...
I now want to look, if the words are in any of the created objects and if so, I want the corresponding label printed in the word_categories column. So for 'house' the column should show noun, for 'hard' it should show adjective. If the word is in none of the objects, it should show nothing or 'NA'.
I tried it with the following:
palm$word_categories <- ifelse(palm$word == noun, "noun",
ifelse(palm$word == adj, "adjective", "")))
This, however, doesn't work at all and I have 7 Objects in total so the statement becomes ridiculously long. How do I do it properly?

If the dataframe is called palm (you first call it df but later you use palm) and noun and adj are vectors as you define above, I would do:
library(dplyr)
palm <- palm %>%
mutate(word_categories = case_when(word %in% noun ~ "noun",
word %in% adj ~ "adjective",
TRUE ~ NA_character_))

One way would be to create a named vector of your noun/adjective dictionaries to select each element. The name would be the word and the corresponding data would be noun, adjective etc. You didn't really supply any data so I made some up.
df <- data.frame(
stringsAsFactors = FALSE,
word = c("dog", "short", "bird", "cat", "short", "man")
)
nounName <- c('dog', 'cat', 'bird')
adjName <- c('quick', 'brown', 'short')
noun <- rep('noun', length(nounName))
adj <- rep('adjective', length(adjName))
names(noun) <- nounName
names(adj) <- adjName
partsofspeech <- c(noun, adj)
df$word_categories <- partsofspeech[df$word]

Vectorized use of the substring function for a row selection of a dataframe with different length

My dataframe has a column named Code of the type char which goes like b,b1,b110-b139,b110,b1100,b1101,... (1602 entries)
I am trying to select all the entries that match the strings in a vector and all the ones that start with the same string.
So lets say I have the vector
Selection=c("b114","d2")
then i want all codes like b114, b1140, b1141, b1142, ... as well as d2, d200, d2000, d2001, d2002, d2003 etc...
what does work in principle is to create a new dataframe like this:
bTable <- TreeMapTable[substr(TreeMapTable$Code,1,4)=="b114"|substr(TreeMapTable$Code,1,2)=="d2",]
which gives me all the data i want, but requires me to manually type the condition for each entry and i just want to give the script a vector with the strings.
I tried to do it like this:
SelectionL=nchar(Selection)
Beispieltable <- TreeMapTable[substr(TreeMapTable$Code,1,AuswahlL)==Auswahl1,]
but this gives me somehow only half of the required entries and i confess i don't really know what it is doing. I know i could use a for loop but from everything i read so far, loops should be avoided and the problem should be solveable by use of vectors.

sample data
df <- data.frame( Code = c("b114", "b115", "b11456", "d2", "d12", "d200", "db114"),
stringsAsFactors = FALSE)
Selection=c("b114","d2")
answer
library( dplyr )
#create a regex pattern to filter on
pattern <- paste0( "^", Selection, collapse = "|" )
#filter out all rows wher 'Code' dows not start with the entries from 'Selection'
df %>% filter( grepl( pattern, Code, perl = TRUE ) )
# Code
# 1 b114
# 2 b11456
# 3 d2
# 4 d200

Transform list inside a column to data.table columns in r

I have a data.table object where each row in the column "status" is filled with a list. How can I create a new data.table transforming its lists to columns:
Columns "as is":
status, done_ratio, created_on, updated_on, closed_on
Columns "to be":
status_id, status_name, done_ratio, created_on, updated_on, closed_on
Code used:
basic_data <- tickets[, .(status,done_ratio,created_on,updated_on,closed_on)]
Result of view(basic_data): http://diegoquirino.pro.br/printdetela/print_duvida_datatablelistcolumn.png
I'd like to have NOT a list, but ordinary columns: status_id, status_name
Thankfully for help.

The following code attempts to the purpose (considering "tickets", object from picture):
# Collecting basic data
basic_data <- tickets[, .(done_ratio,created_on,updated_on,closed_on)]
status_data <- data.frame("status_id" = integer(),"status_name" = character())
# Fetching Status (each list)
for(status in tickets[,status]){
status <- as.data.frame(rbind(status))
names(status) <- c("status_id","status_name")
status_data <- rbind(status_data,status)
}
status_data <- cbind(status_data,basic_data)

How to find unique name/character from a list of names in R

I have a huge list of company names. As illustrated below, if name company is ABBEYCREST.DEAD...10.10.14...ASK.PRICE, this means ABBEYCREST.DEAD...10.10.14... is name of company and ASK.PRICE is ASK Price data and when it ends with BID.PRICE is means its the BID PRICE data. I want to identify the company whose only one column name is avaiable in the dataframe. Actually I have a dataframe which has colum headers as illustrated below, implying each company should have 2 columns, if there are 4000 companies so there should be 8000 columns in my dataframe but I have 7999 ( although my dataframe has a date column but I exclude it when I count columns).
df<-AskBid
ABBEYCREST.DEAD...10.10.14...ASK.PRICE
ABBEYCREST.DEAD...10.10.14...BID.PRICE
ABBOT.GROUP.DEAD...07.03.08...ASK.PRICE
ABBOT.GROUP.DEAD...07.03.08...BID.PRICE
ABERDEEN.ASSET.MAN..FULLY.PAID.23.09.05...ASK.PRICE
ABERDEEN.ASSET.MAN..FULLY.PAID.23.09.05...BID.PRICE
ABERDEEN.ASSET.MAN..NIL.PAID.23.09.05...ASK.PRICE
ABERDEEN.ASSET.MAN..NIL.PAID.23.09.05...BID.PRICE
ABERDEEN.FTBL.CLUB.DEAD...DEAD.04.08.03...ASK.PRICE
ABERDEEN.FTBL.CLUB.DEAD...DEAD.04.08.03...BID.PRICE
ABERTIS..IRS....BID.PRICE
ABGENIX..IRS..DEAD...12.11.07...ASK.PRICE
ABGENIX..IRS..DEAD...12.11.07...BID.PRICE
ABLON.GROUP.DEAD...31.05.13...ASK.PRICE
ABLON.GROUP.DEAD...31.05.13...BID.PRICE
ACAMBIS.DEAD...25.09.08...ASK.PRICE
ACAMBIS.DEAD...25.09.08...BID.PRICE
I want to find is
missing <- df
ABERTIS..IRS....BID.PRICE
I would really appreciate your help. This is causing problems in my estimations.

You can remove the ASK.PRICE and BID.PRICE part and call duplicated twice (the second time on the reversed order):
cn <- readLines(textConnection(
"ABBEYCREST.DEAD...10.10.14...ASK.PRICE
ABBEYCREST.DEAD...10.10.14...BID.PRICE
ABBOT.GROUP.DEAD...07.03.08...ASK.PRICE
ABBOT.GROUP.DEAD...07.03.08...BID.PRICE
ABERDEEN.ASSET.MAN..FULLY.PAID.23.09.05...ASK.PRICE
ABERDEEN.ASSET.MAN..FULLY.PAID.23.09.05...BID.PRICE
ABERDEEN.ASSET.MAN..NIL.PAID.23.09.05...ASK.PRICE
ABERDEEN.ASSET.MAN..NIL.PAID.23.09.05...BID.PRICE
ABERDEEN.FTBL.CLUB.DEAD...DEAD.04.08.03...ASK.PRICE
ABERDEEN.FTBL.CLUB.DEAD...DEAD.04.08.03...BID.PRICE
ABERTIS..IRS....BID.PRICE
ABGENIX..IRS..DEAD...12.11.07...ASK.PRICE
ABGENIX..IRS..DEAD...12.11.07...BID.PRICE
ABLON.GROUP.DEAD...31.05.13...ASK.PRICE
ABLON.GROUP.DEAD...31.05.13...BID.PRICE
ACAMBIS.DEAD...25.09.08...ASK.PRICE
ACAMBIS.DEAD...25.09.08...BID.PRICE"))
## remove (ASK|BID).PRICE
cn.sub <- gsub("(ASK|BID)\\.PRICE$", "", cn)
cn[!(duplicated(cn.sub) | rev(duplicated(rev(cn.sub))))]
# [1] "ABERTIS..IRS....BID.PRICE"

Here is another solution assuming text is the column name in the data frame read in:
library(dplyr)
df$text <- gsub(("(ASK|BID)", "", df$text)
df %>% group_by(text) %>% filter(n() != 2)