After this expression
good.rows<-ifelse(nchar(ufo$DateOccurred)!=10 | nchar(ufo$DateReported)!=10,
FALSE, TRUE)
I expected to get vectors of Booleans but I got
length(good.rows)
[1] 0
This is logical(empty) as I can see in R studio. What can I do to solve this?
dput(head(ufo))
"structure(list(DateOccured = structure(c(9412, 9413, 9131, 9260,
9292, 9428), class = "Date"), DateReported = structure(c(9412,
9414, 9133, 9260, 9295, 9427), class = "Date"), Location = c(" Iowa City, IA",
" Milwaukee, WI", " Shelton, WA", " Columbia, MO", " Seattle, WA",
" Brunswick County, ND"), ShortDescription = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Duration = c(NA, "2 min.", NA, "2 min.", NA, "30 min."), LongDescription = c("Man repts. witnessing "flash, followed by a classic UFO, w/ a tailfin at back." Red color on top half of tailfin. Became triangular.",
"Man on Hwy 43 SW of Milwaukee sees large, bright blue light streak by his car, descend, turn, cross road ahead, strobe. Bizarre!",
"Telephoned Report:CA woman visiting daughter witness discs and triangular ships over Squaxin Island in Puget Sound. Dramatic. Written report, with illustrations, submitted to NUFORC.",
"Man repts. son's bizarre sighting of small humanoid creature in back yard. Reptd. in Acteon Journal, St. Louis UFO newsletter.",
"Anonymous caller repts. sighting 4 ufo's in NNE sky, 45 deg. above horizon. (No other facts reptd. No return tel. #.)",
"Sheriff's office calls to rept. that deputy, 20 mi. SSE of Wilmington, is looking at peculiar, bright white, strobing light."
)), row.names = c(NA, 6L), class = "data.frame")"
There are a couple of reasons why this could be happening:
You're dataset is empty, check this using the dim() method.
The columns are not of type Character check this using the class()
method.
If both of these are correct try running the nchar(...) statements
separately.
Below I've create an example that works correctly, where I've gone through the above mentioned steps. In future please provide a reproducible example as part of your question.
# Create sample data
ufo <- data.frame(DateOccurred=c("a","bb","ccc"),
DateReported=c("a","bb","ccc"),
stringsAsFactors = FALSE)
print(ufo)
# Check size of data (make sure data has rows and columns are of type Character)
dim(ufo)
class(ufo$DateOccurred)
class(ufo$DateReported)
# Check nchar statements (Should run without error/warnings)
nchar(ufo$DateOccurred)
nchar(ufo$DateReported)
# Actual
good.rows <- ifelse(nchar(ufo$DateOccurred)!=3 | nchar(ufo$DateReported)!=3,
FALSE, TRUE)
print(good.rows)
length(good.rows)
Related
I have tried to resolve this problem all day but without any improvement.
I am trying to replace the following abbreviations into the following desired words in my dataset:
-Abbreviations: USA, H2O, Type 3, T3, bp
Desired words United States of America, Water, Type 3 Disease, Type 3 Disease, blood pressure
The input data is for example
[1] I have type 3, its considered the highest severe stage of the disease.
[2] Drinking more H2O will make your skin glow.
[3] Do I have T2 or T3? Please someone help.
[4] We don't have this on the USA but I've heard that will be available in the next 3 years.
[5] Having a high bp means that I will have to look after my diet?
The desired output is
[1] i have type 3 disease, its considered the highest severe stage
of the disease.
[2] drinking more water will make your skin glow.
[3] do I have type 3 disease? please someone help.
[4] we don't have this in the united states of america but i've heard that will be available in the next 3 years.
[5] having a high blood pressure means that I will have to look after my diet?
I have tried the following code but without success:
data= read.csv(C:"xxxxxxx, header= TRUE")
lowercase= tolower(data$MESSAGE)
dict=list("\\busa\\b"= "united states of america", "\\bh2o\\b"=
"water", "\\btype 3\\b|\\bt3\\"= "type 3 disease", "\\bbp\\b"=
"blood pressure")
for(i in 1:length(dict1)){
lowercasea= gsub(paste0("\\b", names(dict)[i], "\\b"),
dict[[i]], lowercase)}
I know that I am definitely doing something wrong. Could anyone guide me on this? Thank you in advance.
If you need to replace only whole words (e.g. bp in Some bp. and not in bpcatalogue) you will have to build a regular expression out of the abbreviations using word boundaries, and - since you have multiword abbreviations - also sort them by length in the descending order (or, e.g. type may trigger a replacement before type three).
An example code:
abbreviations <- c("USA", "H2O", "Type 3", "T3", "bp")
desired_words <- c("United States of America", "Water", "Type 3 Disease", "Type 3 Disease", "blood pressure")
df <- data.frame(abbreviations, desired_words, stringsAsFactors = FALSE)
x <- 'Abbreviations: USA, H2O, Type 3, T3, bp'
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
library(stringr)
str_replace_all(x,
paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b"),
function(z) df$desired_words[df$abbreviations==z][[1]][1]
)
The paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b") code creates a regex like \b(Type 3|USA|H2O|T3|bp)\b, it matches Type 3, or USA, etc. as whole word only as \b is a word boundary. If a match is found, stringr::str_replace_all replaces it with the corresponding desired_word.
See the R demo online.
I have a dataset where sometimes the unit of measure is not separated from the number by a space, and I would like to add it in. I have a list of units of measure that may be used in the dataset and I want to make sure that every time they appear there is a space.
My data is something like:
mydata <- c("black box 125CM", "10KG white chair", "bottle of water 1000ML")
And I would like:
result <- c("black box 125 CM", "10 KG white chair", "bottle of water 1000 ML")
The units of measure that might appear:
measure <- c("ML", "MG", "F", "CM", "CPR", "FL", "CPS", "KG")
So far I have tried (but it is not working):
for (i in 1:NROW(measure)) {
replacement <- paste0("\\s", measure[i])
result <- gsub("(?<=[[:digit:]])"measure[i], replacement, mydata, perl = TRUE)
}
If it were for one substitution I would be able to do it with:
result <- gsub("(?<=[[:digit:]])MG", " MG", mydata, perl = TRUE)
I just do not know how I am supposed to write measure[i] in the gsub function, I cannot find the right syntax.
Any suggestions? Thank you very much in advance.
Regex lookahead can do this.
gsub(paste0("(?<=[0-9])(", paste(measure, collapse = "|"), ")"), " \\1",
mydata, perl = TRUE)
# [1] "black box 125 CM" "10 KG white chair" "bottle of water 1000 ML"
mydata <- c("black box 125CM", "10KG white chair", "bottle of water 1000ML")
stringr::str_replace_all(mydata, "[:digit:]([ML|MG|F|C[M|PR|PS]|FL|KG])", " \\1")
Gives
[1] "black box 12 CM" "1 KG white chair" "bottle of water 100 ML"
Note the special handling of the three cases which all begin with C.
As an aside, if I were having to be this fussy about spaces, I'd also be minded to be fussy about getting the case of the SI units corrrect: "KG" is not kilogramme but Kelvin ⋅ 6.674×10−11 m3⋅kg−1⋅s−2, as near as I can figure it!
This is what I came up with and works for me.
mydata <- c("black box 125CM", "10KG white chair", "bottle of water 1000ML")
measure <- c("ML", "MG", "F", "CM", "CPR", "FL", "CPS", "KG")
measure <- paste(measure, collapse = "|")
result <- sub(paste0("([", measure, "])"), " \\1", mydata)
edit: This will also add spaces if there is already a space,r2evans solution would be more preferable.
If, as in the example, the measure always appears after the numbers, then this works:
sub("(\\d+)", "\\1 ", mydata)
[1] "black box 125 CM" "10 KG white chair" "bottle of water 1000 ML"
I'm trying to create a key-value store with the key being entities and the value being the average sentiment score of the entity in news articles.
I have a dataframe containing news articles and a list of entities called organizations1 identified in those news articles by a classifier. The first row of the organization1 list contains the entities identified in the article on the first row of the news_us dataframe. I'm trying to iterate through the organization list and creating a key-value store with the key being the entity name in the organization1 list and the value being the sentiment score of the news description in which the entity was mentioned.
I can get the sentiment scores for the entity from an article but I wanted to add them together and average the sentiment score.
library(syuzhet)
sentiment <- list()
organization1 <- list(NULL, "US", "Bath", "Animal Crossing", "World Health Organization",
NULL, c("Microsoft", "Facebook"))
news_us <- structure(list(title = c("Stocks making the biggest moves after hours: Bed Bath & Beyond, JC Penney, United Airlines and more - CNBC",
"Los Angeles mayor says 'very difficult to see' large gatherings like concerts and sporting events until 2021 - CNN",
"Bed Bath & Beyond shares rise as earnings top estimates, retailer plans to maintain some key investments - CNBC",
"6 weeks with Animal Crossing: New Horizons reveals many frustrations - VentureBeat",
"Timeline: How Trump And WHO Reacted At Key Moments During The Coronavirus Crisis : Goats and Soda - NPR",
"Michigan protesters turn out against Whitmer’s strict stay-at-home order - POLITICO"
), description = c("Check out the companies making headlines after the bell.",
"Los Angeles Mayor Eric Garcetti said Wednesday large gatherings like sporting events or concerts may not resume in the city before 2021 as the US grapples with mitigating the novel coronavirus pandemic.",
"Bed Bath & Beyond said that its results in 2020 \"will be unfavorably impacted\" by the crisis, and so it will not be offering a first-quarter nor full-year outlook.",
"Six weeks with Animal Crossing: New Horizons has helped to illuminate some of the game's shortcomings that weren't obvious in our first review.",
"How did the president respond to key moments during the pandemic? And how did representatives of the World Health Organization respond during the same period?",
"Many demonstrators, some waving Trump campaign flags, ignored organizers‘ pleas to stay in their cars and flooded the streets of Lansing, the state capital."
), name = c("CNBC", "CNN", "CNBC", "Venturebeat.com", "Npr.org",
"Politico")), na.action = structure(c(`35` = 35L, `95` = 95L,
`137` = 137L, `154` = 154L, `213` = 213L, `214` = 214L, `232` = 232L,
`276` = 276L, `321` = 321L), class = "omit"), row.names = c(NA,
6L), class = "data.frame")
setNames(lapply(news_us$description, get_sentiment), unlist(organization1))
#$US
#[1] 0
#$Bath
#[1] -0.4
#$`Animal Crossing`
#[1] -0.1
#$`World Health Organization`
#[1] 1.1
#$Microsoft
#[1] -0.6
#$Facebook
#[1] -1.9
tapply(sapply(news_us$description, get_sentiment), unlist(organization1), mean) #this line throws the error
Your problem seems to arise from the use of 'unlist'. Avoid this, as it drops the NULL values and concatenates list entries with multiple values.
Your organization1 list has 7 entries (two of which are NULL and one is length = 2). You should have 6 entries if this is to match the news_us data.frame - so something is out of sync there.
Let's assume the first 6 entries in organization1 are correct; I would bind them to your data.frame to avoid further 'sync errors':
news_us$organization1 = organization1[1:6]
Then you need to do the sentiment analysis on each row of the data.frame and bind the results to the organization1 value/s. The code below might not be the most elegant way to achieve this, but I think it does what you are looking for:
results = do.call("rbind", apply(news_us, 1, function(item){
if(!is.null(item$organization1[[1]])) cbind(item$organization1, get_sentiment(item$description))
}))
This code drops any rows where there were no detected organization1 values. It should also duplicate sentiment scores in the case of more than one organization1 being detected. The results will look like this (which I believe is your goal):
[,1] [,2]
[1,] "US" "-0.4"
[2,] "Bath" "-0.1"
[3,] "Animal Crossing" "1.1"
[4,] "World Health Organization" "-0.6"
The mean scores for each organization can then be collapsed using by, aggregate or similar.
[Edit: Examples of by and aggregate]
by(as.numeric(results[, 2]), results$V1, mean)
aggregate(as.numeric(results[, 2]), list(results$V1), mean)
I'm trying to create a key value store with the key being entities and the value being the average sentiment score of the entity in news articles.
I have a dataframe containing news articles and a list of entities called organizations1 indentified in those news articles by a classifier. The first rows of the organization1 list contains the entities identified in the article on the first row of the news_us dataframe. I'm trying to iterate through the organizations list and creating a key value store with the key being the entity in the organization1 list and the value being the sentiment score of the news description in which the entity was mentioned. The code I have doesn't change the scores in the sentiment list and I don't know why. My first guess was that I would have to use the $ operator on the sentiment list to add the value but that didn't change anything either. Here is the code I have so far:
library(syuzhet)
sentiment <- list()
organization1 <- list(NULL, "US", "Bath", "Animal Crossing", "World Health Organization",
NULL, c("Microsoft", "Facebook"))
news_us <- structure(list(title = c("Stocks making the biggest moves after hours: Bed Bath & Beyond, JC Penney, United Airlines and more - CNBC",
"Los Angeles mayor says 'very difficult to see' large gatherings like concerts and sporting events until 2021 - CNN",
"Bed Bath & Beyond shares rise as earnings top estimates, retailer plans to maintain some key investments - CNBC",
"6 weeks with Animal Crossing: New Horizons reveals many frustrations - VentureBeat",
"Timeline: How Trump And WHO Reacted At Key Moments During The Coronavirus Crisis : Goats and Soda - NPR",
"Michigan protesters turn out against Whitmer’s strict stay-at-home order - POLITICO"
), description = c("Check out the companies making headlines after the bell.",
"Los Angeles Mayor Eric Garcetti said Wednesday large gatherings like sporting events or concerts may not resume in the city before 2021 as the US grapples with mitigating the novel coronavirus pandemic.",
"Bed Bath & Beyond said that its results in 2020 \"will be unfavorably impacted\" by the crisis, and so it will not be offering a first-quarter nor full-year outlook.",
"Six weeks with Animal Crossing: New Horizons has helped to illuminate some of the game's shortcomings that weren't obvious in our first review.",
"How did the president respond to key moments during the pandemic? And how did representatives of the World Health Organization respond during the same period?",
"Many demonstrators, some waving Trump campaign flags, ignored organizers‘ pleas to stay in their cars and flooded the streets of Lansing, the state capital."
), name = c("CNBC", "CNN", "CNBC", "Venturebeat.com", "Npr.org",
"Politico")), na.action = structure(c(`35` = 35L, `95` = 95L,
`137` = 137L, `154` = 154L, `213` = 213L, `214` = 214L, `232` = 232L,
`276` = 276L, `321` = 321L), class = "omit"), row.names = c(NA,
6L), class = "data.frame")
i = as.integer(0)
for(index in organizations1){
i <- i+1
if(is.character(index)) { #if entity is not null/NA
val <- get_sentiment(news_us$description[i], method = "afinn")
#print(val)
print(sentiment[[index[1]]])
sentiment[[index[1]]] <- sentiment[[index[1]]]+val
}
}
Here is the sentiment list after running the above code chunk:
$US
integer(0)
$Bath
integer(0)
$`Animal Crossing`
integer(0)
$`World Health Organization`
integer(0)
$`Apple TV`
integer(0)
$`Pittsburgh Steelers`
integer(0)
Whereas I would like it to look something like:
$US
1.3
$Bath
0.3
$`Animal Crossing`
2.4
$`World Health Organization`
1.2
$`Apple TV`
-0.7
$`Pittsburgh Steelers`
0.3
The value column can have multiple values for multiple entities identified in the article.
I am not sure how organization1 and news_us$description are related but perhaps, you meant to use it something like this?
library(syuzhet)
setNames(lapply(news_us$description, get_sentiment), unlist(organization1))
#$US
#[1] 0
#$Bath
#[1] -0.4
#$`Animal Crossing`
#[1] -0.1
#$`World Health Organization`
#[1] 1.1
#$Microsoft
#[1] -0.6
#$Facebook
#[1] -1.9
I have a dataset that I have tried to give a sample of using the dput command below. The problem I'm running into is trying to separate out the data by delimiter.
> dput(head(team_data))
structure(list(X1 = 2:6,
names2 = c("Andre Callender Seton Hall Preparatory School (West Orange, NJ)", "Gosder Cherilus Somerville (Somerville, MA)", "Justin Bell Mount Vernon (Alexandria, VA)", "Tom Anevski Elder (Cincinnati, OH)", "Brad Mueller Mars Area (Mars, PA)"),
pos2 = c("RB 5-10 185", "OT 6-7 270", "TE 6-3 250", "OT 6-5 265", "CB 6-0 170"), rating2 = c("0.8667 194 18 8", "0.8667 262 20 1", "0.8333 306 14 7", "0.8333 377 25 13", "0.8333 496 36 16"),
status2 = c("Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003"), team = c("Boston-College", "Boston-College", "Boston-College", "Boston-College", "Boston-College"), year = c(2003L, 2003L, 2003L, 2003L, 2003L)),
.Names = c("X1", "names2", "pos2", "rating2", "status2", "team", "year"), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
The following is the code I am trying to execute on the above dataset. The following two functions work fine and as expected as far as I can tell.
library(rvest)
library(stringr)
library(tidyr)
library(readxl)
df2<-separate(data=team_data,col=pos2,into= c("Position","Height","Weight"),sep=" ")
df3<-separate(data=df2,col=rating2,into= c("Rating","National","Position","State Rank"),sep=" ")
But then I have significant trouble trying to further separate out the columns of the dataframe. I have tried various ways (examples below) but all of the pieces of code below produce the same error, "Error: Data source must be a dictionary".
df4<-separate(data=df3,col=names2,into= c("Name","Geo"),sep="(")
df4<-separate(data=df3,col=names2,into= c("Name","Geo"),sep='\\(|\\)')
df4<-separate(data=df3,col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
df4<-separate(data=df3,col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
The ultimate goal would be to separate out the "names2" column at the "(" and the "," and remove the ")" so that I would end up with 3 columns of data. For the other column ("status2") the goal would be to separate out the "Enrolled" from the date of enrollment.
From what I have read the error I'm getting indicates that I am duplicating column names, but I can't figure out where that is happening.
You are using Position twice, once in df2 and once in df3. This works for me:
team_data %>%
separate(col=pos2, into= c("Position","Height","Weight"), sep=" ") %>%
separate(col=rating2,into= c("Rating","National","Position2","State Rank"),sep=" ")%>%
separate(col=names2,into= c("Name","Geo"),sep="\\(") %>%
separate(col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")