Clustering with constraint such that each cluster shares a common categorical value - constraints

I am trying to write a program that collects interests and availability from users and clusters them into groups based on similar interests and matching availability.
For example, four users could have the following availability and interests:
User 1:
Availability: Monday, Tuesday, Wednesday
Interest: Movies, Hiking
User 2:
Availability: Monday, Wednesday, Friday
Interest: Movies, Music
User 3:
Availability: Tuesday, Thursday, Saturday
Interest: Sports, Hiking
User 4:
Availability: Thursday, Friday
Interest: Hiking
Then the program could group the users into two clusters:
Cluster 1: User 1, User 2
Cluster 2: User 3, User 4
Since both User 1 and User 2 have the interest "movies" and at least one matching day of availability (Monday), they are placed in the same cluster. Similarly, Users 3 and 4 would be matched in a separate cluster
I am looking for suggestions on how to implement such a clustering algorithm. How can I ensure that each cluster has at least one common day of availability? Ideally, each cluster would share a maximal amount of interests and days of availability. Any other tips or suggestions would be greatly appreciated.

Related

Constrained K-means, R

I am currently doing k-means to cluster my data, however, I wish each cluster to appear once in each given year. I have searched for answers for a whole night but with no result. Would anyone have ideas upon this problem using R? Or is there any package I should look for ? Thanks.
More background infos :
I try to replicated the cluster of relationships, using the reported gender, education level and birth year. I am doing this because this is a survey data whose respondents are old people and they sometime will report inaccurate age or education infos. My main challenge now is that I wish to "have only one cluster labels in each survey year". For example, I do not want to see there are two cluster3 in survey year 2000. My data is like below :
survey year
relationship
gender
education level
birth year
k-means cluster
2000
41( first daughter)
0
3
1997
1
2003
41( first daughter)
0
3
1997
1
2000
42( second daughter)
0
4
1999
2
2003
42( second daughter)
0
4
1999
2
2000
42( third daughter)
0
5
1999
2
2003
42( third daughter)
0
5
2001
3
Thanks in advance.
--Update--
A more detailed description of the task:
The data set is a panel survey data asking elders for their health status, their relationships ( incl. sons, daughters, neighbors ). Since these older people are sometimes imprecise on their family's demographic information such as birth year, education level, etc., we might need to delete a big part of the data if it did not match.
(e.g., A reported his first son is 30 years old in 1997, while said his first son was 29 years old in 1999, this data could therefore be problematic). My task is to save as much data as possible if the imprecision is not that high.
Therefore I first mutated columns to check the precision of each family member (e.g., birth year error %in% c(-1,2)). Next, I run k-means if the family members are detected to be imprecise. In this way, I save much of the data. Although I did not solve the above problem, it rarely occurs that I can almost ignore or drop these observations.

Optimisation in R: Creating an exam schedule (ompr, ROI ?)

I'm scheduling an exam for a course (using R).
I have 36 students, which all need to be fitted with time-slots to 8 teachers on the exam day (1:1 exams).
Each student or teacher can obviously only be in one place at a time.
How do I run an optimisation to find a solution using the least amount of time slots?
# Teachers
examiners <- c(seq(from=1, to=8, by= 1))
#students
students <- c(seq(from=1, to=36))
Some packages of interest: ompr, ROI?
Many thanks!

Apply Sentimentr on Dataframe with Multiple Sentences in 1 String Per Row

I have a dataset where I am trying to get the sentiment by article. I have about 1000 articles. Each article is a string. This string has multiple sentences within it. I ideally would like to add another column that would summarise the sentiment for each article. Is there an efficient way to do this using dplyr?
Below is an example dataset with just 2 articles.
date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this link .',
'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')
df<-data.frame(date, text, link, V4)
head(df)
So I have been looking up how to do this using the sentimentr package and created below. However, this only outputs each sentences' sentiment (I do this by doing a strsplit of .,) and I want to instead aggregate everything at the full article level after applying this strsplit.
library(sentimentr)
full<-df %>%
group_by(V4) %>%
mutate(V2 = strsplit(as.character(V4), "[.],")) %>%
unnest(V2) %>%
get_sentences() %>%
sentiment()
The desired output I am looking for is to simply add an extra column my df dataframe with a summary sum(sentiment) for each article.
Additional info based on answer below:
date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this link .',
'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')
df<-data.frame(date, text, link, V4)
df %>%
group_by(V4) %>% # group by not really needed
mutate(V4 = gsub("[.],", ".", V4),
sentiment_score = sentiment_by(V4))
# A tibble: 2 x 5
# Groups: V4 [2]
date text link V4 sentiment_score$e~ $word_count $sd $ave_sentiment
<date> <chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 2020-06-24 3 more cops recover as P~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Three more police officers ~ 1 172 0.204 -0.00849
2 2020-06-24 QC suspends processing o~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Quezon City will halt the p~ 1 161 0.329 -0.174
Warning message:
Can't combine <sentiment_by> and <sentiment_by>; falling back to <data.frame>.
x Some attributes are incompatible.
i The author of the class should implement vctrs methods.
i See <https://vctrs.r-lib.org/reference/faq-error-incompatible-attributes.html>.
If you need the sentiment over the whole text, there is no need to split the text first into sentences, the sentiment functions take care of this. I replaced the ., in your text back to periods as this is needed for the sentiment functions. The sentiment functions recognizes "mr." as not being the end of a sentence. If you use get_sentences() first, you get the sentiment per sentence and not over the whole text.
The function sentiment_by handles the sentiment over the whole text and averages it nicely. Check help with the option for the averaging.function if you need to change this. The by part of the function can deal with any grouping you want to apply.
df %>%
group_by(V4) %>% # group by not really needed
mutate(V4 = gsub("[.],", ".", V4),
sentiment_score = sentiment_by(V4))
# A tibble: 2 x 5
# Groups: V4 [2]
date text link V4 sentiment_score$~ $word_count $sd $ave_sentiment
<date> <chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 2020-06-24 3 more cops recov~ https://newsinfo.inquire~ "MANILA, Philippines — Three~ 1 172 0.204 -0.00849
2 2020-06-24 QC suspends proce~ https://newsinfo.inquire~ "MANILA, Philippines — Quezo~ 1 161 0.329 -0.174

filtering text and storing the filtered sentence/paragraph into a new column

I am trying to extract some sentences from text data. I want to extract the sentences which correspond to medical device company released. I can run the following code:
df_text <- unlist(strsplit(df$TD, "\\."))
df_text
df_text <- df_text[grep(pattern = "medical device company released", df_text, ignore.case = TRUE)]
df_text
Which gives me:
[1] "\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday"
So I extracted the sentences which contain the sentence medical device company released. However, I want to do this but store the results in a new column from which grp the sentence came from.
Expected output:
grp TD newCol
3613 text NA # does not contain the sentence
4973 text medical device company released
5570 text NA # does not contain the sentence
Data:
df <- structure(list(grp = c("3613", "4973", "5570"), TD = c(" Wal-Mart plans to add an undisclosed number of positions in areas including its store-planning operation and New York apparel office.\n\nThe moves, which began Tuesday, are meant to \"increase operational efficiencies, support our strategic growth plans and reduce overall costs,\" Wal-Mart spokesman David Tovar said.\n\nWal-Mart still expects net growth of tens of thousands of jobs at the store level this year, Tovar said.\n\nThe reduction in staff is hardly a new development for retailers, which have been cutting jobs at their corporate offices as they contend with the down economy. Target Corp. (TGT), Saks Inc. (SKS) and Best Buy Co. (BBY) are among retailers that have said in recent weeks they plan to pare their ranks.\n\nTovar declined to say whether the poor economy was a factor in Wal-Mart's decision.\n\nWal-Mart is operating from a position of comparative strength as one of the few retailers to consistently show positive growth in same-store sales over the past year as the recession dug in.\n\nWal-Mart is \"a fiscally responsible company that will manage its capital structure appropriately,\" said Todd Slater, retail analyst at Lazard Capital Markets.\n\nEven though Wal-Mart is outperforming its peers, the company \"is not performing anywhere near peak or optimum levels,\" Slater said. \"The consumer has cut back significantly.\"\n\nWal-Mart indicated it had regained some footing in January, when comparable-store sales rose 2.1%, after a lower-than-expected 1.7% rise in December.\n\nWal-Mart shares are off 3.2% to $47.68.\n\n-By Karen Talley, Dow Jones Newswires; 201-938-5106; karen.talley#dowjones.com [ 02-10-09 1437ET ]\n ",
" --To present new valve platforms Friday\n\n(Updates with additional comment from company, beginning in the seventh paragraph.)\n\n\n \n By Anjali Athavaley \n Of DOW JONES NEWSWIRES \n \n\nNEW YORK (Dow Jones)--Edwards Lifesciences Corp. (EW) said Friday that it expects earnings to grow 35% to 40%, excluding special items, in 2012 on expected sales of its catheter-delivered heart valves that were approved in the U.S. earlier this year.\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday. The catheter-delivered heart valve market is considered to have a multibillion-dollar market potential, but questions have persisted on how quickly the Edwards device, called Sapien, will be rolled out and who will be able to receive it.\n\nEdwards said it expects transcatheter valve sales between $560 million and $630 million in 2012, with $200 million to $260 million coming from the U.S.\n\nOverall, for 2012, Edwards sees total sales between $1.95 billion and $2.05 billion, above the $1.68 billion to $1.72 billion expected this year and bracketing the $2.01 billion expected on average by analysts surveyed by Thomson Reuters.\n\nThe company projects 2012 per-share earnings between $2.70 and $2.80, the midpoint of which is below the average analyst estimate of $2.78 on Thomson Reuters. Edwards estimates a gross profit margin of 73% to 75%.\n\nEdwards also reaffirmed its 2011 guidance, which includes earnings per share of $1.97 to $2.02, excluding special items.\n\nThe company said it continues to expect U.S. approval of its Sapien device for high-risk patients in mid-2012. Currently, the device is only approved in the U.S. for patients too sick for surgery.\n\nThe company added that a separate trial studying its newer-generation valve in a larger population is under way in the U.S. It expects U.S. approval of that device in 2014.\n\nEdwards also plans to present at its investor conference two new catheter-delivered valve platforms designed for different implantation methods. European trials for these devices are expected to begin in 2012.\n\nShares of Edwards, down 9% over the past 12 months, were inactive premarket. The stock closed at $63.82 on Thursday.\n\n-By Anjali Athavaley, Dow Jones Newswires; 212-416-4912; anjali.athavaley#dowjones.com [ 12-09-11 0924ET ]\n ",
" In September, the company issued a guidance range of 43 cents to 44 cents a share. \n\nFor the year, GE now sees earnings no lower than $1.81 a share to $1.83 a share. The previous forecast called for income of $1.80 to $1.83 a share. The new range brackets analyst projections of $1.82 a share. \n\nThe new targets represent double-digit growth from the respective year-earlier periods. Last year's third-quarter earnings were $3.87 billion, or 36 cents a share, excluding items; earnings for the year ended Dec. 31 came in at $16.59 billion, or $1.59 a share. [ 10-06-05 0858ET ] \n\nGeneral Electric also announced Thursday that it expects 2005 cash flow from operating activities to exceed $19 billion. \n\nBecause of the expected cash influx, the company increased its authorization for share repurchases by $1 billion to more than $4 billion. \n\nGE announced the updated guidance at an analysts' meeting Thursday in New York. A Web cast of the meeting is available at . \n\nThe company plans to report third-quarter earnings Oct. 14. \n\nShares of the Dow Jones Industrial Average component recently listed at $33.20 in pre-market trading, according to Inet, up 1.6%, or 52 cents, from Wednesday's close of $32.68. \n\nCompany Web site: \n\n-Jeremy Herron; Dow Jones Newswires; 201-938-5400; Ask Newswires#DowJones.com \n\nOrder free Annual Report for General Electric Co. \n\nVisit or call 1-888-301-0513 [ 10-06-05 0904ET ] \n "
)), class = "data.frame", row.names = c(NA, -3L))
We can get data in separate rows keeping the grp intact and keep only sentence that has "medical device company released" in it.
library(dplyr)
df %>%
tidyr::separate_rows(TD, sep = "\\.") %>%
group_by(grp) %>%
summarise(newCol = toString(grep(pattern = "medical device company released",
TD, ignore.case = TRUE, value = TRUE)))
# grp newCol
# <chr> <chr>
#1 3613 ""
#2 4973 "\n\nThe medical device company released its financia…
#3 5570 ""

Webscraping in R

I'm working on a project studying state issued municipal bonds but I am having trouble getting my data. Using the XML package and the below code I was able to get some of it.
> nys="http://newyork.municipalbonds.com/bonds/issue/649787N87"
> nys.table=readHTMLTable(nys,asText=TRUE,which=4)
> nys.table=as.data.frame(nys.table)
> head(nys.table)
Trade Date Trade Time Maturity Date Coupon Price Yield Trade Amount Trade Type
1 2012-09-27 2:49pm 2013-Apr 5.000% 102.522 0.289 $270,000 Investor bought
2 2012-09-27 1:17pm 2013-Apr 5.000% 102.290 0.712 $45,000 Inter-dealer
But that site only offers a small sample for free. The official website, EMMA has the data for free but I'm having a terrible time scraping it. When I try the same approach as before I end up with
nys="http://emma.msrb.org/SecurityView/SecurityDetailsTrades.aspx?cusip=649787N87"
nys.table=readHTMLTable(nys,asText=TRUE)
nys.table=as.data.frame(nys.table)
head(nys.table)
data frame with 0 columns and 0 rows
From what I understand, and I'm fairly certain about this, is that there is a standard T&C page when you navigate to it via web browser. After using htmlParse(nys), the output is identical to the page source code of the T&C page and not the page where the data is actually located. So when the code runs, it is trying to find tables on the T&C page.
I figured that this would be a fairly common problem but so far I have not been able to find any posts where someone had a similar issue. If someone could point me in the right direction, I'd be greatly appreciative.
I finally got it to work. I had to use Web Developer in Firefox which allowed me to see what name/value pair the site was setting for the Disclaimer cookie. Here it is:
library(RCurl)
nys="http://emma.msrb.org/SecurityView/SecurityDetailsTrades.aspx?cusip=649787N87"
txt<-getURLContent(nys,cookie='Disclaimer=Ratings')
readHTMLTable(htmlParse(txt, asText = TRUE))
$ctl00_mainContentArea_tradeSearchResults
Trade Date/Time   Settlement Date Price (%) Yield (%) Trade Amt ($) Trade Submission Type  
1 09/27/2012 : 02:49 PM 10/02/2012 102.5220 0.289 270,000 Customer bought
2 09/27/2012 : 01:17 PM 10/02/2012 102.29 0.712 45,000 Inter-dealer Trade
3 09/27/2012 : 01:17 PM 10/02/2012 102.29 0.712 45,000 Inter-dealer Trade
To get the next 100 rows, you have to post a form with the current "viewstate":
# Get next set
viewstate=gsub('.*\"__VIEWSTATE\" value=\"([^\"]*)\".*','\\1',txt)
txt<-postForm(nys,
"__VIEWSTATE"=viewstate,
"__EVENTTARGET"="ctl00$mainContentArea$nextBottomButton",
.opts=list(cookie='Disclaimer=Ratings'))
readHTMLTable(htmlParse(txt, asText = TRUE))
$ctl00_mainContentArea_tradeSearchResults
Trade Date/Time   Settlement Date Price (%) Yield (%) Trade Amt ($) Trade Submission Type  
1 06/27/2011 : 01:51 PM 06/30/2011 107.7350 0.65 600,000 Customer sold
2 06/22/2011 : 12:05 PM 06/27/2011 107.1960 0.957 8,000 Customer bought
3 06/22/2011 : 12:05 PM 06/27/2011 106.6960 1.226 8,000 Inter-dealer Trade

Resources