Find the minimum sum of K rows for a MxN matrix, where you need to have 1 item in each column - math

Let say below is a table describing the time it takes 5 people to finish 5 jobs.
Job1 Job2 Job3 Job4 Job5
Bob 1 2 3 4 5
Joe 2 1 3 4 5
Tom 2 3 1 4 5
May 2 3 1 2 5
Sue 2 3 4 5 1
If the company has enough money to hire all 5 people, then I can see that the minimum time to complete everything is
Bob do Job1
Joe do Job2
Tom do Job3
May do Job4
Sue do Job5
1 + 1 + 1 + 2 + 1 = 6 units
I found the fastest way to solve this is to use Hungary Algorthm
(Referenced find the minimum sum of matrix (n x n) that select only one in each row and column)
Also, if company can only hire 1 person, May will be hired because everyone took 15 units to finish, where May took only 13 units.
However, if company has enough budget to hire 3 people. What's the fastest algorithm to find out which 3 people I should hire?
In the above Example, should be Bob (or Joe) + May + Sue, their total will be
Either Bob do Job 1
Either Bob do Job 2
May do Job 3
May do Job 4
Sue do Job 5
1 + 2 + 1 + 2 + 1 = 7 units

Related

unnesting list inside of a list, inside of a list, inside of a list... while preserving id in R

I imported a JSON file with below structure:
link
I would like to transform it to a dataframe with 3 columns: ID group_name date_joined,
where ID is a element number from "data" list.
It should look like this:
ID group_name date_joined
1 aaa dttm
1 bbb dttm
1 ccc dttm
1 ddd dttm
2 eee dttm
2 aaa dttm
2 bbb dttm
2 fff dttm
2 ggg dttm
3 bbb dttm
3 ccc dttm
3 ggg dttm
3 mmm dttm
Using below code few times i get a dataframe with just 2 columns: group_name and date_joined
train2 <- do.call("rbind", train2)
sample file link
the following should work:
library(jsonlite)
train2 <- fromJSON("sample.json")
train2 <- train2[[1]]$groups$data
df <- data.frame(
ID = unlist(lapply(1:length(train2), function(x) rep.int(x,length(train2[[x]]$group_name)))),
group_name = unlist(lapply(1:length(train2),function(x) train2[[x]]$group_name)),
date_joined = unlist(lapply(1:length(train2),function(x) train2[[x]]$date_joined)))
output:
> df
ID group_name date_joined
1 1 Let's excercise together and lose a few kilo quicker - everyone is welcome! (Piastow) 2008-09-05 09:55:18.730066
2 1 Strongman competition 2008-05-22 21:25:22.572365
3 1 Fast food 4 life 2012-02-02 05:26:01.293628
4 1 alternative medicine - Hypnosis and bioenergotheraphy 2008-07-05 05:47:12.254848
5 2 Tom Cruise group 2009-06-14 16:48:28.606142
6 2 Babysitters (Sokoka) 2010-09-25 03:21:01.944684
7 2 Work abroad - join to find well paid work and enjoy the experience (Sokoka) 2010-09-21 23:44:39.499240
8 2 Tennis, Squash, Badminton, table tennis - looking for sparring partner (Sokoka) 2007-10-09 17:15:13.896508
9 2 Lost&Found (Sokoka) 2007-01-03 04:49:01.499555
10 3 Polish wildlife - best places 2007-07-29 18:15:49.603727
11 3 Politics and politicians 2010-10-03 21:00:27.154597
12 3 Pizza ! Best recipes 2010-08-25 22:26:48.331266
13 3 Animal rights group - join us if you care! 2010-11-02 12:41:37.753989
14 4 The Aspiring Writer 2009-09-08 15:49:57.132171
15 4 Nutrition & food advices 2010-12-02 18:19:30.887307
16 4 Game of thrones 2009-09-18 10:00:16.190795
17 5 The ultimate house and electro group 2008-01-02 14:57:39.269135
18 5 Pirates of the Carribean 2012-03-05 03:28:37.972484
19 5 Musicians Available Poland (Osieczna) 2009-12-21 13:48:10.887986
20 5 Housekeeping - looking for a housekeeper ? Join the group! (Osieczna) 2008-10-28 23:22:26.159789
21 5 Rooms for rent (Osieczna) 2012-08-09 12:14:34.190438
22 5 Counter strike - global ladderboard 2008-11-28 03:33:43.272435
23 5 Nutrition & food advices 2011-02-08 19:38:58.932003

Find the favorite and analyse sequence questions in R

We have a daily meeting when participants nominate each other to speak. The first person is chosen randomly.
I have a dataframe that consists of names and the order of speech every day.
I have a day1, a day2 ,a day3 , etc. in the columns.
The data in the rows are numbers, meaning the order of speech on that particular day.
NA means that the person did not participate on that day.
Name day1 day2 day3 day4 ...
Albert 1 3 1 ...
Josh 2 2 NA
Veronica 3 5 3
Tim 4 1 2
Stew 5 4 4
...
I want to create two analysis, first, I want to create a dataframe who has chosen who the most times. (I know that the result depends on if a participant was nominated before and therefore on that day that participant cannot be nominated again, I will handle it later, but for now this is enough)
It should look like this:
Name Favorite
Albert Stew
Josh Veronica
Veronica Tim
Tim Stew
...
My questions (feel free to answer only one if you can):
1. What code shall I use for it without having to manunally put the names in a different dataframe?
2. How shall I handle a tie, for example Josh chose Veronica and Tim first the same number of times? Later I want to visualise it and I have no idea how to handle ties.
I also would like to analyse the results to visualise strong connections.
Like to show that there are people who usually chose each other, etc.
Is there a good package that is specialised for these? Or how should I get to it?
I do not need DNA sequences, only this simple ones, but I have not found a suitable one yet.
Thanks for your help!
If I am not misunderstanding your problem, here is some code to get the number of occurences of who choose who as next speaker. I added a fourth day to have some count that is not 1. There are ties in the result, choosing the first couple of each group by speaker ('who') may be a solution :
df <- read.table(textConnection(
"Name,day1,day2,day3,day4
Albert,1,3,1,3
Josh,2,2,,2
Veronica,3,5,3,1
Tim,4,1,2,4
Stew,5,4,4,5"),header=TRUE,sep=",",stringsAsFactors=FALSE)
purrr::map(colnames(df)[-1],
function (x) {
who <- df$Name[order(df[x],na.last=NA)]
data.frame(who,lead(who),stringsAsFactors=FALSE)
}
) %>%
replyr::replyr_bind_rows() %>%
filter(!is.na(lead.who.)) %>%
group_by(who,lead.who.) %>% summarise(n=n()) %>%
arrange(who,desc(n))
Input:
Name day1 day2 day3 day4
1 Albert 1 3 1 3
2 Josh 2 2 NA 2
3 Veronica 3 5 3 1
4 Tim 4 1 2 4
5 Stew 5 4 4 5
Result:
# A tibble: 12 x 3
# Groups: who [5]
who lead.who. n
<chr> <chr> <int>
1 Albert Tim 2
2 Albert Josh 1
3 Albert Stew 1
4 Josh Albert 2
5 Josh Veronica 1
6 Stew Veronica 1
7 Tim Stew 2
8 Tim Josh 1
9 Tim Veronica 1
10 Veronica Josh 1
11 Veronica Stew 1
12 Veronica Tim 1

Classification according to unique values \

I have a data frame named as Records having 2 vectors Rank and Name
Rank Name
1 Ashish
1 Ashish
2 Ashish
3 Mark
4 Mark
1 Mark
3 Spencer
2 Spencer
1 Spencer
2 Mary
4 Joseph
I want that every name should be placed in either 1, 2 ,3 or 4 tag depending on their occurrence and uniqueness:
I want to create a new vector which will be named as Tagging
So The output should be:
Rank 1 has three unique elements Mark Spencer and Ashish so the tag is 1 for all three.
Rank 2 has one unique records which is Mary as Ashish has already been assigned tag 1 so Mary is tagged as 2.
Rank 3 has no unique records as Spencer and Mark has already been assigned 1 so I cannot tag 3 to anybody.
Rank 4 has one unique record Joseph so he gets tagged as 4.
Let me know which function can help me do this.
I do not want to use looping as this is 1000000 row database
The below solution follows the principle that the highest Rank of a person is going to be that person's tag too.
tbl <- read.table(header=TRUE, text='
Rank Name
1 Ashish
1 Ashish
2 Ashish
3 Mark
4 Mark
1 Mark
3 Spencer
2 Spencer
1 Spencer
2 Mary
4 Joseph
')
Ordering the 'tbl' dataframe by Rank
tbl_ord <- tbl[with(tbl,order(Rank)),]
Removing multiple occurrence of name within same Rank
> name_ord<- tbl_ord[duplicated(tbl_ord$Rank),]
> name_ord
Rank Name
2 1 Ashish
6 1 Mark
9 1 Spencer
8 2 Spencer
10 2 Mary
7 3 Spencer
11 4 Joseph
Displaying unique Names
#name_ord[unique(name_ord$Name),] #this will work too
> name_ord[!duplicated(name_ord$Name),]
Rank Name
2 1 Ashish
6 1 Mark
9 1 Spencer
10 2 Mary
11 4 Joseph
Using the setkey function of data.table package and unique:
library(data.table)
dt<-data.table(Rank=c(1,1,2,3,4,1,3,2,1,2,4), Name=c(rep("Ashish", 3), rep("Mark", 3), rep("Spencer", 3), "Mary", "Joseph"))
setkey(dt, Rank, Name)
dt<-unique(dt)
setkey(dt, Name)
dt<-unique(dt) # works because of the above setkey call which sorted it
setkey(dt, Rank) # if you want to order them by Rank again

Cross tabulation with n>2 categories in R: hide rows with zero cases

I'm trying to make a cross tabulation in R, and having its output resemble as much as possible what I'd get in an Excel pivot table. So, given this code:
set.seed(2)
df<-data.frame("ministry"=paste("ministry ",sample(1:3,20,replace=T)),"department"=paste("department ",sample(1:3,20,replace=T)),"program"=paste("program ",sample(letters[1:20],20,replace=F)),"budget"=runif(20)*1e6)
library(tables)
library(dplyr)
arrange(df,ministry,department,program)
tabular(ministry*department~((Count=budget)+(Avg=(mean*budget))+(Total=(sum*budget))),data=df)
which yields:
Avg Total
ministry department Count budget budget
ministry 1 department 1 5 479871 2399356
department 2 1 770028 770028
department 3 1 184673 184673
ministry 2 department 1 2 170818 341637
department 2 1 183373 183373
department 3 3 415480 1246440
ministry 3 department 1 0 NaN 0 <---- LOOK HERE
department 2 5 680102 3400509
department 3 2 165118 330235
How do I get the output to hide the rows with zero frequencies?
I'm using tables::tabular but any other package is good for me (as long as there's a way, even indirect, of outputting to html). This is for generating HTML or Latex using R Markdown and displaying the table with my script's results as Excel would, or as in the example above in a pivot-table like form. But without the superfluous row.
Thanks!
Why not just use dplyr?
df %>%
group_by(ministry, department) %>%
summarise(count = n(),
avg_budget = mean(budget, na.rm = TRUE),
tot_budget = sum(budget, na.rm = TRUE))
ministry department count avg_budget tot_budget
1 ministry 1 department 1 5 479871.1 2399355.6
2 ministry 1 department 2 1 770027.9 770027.9
3 ministry 1 department 3 1 184673.5 184673.5
4 ministry 2 department 1 2 170818.3 341636.5
5 ministry 2 department 2 1 183373.2 183373.2
6 ministry 2 department 3 3 415479.9 1246439.7
7 ministry 3 department 2 5 680101.8 3400508.8
8 ministry 3 department 3 2 165117.6 330235.3
While I don't understand at all how the tabular object is made (since it says it's a list but seems to behaves like a data frame), you can select cells as usual, so
> results <-tabular(ministry*department~((Count=budget)+(Avg=(mean*budget))+(Total=(sum*budget))),data=df)
> results[results[,1]!=0,]
Avg Total
ministry department Count budget budget
ministry 1 department 1 5 479871 2399356
department 2 1 770028 770028
department 3 1 184673 184673
ministry 2 department 1 2 170818 341637
department 2 1 183373 183373
department 3 3 415480 1246440
ministry 3 department 2 5 680102 3400509
department 3 2 165118 330235
That's the solution.
I just found out the solution thanks to this user's reply on another question https://stackoverflow.com/users/516548/g-grothendieck

selecting rows with specific conditions in R

I currently have a data that looks like this for multiple ids (that range until around 1600)
id year name status
1 1980 James 3
1 1981 James 3
1 1982 James 3
1 1983 James 4
1 1984 James 4
1 1985 James 1
1 1986 James 1
1 1987 James 1
2 1982 John 2
2 1983 John 2
2 1984 John 1
2 1985 John 1
I want to subset this data so that it only has the information for status=1 and the status right before that. I also want to eliminate multiple 1s and only save the first 1s. In conclusion I would want:
id year name status
1 1984 James 4
1 1985 James 1
2 1983 John 2
2 1984 John 1
I'm doing this because I'm in the process of figuring out in what year how many people from certain status changed to status 1. I only know the subset command and I don't think I can get this data from doing subset(data, subset=(status==1)). How could I save the information right before that
I want to add to this question one more time - I did not get same results when I applied the first reply to this question (which uses plr packages) and the third reply which uses duplicated command. I found out that the first reply preserved information accurately while the third one did not.
This does what you want.
library(plyr)
ddply(d, .(name), function(x) {
i <- match(1, x$status)
if (is.na(i))
NULL
else
x[c(i-1, i), ]
})
id year name status
1 1 1984 James 4
2 1 1985 James 1
3 2 1983 John 2
4 2 1984 John 1
Here's a solution - for each grouping of numbers (the cumsum bit), it looks at the first one and takes that and the previous row if status is 1:
library(data.table)
dt = data.table(your_df)
dt[dt[, if(status[1] == 1) c(.I[1]-1, .I[1]),
by = cumsum(c(0,diff(status)!=0))]$V1]
# id year name status
#1: 1 1984 James 4
#2: 1 1985 James 1
#3: 2 1983 John 2
#4: 2 1984 John 1
Using base R, here is a way to do this:
# this first line is how I imported your data after highlighting and copying (i.e. ctrl+c)
d<-read.table("clipboard",header=T)
# find entries where the subsequent row's "status" is equal to 1
# really what's going on is finding rows where "status" = 1, then subtracting 1
# to find the index of the previous row
e<-d[which(d$status==1)-1 ,]
# be careful if your first "status" entry = 1...
# What you want
# Here R will look for entries where "name" and "status" are both repeats of a
# previous row and where "status" = 1, and it will get rid of those entries
e[!(duplicated(e[,c("name","status")]) & e$status==1),]
id year name status
5 1 1984 James 4
6 1 1985 James 1
10 2 1983 John 2
11 2 1984 John 1
I like the data.table solution myself, but there actually is a way to do it with subset.
# import data from clipboard
x = read.table(pipe("pbpaste"),header=TRUE)
# Get the result table that you want
x1 = subset(x, status==1 |
c(status[-1],0)==1 )
result = subset(x1, !duplicated(cbind(name,status)) )

Resources