I have 60 students that need be divided into 12 teams. Teams should be distributed evenly by gender and nationality.
What's an effective approach to accomplish this?
I've tried a list of example data with 9 students that should be divided into 3 teams:
A: Adam, Male, Sweden
B: Bob, Male, Norway
C: Charlie, Female, Denmark
D: David, Male, Denmark
E: Erica, Female, Sweden
F: Frida, Female, Norway
G: Gunnar, Male, Denmark
H: Hans, Male, Norway
I: Anna, Female, Sweden
I thought perhaps one way would be to first sort by nationality:
AEIBFHCDG
And then take n+3 to create evenly distributed team by nationality:
ABC
EFD
IHG
Then look diagonally for the second iteration:
AFG
BDI
CEH
But doesn't really go further than that. Any ideas on what approach I should use for this?
I think you should first decide what is more important the distribution of gender or nationality with a team. If you think gender is more important do the following with Python:
# todo: read in input
number_of_teams = 12
sorted_students = sorted(input) # todo: use your own comparator
team_list = [[] for _ in xrange(number_of_teams)] # create a list of 12 teams
# a team is modelled as a list
for i, student in enumerate(sorted_students):
team_list[i % number_of_teams].append(student)
Related
I would like to load multple pages from a single website and extract specific attributes from different classes as below. Then I woule like to create a dataframe with parsed information from multiple pages.
Extract from multiple pages
for page in range(1,10):
url = f"https://www.consilium.europa.eu/en/press/press-releases/?page={page}"
res = requests.get(url)
soup = bs(res.text, 'lxml')
Parsing
soup_content = soup.find_all('li', {'class':['list-item ceu clearfix','list-item gsc clearfix','list-item euco clearfix','list-item eg clearfix' ]})
datePublished = []
headline = []
description =[]
urls = []
for i in range(len(soup_content)):
datePublished.append(soup_content[i].find('span', {'itemprop': 'datePublished'}).attrs['content'])
headline.append(soup_content[i].find('h3', {'itemprop': 'headline'}).get_text().strip())
description.append(soup_content[i].find('p', {'itemprop': 'description'}).get_text().strip())
urls.append('https://www.consilium.europa.eu{}'.format(soup.find('a', {'itemprop': 'url'}).attrs['href']))
To DataFrame
df = pd.DataFrame(data = zip(datePublished, headline, description, urls), columns=['date','title', 'description', 'link'])
df
To expand on my comments, this should work:
maxPage = 9
datePublished = []
headline = []
description =[]
urls = []
for page in range(1, maxPage+1):
url = f"https://www.consilium.europa.eu/en/press/press-releases/?page={page}"
res = requests.get(url)
print(f'[page {page:>3}]', res.status_code, res.reason, 'from', res.url)
soup = BeautifulSoup(res.content, 'lxml')
soup_content = soup.find_all('li', {'class':['list-item ceu clearfix','list-item gsc clearfix','list-item euco clearfix','list-item eg clearfix' ]})
for i in range(len(soup_content)):
datePublished.append(soup_content[i].find('span', {'itemprop': 'datePublished'}).attrs['content'])
headline.append(soup_content[i].find('h3', {'itemprop': 'headline'}).get_text().strip())
description.append(soup_content[i].find('p', {'itemprop': 'description'}).get_text().strip())
urls.append('https://www.consilium.europa.eu{}'.format(soup.find('a', {'itemprop': 'url'}).attrs['href']))
When I ran it, 179 unique rows were collected [20 rows from all pages except the 7th, which had 19].
There are different ways to get your goal:
#Driftr95 comes up with a modification of yours using range(), that is fine, while iterating a specific number of pages.
Using a while-loop to be flexible in number of pages, without knowing the exact one. You can also use a counter if you like to break the loop at a certain number of iterations.
...
I would recommend the second one and also to avoid the bunch of lists cause you have to ensure they have the same lenght. Instead use a single list with dicts that looks more structured.
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
base_url = 'https://www.consilium.europa.eu'
path ='/en/press/press-releases'
url = base_url+path
data = []
while True:
print(url)
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('li.list-item'):
data.append({
'date':e.find_previous('h2').text,
'title':e.h3.text,
'desc':e.p.text,
'url':base_url+e.h3.a.get('href')
})
if soup.select_one('li[aria-label="Go to the next page"] a[href]'):
url = base_url+path+soup.select_one('li[aria-label="Go to the next page"] a[href]').get('href')
else:
break
df = pd.DataFrame(data)
Output
date
title
desc
url
0
30 January 2023
Statement by the High Representative on behalf of the EU on the alignment of certain third countries concerning restrictive measures in view of the situation in the Democratic Republic of the Congo
Statement by the High Representative on behalf of the European Union on the alignment of certain third countries with Council Implementing Decision (CFSP) 2022/2398 of 8 December 2022 implementing Decision 2010/788/CFSP concerning restrictive measures in view of the situation in the Democratic Republic of the Congo.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/30/statement-by-the-high-representative-on-behalf-of-the-eu-on-the-alignment-of-certain-third-countries-concerning-restrictive-measures-in-view-of-the-situation-in-the-democratic-republic-of-the-congo/
1
30 January 2023
Council adopts recommendation on adequate minimum income
The Council adopted a recommendation on adequate minimum income to combat poverty and social exclusion. Income support is considered adequate when it ensures a life in dignity at all stages of life. Member states are recommended to gradually achieve the adequate level of income support by 2030 at the latest, while safeguarding the sustainability of public finances.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/30/council-adopts-recommendation-on-adequate-minimum-income/
2
27 January 2023
Forward look: 30 January - 12 February 2023
Overview of the main subjects to be discussed at meetings of the Council of the EU over the next two weeks and upcoming media events.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/forward-look/
3
27 January 2023
Russia: EU prolongs economic sanctions over Russia’s military aggression against Ukraine
The Council prolonged restrictive measures in view of Russia's actions destabilising the situation in Ukraine by six months.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/russia-eu-prolongs-economic-sanctions-over-russia-s-military-aggression-against-ukraine/
4
27 January 2023
Media advisory – Agriculture and Fisheries Council meeting on 30 January 2023
Main agenda items, approximate timing, public sessions and press opportunities.
https://www.consilium.europa.eu/en/press/press-releases/2023/01/27/media-advisory-agriculture-and-fisheries-council-meeting-on-30-january-2023/
...
435
6 July 2022
EU support to the African Union Mission in Somalia: Council approves further support under the European Peace Facility
The Council approved €120 million in support to the military component of AMISOM/ATMIS for 2022 under the European Peace Facility.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/06/eu-support-to-the-african-union-mission-in-somalia-council-approves-further-support-under-the-european-peace-facility/
436
6 July 2022
Report by President Charles Michel to the European Parliament plenary session
Report by European Council President Charles Michel to the European Parliament plenary session on the outcome of the European Council meeting of 23-24 June 2022.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/06/report-by-president-charles-michel-to-the-european-parliament-plenary-session/
437
5 July 2022
Declaration by the High Representative on behalf of the EU on the alignment of certain countries concerning restrictive measures against ISIL (Da’esh) and Al-Qaeda and persons, groups, undertakings and entities associated with them
Declaration by the High Representative on behalf of the European Union on the alignment of certain third countries with Council Decision (CFSP) 2022/950 of 20 June 2022 amending Decision (CFSP) 2016/1693 concerning restrictive measures against ISIL (Da’esh) and Al-Qaeda and persons, groups, undertakings and entities associated with them.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/05/declaration-by-the-high-representative-on-behalf-of-the-eu-on-the-alignment-of-certain-countries-concerning-restrictive-measures-against-isil-da-esh-and-al-qaeda-and-persons-groups-undertakings-and-entities-associated-with-them/
438
5 July 2022
Remarks by President Charles Michel after his meeting in Skopje with Prime Minister of North Macedonia Dimitar Kovačevski
During his visit to North Macedonia, President Michel expressed his support for proposed compromise solution on the country's accession negotiations.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/05/remarks-by-president-charles-michel-after-his-meeting-in-skopje-with-prime-minister-of-north-macedonia-dimitar-kovacevski/
439
4 July 2022
Readout of the telephone conversation between President Charles Michel and Prime Minister of Ethiopia Abiy Ahmed
President Charles Michel and Prime Minister of Ethiopia Abiy Ahmed valued their open and frank exchange and agreed to speak in the near future to take stock.
https://www.consilium.europa.eu/en/press/press-releases/2022/07/04/readout-of-the-telephone-conversation-between-president-charles-michel-and-prime-minister-of-ethiopia-abiy-ahmed/
...
I have a population categorized based on four caracteristics: gender (female, male), age (young, middle-aged, old), geography (urban, rural) and education (lower, higher).
This leaves me with 24 possible combinations:
[1] Female Young Urban Lower
[2] Female Young Urban Higher
[3] Female Young Rural Lower
[4] Female Young Rural Higher
...
[24] Male Old Rural Higher
Let's say I want to draw a stratified sample of 25 people, among which:
Female: 13
Male: 12
---------------
Young: 7
Middle-aged: 8
Old: 10
---------------
Urban: 15
Rural: 10
---------------
Lower: 20
Higher: 5
In order to do so, I want to determine which combinations of the profiles above will allow me to achieve this distribution (e.g. 2x[1], 3x[2], 1x[3] ... 2x[24]), using R.
I thought I could (i) create a dataset with my 24 combinations having value 1-3 (using the crossing() function), (ii) calculate the products and (iii) check if they match my distribution. However, I do not even manage to create the base dataset because it is too large for my memory (3^24 = 282429536481)...
Is there someone who could help me to achieve this in an easier way (with a loop function maybe, that checks if a combination matches and drops it immediately if it does not, in order to save memory; or just a mutch easier way I did not think of)?
Many thanks in advance.
Let's say I have the following data frame:
df <- data.frame(address=c('654 Peachtree St','890 River Rd','890 River Rd','890 River Rd','1234 Main St','1234 Main St','567 1st Ave','567 1st Ave'), city=c('Atlanta','Eugene','Eugene','Eugene','Portland','Portland','Pittsburgh','Etna'), state=c('GA','OR','OR','OR','OR','OR','PA','PA'), zip5=c('30308','97404','97404','97404','97201','97201','15223','15223'), zip9=c('30308-1929','97404-3253','97404-3253','97404-3253','97201-5717','97201-5000','15223-2105','15223-2105'), stringsAsFactors = FALSE)
`address city state zip5 zip9
1 654 Peachtree St Atlanta GA 30308 30308-1929
2 8910 River Rd Eugene OR 97404 97404-3253
3 8910 River Rd Eugene OR 97404 97404-3253
4 8910 River Rd Eugene OR 97404 97404-3253
5 1234 Main St Portland OR 97201 97201-5717
6 1234 Main St Portland OR 97201 97201-5000
7 567 1st Ave Pittsburgh PA 15223 15223-2105
8 567 1st Ave Etna PA 15223 15223-2105`
I'm considering any rows with a matching address and zip5 to be duplicates.
Filtering out or keeping duplicates based on these two columns is simple enough in R. What I'm trying to do is create a new column with a conditional label for each set of duplicates, ending up with something similar to this:
`address city state zip5 zip9 type
1 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
2 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
3 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
4 1234 Main St Portland OR 97201 97201-5717 Different Zip9
5 1234 Main St Portland OR 97201 97201-5000 Different Zip9
6 567 1st Ave Pittsburgh PA 15223 15223-2105 Different City
7 567 1st Ave Etna PA 15223 15223-2105 Different City`
(I'd also be fine with a True/False column for each type of duplicate.)
I'm assuming the solution will be in some mutate+ifelse+boolean code, but I think it's the comparing within each duplicate subset that has me stuck...
Any advice?
Edit:
I don't believe this is a duplicate of Find duplicated rows (based on 2 columns) in Data Frame in R. I can use that solution to create a T/F column for each type of duplicate/group_by match, but I'm trying to create exclusive categories. How could my conditions also take differences into account? The exact match rows should show true only on the "exact match" column, and false for every other column. If I define my columns simply by feeding different combinations of columns to group_by, the exact match rows will never return a False.
I think the key is grouping by "reference" variable--here address makes sense--and then you can count the number of unique items in that vector. It's not a perfect solution since my use of case_when will prioritize earlier options (i.e. if there are two different cities attributed to one address AND two different zip codes, you'll only see that there are two different cities--you will need to address this if it matters with additional case_when statements). However, getting the length of unique items is a reasonable heuristic in this case if you don't need a perfectly granular solution.
df %>%
group_by(address) %>%
mutate(
match_type = case_when(
all(
length(unique(city)) == 1,
length(unique(state)) == 1,
length(unique(zip5)) == 1,
length(unique(zip9)) == 1) ~ "Exact Match",
length(unique(city)) > 1 ~ "Different City",
length(unique(state)) > 1 ~ "Different State",
length(unique(zip5)) > 1 ~ "Different Zip5",
length(unique(zip9)) > 1 ~ "Different Zip9"
))
Otherwise, you'll have to do iterative grouping (address + other variable) and mutate in a Boolean column as you alluded to.
Edit
One additional approach I just thought of if you need a more granular solution is to utilize the addition of an id column (df %>% rowid_to_column("ID")) and then a full join of the table to itself by address with suffixes (e.g. suffix = c("a","b")), filtering out same IDs and calling distinct (since each comparison is there twice), and then you can make Boolean columns with mutate for the pairwise comparisons. It may be too computationally intensive, depending on the size of your dataset, but it should work on the scale of a few thousand if you have a reasonable amount of RAM.
Im reading a book and I found this code. Which I tried and im a little bit confused about the graph im getting.
This is Data Sample.
consumption[sample(1:nrow(consumption), 5, replace=F),]
Food Units Year Amount
8 Fruits and Vegetables Pounds 1980 603.57948
31 Caloric sweeteners Pounds 1995 144.08113
16 Fruits and Vegetables Pounds 1985 630.24491
28 Eggs Number 1995 232.28203
19 Fish and Shellfist Pounds 1990 14.94411
And im getting this graph. Which the Y indexes are numbers from 1 to 20, that are not the correct "Amounts".
What can I do so the Amount index in the Y axis shows correctly?
The figure you show is just like the one in the book, R in a Nutshell, that provided you with the code. Actually, the book provides the code for two different versions of the same plot. I suggest trying them both.
library(nutshell)
data(consumption)
library(lattice)
dotplot(Amount ~ Year | Food, consumption)
dotplot(Amount ~ Year | Food, consumption,
aspect="xy", scales=list(relation="sliced", cex=.4))
My data looks like this.
AK ALASKA DEPT OF PUBLIC SAFETY 1005-00-073-9421 RIFLE,5.56 MILLIMETER
AK ALASKA DEPT OF PUBLIC SAFETY 1005-00-073-9421 RIFLE,5.56 MILLIMETER
I am looking to filter the data in multiple different ways. For example, I filter by the type of equipment, such as column 4, with the code
rifle.off <- city.data[[i]][city.data[[i]][,4]=="RIFLE,5.56 MILLIMETER",]
Where city.data is a list of matrices with data from 31 cities (so I iterate through a for loop to isolate the rifle data for each city). I would like to also filter by the number in the third column. Specifically, I only need to filter by the first two digits, i.e. I would like to isolate all line items where the number in column 3 begins with '10'. How would I modify my above code to isolate only the first two digits but let all the other digits be anything?
Edit: Providing an example of the city.data matrix as requested. First off city.data is a list made with:
city.data <- list(albuq, austin, baltimore, charlotte, columbus, dallas, dc, denver, detroit)
where each city name is a matrix. Each individual matrix is isolated by police department using:
phoenix <- vector()
for (i in 1:nrow(gun.mat)){
if (gun.mat[i,2]=="PHOENIX DEPT OF PUBLIC SAFETY"){
phoenix <- rbind(gun.mat[i,],phoenix)
}
}
where gun.mat is just the original matrix containing all observations. phoenix looks like
state police.dept nsn type quantity price date.shipped name
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
Try this:
The original data that you have in the first block in the question. Subset it.
Rifle556<-subset(data, data$column4 == "RIFLE,5.56 MILLIMETER")
After that, subset the data again that don't start with "10" from column 3
s <- '10'
Rifle55610<-subset(Rifle556, grep(s, column3, invert=T)
This way you have the data subset according to your condition.