Cross-Referencing Tables in Jupyter Notebook - jupyter-notebook

I am having difficulty finding the right command to create a new table that contains the stats of specific players based on the names of players that are in a separate table I made. For example, I have a table that contains the names of multiple NBA players in the table Daddy_Gang and I want to pull their stats from the stats table into a new table that displays the stats of the players from Daddy_Gang.

Quickly, don't post photos of your datasets, code, errors, etc. Post the actual dataset (or a small sample), the actual code, etc. No one really wants to manufacture a dataset and/or your code.
But, once you get your list you can use .isin().
Also double check how you are creating that list. You are creating a list of lists.
So change that line to:
g = list(Daddy_Gang.values)
Code:
import pandas as pd
stats = pd.DataFrame([
['Aaron Gordon', 3, 45],
['Coby White', 5, 33],
['Zach LaVine', 7, 22],
['apad13', 0, 0]],
columns = ['Player', 'G', 'PTS'])
g = ['Coby White', 'Zach LaVine']
g = list(stats['Player'].values)
filtered_df = stats[stats['Player'].isin(g)]
Output:
print (filtered_df)
Player G PTS
1 Coby White 5 33
2 Zach LaVine 7 22

Related

Calculating row sums in data frame based on column names

I have a data frame with media spending for different media channels:
TV <- c(200,500,700,1000)
Display <- c(30,33,47,55)
Social <- c(20,21,22,23)
Facebook <- c(30,31,32,33)
Print <- c(50,51,52,53)
Newspaper <- c(60,61,62,63)
df_media <- data.frame(TV,Display,Social,Facebook, Print, Newspaper)
My goal is to calculate the row sums of specific columns based on their name.
For example: Per definition Facebook falls into the category of Social, so I want to add the Facebook column to the Social column and just have the Social column left. The same goes for Newspaper which should be added to Print and so on.
The challenge is that the names and the number of columns that belong to one category change from data set to data set, e.g. the next data set could contain Social, Facebook and Instagram which should be all summed up to Social.
There is a list of rules, which define which media types (column names) belong to each other, but I have to admit that I'm a bit clueless and can only think about a long set of if commands right now, but I hope there is a better solution.
I'm thinking about putting all the names that belong to each other in vectors and use them to find and summarize the relevant columns, but I have no idea, how to execute this.
Any help is appreciated.
You could something along those lines, which allows columns to not be part of every data set (with intersect and setdiff).
Define a set of rules, i.e. those columns that are going to be united/grouped together.
Create a vector d of the remaining columns
Compute the rowSums of every subset of the data set defined in the rules
append the remaining columns
cbind the columns of the list using do.call.
#Rules
rules = list(social = c("Social", "Facebook", "Instagram"),
printed = c("Print", "Newspaper"))
d <- setdiff(colnames(df_media), unlist(rules)) #columns that are not going to be united
#data frame
lapply(rules, function(x) rowSums(df_media[, intersect(colnames(df_media), x)])) |>
append(df_media[, d]) |>
do.call(cbind.data.frame, args = _)
social printed TV Display
1 50 110 200 30
2 52 112 500 33
3 54 114 700 47
4 56 116 1000 55

How do I write data to a single column with different indexes?

Everybody. My program counts statistics for all groups. So for example I have groups 1,2,3,4. The program processes in cycles: 1-2; 1-3; 1-4; 2-3; 2-4; 3-4; but NOT columns 2 and 1, because that's the same as the first pair, etc.
I created a new data set with columns c('n', 'v', 'd','pv',' pvf') from 1 to 3. I would like it to be recorded in this format:
1:...
2:...
3:...
And from 4 to 5:
1-2:...
1-3:...
1-4:...
2-3:...
2-4:...
3-4:...
I wrote a small example:
res=list()
for (i in 1:(ncol(mtcars)-1)) {
for (j in (i+1):ncol(mtcars)) {
res=c(res,list(c(i,j,paste0(mtcars[,i],'_',mtcars[,j])))
)
}
}
res=do.call(cbind,res)
How do I write data with different indexes to one cell and glue them together?

How to split large Excel file into multiple Excel files using R

I'm looking for a way to split up a large Excel file into a number of smaller Excel files using R.
Specifically, there are three things I would like to do:
I have a large data set consisting of information regarding students (their school, the area in which the school is located, test score A, test score B) that I would like to split up into individual files, one file per school containing all of the students attending that specific school.
I would also like all of the individual Excel files to contain an image covering the first row and columns A, B, C & D of every Excel file. The image will be the same for all the schools in the data set.
Lastly, I would also like the Excel files, after being created, to end up in individual folders on my desktop. The folders name would be the area in which the schools are located. An area has about 3-5 schools so the folder would contain 3-5 Excel files, 1 for each school.
My data is structured like this:
Area
School
Student ID
Test score A
Test score B
North
A
134
24
31
North
A
221
26
33
South
B
122
22
21
South
B
126
25
25
I have data covering roughly 200 schools located in 5 different areas.
Any guidance on how to do this would be greatly appreciated!
As some of the comments have referenced, this will be hard to solve without knowing your specific operating environment & folder structure, I solved this using Windows 10/ C drive user folder but you can customize to your system. You're going to need a folder with all the images from the school saved by the name (or the ID I created) of the school and they will all need to be the same format (JPG or PNG). Plus, you need folders created for each Area you want to output to (openxlsx can write the files but not create the folders for you). Once you have those setup, something similar to this should work for you, but I would highly recommend referring to the openxlsx documentation for more info:
library(dplyr)
library(openxlsx)
# Load your excel file into a df
# g0 = openxlsx::read.xlsx(<your excel file & sheet..see openxlsx documentation>)
# Replace this tibble with your actual excel file, this was just for an example
g0 = tibble(
Area = c("North","North","North","North"),
School = c("A","A","B","B"),
Student_ID = c(134,221,122,126),
test_score_a = c(24,26,22,25),
test_score_b = c(31,33,21,25))
# Need a numeric school id for the loop
g0$school_id = as.numeric(as.factor(g0$School))
# Loop through schools, filter using dplyr and create a sheet per school
for (i in 1:n_distinct(g0$school_id)){
g1 = g0 %>%
filter(school_id == i)
## Create a new workbook
wb <- createWorkbook(as.character(g1$School))
## Add some worksheets
addWorksheet(wb, as.character(g1$School))
## Insert images
## I left the image as a direct path for example but you can change to a
## relative path once you get it working
img <- system.file("C:","Users","your name","Documents","A","A.jpg", package = "openxlsx")
insertImage(wb, as.character(g1$School), img, startRow = 5, startCol = 3, width = 6, height = 5)
## Save workbook
saveWorkbook(wb, paste0("C://Users//your name//Documents//",g0$Area,"//",g0$school,".xlsx"), overwrite = TRUE)
}

Create a new row to assign M/F to a column based on heading, referencing second table?

I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)

Trying to scrape a playoff bracket from wikipedia using beautiful soup. How do I identify the correct columns?

I'm trying to scrape the nhl playoff bracket from wikipedia, for the years 1988 on, using beautiful soup 4 in python. Inconsistent formatting (sometimes the there is more than one team on a row see: (https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs) makes this hard. I would like to identify the Team, Round, and Number of Games Won for every series in that year.
Initially, I converted the table to text and used regular expressions to identify the teams and the information, but the ordering shifts depending on whether the brackets allow more than one team per row or not.
Now I'm trying to work my way down the rows and count things like the number of cells/columns spans, but the results are inconsistent. I'm missing how the 4th round teams are identified.
What I have so far is an attempt to count the number of cells before a cell with a team is reached...
from bs4 import BeautifulSoup as soup
hockeyteams = ['Anaheim','Arizona','Atlanta','Boston','Buffalo','Calgary','Carolina','Chicago','Colorado','Columbus','Dallas','Detroit',
'Edmonton','Florida','Hartford','Los Angeles','Minnesota','Montreal','Nashville','New Jersey',
'Ottawa','Philadelphia','Pittsburgh','Quebec','San Jose','St. Louis','Tampa Bay','Toronto','Vancouver','Vegas','Washington',
'Winnipeg','NY Rangers','NY Islanders']
#fetch the content from the url from the library
page_response = requests.get(full_link, timeout=5)
#use the html parser to parse the url
page_content = soup(page_response.content, "html.parser")
tables = page_content.find_all('table')
cnt = 0
#identify the appropriate table
for table in tables:
if ('Semi' in table.text) & ('Stanley Cup Finals' in table.text):
bracket = table
break
row_num = 0
for row in bracket.find_all('tr'):
row_num += 1
print(row_num,'#')
colcnt = 0
for col in row.find_all('td'):
if "colspan" in col.attrs:
colcnt += int(col.attrs['colspan'])
else:
colcnt += 1
if (col.text.strip(' \n') in str(hockeyteams)):
print(colcnt,col.text)
print('col width:',colcnt)
Ultimately I'd like something like a dataframe that has:
Round Team A Team A Wins, Team B Team B Wins
1, Tampa Bay, 4, NY Islanders, 1
2, Tampa Bay, 4, Montreal, 0
etc
That table can be scraped with pandas:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs#Playoff_bracket')
bracket = tables[2].dropna(axis=1, how='all').dropna(axis=0, how='all')
print(bracket)
The output is full of NaNs, but it has what I think you're looking for and you can modify it using standard pandas methods.

Resources