Changing Column Names using a Key - r

I have a data frame with 3 letter key column headings, which looks like:
MFB MBB WBB
X X X
and another data frame with the full names:
Key Descr
MFB Men's Football
MBB Men's Basketball
WBB Women's Basketball
My question is, how would I go about renaming the columns so the original table looks like:
Men's Football Men's Basketball Women's Basketball
X X X
There are about 80 column headings I want to rename, so manually renaming each column is not desired. My guess is it could be done using a for loop or the 'map2' function from the 'purrr' library, but I am not sure where to start.

Similar to Rename multiple columns given character vectors of column names and replacement
To make your question fully reproducible, let's start with
library(tidyverse)
sports <- tibble(MFB=c("bears", "texans", "packers"),
MBB=c("bulls", "heat", "bucks"),
WBB=c("dream", "sky", "sun"))
pairs <- tibble(Key=c("MFB", "MBB", "WBB"),
Descr=c("Men's Football", "Men's Basketball", "Women's Basketball"))
If the keys can be sorted in the same order as the original column headings, then a simple
setNames(sports,pairs$Descr)
works. Otherwise
sports %>% rename_at(pairs$Key, ~pairs$Descr)

Related

stringdist_join not merging data

I have three data frames that need to be merged. There are a few small differences between the competitor names in each data frame. For instance, one name might not have a space between their middle and last name, while the other data frame correctly displays the persons name (Example: Sarah JaneDoe vs. Sarah Jane Doe). So, I used the fuzzy join package. When I run the below code, it just keeps running. I can't figure out how to fix this.
Can you identify where I went wrong?
library(fuzzyjoin)
library(tidyverse)
temp1 = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/temp1.csv')
stats=read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/stats.csv')
winners = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/winners.csv')
#perform fuzzy matching full join
star = stringdist_join(temp1, stats,
by='Name', #match based on Name
mode='full', #use full join
method = "jw", #use jw distance metric
max_dist=99,
distance_col='dist') %>%
group_by(Name.x)

Replace strings in a dataframe based on another dataframe

I have a 200k rows dataframe with a character column named "departament_name", some of the values in this column contain a specific char: "?". For example: "GENERAL SAN MART?N", "
UNI?N", etc.
I want to replace those values using another 750k rows dataframe that cointains a column also named "departament_name", but the values in this column are correct. Following the example, it will be: "GENERAL SAN MARTIN", "UNION", and so on.
Can I do this automatically using pattern recognition withouth making a dictionary (there are several values with this problem).
My objetive is to have an unified dataset with the two dataframes and unique values for those problematics rows in "departament_name". I prefer tidyverse (mutate, stringr, etc) if possible.
You can try using stringdist.* joins from fuzzjoin package.
fuzzyjoin::stringdist_left_join(df1, df2, by = 'departament_name')
# departament_name.x departament_name.y
#1 GENERAL SAN MART?N GENERAL SAN MARTIN
#2 UNI?N UNION
Obviously, this works for the simple example you have shared but it might not give you 100% correct result for all the entries in your actual data. You can tweak the parameters max_dist and method as per your data. See ?fuzzyjoin::stringdist_join for more information about them.
data
df1 <- data.frame(departament_name = c("GENERAL SAN MART?N", "UNI?N"))
df2 <- data.frame(departament_name = c("GENERAL SAN MARTIN", "UNION"))

Creating horizontal dataframe from vertical table (with repeated variables)

Need to create usable dataframe using R or Excel
Variable1
ID
Variable2
Name of A person 1
002157
NULL
Drugs used
NULL
3.0
Days in hospital
NULL
2
Name of a surgeon
NULL
JOHN T.
Name of A person 2
002158
NULL
Drugs used
NULL
4.0
Days in hospital
NULL
5
Name of a surgeon
NULL
ADAM S.
I have a table exported from 1C (accounting software). It contains more than 20 thousand observations. A task is to analyze: How many drugs were used and how many days the patient stayed in the hospital.
For that reason, I need to transform the one dataframe into a second dataframe, which will be suitable for doing analysis (from horizontal to vertical). Basically, I have to create a dataframe consisting of 4 columns: ID, drugs used, Hospital stay, and Name of a surgeon. I am guessing that it requires two functions:
for ID it must read the first dataframe and extract filled rows
for Name of a surgeon, Drugs used and Days in hospital the function have to check that the row corresponds to one of that variables and extracts date from the third column, adding it to the second dataframe.
Shortly, I have no idea how to do that. Could you guys help me to write functions for R or tips for excel?
for R, I guess you want something like this:
load the table, make sure to substitute the "," with the separator that is used in your file (could be ";" or "\t" for tab etc.).
df = read.table("path/to/file", sep=",")
create subset tables that contain only one row for the patient
id = subset(df, !is.null(ID))
drugs = subset(df, Variable1 %in% "Drugs used")
days = subset(df, Variable1 %in% "Days in hospital")
#...etc...
make a new data frame that contains these information
new_df = data.frame(
id = id$ID,
drugs = drugs$Variable2,
days = days$Variable2,
#...etc...no comma after the last!
)
EDIT:
Note that this approach only works if your table is basically perfect! Otherwise there might be shifts in the data.
#=====================================================
EDIT 2:
If you have an imperfect table, you might wanna do something like this:
Step 1.5) , change all NA-values (which in you table is labeled as NULL, but I assume R will change that to NA) to the patient ID. Note that the is.na() function in the code below is specifically for that, and will not work with NULL or "NULL" or other stuff:
for(i in seq_along(df$ID)){
if(is.na(df$ID[i])) df$ID[i] <- df$ID[i-1]
}
Then go again to step 2) above (you dont need the id subset though) and then you have to change each data frame a little. As an example for the drugs and days data frames:
drugs = drugs[, -1] #removes the first column
colnames(drugs) = c("ID","drugs") #renames the columns
days = days[, -1]
colnames(days) = c("ID", "days")
Then instead of doing step 3 as above, use merge and choose the ID column to be the merging column.
new_df = merge(drugs, days, by="ID")
Repeat this for other subsetted data frames:
new_df = merge(new_df, surgeon, by="ID")
# etc...
That is much more robust and even if some patients have a line that others dont have (e.g. days), their respective column in this new data frame will just contain an NA for this patient.

Subset using 2 dataframes of different sizes R

I have 2 data frames of different sizes
Season: is a fixture list for Australian Rules Football
Strength: has ratings for different aspects of a team, for each team in the league
I want to create a for loop that looks at each row of Season that matches the home team column with a row in Strength and then assigns that column to a variable HOME and then do the same for AWAY
Then HOME and AWAY will be used to compute a probability and inserted in a new column for the Season data frame
But I cannot get Strength to filter by Season in the loop, this is how I tried
for(row in 1:nrow(Season)){
HOME<-strength%>%
filter(Season$HomeTeam == Strength$Team)
Away<-strength%>%
filter(Season$AwayTeam == Strength$Team)
}
I just keep receiving this error message:
longer object length is not a multiple of shorter object length
Any help would be appreciated
Thanks,
Dave
Answered in comments:
library(dplyr)
HOME <- Strength %>%
filter(Strength$Team %in% Season$HomeTeam)

How to create a lookup in R?

I have a data frame in R that examines the ELO rating of college football teams over the course of several decades.
Data Layout
Each row is a specific game, and the team listed under the Team.A column is a winning team while the team under Team.B is a losing team. Also, the ELO scores under Elo.A represent the score for Team.A and the ELO scores under Elo.B represent the score for Team.B for those games, respectively.
I want to create a time-series that, for instance, looks at all of the ELO scores in Elo.A and Elo.B for Minnesota. Is there a way in R that can pull the date and scores in both of those columns for that one school?
How about:
df[df$team.A=="Minesota" | df$tema.B=="Minesota", ]
And you can select and specific columns using c(...) in the space after the ','

Resources