Data frame subset according to matching values in R - r

I have a data.frame with information on racing performance on horses. I have a variable Competition.year that has a "Total" row and then a row for each year the horse competed. I also have a variable Competition.age that describes the age the horses were in each specific year they competed.
I am trying to create a subsetted df based on their best racing times and the age they were when they achieved it. In the "Total" row, the racing time included is their best one. So, I need to figure out how to tell R that, when the race time in Total row is equal to whenever it is they actually achieved that time, include the age they were then in the new data frame. I am super new to R so I have no idea where to even begin doing this, I've tried some stuff I've seen on other questions but I can't get it right. Any help would be much appreciated!
My df looks like this:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
The desired df should have 223 rows (since that is the total amount of horses I have) with columns Name, Competition.year=="Total", Time.record.auto.start, Time.record.volt.start and Competition.age

Firstly, I had to change your sample data to make sure all 5 variables only had 14 observations each. I did this by removing the final NA in the Competition.age variable. I also had to swap around the 94.3 and 98.3 values in the Time.record.volt.start variable so that the values lined up with what was expected in the Total column for the horse with Name equal to Pelson Poika.
Here is the corrected data:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
And here is a simple dplyr solution, which I think does what you want.
library(dplyr)
df1 <-
travdata %>% group_by(Name) %>% filter(Competition.year == "Total") %>% select(Name, Time.record.auto.start, Time.record.volt.start)
df2 <- travdata %>% filter(Competition.year != "Total")
df3 <-
inner_join(
df1,
df2,
by = c(
"Name" = "Name",
"Time.record.auto.start" = "Time.record.auto.start",
"Time.record.volt.start" = "Time.record.volt.start"
)
)
The dataframe df3 should return what you were after.

Related

Counting up occurrences in a column based on another column

I have a dataframe in R in which every observation is a paper. The continent column denotes which continent the paper looks at and the impacts column denotes which "impacts" it mentions (0 for not mentioned).
It looks like this:
What I am trying to do is this:
I was successful in producing the data in that second picture by doing the following:
water <- with(df,tapply(Water,Continent,FUN=sum))
colnames(water) <- "Water"
temp<- with(df,tapply(Temperature,Continent,FUN=sum))
colnames(temp) <- "Temperature")
new.df <- cbind(water,temp)
new.df2 <- cbind(newColName = rownames(new.df), new.df)
rownames(new.df) <- 1:nrow(new.df)
new.df2 <- melt(new.df2)
names(new.df2) <- c("Continent", "Impact", "Occurrences")
However, I have 50 columns I would like to do this for and am struggling to find a method that would do it succinctly. In sum what i am trying to do is find a easy way to:
sum each "impact" column (every column except the paper number and continent columns) BY the continent column
melt it so that for each Continent and each impact it tells me the number of occurrences that impact appears in that continent
I think a tidyverse approach would look like this, but hard to know for sure without testing on a portion of your data:
tidyr::pivot_longer(
df1,
-c(`paper#`, Continent),
names_to = "Impact",
values_to = "Occurrences"
) %>%
dplyr::group_by(Continent, Impact) %>%
dplyr::summarise(Occurrences = sum(Occurrences))

R- How do I use a lookup table containing threshold values that vary for different variables (columns) to replace values below those thresholds?

I am trying to streamline the process of auditing chemistry laboratory data. When we encounter data where an analyte is not detected I need to change the recorded result to a value equal to 1/2 of the level of detection (LOD) for the analytical method. I have LOD's contained within another dataframe to be used as a lookup table.
I have multiple columns representing data from different analytical tests, each with it's own unique LOD. Here's an example of the type of data I am working with:
library(tidyverse)
dat <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,-2.3,7.6,0.1,45.6,12.2,-0.1,22.2,0.6),
"TN" = c(100.3,56.2,-10.5,0.4,-0.3,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,-10.5,0.2,14.6,489.3,0.3,14.4,54.6,88.8))
dat
detect_level <- tibble("Parameter" = c('TP', 'TN', 'DOC'),
'LOD' = c(0.6, 11, 0.3)) %>%
mutate(halfLOD=LOD/2)
detect_level
I have poured over multiple other questions with a similar theme:
Change values in multiple columns of a dataframe using a lookup table
R - Match values from multiple columns in a data.frame to a lookup table.
Replace values in multiple columns using different thresholds
and gotten to a point where I have pivoted the data and split it out into a list of dataframes that are specific analytes:
dat %>%
pivot_longer(cols = c('TP','TN','DOC')) %>%
arrange(name) %>%
split(.$name)
I have tried to apply a function using map(), however I cannot figure out how to integrate the values from the lookup table (detect_level) into my code. If someone could help me continue this pipe, or finish the process to achieve a final product dat2 that should look like this I would appreciate it:
dat2 <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,0.3,7.6,0.3,45.6,12.2,0.3,22.2,0.6),
"TN" = c(100.3,56.2,5.5,5.5,5.5,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,0.15,0.15,14.6,489.3,0.3,14.4,54.6,88.8))
dat2
Another possibility would be from the closest similar question I have found is:
Lookup multiple column from a single table
Here's a snippet of code that I have adapted from this question, however, if you run it you will see that where values exist that are not found in detect_level an NA is returned. Additionally, it does not appear to have worked for $TN or $DOC, even in cases when the $LOD value from detect_level was present.
dat %>%
mutate(across(all_of(unique(detect_level$Parameter)),
~ {i1 <- detect_level$Parameter == cur_column()
detect_level$LOD[i1][match(., detect_level$LOD)]}))
I am not comfortable at all with the purrr language here and have only adapted this code from the question linked, so I would appreciate if this is the direction an answerer chooses, that they might comment code to explain briefly what is happening "under the hood".
Thank you in advance!
Perhaps this helps
library(dplyr)
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ pmax(., detect_level$LOD[match(cur_column(), detect_level$Parameter)])))
For the updated case
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ replace(., . < detect_level$LOD[match(cur_column(),
detect_level$Parameter)],detect_level$halfLOD[match(cur_column(),
detect_level$Parameter)])))

Aggregating two rows based on condition of different ID in R

I am dealing with a dataset of players statistics for a sport. There is an error in the data where one week a player who doesn't exist, has been attributed the data that belongs to a real player. I need to aggregate the two players data and delete the instance of the false players' row.
I need to adjust my preprocessing code to accommodate this so when I scrape future weeks data then I don't need to make manual adjustments.
df <- data.frame(Name = c("Bob","Ben","Bill"),
Team = c("Dogs","Cats","Birds"),
Runs = c(6, 4, 2)
I'd like to do something along the lines of aggregating the two rows based on their df$Name e.g. when df$Name == "Bob" & df$Name == "Bill" aggregate columns [3:40] -- these are my columns with numeric statistics, [1:2] have df$Name and df$Team.
It would depend on the type of aggregation you are trying to do. This looks like a perfect use of the group_by from the dplyr package. Consider the CO2 data set.
library(dplyr)
CO2 %>%
group_by(Plant) %>%
summarise(
n = n(), #Calculate number of rows in each group
meanUptake = mean(uptake) # Aggregate data and take mean for each group
) %>%
ungroup()
Here we take each group, in your case above it would be name. In the summarise, if you wish to include extra information (like team) include it within the summarise.

Dplyr solution for difference in row values based on two factor levels in separate columns

I am trying to use dplyr to calculate the difference between two row values based on factor levels in large data frame. In practical terms, I want the vote distance between two groups across each party within each country. For the data below, I would like to end up with a data frame with rows indicating the difference between the vote values for each group pair for each party level within each country level. The lag function does not seem to work with my data as the number of factor levels varies by country, meaning each country has a different total number of groups and parties. A small sample of the setup is below.
df1 <- data.frame(id = c(1:12),
country = c("a","a","a","a","a","a","b","b","b","b","b","b"),
group = c("x","y","z","x","y","z","x","y","z","x","y","z"),
party = c("d","d","d","e","e","e","d","d","d","e","e","e"),
vote = c(.15,.02,.7, .5, .6, .22,.47,.33,.09,.83,.77,.66))
This is how I would like the end product to look.
df2 <- data.frame(id= c(1:12),
country = c("a","a","a","b","b","b","a","a","a","b","b","b"),
group1 = c("x","x","y","x","x","y","x","x","y","x","x","y"),
group2 = c("y","z","z","y","z","z","y","z","z","y","z","z"),
party = c("d","d","d","d","d","d","e","e","e","e","e","e"),
dist = c(.13,-.5,-.68,.14,.38,.24,-.1,.28,.38,.06,.17,.11))
I have tried dcast previously and if I fill with the column I want, it doesn't line up and produces NA or 0 where there should be values. The lag function doesn't work in my case because the number of parties and groups are unique for each country and not fixed. Whenever I have tried different intervals for the lag the values are comparing across countries of across parties rather than across groups in some instances.
I have found solutions outside of dplyr but for parsimony in presenting code I am wondering if there is a way in dplyr. Also, the code I have is incredibly long and clunky, and uses six or seven packages just for this problem.
Thanks
We can use combn to create the difference
library(dplyr)
df1 %>%
group_by(country, party) %>%
mutate(dist = combn(vote, 2, FUN = function(x) x[1] - x[2]))
Another way is to use
library(tidyverse)
df1 %>%
left_join(df1 %>% select(-id), by = c("country", "party"), suffix = c("1", "2")) %>%
filter(group1 != group2) %>%
mutate(dist = vote1 - vote2)

Summing across rows conditional on groups with dplyr using select, group_by, and mutate

Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market.
Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio.
Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)):
df<- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(s))
#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
for(j in 1:15){
if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,50]=df[j,3]
}
}
}
This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful.
Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor.
#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))
names(blp)<-c("cat","yr","s")
head(blp)
#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))
#works thanks to akrun: applying the code I provided for what leads to the 15 groups
df <- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(as.numeric(as.character(s))))
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,4]=df[j,3];
}
}
if I understood your problem correctly this should ideally help!
Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one.
# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
# Calculation
blp <-
data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
group_by(cat, yr) %>% # Grouping by category and year
mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year
ungroup()
Expected output
Expected output

Resources