Better way to apply which.max over dataframe - r

so I'm trying to learn R while playing with a dataset from https://www.kaggle.com/abcsds/pokemon
data = read.csv("Pokemon.csv")
data$Name = sub(".*(Mega)", "Mega", data$Name) # replacing name duplications
And I want to find all the pokemon that have a maximum value on any columns (Total, Attack, HP, etc):
I know I can do this: sapply(data[5:11], max, na.rm = TRUE) to find out the max values and stuff like
data[which.max(data$Total),]
data[which.max(data$HP),]
data[which.max(data$Attack),]
to find all the rows that have a max.
Is there a way I can use something like sapply in order to get all the rows without going through them sequentially?

I believe this is what you want to achieve
I use tidyverse for this, as the data is in wide format with different columns for stat, I first convert it into long format using pivot_longer then I group_by stats column and filter the max of each group to achieve the desired result.
library(tidyverse)
df %>%
select(c(2, 5:11)) %>%
pivot_longer(-1, names_to = "stats") %>%
group_by(stats) %>%
filter(value == max(value))

Related

Create a new column by match conditional in two columns in pairwise dataframe in R

I am trying to create a new column by conditionals in the matching of two columns with the same factors, in order to summarise the data.
This is how looks the dataframe
This is how I want to summarise
This is my proposed not working code:
probe$Cluster<-"cluster"
for (i in 1:length(probe)){
if(is.na(probe[i,4])){
probe[i,9]<-probe[i,6]}
if(is.na(probe[i,6])){
probe[i,9]<-probe[i,4]}
if(identical(probe[i,4],probe[i,6])){
probe[i,9]<-probe[i,4]}}
if(!identical(probe[i,4],probe[i,6])){
probe[i,9]<-probe[i,4]
rep(probe[i,1:9]%>%probe[i,9]<-probe[i,6}
#Then create a summary of this like this:
Sum<-probe%>%group_by(Method,Cluster)%>% summarise(mean(relation, na.rm
=FALSE),numberobservations=length(unique(GenA)))%>%data.frame()
Thank you for any advise
Can't verify without sample data that I can load (without retyping from a picture), but it looks like you're going for something like this:
library(dplyr)
probe %>%
mutate(Cluster = coalesce(ClusterA, ClusterB) %>% # use 1st non-NA from cols
group_by(Method, Cluster) %>%
summarize(mean = mean(relation, na.rm = TRUE),
numberobservations = n(), .groups = "drop")

How to find the mean and standard deviation of rows in dataframes with some having NAs and others not

I'm trying to find the mean and standard deviation for C and P separately.
I have toyed around with this so far:
C <- rowMeans(dplyr::select(total, C1:41), na.rm=TRUE)
This didn't yield what I needed it to.
Then I thought about just using the summary, but again it didn't give me what I needed.
So then I thought of using na.omit:
Of course though, this would take out all of the data since I have NAs throughout the dataframe.
What am I missing here? Is this a matter of aggregating my data into certain groups?
I know describeby could force these descriptives, but again I'm not sure how to do that.
So, I think the angle I want to take is to order these, then aggregate and find totals, and then find the descriptives using describeby in order to avoid NAs. I'm stuck though. Where am I going wrong?
Try using this :
library(dplyr)
total %>%
#Select only columns that have S in their name
#i.e SP and SC
select(starts_with('S')) %>%
#Get the data in long format, remove NA values
tidyr::pivot_longer(cols = everything(), values_drop_na = TRUE) %>%
#Create a group for each participant
group_by(grp = c('Participant1', 'Participant2')[grepl('C\\d+', name) + 1]) %>%
#Take mean and standard deviation for each group
summarise(mean = mean(value), sd = sd(value))

2 Numeric Values In A Dataframe Field In R

I have a dataset in R with a little under 100 columns.
Some of the columns have numeric values such as 87+3 as oppose to 90.
I have been able to update each column with the following piece of code:
library(dplyr)
new_dataframe = dataframe %>%
rowwise() %>%
mutate(new_value = eval(parse(text = value)))
However, I would like to be able to update a list of 60 columns in a more efficient way than simply repeating this line for each column.
Can someone help me find a more efficient way?
We can use mutate_at
library(dplyr)
dataframe %>%
rowwise() %>%
mutate_at(1:60, list(new_value = ~eval(parse(text = .))))

R spread across multiple value columns

My dataset looks like this -
dataset = data.frame(Site=c(rep('A',6),rep('B',6)),Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),4)),Question=c(rep('Q1',3),rep('Q2',3)),Score=runif(12,0.5,1),Average=runif(12,0.5,1))
I'd like to spread columns in such a way that the the first two columns contain the Site and Question and the remaining columns are have the Score_Date and Average_Date
Here's an example of what the first line of the resulting table would look like
Site Question Score_2019.03.31 Score_2019.04.30 Score_2019.05.31 Average_2019.03.31 Average_2019.04.30 Average_2019.05.31
A Q1 0.9117566 0.8661078 0.5624139 0.7246694 0.8870703 0.6401099
I tried using unite & spread from tidyr but nowhere close to the result
Any inputs would be highly appreciated
Using tidyr and dplyr from the tidyverse, you could do the following:
library(tidyverse)
dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col) %>%
unnest(`2019-03-31`, `2019-04-30`, `2019-05-31`, .sep = "_")

How do I aggregate certain columns from data frame by a Unique ID?

I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.

Resources