Finding the mean of two columns with two different classes/labels - r

right now I'm trying to create a data frame that contains the mean of two columns for two separate labels/categories.
But, I don't know how to calculate the mean for two columns, it just returns the same mean for both winner and opponent/loser.
Currently, I'm using the tidyverse library.
Here is the original data frame:
winner_hand winner_ht winner_ioc winner_age opponent_hand opponent_ht opponent_ioc opponent_age result name
<chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <fct> <chr>
R 178 JPN 29.00479 R NA RUS 22.88569 winner Kei Nishikori
R NA RUS 22.88569 R 188 FRA 33.70568 winner Daniil Medvedev
R 178 JPN 29.00479 R 188 FRA 31.88227 winner Kei Nishikori
R 188 FRA 33.70568 R NA AUS 19.86858 winner Jo Wilfried Tsonga
R NA RUS 22.88569 R 196 CAN 28.01095 winner Daniil Medvedev
R 188 FRA 31.88227 R NA JPN 26.40383 winner Jeremy Chardy
My code:
age_summary <- game_data %>%
group_by(result) %>%
summarize(mean_age = mean(winner_age))
age_summary
Resulting Data frame:
result mean_age
<fct> <dbl>
winner 27.68495
loser 27.68495

If you want summaries from two columns, you need expressions for each column in the call to summarize().
Example with fake data, since your excerpt only has one value for the 'result' column:
library(tidyverse)
dat <- read_csv(
"result, winner_age, opponent_age
A, 5, 10
A, 6, 11,
B, 12, 2
B, 13, 1")
dat %>%
group_by(result) %>%
# note: two expressions here:
summarise(mean_winner_age = mean(winner_age),
mean_opponent_age = mean(opponent_age))
output:
# A tibble: 2 x 3
result mean_winner_age mean_opponent_age
<chr> <dbl> <dbl>
1 A 5.5 10.5
2 B 12.5 1.5

Related

Group by a variable in dataframe R

I have a dataframe like below,
Date
cat
cam
reg
per
22-01-05
A
60
120
50
22-01-05
B
20
100
20
22-01-08
A
30
150
20
22-01-08
B
30
100
30
But i want something like below,
Date
cam
reg
per
22-01-05
80
220
14.5
22-01-08
60
250
24
How to get this using R?
I am not sure why your expected per values are like that, but maybe you want the following:
df <- data.frame(Date = c("22-01-05", "22-01-05", "22-01-08", "22-01-08"),
cat = c("A", "B", "A", "B"),
cam = c(60,20,30,30),
reg = c(120,100,150,100),
per = c(50,20,20,30))
library(dplyr)
df %>%
group_by(Date) %>%
summarise(cam = sum(cam),
reg = sum(reg),
per = cam/reg)
#> # A tibble: 2 × 4
#> Date cam reg per
#> <chr> <dbl> <dbl> <dbl>
#> 1 22-01-05 80 220 0.364
#> 2 22-01-08 60 250 0.24
Created on 2022-07-07 by the reprex package (v2.0.1)
Using only the package dplyr (which is part of package tidyverse) just do:
df %>% group_by(Date) %>% summarise(cam = sum(cam),
reg = sum(reg),
per = 100*(cam/reg))
Date cam reg per
<chr> <int> <int> <dbl>
1 22-01-05 80 220 36.4
2 22-01-08 60 250 24
The nice thing with this syntax is, you can modify and add additional variables like sum, but also like mean, median, etc. in a very clean and structured way.
you can try this, but I don't how to get the value of per ,14.5 and 24
library(dplyr)
aggregate(cbind(cam, reg) ~ Date,df,sum) %>% mutate(per = 100*(cam/reg))
A data.frame: 2 × 4
Date cam reg per
<chr> <dbl> <dbl> <dbl>
22-01-05 80 220 36.36364
22-01-08 60 250 24.00000

How to create rate on R

I want to change my data so that it gives me the rate of pedestrians to that states population. I am using a linear model and my summary values look like this:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.087061 0.029876 2.914 0.00438 **
intersection 0.009192 0.003086 2.978 0.00362 **
Here, my beta value intersection is .009192 and that is not meaningful because compared to a state that has a smaller population, this value might be nothing in comparison.
Below is a condensed version of my data without all the columns I use, but here is the link of the csv incase someone wants to download it from there.
> head(c)
# A tibble: 6 x 15
STATE STATENAME PEDS PERSONS PERMVIT PERNOTMVIT COUNTY COUNTYNAME CITY DAY MONTH YEAR LATITUDE LONGITUD
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Alabama 0 3 3 0 81 LEE (81) 2340 7 2 2019 32.7 -85.3
2 1 Alabama 0 2 2 0 55 ETOWAH (55) 1280 23 1 2019 34.0 -86.1
3 1 Alabama 0 4 4 0 29 CLEBURNE (29) 0 22 1 2019 33.7 -85.4
4 1 Alabama 1 1 1 1 55 ETOWAH (55) 2562 22 1 2019 34.0 -86.1
5 1 Alabama 0 1 1 0 3 BALDWIN (3) 0 18 1 2019 30.7 -87.8
6 1 Alabama 0 2 2 0 85 LOWNDES (85) 0 7 1 2019 32.2 -86.4
# … with 1 more variable: FATALS <dbl>
Here is the code I have that runs through the process I am doing. I don't see how I can change it so that each value is a rate (values like peds or type_int)
#Libraries
rm(list=ls()) # this is to clear anything in memory
library(leaflet)
library(tidyverse)
library(ggmap)
library(leaflet.extras)
library(htmltools)
library(ggplot2)
library(maps)
library(mapproj)
library(mapdata)
library(zoo)
library(tsibble)
setwd("~/Desktop/Statistics790/DataSets/FARS2019NationalCSV")
df <- read.csv("accident.csv")
state <- unique(df$STATE)
for(i in state){
df1<- df %>%
filter(STATE==i) %>%
dplyr::select(c(STATE,PEDS,DAY,MONTH,YEAR,TYP_INT)) %>%
mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d"))) %>% # create a date
group_by(date) %>% # Group by State id and date
# summarise_at(.vars = vars(PEDS), sum)
summarise(pedday=sum(PEDS),intersection=mean(TYP_INT))
#ts1<-ts(df,start=c(2019,1,1), frequency=365)
setwd("~/Desktop/Statistics790/States_ts/figures")
plots<-df1 %>%
ggplot()+
geom_line(aes(x=date,y=pedday))+ylim(0,13)+
theme_bw()
ggsave(paste0("state_",i,".png"),width=8,height=6, )
ts1<-ts(df1,start=c(2019,1,1), frequency=365)
setwd("~/Desktop/Statistics790/States_ts")
ts1 %>% write.csv(paste0("state_",i,".csv"),row.names = F)
#Plots
}
#date1<- as.character(df$date)
#df1<- df%>% filter(STATE=="1")
#ts2<-xts(df,order.by = as.Date(df$date,"%Y-%m-%d"))
setwd("~/Desktop/Statistics790/States_ts")
cat("\f")
#df <- read.csv(paste0("state_1.csv"))
#print("------Linear Model------")
#summary(lm(pedday~weather,data=df))
for(i in state){
print(paste0("-------------------------Analysis for State: ",i," -------------------------------"))
df <- read.csv(paste0("state_",i,".csv"))
print("------Linear Model------")
print(summary(lm(pedday~intersection,data=df)))
}
Collating my answers from the comments: you need to get state population data from an outside source such as the US Census https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html#par_textimage_1574439295, read it in, join it to your dataset, and then calculate rate as pedestrians per population, scaled for ease of reading on the graph. You can make your code faster by taking some of your calculations out of the loop. The code below assumes the census data is called 'census.csv' and has columns 'Geographic Area' for state and 'X2019' for the most recent population data available.
pop <- read.csv('census.csv')
df <- read.csv('accidents.csv') %>%
left_join(pop, by = c('STATENAME' = 'Geographic Area') %>%
mutate(rate = (PEDS / X2019) * <scale>) %>%
mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d")))
The left_join will match state names and give each row a population value depending on its state, regardless of how many rows there are.

How to mutate new column by using multiple variable's in another column r

Hi I would like to make a new variable, Foldchange, by using the mutate function. However I want to calculate this using values from the same column. Is there any way to calculate a new variable, fold change from within the same column without rearranging the table so that the table doesn't require splitting up?
Here is an example for clarity:
Plate Sample_ID Visit Bead_MFI Phago_Score mean_phago
<fct> <chr> <fct> <int> <dbl> <dbl>
4 100004 V1 1199 237. 253.
4 100077 V1 1522 405. 396.
4 100077 V2 1349 324. 814.
4 100004 V2 1518 466. 867
the output I would like is something like using:
test %>% group_by("Sample", "Plate") %>% mutate (Foldchange = ((mean_phago$V2-mean_phago$V1)/mean_phago$V1))
to get
Plate Sample_ID Visit Bead_MFI Phago_Score mean_phago Foldchange
<fct> <chr> <fct> <int> <dbl> <dbl>
4 100004 V1 1199 237. 253. 2.42
4 100077 V1 1522 405. 396. 1.11
4 100077 V2 1349 324. 834. 1.11
4 100004 V2 1518 466. 867 2.42
Obviously I cant select based the V1 and V2 variables using this code but that's just to illustrate. I'm hoping through this way I can keep my additional table in tact, the fold change would have repeating values but that's OK at this point.
Thanks for any help in advance still quite new to R!
Mari
We need to subset based on a logical condition
library(dplyr)
test %>%
group_by(Plate) %>%
mutate(Foldchange = (mean_phage[Visit == 'V2'] -
mean_phage[Visit == 'V1'])/mean_phage[Visit =='V1'])
Or if there are only single 'V1', 'V2', per each group, can use diff
test %>%
arrange(Plate, desc(Visit)) %>%
group_by( Plate) %>%
mutate(Foldchange = diff(mean_phage)/last(mean_phage))

Trying to group data by region and summarize by date in R Studio on COVID19 epidemic

I'm an old FOTRAN, C programmer trying to learn R. I started working with data on the COVID19 epidemic and have run aground.
The data I'm working with started out as wide data and I have converted it row data. It contains a daily case count of cases by ProvinceState, Region/Country, Lat, Long, Date, Cases.
I want to filter the dataframe for Mainland China and summarize cases by date as a first step. The code below generates a NULL data set when I try to group the data.
Thanks for any help!
library(dplyr)
library(dygraphs)
library(lubridate)
library(tidyverse)
library(timeSeries)
# Set current working directory.
#
setwd("/Users/markmcleod/MarksRepository/Data")
# Read a Case csv files
#
Covid19ConfirmedWideData <- read.csv("Covid19Deaths.csv",header=TRUE,check.names = FALSE)
# count the number of days of data
#
Covid19ConfirmedDays = NCOL(Covid19ConfirmedWideData)
# Gather Wide Data columns starting at column 5 until NCOL() into RowData DataFrame
#
Covid19ConfirmedRowData <- gather(Covid19ConfirmedWideData, Date, Cases, 5:Covid19ConfirmedDays, na.rm = FALSE, convert = TRUE)
tibble(Covid19ConfirmedRowData)
# # A tibble: 2,204 x 1
# Covid19ConfirmedRowData$ProvinceState $CountryRegion $Lat $Long $Date $Cases
# <fct> <fct> <dbl> <dbl> <chr> <int>
# 1 Anhui Mainland China 31.8 117. 1/22/20 0
# 2 Beijing Mainland China 40.2 116. 1/22/20 0
# 3 Chongqing Mainland China 30.1 108. 1/22/20 0
# Transmute date from chr to date
#
Covid19ConfirmedFormatedData <- transmute(Covid19ConfirmedRowData,CountryRegion,Date=as.Date(Date,format="%m/%d/%Y"),Cases)
tibble(Covid19ConfirmedFormatedData)
# # A tibble: 2,204 x 1
# Covid19ConfirmedFormatedData$CountryRegion $Date $Cases
# <fct> <date> <int>
# 1 Mainland China 0020-01-22 0
# 2 Mainland China 0020-01-22 0
Covid19ConfirmedGroupedData <- Covid19ConfirmedFormatedData %>%
filter(Covid19ConfirmedFormatedData$CountryRegion=='Mainland China')
tibble(Covid19ConfirmedGroupedData)
# A tibble: 2,204 x 1
Covid19ConfirmedGroupedData[,1] [,2] [,3]
<dbl> <dbl> <dbl>
1 NA NA NA
It appears that I have a conflict in the libraries I am using.
I fell backto a previous version of the code and used only the following libraries.
library(dygraphs)
library(lubridate)
library(tidyverse)
The code seems to work again.

Mutate an R tibble by looking up a value in another tibble

Wrangling data in R, I would like to mutate a tibble in such a way that the numerical values in the new column are being looked up in a different tibble.
Given a dataset of catheter removals:
# A tibble: 51 x 2
ExplYear RemovalReason
<dbl> <chr>
1 2018 Infection
2 2018 Dysfunction
3 2018 Infection
# ... etc.
where each row corresponds to a single catheter removal, I would like to add a column Implants that holds the total number of _im_plantations in the year that the catheter was removed (_ex_planted).
The implantation numbers are in a tibble impl_per_year:
# A tibble: 13 x 2
ImplYear n
<dbl> <int>
1 2006 14
2 2007 46
3 2008 64
# ... etc.
I have tried to mutate the first tibble with map and a helper function:
lookup = function(year) { impl_per_year[impl_per_year$ImplYear == year,]$n }
explants %>% mutate(Implants = map(ExplYear, lookup)
But this places lots of empty integer vectors into the Implants column:
# A tibble: 51 x 3
ExplYear RemovalReason Implants
<dbl> <chr> <list>
1 18 Infection <int [0]>
2 18 Dysfunction <int [0]>
3 18 Infection <int [0]>
# ... etc.
What is the mistake?
You should be able to simply join the two tables by year. If we call your first tibble ExplTibble and your second ImplTibble, using dplyr:
ExplTibble %>% left_join(ImplTibble, by = c("ExplYear" = "ImplYear"))
This should add a new column n containing the number of implants in each year.
library(tidyverse)
I altered your data so that my illustration wouldn't have a NULL output.
df <- tribble(
~ExplYear, ~RemovalReason,
2018, "Infection",
2017, "Dysfunction",
2016, "Infection")
impl_per_year <- tribble(
~ImplYear, ~n,
2017, 14,
2016, 46,
2016, 64
)
left_join is the function you're looking for. It's part of the dplyr::join family of functions that do this.
It's good to have the same names for "joining" variables, but in your case you need the by = c( ... ) option to let left_join know what you are joining by.
left_join(df, impl_per_year, by = c("ExplYear" = "ImplYear"))
# A tibble: 4 x 3
ExplYear RemovalReason n
<dbl> <chr> <dbl>
1 2018 Infection NA
2 2017 Dysfunction 14
3 2016 Infection 46
4 2016 Infection 64
Depending on what you want, consider right_join, inner_join, etc. until you get the output you are looking for. For example:
inner_join(df, impl_per_year, by = c("ExplYear" = "ImplYear"))
# A tibble: 3 x 3
ExplYear RemovalReason n
<dbl> <chr> <dbl>
1 2017 Dysfunction 14
2 2016 Infection 46
3 2016 Infection 64
... which gives only successful matches from both tibbles.

Resources