I have a dataset of football teams and their Win/Loss results from 2009-2017. Currently the Wins and Losses are in the same column, one after the other, and I want to create a new column for the losses.
A sample of the data looks like:
Football <- data.frame (
Season = rep ("2009", 10),
Team = rep (c("ARI", "ARI", "ATL", "ATL", "BAL", "BAL", "BUF", "BUF", "CAR", "CAR")),
Value = c(10, 6, 7, 9, 7, 9, 6, 10, 8, 8)
)
I would like the final output to show:
Season Team Wins Losses
2009 ARI 10 6
2009 ATL 7 9
2009 BAL 7 9
and so on. There are also several other variables but the only one that changes for each Season/Team pair is the "Value".
I have tried several iterations of spread() and mutate() but they typically make many more columns (i.e. 2009.Wins, 2009.Losses, 2010.Wins, 2010.Losses) than I want.
Thanks for any help. I hope this post turns out alright, its my first time posting.
Cheers, Jeremy
We create a column of "Winloss" and then spread to 'wide' format
library(tidyverse)
Football %>%
mutate(Winloss = rep(c("Win", "Loss"), n()/2)) %>%
spread(Winloss, Value)
# Season Team Loss Win
#1 2009 ARI 6 10
#2 2009 ATL 9 7
#3 2009 BAL 9 7
#4 2009 BUF 10 6
#5 2009 CAR 8 8
data
Football <- data.frame (
Season = rep ("2009", 10),
Team = rep (c("ARI", "ARI", "ATL", "ATL", "BAL", "BAL", "BUF", "BUF", "CAR", "CAR")),
Value = c(10, 6, 7, 9, 7, 9, 6, 10, 8, 8)
)
Using reshape2 package
> Football$WL <- rep(c("Win", "Losses"), nrow(Football)/2)
>
> library(reshape2)
> dcast(Football, Season + Team ~ WL, value.var="Value")
Season Team Losses Win
1 2009 ARI 6 10
2 2009 ATL 9 7
3 2009 BAL 9 7
4 2009 BUF 10 6
5 2009 CAR 8 8
Related
My (simplified) dataset consists of donor occupation and contribution amounts. I'm trying to determine what the average contribution amount by occupation is (note: donor occupations are often repeated in the column, so I use that as a grouping variable). Right now, I'm using two dplyr statements -- one to get a sum of contributions amount by each occupation and another to get a count of the number of donations from that specific occupation. I am then binding the dataframes with cbind and creating a new column with mutate, where I can divide the sum by the count.
Data example:
contributor_occupation contribution_receipt_amount
1 LISTING COORDINATOR 5.00
2 NOT EMPLOYED 2.70
3 TEACHER 2.70
4 ELECTRICAL DESIGNER 2.00
5 STUDENT 50.00
6 SOFTWARE ENGINEER 10.00
7 TRUCK DRIVER 2.70
8 NOT EMPLOYED 50.00
9 CONTRACTOR 5.00
10 ENGINEER 6.00
11 FARMER 2.70
12 ARTIST 50.00
13 CIRCUS ARTIST 100.00
14 CIRCUS ARTIST 27.00
15 INFORMATION SECURITY ANALYST 2.00
16 LAWYER 5.00
occupation2 <- b %>%
select(contributor_occupation, contribution_receipt_amount) %>%
group_by(contributor_occupation) %>%
summarise(total = sum(contribution_receipt_amount)) %>%
arrange(desc(contributor_occupation))
occupation3 <- b %>%
select(contributor_occupation) %>%
count(contributor_occupation) %>%
group_by(contributor_occupation) %>%
arrange(desc(contributor_occupation))
final_occ <- cbind(occupation2, occupation3[, 2]) # remove duplicate column
occ_avg <- final_occ %>%
select(contributor_occupation:n) %>%
mutate("Average Donation" = total/n) %>%
rename("Number of Donations"= n, "Occupation" = contributor_occupation, "Total Donated" = total)
occ_avg %>%
arrange(desc(`Average Donation`))
This gives me the result I want but seems like a very cumbersome process. It seems I get the same result by using the following code; however, I am confused as to why it works:
avg_donation_occupation <- b %>%
select(contributor_occupation, contribution_receipt_amount) %>%
group_by(contributor_occupation) %>%
summarize(avg_donation_by_occupation = sum(contribution_receipt_amount)/n()) %>%
arrange(desc(avg_donation_by_occupation))
Wouldn't dividing by n divide by the number of rows (i.e., number of occupations) as opposed to the number of people in that occupation (which is what I used the count function for previously)?
Thanks for the help clearing up any confusion!
We may need both sum and mean along with n() which gives the number of observations in the grouped data. According to ?context
n() gives the current group size.
and `?mean
mean - Generic function for the (trimmed) arithmetic mean.
which is basically the sum of observations divided by the number of observations
library(dplyr)
out <- b %>%
group_by(Occupation = contributor_occupation) %>%
summarise(`Total Donated` = sum(contribution_receipt_amount),
`Number of Donations` = n(),
`Average Donation` = mean(contribution_receipt_amount),
#or
#`Average Donation` = `Total Donated`/`Number of Donations`,
.groups = 'drop') %>%
arrange(desc(`Average Donation`))
-output
out
# A tibble: 14 × 4
Occupation `Total Donated` `Number of Donations` `Average Donation`
<chr> <dbl> <int> <dbl>
1 CIRCUS ARTIST 127 2 63.5
2 ARTIST 50 1 50
3 STUDENT 50 1 50
4 NOT EMPLOYED 52.7 2 26.4
5 SOFTWARE ENGINEER 10 1 10
6 ENGINEER 6 1 6
7 CONTRACTOR 5 1 5
8 LAWYER 5 1 5
9 LISTING COORDINATOR 5 1 5
10 FARMER 2.7 1 2.7
11 TEACHER 2.7 1 2.7
12 TRUCK DRIVER 2.7 1 2.7
13 ELECTRICAL DESIGNER 2 1 2
14 INFORMATION SECURITY ANALYST 2 1 2
data
b <- structure(list(contributor_occupation = c("LISTING COORDINATOR",
"NOT EMPLOYED", "TEACHER", "ELECTRICAL DESIGNER", "STUDENT",
"SOFTWARE ENGINEER", "TRUCK DRIVER", "NOT EMPLOYED", "CONTRACTOR",
"ENGINEER", "FARMER", "ARTIST", "CIRCUS ARTIST", "CIRCUS ARTIST",
"INFORMATION SECURITY ANALYST", "LAWYER"), contribution_receipt_amount = c(5,
2.7, 2.7, 2, 50, 10, 2.7, 50, 5, 6, 2.7, 50, 100, 27, 2, 5)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16"))
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
I have a data frame with three columns where each row is unique:
df1
# state val_1 season
# 1 NY 3 winter
# 2 NY 10 spring
# 3 NY 24 summer
# 4 BOS 14 winter
# 5 BOS 26 spring
# 6 BOS 19 summer
# 7 WASH 99 winter
# 8 WASH 66 spring
# 9 WASH 42 summer
I want to create a matrix with the state names for rows and the seasons for columns with val_1 as the values. I have previously used:
library(reshape2)
df <- acast(df1, state ~ season, value.var='val_1')
And it has created the desired matrix with each state name appearing once but for some reason when I have been using acast or dcast recently it automatically defaults to the length function and gives 1's for the values. Can anyone recommend a solution?
data
state <- c('NY', 'NY', 'NY', 'BOS', 'BOS', 'BOS', 'WASH', 'WASH', 'WASH')
val_1 <- c(3, 10, 24, 14, 26, 19, 99, 66, 42)
season <- c('winter', 'spring', 'summer', 'winter', 'spring', 'summer',
'winter', 'spring', 'summer')
df1 <- data.frame(state, val_1, season)
You may define the fun.aggregate=.
library(reshape2)
acast(df1, state~season, value.var = 'val_1', fun.aggregate=sum)
# spring summer winter
# BOS 26 19 14
# NY 10 24 3
# WASH 66 42 99
This also works
library(reshape2)
state = c('NY', 'NY', 'NY', 'BOS', 'BOS', 'BOS', 'WASH', 'WASH', 'WASH')
val_1 = c(3, 10, 24, 14, 26, 19, 99, 66, 42)
season = c('winter', 'spring', 'summer', 'winter', 'spring', 'summer', 'winter', 'spring', 'summer')
df1 = data.frame(state,
val_1,
season)
dcast(df1, state~season, value.var = 'val_1')
#> state spring summer winter
#> 1 BOS 26 19 14
#> 2 NY 10 24 3
#> 3 WASH 66 42 99
Created on 2022-04-08 by the reprex package (v2.0.1)
I am dealing with a data frame with column names, company name, division name all_production_2017, bad_production_2017...with many years back
Now I am writing a function that takes a company name and a year as arguments and summarize the company's production in that year. Then sort it by decreasing order in all_production_year
I have already converted the year to a string and filter the rows and columns required. But how can I sort it by a specific column? I do not know how to access that column name because the argument year is the suffix of that.
Here is a rough sketch of the structure of my data frame.
structure(list(company = c("DLT", "DLT", "DLT", "MSF", "MSF", "MSF"),
division = c("Marketing", "CHANG1", "CAHNG2", "MARKETING", "CHANG1M", "CHANG2M"),
all_production_2000 = c(15, 25, 25, 10, 25, 18),
good_production_2000 = c(10, 24, 10, 8, 10, 10),
bad_production_2000 = c(2, 1, 2, 1, 3, 5)))
with data from 2000 to 2017
I want to write a function that given a name of the company and a year.
It can filter out the company and the year relevant, and sort the all_production_thatyear, by decreasing order.
I have done so far.
ExportCompanyYear <- function(company.name, year){
year.string <- toString(year)
x <- filter(company.data, company == company.name) %>%
select(company, division, contains(year.string))
}
I just do not know how to sort by decreasing order because i do not know how to access the column name which contains the argument year.
You definitely need to reshape your data in such a way that year values could be passed as a parameter.
To create a reproducible example, I have added another year 2001 in the data.
df = data.frame(company = c("DLT", "DLT", "DLT", "MSF", "MSF", "MSF"), division = c("Marketing", "CHANG1", "CAHNG2", "MARKETING", "CHANG1M", "CHANG2M"), all_production_2000 = c(15, 25, 25, 10, 25, 18), good_production_2000 = c(10, 24, 10, 8, 10, 10), bad_production_2000 = c(2, 1, 2, 1, 3, 5),all_production_2001 = 2*c(15, 25, 25, 10, 25, 18), good_production_2001 = 2*c(10, 24, 10, 8, 10, 10), bad_production_2001 = 2*c(2, 1, 2, 1, 3, 5))
Now you can reshape the data using the reshape function in R.
Here, the variables "all_production","good_production","bad_production" are varying with time, and year values are changing for those variables.
So we specify v.names = c("all_production","good_production","bad_production").
df2 = reshape(df,direction="long",
v.names = c("all_production","good_production","bad_production"),
varying = names(df)[3:8],
idvar = c("company","division"),
timevar = "year",times = c(2000,2001))
For your data.frame you can specify times=2000:2017 and varying=3:ncol(df)
>df2
company division year all_production good_production bad_production
DLT.Marketing.2000 DLT Marketing 2000 15 2 10
DLT.CHANG1.2000 DLT CHANG1 2000 25 1 24
DLT.CAHNG2.2000 DLT CAHNG2 2000 25 2 10
MSF.MARKETING.2000 MSF MARKETING 2000 10 1 8
MSF.CHANG1M.2000 MSF CHANG1M 2000 25 3 10
MSF.CHANG2M.2000 MSF CHANG2M 2000 18 5 10
DLT.Marketing.2001 DLT Marketing 2001 30 4 20
DLT.CHANG1.2001 DLT CHANG1 2001 50 2 48
DLT.CAHNG2.2001 DLT CAHNG2 2001 50 4 20
MSF.MARKETING.2001 MSF MARKETING 2001 20 2 16
MSF.CHANG1M.2001 MSF CHANG1M 2001 50 6 20
MSF.CHANG2M.2001 MSF CHANG2M 2001 36 10 20
Now you can filter and sort like this:
library(dplyr)
somefunc<-function(company.name,yearval){
df2%>%filter(company==company.name,year==yearval)%>%arrange(-all_production)
}
>somefunc("DLT",2001)
company division year all_production good_production bad_production
1 DLT CHANG1 2001 50 2 48
2 DLT CAHNG2 2001 50 4 20
3 DLT Marketing 2001 30 4 20
Although it seems OP has provided a very simple sample data which contains data for only year 2000.
A solution approach could be:
1. Convert the list to data.frame
2. Use gather from tidyr to arrange dataframe in way where filter can be applied
ll <- structure(list(company = c("DLT", "DLT", "DLT", "MSF", "MSF", "MSF"),
division = c("Marketing", "CHANG1", "CAHNG2", "MARKETING", "CHANG1M",
"CHANG2M"), all_production_2000 = c(15, 25, 25, 10, 25, 18),
good_production_2000 = c(10, 24, 10, 8, 10, 10),
bad_production_2000 = c(2, 1, 2, 1, 3, 5)))
df <- as.data.frame(ll)
library(tidyr)
gather(df, key = "key", value = "value", -c("company", "division"))
#result:
# company division key value
#1 DLT Marketing all_production_2000 15
#2 DLT CHANG1 all_production_2000 25
#3 DLT CAHNG2 all_production_2000 25
#4 MSF MARKETING all_production_2000 10
#5 MSF CHANG1M all_production_2000 25
#6 MSF CHANG2M all_production_2000 18
#7 DLT Marketing good_production_2000 10
#8 DLT CHANG1 good_production_2000 24
#9 DLT CAHNG2 good_production_2000 10
#10 MSF MARKETING good_production_2000 8
#11 MSF CHANG1M good_production_2000 10
#12 MSF CHANG2M good_production_2000 10
#13 DLT Marketing bad_production_2000 2
#14 DLT CHANG1 bad_production_2000 1
#15 DLT CAHNG2 bad_production_2000 2
Now, filter can be applied easily on above data.frame.
So I have this:
Staff Result Date Days
1 50 2007 4
1 75 2006 5
1 60 2007 3
2 20 2009 3
2 11 2009 2
And I want to get to this:
Staff Result Date Days
1 55 2007 7
1 75 2006 5
2 15 2009 5
I want to have the Staff ID and Date be unique in each row, but I want to sum 'Days' and mean 'Result'
I can't work out how to do this in R, I'm sure I need to do lots of aggregations but I keep getting different results to what I am aiming for.
Many thanks
the simplest way to do this is to group_by Staff and Date and summarise the results with dplyr package:
require(dplyr)
df <- data.frame(Staff = c(1,1,1,2,2),
Result = c(50, 75, 60, 20, 11),
Date = c(2007, 2006, 2007, 2009, 2009),
Days = c(4, 5, 3, 3, 2))
df %>%
group_by(Staff, Date) %>%
summarise(Result = floor(mean(Result)),
Days = sum(Days)) %>%
data.frame
Staff Date Result Days
1 1 2006 75 5
2 1 2007 55 7
3 2 2009 15 5
You can aggregate on two variables by using a formula and then merge the two aggregates
merge(aggregate(Result ~ Staff + Date, data=df, mean),
aggregate(Days ~ Staff + Date, data=df, sum))
Staff Date Result Days
1 1 2006 75.0 5
2 1 2007 55.0 7
3 2 2009 15.5 5
Here is another option with data.table
library(data.table)
setDT(df1)[, .(Result = floor(mean(Result)), Days = sum(Days)), .(Staff, Date)]
# Staff Date Result Days
#1: 1 2007 55 7
#2: 1 2006 75 5
#3: 2 2009 15 5
I have a long format dataframe dogs that I'm trying to reformat to wide using the reshape() function. It currently looks like so:
dogid month year trainingtype home school timeincomp
12345 1 2014 1 1 1 340
12345 2 2014 1 1 1 360
31323 12 2015 2 7 3 440
31323 1 2014 1 7 3 500
31323 2 2014 1 7 3 520
The dogid column is a bunch of ids, one for each dog. The month column varies for 1 to 12 for the 12 months, and year from 2014 to 2015. Trainingtype varies for 1 to 2. Each dog has a timeincomp value for every month-year-trainingtype combination, so 48 entries per dog. Home and school vary from 1-8 and are constant per dog (every entry for the same dog has the same school and home). Time in comp is my response variable.
I would like my table to look like so:
dogid home school month1year2014trainingtype1 month2year2014trainingtype1
12345 1 1 340 360
31323 7 3 500 520
etc. (with columns for each month-year-trainingtype combination)
What parameters should I use in reshape to achieve this?
You can use the function dcast from package reshape2. It's easier to understand. The left side of the formula is the one that stays long, while the right side is the one that goes wide.
The fun.aggregate is the function to apply in case that there is more than 1 number per case. If you're sure you don't have repeated cases, you can use mean or sum
dcast(data, formula= dogid + home + school ~ month + year + trainingtype,
value.var = 'timeincomp',
fun.aggregate = sum)
I hope it works:
dogid home school 1_2014_1 2_2014_1 12_2015_2
1 12345 1 1 340 360 0
2 31323 7 3 500 520 440
In this case, using base reshape, you essentially want an interaction() of the three time variables to define your wide variables, so:
idvars <- c("dogid","home","school")
grpvars <- c("year","month","trainingtype")
outvar <- "timeincomp"
time <- interaction(dat[grpvars])
reshape(
cbind(dat[c(idvars,outvar)],time),
idvar=idvars,
timevar="time",
direction="wide"
)
# dogid home school timeincomp.2014.1.1 timeincomp.2014.2.1 timeincomp.2015.12.2
#1 12345 1 1 340 360 NA
#3 31323 7 3 500 520 440
You can do the same thing using the new replacement for reshape2, tidyr:
library(tidyr)
library(dplyr)
data %>% unite(newcol, c(year, month, trainingtype)) %>%
spread(newcol, timeincomp)
dogid home school 2014_1_1 2014_2_1 2015_12_2
1 12345 1 1 340 360 NA
2 31323 7 3 500 520 440
First, we unite the year, month and trainingtype columns into a new column called newcol, then we spread the data with timeincomp as our value variable.
The NA is there as we have no value, you can give it one by changing fill = NA in the spread function.
With tidyr_1.0.0 and above, another option is pivot_wider
library(tidyverse)
df <- tribble(
~dogid, ~month, ~year, ~trainingtype, ~home, ~school, ~timeincomp,
12345, 1, 2014, 1, 1, 1, 340,
12345, 2, 2014, 1, 1, 1, 360,
31323, 12, 2015, 2, 7, 3, 440,
31323, 1, 2014, 1, 7, 3, 500,
31323, 2, 2014, 1, 7, 3, 520
)
df %>% pivot_wider(
id_cols = c(dogid,home, school),
names_from = c(month, year, trainingtype),
values_from = c(timeincomp),
)