I am trying to reshape my dataframe using dcast() but I am getting this error
object 'newid' not found
I am not clear about the error. This is the original dataframe
Grade Week Subject Location Marks
6 January English IND 76.50
6 January English US 52.50
7 January English IND 24.00
7 January English US 5.00
8 February English IND 63.00
8 February English US 40.25
9 February English IND 63.00
9 February English US 32.50
10 March English IND 27.00
10 March English US 4.50
11 March English IND 10.00
tmp <- plyr::ddply(monthTotalDataFinal, .(Subject, Grade),
transform,newid = paste(Subject))
d2 <- dcast(tmp, formula = Subject+newid ~ Grade+Location+Week,
value.var = 'Marks')
The required data frame is as follows:
Subject 6_IND 7_IND 6_US 7_US 8_IND 9_IND 8_US 9_US 10_IND 11_IND 10_US
English 77 24 53 5 63 63 40 33 27 10 5
Please give a suitable solution for this.
Using dplyr and tidyr, we can unite Grade, Location column and use spread to get data in wide format.
library(dplyr)
library(tidyr)
df %>%
unite(key, Grade, Location) %>%
select(-Week) %>%
spread(key, Marks)
# Subject 10_IND 10_US 11_IND 6_IND 6_US 7_IND 7_US 8_IND 8_US 9_IND 9_US
#1 English 27 4.5 10 76.5 52.5 24 5 63 40.25 63 32.5
Based on the comments we might need to create an identifier column for multiple Subject
df %>%
unite(key, Grade, Location) %>%
select(-Week) %>%
group_by(key, Subject) %>%
mutate(row = row_number()) %>%
spread(key, Marks)
As it is a dcast question, we can use
library(data.table)
dcast(setDT(df), Subject ~ Grade + Location, value.var = 'Marks')
# Subject 6_IND 6_US 7_IND 7_US 8_IND 8_US 9_IND 9_US 10_IND 10_US 11_IND
#1: English 76.5 52.5 24 5 63 40.25 63 32.5 27 4.5 10
data
df <- structure(list(Grade = c(6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L,
10L, 11L), Week = c("January", "January", "January", "January",
"February", "February", "February", "February", "March", "March",
"March"), Subject = c("English", "English", "English", "English",
"English", "English", "English", "English", "English", "English",
"English"), Location = c("IND", "US", "IND", "US", "IND", "US",
"IND", "US", "IND", "US", "IND"), Marks = c(76.5, 52.5, 24, 5,
63, 40.25, 63, 32.5, 27, 4.5, 10)), class = "data.frame",
row.names = c(NA,
-11L))
Related
Say you have a database like gapminder with the population per country. Even though the current year is 2021, you also have predictions for the following years to come.
location 2020.0 2021.0 2022.0
Canada 5 7 9
China 23 34 54
Congo 1 2 3
and another database like this, vaccins
location date amount_of_vaccins
Canada 2020-01-02 50
China 2021-05-03 59
Congo 2022-03-05 34
How can I merge the population of each country into the second database, but following the dates in the second database.
I managed to merge them by country like this:
merge(gapminder,vaccins, by = "location")
but I'm getting this
location date amount_of_vaccins 2020.0 2021.0 2022.0
Canada 2020-01-02 50 5 7 9
China 2021-05-03 59 23 34 54
Congo 2022-03-05 34 1 2 3
I'd like to have only a new variable giving the population of the country according to the year. Thank you.
You could do something like this with tidyverse.
library(tidyverse)
df1 <- df1 %>%
pivot_longer(!location, names_to = "date", values_to = "population") %>%
dplyr::mutate(year = str_sub(date, 1, 4))
df2 %>%
dplyr::mutate(year = str_sub(date, end = 4)) %>%
dplyr::left_join(., df1, by = c("location", "year")) %>%
dplyr::select(-c(date.y, year)) %>%
dplyr::rename(date = date.x)
Output
location date amount_of_vaccins population
1 Canada 2020-01-02 50 5
2 China 2021-05-03 59 34
3 Congo 2022-03-05 54 3
Data
df1 <-
structure(
list(
location = c("Canada", "China", "Congo"),
`2020.0` = c(5, 23, 1),
`2021.0` = c(7, 34, 2),
`2022.0` = c(9, 54, 3)
),
class = "data.frame",
row.names = c(NA,-3L)
)
df2 <-
structure(
list(
location = c("Canada", "China", "Congo"),
date = c("2020-01-02",
"2021-05-03", "2022-03-05"),
amount_of_vaccins = c(50, 59, 54)
),
class = "data.frame",
row.names = c(NA,-3L)
)
I currently need to translate my dplyr code into base R code. My dplyr code gives me 3 columns, competitor sex, the olympic season and the number of different sports. The code looks like this:
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n()) %>%
setNames(c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
My data structure looks like this.
structure(list(Name = c("A Lamusi", "Juhamatti Tapio Aaltonen",
"Andreea Aanei", "Jamale (Djamel-) Aarrass (Ahrass-)", "Nstor Abad Sanjun",
"Nstor Abad Sanjun"), Sex = c("M", "M", "F", "M", "M", "M"),
Age = c(23L, 28L, 22L, 30L, 23L, 23L), Height = c(170L, 184L,
170L, 187L, 167L, 167L), Weight = c(60, 85, 125, 76, 64,
64), Team = c("China", "Finland", "Romania", "France", "Spain",
"Spain"), NOC = c("CHN", "FIN", "ROU", "FRA", "ESP", "ESP"
), Games = c("2012 Summer", "2014 Winter", "2016 Summer",
"2012 Summer", "2016 Summer", "2016 Summer"), Year = c(2012L,
2014L, 2016L, 2012L, 2016L, 2016L), Season = c("Summer",
"Winter", "Summer", "Summer", "Summer", "Summer"), City = c("London",
"Sochi", "Rio de Janeiro", "London", "Rio de Janeiro", "Rio de Janeiro"
), Sport = c("Judo", "Ice Hockey", "Weightlifting", "Athletics",
"Gymnastics", "Gymnastics"), Event = c("Judo Men's Extra-Lightweight",
"Ice Hockey Men's Ice Hockey", "Weightlifting Women's Super-Heavyweight",
"Athletics Men's 1,500 metres", "Gymnastics Men's Individual All-Around",
"Gymnastics Men's Floor Exercise"), Medal = c(NA, "Bronze",
NA, NA, NA, NA), BMI = c(20.7612456747405, 25.1063327032136,
43.2525951557093, 21.7335354170837, 22.9481157445588, 22.9481157445588
)), .Names = c("Name", "Sex", "Age", "Height", "Weight",
"Team", "NOC", "Games", "Year", "Season", "City", "Sport", "Event",
"Medal", "BMI"), row.names = c(NA, 6L), class = "data.frame")
Does anyone know how to translate this into base R?
Since you are grouping twice in dplyr you can use double aggregate in base R
setNames(aggregate(Name~Sex + Season,
aggregate(Name~Sex + Season + Sport, olympics, length), length),
c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
# Competitor_Sex Olympic_Season Num_Sports
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
This gives the same output as dplyr option
library(dplyr)
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n()) %>%
setNames(c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
# Competitor_Sex Olympic_Season Num_Sports
# <chr> <chr> <int>
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
A base R option would be using aggregate twice
out <- aggregate(BMI ~ Sex + Season,
aggregate(BMI ~ Sex + Season + Sport, olympics, length), length)
names(out) <- c("Competitor_Sex", "Olympic_Season", "Num_Sports")
out
# Competitor_Sex Olympic_Season Num_Sports
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
It is similar to the OP's output
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n()) %>%
setNames(c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
# A tibble: 3 x 3
# Groups: Sex [2]
# Competitor_Sex Olympic_Season Num_Sports
# <chr> <chr> <int>
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
Or it can be done in a compact way with table from base R
table(sub(",[^,]+$", "", names(table(do.call(paste,
c(olympics[c("Sex", "Season", "Sport")], sep=","))))))
# F,Summer M,Summer M,Winter
# 1 3 1
I'm trying to use dplyr to summarize a dataset based on 2 groups: "year" and "area". This is how the dataset looks like:
Year Area Num
1 2000 Area 1 99
2 2001 Area 3 85
3 2000 Area 1 60
4 2003 Area 2 90
5 2002 Area 1 40
6 2002 Area 3 30
7 2004 Area 4 10
...
The end result should look something like this:
Year Area Mean
1 2000 Area 1 100
2 2000 Area 2 80
3 2000 Area 3 89
4 2001 Area 1 80
5 2001 Area 2 85
6 2001 Area 3 59
7 2002 Area 1 90
8 2002 Area 2 88
...
Excuse the values for "mean", they're made up.
The code for the example dataset:
df <- structure(list(
Year = c(2000, 2001, 2000, 2003, 2002, 2002, 2004),
Area = structure(c(1L, 3L, 1L, 2L, 1L, 3L, 4L),
.Label = c("Area 1", "Area 2", "Area 3", "Area 4"),
class = "factor"),
Num = structure(c(7L, 5L, 4L, 6L, 3L, 2L, 1L),
.Label = c("10", "30", "40", "60", "85", "90", "99"),
class = "factor")),
.Names = c("Year", "Area", "Num"),
class = "data.frame", row.names = c(NA, -7L))
df$Num <- as.numeric(df$Num)
Things I've tried:
df.meanYear <- df %>%
group_by(Year) %>%
group_by(Area) %>%
summarize_each(funs(mean(Num)))
But it just replaces every value with the mean, instead of the intended result.
If possible please do provide alternate means (i.e. non-dplyr) methods, because I'm still new with R.
Is this what you are looking for?
library(dplyr)
df <- group_by(df, Year, Area)
df <- summarise(df, avg = mean(Num))
We can use data.table
library(data.table)
setDT(df)[, .(avg = mean(Num)) , by = .(Year, Area)]
I had a similar problem in my code, I fixed it with the .groups attribute:
df %>%
group_by(Year,Area) %>%
summarise(avg = mean(Num), .groups="keep")
Also verified with the added example (as.numeric corrupted Num values, so I used as.numeric(as.character(df$Num)) to fix it):
Year Area avg
<dbl> <fct> <dbl>
1 2000 Area 1 79.5
2 2001 Area 3 85
3 2002 Area 1 40
4 2002 Area 3 30
5 2003 Area 2 90
6 2004 Area 4 10
I have a data.frame which looks like this (in reality 1M rows):
`> df
R.DMA.NAMES quarter daypart allpersons.imp rate station spot.id
1 Wilkes.Barre.Scranton.Hztn Q22014 afternoon 0.0 30 WSWB 13048713
2 Nashville Q12014 primetime 0.0 50 COM NASHVILLE 11969260
3 Seattle.Tacoma Q12014 primetime 6.1 51 ESPN SEATTLE, EVERETT ZONE 11898905
4 Jacksonville Q42013 late fringe 2.3 130 Jacksonville WAWS 11617447
5 Detroit Q22014 overnight 0.0 0 WKBD 12571421
6 South.Bend.Elkhart Q42013 primetime 11.5 325 WBND 11741171`
dput(df)
structure(list(R.DMA.NAMES = c("Wilkes.Barre.Scranton.Hztn",
"Nashville", "Seattle.Tacoma", "Jacksonville", "Detroit", "South.Bend.Elkhart"
), quarter = structure(c(3L, 1L, 1L, 6L, 3L, 6L), .Label = c("Q12014",
"Q22013", "Q22014", "Q32013", "Q32014", "Q42013"), class = "factor"),
daypart = c("afternoon", "primetime", "primetime", "late fringe",
"overnight", "primetime"), allpersons.imp = c(0, 0, 6.1,
2.3, 0, 11.5), rate = c(30, 50, 51, 130, 0, 325), station = c("WSWB",
"COM NASHVILLE", "ESPN SEATTLE, EVERETT ZONE", "Jacksonville WAWS",
"WKBD", "WBND"), spot.id = c(13048713L, 11969260L, 11898905L,
11617447L, 12571421L, 11741171L)), .Names = c("R.DMA.NAMES",
"quarter", "daypart", "allpersons.imp", "rate", "station", "spot.id"
), row.names = c(NA, -6L), class = "data.frame")
I am using a ddply function to perform a calculation:
ddply(df, .(R.DMA.NAMES, station, quarter), function (x) {
cpi = sum(df$rate) / sum(df$allpersons.imp)
})
This creates a new data.frame which looks like this:
R.DMA.NAMES station quarter V1
1 Detroit WKBD Q22014 NaN
2 Jacksonville Jacksonville WAWS Q42013 56.521739
3 Nashville COM NASHVILLE Q12014 Inf
4 Seattle.Tacoma ESPN SEATTLE, EVERETT ZONE Q12014 8.360656
5 South.Bend.Elkhart WBND Q42013 28.260870
6 Wilkes.Barre.Scranton.Hztn WSWB Q22014 Inf
What I'd like to do is create a new column called "cpi" in my original df i.e. the applicable "cpi" value should appear against the particular row. Of course, the same value will repeat many times i.e. 8.36 will appear for every row which contains "Seattle.Tacoma" for R.DMA.NAMES, "ESPN SEATTLE, EVERETT ZONE" for station and Q12014 for quarter. I tried several things including:
transform(df, cpi = ddply(df, .(R.DMA.NAMES, station, quarter), function (x) {
cpi = sum(df$rate) / sum(df$allpersons.imp)
})
But this didn't work ! Can someone explain . .
Use transform within ddply:
ddply(df, .(R.DMA.NAMES, station, quarter),
transform, cpi = sum(rate) / sum(allpersons.imp))
I have a dataframe with some monthly data for 2 decades:
year month value
1960 January 925
1960 February 903
1960 March 1006
...
1969 December 892
1970 January 990
1970 February 866
...
1979 December 120
I would like to create a dataframe where I sum up the totals, for each decade, by month, as follows:
year month value
decade_60s January 4012
decade_60s February 8678
decade_60s March 9317
...
decade_60s December 3995
decade_70s January 8005
decade_70s February 9112
...
decade_70s December 325
I have been looking at the aggregate function, but this doesn't appear to be the right option.
I looked instead at some careful subsetting using the which function but this quickly became too messy.
For this kind of problem, what would be the correct approach? Will I need to use apply at some point, and if so, how?
I feel the temptation to use a for loop growing but I don't think this would be the best way to improve my skills in R..
Thanks for the advice.
PS: The month value is an ordinal factor, if this matters.
Aggregate is a way to go using base R
First define the decade
yourdata$decade <- cut(yourdata$year, breaks=c(1960,1970,1980), labels=c(60,70),
include.lowest=TRUE, right=FALSE)
Then aggregate the data
aggregate(value ~ decade + month, data=yourdata , sum)
Then order to get required output
plyr's count + gsub are definitely your friends here:
library(plyr)
dat <- structure(list(year = c(1960L, 1960L, 1960L, 1969L, 1970L, 1970L, 1979L),
month = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 1L),
.Label = c("December", "February", "January", "March"),
class = "factor"),
value = c(925L, 903L, 1006L, 892L, 990L, 866L, 120L)),
.Names = c("year", "month", "value"),
class = "data.frame", row.names = c(NA, -7L))
dat$decade <- gsub("[0-9]$", "0", dat$year)
count(dat, .(decade, month), wt_var=.(value))
## decade month freq
## 1 1960 December 892
## 2 1960 February 903
## 3 1960 January 925
## 4 1960 March 1006
## 5 1970 December 120
## 6 1970 February 866
## 7 1970 January 990