I have a data.frame which looks like this (in reality 1M rows):
`> df
R.DMA.NAMES quarter daypart allpersons.imp rate station spot.id
1 Wilkes.Barre.Scranton.Hztn Q22014 afternoon 0.0 30 WSWB 13048713
2 Nashville Q12014 primetime 0.0 50 COM NASHVILLE 11969260
3 Seattle.Tacoma Q12014 primetime 6.1 51 ESPN SEATTLE, EVERETT ZONE 11898905
4 Jacksonville Q42013 late fringe 2.3 130 Jacksonville WAWS 11617447
5 Detroit Q22014 overnight 0.0 0 WKBD 12571421
6 South.Bend.Elkhart Q42013 primetime 11.5 325 WBND 11741171`
dput(df)
structure(list(R.DMA.NAMES = c("Wilkes.Barre.Scranton.Hztn",
"Nashville", "Seattle.Tacoma", "Jacksonville", "Detroit", "South.Bend.Elkhart"
), quarter = structure(c(3L, 1L, 1L, 6L, 3L, 6L), .Label = c("Q12014",
"Q22013", "Q22014", "Q32013", "Q32014", "Q42013"), class = "factor"),
daypart = c("afternoon", "primetime", "primetime", "late fringe",
"overnight", "primetime"), allpersons.imp = c(0, 0, 6.1,
2.3, 0, 11.5), rate = c(30, 50, 51, 130, 0, 325), station = c("WSWB",
"COM NASHVILLE", "ESPN SEATTLE, EVERETT ZONE", "Jacksonville WAWS",
"WKBD", "WBND"), spot.id = c(13048713L, 11969260L, 11898905L,
11617447L, 12571421L, 11741171L)), .Names = c("R.DMA.NAMES",
"quarter", "daypart", "allpersons.imp", "rate", "station", "spot.id"
), row.names = c(NA, -6L), class = "data.frame")
I am using a ddply function to perform a calculation:
ddply(df, .(R.DMA.NAMES, station, quarter), function (x) {
cpi = sum(df$rate) / sum(df$allpersons.imp)
})
This creates a new data.frame which looks like this:
R.DMA.NAMES station quarter V1
1 Detroit WKBD Q22014 NaN
2 Jacksonville Jacksonville WAWS Q42013 56.521739
3 Nashville COM NASHVILLE Q12014 Inf
4 Seattle.Tacoma ESPN SEATTLE, EVERETT ZONE Q12014 8.360656
5 South.Bend.Elkhart WBND Q42013 28.260870
6 Wilkes.Barre.Scranton.Hztn WSWB Q22014 Inf
What I'd like to do is create a new column called "cpi" in my original df i.e. the applicable "cpi" value should appear against the particular row. Of course, the same value will repeat many times i.e. 8.36 will appear for every row which contains "Seattle.Tacoma" for R.DMA.NAMES, "ESPN SEATTLE, EVERETT ZONE" for station and Q12014 for quarter. I tried several things including:
transform(df, cpi = ddply(df, .(R.DMA.NAMES, station, quarter), function (x) {
cpi = sum(df$rate) / sum(df$allpersons.imp)
})
But this didn't work ! Can someone explain . .
Use transform within ddply:
ddply(df, .(R.DMA.NAMES, station, quarter),
transform, cpi = sum(rate) / sum(allpersons.imp))
Related
I'm working on a March Madness project. I have a data frame df.A with every team and season.
For example:
Season Team Name Code
2003 Creighton 2003-1166
2003 Notre Dame 2003-1323
2003 Arizona 2003-1112
And another data frame df.B with game results of of every game every season:
WTeamScore LTeamScore WTeamCode LTeamCode
15 10 2003-1166 2003-1323
20 15 2003-1323 2003-1112
10 5 2003-1112 2003-1166
I'm trying to get a column in df.A that totals the number of points in both wins and losses. Basically:
Season Team Name Code Points
2003 Creighton 2003-1166 20
2003 Notre Dame 2003-1323 30
2003 Arizona 2003-1112 25
There are obviously thousands more rows in each data frame, but this is the general idea. What would be the best way of going about this?
Here is another option using tidyverse, where we can pivot df.B to long form, then get the sum for each team, then join back to df.A.
library(tidyverse)
df.B %>%
pivot_longer(everything(),names_pattern = "(WTeam|LTeam)(.*)",
names_to = c("rep", ".value")) %>%
group_by(Code) %>%
summarise(Points = sum(Score)) %>%
left_join(df.A, ., by = "Code")
Output
Season Team.Name Code Points
1 2003 Creighton 2003-1166 20
2 2003 Notre Dame 2003-1323 30
3 2003 Arizona 2003-1112 25
Data
df.A <- structure(list(Season = c(2003L, 2003L, 2003L), Team.Name = c("Creighton",
"Notre Dame", "Arizona"), Code = c("2003-1166", "2003-1323",
"2003-1112")), class = "data.frame", row.names = c(NA, -3L))
df.B <- structure(list(WTeamScore = c(15L, 20L, 10L), LTeamScore = c(10L,
15L, 5L), WTeamCode = c("2003-1166", "2003-1323", "2003-1112"
), LTeamCode = c("2003-1323", "2003-1112", "2003-1166")), class = "data.frame", row.names = c(NA,
-3L))
We may use match (from base R) between 'Code' on 'df.A' to 'WTeamCode', 'LTeamCode' in df.B to get the matching index, to extract the corresponding 'Score' columns and get the sum (+)
df.A$Points <- with(df.A, df.B$WTeamScore[match(Code,
df.B$WTeamCode)] +
df.B$LTeamScore[match(Code, df.B$LTeamCode)])
-output
> df.A
Season TeamName Code Points
1 2003 Creighton 2003-1166 20
2 2003 Notre Dame 2003-1323 30
3 2003 Arizona 2003-1112 25
If there are nonmatches resulting in missing values (NA) from match, cbind the vectors to create a matrix and use rowSums with na.rm = TRUE
df.A$Points <- with(df.A, rowSums(cbind(df.B$WTeamScore[match(Code,
df.B$WTeamCode)],
df.B$LTeamScore[match(Code, df.B$LTeamCode)]), na.rm = TRUE))
data
df.A <- structure(list(Season = c(2003L, 2003L, 2003L), TeamName = c("Creighton",
"Notre Dame", "Arizona"), Code = c("2003-1166", "2003-1323",
"2003-1112")), class = "data.frame", row.names = c(NA, -3L))
df.B <- structure(list(WTeamScore = c(15L, 20L, 10L), LTeamScore = c(10L,
15L, 5L), WTeamCode = c("2003-1166", "2003-1323", "2003-1112"
), LTeamCode = c("2003-1323", "2003-1112", "2003-1166")),
class = "data.frame", row.names = c(NA,
-3L))
I have created a summary table like below
Name Sales
AS 71.5%
DY 88.4%
VH 44.6%
MY 86.9%
HU 42.3%
TT 67.2%
BG 0.0%
SA 85.3%
now I want to replace the occurrence of 0.0 to "-"
I have tried
tab[,2] <- paste0(tab[,2],"%")
tab[,2] <- replace(tab[,2],tab[,2]<0,"-")
but its converting all values like 8.0 and 7.0 to "-"
do we have any other sollution
the output should be like
Name Sales
AS 71.5%
DY 88.4%
BG -
so the whole function is like this, have three columns of os sales for each person
You can try this:
#Data
df <- structure(list(Name = structure(c(1L, 3L, 8L, 5L, 4L, 7L, 2L,
6L), .Label = c("AS", "BG", "DY", "HU", "MY", "SA", "TT", "VH"
), class = "factor"), Sales = c(71.5, 88.4, 44.6, 86.9, 42.3,
67.2, 0, 85.3)), class = "data.frame", row.names = c(NA, -8L))
#Code
index <- which(df$Sales==0)
df$Sales[index] <- '-'
Name Sales
1 AS 71.5
2 DY 88.4
3 VH 44.6
4 MY 86.9
5 HU 42.3
6 TT 67.2
7 BG -
8 SA 85.3
Update with new data
New data has been provided:
df2 <- structure(list(Name = c("AS", "DY", "VH", "MY", "HU", "TT", "BG",
"SA"), Sales = c("71.5%", "88.4%", "44.6%", "86.9%", "42.3%",
"67.2%", "0.0%", "85.3%")), class = "data.frame", row.names = c(NA,
-8L))
df2$Sales2 <- gsub("0.0%","-",df2$Sales,fixed=T)
Name Sales Sales2
1 AS 71.5% 71.5%
2 DY 88.4% 88.4%
3 VH 44.6% 44.6%
4 MY 86.9% 86.9%
5 HU 42.3% 42.3%
6 TT 67.2% 67.2%
7 BG 0.0% -
8 SA 85.3% 85.3%
Update with variable
Using first data df:
df$tab <- paste0(df$Sales,'%')
df$tab <- ifelse(nchar(df$tab)==2,gsub("0%","-",df$tab,fixed=T),df$tab)
Name Sales tab
1 AS 71.5 71.5%
2 DY 88.4 88.4%
3 VH 44.6 44.6%
4 MY 86.9 86.9%
5 HU 42.3 42.3%
6 TT 67.2 67.2%
7 BG 0.0 -
8 SA 85.3 85.3%
Try this:
tab$Sales <- replace(tab$Sales, which(tab$Sales == 0), "-")
I'd also recommend looking into dplyr's mutate.
I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]
I'm trying to use dplyr to summarize a dataset based on 2 groups: "year" and "area". This is how the dataset looks like:
Year Area Num
1 2000 Area 1 99
2 2001 Area 3 85
3 2000 Area 1 60
4 2003 Area 2 90
5 2002 Area 1 40
6 2002 Area 3 30
7 2004 Area 4 10
...
The end result should look something like this:
Year Area Mean
1 2000 Area 1 100
2 2000 Area 2 80
3 2000 Area 3 89
4 2001 Area 1 80
5 2001 Area 2 85
6 2001 Area 3 59
7 2002 Area 1 90
8 2002 Area 2 88
...
Excuse the values for "mean", they're made up.
The code for the example dataset:
df <- structure(list(
Year = c(2000, 2001, 2000, 2003, 2002, 2002, 2004),
Area = structure(c(1L, 3L, 1L, 2L, 1L, 3L, 4L),
.Label = c("Area 1", "Area 2", "Area 3", "Area 4"),
class = "factor"),
Num = structure(c(7L, 5L, 4L, 6L, 3L, 2L, 1L),
.Label = c("10", "30", "40", "60", "85", "90", "99"),
class = "factor")),
.Names = c("Year", "Area", "Num"),
class = "data.frame", row.names = c(NA, -7L))
df$Num <- as.numeric(df$Num)
Things I've tried:
df.meanYear <- df %>%
group_by(Year) %>%
group_by(Area) %>%
summarize_each(funs(mean(Num)))
But it just replaces every value with the mean, instead of the intended result.
If possible please do provide alternate means (i.e. non-dplyr) methods, because I'm still new with R.
Is this what you are looking for?
library(dplyr)
df <- group_by(df, Year, Area)
df <- summarise(df, avg = mean(Num))
We can use data.table
library(data.table)
setDT(df)[, .(avg = mean(Num)) , by = .(Year, Area)]
I had a similar problem in my code, I fixed it with the .groups attribute:
df %>%
group_by(Year,Area) %>%
summarise(avg = mean(Num), .groups="keep")
Also verified with the added example (as.numeric corrupted Num values, so I used as.numeric(as.character(df$Num)) to fix it):
Year Area avg
<dbl> <fct> <dbl>
1 2000 Area 1 79.5
2 2001 Area 3 85
3 2002 Area 1 40
4 2002 Area 3 30
5 2003 Area 2 90
6 2004 Area 4 10
Suppose I had the following data set.
Index-----Country------Age------Time-------Response
---------------------------------------------------
1------------------Germany-----------20-30----------15-20------------------1
2------------------Germany-----------20-30----------15-20------------------NA
3------------------Germany-----------20-30----------15-20------------------1
4------------------Germany-----------20-30----------15-20------------------0
5------------------France--------------20-30----------30-40------------------1
And I would like to fill in the NA based on the criteria listed below
Find all exact matches of Country, Age and Time. ie. Index 1, 3 and 4
Select at random 1 value from the Response column of these matching
rows. ie 1,1 or 0
Replace the NA with this new value
And I would like it to continue on in the same manner for the rest of the NA's in the data set.
I'm new to 'R' and can't figure out how to code this.
Here is one approach using the "data.table" package:
DT <- data.table(mydf, key = "Country,Age,Time")
DT[, R2 := ifelse(is.na(Response), sample(na.omit(Response), 1),
Response), by = key(DT)]
DT
# Index Country Age Time Response R2
# 1: 5 France 20-30 30-40 1 1
# 2: 6 France 20-30 30-40 NA 2
# 3: 7 France 20-30 30-40 2 2
# 4: 1 Germany 20-30 15-20 1 1
# 5: 2 Germany 20-30 15-20 NA 1
# 6: 3 Germany 20-30 15-20 1 1
# 7: 4 Germany 20-30 15-20 0 0
Similarly, in base R, you could try ave:
within(mydf, {
R2 <- ave(Response, Country, Age, Time, FUN = function(x) {
ifelse(is.na(x), sample(na.omit(x), 1), x)
})
})
Sorry, forgot to share the sample data I was working with:
mydf <- structure(list(Index = 1:7, Country = c("Germany", "Germany",
"Germany", "Germany", "France", "France", "France"), Age = c("20-30",
"20-30", "20-30", "20-30", "20-30", "20-30", "20-30"), Time = c("15-20",
"15-20", "15-20", "15-20", "30-40", "30-40", "30-40"), Response = c(1L,
NA, 1L, 0L, 1L, NA, 2L)), .Names = c("Index", "Country", "Age",
"Time", "Response"), class = "data.frame", row.names = c(NA, -7L))