R moving average between data frame variables - r

I am trying to find a solution but haven't, yet.
I have a dataframe structured as follows:
country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54
I want to find the moving average for every 3 years (i.e. 2014-16, 2015-17, etc) to be placed in ad-hoc columns.
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
France Paris 23 34 54 12 23 21 37 33.3 29.7 18.7
US NYC 1 2 2 12 95 54 etc etc etc etc
Any hint?

1) Using the data shown reproducibly in the Note at the end we apply rollmean to each column in the transpose of the data and then transpose back. We rollapply the appropriate paste command to create the names.
library(zoo)
DF2 <- DF[-(1:2)]
cbind(DF, setNames(as.data.frame(t(rollmean(t(DF2), 3))),
rollapply(names(DF2), 3, function(x) paste(range(x), collapse = "-"))))
giving:
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
1 France Paris 23 34 54 12 23 21 37.000000 33.333333 29.66667 18.66667
2 US NYC 1 2 2 12 95 54 1.666667 5.333333 36.33333 53.66667
2) This could also be expressed using dplyr/tidyr/zoo like this:
library(dplyr)
library(tidyr)
library(zoo)
DF %>%
pivot_longer(-c(country, City)) %>%
group_by(country, City) %>%
mutate(value = rollmean(value, 3, fill = NA),
name = rollapply(name, 3, function(x) paste(range(x), collapse="-"), fill=NA)) %>%
ungroup %>%
drop_na %>%
pivot_wider %>%
left_join(DF, ., by = c("country", "City"))
Note
Lines <- "country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54 "
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, check.names = FALSE)

Related

How to use a loop to create panel data by subsetting and merging a lot of different data frames in R?

I've looked around but I can't find an answer to this!
I've imported a large number of datasets to R.
Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).
All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).
I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc
I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.
I haven't used R in a very long time...
So far, I've tried this:
id <- c("France", "Belgium", "Spain")
varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)
varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)
varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)
df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)
df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)
df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)
year = c(2012:2014)
for(i in 1:length(year)) {
df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]
list2env(df_[i], .GlobalEnv)
}
panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))
I know that there are probably loads of errors in here.
Here are a couple of options; however, it's unclear what you want the expected output to look like.
If you want a wide format, then we can use tidyverse to do:
library(tidyverse)
results <-
map(list(df_2012, df_2013, df_2014), function(x)
x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
reduce(., function(x, y)
left_join(x, y, all = TRUE, by = "id"))
Output
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
However, if you need it in a long format, then we could pivot the data:
results %>%
pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")
Output
id variable year value
<chr> <chr> <chr> <dbl>
1 France varA 2012 1
2 France varB 2012 7
3 France varA 2013 34
4 France varB 2013 2
5 France varA 2014 9
6 France varB 2014 95
7 Belgium varA 2012 2
8 Belgium varB 2012 2
9 Belgium varA 2013 3
10 Belgium varB 2013 53
11 Belgium varA 2014 10
12 Belgium varB 2014 30
13 Spain varA 2012 3
14 Spain varB 2012 9
15 Spain varA 2013 56
16 Spain varB 2013 5
17 Spain varA 2014 5
18 Spain varB 2014 75
Or if using base R for the wide format, then we can do:
results <-
lapply(list(df_2012, df_2013, df_2014), function(x)
subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))
results <-
Reduce(function(x, y)
merge(x, y, all = TRUE, by = "id"), results)
From your initial for loop attempt, it seems the code below may help
> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75

how to sum conditional functions to grouped rows in R

I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100
An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5
df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))

How to apply 10-year average filter in R on a dataframe?

How can I run a 10-year average filter on the NBP on this dataframe?
This is the head of dataframe
> head(df3)
Year NBP
1 1850 35.454343
2 1851 4.5634543
3 1852 112.389182
4 1853 151.169251
5 1854 73.123145
6 1855 -72.309647
In reality I have years from 1850 to 2100, how can I apply 10-year average filter on the NBP on this dataframe for the variable NBP and plot it temporally?
One option would be using slider package function slide_dbl() that allows you to create rolling variables. Here the code:
library(slider)
library(dplyr)
set.seed(123)
#Data
df <- data.frame(Year=1990:2020,NBP=rnorm(31,2,0.5))
# Rolling by group
df %>%
mutate(rollingNBP = slide_dbl(NBP, mean, .before = 9, .complete = T))
Output:
Year NBP rollingNBP
1 1990 1.8399718 NA
2 1991 1.3442388 NA
3 1992 1.7001958 NA
4 1993 1.9352947 NA
5 1994 2.4433681 NA
6 1995 1.9243020 NA
7 1996 2.1648956 NA
8 1997 0.3863386 NA
9 1998 1.6141041 NA
10 1999 2.1432743 1.749598
11 2000 1.3897440 1.704576
12 2001 2.2172752 1.791879
13 2002 2.4000884 1.861868
14 2003 1.9180345 1.860142
15 2004 2.6214594 1.877952
16 2005 1.5328075 1.838802
17 2006 2.1968543 1.841998
18 2007 2.2018157 2.023546
19 2008 1.5567816 2.017813
20 2009 1.3405312 1.937539
21 2010 2.0144220 2.000007
22 2011 1.7839351 1.956673
23 2012 2.8449363 2.001158
24 2013 2.6141964 2.070774
25 2014 2.1380117 2.022429
26 2015 1.4755122 2.016700
27 2016 1.7395653 1.970971
28 2017 2.8116013 2.031949
29 2018 1.4649659 2.022768
30 2019 2.8429436 2.173009
31 2020 1.8791551 2.159482
If you want to include a plot, you can use ggplot2:
library(ggplot2)
#Code2
df %>%
mutate(rollingNBP = slide_dbl(NBP, mean, .before = 9, .complete = T)) %>%
ggplot(aes(x=Year,y=rollingNBP))+
geom_line()
Output:
And if you want to see both series, try this:
library(tidyr)
#Code 3
df %>%
mutate(rollingNBP = slide_dbl(NBP, mean, .before = 9, .complete = F)) %>%
pivot_longer(-Year) %>%
ggplot(aes(x=Year,y=value,group=name,color=name))+
geom_line()
Output:
An option with rollmean from zoo
library(dplyr)
library(zoo)
df %>%
mutate(rollingNBP = rollmeanr(NBP, k = 10, fill = NA))

Summarize the values of a column based on the values of more than two columns of the same dataframe [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have the dataframe below.
year<-c(2016,2016,2017,2017,2016,2016,2017,2017)
city<-c("NY","NY","NY","NY","WS","WS","WS","WS")
spec<-c("df","df","df","df","vb","vb","vb","vb")
num<-c(45,67,89,90,45,67,89,90)
df<-data.frame(year,city,spec,num)
I would like to know if it is possible to sum the num based on year,city and spec columns in order to bring it from this form:
year city spec num
1 2016 NY df 45
2 2016 NY df 67
3 2017 NY df 89
4 2017 NY df 90
5 2016 WS vb 45
6 2016 WS vb 67
7 2017 WS vb 89
8 2017 WS vb 90
to this:
year city spec num
1 2016 NY df 112
2 2017 NY df 179
3 2016 WS vb 112
4 2017 WS vb 179
Possible duplicate, but here is an answer:
library(tidyverse)
df %>%
group_by(year,city,spec) %>%
summarise(sum = sum(num))
...results in ...
# A tibble: 4 x 4
# Groups: year, city [4]
year city spec sum
<dbl> <fct> <fct> <dbl>
1 2016 NY df 112
2 2016 WS vb 112
3 2017 NY df 179
4 2017 WS vb 179
One way is to use sqldf package:
sqldf("Select year, city, spec, sum(num) from df
group by year, city, spec order by city")
year city spec sum(num)
1 2016 NY df 112
2 2017 NY df 179
3 2016 WS vb 112
4 2017 WS vb 179
Using dplyr
df %>%
group_by(year, city, spec) %>%
summarise(SumNum = sum(num)) %>%
arrange(city)

similarity score between vectors and creating column vectors based on a function

I have a sample on which I want to create an aggregate measure based on similarity scores of the person's movie interests. For example consider the following data.
person <- c( 'John', 'John', 'Vikram', 'Kris', 'Kris', 'Lara', 'Mohi', 'Mohi', 'Mohi')
year<- c(2010, 2011,2010,2010, 2011, 2010, 2010, 2011, 2012)
sciencefiction <- c( 4, 5, 0, 44,32, 5, 32, 43,33)
romantic <- c( 19, 28, 56, 7, 4, 33, 2,1,2)
comedy<- c(22,34, 22,34,44, 54, 54,32,44)
timespent<- c(30,40, 100,33, 22, 80, 96, 22,34)
df<- data.frame(person, year, sciencefiction, romantic, comedy, timespent)
I want to variable called similarity score which is basically given by the sum of a persons[i] distance from person[j] multiplied by the time spent by j and is summed over all the combinations for one year. For example for person John for year 2010 it would be
score[john, 2010]= 0.8 * 100+ 0.6 * 33+ .98 * 80 + .73* 96 = 248.28
The 0.8 is the distance (cosine distance calculated by a.b/|a| |b|) between the john and vikram determined by the cosine angle (as shown above) between two vectors formed by sciencefiction+ romantic+comedy (see here (v[i] = 4i+19j+22k and v[j]= 0i+7j+34k)) and 100 is the time spent by Vikram in watching the movies in 2010. In a similar way the comparisons are made and aggregated for John. Is there a way I do this operation in R to create a row called score with the above procedure? Thanks
I'll step through this solution. Skip down to the bottom for the overall result.
Up front: because 2012 only has one person (Mohi), there is no output. You can easily capture this either by not filtering out self-comparisons (which should score 0) or re-merging in missing person/year rows.
Update 2: your df$person needs to be character, so either create your data with
df <- data.frame(..., stringsAsFactors = FALSE)
or modify it in-place with
df$person <- as.character(df$person)
Dependencies
I'm using dplyr here primarily because I think it clearly communicates what is going on. There is nothing in the code that could not be replaced with base functions (or even data.table).
library(dplyr)
One could use tidyr::crossing instead of expand.grid and purrr::pmap instead of mapply. They have strengths but are mostly drop-in replacements, so I leave it up to the reader.
A simple geometric angle-calculation function, for simplicity/reference
angle <- function(a, b, zero = NaN) {
num <- (a %*% b)
denom <- sqrt(sum(a^2)) * sqrt(sum(b^2))
if (denom == 0) zero else (num / denom)
}
Update: if either of the vectors is all-0, then R calculates 0/0 as NaN. Depending on your use, it may make sense to change this to 0 or NA.
Identify unique combinations (not permutations)
df %>%
distinct(year, person) %>%
group_by(year) %>%
do( expand.grid(person = .$person, person2 = .$person, stringsAsFactors = FALSE) ) %>%
ungroup() %>%
filter(person != person2) %>%
mutate(
p1 = pmin(person, person2),
p2 = pmax(person, person2)
) %>%
distinct() %>%
select(-person, -person2)
# # A tibble: 13 × 3
# year p1 p2
# <dbl> <chr> <chr>
# 1 2010 John Vikram
# 2 2010 John Kris
# 3 2010 John Lara
# 4 2010 John Mohi
# 5 2010 Kris Vikram
# 6 2010 Lara Vikram
# 7 2010 Mohi Vikram
# 8 2010 Kris Lara
# 9 2010 Kris Mohi
# 10 2010 Lara Mohi
# 11 2011 John Kris
# 12 2011 John Mohi
# 13 2011 Kris Mohi
If you did up through (but stopping at) the expand.grid, you'd end up with redundant pairs, e.g. "John, Vikram" and "Vikram, John". Because I'm inferring you are interested in pairwise combinations vice permutations, the rest of that code block removes redundant rows.
Bring in each person's data
(continuing in the pipe with the previous data)
... %>%
left_join(setNames(df, paste0(colnames(df), "1")), by = c("p1" = "person1", "year" = "year1")) %>%
left_join(setNames(df, paste0(colnames(df), "2")), by = c("p2" = "person2", "year" = "year2"))
# # A tibble: 13 × 11
# year p1 p2 sciencefiction1 romantic1 comedy1 timespent1 sciencefiction2 romantic2 comedy2 timespent2
# <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2010 John Vikram 4 19 22 30 0 56 22 100
# 2 2010 John Kris 4 19 22 30 44 7 34 33
# 3 2010 John Lara 4 19 22 30 5 33 54 80
# 4 2010 John Mohi 4 19 22 30 32 2 54 96
# 5 2010 Kris Vikram 44 7 34 33 0 56 22 100
# 6 2010 Lara Vikram 5 33 54 80 0 56 22 100
# 7 2010 Mohi Vikram 32 2 54 96 0 56 22 100
# 8 2010 Kris Lara 44 7 34 33 5 33 54 80
# 9 2010 Kris Mohi 44 7 34 33 32 2 54 96
# 10 2010 Lara Mohi 5 33 54 80 32 2 54 96
# 11 2011 John Kris 5 28 34 40 32 4 44 22
# 12 2011 John Mohi 5 28 34 40 43 1 32 22
# 13 2011 Kris Mohi 32 4 44 22 43 1 32 22
Calculate the angle per pair
... %>%
mutate(
angle = mapply(function(a,b,c, d,e,f) angle(c(a,b,c), c(d,e,f), zero=NA),
sciencefiction1, romantic1, comedy1,
sciencefiction2, romantic2, comedy2, SIMPLIFY = TRUE)
) %>%
select(year, p1, p2, starts_with("timespent"), angle)
# A tibble: 13 × 6
# year p1 p2 timespent1 timespent2 angle
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
# 1 2010 John Vikram 30 100 0.8768294
# 2 2010 John Kris 30 33 0.6427461
# 3 2010 John Lara 30 80 0.9851037
# 4 2010 John Mohi 30 96 0.7347653
# 5 2010 Kris Vikram 33 100 0.3380778
# 6 2010 Lara Vikram 80 100 0.7948679
# 7 2010 Mohi Vikram 96 100 0.3440492
# 8 2010 Kris Lara 33 80 0.6428056
# 9 2010 Kris Mohi 33 96 0.9256539
# 10 2010 Lara Mohi 80 96 0.7881070
# 11 2011 John Kris 40 22 0.7311130
# 12 2011 John Mohi 40 22 0.5600843
# 13 2011 Kris Mohi 22 22 0.9533073
Finally, the score
... %>%
group_by(year, person = p1) %>%
summarize(
score = angle %*% timespent2
) %>%
ungroup()
# # A tibble: 6 × 3
# year person score
# <dbl> <chr> <dbl>
# 1 2010 John 258.23933
# 2 2010 Kris 174.09501
# 3 2010 Lara 155.14507
# 4 2010 Mohi 34.40492
# 5 2011 John 28.40634
# 6 2011 Kris 20.97276
I'm guessing the difference between my 258.24 and your 248.28 is due to the second vector (Vikram's values).
TL;DR
All at once:
df %>%
distinct(year, person) %>%
group_by(year) %>%
do( expand.grid(person = .$person, person2 = .$person, stringsAsFactors = FALSE) ) %>%
ungroup() %>%
filter(person != person2) %>%
mutate(
p1 = pmin(person, person2),
p2 = pmax(person, person2)
) %>%
select(-person, -person2) %>%
distinct() %>%
# p-wise lookups
left_join(setNames(df, paste0(colnames(df), "1")), by = c("p1" = "person1", "year" = "year1")) %>%
left_join(setNames(df, paste0(colnames(df), "2")), by = c("p2" = "person2", "year" = "year2")) %>%
# calc angles
mutate(
angle = mapply(function(a,b,c, d,e,f) angle(c(a,b,c), c(d,e,f)),
sciencefiction1, romantic1, comedy1,
sciencefiction2, romantic2, comedy2, SIMPLIFY = TRUE)
) %>%
# calc scores
group_by(year, person = p1) %>%
summarize(
score = angle %*% timespent2
) %>%
ungroup()

Resources