Creating a calculated matrix in R [duplicate] - r

This question already has answers here:
Aggregate by multiple columns and reshape from long to wide
(4 answers)
Closed 2 years ago.
I have a table similar to this
Year Month Purchase_ind Value
2018 1 1 100
2018 1 1 100
2018 1 0 100
2018 2 1 2
2018 2 0 198
2018 3 1 568
2019 1 0 230
.
.
.
And I want to do a matrix whth:
Year for Y axis
Month for X axis
in the calculate section, I need (Value with Purchase ind=1)/Total value
Having this as a result:
2018 2019 2020
1 0.66 0 x
2 0.01 x x
3 1 x x
Thanks a lot for your help!

You can calculate the proportion for Year and Month and cast the data to wide format :
library(dplyr)
df %>%
group_by(Year, Month) %>%
summarise(Value = sum(Value[Purchase_ind == 1])/sum(Value)) %>%
tidyr::pivot_wider(names_from = Year, values_from = Value)
#Add values_fill = 0 if you want 0's instead of `NA`'s
#tidyr::pivot_wider(names_from = Year, values_from = Value, values_fill = 0)
# Month `2018` `2019`
# <int> <dbl> <dbl>
#1 1 0.667 0
#2 2 0.01 NA
#3 3 1 NA
data
df <- structure(list(Year = c(2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2019L), Month = c(1L, 1L, 1L, 2L, 2L, 3L, 1L), Purchase_ind = c(1L,
1L, 0L, 1L, 0L, 1L, 0L), Value = c(100L, 100L, 100L, 2L, 198L,
568L, 230L)), class = "data.frame", row.names = c(NA, -7L))

using data.table:
DT <- data.table(year = c(2018,2018,2018,2018,2018,2018,2019),
month = c(1,1,1,2,2,3,1),
purchase_ind = c(1,1,0,1,0,1,0),
value = c(100,100,100,2,198,568,230))
DT[, value_ind := fifelse(purchase_ind == 1, value, 0)]
DT <- copy(DT[, .(calculate_session = sum(value_ind) / sum(value)), by = .(year, month)])
dcast(DT, month ~ year, value.var = 'calculate_session')
Output:
month 2018 2019
1: 1 0.6666667 0
2: 2 0.0100000 NA
3: 3 1.0000000 NA

in base R you could do:
(a <- prop.table(xtabs(Value ~ Month + Year + Purchase_ind, df), c(1, 2)))
, , Purchase_ind = 0
Year
Month 2018 2019
1 0.3333333 1.0000000
2 0.9900000
3 0.0000000
, , Purchase_ind = 1
Year
Month 2018 2019
1 0.6666667 0.0000000
2 0.0100000
3 1.0000000
of course if you only need the purchase_ind = 1, you could just subscript it:
a[, , "1"] #or even a[, , 2]
Year
Month 2018 2019
1 0.6666667 0.0000000
2 0.0100000
3 1.0000000

Related

Calculate difference between index date and date with first indicator

Suppose I have the following data frame with an index date and follow up dates with a "1" as a stop indicator. I want to input the date difference in days into the index row and if no stop indicator is present input the number of days from the index date to the last observation:
id date group indicator
1 15-01-2022 1 0
1 15-01-2022 2 0
1 16-01-2022 2 1
1 20-01-2022 2 0
2 18-01-2022 1 0
2 20-01-2022 2 0
2 27-01-2022 2 0
Want:
id date group indicator stoptime
1 15-01-2022 1 0 NA
1 15-01-2022 2 0 NA
1 16-01-2022 2 1 1
1 20-01-2022 2 0 NA
2 18-01-2022 1 0 NA
2 20-01-2022 2 0 NA
2 27-01-2022 2 0 9
Convert the 'date' to Date class, grouped by 'id', find the position of 1 from 'indicator' (if not found, use the last position -n()), then get the difference of 'date' from the first to that position in days
library(dplyr)
library(lubridate)
df1 %>%
mutate(date = dmy(date)) %>%
group_by(id) %>%
mutate(ind = match(1, indicator, nomatch = n()),
stoptime = case_when(row_number() == ind ~
as.integer(difftime(date[ind], first(date), units = "days"))),
ind = NULL) %>%
ungroup
-output
# A tibble: 7 × 5
id date group indicator stoptime
<int> <date> <int> <int> <int>
1 1 2022-01-15 1 0 NA
2 1 2022-01-15 2 0 NA
3 1 2022-01-16 2 1 1
4 1 2022-01-20 2 0 NA
5 2 2022-01-18 1 0 NA
6 2 2022-01-20 2 0 NA
7 2 2022-01-27 2 0 9
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), date = c("15-01-2022",
"15-01-2022", "16-01-2022", "20-01-2022", "18-01-2022", "20-01-2022",
"27-01-2022"), group = c(1L, 2L, 2L, 2L, 1L, 2L, 2L), indicator = c(0L,
0L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame",
row.names = c(NA,
-7L))

Cumulative sums by month in R

I want to transform my data from this
Month Expenditures
1 1
1 2
2 3
2 6
3 2
3 5
to this:
Month Cumulative_expenditures
1 3
2 12
3 19
, but can't seem to figure out how to do it.
I tried using the cumsum() function, but it counts each observation - it doesn't distinguish between groups.
Any help would be much appreciated!
A two steps base R solution would be:
#Code
df1 <- aggregate(Expenditures~Month,data=mydf,sum)
#Create cum sum
df1$Expenditures <- cumsum(df1$Expenditures)
Output:
Month Expenditures
1 1 3
2 2 12
3 3 19
Some data used:
#Data
mydf <- structure(list(Month = c(1L, 1L, 2L, 2L, 3L, 3L), Expenditures = c(1L,
2L, 3L, 6L, 2L, 5L)), class = "data.frame", row.names = c(NA,
-6L))
Using dplyr:
library(dplyr)
df %>%
group_by(Month) %>%
summarise(Expenditures = sum(Expenditures), .groups = "drop") %>%
mutate(Expenditures = cumsum(Expenditures))
#> # A tibble: 3 x 2
#> Month Expenditures
#> <int> <int>
#> 1 1 3
#> 2 2 12
#> 3 3 19
Or in base R:
data.frame(Month = unique(df$Month),
Expenditure = cumsum(tapply(df$Expenditure, df$Month, sum)))
#> Month Expenditure
#> 1 1 3
#> 2 2 12
#> 3 3 19
Here is another base R option using subset + ave
subset(
transform(df, Expenditures = cumsum(Expenditures)),
ave(rep(FALSE, nrow(df)), Month, FUN = function(x) seq_along(x) == length(x))
)
which gives
Month Expenditures
2 1 3
4 2 12
6 3 19
We can use base R
out <- with(df1, rowsum(Expenditures, Month))
data.frame(Month = row.names(out), Expenditure = cumsum(out))
# Month Expenditure
#1 1 3
#2 2 12
#3 3 19
Or more compactly
with(df1, stack(cumsum(rowsum(Expenditures, Month)[,1])))[2:1]
data
df1 <- structure(list(Month = c(1L, 1L, 2L, 2L, 3L, 3L), Expenditures = c(1L,
2L, 3L, 6L, 2L, 5L)), class = "data.frame", row.names = c(NA,
-6L))

Linear regression on split data in R

I want to make groups of data where measurements are done in multiple Year on the same species at the same Lat and Long. Then, I want to run linear regression on all those groups (using N as dependent variable and Year as independent variable).
Practice dataset:
Species Year Lat Long N
1 1 1999 1 1 5
2 1 2001 2 1 5
3 2 2010 3 3 4
4 2 2010 3 3 2
5 2 2011 3 3 5
6 2 2012 3 3 8
7 3 2007 8 7 -10
8 3 2019 8 7 100
9 2 2000 1 1 5
First, I averaged data where multiple measurements were done in the same Year on the same Species at the same latitude and longitude . Then, I split data based on Lat, Long and Species. However, this still groups rows together where Lat, Long and Species are not equal ($ '4'). Furthermore, I want to remove $'1', since I only want to use data where multiple measurements are done over a number of Year. How do I do this?
Data <- read.table("Dataset.txt", header = TRUE)
Agr_Data <- aggregate(N ~ Lat + Long + Year + Species, data = Data, mean)
Split_Data <- split(Agr_Data, Agr_Data$Lat + Agr_Data$Long + Agr_Data$Species)
Regression_Data <- lapply(Split_Data, function(Split_Data) lm(N~Year, data = Split_Data) )
Split_Data
$`3`
Lat Long Year Species N
1 1 1 1999 1 5
$`4`
Lat Long Year Species N
2 2 1 2001 1 5
3 1 1 2000 2 5
$`8`
Lat Long Year Species N
4 3 3 2010 2 3
5 3 3 2011 2 5
6 3 3 2012 2 8
$`18`
Lat Long Year Species N
7 8 7 2007 3 -10
8 8 7 2019 3 100
Desired output:
Lat Long Species Coefficients
3 3 2 2.5
8 7 3 9.167
Base R solution:
# 1. Import data:
df <- structure(list(Species = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 2L ),
Year = c(1999L, 2001L, 2010L, 2010L, 2011L, 2012L, 2007L, 2019L, 2000L),
Lat = c(1L, 2L, 3L, 3L, 3L, 3L, 8L, 8L, 1L),
Long = c(1L, 1L, 3L, 3L, 3L, 3L, 7L, 7L, 1L),
N = c(5L, 5L, 4L, 2L, 5L, 8L, -10L, 100L, 5L)),
class = "data.frame", row.names = c(NA, -9L ))
# 2. Aggregate data:
df <- aggregate(N ~ Lat + Long + Year + Species, data = df, mean)
# 3. Concatenate vecs to create grouping vec:
df$grouping_var <- paste(df$Species, df$Lat, df$Long, sep = ", ")
# 4. split apply combine lm:
coeff_n <- as.numeric(do.call("rbind", lapply(split(df, df$grouping_var),
function(x){
ifelse(nrow(x) > 1, coef(lm(N ~ Species+Lat+Long, data = x)), NA)
}
)
)
)
# 5. Create a dataframe of coeffs:
coeff_df <- data.frame(cbind(grouping_var = unique(df$grouping_var), coeff_n = coeff_n))
# 6. Merge the dataframes together:
df <- merge(df, coeff_df, by = "grouping_var", all.x = TRUE)

R: weighted aggregation

I have a dataset on this form:
set.seed(4561) # Make the results reproducible
df=data.frame(
colour=rep(c("green","red","blue"),each=3),
year=rep("2017",9),
month=rep(c(1,2,3),3),
price=c(200,254,188,450,434,490,100,99,97),
work=ceiling(runif(9,30,60)),
gain=ceiling(runif(9,1,10)),
work_weighed_price=NA,
gain_weighed_price=NA
)
For each colour, year, month I have a price (output variable) and two input variables called gain and work. In reality I have many more input variables, but this suffices to show what I desire to do with my dataframe.
> df
colour year month price work gain work_weighed_price gain_weighed_price
1 green 2017 1 200 33 9 NA NA
2 green 2017 2 254 56 5 NA NA
3 green 2017 3 188 42 8 NA NA
4 red 2017 1 450 39 3 NA NA
5 red 2017 2 434 45 2 NA NA
6 red 2017 3 490 36 8 NA NA
7 blue 2017 1 100 50 8 NA NA
8 blue 2017 2 99 45 8 NA NA
9 blue 2017 3 97 56 4 NA NA
I wish to calculate the weighted gain and work (and also the weighted price), where the weight is the price for that month and year, divided by the sum of price across colours:
desired_output=data.frame(
year=rep("2017",3),
month=rep(c(1,2,3),1),
price=c(200*(200/(200+450+100))+450*(450/(200+450+100))+100*(100/(200+450+100)),
254*(254/(254+434+99))+434*(434/(254+434+99))+99*(99/(254+434+99)),
188*(188/(188+490+97))+490*(490/(188+490+97))+97*(97/(188+490+97))),
work_weighed_price=c(47*(200/(200+450+100))+44*(450/(200+450+100))+52*(100/(200+450+100)),
44*(254/(254+434+99))+42*(434/(254+434+99))+32*(99/(254+434+99)),
38*(188/(188+490+97))+52*(490/(188+490+97))+52*(97/(188+490+97))) ,
gain_weighed_price=c(5*(200/(200+450+100))+8*(450/(200+450+100))+10*(100/(200+450+100)),
3*(254/(254+434+99))+7*(434/(254+434+99))+9*(99/(254+434+99)),
2*(188/(188+490+97))+4*(490/(188+490+97))+9*(97/(188+490+97)))
)
> desired_output
year month price work_weighed_price gain_weighed_price
1 2017 1 336.6667 45.86667 7.466667
2 2017 2 333.7649 41.38755 5.960610
3 2017 3 367.5523 48.60387 4.140645
How would I attack this in R?
You can use the weighted.mean function
df %>%
group_by(year, month) %>%
summarise_at(vars(price, work, gain),
funs(price_weighted = weighted.mean(., price)))
# # A tibble: 3 x 5
# # Groups: year [?]
# year month price_price_weighted work_price_weighted gain_price_weighted
# <int> <int> <dbl> <dbl> <dbl>
# 1 2017 1 337 45.9 7.47
# 2 2017 2 334 41.4 5.96
# 3 2017 3 368 48.6 4.14
Or, in data.table
library(data.table)
setDT(df)
df[, lapply(.SD, weighted.mean, price)
, .SDcols = c('price', 'work', 'gain')
, by = .(year, month)]
# year month price work gain
# 1: 2017 1 336.6667 45.86667 7.466667
# 2: 2017 2 333.7649 41.38755 5.960610
# 3: 2017 3 367.5523 48.60387 4.140645
An approach using dplyr. Your use of runif in your example df without setting seed and the fact that it doesn't line up with your desired output is causing some confusion. In the code below, I use a df that's consistent with your desired output.
library(dplyr)
df %>%
group_by(year, month) %>%
mutate(weight = price / sum(price)) %>%
mutate_at(vars(price, work, gain), funs(weighed_price = . * weight)) %>%
summarise_at(vars(ends_with("weighed_price")), sum)
# # A tibble: 3 x 5
# # Groups: year [?]
# year month work_weighed_price gain_weighed_price price_weighed_price
# <int> <int> <dbl> <dbl> <dbl>
# 1 2017 1 45.9 7.47 337.
# 2 2017 2 41.4 5.96 334.
# 3 2017 3 48.6 4.14 368.
df:
structure(list(colour = c("green", "green", "green", "red", "red",
"red", "blue", "blue", "blue"), year = c(2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L), month = c(1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L), price = c(200L, 254L, 188L, 450L,
434L, 490L, 100L, 99L, 97L), work = c(47L, 44L, 38L, 44L, 42L,
52L, 52L, 32L, 52L), gain = c(5L, 3L, 2L, 8L, 7L, 4L, 10L, 9L,
9L), work_weighed_price = c(NA, NA, NA, NA, NA, NA, NA, NA, NA
), gain_weighed_price = c(NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("colour",
"year", "month", "price", "work", "gain", "work_weighed_price",
"gain_weighed_price"), class = "data.frame", row.names = c(NA,
-9L))
A base R solution could be the following sequence of tapply instructions.
fun_price <- function(x){
s <- sum(x)
sum(x*(x/s))
}
fun_weighted <- function(x, w){
s <- sum(w)
sum(x*(w/s))
}
desired <- data.frame(year = unique(df$year), month = sort(unique(df$month)))
desired$price <- with(df, tapply(price, month, FUN = fun_price))
desired$work_weighed_price <- with(df, tapply(work, month, FUN = fun_weighted, w = price))
desired$gain_weighed_price <- with(df, tapply(gain, month, FUN = fun_weighted, w = price))
desired
# year month price work_weighed_price gain_weighed_price
#1 2017 1 336.6667 40.74092 6.622405
#2 2017 2 333.7649 48.56834 4.984429
#3 2017 3 367.5523 44.65052 6.659170

R: Get the last entry from previous group

I have data like this:
Group Year Month Mean_Price
A 2013 6 200
A 2013 6 200
A 2014 2 100
A 2014 2 100
B 2014 1 130
I want to add another column which gets the last entry from the group above, like this:
Group Year Month Mean_Price Last_Mean_price
A 2013 6 200 x
A 2013 6 200 x
A 2014 2 100 200
A 2014 2 100 200 ---This is where I am facing problem as doing dplyr + lag is just getting the last row entry and not the entry of th *last group's* last row.
B 2014 1 130 x
B 2014 4 140 130
All help will be appreciated. Thanks!
I had asked a related question here: Get the (t-1) data within groups
But then I wasn't grouping by years and months
This may be one way to go. I am not sure how you want to group your data. Here, I chose to group your data with GROUP, Year, and Month. First, I want to create a vector with all last elements from each group, which is foo.
group_by(mydf, Group, Year, Month) %>%
summarize(whatever = last(Mean_Price)) %>%
ungroup %>%
select(whatever) %>%
unlist -> foo
# whatever1 whatever2 whatever3 whatever4
# 200 100 130 140
Second, I arranged foo for our later process. Basically, I added x in the first position and deleted the last element in foo.
### Arrange a vector
foo <- c("x", foo[-length(foo)])
Third, I added row numbers for each group in mydf using mutate(). Then, I relaxed all numbers but 1 with x.
group_by(mydf, Group, Year, Month) %>%
mutate(ind = row_number(),
ind = replace(ind, which(row_number(ind) != 1), "x")) -> temp
Finally, I identified rows which have 1 in ind and assigned the vector, foo to the rows.
temp$ind[temp$ind == 1] <- foo
temp
# Group Year Month Mean_Price ind
# (fctr) (int) (int) (int) (chr)
#1 A 2013 6 200 x
#2 A 2013 6 200 x
#3 A 2014 2 100 200
#4 A 2014 2 100 x
#5 B 2014 1 130 100
#6 B 2014 4 140 130
DATA
mydf <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Year = c(2013L, 2013L, 2014L, 2014L,
2014L, 2014L), Month = c(6L, 6L, 2L, 2L, 1L, 4L), Mean_Price = c(200L,
200L, 100L, 100L, 130L, 140L)), .Names = c("Group", "Year", "Month",
"Mean_Price"), class = "data.frame", row.names = c(NA, -6L))

Resources