R: Get the last entry from previous group - r

I have data like this:
Group Year Month Mean_Price
A 2013 6 200
A 2013 6 200
A 2014 2 100
A 2014 2 100
B 2014 1 130
I want to add another column which gets the last entry from the group above, like this:
Group Year Month Mean_Price Last_Mean_price
A 2013 6 200 x
A 2013 6 200 x
A 2014 2 100 200
A 2014 2 100 200 ---This is where I am facing problem as doing dplyr + lag is just getting the last row entry and not the entry of th *last group's* last row.
B 2014 1 130 x
B 2014 4 140 130
All help will be appreciated. Thanks!
I had asked a related question here: Get the (t-1) data within groups
But then I wasn't grouping by years and months

This may be one way to go. I am not sure how you want to group your data. Here, I chose to group your data with GROUP, Year, and Month. First, I want to create a vector with all last elements from each group, which is foo.
group_by(mydf, Group, Year, Month) %>%
summarize(whatever = last(Mean_Price)) %>%
ungroup %>%
select(whatever) %>%
unlist -> foo
# whatever1 whatever2 whatever3 whatever4
# 200 100 130 140
Second, I arranged foo for our later process. Basically, I added x in the first position and deleted the last element in foo.
### Arrange a vector
foo <- c("x", foo[-length(foo)])
Third, I added row numbers for each group in mydf using mutate(). Then, I relaxed all numbers but 1 with x.
group_by(mydf, Group, Year, Month) %>%
mutate(ind = row_number(),
ind = replace(ind, which(row_number(ind) != 1), "x")) -> temp
Finally, I identified rows which have 1 in ind and assigned the vector, foo to the rows.
temp$ind[temp$ind == 1] <- foo
temp
# Group Year Month Mean_Price ind
# (fctr) (int) (int) (int) (chr)
#1 A 2013 6 200 x
#2 A 2013 6 200 x
#3 A 2014 2 100 200
#4 A 2014 2 100 x
#5 B 2014 1 130 100
#6 B 2014 4 140 130
DATA
mydf <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Year = c(2013L, 2013L, 2014L, 2014L,
2014L, 2014L), Month = c(6L, 6L, 2L, 2L, 1L, 4L), Mean_Price = c(200L,
200L, 100L, 100L, 130L, 140L)), .Names = c("Group", "Year", "Month",
"Mean_Price"), class = "data.frame", row.names = c(NA, -6L))

Related

Creating a calculated matrix in R [duplicate]

This question already has answers here:
Aggregate by multiple columns and reshape from long to wide
(4 answers)
Closed 2 years ago.
I have a table similar to this
Year Month Purchase_ind Value
2018 1 1 100
2018 1 1 100
2018 1 0 100
2018 2 1 2
2018 2 0 198
2018 3 1 568
2019 1 0 230
.
.
.
And I want to do a matrix whth:
Year for Y axis
Month for X axis
in the calculate section, I need (Value with Purchase ind=1)/Total value
Having this as a result:
2018 2019 2020
1 0.66 0 x
2 0.01 x x
3 1 x x
Thanks a lot for your help!
You can calculate the proportion for Year and Month and cast the data to wide format :
library(dplyr)
df %>%
group_by(Year, Month) %>%
summarise(Value = sum(Value[Purchase_ind == 1])/sum(Value)) %>%
tidyr::pivot_wider(names_from = Year, values_from = Value)
#Add values_fill = 0 if you want 0's instead of `NA`'s
#tidyr::pivot_wider(names_from = Year, values_from = Value, values_fill = 0)
# Month `2018` `2019`
# <int> <dbl> <dbl>
#1 1 0.667 0
#2 2 0.01 NA
#3 3 1 NA
data
df <- structure(list(Year = c(2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2019L), Month = c(1L, 1L, 1L, 2L, 2L, 3L, 1L), Purchase_ind = c(1L,
1L, 0L, 1L, 0L, 1L, 0L), Value = c(100L, 100L, 100L, 2L, 198L,
568L, 230L)), class = "data.frame", row.names = c(NA, -7L))
using data.table:
DT <- data.table(year = c(2018,2018,2018,2018,2018,2018,2019),
month = c(1,1,1,2,2,3,1),
purchase_ind = c(1,1,0,1,0,1,0),
value = c(100,100,100,2,198,568,230))
DT[, value_ind := fifelse(purchase_ind == 1, value, 0)]
DT <- copy(DT[, .(calculate_session = sum(value_ind) / sum(value)), by = .(year, month)])
dcast(DT, month ~ year, value.var = 'calculate_session')
Output:
month 2018 2019
1: 1 0.6666667 0
2: 2 0.0100000 NA
3: 3 1.0000000 NA
in base R you could do:
(a <- prop.table(xtabs(Value ~ Month + Year + Purchase_ind, df), c(1, 2)))
, , Purchase_ind = 0
Year
Month 2018 2019
1 0.3333333 1.0000000
2 0.9900000
3 0.0000000
, , Purchase_ind = 1
Year
Month 2018 2019
1 0.6666667 0.0000000
2 0.0100000
3 1.0000000
of course if you only need the purchase_ind = 1, you could just subscript it:
a[, , "1"] #or even a[, , 2]
Year
Month 2018 2019
1 0.6666667 0.0000000
2 0.0100000
3 1.0000000

Repeatedly compare same portion of dataset to other portions of dataset based on index value in R

I have a dataframe that looks like the following:
state year value
1 1980 4
1 1981 5
1 1982 4
2 1980 2
2 1981 3
2 1982 4
100 1980 3
100 1981 2
100 1982 5
In the actual dataset, there are more states than are shown here. I would like to make a comparison between state 100 and all other states.
Specifically, for each state, I would like to find the difference between the value given by that state for a particular year and the value given for state 100 for that same year. Below, I have shown how I could compare the value for year 1980 between state 1 and state 100.
df_1 <- df %>% filter(state == 1)
df_100 <- df %>% filter(state == 100)
df_1_1980 <- df_1 %>% filter(year == 1980)
df_100_1980 <- df_100 %>% filter(year == 1980)
difference <- df_1_1980$value - df_100_1980$value
How could I do this for all the other states and years in the dataframe?
One possibility I have considered is making a dataframe composed only of the data from state 100 and then connecting it to the original dataframe, like this:
state year value state100 year100 value100
1 1980 4 100 1980 3
1 1981 5 100 1981 2
1 1982 4 100 1982 5
2 1980 2 100 1980 3
2 1981 3 100 1981 2
2 1982 4 100 1982 5
I could then subtract df$value from df$value100 for each row. I assume there is a better way of doing this.
We can filter the 'state' that is not equal to 100, left_join with the dataset with 'state' 100, by 'year' and get the difference between the 'value' columns
library(dplyr)
df %>%
filter(state != 100) %>%
left_join(df %>%
filter(state == 100) %>%
select(-state), by = c('year')) %>%
transmute(state, year, value = value.x, difference = value.x - value.y)
# state year value difference
#1 1 1980 4 1
#2 1 1981 5 3
#3 1 1982 4 -1
#4 2 1980 2 -1
#5 2 1981 3 1
#6 2 1982 4 -1
data
df <- structure(list(state = c(1L, 1L, 1L, 2L, 2L, 2L, 100L, 100L,
100L), year = c(1980L, 1981L, 1982L, 1980L, 1981L, 1982L, 1980L,
1981L, 1982L), value = c(4L, 5L, 4L, 2L, 3L, 4L, 3L, 2L, 5L)),
class = "data.frame", row.names = c(NA,
-9L))

Find common elements from multiple categories within dataframe?

This is based on the answer to a previous question.
df
year code
2009 a
2009 a
2009 b
2010 b
2010 b
2011 b
2011 c
2011 c
I want to select codes common to all years within df. Here it is "b". One solution is:
Reduce(intersect, list(unique(df$code[df$year==2009]),
unique(df$code[df$year==2010]),
unique(df$code[df$year==2011])))
In practice, df contains about 15 years, thousands of codes, millions of rows, and multiple columns. For starters, the above command becomes quite long when all the years are included. Plus it's memory-consuming and slow. Is there sparser/faster code to do this?
As another idea, you could work on a structure of occurences per year that can be handy and more efficient down the road instead of many pairwise intersections:
lvls = list(y = unique(df$year), c = levels(df$code))
library(Matrix)
tab = sparseMatrix(i = match(df$year, lvls$y),
j = match(df$code, lvls$c),
x = TRUE,
dimnames = lvls)
tab
#3 x 3 sparse Matrix of class "lgCMatrix"
# c
#y a b c
# 2009 | | .
# 2010 . | .
# 2011 . | |
And, then, :
colSums(tab) == nrow(tab)
# a b c
#FALSE TRUE FALSE
or, in this case, better:
colnames(tab)[diff(tab#p) == nrow(tab)]
#[1] "b"
"df" is:
df = structure(list(year = c(2009L, 2009L, 2009L, 2010L, 2010L, 2011L,
2011L, 2011L), code = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L), .Label = c("a", "b", "c"), class = "factor")), .Names = c("year",
"code"), class = "data.frame", row.names = c(NA, -8L))
Using tidyverse functions and considerng dft1 as your input, you can try:
dft1 %>%
unique() %>%
group_by(code) %>%
filter( n_distinct(year) == length(unique(dft1$year)))
which gives:
year code
<int> <chr>
1 2009 b
2 2010 b
3 2011 b

Replace values in a dataset based on more than one condition in R

I have a datset that looks something like this:
age Year f.pop f.dc
1 1990 0 1
5 2001 200 4
1 1990 400 2
1 2001 50 3
5 2001 0 3
I want it to look like this:
age Year f.pop f.dc
1 1990 400 1
5 2001 200 4
1 1990 400 2
1 2001 50 3
5 2001 200 3
Basically, I want to replace zero values in the f.pop column of my dataset with f.pop values of rows that match in two other columns (Year and age). The f.dc column is largely irrelevant to this question, but I want to emphasize that these rows are not identical and must remain separate.
Here's my attempt:
for (i in 1:length(usbd$f.pop)) {
if (usbd$f.pop[i] == 0) {
iage = usbd$age[i]
iyear = usbd$Year[i]
index = which(usbd$age == iage & usbd$Year == iyear)
usbd$f.pop[i] = usbd$f.pop[index] }}
But this is incredibly slow. There must be a more efficient way.
Conditional replacement of values in a data.frame is useful but I'm not sure how to apply this to two conditions with potentially different indices.
We could use data.table to replace the '0' values in 'f.pop' (assuming that 'f.pop' value is unique for each 'age', 'Year' group). Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by age and Year (.(age, Year)), we assign 'f.pop' as the non zero value in 'f.pop' (f.pop := f.pop[f.pop!=0]).
library(data.table)
setDT(df1)[, f.pop:= f.pop[f.pop!=0] , by = .(age, Year)]
df1
# age Year f.pop f.dc
#1: 1 1990 400 1
#2: 5 2001 200 4
#3: 1 1990 400 2
#4: 1 2001 50 3
#5: 5 2001 200 3
data
df1 <- structure(list(age = c(1L, 5L, 1L, 1L, 5L), Year = c(1990L, 2001L,
1990L, 2001L, 2001L), f.pop = c(0L, 200L, 400L, 50L, 0L), f.dc = c(1L,
4L, 2L, 3L, 3L)), .Names = c("age", "Year", "f.pop", "f.dc"),
class = "data.frame", row.names = c(NA, -5L))

Calculate column sums for each combination of two grouping variables [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 7 years ago.
I have a dataset that looks something like this:
Type Age count1 count2 Year Pop1 Pop2 TypeDescrip
A 35 1 1 1990 30000 50000 alpha
A 35 3 1 1990 30000 50000 alpha
A 45 2 3 1990 20000 70000 alpha
B 45 2 1 1990 20000 70000 beta
B 45 4 5 1990 20000 70000 beta
I want to add the counts of the rows that are matching in the Type and Age columns. So ideally I would end up with a dataset that looks like this:
Type Age count1 count2 Year Pop1 Pop2 TypeDescrip
A 35 4 2 1990 30000 50000 alpha
A 45 2 3 1990 20000 70000 alpha
B 45 6 6 1990 20000 70000 beta
I've tried using nested duplicated() statements such as below:
typedup = duplicated(df$Type)
bothdup = duplicated(df[(typedup == TRUE),]$Age)
but this returns indices for which age or type are duplicated, not necessarily when one row has duplicates of both.
I've also tried tapply:
tapply(c(df$count1, df$count2), c(df$Age, df$Type), sum)
but this output is difficult to work with. I want to have a data.frame when I'm done.
I don't want to use a for-loop because my dataset is quite large.
Try
library(dplyr)
df1 %>%
group_by(Type, Age) %>%
summarise_each(funs(sum))
# Type Age count1 count2
#1 A 35 4 2
#2 A 45 2 3
#3 B 45 6 6
In the newer versions of dplyr
df1 %>%
group_by(Type, Age) %>%
summarise_all(sum)
Or using base R
aggregate(.~Type+Age, df1, FUN=sum)
# Type Age count1 count2
#1 A 35 4 2
#2 A 45 2 3
#3 B 45 6 6
Or
library(data.table)
setDT(df1)[, lapply(.SD, sum), .(Type, Age)]
# Type Age count1 count2
#1: A 35 4 2
#2: A 45 2 3
#3: B 45 6 6
Update
Based on the new dataset,
df2 %>%
group_by(Type, Age,Pop1, Pop2, TypeDescrip) %>%
summarise_each(funs(sum), matches('^count'))
# Type Age Pop1 Pop2 TypeDescrip count1 count2
#1 A 35 30000 50000 alpha 4 2
#2 A 45 20000 70000 beta 2 3
#3 B 45 20000 70000 beta 6 6
data
df1 <- structure(list(Type = c("A", "A", "A", "B", "B"), Age = c(35L,
35L, 45L, 45L, 45L), count1 = c(1L, 3L, 2L, 2L, 4L), count2 = c(1L,
1L, 3L, 1L, 5L)), .Names = c("Type", "Age", "count1", "count2"
), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(Type = c("A", "A", "A", "B", "B"), Age = c(35L,
35L, 45L, 45L, 45L), count1 = c(1L, 3L, 2L, 2L, 4L), count2 = c(1L,
1L, 3L, 1L, 5L), Year = c(1990L, 1990L, 1990L, 1990L, 1990L),
Pop1 = c(30000L, 30000L, 20000L, 20000L, 20000L), Pop2 = c(50000L,
50000L, 70000L, 70000L, 70000L), TypeDescrip = c("alpha",
"alpha", "beta", "beta", "beta")), .Names = c("Type", "Age",
"count1", "count2", "Year", "Pop1", "Pop2", "TypeDescrip"),
class = "data.frame", row.names = c(NA, -5L))
#hannah you can also use sql using the sqldf package
sqldf("select
Type,Age,
sum(count1) as sum_count1,
sum(count2) as sum_count2
from
df
group by
Type,Age
")

Resources