how to sum based on three different condition in R - r

The following is my data.
gcode code year P Q
1 101 2000 1 3
1 101 2001 2 4
1 102 2000 1 1
1 102 2001 4 5
1 102 2002 2 6
1 102 2003 6 5
1 103 1999 6 1
1 103 2000 4 2
1 103 2001 2 1
2 104 2000 1 3
2 104 2001 2 4
2 105 2001 4 5
2 105 2002 2 6
2 105 2003 6 5
2 105 2004 6 1
2 106 2000 4 2
2 106 2001 2 1
gcode 1 has 3 different codes 101, 102 and 103. They all have the same year (2000 and 2001). I want to sum up P and Q for these years. Otherwise, I want to delete the irrelevant data. I want to do the same for gcode 2 as well.
How can I get the result like this?
gcode year P Q
1 2000 1+1+4 3+1+2
1 2001 2+4+2 4+5+1
2 2001 2+4+2 4+5+1

We can split the data based on gcode subset the data based on common year which is present in all the code and aggregate the data by gcode and year.
do.call(rbind, lapply(split(df, df$gcode), function(x) {
aggregate(cbind(P, Q)~gcode+year,
subset(x, year %in% Reduce(intersect, split(x$year, x$code))), sum)
}))
# gcode year P Q
#1.1 1 2000 6 6
#1.2 1 2001 8 10
#2 2 2001 8 10
Using dplyr with similar logic we can do
library(dplyr)
df %>%
group_split(gcode) %>%
purrr::map_df(. %>%
group_by(year) %>%
filter(n_distinct(code) == n_distinct(.$code)) %>%
group_by(gcode, year) %>%
summarise_at(vars(P:Q), sum))
data
df <- structure(list(gcode = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), code = c(101L, 101L, 102L, 102L,
102L, 102L, 103L, 103L, 103L, 104L, 104L, 105L, 105L, 105L, 105L,
106L, 106L), year = c(2000L, 2001L, 2000L, 2001L, 2002L, 2003L,
1999L, 2000L, 2001L, 2000L, 2001L, 2001L, 2002L, 2003L, 2004L,
2000L, 2001L), P = c(1L, 2L, 1L, 4L, 2L, 6L, 6L, 4L, 2L, 1L,
2L, 4L, 2L, 6L, 6L, 4L, 2L), Q = c(3L, 4L, 1L, 5L, 6L, 5L, 1L,
2L, 1L, 3L, 4L, 5L, 6L, 5L, 1L, 2L, 1L)), class = "data.frame",
row.names = c(NA, -17L))

An option using data.table package:
years <- DT[, {
m <- min(year)
ty <- tabulate(year-m)
.(year=which(ty==uniqueN(code)) + m)
}, gcode]
DT[years, on=.(gcode, year),
by=.EACHI, .(P=sum(P), Q=sum(Q))]
output:
gcode year P Q
1: 1 2000 6 6
2: 1 2001 8 10
3: 2 2001 8 10
data:
library(data.table)
DT <- fread("gcode code year P Q
1 101 2000 1 3
1 101 2001 2 4
1 102 2000 1 1
1 102 2001 4 5
1 102 2002 2 6
1 102 2003 6 5
1 103 1999 6 1
1 103 2000 4 2
1 103 2001 2 1
2 104 2000 1 3
2 104 2001 2 4
2 105 2001 4 5
2 105 2002 2 6
2 105 2003 6 5
2 105 2004 6 1
2 106 2000 4 2
2 106 2001 2 1")

I came up with the following solution. First, I counted how many times each year appear for each gcode. I also counted how many unique codes exist for each gcode. Then, join the two results using left_join(). Then, I identified rows that have same values in n_year and n_code. Then, I joined the original data frame, which is called mydf. Then, I defined groups by gcode and year, and summed up P and Q for each group.
library(dplyr)
left_join(count(mydf, gcode, year, name = "n_year"),
group_by(mydf, gcode) %>% summarize(n_code = n_distinct(code))) %>%
filter(n_year == n_code) %>%
left_join(mydf, by = c("gcode", "year")) %>%
group_by(gcode, year) %>%
summarize_at(vars(P:Q),
.funs = list(~sum(.)))
# gcode year P Q
# <int> <int> <int> <int>
#1 1 2000 6 6
#2 1 2001 8 10
#3 2 2001 8 10
Another idea
I was reviewing this question later and came up with the following idea, which is much simpler. First, I defined groups by gcode and year. For each group, I counted how many data points existed using add_count(). Then, I defined groups again with gcode only. For each gcode group, I wanted to get rows that meet n == n_distinct(code). n is a column created by add_count(). If a number in n and a number returned by n_distinct() matches, that means that a year in that row exists among all code. Finally, I defined groups by gcode and year again and summed up values in P and Q.
group_by(mydf, gcode, year) %>%
add_count() %>%
group_by(gcode) %>%
filter(n == n_distinct(code)) %>%
group_by(gcode, year) %>%
summarize_at(vars(P:Q),
.funs = list(~sum(.)))
# This is the same code in data.table.
setDT(mydf)[, check := .N, by = .(gcode, year)][,
.SD[check == uniqueN(code)], by = gcode][,
lapply(.SD, sum), .SDcols = P:Q, by = .(gcode, year)][]

Related

creating a 'survival' dummy in R (panel data)

I would like to create a dummy that will be equal to 0 if the firm did not close next year (in this case I close means it was not recorded in the next time period ) in my unbalanced panel and 1 otherwise.
My data look like this:
firm_id year
1 90
1 92
2 90
2 92
2 94
2 96
3 90
So my desired output would look like:
firm_id year dummy
1 90 1
1 92 0
2 90 1
2 92 1
2 94 1
2 96 1
3 90 0
I am unsure of how to approach this problem, my original idea was to count the number of years associated with each firm firm_id and then if firm has 4 years assign it always 1, if firm had 3 years assign the first 2 years 1 and 3rd year 0, but then I discovered I had firms that entered the panel later so this method will not work. Is there some better approach that would solve this issue?
Assign the value 0 to dummy if it is last entry of the firm id and it is not equal to max value of the year.
df$dummy <- 1
df$dummy[!duplicated(df$firm_id, fromLast = TRUE) & df$year != max(df$year)] <- 0
df
# firm_id year dummy
#1 1 90 1
#2 1 92 0
#3 2 90 1
#4 2 92 1
#5 2 94 1
#6 2 96 1
#7 3 90 0
This requires your data to be sorted by year like in the example. If it is not sorted you can use order to sort them first.
df <- df[with(df, order(firm_id, year)), ]
data
df <- structure(list(firm_id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L), year = c(90L,
92L, 90L, 92L, 94L, 96L, 90L)), class = "data.frame", row.names = c(NA, -7L))
We can use
library(dplyr)
df %>%
mutate(dummy = +(duplicated(firm_id, fromLast = TRUE) | year == max(year)))
firm_id year dummy
1 1 90 1
2 1 92 0
3 2 90 1
4 2 92 1
5 2 94 1
6 2 96 1
7 3 90 0
data
df <- structure(list(firm_id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L), year = c(90L,
92L, 90L, 92L, 94L, 96L, 90L)), class = "data.frame", row.names = c(NA, -7L))

How to take data from one data.frame and put it in another with multiple conditions in R

I have a bit of a problem to figure out how to do something. I have two data.frame, and i want to take variable to one date.frame and add it to the other with certain conditions. Here an extract of my two data.frame :
Data.frame 1 :
ID YEAR_F
154 2005
432 2005
123 2007
Data.frame 2 :
ID Year_D Month DC1 DC2
154 2001 1 4 23
154 2001 2 56 22
154 2005 1 32 11
154 2005 2 12 10
432 2005 1 23 11
432 2006 1 23 10
432 2006 2 22 11
123 2001 1 12 34
123 2007 1 11 12
123 2007 2 11 11
123 2004 1 43 43
So i want to take the DC1 and DC2 of the second data.frame and add it to my first data.frame. But, i want it to do it according to the year of the first data.frame. Plus, i want to have a column of DC1 and DC2 by month. So, in final my data.frame will look something like that :
data.frame 3 :
ID Year_D DC1_M1 DC1_M2 DC2_M1 DC2_M2
154 2005 32 12 11 10
432 2005 23 na 11 na
123 2007 11 11 12 11
I'm not really sure how to do it ? Especially because the structure of the second data.frame change ?
Thank you in advance!
We can pivot the second dataset to 'wide' format after filtering based on the 'YEAR_F' of first data and then do a join
library(dplyr)
library(tidyr)
df2 %>%
filter(Year_D %in% df1$YEAR_F) %>%
select(-Year_D) %>%
pivot_wider(names_from = Month, values_from = c(DC1, DC2)) %>%
right_join(df1) %>%
select(names(df1), everything())
-output
# A tibble: 3 x 6
# ID YEAR_F DC1_1 DC1_2 DC2_1 DC2_2
# <int> <int> <int> <int> <int> <int>
#1 154 2005 32 12 11 10
#2 432 2005 23 NA 11 NA
#3 123 2007 11 11 12 11
Or using base R with merge and reshape
merge(df1, reshape(subset(df2, Year_D %in% df1$YEAR_F, select = -Year_D),
idvar = 'ID', direction = 'wide', timevar = 'Month'))
# ID YEAR_F DC1.1 DC2.1 DC1.2 DC2.2
#1 123 2007 11 12 11 11
#2 154 2005 32 11 12 10
#3 432 2005 23 11 NA NA
data
df1 <- structure(list(ID = c(154L, 432L, 123L), YEAR_F = c(2005L, 2005L,
2007L)), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(ID = c(154L, 154L, 154L, 154L, 432L, 432L, 432L,
123L, 123L, 123L, 123L), Year_D = c(2001L, 2001L, 2005L, 2005L,
2005L, 2006L, 2006L, 2001L, 2007L, 2007L, 2004L), Month = c(1L,
2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L), DC1 = c(4L, 56L, 32L,
12L, 23L, 23L, 22L, 12L, 11L, 11L, 43L), DC2 = c(23L, 22L, 11L,
10L, 11L, 10L, 11L, 34L, 12L, 11L, 43L)), class = "data.frame",
row.names = c(NA,
-11L))

Linear regression on split data in R

I want to make groups of data where measurements are done in multiple Year on the same species at the same Lat and Long. Then, I want to run linear regression on all those groups (using N as dependent variable and Year as independent variable).
Practice dataset:
Species Year Lat Long N
1 1 1999 1 1 5
2 1 2001 2 1 5
3 2 2010 3 3 4
4 2 2010 3 3 2
5 2 2011 3 3 5
6 2 2012 3 3 8
7 3 2007 8 7 -10
8 3 2019 8 7 100
9 2 2000 1 1 5
First, I averaged data where multiple measurements were done in the same Year on the same Species at the same latitude and longitude . Then, I split data based on Lat, Long and Species. However, this still groups rows together where Lat, Long and Species are not equal ($ '4'). Furthermore, I want to remove $'1', since I only want to use data where multiple measurements are done over a number of Year. How do I do this?
Data <- read.table("Dataset.txt", header = TRUE)
Agr_Data <- aggregate(N ~ Lat + Long + Year + Species, data = Data, mean)
Split_Data <- split(Agr_Data, Agr_Data$Lat + Agr_Data$Long + Agr_Data$Species)
Regression_Data <- lapply(Split_Data, function(Split_Data) lm(N~Year, data = Split_Data) )
Split_Data
$`3`
Lat Long Year Species N
1 1 1 1999 1 5
$`4`
Lat Long Year Species N
2 2 1 2001 1 5
3 1 1 2000 2 5
$`8`
Lat Long Year Species N
4 3 3 2010 2 3
5 3 3 2011 2 5
6 3 3 2012 2 8
$`18`
Lat Long Year Species N
7 8 7 2007 3 -10
8 8 7 2019 3 100
Desired output:
Lat Long Species Coefficients
3 3 2 2.5
8 7 3 9.167
Base R solution:
# 1. Import data:
df <- structure(list(Species = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 2L ),
Year = c(1999L, 2001L, 2010L, 2010L, 2011L, 2012L, 2007L, 2019L, 2000L),
Lat = c(1L, 2L, 3L, 3L, 3L, 3L, 8L, 8L, 1L),
Long = c(1L, 1L, 3L, 3L, 3L, 3L, 7L, 7L, 1L),
N = c(5L, 5L, 4L, 2L, 5L, 8L, -10L, 100L, 5L)),
class = "data.frame", row.names = c(NA, -9L ))
# 2. Aggregate data:
df <- aggregate(N ~ Lat + Long + Year + Species, data = df, mean)
# 3. Concatenate vecs to create grouping vec:
df$grouping_var <- paste(df$Species, df$Lat, df$Long, sep = ", ")
# 4. split apply combine lm:
coeff_n <- as.numeric(do.call("rbind", lapply(split(df, df$grouping_var),
function(x){
ifelse(nrow(x) > 1, coef(lm(N ~ Species+Lat+Long, data = x)), NA)
}
)
)
)
# 5. Create a dataframe of coeffs:
coeff_df <- data.frame(cbind(grouping_var = unique(df$grouping_var), coeff_n = coeff_n))
# 6. Merge the dataframes together:
df <- merge(df, coeff_df, by = "grouping_var", all.x = TRUE)

Proper way to split data frames at multiple levels using ddply [duplicate]

This question already has answers here:
Efficient method to filter and add based on certain conditions (3 conditions in this case)
(3 answers)
Closed 6 years ago.
Let's say I have a data frame like the following:
year stint ID W
1 2003 1 abc 10
2 2003 2 abc 3
3 2003 1 def 16
4 2004 1 abc 15
5 2004 1 def 11
6 2004 2 def 7
I would like to combine the data so that it looks like
year ID W
1 2003 abc 13
3 2003 def 16
4 2004 abc 15
5 2004 def 18
I found a way to combine the data as desired, but I'm very sure that there's a better way.
combinedData = unique(ddply(data, "ID", function(x) {
ddply(x, "year", function(y) {
data.frame(ID=x$ID, W=sum(y$W))
})
}))
combinedData[order(combinedData$year),]
This produces the following output:
year ID W
1 2003 abc 13
7 2003 def 16
4 2004 abc 15
10 2004 def 18
Specifically I don't like that I had to use unique (otherwise I get each unique combo of year,ID,W three times in the outputted data), and I don't like that the row numbers aren't sequential. How can I do this more cleanly?
Do this with base R:
aggregate(W~year+ID, df, sum)
# year ID W
#1 2003 abc 13
#2 2004 abc 15
#3 2003 def 16
#4 2004 def 18
data
df <- structure(list(year = c(2003L, 2003L, 2003L, 2004L, 2004L, 2004L
), stint = c(1L, 2L, 1L, 1L, 1L, 2L), ID = structure(c(1L, 1L,
2L, 1L, 2L, 2L), .Label = c("abc", "def"), class = "factor"),
W = c(10L, 3L, 16L, 15L, 11L, 7L)), .Names = c("year", "stint",
"ID", "W"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))

Subset of data with criteria of two columns

I would like to create a subset of data that consists of Units that have a higher score in QTR 4 than QTR 1 (upward trend). Doesn't matter if QTR 2 or 3 are present.
Unit QTR Score
5 4 34
1 1 22
5 3 67
2 4 78
3 2 39
5 2 34
1 2 34
5 1 67
1 3 70
1 4 89
3 4 19
Subset would be:
Unit QTR Score
1 1 22
1 2 34
1 3 70
1 4 89
I've tried variants of something like this:
upward_subset <- subset(mydata,Unit if QTR=4~Score > QTR=1~Score)
Thank you for your time
If the dataframe is named "d", then this succeeds on your test set:
d[ which(d$Unit %in%
(sapply( split(d, d["Unit"]),
function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) ,
]
#-------------
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
An alternative in two steps:
result <- unlist(
by(
test,
test$Unit,
function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2])
)
test[test$Unit %in% names(result[result==TRUE]),]
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
A solution using data.table (Probably there are better versions than what I have at the moment).
Note: Assuming a QTR value for a given Unit is unique
Data:
df <- structure(list(Unit = c(5L, 1L, 5L, 2L, 3L, 5L, 1L, 5L, 1L, 1L,
3L), QTR = c(4L, 1L, 3L, 4L, 2L, 2L, 2L, 1L, 3L, 4L, 4L), Score = c(34L,
22L, 67L, 78L, 39L, 34L, 34L, 67L, 70L, 89L, 19L)), .Names = c("Unit",
"QTR", "Score"), class = "data.frame", row.names = c(NA, -11L
))
Solution:
dt <- data.table(df, key=c("Unit", "QTR"))
dt[, Score[Score[QTR == 4] > Score[QTR == 1]], by=Unit]
Unit V1
1: 1 22
2: 1 34
3: 1 70
4: 1 89

Resources