This question already has answers here:
Efficient method to filter and add based on certain conditions (3 conditions in this case)
(3 answers)
Closed 6 years ago.
Let's say I have a data frame like the following:
year stint ID W
1 2003 1 abc 10
2 2003 2 abc 3
3 2003 1 def 16
4 2004 1 abc 15
5 2004 1 def 11
6 2004 2 def 7
I would like to combine the data so that it looks like
year ID W
1 2003 abc 13
3 2003 def 16
4 2004 abc 15
5 2004 def 18
I found a way to combine the data as desired, but I'm very sure that there's a better way.
combinedData = unique(ddply(data, "ID", function(x) {
ddply(x, "year", function(y) {
data.frame(ID=x$ID, W=sum(y$W))
})
}))
combinedData[order(combinedData$year),]
This produces the following output:
year ID W
1 2003 abc 13
7 2003 def 16
4 2004 abc 15
10 2004 def 18
Specifically I don't like that I had to use unique (otherwise I get each unique combo of year,ID,W three times in the outputted data), and I don't like that the row numbers aren't sequential. How can I do this more cleanly?
Do this with base R:
aggregate(W~year+ID, df, sum)
# year ID W
#1 2003 abc 13
#2 2004 abc 15
#3 2003 def 16
#4 2004 def 18
data
df <- structure(list(year = c(2003L, 2003L, 2003L, 2004L, 2004L, 2004L
), stint = c(1L, 2L, 1L, 1L, 1L, 2L), ID = structure(c(1L, 1L,
2L, 1L, 2L, 2L), .Label = c("abc", "def"), class = "factor"),
W = c(10L, 3L, 16L, 15L, 11L, 7L)), .Names = c("year", "stint",
"ID", "W"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))
Related
I have a longitudinal data where respondents recruited as cohort. Right now, I have year in which they took the survey. But I want to create a new column simply counting if it is the first, second, or third time a person took the survey.
Original Table
PersonID
SurveyYear
SurveyQ1Rating
SurveyQ2Rating
Gender
12
2013
5
4
f
12
2012
4
4
f
12
2010
3
3
f
2
2007
4
4
m
2
2008
3
3
m
2
2009
3
5
m
2
2010
5
5
m
2
2013
2
2
m
5
2013
4
4
f
5
2014
5
5
f
Target Table (Where I created a new col SurveytTime to mark the ith time one took the survey)
PersonID
SurveyYear
SurveyTime
SurveyQ1Rating
SurveyQ2Rating
Gender
12
2013
3
5
4
f
12
2012
2
4
4
f
12
2010
1
3
3
f
2
2007
1
4
4
m
2
2008
2
3
3
m
2
2009
3
3
5
m
2
2010
4
5
5
m
2
2013
5
2
2
m
5
2013
1
4
4
f
5
2014
2
5
5
f
A base solution:
df |>
transform(SurveyTime = ave(SurveyYear, PersonID, FUN = rank))
Its dplyr equivalent:
library(dplyr)
df %>%
group_by(PersonID) %>%
mutate(SurveyTime = dense_rank(SurveyYear)) %>%
ungroup()
Data
df <- structure(list(PersonID = c(12L, 12L, 12L, 2L, 2L, 2L, 2L, 2L,
5L, 5L), SurveyYear = c(2013L, 2012L, 2010L, 2007L, 2008L, 2009L,
2010L, 2013L, 2013L, 2014L), SurveyQ1Rating = c(5L, 4L, 3L, 4L,
3L, 3L, 5L, 2L, 4L, 5L), SurveyQ2Rating = c(4L, 4L, 3L, 4L, 3L,
5L, 5L, 2L, 4L, 5L), Gender = c("f", "f", "f", "m", "m", "m",
"m", "m", "f", "f")), class = "data.frame", row.names = c(NA, -10L))
Using data.table
library(data.table)
setDT(df1)[, SurveyTime := frank(SurveyYear), PersonID]
The following is my data.
gcode code year P Q
1 101 2000 1 3
1 101 2001 2 4
1 102 2000 1 1
1 102 2001 4 5
1 102 2002 2 6
1 102 2003 6 5
1 103 1999 6 1
1 103 2000 4 2
1 103 2001 2 1
2 104 2000 1 3
2 104 2001 2 4
2 105 2001 4 5
2 105 2002 2 6
2 105 2003 6 5
2 105 2004 6 1
2 106 2000 4 2
2 106 2001 2 1
gcode 1 has 3 different codes 101, 102 and 103. They all have the same year (2000 and 2001). I want to sum up P and Q for these years. Otherwise, I want to delete the irrelevant data. I want to do the same for gcode 2 as well.
How can I get the result like this?
gcode year P Q
1 2000 1+1+4 3+1+2
1 2001 2+4+2 4+5+1
2 2001 2+4+2 4+5+1
We can split the data based on gcode subset the data based on common year which is present in all the code and aggregate the data by gcode and year.
do.call(rbind, lapply(split(df, df$gcode), function(x) {
aggregate(cbind(P, Q)~gcode+year,
subset(x, year %in% Reduce(intersect, split(x$year, x$code))), sum)
}))
# gcode year P Q
#1.1 1 2000 6 6
#1.2 1 2001 8 10
#2 2 2001 8 10
Using dplyr with similar logic we can do
library(dplyr)
df %>%
group_split(gcode) %>%
purrr::map_df(. %>%
group_by(year) %>%
filter(n_distinct(code) == n_distinct(.$code)) %>%
group_by(gcode, year) %>%
summarise_at(vars(P:Q), sum))
data
df <- structure(list(gcode = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), code = c(101L, 101L, 102L, 102L,
102L, 102L, 103L, 103L, 103L, 104L, 104L, 105L, 105L, 105L, 105L,
106L, 106L), year = c(2000L, 2001L, 2000L, 2001L, 2002L, 2003L,
1999L, 2000L, 2001L, 2000L, 2001L, 2001L, 2002L, 2003L, 2004L,
2000L, 2001L), P = c(1L, 2L, 1L, 4L, 2L, 6L, 6L, 4L, 2L, 1L,
2L, 4L, 2L, 6L, 6L, 4L, 2L), Q = c(3L, 4L, 1L, 5L, 6L, 5L, 1L,
2L, 1L, 3L, 4L, 5L, 6L, 5L, 1L, 2L, 1L)), class = "data.frame",
row.names = c(NA, -17L))
An option using data.table package:
years <- DT[, {
m <- min(year)
ty <- tabulate(year-m)
.(year=which(ty==uniqueN(code)) + m)
}, gcode]
DT[years, on=.(gcode, year),
by=.EACHI, .(P=sum(P), Q=sum(Q))]
output:
gcode year P Q
1: 1 2000 6 6
2: 1 2001 8 10
3: 2 2001 8 10
data:
library(data.table)
DT <- fread("gcode code year P Q
1 101 2000 1 3
1 101 2001 2 4
1 102 2000 1 1
1 102 2001 4 5
1 102 2002 2 6
1 102 2003 6 5
1 103 1999 6 1
1 103 2000 4 2
1 103 2001 2 1
2 104 2000 1 3
2 104 2001 2 4
2 105 2001 4 5
2 105 2002 2 6
2 105 2003 6 5
2 105 2004 6 1
2 106 2000 4 2
2 106 2001 2 1")
I came up with the following solution. First, I counted how many times each year appear for each gcode. I also counted how many unique codes exist for each gcode. Then, join the two results using left_join(). Then, I identified rows that have same values in n_year and n_code. Then, I joined the original data frame, which is called mydf. Then, I defined groups by gcode and year, and summed up P and Q for each group.
library(dplyr)
left_join(count(mydf, gcode, year, name = "n_year"),
group_by(mydf, gcode) %>% summarize(n_code = n_distinct(code))) %>%
filter(n_year == n_code) %>%
left_join(mydf, by = c("gcode", "year")) %>%
group_by(gcode, year) %>%
summarize_at(vars(P:Q),
.funs = list(~sum(.)))
# gcode year P Q
# <int> <int> <int> <int>
#1 1 2000 6 6
#2 1 2001 8 10
#3 2 2001 8 10
Another idea
I was reviewing this question later and came up with the following idea, which is much simpler. First, I defined groups by gcode and year. For each group, I counted how many data points existed using add_count(). Then, I defined groups again with gcode only. For each gcode group, I wanted to get rows that meet n == n_distinct(code). n is a column created by add_count(). If a number in n and a number returned by n_distinct() matches, that means that a year in that row exists among all code. Finally, I defined groups by gcode and year again and summed up values in P and Q.
group_by(mydf, gcode, year) %>%
add_count() %>%
group_by(gcode) %>%
filter(n == n_distinct(code)) %>%
group_by(gcode, year) %>%
summarize_at(vars(P:Q),
.funs = list(~sum(.)))
# This is the same code in data.table.
setDT(mydf)[, check := .N, by = .(gcode, year)][,
.SD[check == uniqueN(code)], by = gcode][,
lapply(.SD, sum), .SDcols = P:Q, by = .(gcode, year)][]
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I got a problem when trying to merge two dataframes in R and I need your help.
Suppose that I have following dataframes:
> Data_A
Code year score
A 1991 1
A 1992 2
A 1993 3
B 1991 3
B 1993 7
> Data_B
Code year l.score
A 1991 NA
A 1992 1
A 1993 2
A 1994 3
B 1991 NA
B 1992 3
B 1993 NA
B 1994 7
And the desire result after merging them should be like that:
> Data_merge
Code year score l.score
A 1991 1 NA
A 1992 2 1
A 1993 3 2
B 1991 3 NA
B 1993 7 NA
It means that when merging these dataframes, share columns in one will be kept (in this case, "Code" and "year" of Data_A). I tried merge(Data_A, Data_B, all = FALSE) but not success. Someone have any idea? Thanks for reading!
library(dplyr)
Data_A %>%
left_join(Data_B, by = c('Code', 'year'))
Code year score l.score
1 A 1991 1 NA
2 A 1992 2 1
3 A 1993 3 2
4 B 1991 3 NA
5 B 1993 7 NA
It seems your own solution can work (I am using R 3.6.1), but no idea why you cannot
> merge(Data_A, Data_B, all = FALSE)
Code year score X1.score
1 A 1991 1 NA
2 A 1992 2 1
3 A 1993 3 2
4 B 1991 3 NA
5 B 1993 7 NA
DATA
Data_A <- structure(list(Code = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), year = c(1991L, 1992L, 1993L, 1991L,
1993L), score = c(1L, 2L, 3L, 3L, 7L)), class = "data.frame", row.names = c(NA,
-5L))
Data_B <- structure(list(Code = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), year = c(1991L,
1992L, 1993L, 1994L, 1991L, 1992L, 1993L, 1994L), X1.score = c(NA,
1L, 2L, 3L, NA, 3L, NA, 7L)), class = "data.frame", row.names = c(NA,
-8L))
I want to make groups of data where measurements are done in multiple Year on the same species at the same Lat and Long. Then, I want to run linear regression on all those groups (using N as dependent variable and Year as independent variable).
Practice dataset:
Species Year Lat Long N
1 1 1999 1 1 5
2 1 2001 2 1 5
3 2 2010 3 3 4
4 2 2010 3 3 2
5 2 2011 3 3 5
6 2 2012 3 3 8
7 3 2007 8 7 -10
8 3 2019 8 7 100
9 2 2000 1 1 5
First, I averaged data where multiple measurements were done in the same Year on the same Species at the same latitude and longitude . Then, I split data based on Lat, Long and Species. However, this still groups rows together where Lat, Long and Species are not equal ($ '4'). Furthermore, I want to remove $'1', since I only want to use data where multiple measurements are done over a number of Year. How do I do this?
Data <- read.table("Dataset.txt", header = TRUE)
Agr_Data <- aggregate(N ~ Lat + Long + Year + Species, data = Data, mean)
Split_Data <- split(Agr_Data, Agr_Data$Lat + Agr_Data$Long + Agr_Data$Species)
Regression_Data <- lapply(Split_Data, function(Split_Data) lm(N~Year, data = Split_Data) )
Split_Data
$`3`
Lat Long Year Species N
1 1 1 1999 1 5
$`4`
Lat Long Year Species N
2 2 1 2001 1 5
3 1 1 2000 2 5
$`8`
Lat Long Year Species N
4 3 3 2010 2 3
5 3 3 2011 2 5
6 3 3 2012 2 8
$`18`
Lat Long Year Species N
7 8 7 2007 3 -10
8 8 7 2019 3 100
Desired output:
Lat Long Species Coefficients
3 3 2 2.5
8 7 3 9.167
Base R solution:
# 1. Import data:
df <- structure(list(Species = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 2L ),
Year = c(1999L, 2001L, 2010L, 2010L, 2011L, 2012L, 2007L, 2019L, 2000L),
Lat = c(1L, 2L, 3L, 3L, 3L, 3L, 8L, 8L, 1L),
Long = c(1L, 1L, 3L, 3L, 3L, 3L, 7L, 7L, 1L),
N = c(5L, 5L, 4L, 2L, 5L, 8L, -10L, 100L, 5L)),
class = "data.frame", row.names = c(NA, -9L ))
# 2. Aggregate data:
df <- aggregate(N ~ Lat + Long + Year + Species, data = df, mean)
# 3. Concatenate vecs to create grouping vec:
df$grouping_var <- paste(df$Species, df$Lat, df$Long, sep = ", ")
# 4. split apply combine lm:
coeff_n <- as.numeric(do.call("rbind", lapply(split(df, df$grouping_var),
function(x){
ifelse(nrow(x) > 1, coef(lm(N ~ Species+Lat+Long, data = x)), NA)
}
)
)
)
# 5. Create a dataframe of coeffs:
coeff_df <- data.frame(cbind(grouping_var = unique(df$grouping_var), coeff_n = coeff_n))
# 6. Merge the dataframes together:
df <- merge(df, coeff_df, by = "grouping_var", all.x = TRUE)
I have a datset that looks something like this:
age Year f.pop f.dc
1 1990 0 1
5 2001 200 4
1 1990 400 2
1 2001 50 3
5 2001 0 3
I want it to look like this:
age Year f.pop f.dc
1 1990 400 1
5 2001 200 4
1 1990 400 2
1 2001 50 3
5 2001 200 3
Basically, I want to replace zero values in the f.pop column of my dataset with f.pop values of rows that match in two other columns (Year and age). The f.dc column is largely irrelevant to this question, but I want to emphasize that these rows are not identical and must remain separate.
Here's my attempt:
for (i in 1:length(usbd$f.pop)) {
if (usbd$f.pop[i] == 0) {
iage = usbd$age[i]
iyear = usbd$Year[i]
index = which(usbd$age == iage & usbd$Year == iyear)
usbd$f.pop[i] = usbd$f.pop[index] }}
But this is incredibly slow. There must be a more efficient way.
Conditional replacement of values in a data.frame is useful but I'm not sure how to apply this to two conditions with potentially different indices.
We could use data.table to replace the '0' values in 'f.pop' (assuming that 'f.pop' value is unique for each 'age', 'Year' group). Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by age and Year (.(age, Year)), we assign 'f.pop' as the non zero value in 'f.pop' (f.pop := f.pop[f.pop!=0]).
library(data.table)
setDT(df1)[, f.pop:= f.pop[f.pop!=0] , by = .(age, Year)]
df1
# age Year f.pop f.dc
#1: 1 1990 400 1
#2: 5 2001 200 4
#3: 1 1990 400 2
#4: 1 2001 50 3
#5: 5 2001 200 3
data
df1 <- structure(list(age = c(1L, 5L, 1L, 1L, 5L), Year = c(1990L, 2001L,
1990L, 2001L, 2001L), f.pop = c(0L, 200L, 400L, 50L, 0L), f.dc = c(1L,
4L, 2L, 3L, 3L)), .Names = c("age", "Year", "f.pop", "f.dc"),
class = "data.frame", row.names = c(NA, -5L))