Hhow do I combine dataframes of unequal length based on a condition - r

I would like to know how to best combine the two following dataframes:
df1 <- data.frame(Date = c(1,2,3,4,5,6,7,8,9,10),
Altitude=c(100,101,101,102,103,99,98,99,89,70))
> df1
Date Altitude
1 1 100
2 2 101
3 3 101
4 4 102
5 5 103
6 6 99
7 7 98
8 8 99
9 9 89
10 10 70
df2 <- data.frame(Start = c(1,4,8),Stop = c(3,7,10),Longitude=c(10,12,13))
> df2
Start Stop Longitude
1 1 3 10
2 4 7 12
3 8 10 13
I would basically need a third column in df2, with the Longitude based on whether the Date is between Start and Stop, resulting in something like this:
Date Altitude Longitude
1 1 100 10
2 2 101 10
3 3 101 10
4 4 102 12
5 5 103 12
6 6 99 12
7 7 98 12
8 8 99 13
9 9 89 13
10 10 70 13
I've been trying all kinds of subsetting, filtering, ... but I just can't figure it out. Any help would be appreciated!
Kind regards

An idea via dplyr is to complete the start:stop sequence, unnest and merge, i.e.
library(dplyr)
df2 %>%
mutate(Date = mapply(seq, Start, Stop)) %>%
tidyr::unnest() %>%
select(-c(1, 2)) %>%
right_join(df1, by = 'Date')
which gives,
Longitude Date Altitude
1 10 1 100
2 10 2 101
3 10 3 101
4 12 4 102
5 12 5 103
6 12 6 99
7 12 7 98
8 13 8 99
9 13 9 89
10 13 10 70

Here is a tidyverse answer using the group_by and group_modify functions in the dplyr package (introduced in version 0.8.1 in May 2019).
library(dplyr)
df1 %>%
group_by(Date, Altitude) %>%
group_modify(~ data.frame(df2 %>%
filter(.x$Date >= Start, .x$Date <= Stop)) %>%
select(Longitude),
keep = TRUE)
For each unique combination in df1 of date and altitude (i.e. for each row), this finds the longitude corresponding to the date range in df2.
The output is a tibble:
# A tibble: 10 x 3
# Groups: Date, Altitude [10]
Date Altitude Longitude
<dbl> <dbl> <dbl>
1 1 100 10
2 2 101 10
3 3 101 10
4 4 102 12
5 5 103 12
6 6 99 12
7 7 98 12
8 8 99 13
9 9 89 13
10 10 70 13

Base R solution:
ind <- apply(df2, 1, function(x) which(df1$Date >= x[1] & df1$Date <= x[2]))
df1$Longitude <- unlist(Map(function(x,y) rep(y, length(x)), ind, df2$Longitude))
Output
Date Altitude Longitude
1 1 100 10
2 2 101 10
3 3 101 10
4 4 102 12
5 5 103 12
6 6 99 12
7 7 98 12
8 8 99 13
9 9 89 13
10 10 70 13

Related

Filling the missing values within each id in r

I have a dataframe having some rows missing value. Here is a sample dataframe:
df <- data.frame(id = c(1,1,1, 2,2,2, 3,3,3),
item = c(11,12,13, 24,25,26, 56,45,56),
score = c(5,5, NA, 6,6,6, 7,NA, 7))
> df
id item score
1 1 11 5
2 1 12 5
3 1 13 NA
4 2 24 6
5 2 25 6
6 2 26 6
7 3 56 7
8 3 45 NA
9 3 56 7
Grouping the dataset by id column, I would like to fill those NA values with the same score.
the desired output should be:
> df
id item score
1 1 11 5
2 1 12 5
3 1 13 5
4 2 24 6
5 2 25 6
6 2 26 6
7 3 56 7
8 3 45 7
9 3 56 7
Any ideas?
Thanks!
We can group by 'id' and fill
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
fill(score, .direction = "downup") %>%
ungroup
Here is another option with base R
> transform(df, score = ave(score, id, FUN = function(x) mean(x, na.rm = TRUE)))
id item score
1 1 11 5
2 1 12 5
3 1 13 5
4 2 24 6
5 2 25 6
6 2 26 6
7 3 56 7
8 3 45 7
9 3 56 7
Another option is to create your own function,eg:
fill.in<-function(dataf){
dataf2<-data.frame()
for (i in 1:length(unique(dataf$id))){
dataf1<-subset(dataf, id %in% unique(dataf$id)[i])
dataf1$score<-max(dataf1$score,na.rm=TRUE)
dataf2<-rbind(dataf2,dataf1)
}
return(dataf2)
}
fill.in(df)

Sum column over specific rownumbers in grouped dataframe in R

I have a dataframe like this:
df = data.frame(
x = 1:100,
y = rep(1:10, times = 10, each = 10)
) %>%
group_by(y)
And I would like to compute the sum of x from the 3rd to the 6th row of each group of y.
I think this should be easy, but I just can not figure it out at the moment.
In pseudocode I imagine something like this:
df %>%
mutate(
sum(x, ifelse(between(row_number(), 3,6)))
)
But this of course does not work. I would like to solve it with some dplyr-function, but also in base R I cannot think of a fast solution.
For the first group the sum would be 3+4+5+6....
One option could be:
df %>%
group_by(y) %>%
mutate(z = sum(x[row_number() %in% 3:6]))
x y z
<int> <int> <int>
1 1 1 18
2 2 1 18
3 3 1 18
4 4 1 18
5 5 1 18
6 6 1 18
7 7 1 18
8 8 1 18
9 9 1 18
10 10 1 18
You could also do this with filter() and summarise() and obtain a group-wise summary:
df %>%
group_by(y) %>%
mutate(rn = 1:n()) %>%
filter(rn %in% 3:6) %>%
summarise(x_sum = sum(x))
# A tibble: 10 x 2
y x_sum
<int> <int>
1 1 18
2 2 58
3 3 98
4 4 138
5 5 178
6 6 218
7 7 258
8 8 298
9 9 338
10 10 378
Update: If you want to sum multiple sequences from x then you can sum by index:
df %>%
group_by(y) %>%
mutate(sum_row3to6 = sum(x[3:6]),
sum_row1to4 = sum(x[1:4])
)
Output:
x y sum_row3to6 sum_row1to4
<int> <int> <int> <int>
1 1 1 18 10
2 2 1 18 10
3 3 1 18 10
4 4 1 18 10
5 5 1 18 10
6 6 1 18 10
7 7 1 18 10
8 8 1 18 10
9 9 1 18 10
10 10 1 18 10
First answer:
We could use slice summarise
library(dplyr)
df %>%
group_by(y) %>%
slice(3:6) %>%
summarise(sum = sum(x))
Output:
y sum
<int> <int>
1 1 18
2 2 58
3 3 98
4 4 138
5 5 178
6 6 218
7 7 258
8 8 298
9 9 338
10 10 378
data.table
library(data.table)
df = data.frame(
x = 1:100,
y = rep(1:10, times = 10, each = 10)
)
setDT(df)[rowid(y) %in% 3:6, list(sum_x = sum(x)), by = y][]
#> y sum_x
#> 1: 1 18
#> 2: 2 58
#> 3: 3 98
#> 4: 4 138
#> 5: 5 178
#> 6: 6 218
#> 7: 7 258
#> 8: 8 298
#> 9: 9 338
#> 10: 10 378
Created on 2021-05-21 by the reprex package (v2.0.0)

add column to dataframes from 1 to unique length of existing grouped rows

Here is my example df:
df = read.table(text = 'colA
22
22
22
45
45
11
11
87
90
110
32
32', header = TRUE)
I just need to add a new col based on colA with values from 1 to the unique length of colA.
Expected output:
colA newCol
22 1
22 1
22 1
45 2
45 2
11 3
11 3
87 4
90 5
110 6
32 7
32 7
Here is what I tried without succes:
library(dplyr)
new_df = df %>%
group_by(colA) %>%
mutate(newCol = seq(1, length(unique(df$colA)), by = 1))
Thanks
newcol = c(1, 1+cumsum(diff(df$colA) != 0))
[1] 1 1 1 2 2 3 3 4 5 6 7 7
The dplyr package has a function to get indices of group:
df$newcol = group_indices(df,colA)
This returns:
colA newcol
1 22 2
2 22 2
3 22 2
4 45 4
5 45 4
6 11 1
7 11 1
8 87 5
9 90 6
10 110 7
11 32 3
12 32 3
Though the index is not ordered according to the order of appearance.
You can also do it using factor:
df$newcol = as.numeric(factor(df$colA,levels=unique(df$colA)))
Another option: You can capitalize on the fact that factors are associated with underlying integers. First create a new factor variable with the same levels as the column, then transform it to numeric.
newCol <- factor(df$colA,
levels = unique(df$colA))
df$newCol <- as.numeric(newCol)
df
colA newCol
1 22 1
2 22 1
3 22 1
4 45 2
5 45 2
6 11 3
7 11 3
8 87 4
9 90 5
10 110 6
11 32 7
12 32 7

Sum of group but keep the same value for each row in r

I have data frame, I want to create a new variable by sum of each ID and group, if I sum normal,dimension of data reduce, my case I need to keep and repeat each row.
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Group <-c(1,1,2,1,1,1,2,2,1,1,1,2)
x <- c(1:12)
y<- c(12:23)
df <- data.frame(ID,Group,x,y)
ID Group x y
1 1 1 1 12
2 1 1 2 13
3 1 2 3 14
4 3 1 4 15
5 3 1 5 16
6 3 1 6 17
7 3 2 7 18
8 3 2 8 19
9 4 1 9 20
10 4 1 10 21
11 4 1 11 22
12 4 2 12 23
The output with 2 more variables "sumx" and "sumy". Group by (ID, Group)
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23
Any Idea?
As short as:
df$sumx <- with(df,ave(x,ID,Group,FUN = sum))
df$sumy <- with(df,ave(y,ID,Group,FUN = sum))
We can use dplyr
library(dplyr)
df %>%
group_by(ID, Group) %>%
mutate_each(funs(sum)) %>%
rename(sumx=x, sumy=y) %>%
bind_cols(., df[c("x", "y")])
If there are only two columns to sum, then
df %>%
group_by(ID, Group) %>%
mutate(sumx = sum(x), sumy = sum(y))
You can use below code to get what you want if it is a single column and in case you have more than 1 column then add accordingly:
library(dplyr)
data13 <- data12 %>%
group_by(Category) %>%
mutate(cum_Cat_GMR = cumsum(GrossMarginRs))

Group Data in R for consecutive rows

If there's not a quick 1-3 liner for this in R, I'll definitely just use linux sort and a short python program using groupby, so don't bend over backwards trying to get something crazy working. Here's the input data frame:
df_in <- data.frame(
ID = c(1,1,1,1,1,2,2,2,2,2),
weight = c(150,150,151,150,150,170,170,170,171,171),
start_day = c(1,4,7,10,11,5,10,15,20,25),
end_day = c(4,7,10,11,30,10,15,20,25,30)
)
ID weight start_day end_day
1 1 150 1 4
2 1 150 4 7
3 1 151 7 10
4 1 150 10 11
5 1 150 11 30
6 2 170 5 10
7 2 170 10 15
8 2 170 15 20
9 2 171 20 25
10 2 171 25 30
I would like to do some basic aggregation by ID and weight, but only when the group is in consecutive rows of df_in. Specifically, the desired output is
df_desired_out <- data.frame(
ID = c(1,1,1,2,2),
weight = c(150,151,150,170,171),
min_day = c(1,7,10,5,20),
max_day = c(7,10,30,20,30)
)
ID weight min_day max_day
1 1 150 1 7
2 1 151 7 10
3 1 150 10 30
4 2 170 5 20
5 2 171 20 30
This question seems to be extremely close to what I want, but I'm having lots of trouble adapting it for some reason.
In dplyr, I would do this by creating another grouping variable for the consecutive rows. This is what the code cumsum(c(1, diff(weight) != 0) is doing in the code chunk below. An example of this is also here.
The group creation can be done within group_by, and then you can proceed accordingly with making any summaries by group.
library(dplyr)
df_in %>%
group_by(ID, group_weight = cumsum(c(1, diff(weight) != 0)), weight) %>%
summarise(start_day = min(start_day), end_day = max(end_day))
Source: local data frame [5 x 5]
Groups: ID, group_weight [?]
ID group_weight weight start_day end_day
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1 150 1 7
2 1 2 151 7 10
3 1 3 150 10 30
4 2 4 170 5 20
5 2 5 171 20 30
This approach does leave you with the extra grouping variable in the dataset, which can be removed, if needed, with select(-group_weight) after ungrouping.
First we combine ID and weight. The quick-and-dirty way is using paste:
df_in$id_weight <- paste(df_in$id, df_in$weight, sep='_')
df_in
ID weight start_day end_day id_weight
1 1 150 1 4 1_150
2 1 150 4 7 1_150
3 1 151 7 10 1_151
4 1 150 10 11 1_150
5 1 150 11 30 1_150
6 2 170 5 10 2_170
7 2 170 10 15 2_170
8 2 170 15 20 2_170
9 2 171 20 25 2_171
10 2 171 25 30 2_171
Safer way is to use interaction or group_indices: Combine values in 4 columns to a single unique value
We can group consecutively using rle.
rlel <- rle(df_in$id_weight)$lengths
df_in$group <- unlist(lapply(1:length(rlel), function(i) rep(i, rlel[i])))
df_in
ID weight start_day end_day id_weight group
1 1 150 1 4 1_150 1
2 1 150 4 7 1_150 1
3 1 151 7 10 1_151 2
4 1 150 10 11 1_150 3
5 1 150 11 30 1_150 3
6 2 170 5 10 2_170 4
7 2 170 10 15 2_170 4
8 2 170 15 20 2_170 4
9 2 171 20 25 2_171 5
10 2 171 25 30 2_171 5
Now with the convenient group number we can summarize by group.
df_in %>%
group_by(group) %>%
summarize(id_weight = id_weight[1],
start_day = min(start_day),
end_day = max(end_day))
# A tibble: 5 x 4
group id_weight start_day end_day
<int> <chr> <dbl> <dbl>
1 1 1_150 1 7
2 2 1_151 7 10
3 3 1_150 10 30
4 4 2_170 5 20
5 5 2_171 20 30
with(df_in, {
aggregate(day, list('ID'=ID, 'weight'=weight),
function(x) c('min_day' = min(x), 'max_day' = max(x)))
})
Produces:
ID weight x.min_day x.max_day
1 1 150 1 5
2 1 151 3 3
3 2 170 1 3
4 2 171 4 5

Resources