I have a dataframe:
set.seed(42)
ID <- sample(1:15, 100, replace = TRUE)
value <- sample(1:4, 100, replace = TRUE)
d <- data.frame(ID, value)
I want to group by ID, and create a new column where each value is subtracted from all others within the group.
Like sum add all of these values into a single column, how do I subtract?
library(dplyr)
d %>%
group_by(ID) %>%
# what's the - equivalent!
mutate(value_c = sub(value))
Thanks
J
Well, its a somewhat odd calculation, but slightly to my own surprise, the following seems to do what you explain:
set.seed(42)
ID <- sample(1:15, 100, replace = TRUE)
value <- sample(1:4, 100, replace = TRUE)
d <- data.frame(ID, value)
d %>% group_by( ID ) %>%
mutate(
value_c = value*2 - sum(value)
) %>%
arrange( ID ) %>%
head( n=20 )
Produces:
# A tibble: 20 x 3
# Groups: ID [3]
ID value value_c
<int> <int> <dbl>
1 1 1 -12
2 1 1 -12
3 1 4 -6
4 1 1 -12
5 1 1 -12
6 1 2 -10
7 1 4 -6
8 2 4 -21
9 2 3 -23
10 2 3 -23
11 2 2 -25
12 2 1 -27
13 2 1 -27
14 2 3 -23
15 2 3 -23
16 2 1 -27
17 2 4 -21
18 2 4 -21
19 3 4 -8
20 3 4 -8
You multiply value by 2 because its going to be in the sum() anyway, which you didn't want, so adding it back on the left side takes care of that.
Here is a base R option using ave
transform(
d,
value_c = 2*value - ave(value,ID,FUN = sum)
)
An option with data.table
library(data.table)
setDT(d)[, value_c := 2 * value - sum(value), ID]
Related
I have a data frame containing numbers that I would like to bin according to their absolute value.
library(tidyverse)
dat <- data.frame(val = seq(-10, 10))
The following command accomplishes what I would like to do, but the values are hardcoded which I need to avoid:
dat %>%
mutate(grp = case_when(abs(val) <= 5 ~ "Grp 1",
abs(val) <= 7 ~ "Grp 2",
TRUE ~ "Grp 3"))
How can I accomplish the same transformation, but instead using a named vector as the input:
grps <- c("Grp 1" = 5, "Grp 2" = 7)
So that I can add/remove groups as needed, for example, adding in "Grp 3" = 9?
Instead of using map or something that works one-by-one, we can do it vectorized with cut:
grps <- c("Grp 1" = 5, "Grp 2" = 7)
dat %>%
mutate(
grp = cut(abs(val), c(-Inf, grps, Inf), labels = c(names(grps), "Grp 3"))
)
# val grp
# 1 -10 Grp 3
# 2 -9 Grp 3
# 3 -8 Grp 3
# 4 -7 Grp 2
# 5 -6 Grp 2
# 6 -5 Grp 1
# 7 -4 Grp 1
# 8 -3 Grp 1
# 9 -2 Grp 1
# 10 -1 Grp 1
# 11 0 Grp 1
# 12 1 Grp 1
# 13 2 Grp 1
# 14 3 Grp 1
# 15 4 Grp 1
# 16 5 Grp 1
# 17 6 Grp 2
# 18 7 Grp 2
# 19 8 Grp 3
# 20 9 Grp 3
# 21 10 Grp 3
Note that grp is a factor; if you want it to be character, just wrap it in as.character.
I want to create a cumulative sum by id. But, it should not sum the value that belongs to the row where is being calculated.
I've already tried with cumsum. However, I do not know how to add a statement which specifies to do not add the amount of the row where the sum is made. The result column I am looking for is the third column called: "sum".
For example, for id 1, the first row is sum=0, because should not add this row. But, for id 1 and row 2 sum=100 because the amount of id 1 previous to the row 2 was 100 and so on.
id amount sum
1: 1 100 0
2: 1 20 100
3: 1 150 120
4: 2 60 0
5: 2 100 60
6: 1 30 270
7: 2 40 160
This is what I've tried:
df[,sum:=cumsum(amount),
by ="id"]
data: df <- data.table(id = c(1, 1, 1, 2, 2,1,2), amount = c(100, 20,
150,60,100,30,40),sum=c(0,100,120,0,60,270,160) ,stringsAsFactors =
FALSE)
You can do this without using lag:
> df %>%
group_by(id) %>%
mutate(sum = cumsum(amount) - amount)
# A tibble: 7 x 3
# Groups: id [2]
id amount sum
<dbl> <dbl> <dbl>
#1 1 100 0
#2 1 20 100
#3 1 150 120
#4 2 60 0
#5 2 100 60
#6 1 30 270
#7 2 40 160
With dplyr -
df %>%
group_by(id) %>%
mutate(sum = lag(cumsum(amount), default = 0)) %>%
ungroup()
# A tibble: 7 x 3
id amount sum
<dbl> <dbl> <dbl>
1 1 100 0
2 1 20 100
3 1 150 120
4 2 60 0
5 2 100 60
6 1 30 270
7 2 40 160
Thanks to #thelatemail here's the data.table version -
df[, sum := cumsum(shift(amount, fill=0)), by=id]
Here is an option in base R
df$Sum <- with(df, ave(amount, id, FUN = cumsum) - amount)
df$Sum
#[1] 0 100 120 0 60 270 160
Or by removing the last observation, take the cumsum
with(df, ave(amount, id, FUN = function(x) c(0, cumsum(x[-length(x)]))))
You can shift the values you're summing by using the lag function.
library(tidyverse)
df <- data.frame(id = c(1, 1, 1, 2, 2,1,2), amount = c(100, 20,
150,60,100,30,40),sum=c(0,100,120,0,60,270,160) ,stringsAsFactors =
FALSE)
df %>%
group_by(id) %>%
mutate(sum = cumsum(lag(amount, 1, default=0)))
# A tibble: 7 x 3
# Groups: id [2]
id amount sum
<dbl> <dbl> <dbl>
1 1 100 0
2 1 20 100
3 1 150 120
4 2 60 0
5 2 100 60
6 1 30 270
7 2 40 160
I want to compute the minimum distance between the current row and every row before it within each group. My data frame has several groups, and each group has multiple dates with longitude and latitude. I use a Haversine function to compute distance, and I need to apply this function as described above. The data frame looks like the following:
grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4
My current attempt uses purrrlyr::by_row, but the method is too slow. In practice, each group has thousands of dates and geographic positions. Here is part of my current attempt:
calc_min_distance <- function(df, grp.name, row){
df %>%
filter(
group_name==grp.name
) %>%
filter(
row_number() <= row
) %>%
mutate(
last.lat = last(lat),
last.long = last(long),
rowid = 1:n()
) %>%
group_by(rowid) %>%
purrrlyr::by_row(
~haversinedistance.fnct(.$last.long, .$last.lat, .$long, .$lat),
.collate='rows',
.to = 'min.distance'
) %>%
filter(
row_number() < n()
) %>%
summarise(
min = min(min.distance)
) %>%
.$min
}
df_dist <-
df %>%
group_by(grp_name) %>%
mutate(rowid = 1:n()) %>%
group_by(grp_name, rowid) %>%
purrrlyr::by_row(
~calc_min_distance(df, .$grp_name,.$rowid),
.collate='rows',
.to = 'min.distance'
) %>%
ungroup %>%
select(-rowid)
Suppose that distance is defined as (lat + long) for reference row - (lat + long) for each pairwise row less than the reference row. My expected output for grp 1 is the following:
grp date long lat rowid min.distance
1 1 1995-07-01 11 12 1 0
2 1 1995-07-05 3 0 2 -20
3 1 1995-07-09 13 4 3 -6
4 1 1995-07-13 4 25 4 6
How can I quickly compute the minimum distance between the current rowid and all rowids before it?
Here's how I would go about it. You need to calculate all the within-group pair-wise distances anyway, so we'll use geosphere::distm which is designed to do just that. I'd suggest stepping through my function line-by-line and looking at what it does, I think it will make sense.
library(geosphere)
find_min_dist_above = function(long, lat, fun = distHaversine) {
d = distm(x = cbind(long, lat), fun = fun)
d[lower.tri(d, diag = TRUE)] = NA
d[1, 1] = 0
return(apply(d, MAR = 2, min, na.rm = TRUE))
}
df %>% group_by(grp) %>%
mutate(min.distance = find_min_dist_above(long, lat))
# # A tibble: 8 x 6
# # Groups: grp [2]
# grp date long lat rowid min.distance
# <int> <fct> <int> <int> <int> <dbl>
# 1 1 1995-07-01 11 12 1 0
# 2 1 1995-07-05 3 0 2 1601842.
# 3 1 1995-07-09 13 4 3 917395.
# 4 1 1995-07-13 4 25 4 1623922.
# 5 2 1995-03-07 12 6 1 0
# 6 2 1995-03-10 3 27 2 2524759.
# 7 2 1995-03-13 34 8 3 2440596.
# 8 2 1995-03-16 25 9 4 997069.
Using this data:
df = read.table(text = ' grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4', h = TRUE)
I need help with programming R. I have data.frame B with one column
x<- c("300","300","300","400","400","400","500","500","500"....etc.) **2 milion rows**
and I need create next columns with rank. Next columns should look as
y<- c(1,2,3,1,2,3,1,2,3,......etc. )
I used cycle with for
B$y[1]=1
for (i in 2:length(B$x))
{
B$y[i]<-ifelse(B$x[i]==B$x[i-1], B$y[i-1]+1, 1)
}
The process ran for 4 hours.
So I need help anything speed up or anything else.
Thanks for your answer.
Here is a solution with base R:
B <- data.frame(x = rep(c(300, 400, 400), sample(c(5:10), 3)))
B
B$y <- ave(B$x, B$x, FUN=seq_along)
Here's an approach with dplyr that takes about 0.2 seconds on 2 million rows.
First I make sample data:
n = 2E6 # number of rows in test
library(dplyr)
sample_data <- data.frame(
x = round(runif(n = n, min = 1, max = 100000), digits = 0)
) %>%
arrange(x) # Optional, added to make output clearer so that each x is adjacent to the others that match.
Then I group by x and make y show which # occurrence of x it is within that group.
sample_data_with_rank <- sample_data %>%
group_by(x) %>%
mutate(y = row_number()) %>%
ungroup()
head(sample_data_with_rank, 20)
# A tibble: 20 x 2
x y
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 1 13
14 1 14
15 1 15
16 2 1
17 2 2
18 2 3
19 2 4
20 2 5
I wish to find the day on which variable hb falls below 90 for each record. I can find the day when x=min
f <- function(x) 1:length(x) <= which.min(x)
ind <- as.logical(ave(df$hb, df1$ id, FUN=f))
dfhb <- (df [ind, ])
maxday <- dfhb %>% group_by(id) %>% summarise(daymax = last(day))
However, I can’t get hb<90
f2 <- function(x) 1:length(x) <= which(x<=90)
ind <- as.logical(ave(df$hb, df$id, FUN=f2))
dfhb <- (df [ind, ])
maxday <- dfhb %>% group_by(id) %>% summarise(daymax = last(day))
summary(maxday$daymax)
I would be very grateful for your advice, bw Annemarie
id day hb
1 1 95
1 2 NA
1 3 91
1 4 89
2 1 98
2 2 87
2 3 84
3 1 89
3 2 92
3 3 89
Here is the idea for obtaining the first observed value that satisfies your threshold,
df %>%
group_by(id) %>%
summarise(daymax = which(hb<90)[1])
# A tibble: 3 × 2
# id daymax
# <int> <int>
#1 1 4
#2 2 2
#3 3 1