r-How to add column in r - r

I have data table
Name Score
A 5
A 6
B 9
B 1
B 0
...
I want to calculate and add a column 'FScore'=max score to this table
My expected result
Name Score Fscore
A 5 6
A 6 6
B 9 9
B 1 9
B 0 9
Thank.

We can use the base R option ave
df$Fscore <- ave(df$Score, df$Name, FUN = max)
df
# Name Score Fscore
#1 A 5 6
#2 A 6 6
#3 B 9 9
#4 B 1 9
#5 B 0 9

If you are trying to find the maximum score for each Name value, you can use data.table as below.
# example data
d <- data.table(Name = c("A", "A", "B", "B", "B"),
Score = c(5, 6, 9, 1, 0))
# find max for each Name and save the value in a new column, Fscore
d[ , Fscore := max(Score), by=Name]
Result:
> print(d)
Name Score Fscore
1: A 5 6
2: A 6 6
3: B 9 9
4: B 1 9
5: B 0 9

Another option using dplyr could be:
df = data.frame(Name = c('a', 'a', 'b','b','b'), Score = c(5,6,9,1,0))
df %>% group_by(Name) %>% mutate(Fscore = max(Score))
Source: local data frame [5 x 3]
Groups: Name [2]
Name Score FScore
<fctr> <dbl> <dbl>
1 a 5 6
2 a 6 6
3 b 9 9
4 b 1 9
5 b 0 9

Related

Filter groups based on difference two highest values

I have the following dataframe called df (dput below):
> df
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 B 8
6 B 2
7 B 2
8 B 3
9 C 10
10 C 1
11 C 1
12 C 8
I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?
dput of df:
df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1,
1, 8)), class = "data.frame", row.names = c(NA, -12L))
Using dplyr
library(dplyr)
df %>%
group_by(group) %>%
filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
ungroup()
# A tibble: 8 × 2
group value
<chr> <int>
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
A base R alternative
grp <- na.omit(aggregate(. ~ group, df, function(x)
abs(diff(sort(x, decreasing=T)[1:2])) <= 2))
do.call(rbind, c(mapply(function(g, v)
list(df[df$group == g & v,]), grp$group, grp$value), make.row.names=F))
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:
library(dplyr)
group_to_keep <-
df %>%
group_by(group) %>%
slice_max(n = 2,value) %>%
filter(abs(diff(value)) <= 2) %>%
pull(group) %>%
unique()
df %>%
filter(group %in% group_to_keep)
You can use ave.
df[ave(df$value, df$group, FUN=\(x) diff(sort(c(-x, Inf)))[1]) <= 2,]
# group value
#1 A 5
#2 A 1
#3 A 1
#4 A 5
#9 C 10
#10 C 1
#11 C 1
#12 C 8
In case you can sure that you have all the time at least two values you can use.
df[ave(df$value, df$group, FUN=\(x) diff(tail(sort(x), 2))) <= 2,]
df[ave(df$value, df$group, FUN=\(x) diff(sort(-x)[1:2])) <= 2,]

Create new column based on previous column by group; if missing, use NA

I am trying out to select a value by group from one column, and pass it as value in another column, extending for the whole group. This is similar to question asked here . BUt, some groups do not have this number: in that case, I need to fill the column with NAs. How to do this?
Dummy example:
dd1 <- data.frame(type = c(1,1,1),
grp = c('a', 'b', 'd'),
val = c(1,2,3))
dd2 <- data.frame(type = c(2,2),
grp = c('a', 'b'),
val = c(8,2))
dd3 <- data.frame(type = c(3,3),
grp = c('b', 'd'),
val = c(7,4))
dd <- rbind(dd1, dd2, dd3)
Create new column:
dd %>%
group_by(type) %>%
mutate(#val_a = ifelse(grp == 'a', val , NA),
val_a2 = val[grp == 'a'])
Expected outcome:
type grp val val_a # pass in `val_a` value of teh group 'a'
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA # value for 'a' is missing from group 3
You were close with your first approach; use any to apply the condition to all observations in the group:
dd %>%
group_by(type) %>%
mutate(val_a = ifelse(any(grp == "a"), val[grp == "a"] , NA))
type grp val val_a
<dbl> <chr> <dbl> <dbl>
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA
Try this:
dd %>%
group_by(type) %>%
mutate(val_a2 = val[which(c(grp == 'a'))[1]])
# # A tibble: 7 x 4
# # Groups: type [3]
# type grp val val_a2
# <dbl> <chr> <dbl> <dbl>
# 1 1 a 1 1
# 2 1 b 2 1
# 3 1 d 3 1
# 4 2 a 8 8
# 5 2 b 2 8
# 6 3 b 7 NA
# 7 3 d 4 NA
This also controls against the possibility that there could be more than one match, which may cause bad results (with or without a warning).

Subsetting dataframe in grouped data

I have a dataframe including a column of factors that I would like to subset to select every nth row, after grouping by factor level. For example,
my_df <- data.frame(col1 = c(1:12), col2 = rep(c("A","B", "C"), 4))
my_df
col1 col2
1 1 A
2 2 B
3 3 C
4 4 A
5 5 B
6 6 C
7 7 A
8 8 B
9 9 C
10 10 A
11 11 B
12 12 C
Subsetting to select every 2nd row should yield my_new_df as,
col1 col2
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
I tried in dplyr:
my_df %>% group_by(col2) %>%
my_df[seq(2, nrow(my_df), 2), ] -> my_new_df
I get an error:
Error: Can't subset columns that don't exist.
x Locations 4, 6, 8, 10, and 12 don't exist.
ℹ There are only 2 columns.
To see if the nrow function was a problem, I tried using the number directly. So,
my_df %>% group_by(col2) %>%
my_df[seq(2, 4, 2), ] -> my_new_df
Also gave an error,
Error: Can't subset columns that don't exist.
x Location 4 doesn't exist.
ℹ There are only 2 columns.
Run `rlang::last_error()` to see where the error occurred.
My expectation was that it would run the subsetting on each group of data and then combine them into 'my_new_df'. My understanding of how group_by works is clearly wrong but I am stuck on how to move past this error. Any help would much appreciated.
Try:
my_df %>%
group_by(col2)%>%
slice(seq(from = 2, to = n(), by = 2))
# A tibble: 6 x 2
# Groups: col2 [3]
col1 col2
<int> <chr>
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
You might want to ungroup after slicing if you want to do other operations not based on col2.
Here is a data.table option:
library(data.table)
data <- as.data.table(my_df)
data[(rowid(col2) %% 2) == 0]
col1 col2
1: 4 A
2: 5 B
3: 6 C
4: 10 A
5: 11 B
6: 12 C
Or base R:
my_df[as.logical(with(my_df, ave(col1, col2, FUN = function(x)
seq_along(x) %% 2 == 0))), ]
col1 col2
4 4 A
5 5 B
6 6 C
10 10 A
11 11 B
12 12 C

Expanding a data.frame based on (group) values from the data.frame

Lets say I have the following data frame:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))
# A tibble: 2 x 3
user first last
<chr> <dbl> <dbl>
1 A 1 6
2 B 4 9
And want to create a tibble that looks like:
bind_rows(tibble(user = 'A', weeks = 1:6),
tibble(user = 'B', weeks = 4:9))
# A tibble: 12 x 2
user weeks
<chr> <int>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9
How could I go about doing this? I have tried:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9)) %>%
group_by(user) %>%
mutate(weeks = first:last)
I wonder if I should try a combination of complete map or nest?
One option is unnest after creating a sequence
library(dplyr)
library(purrr)
df1 %>%
transmute(user, weeks = map2(first, last, `:`)) %>%
unnest(weeks)
# A tibble: 12 x 2
# user weeks
# <chr> <int>
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
# 7 B 4
# 8 B 5
# 9 B 6
#10 B 7
#11 B 8
#12 B 9
Or another option is rowwise
df1 %>%
rowwise %>%
transmute(user, weeks = list(first:last)) %>%
unnest(weeks)
Or without any packages
stack(setNames(Map(`:`, df1$first, df1$last), df1$user))
Or otherwise written as
stack(setNames(do.call(Map, c(f = `:`, df1[-1])), df1$user))
data
df1 <- tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))
One option involving dplyr and tidyr could be:
df %>%
uncount(last - first + 1) %>%
group_by(user) %>%
transmute(weeks = first + 1:n() - 1)
user weeks
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9

Carry / use value from previous group

data:
structure(list(id = c(1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 5),
ax = c("a", "a", "b", "b", "b", "b", "b", "b", "c", "c",
"d", "d", "e"), time = c(1, 3, 0, 2, 4, 5, 6, 8, 7, 9, 10,
11, 12)), .Names = c("id", "ax", "time"), class = c("data.table",
"data.frame"), row.names = c(NA, -13L))
looks like:
id ax time
1: 1 a 1
2: 1 a 3
3: 2 b 0
4: 2 b 2
5: 2 b 4
6: 2 b 5
7: 2 b 6
8: 2 b 8
9: 3 c 7
10: 3 c 9
11: 4 d 10
12: 4 d 11
13: 5 e 12
I want to have the max of the previous group next to the actual group:
desired output:
id ax time newCol
1: 1 a 1 NA
2: 1 a 3 NA
3: 2 b 0 3
4: 2 b 2 3
5: 2 b 4 3
6: 2 b 5 3
7: 2 b 6 3
8: 2 b 8 3
9: 3 c 7 8
10: 3 c 9 8
11: 4 d 10 9
12: 4 d 11 9
13: 5 e 12 11
Is it also possible to have the value of the "previous-previous" grp?
Interessted in baseR, data.table and tidyverse solutions
note:
Can be grouped by EITHER id or ax. The example is a little redundant here.
A data.table solution:
dtt.max <- dtt[, .(max = max(time)), by = ax]
dtt.max[, max.prev := shift(max)]
dtt[dtt.max, newCol := i.max.prev, on = 'ax']
# > dtt
# id ax time newCol
# 1: 1 a 1 NA
# 2: 1 a 3 NA
# 3: 2 b 0 3
# 4: 2 b 2 3
# 5: 2 b 4 3
# 6: 2 b 5 3
# 7: 2 b 6 3
# 8: 2 b 8 3
# 9: 3 c 7 8
# 10: 3 c 9 8
# 11: 4 d 10 9
# 12: 4 d 11 9
# 13: 5 e 12 11
data.table solution using id + 1
library(data.table)
merge(d, setDT(d)[, max(time), id + 1], all.x = TRUE)
Here is a dplyr approach. The key here is to group and ungroup when necessary:
df %>%
group_by(ax) %>%
mutate(new = time[n()]) %>%
ungroup() %>%
mutate(new = lag(new)) %>%
group_by(ax) %>%
mutate(new = new[1])
# A tibble: 13 x 4
# Groups: ax [5]
id ax time new
<dbl> <chr> <dbl> <dbl>
1 1. a 1. NA
2 1. a 3. NA
3 2. b 0. 3.
4 2. b 2. 3.
5 2. b 4. 3.
6 2. b 5. 3.
7 2. b 6. 3.
8 2. b 8. 3.
9 3. c 7. 8.
10 3. c 9. 8.
11 4. d 10. 9.
12 4. d 11. 9.
13 5. e 12. 11.
Assuming id is the same as group:
dfr <- dfr %>% group_by(id) %>% mutate(groupmax = max(time))
dfr$old_group_max <- dfr$groupmax[match(dfr$id - 1, dfr$id)]
The antepenultimate group is left as an exercise :-)
1) This uses no packages. It computes the maximum for each group giving Ag and and then lags it giving LagMax. Finally it left joins using merge that back into the original data frame DF:
Ag <- aggregate(time ~ id, DF, max)
LagMax <- transform(Ag, lagmax = c(NA, head(time, -1)), time = NULL)
merge(DF, LagMax, by = "id", all.x = TRUE)
giving:
id ax time lagmax
1 1 a 1 NA
2 1 a 3 NA
3 2 b 0 3
4 2 b 2 3
5 2 b 4 3
6 2 b 5 3
7 2 b 6 3
8 2 b 8 3
9 3 c 7 8
10 3 c 9 8
11 4 d 10 9
12 4 d 11 9
13 5 e 12 11
2) This sorts time within id so that we know that the maximum is the last value in each id group.
o <- order(factor(DF$id, levels = unique(DF$id)), DF$time)
Time <- DF$time[o]
lagmax <- function(r) if (r[1] == 1) NA else Time[r[1] - 1]
transform(DF, lagmax = ave(seq_along(id), id, FUN = lagmax))
In the question the time values are already sorted within id and if that is known to be the case the above could be shortened to:
lagmax <- function(r) if (r[1] == 1) NA else DF$time[r[1] - 1]
transform(DF, lagmax = ave(seq_along(id), id, FUN = lagmax))
3) This one-liner is a data.table translation of (2):
library(data.table)
DT <- copy(DF) # don't overwrite DF
setDT(DT)[, g:=rleid(id)][, lagmax := DT$time[.I[1]-1], keyby = c("g", "id")]
In the sample data in the question time is sorted within id and if that were known to be the case we could use the following shorter code in place of the last line above
setDT(DT)[, lagmax := DT$time[.I[1]-1], by = id]

Resources