Fill a column based on max values by condition in R - r

I need to fill a new column based on the max values per group.
So I have
A B C
1 1 0
1 9 0
2 5 0
2 10 0
2 15 0
3 1 0
3 2 0
4 5 0
4 6 0
I need to fill $C with 1 for each maximum value in $B per grouping of $A
So:
A B C
1 1 0
1 9 1
2 5 0
2 10 0
2 15 1
3 1 0
3 2 1
4 5 0
4 6 1
Appreciate the help

We can use base R ave to match maximum value in each group
df$C <- +(with(df, B == ave(B, A, FUN = max)))
df
# A B C
#1 1 1 0
#2 1 9 1
#3 2 5 0
#4 2 10 0
#5 2 15 1
#6 3 1 0
#7 3 2 1
#8 4 5 0
#9 4 6 1
The same in dplyr would be
library(dplyr)
df %>%
group_by(A) %>%
mutate(C = +(B == max(B)))
We can also match it with index of maximum value
df$C <- with(df, ave(B, A, FUN = function(x) seq_along(x) == which.max(x)))
and
df %>%
group_by(A) %>%
mutate(C = +(row_number() == which.max(B)))

Related

Creating a new column with conditions in addition to the row value of the new column

Any ideas on how to create a new column B using the values of column A,
while using the value of the row above of the new created colum B?
The value of B should be corresponding to:
A0 = value of the row above.
A1 = 1.
A2 = value of the row above + 1.
Current dataframe + desired outcome
Dataframe Desired outcome
A A B
1 1 1
0 0 1
2 2 2
0 0 2
2 2 3
0 0 3
2 2 4
0 0 4
2 2 5
0 0 5
2 2 6
0 0 6
1 1 1
0 0 1
1 1 1
0 0 1
2 2 2
0 0 2
2 2 3
0 0 3
1 1 1
0 0 1
2 2 2
0 0 2
Data Frame
A <- c(1,0,2,0,2,0,2,0,2,0,2,0,1,0,1,0,2,0,2,0,1,0,2,0)
Bdesiredoutcome <- c(1,1,2,2,3,3,4,4,5,5,6,6,1,1,1,1,2,2,3,3,1,1,2,2)
df = data.frame(A,Bdesiredoutcome)
I tried using dpylr, mutate(), case_when() and lag() but keep running into errors. Due to using the lag() function. When using lag(A) the desired outcome cannot be generated.
Any idea's on how to solve this problem?
df <- df %>%
mutate(B = case_when((A == 0) ~ lag(B),
(A == 1) ~ 1,
(A == 2) ~ (lag(B)+1)
))
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "function"
In addition: Warning message:
We can create a grouping column with cumsum and then create the 'B' column
library(dplyr)
df %>%
group_by(grp = cumsum(A == 1)) %>%
mutate(B = cumsum(A != 0)) %>%
ungroup %>%
select(-grp) %>%
as.data.frame
-output
A Bdesired B
1 1 1 1
2 0 1 1
3 2 2 2
4 0 2 2
5 2 3 3
6 0 3 3
7 2 4 4
8 0 4 4
9 2 5 5
10 0 5 5
11 2 6 6
12 0 6 6
13 1 1 1
14 0 1 1
15 1 1 1
16 0 1 1
17 2 2 2
18 0 2 2
19 2 3 3
20 0 3 3
21 1 1 1
22 0 1 1
23 2 2 2
24 0 2 2
On your original question I got the following:
library(tidyverse)
library(lubridate)
df$date <-dmy(df$date)
df <- df %>%
arrange(id, date) %>%
group_by(id) %>%
mutate(daysbetween = replace_na(date - lag(date),0),
ind = 1,
NewA= case_when (daysbetween < 7 ~ 0, daysbetween > 7 ~ 1),
NewB= case_when (daysbetween < 85 ~ 0, daysbetween > 85 ~ 1),
A = case_when (1 + cumsum(ind*NewA) <= 6 ~ 1 + cumsum(ind*NewA),
1 + cumsum(ind*NewA) > 6 ~ 1 + cumsum(ind*NewA) - 6),
B = 1 + cumsum(ind*NewB))%>%
select(id, date, A, B)
It only works if the reset for A is at 6. I used cumsum() as suggested above.

column-wise operations depending on data on a data frame in R

I have a data frame with negative values in one column. something like this
df <- data.frame("a" = 1:6,"b"= -(5:10), "c" = rep(8:6,2))
a b c
1 1 -5 8
2 2 -6 7
3 3 -7 6
4 4 -8 8
5 5 -9 7
6 6 -10 6
I want to convert this to a data frame with no negative values in "b" keeping row totals unchanged. I can use column "a" only if "c" is not big enough to absorb the negative values in "b".
The end result should look like this
a b c
1 1 0 3
2 2 0 1
3 2 0 0
4 4 0 0
5 3 0 0
6 2 0 0
I feel that sapply could be used. But I don't know how ?
You can use pmin and pmax to get the new values for a, b and c.
df$c <- df$c + pmin(0, df$b)
df$b <- pmax(0, df$b)
df$a <- df$a + pmin(0, df$c)
df$c <- pmax(0, df$c)
df
# a b c
#1 1 0 3
#2 2 0 1
#3 2 0 0
#4 4 0 0
#5 3 0 0
#6 2 0 0
You could use dplyr:
df %>%
mutate(total=rowSums(.)) %>%
rowwise() %>%
mutate(c=max(b+c, 0),
b=max(b,0),
a=total - c - b) %>%
select(-total)
which returns
# A tibble: 6 x 3
# Rowwise:
a b c
<dbl> <dbl> <dbl>
1 1 0 3
2 2 0 1
3 2 0 0
4 4 0 0
5 3 0 0
6 2 0 0
Here is a base R solution.
df2 <- df
df2$c <- df$c + df$b
df2$a <- ifelse(df2$c < 0, df2$a + df2$c, df2$a)
df2[df2 < 0 ] <- 0
df2
# a b c
# 1 1 0 3
# 2 2 0 1
# 3 2 0 0
# 4 4 0 0
# 5 3 0 0
# 6 2 0 0

Only Use The First Match For Every N Rows

I have a data.frame that looks like this.
Date Number
1 1
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 1
I would like to create a new column that puts a 1 in the column if it is the first 1 of every 3 rows. Otherwise put a 0. For example, this is how I would like the new data.frame to look
Date Number New
1 1 1
2 0 0
3 1 0
4 0 0
5 0 0
6 1 1
7 0 0
8 0 0
9 1 1
Every three rows we find the first 1 and populate the column otherwise we place a 0. Thank you.
Hmm, at first glance I thought Akrun answer provided me the solution. However, it is not exactly what I am looking for. Here is what #akrun solution provides.
df1 = data.frame(Number = c(1,0,1,0,1,1,1,0,1,0,0,0))
head(df1,9)
Number
1 1
2 0
3 1
4 0
5 1
6 1
7 1
8 0
9 1
Attempt at solution:
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(Number == row_number()))
Number grp New
<dbl> <int> <int>
1 1 1 1
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0 #should be a 1
6 1 2 0
7 1 3 1
8 0 3 0
9 1 3 0
As you can see the code misses the one on row 5. I am looking for the first 1 in every chunk. Then everything else should be 0.
Sorry if i was unclear akrn
Edit** Akrun new answer is exactly what I am looking for. Thank you very much
Here is an option to create a grouping column with gl and then do a == with the row_number on the index of matched 1. Here, match will return only the index of the first match.
library(dplyr)
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(row_number() == match(1, Number, nomatch = 0)))
# A tibble: 12 x 3
# Groups: grp [4]
# Number grp New
# <dbl> <int> <int>
# 1 1 1 1
# 2 0 1 0
# 3 1 1 0
# 4 0 2 0
# 5 1 2 1
# 6 1 2 0
# 7 1 3 1
# 8 0 3 0
# 9 1 3 0
#10 0 4 0
#11 0 4 0
#12 0 4 0
Looking at the logic, perhaps you want to check if Number == 1 and that the prior 2 values were both 0. If that is not correct please let me know.
library(dplyr)
df %>%
mutate(New = ifelse(Number == 1 & lag(Number, n = 1L, default = 0) == 0 & lag(Number, n = 2L, default = 0) == 0, 1, 0))
Output
Date Number New
1 1 1 1
2 2 0 0
3 3 1 0
4 4 0 0
5 5 0 0
6 6 1 1
7 7 0 0
8 8 0 0
9 9 1 1
You can replace Number value to 0 except for the 1st occurrence of 1 in each 3 rows.
library(dplyr)
df %>%
group_by(gr = ceiling(row_number()/3)) %>%
mutate(New = replace(Number, -which.max(Number), 0)) %>%
#Or to be safe and specific use
#mutate(New = replace(Number, -which(Number == 1)[1], 0)) %>%
ungroup() %>% select(-gr)
# A tibble: 9 x 3
# Date Number New
# <int> <int> <int>
#1 1 1 1
#2 2 0 0
#3 3 1 0
#4 4 0 0
#5 5 0 0
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1

R: df header columns are ordinal ranking and spread across columns for each observation

I have a questionnaire data that look like below:
items no_stars1 no_stars2 no_stars3 average satisfied bad
1 A 1 0 0 0 0 1
2 B 0 1 0 1 0 0
3 C 0 0 1 0 1 0
4 D 0 1 0 0 1 0
5 E 0 0 1 1 0 0
6 F 0 0 1 0 1 0
7 G 1 0 0 0 0 1
Basically, the header columns (no. of stars rating and satisfactory) are ordinal ranking for each Items. I would like to summarize the no_stars(col 2:4) and satisfactory(col 5:7) into one column so that the output would look like this :
items no_stars satisfactory
1 A 1 1
2 B 2 2
3 C 3 3
4 D 2 3
5 E 3 2
6 F 3 3
7 G 1 1
$no_stars <- 1 is for no_stars1, 2 for no_stars2, 3 for no_stars3
$satisfactory <- 1 is for bad, 2 for average, 3 for good
I have tried the code below
df$no_stars2[df$no_stars2 == 1] <- 2
df$no_stars3[df$no_stars3 == 1] <- 3
df$average[df$average == 1] <- 2
df$satisfied[df$satisfied == 1] <- 3
no_stars <- df$no_stars1 + df$no_stars2 + df$no_stars3
satisfactory <- df$bad + df$average + df$satisfied
tidy_df <- data.frame(df$Items, no_stars, satisfactory)
tidy_df
Is there any function in R that can do the same thing? or
anyone got better and simpler solution ?
Thanks
Just use max.col and set preferences:
starsOrder<-c("no_stars1","no_stars2","no_stars3")
satOrder<-c("bad","average","satisfied")
data.frame(items=df$items,no_stars=max.col(df[,starsOrder]),
satisfactory=max.col(df[,satOrder]))
# items no_stars satisfactory
#1 A 1 1
#2 B 2 2
#3 C 3 3
#4 D 2 3
#5 E 3 2
#6 F 3 3
#7 G 1 1
Another tidyverse solution making use of factor to integer conversions to encode no_stars and satisfactory and spreading from wide to long twice:
library(tidyverse)
df %>%
gather(no_stars, v1, starts_with("no_stars")) %>%
mutate(no_stars = as.integer(factor(no_stars))) %>%
gather(satisfactory, v2, average, satisfied, bad) %>%
filter(v1 > 0 & v2 > 0) %>%
mutate(satisfactory = as.integer(factor(
satisfactory, levels = c("bad", "average", "satisfied")))) %>%
select(-v1, -v2) %>%
arrange(items)
# items no_stars satisfactory
#1 A 1 1
#2 B 2 2
#3 C 3 3
#4 D 2 3
#5 E 3 2
#6 F 3 3
#7 G 1 1
While there may be more elegant solutions, using dplyr::case_when() gives you the flexibility to code things however you want:
library(dplyr)
df %>%
dplyr::mutate(
no_stars = dplyr::case_when(
no_stars1 == 1 ~ 1,
no_stars2 == 1 ~ 2,
no_stars3 == 1 ~ 3)
, satisfactory = dplyr::case_when(
average == 1 ~ 2,
satisfied == 1 ~ 3,
bad == 1 ~ 1)
)
# items no_stars1 no_stars2 no_stars3 average satisfied bad no_stars satisfactory
# 1 A 1 0 0 0 0 1 1 1
# 2 B 0 1 0 1 0 0 2 2
# 3 C 0 0 1 0 1 0 3 3
# 4 D 0 1 0 0 1 0 2 3
# 5 E 0 0 1 1 0 0 3 2
# 6 F 0 0 1 0 1 0 3 3
# 7 G 1 0 0 0 0 1 1 1
dat%>%
replace(.==1,NA)%>%
replace_na(setNames(as.list(names(.)),names(.)))%>%
replace(.==0,NA)%>%
mutate(s=coalesce(!!!.[2:4]),
no_stars=as.numeric(factor(s,unique(s))),
t=coalesce(!!!.[5:7]),
satisfactory=as.numeric(factor(t,unique(t))))%>%
select(items,no_stars,satisfactory)
items no_stars satisfactory
1 A 1 1
2 B 2 2
3 C 3 3
4 D 2 3
5 E 3 2
6 F 3 3
7 G 1 1
using apply and match :
data.frame(
items = df1$items,
no_stars = apply(df1[2:4], 1, match, x=1),
satisfactory = apply(df1[c(7,5:6)], 1, match, x=1))
# items no_stars satisfactory
# 1 A 1 1
# 2 B 2 2
# 3 C 3 3
# 4 D 2 3
# 5 E 3 2
# 6 F 3 3
# 7 G 1 1
data
df1 <- read.table(header=TRUE,stringsAsFactors=FALSE,text="
items no_stars1 no_stars2 no_stars3 average satisfied bad
1 A 1 0 0 0 0 1
2 B 0 1 0 1 0 0
3 C 0 0 1 0 1 0
4 D 0 1 0 0 1 0
5 E 0 0 1 1 0 0
6 F 0 0 1 0 1 0
7 G 1 0 0 0 0 1")

counts sequences in R

id random count
a 0 -1
a 1 1
a 1 2
a 0 -1
a 0 -2
a 1 1
a 0 -1
a 1 1
a 0 -1
b 0 -1
b 0 -2
b 1 1
b 0 -1
b 1 1
b 0 -1
b 0 -2
b 0 -3
id is a player , random is binary 0 or 1 , I want to create a count column that counts the sequences of 1's and 0's by player , preferably without loops since the database is very big.
I think this is what you're looking for:
library(data.table)
setDT(DF)[, count := seq_len(.N), by=.(id,rleid(random))]
which gives
id random count
1: a 0 1
2: a 1 1
3: a 1 2
4: a 0 1
5: a 0 2
6: a 1 1
7: a 0 1
8: a 1 1
9: a 0 1
10: b 0 2
11: b 0 3
12: b 1 1
13: b 0 1
14: b 1 1
15: b 0 1
16: b 0 2
17: b 0 3
(In the next version of the data.table package, 1.9.8, there will be a small shortcut setDT(DF)[, count := rowid(rleid(random)), by=id]. I am making this note so I can update the answer later.)
You may also want identifiers for groups of runs:
DF[, rid := rleid(random), by=id]
which gives
id random count rid
1: a 0 1 1
2: a 1 1 2
3: a 1 2 2
4: a 0 1 3
5: a 0 2 3
6: a 1 1 4
7: a 0 1 5
8: a 1 1 6
9: a 0 1 7
10: b 0 1 1
11: b 0 2 1
12: b 1 1 2
13: b 0 1 3
14: b 1 1 4
15: b 0 1 5
16: b 0 2 5
17: b 0 3 5
If you read through the introductory materials on the package, you'll see that these variables can also be created in a single step.
Here's a dplyr solution
dat %>%
transform(idx = c(0,cumsum(random[-1L] != random[-length(random)]))) %>%
group_by(id, idx) %>%
mutate(count = -1*cumsum(random == 0) + cumsum(random == 1)) %>%
ungroup() %>%
select(-idx)
Source: local data frame [17 x 3]
id random count
1 a 0 -1
2 a 1 1
3 a 1 2
4 a 0 -1
5 a 0 -2
6 a 1 1
7 a 0 -1
8 a 1 1
9 a 0 -1
10 b 0 -1
11 b 0 -2
12 b 1 1
13 b 0 -1
14 b 1 1
15 b 0 -1
16 b 0 -2
17 b 0 -3
I think the easiest way to achieve this is streak_run function from runner package. streak_run is also fastest as shown in below section
Solution
library(runner)
df <- data.frame( id = 1:10, random = sample(c(0,1), 10, replace=T))
df$count <- streak_run(df$random)
df$count[df$random==0] <- -df$count[df$random==0]
df
# id random count
#1 1 0 -1
#2 2 0 -2
#3 3 1 1
#4 4 1 2
#5 5 1 3
#6 6 1 4
#7 7 0 -1
#8 8 0 -2
#9 9 0 -3
#10 10 0 -4
Benchmarks
runner_example <- function(df){
df$count <- streak_run(df$random)
df$count[df$random==0] <- -df$count[df$random==0]
return(df)}
dplyr_example <- function(df){
df %>%
transform(idx = c(0,cumsum(random[-1L] != random[-length(random)]))) %>%
group_by(id, idx) %>%
mutate(count = -1*cumsum(random == 0) + cumsum(random == 1)) %>%
ungroup() %>%
select(-idx)
return(df)}
dt_example <- function(df){
setDT(df)[, count := seq_len(.N), by=.(id,rleid(random))]
return(df)}
library(dplyr);library(data.table)
library(microbenchmark); library(magrittr)
df <- data.frame( id = 1:2000L, random = sample(letters[1:2], 2000L, replace=T))
microbenchmark(
dplyr = dplyr_example(df),
dt = dt_example(df),
runner = runner_example(df),
times=100
)
#Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 134388.839 164274.611 204478.048 188548.4975 222777.298 526019.563 100
# dt 1306.139 1710.665 2181.989 1941.3420 2380.953 5581.682 100
# runner 284.522 741.145 1022.456 853.5715 1004.553 7398.019 100

Resources