dplyr creating new column based on some condition [duplicate] - r

This question already has an answer here:
Assign the value of the first row of a group to the whole group [duplicate]
(1 answer)
Closed 1 year ago.
I have the following df:
df<-data.frame(geo_num=c(11,12,22,41,42,43,77,71),
cust_id=c("A","A","B","C","C","C","D","D"),
sales=c(2,3,2,1,2,4,6,3))
> df
geo_num cust_id sales
1 11 A 2
2 12 A 3
3 22 B 2
4 41 C 1
5 42 C 2
6 43 C 4
7 77 D 6
8 71 D 3
Require to create a new column 'geo_num_new' which has for every group from 'cust_id' has first values from 'geo_num' as shown below:
> df_new
geo_num cust_id sales geo_num_new
1 11 A 2 11
2 12 A 3 11
3 22 B 2 22
4 41 C 1 41
5 42 C 2 41
6 43 C 4 41
7 77 D 6 77
8 71 D 3 77
thanks.

We could use first after grouping by 'cust_id'. The single value will be recycled for the entire grouping
library(dplyr)
df <- df %>%
group_by(cust_id) %>%
mutate(geo_num_new = first(geo_num)) %>%
ungroup
-ouptut
df
# A tibble: 8 x 4
geo_num cust_id sales geo_num_new
<dbl> <chr> <dbl> <dbl>
1 11 A 2 11
2 12 A 3 11
3 22 B 2 22
4 41 C 1 41
5 42 C 2 41
6 43 C 4 41
7 77 D 6 77
8 71 D 3 77
Or use data.table
library(data.table)
setDT(df)[, geo_num_new := first(geo_num), by = cust_id]
or with base R
df$geo_num_new <- with(df, ave(geo_num, cust_id, FUN = function(x) x[1]))
Or an option with collapse
library(collapse)
tfm(df, geo_num_new = ffirst(geo_num, g = cust_id, TRA = "replace"))
geo_num cust_id sales geo_num_new
1 11 A 2 11
2 12 A 3 11
3 22 B 2 22
4 41 C 1 41
5 42 C 2 41
6 43 C 4 41
7 77 D 6 77
8 71 D 3 77

Related

How to get data for rows that follow certain pattern

I have a data frame that looks something like this:
x
y
z
23
1
1
23
4
2
23
56
1
23
59
2
15
89
1
15
12
1
15
15
2
17
18
1
17
21
2
78
11
1
78
38
1
78
41
2
Now this data has certain pattern on column y and column z.
I want to get all the data where in column z we have a row wise pair of 2 followed by 1 for a given value in x. Simply put, we need to remove all rows that have 1 in column z but that 1 is not followed by 2 in next row.
The final output should look like this:
x
y
z
23
1
1
23
4
2
23
56
1
23
59
2
15
12
1
15
15
2
17
18
1
17
21
2
78
38
1
78
41
2
You can do this:
library(dplyr)
df %>%
group_by(x) %>%
filter((((z == 1) & (lead(z) == 2)) | ((z == 2) & (lag(z) == 1))))
# A tibble: 10 × 3
# Groups: x [4]
x y z
<int> <int> <int>
1 23 1 1
2 23 4 2
3 23 56 1
4 23 59 2
5 15 12 1
6 15 15 2
7 17 18 1
8 17 21 2
9 78 38 1
10 78 41 2
library(tidyverse)
df <- data.frame(x = c(23,23,23,23,15,15,15,17,17,78,78,78),
y = c(1,4,56,59,89,12,15,18,21,11,38,41),
z = c(1,2,1,2,1,1,2,1,2,1,1,2))
df %>%
filter(!(z == 1 & lead(z) != 2))

How to use dplyr & casewhen, across groups and rows, with three outcomes?

This seems a simple question to me but I'm super stuck on it! My data looks like this:
Name round MatchNumber Score
<chr> <int> <int> <dbl>
1 A 1 1 48
2 B 1 1 66
3 C 1 2 74
4 D 1 2 62
5 E 1 3 61
6 F 1 3 63
7 G 1 4 63
8 H 1 4 63
9 E 2 1 51
10 D 2 1 59
11 A 2 2 50
12 H 2 2 78
13 B 2 3 51
14 G 2 3 47
15 C 2 4 72
16 F 2 4 73
All I want to do is create a new column Outcome from Score to designate that for every name, round and match, there is a Win/ Loss or Draw. Ideally, this would be done via dplyr and likely via casewhen but I just can't get my head around the row-wise calculation and grouping. I've tried (but am stuck at) the following:
MatchOutcome <- ExampleData %>%
arrange(round, MatchNumber) %>%
group_by(Name, round, MatchNumber) %>%
mutate(Outcome = Score)
My ideal output would look like:
Name round MatchNumber Score Outcome
<chr> <int> <int> <dbl> <chr>
1 A 1 1 48 Loss
2 B 1 1 66 Win
3 C 1 2 74 Win
4 D 1 2 62 Loss
5 E 1 3 61 Loss
6 F 1 3 63 Win
7 G 1 4 63 Draw
8 H 1 4 63 Draw
9 E 2 1 51 Loss
10 D 2 1 59 Win
11 A 2 2 50 Loss
12 H 2 2 78 Win
13 B 2 3 51 Win
14 G 2 3 47 Loss
15 C 2 4 72 Loss
16 F 2 4 73 Win
Maybe something like this?
ExampleData %>%
group_by(round, MatchNumber) %>%
mutate(Outcome = case_when(Score == mean(Score) ~ "Draw",
Score == max(Score) ~ "Win",
TRUE ~ "Loss")) %>%
ungroup()
# A tibble: 16 x 5
Name round MatchNumber Score Outcome
<chr> <int> <int> <int> <chr>
1 A 1 1 48 Lose
2 B 1 1 66 Win
3 C 1 2 74 Win
4 D 1 2 62 Lose
5 E 1 3 61 Lose
6 F 1 3 63 Win
7 G 1 4 63 Draw
8 H 1 4 63 Draw
9 E 2 1 51 Lose
10 D 2 1 59 Win
11 A 2 2 50 Lose
12 H 2 2 78 Win
13 B 2 3 51 Win
14 G 2 3 47 Lose
15 C 2 4 72 Lose
16 F 2 4 73 Win
Data:
ExampleData <- read.table(text = "Name round MatchNumber Score
1 A 1 1 48
2 B 1 1 66
3 C 1 2 74
4 D 1 2 62
5 E 1 3 61
6 F 1 3 63
7 G 1 4 63
8 H 1 4 63
9 E 2 1 51
10 D 2 1 59
11 A 2 2 50
12 H 2 2 78
13 B 2 3 51
14 G 2 3 47
15 C 2 4 72
16 F 2 4 73")

Distinct in r within groups of data

How do I transform a dataframe (on the left) to dataframe (on the right)?
I am trying to do this via dplyr, by grouping into name and distinct, but it gives only 3 rows
df %>%
group_by(name) %>%
distinct(.,keep.all = T) %>%
View()
There is a simple way to access all the cells you want to change:
data <- data.frame(name = c(rep("A", 5), rep("B", 5), rep("C", 5)), subject = c(rep(1:5, 3)), marks = sample(1:100, 15))
> data
name subject marks
1 A 1 31
2 A 2 12
3 A 3 29
4 A 4 67
5 A 5 99
6 B 1 77
7 B 2 3
8 B 3 92
9 B 4 69
10 B 5 42
11 C 1 52
12 C 2 66
13 C 3 98
14 C 4 23
15 C 5 72
duplicated(data$name) accesses the relevant cells. But R has no way to leave a cell "blank", so to speak.
You can either set them NA, or fill it with an empty character:
data$name[duplicated(data$name)] <- NA
> data
name subject marks
1 A 1 31
2 <NA> 2 12
3 <NA> 3 29
4 <NA> 4 67
5 <NA> 5 99
6 B 1 77
7 <NA> 2 3
8 <NA> 3 92
9 <NA> 4 69
10 <NA> 5 42
11 C 1 52
12 <NA> 2 66
13 <NA> 3 98
14 <NA> 4 23
15 <NA> 5 72
data$name <- as.character(data$name)
data$name[duplicated(data$name)] <- ""
> data
name subject marks
1 A 1 30
2 2 52
3 3 5
4 4 48
5 5 99
6 B 1 14
7 2 20
8 3 34
9 4 55
10 5 53
11 C 1 38
12 2 27
13 3 67
14 4 12
15 5 77
To use the latter solution with a factor variable, you need to add "" as a factor label:
data$name <- factor(as.numeric(data$name), 1:4, c(levels(data$name), ""))
data$name[duplicated(data$name)] <- ""

add column to dataframes from 1 to unique length of existing grouped rows

Here is my example df:
df = read.table(text = 'colA
22
22
22
45
45
11
11
87
90
110
32
32', header = TRUE)
I just need to add a new col based on colA with values from 1 to the unique length of colA.
Expected output:
colA newCol
22 1
22 1
22 1
45 2
45 2
11 3
11 3
87 4
90 5
110 6
32 7
32 7
Here is what I tried without succes:
library(dplyr)
new_df = df %>%
group_by(colA) %>%
mutate(newCol = seq(1, length(unique(df$colA)), by = 1))
Thanks
newcol = c(1, 1+cumsum(diff(df$colA) != 0))
[1] 1 1 1 2 2 3 3 4 5 6 7 7
The dplyr package has a function to get indices of group:
df$newcol = group_indices(df,colA)
This returns:
colA newcol
1 22 2
2 22 2
3 22 2
4 45 4
5 45 4
6 11 1
7 11 1
8 87 5
9 90 6
10 110 7
11 32 3
12 32 3
Though the index is not ordered according to the order of appearance.
You can also do it using factor:
df$newcol = as.numeric(factor(df$colA,levels=unique(df$colA)))
Another option: You can capitalize on the fact that factors are associated with underlying integers. First create a new factor variable with the same levels as the column, then transform it to numeric.
newCol <- factor(df$colA,
levels = unique(df$colA))
df$newCol <- as.numeric(newCol)
df
colA newCol
1 22 1
2 22 1
3 22 1
4 45 2
5 45 2
6 11 3
7 11 3
8 87 4
9 90 5
10 110 6
11 32 7
12 32 7

How do I elegantly calculate a variable in an R data.frame that uses values in a previous row?

Here is a simple scenario I constructed:
Say I have the following:
set.seed(1)
id<-sample(3,10,replace = TRUE)
n<-1:10
x<-round(runif(10,30,40))
df<-data.frame(id,n,x)
df
id n x
1 1 1 32
2 2 2 32
3 2 3 37
4 3 4 34
5 1 5 38
6 3 6 35
7 3 7 37
8 2 8 40
9 2 9 34
10 1 10 38
How do I elegantly calculate x.lag where x.lag is a previous x for the same id or 0 if a previous value does not exist.
This is what I did but I'm not happy with it:
df$x.lag<-rep(0,10)
for (id in 1:3)
df[df$id==id,]$x.lag<-c(0,df[df$id==id,]$x)[1:sum(df$id==id)]
df
id n x x.lag
1 1 1 32 0
2 2 2 32 0
3 2 3 37 32
4 3 4 34 0
5 1 5 38 32
6 3 6 35 34
7 3 7 37 35
8 2 8 40 37
9 2 9 34 40
10 1 10 38 38
We can use data.table
library(data.table)
setDT(df)[, x.lag := shift(x, fill=0), id]
Or with dplyr
library(dplyr)
df %>%
group_by(id) %>%
mutate(x.lag = lag(x, default = 0))
Or using ave from base R
df$x.lag <- with(df, ave(x, id, FUN = function(x) c(0, x[-length(x)])))
df$x.lag
#[1] 0 0 32 0 32 34 35 37 40 38

Resources