I've been looking at the various answers for similar issues, but can't see anything that quite answers my problem.
I have a large data table
Number_X
Amount
1
100
2
100
1
100
3
100
1
100
2
100
I want to replace the amount with 50 for those rows where Number_X == 1.
I've tried
library(dplyr)
data <- data %>%
mutate(Amount = replace(Amount, Number_X == 1, 50))
but it doesn't change the value for Amount. How can I fix this?
# set as data.table
setDT(df)
# if then
df[ Number_X == 1, Amount := 50]
With large data, a data.table solution is most appropriate.
I don't see an issue with using replace() but you can also try to use if_else()
library(dplyr, warn.conflicts = FALSE)
data <- tibble(
Number_X = c(1L, 2L, 1L, 3L, 1L, 2L),
Amount = c(100L, 100L, 100L, 100L, 100L, 100L)
)
data %>%
mutate(Amount = replace(Amount, Number_X == 1, 50L))
#> # A tibble: 6 x 2
#> Number_X Amount
#> <int> <int>
#> 1 1 50
#> 2 2 100
#> 3 1 50
#> 4 3 100
#> 5 1 50
#> 6 2 100
data %>%
mutate(Amount = if_else(Number_X == 1, 50L, Amount))
#> # A tibble: 6 x 2
#> Number_X Amount
#> <int> <int>
#> 1 1 50
#> 2 2 100
#> 3 1 50
#> 4 3 100
#> 5 1 50
#> 6 2 100
Created on 2022-02-04 by the reprex package (v2.0.1)
Tip: Use dput() with your data to share it more easily:
dput(data)
#> structure(list(Number_X = c(1L, 2L, 1L, 3L, 1L, 2L), Amount = c(100L,
#> 100L, 100L, 100L, 100L, 100L)), class = c("tbl_df", "tbl", "data.frame"
#> ), row.names = c(NA, -6L))
If you want a tidyverse approach:
data %>%
mutate(Amount = ifelse(Number_X == 1, 50, Amount))
If you want almost the speed of data.table and the grammar of dplyr, you can consider dtplyr.
Related
I have a dataframe 'df' where I want to summarize how many times each 'user' has a higher 'total' value for each head-to-head 'game'. My data frame looks like this:
game
user
total
1
L
55
1
J
60
2
L
64
2
J
77
3
L
90
3
J
67
4
L
98
4
J
88
5
L
71
5
J
92
The summary would state that L had a higer total in 2 games and J had a higher total in 3 games.
Thank you!
Same approach as Vinay, using data.table
library(data.table)
setDT(df)
df[order(total), tail(.SD, 1), game][, .N, user]
#> user N
#> <char> <int>
#> 1: J 3
#> 2: L 2
Created on 2022-01-19 by the reprex package (v2.0.1)
Data used:
df <- structure(list(game = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L
), user = c("L", "J", "L", "J", "L", "J", "L", "J", "L", "J"),
total = c(55L, 60L, 64L, 77L, 90L, 67L, 98L, 88L, 71L, 92L
)), row.names = c(NA, -10L), class = "data.frame")
Assuming df is your dataframe the following should give you the long form summary.
df %>%
arrange(game,desc(total)) %>% #we sort descending to ensure winner row is first.
group_by(game) %>% # we group the rows per game, this allows for winner row to be first in each group
slice_head(n=1)%>% #get first row in each group i.e winner row
ungroup()
Output:
# A tibble: 5 × 3
game user total
<int> <chr> <int>
1 1 J 60
2 2 J 77
3 3 L 90
4 4 L 98
5 5 J 92
If you just want the user wise summary add count to the code as follows:
df %>%
arrange(game,desc(total)) %>% #we sort descending to ensure winner row is first.
group_by(game) %>% # we group the rows per game, this allows for winner row to be first in each group
slice_head(n=1) %>% #get first row in each group i.e winner row
ungroup() %>%
count(user)
Output:
# A tibble: 2 × 2
user n
<chr> <int>
1 J 3
2 L 2
We can group the data by game, slice_max and then count the resulting data.
library(tidyverse)
df %>% group_by(game) %>%
slice_max(total) %>%
ungroup() %>%
count(user)
#> # A tibble: 2 × 2
#> user n
#> <chr> <int>
#> 1 J 3
#> 2 L 2
Created on 2022-01-20 by the reprex package (v2.0.1)
Note that if there's a tie, it will add one to both teams:
library(tidyverse)
df <-
read_table('game user total
1 L 60
1 J 60
2 L 64
2 J 77
3 L 90
3 J 67
4 L 98
4 J 88
5 L 71
5 J 92')
df %>% group_by(game) %>%
slice_max(total) %>%
ungroup() %>%
count(user)
#> # A tibble: 2 × 2
#> user n
#> <chr> <int>
#> 1 J 3
#> 2 L 3
Created on 2022-01-20 by the reprex package (v2.0.1)
Is this the type of output you want?
library(tidyverse)
df <- structure(list(game = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L),
user = c("L", "J", "L", "J", "L", "J", "L", "J", "L", "J"),
total = c(55L, 60L, 64L, 77L, 90L, 67L, 98L, 88L, 71L, 92L)
), class = "data.frame", row.names = c(NA, -10L))
df %>%
group_by(game) %>%
slice_max(order_by = total,
n = 1,
with_ties = TRUE) %>%
group_by(user) %>%
summarise(wins = n())
#> # A tibble: 2 × 2
#> user wins
#> <chr> <int>
#> 1 J 3
#> 2 L 2
Created on 2022-01-20 by the reprex package (v2.0.1)
Edit
If you have a draw, then the above method counts that as a 'win' for both users. To count a draw as 'no winner' for both users (e.g. shown in game 1, below), you could use:
library(tidyverse)
df <- structure(list(game = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L),
user = c("L", "J", "L", "J", "L", "J", "L", "J", "L", "J"),
total = c(55L, 55L, 64L, 77L, 90L, 67L, 98L, 88L, 71L, 92L)
), class = "data.frame", row.names = c(NA, -10L))
df
#> game user total
#> 1 1 L 55
#> 2 1 J 55
#> 3 2 L 64
#> 4 2 J 77
#> 5 3 L 90
#> 6 3 J 67
#> 7 4 L 98
#> 8 4 J 88
#> 9 5 L 71
#> 10 5 J 92
df %>%
group_by(game) %>%
distinct(total, .keep_all = TRUE) %>%
filter(n() >= 2) %>%
slice_max(order_by = total,
n = 1,
with_ties = FALSE) %>%
group_by(user) %>%
summarise(win = n())
#> # A tibble: 2 × 2
#> user win
#> <chr> <int>
#> 1 J 2
#> 2 L 2
Created on 2022-01-20 by the reprex package (v2.0.1)
I am looking to extract timepoints from a table.
Output should be the starting point in seconds from column 2 and the duration of the series. But output only if the stage lasts for at least 3 minutes ( if you look at the seconds column) so repetition of either stage 0,1,2,3 or 5 for more than 6 consecutive lines of the stage column.
So in this case the 0-series does not qualify, while the following 1-series does.
desired output would be : 150, 8
starting at timepoint 150 and lasting for 8 rows.
I was experimenting with rle(), but haven't been successful yet..
Stage
Seconds
0
0
0
30
0
60
0
90
0
120
1
150
1
180
1
210
1
240
1
270
1
300
1
330
1
360
1
390
0
420
Not sure how representative of your data this might be. This may be an option using dplyr
library(dplyr)
df %>%
mutate(grp = c(0, cumsum(abs(diff(stage))))) %>%
filter(stage == 1) %>%
group_by(grp) %>%
mutate(count = n() - 1) %>%
filter(row_number() == 1, count >= 6) %>%
ungroup() %>%
select(-c(grp, stage))
#> # A tibble: 4 x 2
#> seconds count
#> <dbl> <dbl>
#> 1 960 16
#> 2 1500 7
#> 3 2040 17
#> 4 2670 10
Created on 2021-09-23 by the reprex package (v2.0.0)
data
set.seed(123)
df <- data.frame(stage = sample(c(0, 1), 100, replace = TRUE, prob = c(0.2, 0.8)),
seconds = seq(0, by = 30, length.out = 100))
Similar to this answer, you can use data.table::rleid() with dplyr
df <- structure(list(Stage = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L), Seconds = c(0L, 30L, 60L, 90L, 120L,
150L, 180L, 210L, 240L, 270L, 300L, 330L, 360L, 390L, 420L)), class = "data.frame", row.names = c(NA,
-15L))
library(dplyr)
library(data.table)
df %>%
filter(Seconds > 0) %>%
group_by(grp = rleid(Stage)) %>%
filter(n() > 6)
#> # A tibble: 9 x 3
#> # Groups: grp [1]
#> Stage Seconds grp
#> <int> <int> <int>
#> 1 1 150 2
#> 2 1 180 2
#> 3 1 210 2
#> 4 1 240 2
#> 5 1 270 2
#> 6 1 300 2
#> 7 1 330 2
#> 8 1 360 2
#> 9 1 390 2
Created on 2021-09-23 by the reprex package (v2.0.0)
I have an example dataset:
Road Start End Cat
1 0 50 a
1 50 60 b
1 60 90 b
1 70 75 a
2 0 20 a
2 20 25 a
2 25 40 b
Trying to output following:
Road Start End Cat
1 0 50 a
1 50 90 b
1 70 75 a
2 0 25 a
2 25 40 b
My code doesn't work:
df %>% group_by(Road, cat)
%>% summarise(
min(Start),
max(End)
)
How can I achieve the results I wanted?
We can use rleid from data.table to get the run-length-id-encoding for grouping and then do the summarise
library(dplyr)
library(data.table)
df %>%
group_by(Road, grp = rleid(Cat)) %>%
summarise(Cat = first(Cat), Start = min(Start), End = max(End)) %>%
select(-grp)
# A tibble: 5 x 4
# Groups: Road [2]
# Road Cat Start End
# <int> <chr> <int> <int>
#1 1 a 0 50
#2 1 b 50 90
#3 1 a 70 75
#4 2 a 0 25
#5 2 b 25 40
Or using data.table methods
library(data.table)
setDT(df)[, .(Start = min(Start), End = max(End)), .(Road, Cat, grp = rleid(Cat))]
data
df <- structure(list(Road = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), Start = c(0L,
50L, 60L, 70L, 0L, 20L, 25L), End = c(50L, 60L, 90L, 75L, 20L,
25L, 40L), Cat = c("a", "b", "b", "a", "a", "a", "b")),
class = "data.frame", row.names = c(NA,
-7L))
I have a df that takes this general form:
ID votes
1 65
1 85
2 100
2 20
2 95
3 50
3 60
I want to create a new df that takes the two highest values in votes for each ID and shows their difference. The new df should look like this:
ID margin
1 20
2 5
3 10
Is there a way to use dplyr for this?
An option would be to be arrange by 'ID', 'votes' (either in descending or ascending), grouped by 'ID' and get the diff of the first two 'votes'
library(dplyr)
df1 %>%
arrange(ID, desc(votes)) %>%
group_by(ID) %>%
summarise(margin = abs(diff(votes[1:2])))
# A tibble: 3 x 2
# ID margin
# <int> <int>
#1 1 20
#2 2 5
#3 3 10
Or another option is
df1 %>%
group_by(ID) %>%
summarise(margin = max(votes) - max(votes[-which.max(votes)]))
Or with slice and diff
df1 %>%
group_by(ID) %>%
slice(row_number(votes)[1:2]) %>%
summarise(margin = diff(votes))
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L), votes = c(65L,
85L, 100L, 20L, 95L, 50L, 60L)), class = "data.frame", row.names = c(NA,
-7L))
My dataset is set up as follows:
User Day
10 2
1 3
15 1
3 1
1 2
15 3
1 1
I'n trying to find out the users that are present on all three days. I'm using the below code using dplyr package:
MAU%>%
group_by(User)%>%
filter(c(1,2,3) %in% Day)
# but get this error message:
# Error in filter_impl(.data, quo) : Result must have length 12, not 3
any idea how to fix?
Using the input shown reproducibly in the Note at the end, count the distinct Users and filter out those for which there are 3 days:
library(dplyr)
DF %>%
distinct %>%
count(User) %>%
filter(n == 3) %>%
select(User)
giving:
# A tibble: 1 x 1
User
<int>
1 1
Note
Lines <- "
User Day
10 2
1 3
15 1
3 1
1 2
15 3
1 1"
DF <- read.table(text = Lines, header = TRUE)
We can use all to get a single TRUE/FALSE from the logical vector 1:3 %in% Day
library(dplyr)
MAU %>%
group_by(User)%>%
filter(all(1:3 %in% Day))
# A tibble: 3 x 2
# Groups: User [1]
# User Day
# <int> <int>
#1 1 3
#2 1 2
#3 1 1
data
MAU <- structure(list(User = c(10L, 1L, 15L, 3L, 1L, 15L, 1L), Day = c(2L,
3L, 1L, 1L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA,
-7L))