x=data.frame("l"=c('a','b','d','a','c','c','d'),
"t"=c(1:7))
x$l1 = ifelse(x$l=="a",1,
ifelse(x$l=="b",2,
ifelse(x$l=="c",3,
ifelse(x$l=="d" ,4,
ifelse(x$l=="d" & x$t==7,5,NA)))))
As you can see on the final statement, if l == "d" & t == 7 then l1 should equals to 5 but it gets set equals to 4.
ifelse execution stops wherever it first hits TRUE.
Simply reordering your conditions -
x$l1 = ifelse(x$l=="a",1,
ifelse(x$l=="b",2,
ifelse(x$l=="c",3,
ifelse(x$l=="d" & x$t==7,5,
ifelse(x$l=="d" ,4,NA)))))
x
l t l1
1 a 1 1
2 b 2 2
3 d 3 4
4 a 4 1
5 c 5 3
6 c 6 3
7 d 7 5
Also consider using case_when() from dplyr package; it would be much more readable.
It is better not to use any nested ifelse or case_when. This can be done much easily and efficiently with a key/value dataset merge
keydat <- data.frame(l = letters[1:4],l1 = 1:4)
left_join(x, keydat) %>%
mutate(l1 = replace(l1, l== 'd' & t == 7, 5))
# l t l1
#1 a 1 1
#2 b 2 2
#3 d 3 4
#4 a 4 1
#5 c 5 3
#6 c 6 3
#7 d 7 5
Based on #Shree's answer, here it comes the tidyverse version:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
x <- data.frame("l"=c('a','b','d','a','c','c','d'),
"t"=c(1:7))
x %>%
as_tibble() %>%
mutate(l1 = case_when(
l == "a" ~ 1,
l == "b" ~ 2,
l == "c" ~ 3,
l == "d" & t == 7 ~ 5,
l == "d" ~ 4
))
#> # A tibble: 7 x 3
#> l t l1
#> <fct> <int> <dbl>
#> 1 a 1 1
#> 2 b 2 2
#> 3 d 3 4
#> 4 a 4 1
#> 5 c 5 3
#> 6 c 6 3
#> 7 d 7 5
Created on 2019-06-17 by the reprex package (v0.3.0)
Related
I have a very large data frame that includes integer columns state and state_cyclen. Every row is a gameframe, while state describes the state a game is in at that frame and state_cyclen is coded to indicate n occurrence of that state (it is basically data.table::rleid(state)). Conditioning on state and cycling by state_cyclen I need to import several columns from other definitions data frames. Definition data frames store properties about state and their row ordering informs on the way these properties are cycled throughout the game (players encounter each game state many times).
A minimal example of the long data that should be left joined:
data <- data.frame(
state = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, 3, 4, 4, 3, 3),
state_cyclen = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 4, 4)
)
data
#> state state_cyclen
#> 1 1 1
#> 2 1 1
#> 3 2 1
#> 4 2 1
#> 5 3 1
#> 6 3 1
#> 7 1 2
#> 8 1 2
#> 9 2 2
#> 10 2 2
#> 11 3 2
#> 12 3 2
#> 13 2 3
#> 14 2 3
#> 15 3 3
#> 16 3 3
#> 17 3 3
#> 18 4 1
#> 19 4 1
#> 20 3 4
#> 21 3 4
Minimal example for definition data frames storing the ordering:
def_one <- data.frame(
prop = letters[1:3],
others = LETTERS[1:3]
)
def_two <- data.frame(
prop = letters[4:10],
others = LETTERS[4:10]
)
def_three <- data.frame(
prop = letters[11:12],
others = LETTERS[11:12]
)
I have a solution written in base R that gives the desired output, but it's neither very readable, nor probably very efficient.
# Add empty columns
data$prop <- NA
data$others <- NA
# Function that recycles numeric vector bounded by a upper limit
bounded_vec_recyc <- function(vec, n) if(n == 1) vec else (vec - 1) %% n + 1
# My solution
vec_pos_one <- data[data[, "state"] == 1, ]$state_cyclen
vec_pos_one <- bounded_vec_recyc(vec_pos_one, n = nrow(def_one))
data[data[, "state"] == 1, ][, c("prop", "others")] <- def_one[vec_pos_one,]
vec_pos_two <- data[data[, "state"] == 2, ]$state_cyclen
vec_pos_two <- bounded_vec_recyc(vec_pos_two, n = nrow(def_two))
data[data[, "state"] == 2, ][, c("prop", "others")] <- def_two[vec_pos_two,]
vec_pos_three <- data[data[, "state"] == 3, ]$state_cyclen
vec_pos_three <- bounded_vec_recyc(vec_pos_three, n = nrow(def_three))
data[data[, "state"] == 3, ][, c("prop", "others")] <- def_three[vec_pos_three,]
data
#> state state_cyclen prop others
#> 1 1 1 a A
#> 2 1 1 a A
#> 3 2 1 d D
#> 4 2 1 d D
#> 5 3 1 k K
#> 6 3 1 k K
#> 7 1 2 b B
#> 8 1 2 b B
#> 9 2 2 e E
#> 10 2 2 e E
#> 11 3 2 l L
#> 12 3 2 l L
#> 13 2 3 f F
#> 14 2 3 f F
#> 15 3 3 k K
#> 16 3 3 k K
#> 17 3 3 k K
#> 18 4 1 <NA> <NA>
#> 19 4 1 <NA> <NA>
#> 20 3 4 l L
#> 21 3 4 l L
Created on 2022-08-30 with reprex v2.0.2
TLDR: As you can see, I am basically trying to merge one by one these definition data frames to the main data frame on corresponding state by recycling the rows of the definition data frame while retaining their order, using the state_cyclen column to keep track of occurrences of each state throughout the game.
Is there a way to do this within the tidyverse or data.table that is faster or at least easier to read? I need this to be quite fast as I have many such gameframe files (in the hundreds) and they are lengthy (hundreds of thousands of rows).
P.S. Not sure if title is adequate for the operations I am doing, as I can imagine multiple ways of implementation. Edits on it are welcome.
Here, I make a lookup table combining the three sources. Then I join the data with the number of rows for each state, modify the state_cyclen in data using modulo with that number to be within the lookup range, then join.
library(tidyverse)
def <- bind_rows(def_one, def_two, def_three, .id = "state") %>%
mutate(state = as.numeric(state)) %>%
group_by(state) %>%
mutate(state_cyclen_adj = row_number()) %>%
ungroup()
data %>%
left_join(def %>% count(state)) %>%
# eg for row 15 we change 3 to 1 since the lookup table only has 2 rows
mutate(state_cyclen_adj = (state_cyclen - 1) %% n + 1) %>%
left_join(def)
Joining, by = "state"
Joining, by = c("state", "state_cyclen_adj")
state state_cyclen n state_cyclen_adj prop others
1 1 1 3 1 a A
2 1 1 3 1 a A
3 2 1 7 1 d D
4 2 1 7 1 d D
5 3 1 2 1 k K
6 3 1 2 1 k K
7 1 2 3 2 b B
8 1 2 3 2 b B
9 2 2 7 2 e E
10 2 2 7 2 e E
11 3 2 2 2 l L
12 3 2 2 2 l L
13 2 3 7 3 f F
14 2 3 7 3 f F
15 3 3 2 1 k K
16 3 3 2 1 k K
17 3 3 2 1 k K
18 4 1 NA NA <NA> <NA>
19 4 1 NA NA <NA> <NA>
20 3 4 2 2 l L
21 3 4 2 2 l L
Here is a data.table solution. Not sure it is easier to read, but pretty sure it is more efficient:
library(data.table)
dt <- rbind(setDT(def_one)[,state := 1],
setDT(def_two)[,state := 2],
setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]
data <- setDT(data)
data[dt[,.N,by = state],
state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
on = "state",
by = .EACHI]
dt[data,on = c("state","state_cyclen")]
prop others state state_cyclen
1: a A 1 1
2: a A 1 1
3: d D 2 1
4: d D 2 1
5: k K 3 1
6: k K 3 1
7: b B 1 2
8: b B 1 2
9: e E 2 2
10: e E 2 2
11: l L 3 2
12: l L 3 2
13: f F 2 3
14: f F 2 3
15: k K 3 1
16: k K 3 1
17: k K 3 1
18: <NA> <NA> 4 1
19: <NA> <NA> 4 1
20: l L 3 2
21: l L 3 2
prop others state state_cyclen
By step:
I bind the def_one, def_two and def_three dataframes to create a data.table with the variable you need to merge
dt <- rbind(setDT(def_one)[,state := 1],
setDT(def_two)[,state := 2],
setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]
In case you want to merge a lot of dataframes, you can use rbindlist and a list of data.tables.
I then modify your state_cyclen in data to do the same recycling than you:
dt[,.N,by = state]
state N
1: 1 3
2: 2 7
3: 3 2
gives the lengths you use to define your recycling.
data[dt[,.N,by = state],
state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
on = "state",
by = .EACHI]
I use the by = .EACHI to modify the variable for each group during the merge, using the N variable from dt[,.N,by = state]
Then I just have to do the left join:
dt[data,on = c("state","state_cyclen")]
An option with nest/unnest
library(dplyr)
library(tidyr)
data %>%
nest_by(state) %>%
left_join(tibble(state = 1:3, dat = list(def_one, def_two, def_three))) %>%
mutate(data = list(bind_cols(data, if(!is.null(dat))
dat[data %>%
pull(state_cyclen) %>%
bounded_vec_recyc(., nrow(dat)),] else NULL)), dat = NULL) %>%
ungroup %>%
unnest(data)
-output
# A tibble: 21 × 4
state state_cyclen prop others
<dbl> <dbl> <chr> <chr>
1 1 1 a A
2 1 1 a A
3 1 2 b B
4 1 2 b B
5 2 1 d D
6 2 1 d D
7 2 2 e E
8 2 2 e E
9 2 3 f F
10 2 3 f F
# … with 11 more rows
I have an ID recording column containing the characters:
39299
30299
39299
39299
38744
38744
27222
39299
29000
38744
29000
29000
29000.
How can I code to make a new column that shows the number of repeats? 39299 repeated 4 times, 30299 repeated 1 time, 38744 repeated 3 times, 27222 repeated 1 time, and 29000 repeated 4 times, then the output looks like c(4, 1, 3, 4)?
You can do as.vector(table(your_vector)) to get the number of appearances of each number in a vector. Check table only as well.
vec = c(2, 2, 3, 1, 4)
as.vector(table(factor(vec, levels = unique(vec))))
[1] 2 1 1 1
If this is a column in a data frame, you can use dplyr solution.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# test data frame
df = data.frame(vec = c(2, 2, 3, 1, 4))
# add column with counts
df %>% add_count(vec)
#> vec n
#> 1 2 2
#> 2 2 2
#> 3 3 1
#> 4 1 1
#> 5 4 1
# count unique values in column
df %>% count(vec)
#> vec n
#> 1 1 1
#> 2 2 2
#> 3 3 1
#> 4 4 1
# count unique values and order as in the original data
df %>%
mutate(vec = factor(vec, levels = unique(vec)))%>%
count(vec)
#> vec n
#> 1 2 2
#> 2 3 1
#> 3 1 1
#> 4 4 1
Created on 2022-08-15 by the reprex package (v2.0.0)
This question arose by working on this The R dplyr function arrange(ymd(col)) is not working
We have this data frame:
df <- structure(list(record_id = 1:5, group = c("A", "B", "C", "D",
"E"), date_start = c("Apr-22", "Aug-21", "Jan-22", "Feb-22",
"Dec-21")), class = "data.frame", row.names = c(NA, -5L))
record_id group date_start
1 1 A Apr-22
2 2 B Aug-21
3 3 C Jan-22
4 4 D Feb-22
5 5 E Dec-21
We would like to sort date_start:
My first approach: worked
library(dplyr)
library(lubridate)
df %>%
mutate(date_start1 = myd(paste0(date_start,"-01"))) %>%
arrange(date_start1) %>%
select(-date_start1)
record_id group date_start
1 2 B Aug-21
2 5 E Dec-21
3 3 C Jan-22
4 4 D Feb-22
5 1 A Apr-22
Then I tried this and it also worked
library(dplyr)
library(lubridate)
df %>%
arrange(date_start1 = myd(paste0(date_start,"-01")))
record_id group date_start
1 2 B Aug-21
2 5 E Dec-21
3 3 C Jan-22
4 4 D Feb-22
5 1 A Apr-22
I would like to understand how one arrange can do the same as a combination of mutate, arrange and select
What the code is not doing
The output of arrange() is perhaps surprising because you think it is doing the following:
Everything to the right of the = is just a function to create a vector.
time_col <- df$date_start %>%
paste0(.,"-01") %>%
myd() %>%
print()
#> [1] "2022-04-01" "2021-08-01" "2022-01-01" "2022-02-01" "2021-12-01"
The = of course is assignment to a new column :
df <- df %>%
mutate(date_start1 = time_col) %>%
print()
#> record_id group date_start date_start1
#> 1 1 A Apr-22 2022-04-01
#> 2 2 B Aug-21 2021-08-01
#> 3 3 C Jan-22 2022-01-01
#> 4 4 D Feb-22 2022-02-01
#> 5 5 E Dec-21 2021-12-01
You're then sorting on that variable:
df %>% arrange(date_start1)
#> record_id group date_start date_start1
#> 1 2 B Aug-21 2021-08-01
#> 2 5 E Dec-21 2021-12-01
#> 3 3 C Jan-22 2022-01-01
#> 4 4 D Feb-22 2022-02-01
#> 5 1 A Apr-22 2022-04-01
What the code is doing
If you look at the output, the code is not actually doing what is shown previously and then removing a column. It is missing the new column date_start1 without us even needing to remove it manually:
df %>%
arrange(date_start1 = myd(paste0(date_start,"-01")))
#> record_id group date_start
#> 1 2 B Aug-21
#> 2 5 E Dec-21
#> 3 3 C Jan-22
#> 4 4 D Feb-22
#> 5 1 A Apr-22
The key then is to understand that you are not creating a new variable that is added to the data.frame, sorting on it, then removing it. Rather, you are passing a set of values (one per row) on which to sort.
Why this is possible
This is permitted because you can pass any arbitrary vector that may (not) be a function of the variables in the data. As noted in the documentation for arrange(), the second argument is:
Variables, or functions of variables. Use desc() to sort a variable in descending order.
All you are doing is passing a function of variables! This is why you can also do:
df %>%
arrange(1:nrow(df) + record_id)
#> record_id group date_start
#> 1 1 A Apr-22
#> 2 2 B Aug-21
#> 3 3 C Jan-22
#> 4 4 D Feb-22
#> 5 5 E Dec-21
I am trying to find a way to replicate the following code 4 times:
df3_1<- df3_1 %>% add_row(.before
= 2)
I tried the 'rep' function but it didn't work out. Is there any way to repeat this code 4 times so I can add multiple blank rows exactly in the manner described above (i.e. with respect to a specific row number).
Thanks!
In base R, you can do:
n <- 4
nr <- nrow(df3_1)
df3_1[append(seq(nr), values = rep(nr + 1, n), after = 1), ]
Or a tidyverse approach:
library(tibble)
n <- 4
df3_1 %>%
add_row(!!names(.)[1] := rep(NA, n), .before = 2)
I would think base::Reduce() or purrr::reduce() are best designed for this kind of job
library(dplyr)
library(purrr)
df3_1 <- data.frame(a = 1:5,
b = letters[1:5])
# purrr reduce
reduce(1:4,
function(df, x){
add_row(df, .before = 2)
},
.init = df3_1)
#> a b
#> 1 1 a
#> 2 NA <NA>
#> 3 NA <NA>
#> 4 NA <NA>
#> 5 NA <NA>
#> 6 2 b
#> 7 3 c
#> 8 4 d
#> 9 5 e
# base R Reduce
Reduce(function(df, x){
add_row(df, .before = 2)
},
1:4,
init = df3_1)
#> a b
#> 1 1 a
#> 2 NA <NA>
#> 3 NA <NA>
#> 4 NA <NA>
#> 5 NA <NA>
#> 6 2 b
#> 7 3 c
#> 8 4 d
#> 9 5 e
You dont actually need the x parameter here inside of your function but to determine how often the function should be called, that is, how many blank rows should be inserted.
Try this (updated simplified version based on #27ϕ9 generalised approach):
df3_1 <- data.frame(a = 1:5,
b = letters[1:5])
library(dplyr)
df3_1 %>%
add_row(a = rep(NA, 4), .before = 2)
#> a b
#> 1 1 a
#> 2 NA <NA>
#> 3 NA <NA>
#> 4 NA <NA>
#> 5 NA <NA>
#> 6 2 b
#> 7 3 c
#> 8 4 d
#> 9 5 e
Created on 2020-07-01 by the reprex package (v0.3.0)
let's say I have a data frame which looks something like this
A <- c(1:100)
B <- c(0.5:100)
df <- data.frame(A,B)
And I want to get 25 random rows out of this data frame with
df[sample(nrow(df), size = 25, replace = FALSE),]
But now I want to repeat this sample function 100 times and save every result individually.
I've tried to use the repeat function but I can't find a way to save every result.
Thank you.
As mentioned in the comments, the replicate implementation can reach your goal, i.e.,
res <- replicate(100,df[sample(nrow(df), size = 25, replace = FALSE),],simplify = F)
An alternative is to use sapply (or lapply), i.e.,
res <- sapply(1:100, function(k) df[sample(nrow(df), size = 25, replace = FALSE),],simplify = F)
or
res <- lapply(1:100, function(k) df[sample(nrow(df), size = 25, replace = FALSE),])
replicate() is a great option for this problem.
If you would like your final results in a single table with a column for the ID variable, you can use bind_rows() from the dplyr package. Here is a smaller example (3 samples from a data set of 10 rows) that may allow easier understanding of replicate()'s behavior:
library(dplyr, warn.conflicts = FALSE)
# make a smaller data set of 10 rows
d <- data.frame(
A = 1:10,
B = LETTERS[1:10]
) %>% print
#> A B
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 4 D
#> 5 5 E
#> 6 6 F
#> 7 7 G
#> 8 8 H
#> 9 9 I
#> 10 10 J
# create 3 samples, with each sample containing 4 rows
reps <- replicate(3, d[sample(nrow(d), 4, FALSE), ], simplify = FALSE) %>% print
#> [[1]]
#> A B
#> 2 2 B
#> 5 5 E
#> 6 6 F
#> 1 1 A
#>
#> [[2]]
#> A B
#> 3 3 C
#> 2 2 B
#> 5 5 E
#> 8 8 H
#>
#> [[3]]
#> A B
#> 4 4 D
#> 9 9 I
#> 3 3 C
#> 8 8 H
# bind the list elements into a single tibble, with an ID column for the sample
bind_rows(reps, .id = "sample_id")
#> sample_id A B
#> 1 1 2 B
#> 2 1 5 E
#> 3 1 6 F
#> 4 1 1 A
#> 5 2 3 C
#> 6 2 2 B
#> 7 2 5 E
#> 8 2 8 H
#> 9 3 4 D
#> 10 3 9 I
#> 11 3 3 C
#> 12 3 8 H
Created on 2019-12-02 by the reprex package (v0.3.0)