I'm trying take a sequence of dates--and starting with the first date--select subsequent dates by a random number generated from a normal distribution. At the moment I have code that selects the row number by a random number, but it uses the same number every time. In this example, it selects a row every 12 days:
set.seed(123)
library(tidyverse)
library(lubridate)
start_date <- as.Date('2018-03-01')
end_date <- as.Date('2018-07-01')
seq_dates <- seq(ymd(start_date), ymd(end_date), by='1 days')
seq_dates <- seq_dates %>%
as.tibble()
seq_dates
seq_dates %>%
filter(row_number() %% round(rnorm(n=1, mean=14, sd=3), 0) == 1)
Is there a way I can do this with dplyr, but select a row from the start date at a random interval every time? So from 2018-03-01 the next date might be 12 days later, then 14 days later, then 19 days later, etc?
library(dplyr)
set.seed(10)
n <- rnorm(50, 14, 3)
rows <- cumsum(round(n, 0))
diff(rows) # random ~normal increments used when selecting your rows
# [1] 13 10 12 15 15 10 13 9 13 17 16 13 17 16 14 11 13 17 15 12 7 12 8 10 13 12 11 14 13 8 14 17
# [33] 15 10 10 15 9 13 12 17 12 12 17 11 14 15 13 12 16
seq_dates %>%
slice(rows[rows <= n()])
# # A tibble: 9 x 1
# value
# <date>
# 1 2018-03-14
# 2 2018-03-27
# 3 2018-04-06
# 4 2018-04-18
# 5 2018-05-03
# 6 2018-05-18
# 7 2018-05-28
# 8 2018-06-10
# 9 2018-06-19
Related
Region
Age
Student Type
Values
A
17
Any
32
A
17
Full time
24
A
18
Any
27
A
18
Full time
19
B
17
Any
22
B
17
Full time
14
B
18
Any
80
B
18
Full time
75
I am working with this dataset in R. I am hoping to create a new tow for each region and age, with student type being "Part time" and values being the values of "Any" - "Full time". It seems I can use lag in the process, but I was hoping to be more explicit, specifying it is "Any" - "Full time", as while this dataset is well organized there may be data sets where entries are reversed.
Ideally the result would look something like
Region
Age
Student Type
Values
A
17
Any
32
A
17
Full time
24
A
17
Part time
8
A
18
Any
27
A
18
Full time
19
A
18
Part time
8
B
17
Any
22
B
17
Full time
14
B
17
Part time
8
B
18
Any
80
B
18
Full time
75
B
18
Part time
5
Thank you!
You may try
library(dplyr)
df %>%
group_by(Region, Age) %>%
summarize(Student.Type = "Part time",
Values = abs(diff(Values))) %>%
rbind(., df) %>%
arrange(Region, Age, Student.Type)
Region Age Student.Type Values
<chr> <int> <chr> <int>
1 A 17 Any 32
2 A 17 Full time 24
3 A 17 Part time 8
4 A 18 Any 27
5 A 18 Full time 19
6 A 18 Part time 8
7 B 17 Any 22
8 B 17 Full time 14
9 B 17 Part time 8
10 B 18 Any 80
11 B 18 Full time 75
12 B 18 Part time 5
With dplyr, you could use group_modify() + add_row().
df %>%
group_by(Region, Age) %>%
group_modify(~ {
.x %>%
summarise(StudentType = "Part time", Values = -diff(Values)) %>%
add_row(.x, .)
}) %>%
ungroup()
# # A tibble: 12 × 4
# Region Age StudentType Values
# <chr> <int> <chr> <int>
# 1 A 17 Any 32
# 2 A 17 Full time 24
# 3 A 17 Part time 8
# 4 A 18 Any 27
# 5 A 18 Full time 19
# 6 A 18 Part time 8
# 7 B 17 Any 22
# 8 B 17 Full time 14
# 9 B 17 Part time 8
# 10 B 18 Any 80
# 11 B 18 Full time 75
# 12 B 18 Part time 5
So I have a dataset of parents and their children of the following form
Children_id Parent_id
10 1
11 1
12 1
13 2
14 2
What I want is a dataset of each child's siblings in long format, i.e.,
id sibling_id
10 11
10 12
11 10
11 12
12 10
12 11
13 14
14 13
What's the best way to achieve this, preferably using datatable?
Example data:
df <- data.frame("Children_id" = c(10,11,12,13,14), "Parent_id" = c(1,1,1,2,2))
The graph experts out there will probably have better solutions, but here is a data.table solution:
library(data.table)
setDT(df)[df,on=.(Parent_id), allow.cartesian=T] %>%
.[Children_id!=i.Children_id, .(id = i.Children_id, sibling=Children_id)]
Output:
id sibling
<num> <num>
1: 10 11
2: 10 12
3: 11 10
4: 11 12
5: 12 10
6: 12 11
7: 13 14
8: 14 13
In base R, we can use expand.grid after splitting
out <- do.call(rbind, lapply(split(df$Children_id, df$Parent_id), \(x)
subset(expand.grid(x, x), Var1 != Var2)[2:1]))
row.names(out) <- NULL
colnames(out) <- c("id", "sibling_id")
-output
> out
id sibling_id
1 10 11
2 10 12
3 11 10
4 11 12
5 12 10
6 12 11
7 13 14
8 14 13
Or using data.table with CJ
library(data.table)
setDT(df)[, CJ(id = Children_id, sibling_id = Children_id),
Parent_id][id != sibling_id, .(id, sibling_id)]
id sibling_id
<num> <num>
1: 10 11
2: 10 12
3: 11 10
4: 11 12
5: 12 10
6: 12 11
7: 13 14
8: 14 13
A dplyr solution with inner_join:
library(dplyr)
inner_join(df, df, by = "Parent_id") %>%
select(id = Children_id.x, siblings = Children_id.y) %>%
filter(id != siblings)
id siblings
1 10 11
2 10 12
3 11 10
4 11 12
5 12 10
6 12 11
7 13 14
8 14 13
or another strategy:
library(dplyr)
df %>%
group_by(Parent_id) %>%
mutate(siblings = list(unique(Children_id))) %>%
unnest(siblings) %>%
filter(Children_id != siblings)
I have a dataframe
data <- data.frame(v=c(15,25,24), x_val=c(12,7,2), y_val=c(6,6,18))
I want the resulting data to look like this with the data repeated in rows a specified number of times (here 2 times).
v1 x1 y1 v2 x2 y2 v3 x3 y3
15 12 6 25 7 6 24 2 18
15 12 6 25 7 6 24 2 18
I managed to get the data all in one row with the right column names but I'm not sure how to extend the column to a specified length with the values repeated. Further, how can I do this without loops? I want to run this with a larger dataset which can be quite slow with loops.
My code is below which gives the values in a single row.
r=NULL
r<- as.data.frame(matrix(nrow=1, ncol=1))
n<-2
for (i in 1:nrow(data_subset)){
datainarow <- data_subset[i,]
r=cbind(r,as.data.frame(datainarow))
colnames(r)[n] <- paste0("v",i)
colnames(r)[n+1] <- paste0("x",i)
colnames(r)[n+2] <- paste0("y",i)
n <- n+3
}
Thank you!
You can use uncount in the tidyr package
If you already have your data in the single row format, just do:
n=4
data %>% tidyr::uncount(n)
# A tibble: 4 x 9
v1 v2 v3 x1 x2 x3 y1 y2 y3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 15 25 24 12 7 2 6 6 18
2 15 25 24 12 7 2 6 6 18
3 15 25 24 12 7 2 6 6 18
4 15 25 24 12 7 2 6 6 18
Here is one way to get that result from initial three row data frame
library(tidyverse)
n=4
data %>%
rename_all(~c("v","x","y")) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = v:y,names_sep = "") %>%
uncount(n)
This is a one-liner in base R
as.data.frame(t(as.vector(t(data))))[rep(1, 2),]
#> V1 V2 V3 V4 V5 V6 V7 V8 V9
#> 1 15 12 6 25 7 6 24 2 18
#> 1.1 15 12 6 25 7 6 24 2 18
Or if you wish to use the naming convention described, and have a more generalizable solution, you could use the following function:
expand_data <- function(data, reps) {
df <- as.data.frame(t(as.vector(t(data))))[rep(1, reps),]
names(df) <- paste(names(data), rep(seq(nrow(data)), each = nrow(data)), sep = "_")
rownames(df) <- NULL
df
}
which allows:
expand_data(data, 10)
v_1 x_val_1 y_val_1 v_2 x_val_2 y_val_2 v_3 x_val_3 y_val_3
1 15 12 6 25 7 6 24 2 18
2 15 12 6 25 7 6 24 2 18
3 15 12 6 25 7 6 24 2 18
4 15 12 6 25 7 6 24 2 18
5 15 12 6 25 7 6 24 2 18
6 15 12 6 25 7 6 24 2 18
7 15 12 6 25 7 6 24 2 18
8 15 12 6 25 7 6 24 2 18
9 15 12 6 25 7 6 24 2 18
10 15 12 6 25 7 6 24 2 18
I have got a panel dataframe in R with a many rows. I wish to subset the dataframe to only include the last 10 (or last observation 10 days before the end of the month) days of each month. However the months are varying and not all month include end of the month observations. I need a subset of the data to include of every month the final 10 or five days.
CIV50s = CIV50sub %>%
select(cusip, date, impl_volatility) %>%
group_by(year(date), month(date), cusip) %>%
summarize(impl_volatility = tail(impl_volatility, 1)) %>%
mutate(date = make_date(`year(date)`, `month(date)`))
I have tried this. However this only gives me the last day of the month observation. I need either the last 10 days or the last observations 10 days before the end of the month.
my dataset looks like this:
Here are two possible solutions. The first is quick but imprecise, as you can extract the day of each date and filter those from 21 onward. But this doesn't work precisely since months have different lengths.
library(dplyr)
library(lubridate)
df <- data.frame(t=seq(ymd('2018-01-01'),ymd('2019-01-01'),by='days'))
#extract day of month
df$day <- as.numeric(format(df$t,'%d'))
df %>% filter(day>=20) # can change this to 21 or other number
t day
1 2018-01-20 20
2 2018-01-21 21
3 2018-01-22 22
4 2018-01-23 23
5 2018-01-24 24
6 2018-01-25 25
7 2018-01-26 26
The other option is to add the length of each month, find the last 10 days, then filter based on the difference. Either option will work if you have missing days for the last days of each month.
df %>% mutate(month=as.numeric(format(t,'%m')),
month.length=case_when(month %in% c(1,3,5,7,8,10,12)~31,
month==2~28,
TRUE~30),
diff=month.length-day) %>%
filter(diff<=10)
t day month month.length diff
1 2018-01-21 21 1 31 10
2 2018-01-22 22 1 31 9
3 2018-01-23 23 1 31 8
4 2018-01-24 24 1 31 7
5 2018-01-25 25 1 31 6
6 2018-01-26 26 1 31 5
7 2018-01-27 27 1 31 4
8 2018-01-28 28 1 31 3
9 2018-01-29 29 1 31 2
10 2018-01-30 30 1 31 1
11 2018-01-31 31 1 31 0
12 2018-02-18 18 2 28 10
13 2018-02-19 19 2 28 9
14 2018-02-20 20 2 28 8
15 2018-02-21 21 2 28 7
16 2018-02-22 22 2 28 6
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a question. Need to find first 2 maximum values from a table and print their name in new column. Below is data set
ID Fail1 Fail2 Fail3 Fail4
43324 10 5 4 9
42059 12 7 6 11
43321 14 9 8 13
43414 16 11 10 15
41517 18 13 12 17
43711 20 15 14 19
55675 22 17 16 21
55769 24 19 18 23
55631 26 21 20 25
Now for every ID, need first and second max causes of Fail concatenated in a new column added in same table.
Data set Sample
Here is an approach which reshapes from wide to long format, picks the two max values for each ID and appends the respective column names as a new column to the original data.frame (using join):
library(data.table)
DF[melt(DF, id.var = "ID")[order(-value), .(top = toString(variable[1:2])), by = ID],
on = "ID"]
ID Fail1 Fail2 Fail3 Fail4 top
1: 55631 26 21 20 25 Fail1, Fail4
2: 55769 24 19 18 23 Fail1, Fail4
3: 55675 22 17 16 21 Fail1, Fail4
4: 43711 20 15 14 19 Fail1, Fail4
5: 41517 18 13 12 17 Fail1, Fail4
6: 43414 16 11 10 15 Fail1, Fail4
7: 43321 14 9 8 13 Fail1, Fail4
8: 42059 12 7 6 11 Fail1, Fail4
9: 43324 10 5 4 9 Fail1, Fail4
Data
library(data.table)
DF <- fread(
"ID Fail1 Fail2 Fail3 Fail4
43324 10 5 4 9
42059 12 7 6 11
43321 14 9 8 13
43414 16 11 10 15
41517 18 13 12 17
43711 20 15 14 19
55675 22 17 16 21
55769 24 19 18 23
55631 26 21 20 25"
)
Something like this, assuming your data frame is called dat:
dat$top2 = apply(dat[ , grepl("Fail", names(dat))], 1, function(r) {
paste(names(r)[which(rank(-r, ties.method="first") %in% c(1:2))], collapse=", ")
})
If there's a tie for first, this will give all the columns that tie for first (even if there are more than two) and none of the columns that tie for second. If there's a tie for second and no tie for first, then this will give the column with the highest value and all columns that tie for second.
Here's a tidyverse version of #Uwe's answer:
library(tidyverse)
dat = dat %>% left_join(
dat %>%
gather(key, value, -ID) %>%
arrange(desc(value)) %>%
group_by(ID) %>%
slice(1:2) %>%
summarise(top2 = paste(key, collapse=", "))
)