I would like to pivot variables nclaims, npatients, nproviders to show up underneath groups.
I believe I should be using pivot_longer but it doesn't work.
library(tidyr)
ptype <- c(0,1,2,0,1)
groups <- c(rep(1,3), rep(2,2))
nclaims <- c(10,23,32,12,8)
nproviders <- c(2,4,5,1,1)
npatients <- c(8, 20, 29, 9, 6)
dta <- data.frame(ptype=ptype, groups=groups, nclaims=nclaims, nproviders=nproviders, npatients=npatients)
table <- pivot_longer(everything(dta), names_to = "groups", values_to=c("nclaims", "npatients", "nproviders"))
Desired output:
We need to use pivot_longer, then pivot_wider:
dta %>%
pivot_longer(nclaims:npatients) %>%
# values_fill = 0 changes NA values to 0, as in your desired result
pivot_wider(names_from = ptype, values_from = value,
values_fill = 0)
groups name `0` `1` `2`
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 nclaims 10 23 32
2 1 nproviders 2 4 5
3 1 npatients 8 20 29
4 2 nclaims 12 8 0
5 2 nproviders 1 1 0
6 2 npatients 9 6 0
another approach, using reshape2::recast()
library( reshape2 )
recast( dta, groups + variable ~ ptype, id.var = c("ptype", "groups") )
# groups variable 0 1 2
# 1 1 nclaims 10 23 32
# 2 1 nproviders 2 4 5
# 3 1 npatients 8 20 29
# 4 2 nclaims 12 8 NA
# 5 2 nproviders 1 1 NA
# 6 2 npatients 9 6 NA
Related
I have this data frame:
id <- c(0,1,2,3,4)
groupA_sample1_values <- c(10,11,12,13,14)
groupA_sample2_values <- c(20,21,22,23,24)
groupA_sample3_values <- c(30,31,32,33,34)
groupB_sample1_values <- c(40,41,42,43,44)
groupB_sample2_values <- c(50,51,52,53,54)
groupB_sample3_values <- c(60,61,62,63,64)
df <- data.frame(id,
groupA_sample1_values,
groupA_sample2_values,
groupA_sample3_values,
groupB_sample1_values,
groupB_sample2_values,
groupB_sample3_values)
df
and I am trying to obtain another table with these columns:
id, group, sample, value.
I belive I would have to extract the name groupA/groupB with regex, and the same for the sample number, and the melt it to a new data frame, but I'm not sure how to approach it.
Any help?
try
library( tidyverse )
df %>%
pivot_longer( -id,
names_to = c("group", "sample" ),
names_pattern = "group(.)_sample(.)_values",
values_to = "value" )
# # A tibble: 30 x 4
# id group sample value
# <dbl> <chr> <chr> <dbl>
# 1 0 A 1 10
# 2 0 A 2 20
# 3 0 A 3 30
# 4 0 B 1 40
# 5 0 B 2 50
# 6 0 B 3 60
# 7 1 A 1 11
# 8 1 A 2 21
# 9 1 A 3 31
#10 1 B 1 41
I am struggling with one maybe easy question. I have a dataframe of 1 column with n rows (n is a multiple of 3). I would like to add a second column with integers like: 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,.. How can I achieve this with dplyr as a general solution for different length of rows (all multiple of 3).
I tried this:
df <- tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(1:4, each=3))
This works. But I would like to have a solution for n rows, each = 3 . Many thanks!
You can specify each and length.out parameter in rep.
library(dplyr)
tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(row_number(), each=3, length.out = n()))
# Col1 Col2
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 2
# 7 7 3
# 8 8 3
# 9 9 3
#10 10 4
#11 11 4
#12 12 4
We can use gl
library(dplyr)
df %>%
mutate(col2 = as.integer(gl(n(), 3, n())))
As integer division i.e. %/% 3 over a sequence say 0:n will result in 0, 0, 0, 1, 1, 1, ... adding 1 will generate the desired sequence automatically, so simply this will also do
df %>% mutate(col2 = 1+ (row_number()-1) %/% 3)
# A tibble: 12 x 2
Col1 col2
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 2
5 5 2
6 6 2
7 7 3
8 8 3
9 9 3
10 10 4
11 11 4
12 12 4
I've got a dataset with a number of vars (t01-t05 in a dummy example but many more in the real dataset). I calculate pred variable as a proportion of target == 1/n() per all group-level combinations (5th element in the ns_by_group_list). However, if the total number of people in that combination (s var) less than 6, I need to use the pred value from the equivalent t01-t04 combination (4th element of ns_by_group_list). If this one is less than 6, then from t01-t03 combinations (3rd element of ns_by_group_list), etc. The final output should look like ns_by_group_list[[5]] but with pred values coming from different ns_by_group_list list elements.
I was thinking of renaming pred and s vars in different list elements to pred1, pred2, .. pred5 and then pulling it all together to one data.frame, then create a long case_when statement... But surely there's a better/more elegant way to do it?
library(tibble)
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
## functions ####
create_t_labels <- function(n) {
paste0('t', str_pad(1:n, 2, 'left', '0'))
}
ns_by_group <- function(group_vars) {
input %>%
group_by_at(.vars = vars(group_vars)) %>%
summarise(n = n()) %>% # total number of people in each group
ungroup() %>%
spread(key = target, value = n) %>%
mutate(`0` = replace_na(`0`, 0),
n = replace_na(`1`, 0),
s = n + `0`,
pred = round(n/s, 3)
) %>%
select(-c(`1`, `0`))
}
### input data ####
set.seed(1)
input <- tibble(
target = sample(0:1, 50, replace = TRUE),
t01 = sample(1:3, 50, replace = TRUE),
t02 = rep(1:2, each = 25),
t03 = rep(1:5, each = 10),
t04 = rep(1, 50),
t05 = rep(1:2, each = 25)
)
## calculations ####
group_combo_list <- map(1:5, create_t_labels)
group_combo_list <- map(group_combo_list, function(x) c(x, 'target'))
ns_by_group_list <- map(group_combo_list, ns_by_group)
Recursively joining and replacing:
reduce(
ns_by_group_list,
~ {
left_join(.y, .x, by = grep("^t\\d+$", names(.x), value = TRUE),
suffix = c("", ".replacement")) %>%
mutate(pred = if_else(s < 6, pred.replacement, pred),
s = if_else(s < 6, s.replacement, s)) %>%
select(-ends_with(".replacement"))
},
.dir = "backward"
)
# # A tibble: 16 x 8
# t01 t02 t03 t04 t05 n s pred
# <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl>
# 1 1 1 1 1 1 1 16 0.562
# 2 1 1 2 1 1 1 16 0.562
# 3 1 2 3 1 2 2 12 0.583
# 4 1 2 4 1 2 4 6 0.667
# 5 1 2 5 1 2 1 12 0.583
# 6 2 1 1 1 1 3 13 0.385
# 7 2 1 2 1 1 2 6 0.333
# 8 2 1 3 1 1 0 13 0.385
# 9 2 2 4 1 2 1 6 0.5
# 10 2 2 5 1 2 2 6 0.5
# 11 3 1 1 1 1 0 8 0.125
# 12 3 1 2 1 1 1 8 0.125
# 13 3 1 3 1 1 0 8 0.125
# 14 3 2 3 1 2 0 7 0.714
# 15 3 2 4 1 2 1 7 0.714
# 16 3 2 5 1 2 4 7 0.714
I want to compute the minimum distance between the current row and every row before it within each group. My data frame has several groups, and each group has multiple dates with longitude and latitude. I use a Haversine function to compute distance, and I need to apply this function as described above. The data frame looks like the following:
grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4
My current attempt uses purrrlyr::by_row, but the method is too slow. In practice, each group has thousands of dates and geographic positions. Here is part of my current attempt:
calc_min_distance <- function(df, grp.name, row){
df %>%
filter(
group_name==grp.name
) %>%
filter(
row_number() <= row
) %>%
mutate(
last.lat = last(lat),
last.long = last(long),
rowid = 1:n()
) %>%
group_by(rowid) %>%
purrrlyr::by_row(
~haversinedistance.fnct(.$last.long, .$last.lat, .$long, .$lat),
.collate='rows',
.to = 'min.distance'
) %>%
filter(
row_number() < n()
) %>%
summarise(
min = min(min.distance)
) %>%
.$min
}
df_dist <-
df %>%
group_by(grp_name) %>%
mutate(rowid = 1:n()) %>%
group_by(grp_name, rowid) %>%
purrrlyr::by_row(
~calc_min_distance(df, .$grp_name,.$rowid),
.collate='rows',
.to = 'min.distance'
) %>%
ungroup %>%
select(-rowid)
Suppose that distance is defined as (lat + long) for reference row - (lat + long) for each pairwise row less than the reference row. My expected output for grp 1 is the following:
grp date long lat rowid min.distance
1 1 1995-07-01 11 12 1 0
2 1 1995-07-05 3 0 2 -20
3 1 1995-07-09 13 4 3 -6
4 1 1995-07-13 4 25 4 6
How can I quickly compute the minimum distance between the current rowid and all rowids before it?
Here's how I would go about it. You need to calculate all the within-group pair-wise distances anyway, so we'll use geosphere::distm which is designed to do just that. I'd suggest stepping through my function line-by-line and looking at what it does, I think it will make sense.
library(geosphere)
find_min_dist_above = function(long, lat, fun = distHaversine) {
d = distm(x = cbind(long, lat), fun = fun)
d[lower.tri(d, diag = TRUE)] = NA
d[1, 1] = 0
return(apply(d, MAR = 2, min, na.rm = TRUE))
}
df %>% group_by(grp) %>%
mutate(min.distance = find_min_dist_above(long, lat))
# # A tibble: 8 x 6
# # Groups: grp [2]
# grp date long lat rowid min.distance
# <int> <fct> <int> <int> <int> <dbl>
# 1 1 1995-07-01 11 12 1 0
# 2 1 1995-07-05 3 0 2 1601842.
# 3 1 1995-07-09 13 4 3 917395.
# 4 1 1995-07-13 4 25 4 1623922.
# 5 2 1995-03-07 12 6 1 0
# 6 2 1995-03-10 3 27 2 2524759.
# 7 2 1995-03-13 34 8 3 2440596.
# 8 2 1995-03-16 25 9 4 997069.
Using this data:
df = read.table(text = ' grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4', h = TRUE)
I am working with gait-cycle data. I have 8 events marked for each id and gait trial. The values "LFCH" and "RFCH" occurs twice in each trial, as these represent the beginning and the end of the gait cycles from left and right leg.
Sample Data Frame:
df <- data.frame(ID = rep(1:5, each = 16),
Gait_nr = rep(1:2, each = 8, times=5),
Frame = rep(c(1,5,7,9,10,15,22,25), times = 10),
Marks = rep(c("LFCH", "LHL", "RFCH", "LTO", "RHL", "LFCH", "RTO", "RFCH"), times =10)
head(df,8)
ID Gait_nr Frame Marks
1 1 1 1 LFCH
2 1 1 5 LHL
3 1 1 7 RFCH
4 1 1 9 LTO
5 1 1 10 RHL
6 1 1 15 LFCH
7 1 1 22 RTO
8 1 1 25 RFCH
I wold like to create something like
Total_gait_left = Frame[The last time Marks == "LFCH"] - Frame[The first time Marks == "LFCH"]
My current code solves the problem, but depends on the position of the Frame values rather than actual values in Marks. Any individual not following the normal gait pattern will have wrong values produced by the code.
library(tidyverse)
l <- df %>% group_by(ID, Gait_nr) %>% filter(grepl("L.+", Marks)) %>%
summarize(Total_gait = Frame[4] - Frame[1],
Side = "left")
r <- df %>% group_by(ID, Gait_nr) %>% filter(grepl("R.+", Marks)) %>%
summarize(Total_gait = Frame[4] - Frame[1],
Side = "right")
val <- union(l,r, by=c("ID", "Gait_nr", "Side")) %>% arrange(ID, Gait_nr, Side)
Can you help me make my code more stable by helping me change e.g. Frame[4] to something like Frame[Marks=="LFCH" the last time ]?
If both LFCH and RFCH happen exactly twice, you can filter and then use diff in summarize:
df %>%
group_by(ID, Gait_nr) %>%
summarise(
left = diff(Frame[Marks == 'LFCH']),
right = diff(Frame[Marks == 'RFCH'])
)
# A tibble: 10 x 4
# Groups: ID [?]
# ID Gait_nr left right
# <int> <int> <dbl> <dbl>
# 1 1 1 14 18
# 2 1 2 14 18
# 3 2 1 14 18
# 4 2 2 14 18
# 5 3 1 14 18
# 6 3 2 14 18
# 7 4 1 14 18
# 8 4 2 14 18
# 9 5 1 14 18
#10 5 2 14 18
We can use first and last from the dplyr package.
library(dplyr)
df2 <- df %>%
filter(Marks %in% "LFCH") %>%
group_by(ID, Gait_nr) %>%
summarise(Total_gait = last(Frame) - first(Frame)) %>%
ungroup()
df2
# # A tibble: 10 x 3
# ID Gait_nr Total_gait
# <int> <int> <dbl>
# 1 1 1 14
# 2 1 2 14
# 3 2 1 14
# 4 2 2 14
# 5 3 1 14
# 6 3 2 14
# 7 4 1 14
# 8 4 2 14
# 9 5 1 14
# 10 5 2 14