I have a data frame and wish to sort specific columns alphabetically in dplyr. I know I can use the code below to sort all columns, but I would only like to sort columns C, B and A alphabetically. I tried using the across function as I would effectively like to select columns C:A, but this did not work.
df <- data.frame(1:16)
df$Testinfo1 <- 1
df$Band <- 1
df$Alpha <- 1
df$C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$B <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16,16)
df$A <- c(1,1,1,1,1,1,1,1,1,1,1,14,NA_real_,NA_real_,NA_real_,16)
df
df %>%
select(sort(names(.)))
A Alpha B Band C Testinfo1 X1.16
1: 1 1 10 1 10 1 1
2: 1 1 0 1 12 1 2
3: 1 1 0 1 14 1 3
4: 1 1 0 1 16 1 4
5: 1 1 12 1 10 1 5
6: 1 1 12 1 12 1 6
7: 1 1 12 1 14 1 7
8: 1 1 12 1 16 1 8
9: 1 1 0 1 10 1 9
10: 1 1 14 1 12 1 10
11: 1 1 NA 1 14 1 11
12: 14 1 14 1 16 1 12
13: NA 1 16 1 10 1 13
14: NA 1 16 1 12 1 14
15: NA 1 16 1 14 1 15
16: 16 1 16 1 16 1 16
My desired output is below:
X1.16 Testinfo1 Band Alpha A B C
1: 1 1 1 1 1 10 10
2: 2 1 1 1 1 0 12
3: 3 1 1 1 1 0 14
4: 4 1 1 1 1 0 16
5: 5 1 1 1 1 12 10
6: 6 1 1 1 1 12 12
7: 7 1 1 1 1 12 14
8: 8 1 1 1 1 12 16
9: 9 1 1 1 1 0 10
10: 10 1 1 1 1 14 12
11: 11 1 1 1 1 NA 14
12: 12 1 1 1 14 14 16
13: 13 1 1 1 NA 16 10
14: 14 1 1 1 NA 16 12
15: 15 1 1 1 NA 16 14
16: 16 1 1 1 16 16 16
You can use relocate() (from dplyr 1.0.0 onwards):
library(dplyr)
vars <- c("C", "B", "A")
df %>%
relocate(all_of(sort(vars)), .after = last_col())
If you are passing a character vector of names you should wrap it in all_of() (which will error if any variables are missing) or any_of() which won't.
You can do
sortcols <- c("A","B","C")
library(dplyr)
df %>%
select(-sortcols, sort(sortcols))
The -sortcols part selects everything but the columns you want to sort and then you put the columns you want after those.
A base R option for a case which may or may not exist. If the columns that you want to sort are not at the end of the dataframe.
We add a new column D which you don't want to change the position of.
df$D <- 1:16
cols_to_sort <- c('A', 'B', 'C')
inds <- match(cols_to_sort, names(df))
cols <- seq_along(df)
cols[cols %in% inds] <- inds
df[cols]
# X1.16 Testinfo1 Band Alpha A B C D
#1 1 1 1 1 1 10 10 1
#2 2 1 1 1 1 0 12 2
#3 3 1 1 1 1 0 14 3
#4 4 1 1 1 1 0 16 4
#5 5 1 1 1 1 12 10 5
#6 6 1 1 1 1 12 12 6
#7 7 1 1 1 1 12 14 7
#8 8 1 1 1 1 12 16 8
#9 9 1 1 1 1 0 10 9
#10 10 1 1 1 1 14 12 10
#11 11 1 1 1 1 NA 14 11
#12 12 1 1 1 14 14 16 12
#13 13 1 1 1 NA 16 10 13
#14 14 1 1 1 NA 16 12 14
#15 15 1 1 1 NA 16 14 15
#16 16 1 1 1 16 16 16 16
Related
I have this tibble
# Data
set.seed(1)
x <- tibble(values = round(rnorm(20, 10, 10), 0),
index = c(0,0,1,1,1,0,1,0,1,1,1,1,1,1,0,
1,1,0,0,0))
x
#> # A tibble: 20 x 2
#> values index
#> <dbl> <dbl>
#> 1 4 0
#> 2 12 0
#> 3 2 1
#> 4 26 1
#> 5 13 1
#> 6 2 0
#> 7 15 1
#> 8 17 0
#> 9 16 1
#> 10 7 1
#> 11 25 1
#> 12 14 1
#> 13 4 1
#> 14 -12 1
#> 15 21 0
#> 16 10 1
#> 17 10 1
#> 18 19 0
#> 19 18 0
#> 20 16 0
I'd like to create groups where the value in the index column are consecutive ones. The final aim is to compute the sum per each group.
This is the expected tibble is someting like:
# A tibble: 20 x 3
values index group
<dbl> <dbl> <chr>
1 4 0 NA
2 12 0 NA
3 2 1 A
4 26 1 A
5 13 1 A
6 2 0 NA
7 15 1 B
8 17 0 NA
9 16 1 C
10 7 1 C
11 25 1 C
12 14 1 C
13 4 1 C
14 -12 1 C
15 21 0 NA
16 10 1 D
17 10 1 D
18 19 0 NA
19 18 0 NA
20 16 0 NA
Thank you in advance for your advice.
You could use cumsum() on runs identified by rle(), replacing the values where index is zero with NA. If there are more than 26 IDs it will need a minor modification.
library(dplyr)
x2 <- x %>%
mutate(id = LETTERS[replace(with(rle(index),
rep(cumsum(values), lengths)), index == 0, NA)])
Giving:
# A tibble: 20 x 3
values index id
<dbl> <dbl> <chr>
1 4 0 NA
2 12 0 NA
3 2 1 A
4 26 1 A
5 13 1 A
6 2 0 NA
7 15 1 B
8 17 0 NA
9 16 1 C
10 7 1 C
11 25 1 C
12 14 1 C
13 4 1 C
14 -12 1 C
15 21 0 NA
16 10 1 D
17 10 1 D
18 19 0 NA
19 18 0 NA
20 16 0 NA
To sum the values:
x2 %>%
group_by(id) %>%
summarise(sv = sum(values))
# A tibble: 5 x 2
id sv
* <chr> <dbl>
1 A 41
2 B 15
3 C 54
4 D 20
5 NA 109
An option with data.table
library(data.table)
setDT(x)[, group := LETTERS[as.integer(factor((NA^!index) *rleid(index)))]]
x
# values index group
# 1: 4 0 <NA>
# 2: 12 0 <NA>
# 3: 2 1 A
# 4: 26 1 A
# 5: 13 1 A
# 6: 2 0 <NA>
# 7: 15 1 B
# 8: 17 0 <NA>
# 9: 16 1 C
#10: 7 1 C
#11: 25 1 C
#12: 14 1 C
#13: 4 1 C
#14: -12 1 C
#15: 21 0 <NA>
#16: 10 1 D
#17: 10 1 D
#18: 19 0 <NA>
#19: 18 0 <NA>
#20: 16 0 <NA>
Or similar logic in dplyr
library(dplyr)
x %>%
mutate(group = LETTERS[as.integer(factor((NA^!index) *rleid(index)))])
# A tibble: 20 x 3
# values index group
# <dbl> <dbl> <chr>
# 1 4 0 <NA>
# 2 12 0 <NA>
# 3 2 1 A
# 4 26 1 A
# 5 13 1 A
# 6 2 0 <NA>
# 7 15 1 B
# 8 17 0 <NA>
# 9 16 1 C
#10 7 1 C
#11 25 1 C
#12 14 1 C
#13 4 1 C
#14 -12 1 C
#15 21 0 <NA>
#16 10 1 D
#17 10 1 D
#18 19 0 <NA>
#19 18 0 <NA>
#20 16 0 <NA>
I would like to write code to compute within each group, sum of lagged differences as shown in the table below:
ID x rank U R Required Output Value
1 1 1 U1 R1 -
1 1 2 U2 R2 R2-U1
1 1 3 U3 R3 (R3-U2) + (R3-U1)
1 1 4 U4 R4 (R4-U3) + (R4-U2) + (R4-U1)
1 0 5 U5 R5 R5
1 0 6 U6 R6 R6
1 0 7 U7 R7 R7
2 1 1 U8 R8 -
2 1 2 U9 R9 R9-U8
2 1 3 U10 R10 (R10-U9) + (R10 - U8)
2 1 4 U11 R11 (R11-U10) + (R11 - U9) + (R11 - U8)
3 1 1 U12 R12 -
3 0 2 U13 R13 R13
3 0 3 U14 R14 R14
ID is the unique group identifier. x is a bool and depending on its value the required output is either sum of difference with previous values or same period value. "rank" is a rank ordering column and the maximum rank can vary within each group. "U" and "R" are the main columns of interest.
To give a numerical example, I need the following:
ID x rank U R Required Output Value
1 1 1 10 7 -
1 1 2 9 11 1
1 1 3 10 10 1 + 0 = 1
1 1 4 11 13 3+4+3 = 10
1 0 5 7 8 8
1 0 6 8 8 8
1 0 7 5 7 7
2 1 1 3 2 -
2 1 2 9 15 12
2 1 3 13 14 16
2 1 4 1 14 17
3 1 1 12 1 -
3 0 2 14 9 9
3 0 3 1 11 11
R code to generate this table:
ID = c(rep(1,7),rep(2,4),rep(3,3))
x = c(rep(1,4),rep(0,3),rep(1,5),rep(0,2))
rank = c(1:7,1:4,1:3)
U = c(10,9,10,11,7,8,5,3,9,13,1,12,14,1)
R = c(7,11,10,13,8,8,7,2,15,14,14,1,9,11)
dat = cbind(ID,x,rank,U,R)
colnames(dat)=c("ID","x","rank","U","R")
Here's a tidyverse solution:
library(dplyr)
library(tidyr)
dat %>%
as_tibble() %>%
group_by(ID) %>%
mutate(output = ifelse(x, lag(rank) * R - lag(cumsum(U)), R))
Result:
# A tibble: 14 x 6
# Groups: ID [3]
ID x rank U R output
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 10 7 NA
2 1 1 2 9 11 1
3 1 1 3 10 10 1
4 1 1 4 11 13 10
5 1 0 5 7 8 8
6 1 0 6 8 8 8
7 1 0 7 5 7 7
8 2 1 1 3 2 NA
9 2 1 2 9 15 12
10 2 1 3 13 14 16
11 2 1 4 1 14 17
12 3 1 1 12 1 NA
13 3 0 2 14 9 9
14 3 0 3 1 11 11
Here is a base R solution using ave
dat <- within(dat,output <- ave(R,ID,x, FUN = function(v) v*(seq(v)-1))-ave(U,ID,x, FUN = function(v) c(NA,cumsum(v)[-length(v)])))
dat <- within(dat, output <- ifelse(x==0,R,output))
such that
> dat
ID x rank U R output
1 1 1 1 10 7 NA
2 1 1 2 9 11 1
3 1 1 3 10 10 1
4 1 1 4 11 13 10
5 1 0 5 7 8 8
6 1 0 6 8 8 8
7 1 0 7 5 7 7
8 2 1 1 3 2 NA
9 2 1 2 9 15 12
10 2 1 3 13 14 16
11 2 1 4 1 14 17
12 3 1 1 12 1 NA
13 3 0 2 14 9 9
14 3 0 3 1 11 11
Actually I have a dataframe with 2 values:
v1<- c(1,1,1,0,0,1,1,2,2,2,0,0,0,2,1,1,0,1,0,2)
v2<- c(5,5,10,-1,-5,9,7,6,1,5,3,-4,7,-6,-3,-1,7,1,5,3)
df<- data.frame(v1=v1, v2=v2)
> df
v1 v2
1 1 5
2 1 5
3 1 10
4 0 -1
5 0 -5
6 1 9
7 1 7
8 2 6
9 2 1
10 2 5
11 0 3
12 0 -4
13 0 7
14 2 -6
15 1 -3
16 1 -1
17 0 7
18 1 1
19 0 5
20 2 3
What I'm trying to do is replacing values on value V2, based on the fact that :
If there is successive 0 on V1 (only successive 0 so 1,0,1 wont count but 1,0,0,1 will count and so on), all V2 will be equal to the first V2 value where 0 occures in V1.
For example:
> df[3:6,]
v1 v2
3 1 10
4 0 -1
5 0 -5
6 1 9
#Must become
> df[3:6,]
v1 v2
3 1 10
4 0 -1
5 0 -1
6 1 9
Or also :
> df[10:14,]
v1 v2
10 2 5
11 0 3
12 0 -4
13 0 7
14 2 -6
#Must become
> df[10:14,]
v1 v2
10 2 5
11 0 3
12 0 3
13 0 3
14 2 -6
We can create the group with rleid (from data.table) and replace 'v2' with the first value of 'v2' only when all the values in 'v1' are 0
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(v1)) %>%
mutate(v2 = if(all(v1 == 0)) first(v2) else v2) %>%
ungroup %>%
select(-grp)
# A tibble: 20 x 2
# v1 v2
# <dbl> <dbl>
# 1 1 5
# 2 1 5
# 3 1 10
# 4 0 -1
# 5 0 -1
# 6 1 9
# 7 1 7
# 8 2 6
# 9 2 1
#10 2 5
#11 0 3
#12 0 3
#13 0 3
#14 2 -6
#15 1 -3
#16 1 -1
#17 0 7
#18 1 1
#19 0 5
#20 2 3
Or using data.table (from #IceCreamToucan's comments)
library(data.table)
setDT(df)[, v2 := if(first(v1) == 0) first(v2) else v2, rleid(v1)]
Here is an solution with base R, where rle() and split() are used:
dfs <- split(df,findInterval(1:nrow(df),cumsum((r <- with(df,rle(v1)))$lengths),left.open = T))
df <- Reduce(rbind,{dfs[r$values==0] <- Map(function(x) {x[2]<-head(x[2],1);x},dfs[r$values==0]);dfs})
which gives
> df
v1 v2
1 1 5
2 1 5
3 1 10
4 0 -1
5 0 -1
6 1 9
7 1 7
8 2 6
9 2 1
10 2 5
11 0 3
12 0 3
13 0 3
14 2 -6
15 1 -3
16 1 -1
17 0 7
18 1 1
19 0 5
20 2 3
DATA
v1<- c(1,1,1,0,0,1,1,2,2,2,0,0,0,2,1,1,0,1,0,2)
v2<- c(5,5,10,-1,-5,9,7,6,1,5,3,-4,7,-6,-3,-1,7,1,5,3)
df<- data.frame(v1=v1, v2=v2)
I have a data frame like this (with many more rows):
id act_l_n pas_l_n act_q_p pas_q_p act_l_p pas_l_p act_q_n pas_q_n
1 14 8 14 10 21 11 21 11
2 19 9 11 17 22 11 20 11
Every column name contains information about 3 variables separated by '_' (each has 2 levels named act/pas, l/q, n/p). Values are scores corresponding to each combination of variables (i.e. 1 of 8 conditions).
I need to move 3 variables to 3 separate columns, mark their levels by digits, and move corresponding value to separate column called "score". So from 1st row of current data frame I'd get something like this:
id score actpas lq pn
1 14 1 1 1
1 8 2 1 1
1 14 1 2 2
1 10 2 2 2
1 21 1 1 2
1 11 2 1 2
1 21 1 2 1
1 11 2 2 1
I've tried wrangling this with dplyr using gather and separate functions, but I can't really get what I need. Help with dplyr would be most appriciated!
If I understand well:
df<-read.table(textConnection(
"id,act_l_n,pas_l_n,act_q_p,pas_q_p,act_l_p,pas_l_p,act_q_n,pas_q_n
1,14,8,14,10,21,11,21,11
2,19,9,11,17,22,11,20,11"),
header=TRUE,sep=",")
library(tidyr)
library(dplyr)
gather(df,k,score,-id) %>% mutate(v1=1+as.integer(substr(k,1,3)=="pas")
,v2=1+as.integer(substr(k,5,5)=="q")
,v3=1+as.integer(substr(k,7,7)=="p")) %>%
select(-2) %>% arrange(id)
# id score v1 v2 v3
#1 1 14 1 1 1
#2 1 8 2 1 1
#3 1 14 1 2 2
#4 1 10 2 2 2
#5 1 21 1 1 2
#6 1 11 2 1 2
#7 1 21 1 2 1
#8 1 11 2 2 1
#9 2 19 1 1 1
#10 2 9 2 1 1
#11 2 11 1 2 2
#12 2 17 2 2 2
#13 2 22 1 1 2
#14 2 11 2 1 2
#15 2 20 1 2 1
#16 2 11 2 2 1
Given the following first two columns(id and time_diff), i want to generate the 'block' column
test
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5
The data is already sorted by id and time. The time_diff was computed based on the difference of the previous time and the time value for the row, given the same id. I want to create a block id which is an auto-increment value and increases when a new ID or a time_diff of >10 with the same id is encountered.
How can I achieve this in R?
Importing your data as a data frame with something like:
df = read.table(text='
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5')
You can do a one-liner like this to get occurrences satisfying your two conditions:
> new_col = as.vector(cumsum(
na.exclude(
c(F,diff(as.numeric(as.factor(df$id)))) | # change of id OR
df$time_diff > 10 # time_diff greater than 10
)
))
> new_col
[1] 0 0 0 0 0 1 2 2 2 2 3 3 4 4 4
And finally append this new column to your dataframe with cbind:
> cbind(df, block = c(0,new_col))
id time_diff block block
1 a NA 1 0
2 a 1 1 0
3 a 1 1 0
4 a 1 1 0
5 a 3 1 0
6 a 3 1 0
7 b NA 2 1
8 b 11 3 2
9 b 1 3 2
10 b 1 3 2
11 b 1 3 2
12 b 12 4 3
13 b 1 4 3
14 c NA 5 4
15 c 4 5 4
16 c 7 5 4
You will notice an offset between your wanted block variable and mine: correcting it is easy and can be done at several different step, I will leave it to you :)
Another variation of #Jealie's method would be:
with(test, cumsum(c(TRUE,id[-1]!=id[-nrow(test)])|time_diff>10))
#[1] 1 1 1 1 1 1 2 3 3 3 3 4 4 5 5 5
After learning from Jealie and akrun, I came up with this idea.
mydf %>%
mutate(group = cumsum(time_diff > 10 |!duplicated(id)))
# id time_diff block group
#1 a NA 1 1
#2 a 1 1 1
#3 a 1 1 1
#4 a 1 1 1
#5 a 3 1 1
#6 a 3 1 1
#7 b NA 2 2
#8 b 11 3 3
#9 b 1 3 3
#10 b 1 3 3
#11 b 1 3 3
#12 b 12 4 4
#13 b 1 4 4
#14 c NA 5 5
#15 c 4 5 5
#16 c 7 5 5
Here is an approach using dplyr:
require(dplyr)
set.seed(999)
test <- data.frame(
id = rep(letters[1:4], each = 3),
time_diff = sample(4:15)
)
test %>%
mutate(
b = as.integer(id) - lag(as.integer(id)),
more10 = time_diff > 10,
increment = pmax(b, more10, na.rm = TRUE),
increment = ifelse(row_number() == 1, 1, increment),
block = cumsum(increment)
) %>%
select(id, time_diff, block)
Try:
> df
id time_diff
1 a NA
2 a 1
3 a 1
4 a 1
5 a 3
6 a 3
7 b NA
8 b 11
9 b 1
10 b 1
11 b 1
12 b 12
13 b 1
14 c NA
15 c 4
16 c 7
block= c(1)
for(i in 2:nrow(df))
block[i] = ifelse(df$time_diff[i]>10 || df$id[i]!=df$id[i-1],
block[i-1]+1,
block[i-1])
df$block = block
df
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5