1 2 3 4 5
1 2013 A B 513 513
2 2013 B A 533 524
3 2013 B A 541 540
4 2013 B A 544 532
5 2013 E B 554 540
6 2014 F B 557 558
7 2014 F A 553 604
I have a tibble like that. How I can get the sum of every combination of the columns 2 and 3? So that I get 1 for A and B, 3 for B and A, 1 for E and B and so on.
Group those two variables and summarise. Easy to do with tidyverse, although I'd change the names of the columns to text first.
library(tidyverse)
df %>%
group_by(col2, col3) %>%
summarise(count = n())
first apologise if this question was asked somewhere else but I couldn't find an answer.
In R, I have a 2 columns data.frame with ID and Score values.
library(dplyr)
library(magrittr)
set.seed(1235) # for reproducible example
data.frame(ID = LETTERS[1:16],
Score = round(rnorm(n=16,mean = 1200, sd = 5 ), 0),
stringsAsFactors = F) -> tmp
head(tmp)
# ID Score
# 1 A 1203
# 2 B 1198
# 3 C 1197
# 4 D 1202
# 5 E 1200
# 6 F 1190
I want to create a new column called Position with numbers from 1 to nrow(tmp) corresponding to the decreasing order of the Score column.
I can do that in base R with:
tmp[order(tmp$Score, decreasing = T), "Position"] <- 1:nrow(tmp)
head(tmp[order(tmp$Position), ])
# ID Score Position
# 1 A 1211 1
# 8 H 1210 2
# 3 C 1209 3
# 4 D 1205 4
# 5 E 1202 5
# 16 P 1202 6
But I was wondering if there's a more elegant way to do it abiding the tidyverse principles?
Like I tried this but it doesn't work and I can't understand why...
tmp %>%
mutate(Position = order(Score, decreasing = T)) %>%
arrange(Position) %>%
head()
# ID Score Position
# 1 A 1211 1
# 2 L 1200 2
# 3 C 1209 3
# 4 D 1205 4
# 5 E 1202 5
# 6 G 1188 6
Here the ordering clearly didn't work.
Thanks!
We can use row_number
library(dplyr)
tmp %>%
mutate(Position2 = row_number(-Score))
-output
# ID Score Position Position2
#1 A 1197 12 12
#2 B 1194 16 16
#3 C 1205 3 3
#4 D 1201 8 8
#5 E 1201 9 9
#6 F 1208 1 1
#7 G 1200 10 10
#8 H 1203 5 5
#9 I 1207 2 2
#10 J 1202 6 6
#11 K 1195 15 15
#12 L 1205 4 4
#13 M 1196 13 13
#14 N 1198 11 11
#15 O 1196 14 14
#16 P 1202 7 7
where 'Position' is the one created with order based on base R OP's code
Similar to your order logic we can arrange the data in decreasing order and create position column which goes from 1 to number of rows in the data.
library(dplyr)
tmp %>%
arrange(desc(Score)) %>%
mutate(position = 1:n())
# ID Score position
#1 F 1208 1
#2 I 1207 2
#3 C 1205 3
#4 L 1205 4
#5 H 1203 5
#6 J 1202 6
#7 P 1202 7
#8 D 1201 8
#9 E 1201 9
#10 G 1200 10
#11 N 1198 11
#12 A 1197 12
#13 M 1196 13
#14 O 1196 14
#15 K 1195 15
#16 B 1194 16
Let say I have a data df as below. In total, there are 20 rows and there are four types of strings in column string: "A", "B", "C" and "D".
no string position
1 B 650
2 C 651
3 B 659
4 C 660
5 C 662
6 B 663
7 D 668
8 D 670
9 C 671
10 B 672
11 C 673
12 A 681
13 C 682
14 B 683
15 C 684
16 D 690
17 A 692
18 C 693
19 D 694
20 C 695
By performing subtraction of value in column position from the previous row, I could get a forth column distance by executing the following command:
df$distance <- ave(df$position, FUN=function(x) c(0, diff(x)))
So that I could get distance from the current value to the previous row as below:
no string position distance
1 B 650 0
2 C 651 1
3 B 659 8
4 C 660 1
5 C 662 2
6 B 663 1
7 D 668 5
8 D 670 2
9 C 671 1
10 B 672 1
11 C 673 1
12 A 681 8
13 C 682 1
14 B 683 1
15 C 684 1
16 D 690 6
17 A 692 2
18 C 693 1
19 D 694 1
20 C 695 1
However, what I wish to have is to get the distance in column position for each string to the nearest previous string "C", such as the change of 7,8 and 17 below:
no string position distance
1 B 650 0
2 C 651 1
3 B 659 8
4 C 660 1
5 C 662 2
6 B 663 1
7 D 668 6
8 D 670 8
9 C 671 1
10 B 672 1
11 C 673 1
12 A 681 8
13 C 682 1
14 B 683 1
15 C 684 1
16 D 690 6
17 A 692 8
18 C 693 1
19 D 694 1
20 C 695 1
How can I do so? By the way, can I know how I can do to get the distance from the nearest next "C" in column string as well?
Maybe not an ideal solution and there is a way to simplify this.
#Taken from your code
df$distance <- ave(df$position, FUN=function(x) c(0, diff(x)))
#logical values indicating occurrence of "C"
c_occur = df$string == "C"
#We can ignore first two values in each group since,
#First value is "C" and second value is correctly calculated from previous row
#Get the indices where we need to replace the values
inds_to_replace = which(ave(df$string, cumsum(c_occur), FUN = seq_along) > 2)
#Get the closest occurrence of "C" from the inds_to_replace
c_to_replace <- sapply(inds_to_replace, function(x) {
new_inds <- which(c_occur)
max(new_inds[(x - new_inds) > 0])
#To get distance from "nearest next "C" replace the above line with
#new_inds[which.max(x - new_inds < 0)]
})
#Replace the values
df$distance[inds_to_replace] <- df$position[inds_to_replace] -
df$position[c_to_replace]
df[inds_to_replace, ]
# no string position distance
#7 7 D 668 6
#8 8 D 670 8
#17 17 A 692 8
The following tidyverse approach reproduces your expected output.
Problem description: Calculate the difference in position of the current row with the previous string = "C" row; if there is no previous string = "C" row or the row itself has string = "C", then the distance is given by the difference in position between the current and previous row (irrespective of string).
library(tidyverse)
df %>%
mutate(nC = cumsum(string == "C")) %>%
group_by(nC) %>%
mutate(dist = cumsum(c(0, diff(position)))) %>%
ungroup() %>%
mutate(dist = if_else(dist == 0, c(0, diff(position)), dist)) %>%
select(-nC)
## A tibble: 20 x 4
# no string position dist
# <int> <fct> <int> <dbl>
# 1 1 B 650 0.
# 2 2 C 651 1.
# 3 3 B 659 8.
# 4 4 C 660 1.
# 5 5 C 662 2.
# 6 6 B 663 1.
# 7 7 D 668 6.
# 8 8 D 670 8.
# 9 9 C 671 1.
#10 10 B 672 1.
#11 11 C 673 1.
#12 12 A 681 8.
#13 13 C 682 1.
#14 14 B 683 1.
#15 15 C 684 1.
#16 16 D 690 6.
#17 17 A 692 8.
#18 18 C 693 1.
#19 19 D 694 1.
#20 20 C 695 1.
Sample data
df <- read.table(text =
"no string position
1 B 650
2 C 651
3 B 659
4 C 660
5 C 662
6 B 663
7 D 668
8 D 670
9 C 671
10 B 672
11 C 673
12 A 681
13 C 682
14 B 683
15 C 684
16 D 690
17 A 692
18 C 693
19 D 694
20 C 695", header = T)
Here is a data.table way:
dtt[, distance := c(0, diff(position))]
dtt[cumsum(string == 'C') > 0,
distance := ifelse(seq_len(.N) == 1, distance, position - position[1]),
by = cumsum(string == 'C')]
# no string position distance
# 1: 1 B 650 0
# 2: 2 C 651 1
# 3: 3 B 659 8
# 4: 4 C 660 1
# 5: 5 C 662 2
# 6: 6 B 663 1
# 7: 7 D 668 6
# 8: 8 D 670 8
# 9: 9 C 671 1
# 10: 10 B 672 1
# 11: 11 C 673 1
# 12: 12 A 681 8
# 13: 13 C 682 1
# 14: 14 B 683 1
# 15: 15 C 684 1
# 16: 16 D 690 6
# 17: 17 A 692 8
# 18: 18 C 693 1
# 19: 19 D 694 1
# 20: 20 C 695 1
Here is dtt:
structure(list(no = 1:20, string = c("B", "C", "B", "C", "C",
"B", "D", "D", "C", "B", "C", "A", "C", "B", "C", "D", "A", "C",
"D", "C"), position = c(650L, 651L, 659L, 660L, 662L, 663L, 668L,
670L, 671L, 672L, 673L, 681L, 682L, 683L, 684L, 690L, 692L, 693L,
694L, 695L)), row.names = c(NA, -20L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x1939260>)
If you want to get distance to nearest next C for non-C rows, try this:
dtt[, distance := c(0, diff(position))]
dtt[, g := rev(cumsum(rev(string == 'C')))]
dtt[g > 0, distance := ifelse(seq_len(.N) == .N, distance, abs(position - position[.N])), by = g]
dtt[, g := NULL]
# no string position distance
# 1: 1 B 650 1
# 2: 2 C 651 1
# 3: 3 B 659 1
# 4: 4 C 660 1
# 5: 5 C 662 2
# 6: 6 B 663 8
# 7: 7 D 668 3
# 8: 8 D 670 1
# 9: 9 C 671 1
# 10: 10 B 672 1
# 11: 11 C 673 1
# 12: 12 A 681 1
# 13: 13 C 682 1
# 14: 14 B 683 1
# 15: 15 C 684 1
# 16: 16 D 690 3
# 17: 17 A 692 1
# 18: 18 C 693 1
# 19: 19 D 694 1
# 20: 20 C 695 1
I want to build all possible pairs of rows in a dataframe within each level of a categorical variable name and then make the differences of these rows within each level of name for all non-factor variables: row 1 - row 2, row 1 - row 3, …
set.seed(9)
df <- data.frame(
ID = 1:10,
name = as.factor(rep(LETTERS, each = 4)[1:10]),
X1 = sample(1001, 10),
X2 = sample(1001, 10),
bool = sample(c(TRUE, FALSE), 10, replace = TRUE),
fruit = as.factor(sample(c("Apple", "Orange", "Kiwi"), 10, replace = TRUE))
)
This is what the sample looks like:
ID name X1 X2 bool fruit
1 1 A 222 118 FALSE Apple
2 2 A 25 9 TRUE Kiwi
3 3 A 207 883 TRUE Orange
4 4 A 216 301 TRUE Kiwi
5 5 B 443 492 FALSE Apple
6 6 B 134 499 FALSE Kiwi
7 7 B 389 401 TRUE Kiwi
8 8 B 368 972 TRUE Kiwi
9 9 C 665 356 FALSE Apple
10 10 C 985 488 FALSE Kiwi
I want to get a dataframe of 13 rows which looks like :
ID name X1 X2 bool fruit
1 1-2 A 197 109 -1 Apple
2 1-3 A 15 -765 -1 Kiwi
…
Note that the factor fruit should be unchanged. But it is a bonus, I want above all the X1 and X2 to be changed and the factor name to be kept.
I know I may use combn function but I do not see how to do it. I would prefer a solution with the dplyr package and the group_by function.
I've managed to create all differences for consecutives rows with dplyr using
varnotfac <- names(df)[!sapply(df, is.factor )] # remove factorial variable
# but not logical variable
library(dplyr)
diff <- df%>%
group_by(name) %>%
mutate_at(varnotfac, funs(. - lead(.))) %>% #
na.omit()
I could not find out how to keep all variables using filter_if / filter_at so I used select_at. So from #Axeman's answer
set.seed(9)
varnotfac <- names(df)[!sapply(df, is.factor )] # names of non-factorial variables
diff1<- df %>%
group_by(name) %>%
select_at(vars(varnotfac)) %>%
nest() %>%
mutate(data = purrr::map(data, ~as.data.frame(map(.x, ~combn(., 2, base::diff))))) %>%
unnest()
Or with the outer function, it's way faster than combn
set.seed(9)
varnotfac <- names(df)[!sapply(df, is.factor )] # names of non-factorial variables
allpairs <- function(v){
y <- outer(v,v,'-')
z <- y[lower.tri(y)]
return(z)
}
diff2<- df %>%
group_by(name) %>%
select_at(vars(varnotfac)) %>%
nest() %>%
mutate(data = purrr::map(data, ~as.data.frame(map(.x, ~allpairs(.))))) %>%
unnest()
)
One can check that the data.frame obtained are the same with
all.equal(diff1,diff2)
[1] TRUE
My sample looks different...
ID name X1 X2 bool
1 1 A 222 118 FALSE
2 2 A 25 9 TRUE
3 3 A 207 883 TRUE
4 4 A 216 301 TRUE
5 5 B 443 492 FALSE
6 6 B 134 499 FALSE
7 7 B 389 401 TRUE
8 8 B 368 972 TRUE
9 9 C 665 356 FALSE
10 10 C 985 488 FALSE
Using this, and looking here, we can do:
library(dplyr)
library(tidyr)
library(purrr)
df %>%
group_by(name) %>%
nest() %>%
mutate(data = map(data, ~as.data.frame(map(.x, ~as.numeric(dist(.)))))) %>%
unnest()
# A tibble: 13 x 5
name ID X1 X2 bool
<fct> <dbl> <dbl> <dbl> <dbl>
1 A 1 197 109 1
2 A 2 15 765 1
3 A 3 6 183 1
4 A 1 182 874 0
5 A 2 191 292 0
6 A 1 9 582 0
7 B 1 309 7 0
8 B 2 54 91 1
9 B 3 75 480 1
10 B 1 255 98 1
11 B 2 234 473 1
12 B 1 21 571 0
13 C 1 320 132 0
This is unsigned though. Alternatively:
df %>%
group_by(name) %>%
nest() %>%
mutate(data = map(data, ~as.data.frame(map(.x, ~combn(., 2, diff))))) %>%
unnest()
# A tibble: 13 x 5
name ID X1 X2 bool
<fct> <int> <int> <int> <int>
1 A 1 -197 -109 1
2 A 2 -15 765 1
3 A 3 -6 183 1
4 A 1 182 874 0
5 A 2 191 292 0
6 A 1 9 -582 0
7 B 1 -309 7 0
8 B 2 -54 -91 1
9 B 3 -75 480 1
10 B 1 255 -98 1
11 B 2 234 473 1
12 B 1 -21 571 0
13 C 1 320 132 0
Writing the title for this was more difficult than expected.
I have data that look like this:
scenario type value
1 A U 922
2 A V 291
3 A W 731
4 A X 970
5 A Y 794
6 B U 827
7 B V 10
8 B W 517
9 B X 97
10 B Y 681
11 C U 26
12 C V 410
13 C W 706
14 C X 865
15 C Y 385
16 D U 473
17 D V 561
18 D W 374
19 D X 645
20 D Y 217
21 E U 345
22 E V 58
23 E W 437
24 E X 106
25 E Y 292
What I'm trying to do is subtract the value from type == W from all the values in each scenario. So, for example, after this command is done, scenario A would look like this:
scenario type value
1 A U 191
2 A V -440
3 A W 0
4 A X 239
5 A Y 63
...and so forth
I figure I can use dplyr::group_by() and mutate() but I'm not sure what to put in the mutate command
You can do this with dplyr. In the mutate function you can just query which has type of "W" then subtract that from the original value.
library(dplyr)
df %>% group_by(scenario) %>% mutate(value = value - value[which(type == "W")])
# A tibble: 25 x 3
# Groups: scenario [5]
# scenario type value
# <fct> <fct> <int>
# 1 A U 191
# 2 A V -440
# 3 A W 0
# 4 A X 239
# 5 A Y 63
# 6 B U 310
# 7 B V -507
# 8 B W 0
# 9 B X -420
#10 B Y 164
## ... with 15 more rows