dplyr:: create new column with order number of another column - r

first apologise if this question was asked somewhere else but I couldn't find an answer.
In R, I have a 2 columns data.frame with ID and Score values.
library(dplyr)
library(magrittr)
set.seed(1235) # for reproducible example
data.frame(ID = LETTERS[1:16],
Score = round(rnorm(n=16,mean = 1200, sd = 5 ), 0),
stringsAsFactors = F) -> tmp
head(tmp)
# ID Score
# 1 A 1203
# 2 B 1198
# 3 C 1197
# 4 D 1202
# 5 E 1200
# 6 F 1190
I want to create a new column called Position with numbers from 1 to nrow(tmp) corresponding to the decreasing order of the Score column.
I can do that in base R with:
tmp[order(tmp$Score, decreasing = T), "Position"] <- 1:nrow(tmp)
head(tmp[order(tmp$Position), ])
# ID Score Position
# 1 A 1211 1
# 8 H 1210 2
# 3 C 1209 3
# 4 D 1205 4
# 5 E 1202 5
# 16 P 1202 6
But I was wondering if there's a more elegant way to do it abiding the tidyverse principles?
Like I tried this but it doesn't work and I can't understand why...
tmp %>%
mutate(Position = order(Score, decreasing = T)) %>%
arrange(Position) %>%
head()
# ID Score Position
# 1 A 1211 1
# 2 L 1200 2
# 3 C 1209 3
# 4 D 1205 4
# 5 E 1202 5
# 6 G 1188 6
Here the ordering clearly didn't work.
Thanks!

We can use row_number
library(dplyr)
tmp %>%
mutate(Position2 = row_number(-Score))
-output
# ID Score Position Position2
#1 A 1197 12 12
#2 B 1194 16 16
#3 C 1205 3 3
#4 D 1201 8 8
#5 E 1201 9 9
#6 F 1208 1 1
#7 G 1200 10 10
#8 H 1203 5 5
#9 I 1207 2 2
#10 J 1202 6 6
#11 K 1195 15 15
#12 L 1205 4 4
#13 M 1196 13 13
#14 N 1198 11 11
#15 O 1196 14 14
#16 P 1202 7 7
where 'Position' is the one created with order based on base R OP's code

Similar to your order logic we can arrange the data in decreasing order and create position column which goes from 1 to number of rows in the data.
library(dplyr)
tmp %>%
arrange(desc(Score)) %>%
mutate(position = 1:n())
# ID Score position
#1 F 1208 1
#2 I 1207 2
#3 C 1205 3
#4 L 1205 4
#5 H 1203 5
#6 J 1202 6
#7 P 1202 7
#8 D 1201 8
#9 E 1201 9
#10 G 1200 10
#11 N 1198 11
#12 A 1197 12
#13 M 1196 13
#14 O 1196 14
#15 K 1195 15
#16 B 1194 16

Related

How to convert data into transactional format in R

I have the following dataset and I want to transform it into a transactional format.
sample_data<-data.frame(id=c(452,125,288,496,785,328,712,647),a=c(5,8,7,9,0,0,4,0),b=c(0,7,8,9,3,6,0,0),c=c(7,8,9,0,0,0,0,7),d=c(8,7,5,0,0,0,0,7))
sample_data
sample_data
id a b c d
452 5 0 7 8
125 8 7 8 7
288 7 8 9 5
496 9 9 0 0
785 0 3 0 0
328 0 6 0 0
712 4 0 0 0
647 0 0 7 7
The desired output is as follows:
id item
452 a c d
125 a b c d
288 a b c d
496 a b
785 b
328 b
712 a
647 c d
How can I achieve this in R?
Is there an easier way of doing this?
Here is a tidyverse solution using pivot_longer, filter, and summarize.
library(dplyr)
library(stringr)
library(tidyr)
sample_data %>%
pivot_longer(a:d, names_to = "item") %>%
filter(value != 0) %>%
group_by(id) %>%
summarize(item = str_c(item, collapse = " "))
# A tibble: 8 x 2
id item
<dbl> <chr>
1 125 a b c d
2 288 a b c d
3 328 b
4 452 a c d
5 496 a b
6 647 c d
7 712 a
8 785 b
We can use apply to loop over the rows, get the names of the data where the value of numeric columns are not 0, and paste them together, then cbind with the first column of the data
cbind(sample_data[1], item = apply(sample_data[-1], 1,
function(x) paste(names(x)[x != 0], collapse = ' ')))
-output
# id item
#1 452 a c d
#2 125 a b c d
#3 288 a b c d
#4 496 a b
#5 785 b
#6 328 b
#7 712 a
#8 647 c d

Create multiple variables in a loop

Is there a way to create multiple variables in a loop. For example, if I have a variable, called 'test' among others, in my data frame, how can I create a series of new variables called say 'test1', 'test2', ... 'testn' that are defined as test^1, test^2... test^n
As an example
mynum <- 1:10
myletters <- letters[1:10]
mydf <- data.frame(mynum, myletters)
mydf
mynum myletters
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
for (i in 1:5)
{paste0(var, i) <- mynum^i
}
But it errors out.
I am trying to create variables like var1, var2, var3 etc which are mynum^1, mynum^2, mynum^3 etc.
Best regards
Deepak
You can use lapply to create new columns and combine them using do.call + cbind.
n <- 1:5
mydf[paste0('var', n)] <- do.call(cbind, lapply(n, function(x) mydf$mynum^x))
mydf
# mynum myletters var1 var2 var3 var4 var5
#1 1 a 1 1 1 1 1
#2 2 b 2 4 8 16 32
#3 3 c 3 9 27 81 243
#4 4 d 4 16 64 256 1024
#5 5 e 5 25 125 625 3125
#6 6 f 6 36 216 1296 7776
#7 7 g 7 49 343 2401 16807
#8 8 h 8 64 512 4096 32768
#9 9 i 9 81 729 6561 59049
#10 10 j 10 100 1000 10000 100000
Or with purrr's map_dfc
mydf[paste0('var', n)] <- purrr::map_dfc(n, ~mydf$mynum^.x)
Try this, you have to take into account that you have to move the position of the new variables. That is why I use i+2 in the loop. Here the code:
#Data
mynum <- 1:10
myletters <- letters[1:10]
mydf <- data.frame(mynum, myletters,stringsAsFactors = F)
The loop:
#Loop
for (i in 1:5)
{
mydf[,i+2] <- mydf[,'mynum']^i
names(mydf)[i+2] <- paste0('var',i)
}
Output:
mynum myletters var1 var2 var3 var4 var5
1 1 a 1 1 1 1 1
2 2 b 2 4 8 16 32
3 3 c 3 9 27 81 243
4 4 d 4 16 64 256 1024
5 5 e 5 25 125 625 3125
6 6 f 6 36 216 1296 7776
7 7 g 7 49 343 2401 16807
8 8 h 8 64 512 4096 32768
9 9 i 9 81 729 6561 59049
10 10 j 10 100 1000 10000 100000
An option with map
library(dplyr)
library(purrr)
map_dfc(1:5, ~ mydf$mynum^.x) %>%
rename_all(~ str_replace(., '\\.+', 'var')) %>%
bind_cols(mydf, .)

Subtract value to the nearest specific string in another column

Let say I have a data df as below. In total, there are 20 rows and there are four types of strings in column string: "A", "B", "C" and "D".
no string position
1 B 650
2 C 651
3 B 659
4 C 660
5 C 662
6 B 663
7 D 668
8 D 670
9 C 671
10 B 672
11 C 673
12 A 681
13 C 682
14 B 683
15 C 684
16 D 690
17 A 692
18 C 693
19 D 694
20 C 695
By performing subtraction of value in column position from the previous row, I could get a forth column distance by executing the following command:
df$distance <- ave(df$position, FUN=function(x) c(0, diff(x)))
So that I could get distance from the current value to the previous row as below:
no string position distance
1 B 650 0
2 C 651 1
3 B 659 8
4 C 660 1
5 C 662 2
6 B 663 1
7 D 668 5
8 D 670 2
9 C 671 1
10 B 672 1
11 C 673 1
12 A 681 8
13 C 682 1
14 B 683 1
15 C 684 1
16 D 690 6
17 A 692 2
18 C 693 1
19 D 694 1
20 C 695 1
However, what I wish to have is to get the distance in column position for each string to the nearest previous string "C", such as the change of 7,8 and 17 below:
no string position distance
1 B 650 0
2 C 651 1
3 B 659 8
4 C 660 1
5 C 662 2
6 B 663 1
7 D 668 6
8 D 670 8
9 C 671 1
10 B 672 1
11 C 673 1
12 A 681 8
13 C 682 1
14 B 683 1
15 C 684 1
16 D 690 6
17 A 692 8
18 C 693 1
19 D 694 1
20 C 695 1
How can I do so? By the way, can I know how I can do to get the distance from the nearest next "C" in column string as well?
Maybe not an ideal solution and there is a way to simplify this.
#Taken from your code
df$distance <- ave(df$position, FUN=function(x) c(0, diff(x)))
#logical values indicating occurrence of "C"
c_occur = df$string == "C"
#We can ignore first two values in each group since,
#First value is "C" and second value is correctly calculated from previous row
#Get the indices where we need to replace the values
inds_to_replace = which(ave(df$string, cumsum(c_occur), FUN = seq_along) > 2)
#Get the closest occurrence of "C" from the inds_to_replace
c_to_replace <- sapply(inds_to_replace, function(x) {
new_inds <- which(c_occur)
max(new_inds[(x - new_inds) > 0])
#To get distance from "nearest next "C" replace the above line with
#new_inds[which.max(x - new_inds < 0)]
})
#Replace the values
df$distance[inds_to_replace] <- df$position[inds_to_replace] -
df$position[c_to_replace]
df[inds_to_replace, ]
# no string position distance
#7 7 D 668 6
#8 8 D 670 8
#17 17 A 692 8
The following tidyverse approach reproduces your expected output.
Problem description: Calculate the difference in position of the current row with the previous string = "C" row; if there is no previous string = "C" row or the row itself has string = "C", then the distance is given by the difference in position between the current and previous row (irrespective of string).
library(tidyverse)
df %>%
mutate(nC = cumsum(string == "C")) %>%
group_by(nC) %>%
mutate(dist = cumsum(c(0, diff(position)))) %>%
ungroup() %>%
mutate(dist = if_else(dist == 0, c(0, diff(position)), dist)) %>%
select(-nC)
## A tibble: 20 x 4
# no string position dist
# <int> <fct> <int> <dbl>
# 1 1 B 650 0.
# 2 2 C 651 1.
# 3 3 B 659 8.
# 4 4 C 660 1.
# 5 5 C 662 2.
# 6 6 B 663 1.
# 7 7 D 668 6.
# 8 8 D 670 8.
# 9 9 C 671 1.
#10 10 B 672 1.
#11 11 C 673 1.
#12 12 A 681 8.
#13 13 C 682 1.
#14 14 B 683 1.
#15 15 C 684 1.
#16 16 D 690 6.
#17 17 A 692 8.
#18 18 C 693 1.
#19 19 D 694 1.
#20 20 C 695 1.
Sample data
df <- read.table(text =
"no string position
1 B 650
2 C 651
3 B 659
4 C 660
5 C 662
6 B 663
7 D 668
8 D 670
9 C 671
10 B 672
11 C 673
12 A 681
13 C 682
14 B 683
15 C 684
16 D 690
17 A 692
18 C 693
19 D 694
20 C 695", header = T)
Here is a data.table way:
dtt[, distance := c(0, diff(position))]
dtt[cumsum(string == 'C') > 0,
distance := ifelse(seq_len(.N) == 1, distance, position - position[1]),
by = cumsum(string == 'C')]
# no string position distance
# 1: 1 B 650 0
# 2: 2 C 651 1
# 3: 3 B 659 8
# 4: 4 C 660 1
# 5: 5 C 662 2
# 6: 6 B 663 1
# 7: 7 D 668 6
# 8: 8 D 670 8
# 9: 9 C 671 1
# 10: 10 B 672 1
# 11: 11 C 673 1
# 12: 12 A 681 8
# 13: 13 C 682 1
# 14: 14 B 683 1
# 15: 15 C 684 1
# 16: 16 D 690 6
# 17: 17 A 692 8
# 18: 18 C 693 1
# 19: 19 D 694 1
# 20: 20 C 695 1
Here is dtt:
structure(list(no = 1:20, string = c("B", "C", "B", "C", "C",
"B", "D", "D", "C", "B", "C", "A", "C", "B", "C", "D", "A", "C",
"D", "C"), position = c(650L, 651L, 659L, 660L, 662L, 663L, 668L,
670L, 671L, 672L, 673L, 681L, 682L, 683L, 684L, 690L, 692L, 693L,
694L, 695L)), row.names = c(NA, -20L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x1939260>)
If you want to get distance to nearest next C for non-C rows, try this:
dtt[, distance := c(0, diff(position))]
dtt[, g := rev(cumsum(rev(string == 'C')))]
dtt[g > 0, distance := ifelse(seq_len(.N) == .N, distance, abs(position - position[.N])), by = g]
dtt[, g := NULL]
# no string position distance
# 1: 1 B 650 1
# 2: 2 C 651 1
# 3: 3 B 659 1
# 4: 4 C 660 1
# 5: 5 C 662 2
# 6: 6 B 663 8
# 7: 7 D 668 3
# 8: 8 D 670 1
# 9: 9 C 671 1
# 10: 10 B 672 1
# 11: 11 C 673 1
# 12: 12 A 681 1
# 13: 13 C 682 1
# 14: 14 B 683 1
# 15: 15 C 684 1
# 16: 16 D 690 3
# 17: 17 A 692 1
# 18: 18 C 693 1
# 19: 19 D 694 1
# 20: 20 C 695 1

Subtracting a specific row's value from other values in a dplyr group_by() tbl

Writing the title for this was more difficult than expected.
I have data that look like this:
scenario type value
1 A U 922
2 A V 291
3 A W 731
4 A X 970
5 A Y 794
6 B U 827
7 B V 10
8 B W 517
9 B X 97
10 B Y 681
11 C U 26
12 C V 410
13 C W 706
14 C X 865
15 C Y 385
16 D U 473
17 D V 561
18 D W 374
19 D X 645
20 D Y 217
21 E U 345
22 E V 58
23 E W 437
24 E X 106
25 E Y 292
What I'm trying to do is subtract the value from type == W from all the values in each scenario. So, for example, after this command is done, scenario A would look like this:
scenario type value
1 A U 191
2 A V -440
3 A W 0
4 A X 239
5 A Y 63
...and so forth
I figure I can use dplyr::group_by() and mutate() but I'm not sure what to put in the mutate command
You can do this with dplyr. In the mutate function you can just query which has type of "W" then subtract that from the original value.
library(dplyr)
df %>% group_by(scenario) %>% mutate(value = value - value[which(type == "W")])
# A tibble: 25 x 3
# Groups: scenario [5]
# scenario type value
# <fct> <fct> <int>
# 1 A U 191
# 2 A V -440
# 3 A W 0
# 4 A X 239
# 5 A Y 63
# 6 B U 310
# 7 B V -507
# 8 B W 0
# 9 B X -420
#10 B Y 164
## ... with 15 more rows

Convert an r dataframe to correct format to use rep

I have a data frame of the form
A = data.frame(c(1485,1486,1701,1808))
names(A) <- c("ID")
and a second data frame of the form
B = data.frame(1:12)
names(B) <- "value"
I want to be able to use this with rep to form a second column in B such that I have
B$new <- rep(A,each = 3, length.out = 12)
giving
> B
value new
1 1 1485
2 2 1485
3 3 1485
4 4 1486
5 5 1486
6 6 1486
7 7 1701
8 8 1701
9 9 1701
10 10 1808
11 11 1808
12 12 1808
this works fine if I define A = c(1485,1486,1701,1808) , but because A is a dataframe it does not. How do I convert A into the correct form to use with rep? I have tried as.list, as.vector, as.integer unsuccessfully.
As A is a dataframe, you need to specify which column you want to repeat. (here ID)
B$new <- rep(A$ID,each = 3, length.out = 12)
B
# value new
#1 1 1485
#2 2 1485
#3 3 1485
#4 4 1486
#5 5 1486
#6 6 1486
#7 7 1701
#8 8 1701
#9 9 1701
#10 10 1808
#11 11 1808
#12 12 1808
In your case, this would also work without using length.out argument
rep(A$ID,each = 3)
It would repeat every ID in A 3 times giving the same result.
If A is a data frame you can use rep function in this way
A = data.frame(c(1485,1486,1701,1808))
names(A) <- c("ID")
B = data.frame(1:12)
names(B) <- "value"
B$new <- sort(unlist(rep(A,times=3)))
print(B)
value new
1 1 1485
2 2 1485
3 3 1485
4 4 1486
5 5 1486
6 6 1486
7 7 1701
8 8 1701
9 9 1701
10 10 1808
11 11 1808
12 12 1808

Resources