Let say I have a data df as below. In total, there are 20 rows and there are four types of strings in column string: "A", "B", "C" and "D".
no string position
1 B 650
2 C 651
3 B 659
4 C 660
5 C 662
6 B 663
7 D 668
8 D 670
9 C 671
10 B 672
11 C 673
12 A 681
13 C 682
14 B 683
15 C 684
16 D 690
17 A 692
18 C 693
19 D 694
20 C 695
By performing subtraction of value in column position from the previous row, I could get a forth column distance by executing the following command:
df$distance <- ave(df$position, FUN=function(x) c(0, diff(x)))
So that I could get distance from the current value to the previous row as below:
no string position distance
1 B 650 0
2 C 651 1
3 B 659 8
4 C 660 1
5 C 662 2
6 B 663 1
7 D 668 5
8 D 670 2
9 C 671 1
10 B 672 1
11 C 673 1
12 A 681 8
13 C 682 1
14 B 683 1
15 C 684 1
16 D 690 6
17 A 692 2
18 C 693 1
19 D 694 1
20 C 695 1
However, what I wish to have is to get the distance in column position for each string to the nearest previous string "C", such as the change of 7,8 and 17 below:
no string position distance
1 B 650 0
2 C 651 1
3 B 659 8
4 C 660 1
5 C 662 2
6 B 663 1
7 D 668 6
8 D 670 8
9 C 671 1
10 B 672 1
11 C 673 1
12 A 681 8
13 C 682 1
14 B 683 1
15 C 684 1
16 D 690 6
17 A 692 8
18 C 693 1
19 D 694 1
20 C 695 1
How can I do so? By the way, can I know how I can do to get the distance from the nearest next "C" in column string as well?
Maybe not an ideal solution and there is a way to simplify this.
#Taken from your code
df$distance <- ave(df$position, FUN=function(x) c(0, diff(x)))
#logical values indicating occurrence of "C"
c_occur = df$string == "C"
#We can ignore first two values in each group since,
#First value is "C" and second value is correctly calculated from previous row
#Get the indices where we need to replace the values
inds_to_replace = which(ave(df$string, cumsum(c_occur), FUN = seq_along) > 2)
#Get the closest occurrence of "C" from the inds_to_replace
c_to_replace <- sapply(inds_to_replace, function(x) {
new_inds <- which(c_occur)
max(new_inds[(x - new_inds) > 0])
#To get distance from "nearest next "C" replace the above line with
#new_inds[which.max(x - new_inds < 0)]
})
#Replace the values
df$distance[inds_to_replace] <- df$position[inds_to_replace] -
df$position[c_to_replace]
df[inds_to_replace, ]
# no string position distance
#7 7 D 668 6
#8 8 D 670 8
#17 17 A 692 8
The following tidyverse approach reproduces your expected output.
Problem description: Calculate the difference in position of the current row with the previous string = "C" row; if there is no previous string = "C" row or the row itself has string = "C", then the distance is given by the difference in position between the current and previous row (irrespective of string).
library(tidyverse)
df %>%
mutate(nC = cumsum(string == "C")) %>%
group_by(nC) %>%
mutate(dist = cumsum(c(0, diff(position)))) %>%
ungroup() %>%
mutate(dist = if_else(dist == 0, c(0, diff(position)), dist)) %>%
select(-nC)
## A tibble: 20 x 4
# no string position dist
# <int> <fct> <int> <dbl>
# 1 1 B 650 0.
# 2 2 C 651 1.
# 3 3 B 659 8.
# 4 4 C 660 1.
# 5 5 C 662 2.
# 6 6 B 663 1.
# 7 7 D 668 6.
# 8 8 D 670 8.
# 9 9 C 671 1.
#10 10 B 672 1.
#11 11 C 673 1.
#12 12 A 681 8.
#13 13 C 682 1.
#14 14 B 683 1.
#15 15 C 684 1.
#16 16 D 690 6.
#17 17 A 692 8.
#18 18 C 693 1.
#19 19 D 694 1.
#20 20 C 695 1.
Sample data
df <- read.table(text =
"no string position
1 B 650
2 C 651
3 B 659
4 C 660
5 C 662
6 B 663
7 D 668
8 D 670
9 C 671
10 B 672
11 C 673
12 A 681
13 C 682
14 B 683
15 C 684
16 D 690
17 A 692
18 C 693
19 D 694
20 C 695", header = T)
Here is a data.table way:
dtt[, distance := c(0, diff(position))]
dtt[cumsum(string == 'C') > 0,
distance := ifelse(seq_len(.N) == 1, distance, position - position[1]),
by = cumsum(string == 'C')]
# no string position distance
# 1: 1 B 650 0
# 2: 2 C 651 1
# 3: 3 B 659 8
# 4: 4 C 660 1
# 5: 5 C 662 2
# 6: 6 B 663 1
# 7: 7 D 668 6
# 8: 8 D 670 8
# 9: 9 C 671 1
# 10: 10 B 672 1
# 11: 11 C 673 1
# 12: 12 A 681 8
# 13: 13 C 682 1
# 14: 14 B 683 1
# 15: 15 C 684 1
# 16: 16 D 690 6
# 17: 17 A 692 8
# 18: 18 C 693 1
# 19: 19 D 694 1
# 20: 20 C 695 1
Here is dtt:
structure(list(no = 1:20, string = c("B", "C", "B", "C", "C",
"B", "D", "D", "C", "B", "C", "A", "C", "B", "C", "D", "A", "C",
"D", "C"), position = c(650L, 651L, 659L, 660L, 662L, 663L, 668L,
670L, 671L, 672L, 673L, 681L, 682L, 683L, 684L, 690L, 692L, 693L,
694L, 695L)), row.names = c(NA, -20L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x1939260>)
If you want to get distance to nearest next C for non-C rows, try this:
dtt[, distance := c(0, diff(position))]
dtt[, g := rev(cumsum(rev(string == 'C')))]
dtt[g > 0, distance := ifelse(seq_len(.N) == .N, distance, abs(position - position[.N])), by = g]
dtt[, g := NULL]
# no string position distance
# 1: 1 B 650 1
# 2: 2 C 651 1
# 3: 3 B 659 1
# 4: 4 C 660 1
# 5: 5 C 662 2
# 6: 6 B 663 8
# 7: 7 D 668 3
# 8: 8 D 670 1
# 9: 9 C 671 1
# 10: 10 B 672 1
# 11: 11 C 673 1
# 12: 12 A 681 1
# 13: 13 C 682 1
# 14: 14 B 683 1
# 15: 15 C 684 1
# 16: 16 D 690 3
# 17: 17 A 692 1
# 18: 18 C 693 1
# 19: 19 D 694 1
# 20: 20 C 695 1
Related
I have the following table
Type Score
B 18
A 23
A 45
B 877
A 654
B 345
A 23445
A 45
A 432
B 22
B 4566
B 2
B 346
A 889
I would like to be able to create a column that takes out the A values, see below
Type Score New_Score
B 18 18
A 23 0
A 45 0
B 877 877
A 654 0
B 345 345
A 23445 0
A 45 0
A 432 0
B 22 22
B 4566 4566
B 2 2
B 346 346
A 889 0
I have tried a good few things in r but none of them work for me, any help would be most appreciated.
use this
df$New_score <- replace(df$Score, df$Type == 'B', 0)
Check
df <- read.table(text = 'Type Score
B 18
A 23
A 45
B 877
A 654
B 345
A 23445
A 45
A 432
B 22
B 4566
B 2
B 346
A 889', header = T)
df$New_score <- replace(df$Score, df$Type == 'B', 0)
df
Type Score New_Score
1 B 18 18
2 A 23 0
3 A 45 0
4 B 877 877
5 A 654 0
6 B 345 345
7 A 23445 0
8 A 45 0
9 A 432 0
10 B 22 22
11 B 4566 4566
12 B 2 2
13 B 346 346
14 A 889 0
Using ifelse.
transform(dat, new_score=ifelse(Type == "B", Score, 0))
# Type Score new_score
# 1 B 18 18
# 2 A 23 0
# 3 A 45 0
# 4 B 877 877
# 5 A 654 0
# 6 B 345 345
# 7 A 23445 0
# 8 A 45 0
# 9 A 432 0
# 10 B 22 22
# 11 B 4566 4566
# 12 B 2 2
# 13 B 346 346
# 14 A 889 0
use of dplyr::mutate and case_when should solve the problem, I would think.
library(dplyr)
df <- data.frame(Type=c("B","A","C","D","A","B","A"), Score = c(1,2,3,4,5,6,7))
df_new <- df %>% mutate(New_Score = dplyr::case_when (
df$Type == "A" ~ as.numeric(0),
TRUE ~ df$Score
)#end of case_when
)#end of mutate
df_new
Just for fun. Here is another solution
df$New_Score <- df$Score # add New_Score column
df$New_Score1 <- df$New_Score[df$Type == "A"] <- 0 # add 0 with helping column
df = subset(df, select = -(New_Score1)) # remove helping column
Output:
Type Score New_Score
1 B 18 18
2 A 23 0
3 A 45 0
4 B 877 877
5 A 654 0
6 B 345 345
7 A 23445 0
8 A 45 0
9 A 432 0
10 B 22 22
11 B 4566 4566
12 B 2 2
13 B 346 346
14 A 889 0
data:
structure(list(Type = c("B", "A", "A", "B", "A", "B", "A", "A",
"A", "B", "B", "B", "B", "A"), Score = c(18, 23, 45, 877, 654,
345, 23445, 45, 432, 22, 4566, 2, 346, 889), New_Score = c(18,
0, 0, 877, 0, 345, 0, 0, 0, 22, 4566, 2, 346, 0)), row.names = c(NA,
-14L), class = c("tbl_df", "tbl", "data.frame"))
We can use
dat$new_score <- ifelse(dat$Type == "B", dat$Score, 0)
first apologise if this question was asked somewhere else but I couldn't find an answer.
In R, I have a 2 columns data.frame with ID and Score values.
library(dplyr)
library(magrittr)
set.seed(1235) # for reproducible example
data.frame(ID = LETTERS[1:16],
Score = round(rnorm(n=16,mean = 1200, sd = 5 ), 0),
stringsAsFactors = F) -> tmp
head(tmp)
# ID Score
# 1 A 1203
# 2 B 1198
# 3 C 1197
# 4 D 1202
# 5 E 1200
# 6 F 1190
I want to create a new column called Position with numbers from 1 to nrow(tmp) corresponding to the decreasing order of the Score column.
I can do that in base R with:
tmp[order(tmp$Score, decreasing = T), "Position"] <- 1:nrow(tmp)
head(tmp[order(tmp$Position), ])
# ID Score Position
# 1 A 1211 1
# 8 H 1210 2
# 3 C 1209 3
# 4 D 1205 4
# 5 E 1202 5
# 16 P 1202 6
But I was wondering if there's a more elegant way to do it abiding the tidyverse principles?
Like I tried this but it doesn't work and I can't understand why...
tmp %>%
mutate(Position = order(Score, decreasing = T)) %>%
arrange(Position) %>%
head()
# ID Score Position
# 1 A 1211 1
# 2 L 1200 2
# 3 C 1209 3
# 4 D 1205 4
# 5 E 1202 5
# 6 G 1188 6
Here the ordering clearly didn't work.
Thanks!
We can use row_number
library(dplyr)
tmp %>%
mutate(Position2 = row_number(-Score))
-output
# ID Score Position Position2
#1 A 1197 12 12
#2 B 1194 16 16
#3 C 1205 3 3
#4 D 1201 8 8
#5 E 1201 9 9
#6 F 1208 1 1
#7 G 1200 10 10
#8 H 1203 5 5
#9 I 1207 2 2
#10 J 1202 6 6
#11 K 1195 15 15
#12 L 1205 4 4
#13 M 1196 13 13
#14 N 1198 11 11
#15 O 1196 14 14
#16 P 1202 7 7
where 'Position' is the one created with order based on base R OP's code
Similar to your order logic we can arrange the data in decreasing order and create position column which goes from 1 to number of rows in the data.
library(dplyr)
tmp %>%
arrange(desc(Score)) %>%
mutate(position = 1:n())
# ID Score position
#1 F 1208 1
#2 I 1207 2
#3 C 1205 3
#4 L 1205 4
#5 H 1203 5
#6 J 1202 6
#7 P 1202 7
#8 D 1201 8
#9 E 1201 9
#10 G 1200 10
#11 N 1198 11
#12 A 1197 12
#13 M 1196 13
#14 O 1196 14
#15 K 1195 15
#16 B 1194 16
I have the following dataset and I want to transform it into a transactional format.
sample_data<-data.frame(id=c(452,125,288,496,785,328,712,647),a=c(5,8,7,9,0,0,4,0),b=c(0,7,8,9,3,6,0,0),c=c(7,8,9,0,0,0,0,7),d=c(8,7,5,0,0,0,0,7))
sample_data
sample_data
id a b c d
452 5 0 7 8
125 8 7 8 7
288 7 8 9 5
496 9 9 0 0
785 0 3 0 0
328 0 6 0 0
712 4 0 0 0
647 0 0 7 7
The desired output is as follows:
id item
452 a c d
125 a b c d
288 a b c d
496 a b
785 b
328 b
712 a
647 c d
How can I achieve this in R?
Is there an easier way of doing this?
Here is a tidyverse solution using pivot_longer, filter, and summarize.
library(dplyr)
library(stringr)
library(tidyr)
sample_data %>%
pivot_longer(a:d, names_to = "item") %>%
filter(value != 0) %>%
group_by(id) %>%
summarize(item = str_c(item, collapse = " "))
# A tibble: 8 x 2
id item
<dbl> <chr>
1 125 a b c d
2 288 a b c d
3 328 b
4 452 a c d
5 496 a b
6 647 c d
7 712 a
8 785 b
We can use apply to loop over the rows, get the names of the data where the value of numeric columns are not 0, and paste them together, then cbind with the first column of the data
cbind(sample_data[1], item = apply(sample_data[-1], 1,
function(x) paste(names(x)[x != 0], collapse = ' ')))
-output
# id item
#1 452 a c d
#2 125 a b c d
#3 288 a b c d
#4 496 a b
#5 785 b
#6 328 b
#7 712 a
#8 647 c d
Writing the title for this was more difficult than expected.
I have data that look like this:
scenario type value
1 A U 922
2 A V 291
3 A W 731
4 A X 970
5 A Y 794
6 B U 827
7 B V 10
8 B W 517
9 B X 97
10 B Y 681
11 C U 26
12 C V 410
13 C W 706
14 C X 865
15 C Y 385
16 D U 473
17 D V 561
18 D W 374
19 D X 645
20 D Y 217
21 E U 345
22 E V 58
23 E W 437
24 E X 106
25 E Y 292
What I'm trying to do is subtract the value from type == W from all the values in each scenario. So, for example, after this command is done, scenario A would look like this:
scenario type value
1 A U 191
2 A V -440
3 A W 0
4 A X 239
5 A Y 63
...and so forth
I figure I can use dplyr::group_by() and mutate() but I'm not sure what to put in the mutate command
You can do this with dplyr. In the mutate function you can just query which has type of "W" then subtract that from the original value.
library(dplyr)
df %>% group_by(scenario) %>% mutate(value = value - value[which(type == "W")])
# A tibble: 25 x 3
# Groups: scenario [5]
# scenario type value
# <fct> <fct> <int>
# 1 A U 191
# 2 A V -440
# 3 A W 0
# 4 A X 239
# 5 A Y 63
# 6 B U 310
# 7 B V -507
# 8 B W 0
# 9 B X -420
#10 B Y 164
## ... with 15 more rows
This question is very similar to Sample random rows within each group in a data.table.
The difference is in a minor subtlety that I did not have enough reputation to discuss for that question itself.
Let's change Christopher Manning's initial data a little bit:
> DT = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20))
> DT
a b
1: 1 102
2: 1 5
3: 1 658
4: 1 499
5: 2 632
6: 3 186
7: 4 761
8: 5 150
9: 6 423
10: 7 832
11: 8 883
12: 9 247
13: 10 894
14: 11 141
15: 12 891
16: 13 488
17: 14 101
18: 15 677
19: 1 400
20: 1 467
If we tried the question's solution:
> DT[,.SD[sample(.N,3)],by = a]
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
This is because there are values in column that only occur once. We cannot sample 3 times for values that occur less than three times without using replacement (which we do not want to do).
I am struggling to deal with this scenario. We want to sample 3 times when the number of occurrences is >= 3, but pull the number of occurrences if it is < 3. For example with our DT above we would want:
a b
1: 1 102
2: 1 5
3: 1 658
4: 2 632
5: 3 186
6: 4 761
7: 5 150
8: 6 423
9: 7 832
10: 8 883
11: 9 247
12: 10 894
13: 11 141
14: 12 891
15: 13 488
16: 14 101
17: 15 677
Maybe a solution could involve sorting the data.table like this, then using rle() lengths to find out which n to use in the sample function above:
> DT <- DT[order(DT$a),]
> DT
a b
1: 1 102
2: 1 5
3: 1 658
4: 1 499
5: 1 400
6: 1 467
7: 2 632
8: 3 186
9: 4 761
10: 5 150
11: 6 423
12: 7 832
13: 8 883
14: 9 247
15: 10 894
16: 11 141
17: 12 891
18: 13 488
19: 14 101
20: 15 677
> ifelse(rle(DT$a)$lengths >= 3, 3,rle(DT$a)$lengths)
> [1] 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
If we replace "3" with n, this will return how much we should sample from a=1, a=2, a=3...
I have yet to find a way to incorporate this into a final solution. Any help would be appreciated!
I might be misunderstanding your question, but are you looking for something like this?
set.seed(123)
##
DT <- data.table(
a=c(1,1,1,1:15,1,1),
b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
a b
1: 1 288
2: 1 881
3: 1 409
4: 2 937
5: 3 46
6: 4 525
7: 5 887
8: 6 548
9: 7 453
10: 8 948
11: 9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15 42
where we are drawing 3 samples from b for group a_i if a_i contains three or more values, else we draw only n values, where n (n < 3) is the size of group a_i.
Just for demonstration, here are the 6 possible values of b for a=1 that we are sampling from (assuming you use the same random seed as above):
R> DT[order(a)][1:6,]
a b
1: 1 288
2: 1 788
3: 1 409
4: 1 881
5: 1 323
6: 1 996