Compare values of 2 uneven vectors and choose the bigger value - r

I have 2 vectors of value that I extracted from a .pdf file which represents the location of two different keywords. vector.1 is the first keyword while vector.2 is the second keyword. So I need to extract all the rows in between using something like below.
df.1 <- df[vector.1[1]:vector.2[1], ]
I managed to use a loop to go through all the vectors for other documents but for this particular file, it has uneven location due to the structure of it.
vector.1 <- c(12, 85, 144, 188, 233, 285, 338, 384, 426, 469, 512, 558, 613, 669, 713, 758, 808, 859, 908, 964, 1046, 1090, 1126, 1149, 1216, 1267, 1346, 1423, 1464, 1513, 1560, 1607, 1665, 1718, 1763, 1810, 1856, 1908, 1938)
vector.2 <- c(48, 53, 111, 116, 155, 160, 198, 203, 250, 255, 303, 308, 350, 355, 392, 397, 435, 440, 478, 483, 523, 528, 578, 583, 635, 640, 679, 684, 723, 728, 773, 778, 824, 829, 871, 876, 929, 934, 1008, 1017, 1091, 1096, 1182, 1187, 1232, 1237, 1308, 1313, 1385, 1390, 1430, 1435, 1478, 1483, 1525, 1530, 1572, 1577, 1629, 1634, 1683, 1688, 1729, 1734, 1776, 1781, 1821, 1826, 1874, 1879, 1967, 1972)
As you can see vector.1[2] is bigger than vector.2[2] and the actual location is supposed to be vector.2[3]. Is there anyway to write the code to match each vector[i] so that the desired result is something like below:
vector.3 <- c(48, 111, 155, 198, 250, 303, 392, 435, ....)
Thank you!

vector.2[rowSums(outer(vector.1,vector.2,">"))+1]
[1] 48 111 155 198 250 303 350 392 435 478 523 578 635 679 723 773 824 871 929
[20] 1008 1091 1091 1182 1182 1232 1308 1385 1430 1478 1525 1572 1629 1683 1729 1776 1821 1874 1967
[39] 1967
You can also do vector.2[colSums(sapply(vector.1,">",vector.2))+1]

If you are trying to extract particular values from vector.2 into another vector.3 this should do, provided the indexes follow a sequence separated by 2. You can generate a sequence of indices using seq by providing the start and end index and the increment by arguments accordingly.
vector.3 <- vector.2[seq(1,length(vector.2),by=2)]
[1] 48 111 155 198 250 303 350 392 435 478
[11] 523 578 635 679 723 773 824 871 929 1008
[21] 1091 1182 1232 1308 1385 1430 1478 1525 1572 1629
[31] 1683 1729 1776 1821 1874 1967

Or do you want the smallest value of vector.2 that is larger than each value of vector.1?...
sapply(vector.1, function(x) min(vector.2[vector.2>x]))
[1] 48 111 155 198 250 303 350 392 435 478 523 578 635 679 723 773 824 871 929 1008 1091 1091 1182
[24] 1182 1232 1308 1385 1430 1478 1525 1572 1629 1683 1729 1776 1821 1874 1967 1967

Related

Count the number of consecutive occurrences in a vector

I have the following vector:
vec <- c(28, 44, 45, 46, 47, 48, 61, 62, 70, 71, 82, 83, 104, 105, 111, 115, 125, 136, 137, 138, 146, 147, 158, 159, 160, 185, 186, 187, 188, 189, 190, 191, 192, 193, 209, 263, 264, 265, 266, 267, 268, 280, 283, 284, 308, 309, 318, 319, 324, 333, 334, 335, 347, 354)
Now I would like to get the number of consecutive occurrences in the vector of the minimum length two.
So here this would be valid for the following cases:
44, 45, 46, 47, 48
61, 62
70, 71
82, 83
104, 105
136, 137, 138
146, 147
158, 159, 160
185, 186, 187, 188, 189, 190, 191, 192, 193
263, 264, 265, 266, 267, 268
283, 284
308, 309
318, 319
333, 334, 335
So there are 14 times cases of consecutive numbers, and I just need the integer 14 as output.
Anybody with an idea how to do that?
We can use rle and diff functions :
a=rle(diff(vec))
sum(a$values==1)
diff and split will help
vec2 <- split(vec, cumsum(c(1, diff(vec) != 1)))
vec2[(sapply(vec2, function(x) length(x))>1)]
$`2`
[1] 44 45 46 47 48
$`3`
[1] 61 62
$`4`
[1] 70 71
$`5`
[1] 82 83
$`6`
[1] 104 105
$`10`
[1] 136 137 138
$`11`
[1] 146 147
$`12`
[1] 158 159 160
$`13`
[1] 185 186 187 188 189 190 191 192 193
$`15`
[1] 263 264 265 266 267 268
$`17`
[1] 283 284
$`18`
[1] 308 309
$`19`
[1] 318 319
$`21`
[1] 333 334 335
Brut force :
var <- sort(var)
nconsecutive <- 0
p <- length(var)-1
for (i in 1:p){
if((var[i + 1] - var[i]) == 1){
consecutive <- consecutive + 1
}else{
# If at least one consecutive number
if(consecutive > 0){
# when no more consecutive numbers add one to your increment
nconsecutive = nconsecutive + 1
}
# Re set to 0 your increment
consecutive <- 0
}
}
Here's another base R one-liner using tapply -
sum(tapply(vec, cumsum(c(TRUE, diff(vec) != 1)), length) > 1)
#[1] 14

How to create list-columns from list in R?

# Sample data
df <- tibble(id=1:2, xml_str=c("<?xml version='1.0'?><!DOCTYPE svg PUBLIC '-//W3C//DTD SVG 1.1//EN' 'http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd'><svg version='1.1' xmlns='http://www.w3.org/2000/svg'>'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M171, 160 L171, 160, 168, 159, 164, 159, 163, 159, 162, 159, 161, 159, 161, 158, 162, 158, 162, 157, 163, 156, 165, 156'/>'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M172, 226 L172, 226, 171, 213, 170, 212, 171, 212, 172, 212, 173, 212, 173, 211, 172, 211, 171, 211, 171, 212, 171, 215'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M153, 94 L153, 94, 150, 90, 150, 89, 150, 88, 150, 87, 150, 86, 150, 85, 150, 84, 150, 82, 150, 81, 150, 80, 150, 79'/>'/>'/>'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M346, 84 L346, 84, 346, 79, 347, 78, 347, 77, 348, 77, 348, 76, 348, 75, 348, 76, 348, 77, 349, 77, 348, 78'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M314, 67 L314, 67, 311, 76, 309, 76, 308, 77, 307, 77, 307, 76, 306, 76, 305, 76, 305, 77, 306, 77, 307, 77, 306, 77, 305, 79, 304, 80'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M313, 57 L313, 57, 321, 56, 321, 57, 321, 58'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M332, 58 L332, 58, 332, 57, 331, 57, 333, 57, 334, 57, 335, 57, 336, 58, 337, 58, 338, 58, 339, 58, 340, 58, 341, 58, 341, 59, 340, 60, 339, 60, 338, 60, 337, 60, 336, 60, 335, 60, 334, 60, 333, 60, 332, 60, 331, 60, 331, 59, 333, 58, 334, 58'/></svg>", "<?xml version='1.0'?><!DOCTYPE svg PUBLIC '-//W3C//DTD SVG 1.1//EN' 'http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd'><svg version='1.1' xmlns='http://www.w3.org/2000/svg'>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M315, 80 L315, 80, 321, 79, 320, 79, 318, 79, 317, 79'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M334, 83 L334, 83, 334, 82'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M315, 80 L315, 80, 315, 82, 315, 83, 315, 84, 315, 85'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M315, 72 L315, 72'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M315, 69 L315, 69, 315, 70'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M332, 66 L332, 66, 332, 67'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M315, 56 L315, 56'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M315, 66 L315, 66, 315, 67'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M315, 72 L315, 72'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M332, 72 L332, 72, 333, 75'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M315, 72 L315, 72'/>\n<path fill='none' stroke='#ff0000' stroke-width='5' d='M334, 73 L334, 73, 333, 73'/></svg>"))
df <- df %>%
rowwise() %>%
mutate(nodes = (xml_str %>% read_xml() %>% xml_find_all(., "//#d") %>% as_list()))
With the data frame above, I want to extract all path-element d-nodes from the xml string and store them as a list in the same data frame, but I get Column nodes must be length 1 (the group size), not 7
The piping used in the mutate statement does return a single list.
I can leave out the 'rowwise()', but that simply expects length 2 instead of 1.
What am I missing here?
It's not exactly the way you're doing it, but you can use str_extract_all and regex to pull out the relevant string as a list of comma-separated strings
ans <-
df %>%
dplyr::mutate(dnodes = stringr::str_extract_all(xml_str, "(?<=[d]=')[^']+(?='\\/)"))
ans$dnodes
# [[1]]
# [1] "M171, 160 L171, 160, 168, 159, 164, 159, 163, 159, 162, 159, 161, 159, 161, 158, 162, 158, 162, 157, 163, 156, 165, 156"
# [2] "M172, 226 L172, 226, 171, 213, 170, 212, 171, 212, 172, 212, 173, 212, 173, 211, 172, 211, 171, 211, 171, 212, 171, 215"
# [3] "M153, 94 L153, 94, 150, 90, 150, 89, 150, 88, 150, 87, 150, 86, 150, 85, 150, 84, 150, 82, 150, 81, 150, 80, 150, 79"
# [4] "M346, 84 L346, 84, 346, 79, 347, 78, 347, 77, 348, 77, 348, 76, 348, 75, 348, 76, 348, 77, 349, 77, 348, 78"
# [5] "M314, 67 L314, 67, 311, 76, 309, 76, 308, 77, 307, 77, 307, 76, 306, 76, 305, 76, 305, 77, 306, 77, 307, 77, 306, 77, 305, 79, 304, 80"
# [6] "M313, 57 L313, 57, 321, 56, 321, 57, 321, 58"
# [7] "M332, 58 L332, 58, 332, 57, 331, 57, 333, 57, 334, 57, 335, 57, 336, 58, 337, 58, 338, 58, 339, 58, 340, 58, 341, 58, 341, 59, 340, 60, 339, 60, 338, 60, 337, 60, 336, 60, 335, 60, 334, 60, 333, 60, 332, 60, 331, 60, 331, 59, 333, 58, 334, 58"
# [[2]]
# [1] "M315, 80 L315, 80, 321, 79, 320, 79, 318, 79, 317, 79" "M334, 83 L334, 83, 334, 82"
# [3] "M315, 80 L315, 80, 315, 82, 315, 83, 315, 84, 315, 85" "M315, 72 L315, 72"
# [5] "M315, 69 L315, 69, 315, 70" "M332, 66 L332, 66, 332, 67"
# [7] "M315, 56 L315, 56" "M315, 66 L315, 66, 315, 67"
# [9] "M315, 72 L315, 72" "M332, 72 L332, 72, 333, 75"
# [11] "M315, 72 L315, 72" "M334, 73 L334, 73, 333, 73"
You can convert to list of a vector of strings with
ans <-
df %>%
dplyr::mutate(dnodes = stringr::str_extract_all(xml_str, "(?<=[d]=')[^']+(?='\\/)")) %>%
dplyr::mutate(dnodes = purrr::map(dnodes, ~unlist(strsplit(paste(.x, collapse=", "), ", "))))
ans$dnodes
# [[1]]
# [1] "M171" "160 L171" "160" "168" "159" "164" "159" "163" "159" "162"
# [11] "159" "161" "159" "161" "158" "162" "158" "162" "157" "163"
# [21] "156" "165" "156" "M172" "226 L172" "226" "171" "213" "170" "212"
# [31] "171" "212" "172" "212" "173" "212" "173" "211" "172" "211"
# [41] "171" "211" "171" "212" "171" "215" "M153" "94 L153" "94" "150"
# [51] "90" "150" "89" "150" "88" "150" "87" "150" "86" "150"
# [61] "85" "150" "84" "150" "82" "150" "81" "150" "80" "150"
# etc
Does this do what you want? I usually wrap the right side of my mutate(name = right_side) in list() to accomplish this.
df <- df %>%
mutate(nodes = list(xml_str %>% read_xml() %>% xml_find_all(., "//#d")))
class(df$nodes)
"list"
class(df$nodes[[1]])
"xml_nodeset"
Not sure if you want the xml_nodeset objects or perhaps CPak's solution with actual strings is better for you.

Converting data frame into a list of lists in R

I would like to turn data.frame like this one:
dat = data.frame (
ConditionA = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
ConditionB = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5),
X = c(460, 382, 468, 618, 421, 518, 655, 656, 621, 552, 750, 725, 337, 328, 342, 549, 569, 523, 469, 429),
Y = c(437, 305, 498, 620, 381, 543, 214, 181, 183, 387, 439, 351, 327, 268, 276, 178, 375, 393, 312, 302)
)
into a list of lists like this (or similar):
lst = list(
list(
c(460, 382, 468, 618),
c(437, 305, 498, 620)
),
list(
c(421, 518, 655, 656, 621),
c(381, 543, 214, 181, 183)
),
list(
c(552, 750, 725),
c(387, 439, 351)
),
list(
c(337, 328, 342, 549),
c(327, 268, 276, 178)
),
list(
c(569, 523, 469, 429),
c(375, 393, 312, 302)
)
)
> lst
[[1]]
[[1]][[1]]
[1] 460 382 468 618
[[1]][[2]]
[1] 437 305 498 620
[[2]]
[[2]][[1]]
[1] 421 518 655 656 621
[[2]][[2]]
[1] 381 543 214 181 183
[[3]]
[[3]][[1]]
[1] 552 750 725
[[3]][[2]]
[1] 387 439 351
. . .
What would be the most efficient way to make such a conversion?
We can do a split based on the 1st and 2nd columns, use drop=TRUE for removing the combinations with 0 elements and convert to list
lapply(split(dat[-(1:2)], dat[1:2], drop = TRUE), as.list)
Or using tidyverse
library(tidyverse)
dat %>%
group_by(ConditionA, ConditionA.1) %>%
nest %>%
mutate(data = map(data, as.list)) %>%
pull(data)
May be this using data.table
Data:
dat = data.frame (
ConditionA = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
ConditionB = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5),
X = c(460, 382, 468, 618, 421, 518, 655, 656, 621, 552, 750, 725, 337, 328, 342, 549, 569, 523, 469, 429),
Y = c(437, 305, 498, 620, 381, 543, 214, 181, 183, 387, 439, 351, 327, 268, 276, 178, 375, 393, 312, 302)
)
Code:
library('data.table')
setDT(dat)
dat[, list(list(as.list(.SD))),by = .(ConditionA, ConditionB)][, V1]
or this
dat[, list(list(list(.SD))),by = .(ConditionA, ConditionB)][, V1]
c(by(dat[3:4],dat[1:2],as.list))
[[1]]
[[1]]$X
[1] 460 382 468 618
[[1]]$Y
[1] 437 305 498 620
[[2]]
[[2]]$X
[1] 421 518 655 656 621
[[2]]$Y
[1] 381 543 214 181 183
[[3]]
[[3]]$X
[1] 552 750 725
[[3]]$Y
[1] 387 439 351
. . . .

How to merge multiple columns into one column?

I currently have data spread out across multiple columns in R. I am looking for a way to put this information into the one column as a vector for each of the individual rows.
Is there a function to do this?
For example, the data looks like this:
DF <- data.frame(id=rep(LETTERS, each=1)[1:26], replicate(26, sample(1001, 26)), Class=sample(c("Yes", "No"), 26, TRUE))
select(DF, cols=c("id", "X1","X2", "X23", "Class"))
How can I merge the columns "X1","X2", "X23" into a vector containing numeric type variables for each of the IDs?
Like this?
library(reshape2)
melt(df) %>% dcast(id ~ ., fun.aggregate = list)
Using id, Class as id variables
id .
1 A 422, 74, 439
2 B 879, 443, 923
3 C 575, 901, 749
4 D 813, 747, 21
5 E 438, 526, 675
6 F 863, 562, 474
7 G 103, 713, 918
8 H 585, 294, 525
9 I 115, 76, 175
10 J 953, 379, 926
11 K 679, 439, 377
12 L 816, 624, 538
13 M 678, 226, 142
14 N 667, 369, 586
15 O 795, 422, 248
16 P 165, 22, 612
17 Q 294, 476, 746
18 R 968, 368, 290
19 S 238, 481, 980
20 T 921, 482, 741
21 U 550, 15, 296
22 V 121, 358, 625
23 W 213, 313, 242
24 X 92, 77, 58
25 Y 607, 936, 350
26 Z 660, 42, 275
A note though: I do not know your final use case, but this strikes me as something you probably do not want to have. It is often more advisable to stick to tidy data, see e.g. https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

remove every 4 rows between two numbers

for example, I have a data frame with one column containing numbers. these is how it looks.
head(c1)
c
1 300
2 302
3 304
4 306
5 308
6 310
Here is the sample data frame.
c1 <- structure(list(c = c(300, 302, 304, 306, 308, 310, 312, 314,
316, 318, 320, 322, 324, 326, 328, 330, 332, 334, 336, 338, 340,
342, 344, 346, 348, 350, 352, 354, 356, 358, 360, 362, 364, 366,
368, 370, 372, 374, 376, 378, 380, 382, 384, 386, 388, 390, 392,
394, 396, 398, 400)), .Names = "c", row.names = c(NA, -51L), class = "data.frame")
I want to delete the rows between 300 to 310 and 310 to 320 and so on..
I want to have a dataframe like these
300
310
320
330
340
350
.
.
.
400
Any ideas how to do these, I found how to remove every nth row, but not every four rows between two numbers
You can make use of the modulo operator %%. If you want the result as an atomic vector, you can run
c1$c[c1$c %% 10 == 0]
or if you want it as a data.frame with 1 column, you can use
c1[c1$c %% 10 == 0, , drop=FALSE]

Resources