Error using function rle() - r

I have a dataframe in R called pxlast, for example to access to the 5 column I use pxlast[[5]].
[1] 259.55 259.55 265.21 269.40 278.23 283.63 288.51 289.84 284.83 280.51 289.76 289.38 294.10 -1.00 -1.00 -1.00
[17] 300.30 303.86 311.65 303.29 296.44 295.13 297.22 294.60 299.65 290.23 295.80 -1.00 -1.00 -1.00 298.56 299.25
[33] 287.37 290.06 281.71 287.66 290.16 280.31 281.51 293.69 292.25 293.73 294.60 291.36 283.81 288.65 288.29 -1.00
[49] -1.00 -1.00 293.25 293.54 277.41 268.08 267.01 270.63 267.25 254.73 266.59 266.73 278.34 282.03 289.63 282.40
[65] 289.59 289.54 291.31 290.85 295.60 290.72 288.25 288.00 293.98 297.11 290.00 278.35 270.61 274.89 267.80 276.32
[81] 279.05 289.07 285.87 293.36 293.18 294.76 295.77 296.35 290.23 297.61 296.93 293.31 290.06 289.98 287.29 282.07
[97] 275.89 270.92 273.68 270.85 280.05 279.64 284.83 288.91 294.85 296.91 297.94 301.66 303.05 298.72 303.46 298.22
[113] 304.92 309.59 316.07 318.05 318.86 318.09 317.84 318.04 337.08 346.89 345.36 350.96 354.65 361.06 354.53 352.63
[129] 352.83 351.45 351.38 361.47 365.13 367.11 371.42 364.37 368.83 372.12 375.10 381.97 384.47 388.67 388.61 386.73
[145] 392.16 388.55 383.86 389.50 379.83 381.37 392.27 387.79 388.61 388.01 394.23 401.78 414.70 421.23 427.77 436.23
[161] 423.86 398.80 419.00 413.60 400.77 416.78 412.58 405.90 404.30 405.65 NA
As you can see there are repated values for example -1 values.
I want to return the values and indexes which are repeated more than X times, for example the values that are repeated more than 3 times.
This is my code for doing that.
runs = rle(pxlast[[5]])
pxlast[[5]][runs$lengths > 2]
The result is:
[1] 294.10 299.65 294.60
This result should be the first repeated element from my vector, as you can see the values are incorrect.
Why?
I have been testing and rle function is returning on my runs variable the following.
[1] 2 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[59] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[117] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
As you can see the function groups that values that are the same, so for example the first "2 value" that appears means that the 2 first numbers are the same, that is to say this vector is grouping if the number are the same, so I can't use it on my vector to return my repeated values because it doesn't match which the total amount of indixes.
If it were in the following way , for example to the 25 first lines, I could use it.
[1] 2 2 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Because you keep the total sum of the indices.
Any idea to solve it?

If we need to extract the values based on the rle index
runs <- within.list(rle(pxlast[[5]]), {
i1 <- lengths > 2
values <- values[i1]
lengths <- lengths[i1]})
inverse.rle(runs)
Using a reproducible example
v1 <- c(2, 2, 1, 3, 3, 3, 2, 4, 4, 4, 5)
runs <- within.list(rle(v1), {i1 <- lengths > 2
values <- values[i1]
lengths <- lengths[i1]})
inverse.rle(runs)
#[1] 3 3 3 4 4 4

This is a possible way:
df<-data.frame(lengths=as.numeric(runs$lengths),values=as.numeric(runs$values))
df[df[,"lengths"]>2,]
lengths values
13 3 -1
25 3 -1
43 3 -1

Related

How to find the streaks of a particular value in R?

The rle() function returns a list with values and lengths. I have not found a way to subset the output to isolate the streaks of a particular value that does not involve calling rle() twice, or saving the output into an object to later subset (an added step).
For instance, for runs of heads (1's) in a series of fair coin tosses:
s <- sample(c(0,1),100,T)
rle(s)
Run Length Encoding
lengths: int [1:55] 1 2 1 2 1 2 1 2 2 1 ...
values : num [1:55] 0 1 0 1 0 1 0 1 0 1 ...
# Double-call:
rle(s)[[1]][rle(s)[[2]]==1]
[1] 2 2 2 2 1 1 1 1 6 1 1 1 2 2 1 1 2 2 2 2 2 3 1 1 4 1 2
# Adding an intermediate step:
> r <- rle(s)
> r$lengths[r$values==1]
[1] 2 2 2 2 1 1 1 1 6 1 1 1 2 2 1 1 2 2 2 2 2 3 1 1 4 1 2
I see that a very easy way of getting the streak lengths just for 1 is to simply tweak the rle() code (answer), but there may be an even simpler way.
in Base R:
with(rle(s), lengths[values==1])
[1] 1 3 2 2 1 1 1 3 2 1 1 3 1 1 1 1 1 2 3 1 2 1 3 3 1 2 1 1 2
For a sequence of outcomes s and when interested solely the lengths of the streaks on outcome oc:
sk = function(s,oc){
n = length(s)
y <- s[-1L] != s[-n]
i <- c(which(y), n)
diff(c(0L, i))[s[i]==oc]
}
So to get the lengths for 1:
sk(s,1)
[1] 2 2 2 2 1 1 1 1 6 1 1 1 2 2 1 1 2 2 2 2 2 3 1 1 4 1 2
and likewise for 0:
sk(s,0)
[1] 1 1 1 1 2 2 2 2 4 1 1 2 1 1 1 1 1 1 3 1 1 2 6 2 1 1 4 4

Recoding Number to String R

I am new to R and I am trying to recode a numeric variable
which is 1,2,3 to string. I have seen how to do it but I do not know why mine
is not working, maybe it is because it should be from string to number?
This is what I got, and thanks in advance!
cars$origin = as.factor(cars$origin)
cars$origin
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 2 2 2 2 2 1 1 1 1 1 3 1 3 1 1
[35] 1 1 1 1 1 1 1 1 2 2 2 3 3 2 1 3 1 2 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1
[69] 2 2 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 2 1 3 1 1 1
Levels: 1 2 3
cars$origin <- recode(cars$origin, "1='american';2='european';3='japan'")
Error: Argument 2 must be named, not unnamed
Function factor has argument labels for that:
cars$origin = factor(cars$origin,
levels = c(1, 2, 3),
labels = c("american", "european", "japan"))

Count if a word occurs in each row of a 4 million observation data set

I am using R and writing a script that counts if one of ~2000 words occurs in each row of a 4 million observation data file. The data set with observations (df) contains two columns, one with text (df$lead_paragraph), and one with a date (df$date).
Using the following, I can count if any of the words in a list (p) occur in each row of the lead_paragraph column of the df file, and output the answer as a new column.
df$pcount<-((rowSums(sapply(p, grepl, df$lead_paragraph,
ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)
However, if I include too many words in the list p, running the code crashes R.
My alternate strategy is to simply break this into pieces, but I was wondering if there is a better, more elegant coding solution to use here. My inclination is to use a for loop, but everything I am reading suggests this is not preferred in R. I am pretty new to R and not a very good coder, so my apologies if this is not clear.
df$pcount1<-((rowSums(sapply(p[1:100], grepl, df$lead_paragraph,
ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)
df$pcount2<-((rowSums(sapply(p[101:200], grepl, df$lead_paragraph,
ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)
...
df$pcount22<-((rowSums(sapply(p[2101:2200], grepl, df$lead_paragraph,
ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)
I didn't complete this... but this should point you in the right direction. It's faster using the data.table package, but hopefully this gives you an idea of the process.
I recreated your dataset using random dates and strings which were
extracted from http://www.norvig.com/big.txt into a data.frame
named nrv_df
library(stringi)
> head(nrv_df)
lead_para date
1 The Project Gutenberg EBook of The Adventures of Sherlock Holmes 2018-11-16
2 by Sir Arthur Conan Doyle 2019-06-05
3 15 in our series by Sir Arthur Conan Doyle 2017-08-08
4 Copyright laws are changing all over the world Be sure to check the 2014-12-17
5 copyright laws for your country before downloading or redistributing 2016-09-13
6 this or any other Project Gutenberg eBook 2015-06-15
> dim(nrv_df)
[1] 103598 2
I then randomly sampled words from the entire body to get 2000 unique words
> length(p)
[1] 2000
> head(p)
[1] "The" "Project" "Gutenberg" "EBook" "of" "Adventures"
> tail(p)
[1] "accomplice" "engaged" "guessed" "row" "moist" "red"
Then, to leverage the stringi package and using a regex to match complete
cases of the words, I joined each of the strings in vector p, and
collapsed then with a |, so that we are looking for any words with a word-boundary
before or after:
> p_join2 <- stri_join(sprintf("\\b%s\\b", p), collapse = "|")
> p_join2
[1] "\\bThe\\b|\\bProject\\b|\\bGutenberg\\b|\\bEBook\\b|\\bof\\b|\\bAdventures\\b|\\bSherlock\\b|\\bHolmes\\b|\\bby\\b|\\bSir\\b|\\bArthur\\b|\\bConan\\b|\\bDoyle\\b|\\b15\\b|\\bin\\b|\\bour\\b|\\bseries\\b|\\bCopyright\\b|\\blaws\\b|\\bare\\b|\\bchanging\\b|\\ball\\b|\\bover\\b|\\bthe\\b|\\bworld\\b|\\bBe\\b|\\bsure\\b|\\bto\\b|\\bcheck\\b|\\bcopyright\\b|\\bfor\\b|\\byour\\b|\\bcountry\\b|..."
And then simply count the words and you could do nrv_df$counts <- to add this as a column...
> stri_count_regex(nrv_df$lead_para[25000:26000], p_join2, stri_opts_regex(case_insensitive = TRUE))
[1] 12 11 8 13 7 7 6 7 6 8 12 1 6 7 8 3 5 3 5 5 5 4 7 5 5 5 5 5 10 2 8 13 5 8 9 7 6 5 7 5 9 8 7 5 7 8 5 6 0 8 6
[52] 3 4 0 10 7 9 8 4 6 8 8 7 6 6 6 0 3 5 4 7 6 5 7 10 8 10 10 11
EDIT:
Since it's of no consequence to find the number of matches...
First a function to do the work to each paragraph and detect if any of the stirngs in p2 exist in the body of lead_paragraph
f <- function(i, j){
if(any(stri_detect_fixed(i, j, omit_no_match = TRUE))){
1
}else {
0
}
}
Now... using the parallel library on linux. And only testing 1000 rows since it's an example gives us:
library(parallel)
library(stringi)
> rst <- mcmapply(function(x){
f(i = x, j = p2)
}, vdf2$lead_paragraph[1:1000],
mc.cores = detectCores() - 2,
USE.NAMES = FALSE)
> rst
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[70] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[139] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
[208] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[277] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[346] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1
[415] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[484] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[553] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[622] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[691] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[760] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[829] 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[898] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
[967] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
This also works:
library(corpus)
# simulate the problem as in #carl-boneri's answer
lead_para <- readLines("http://www.norvig.com/big.txt")
# get a random sample of 2000 word types
types <- text_types(lead_para, collapse = TRUE)
p <- sample(types, 2000)
# find whether each entry has at least one of the terms in `p`
ix <- text_detect(lead_para, p)
Even only using a single core, it's over 20 times faster than the previous solution:
system.time(ix <- text_detect(lead_para, p))
## user system elapsed
## 0.231 0.008 0.240
system.time(rst <- mcmapply(function(x) f(i = x, j = p_join2),
lead_para, mc.cores = detectCores() - 2,
USE.NAMES = FALSE))
## user system elapsed
## 11.604 0.240 5.805

number of occurrences by lines R

I have this array:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
[112] 2 1 1 2 2 2 2 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2
And I want to count the number of occurrences of '1' and '2'. From [1] to [70] and from [71] to the end.
I tried :
sum(x==1)
But this for all.How can I select lines?
the function sum {base} should return the sum of all the values present in its arguments
you could define the arguments the following way:
with x[a:b] you can set boundaries (for example a=1 and b=10, will set the area from [1] to[10]);
with the operator == you can check if one specific value c is present between your boundaries ... e.g.: x[a:b]==c
if you want to look for more than one value ( for example c & d , where c==1 and d==2 , you can (for example) use a simple addition to sum up your results:
Now you can just say: sum(x[a:b]==c) + sum(x[a:b]==c)
Where a&b are your boundaries and c&d are the values you want to compare.

Splitting a data frame into a list using intervals

I want to split a data frame like this
chr.pos nt.pos CNV
1 74355 0
1 431565 0
1 675207 0
1 783605 1
1 888149 1
1 991311 1
1 1089305 1
1 1177669 1
1 1279886 0
1 1406311 0
1 1491385 0
1 1579761 0
2 1670488 1
2 1758800 1
2 1834256 0
2 1902924 1
2 1978088 1
2 2063124 0
The point is to get a list of intervals where the chr are the same and CNV=1 column, but taking into account the 0 inervals between them
[[1]]
1 783605 1
1 888149 1
1 991311 1
1 1089305 1
1 1177669 1
[[2]]
2 1670488 1
2 1758800 1
[[3]]
2 1902924 1
2 1978088 1
Any ideas?
You can use rle to create a variable to use in split
# create a group identifier
DF$GRP <- with(rle(DF$CNV), rep(seq_along(lengths),lengths))
# split a subset of DF which contains only CNV==1
split(DF[DF$CNV==1,],DF[DF$CNV==1,'GRP'] )
$`2`
chr.pos nt.pos CNV GRP
4 1 783605 1 2
5 1 888149 1 2
6 1 991311 1 2
7 1 1089305 1 2
8 1 1177669 1 2
$`4`
chr.pos nt.pos CNV GRP
13 2 1670488 1 4
14 2 1758800 1 4
$`6`
chr.pos nt.pos CNV GRP
16 2 1902924 1 6
17 2 1978088 1 6

Resources