Use length and integer(0) to test a conditional expression - r

I am posting this question which refers to a statistical problem however the issue I have is with the code.
I have the below vector and autocorrelation computation of significant values for a confidence intervals of 95%. (There is only one significant value at t+1) Package Quantmod.
x<-c(1,1,3,8,5,2,4,3,1,1,0,5,1,1,3,4,6,7,8,9,4,2,1)
a<-acf(x)
b<-a[[1]]
c<-(b[2:length(b)])
posssignificance_level<-qnorm((1+0.95)/2)/sqrt(sum(!is.na(x)))
posssignificance_level
negsignificance_level<- -posssignificance_level
negsignificance_level
poscorr<-which(posssignificance_level<c)
negcorr<-which(negsignificance_level>c)
poscorr
negcorr
I would like to instruct to lag if there is any significant value above/below 95% confident interval either in poscorr or negcorr or both. I atempted to use the below expresions without sucess. (I use length(poscorr==0 and length(negcorr==0) as the lengh of the resulting vector is 0 when there is no autocorrelation. The result for negcorr is "integer(0)").
posautorrelation <- if(length(poscorr==0)) Lag(x,0) else Lag(x,poscorr)
negautorrelation <- if(length(negcorr==0)) Lag(x,0) else Lag(x,negcorr)
Error en `colnames<-`(`*tmp*`, value = "Lag.") :
la longitud de 'dimnames' [2] no es igual a la extensión del arreglo
Error durante el wrapup: no se puede abrir la conexión
I also try
posautorrelation <- if((poscorr==integer(0)) Lag(x,0) else Lag(x,poscorr)
Error: inesperado símbolo in "posautorrelation <- if(length(poscorr==integer(0)) Lag"
Error durante el wrapup: no se puede abrir la conexión
negautorrelation <- if((negcorr==integer(0)) Lag(x,0) else Lag(x,negcorr)
Error: inesperado símbolo in "negautorrelation <- if(length(negcorr==integer(0)) Lag"
Error durante el wrapup: no se puede abrir la conexión
I would like to know how could I instruct the last two expressions as to obtain two versions of x. One with a lag and the one without any autocorrelation value with a 0 lag using the result of negcorr integer(0).

The first part explains why the which(...) expression returns integer(0)
print(negsignificance_level)
# [1] -0.4086807
min(c)
# [1] -0.3432622
which(negsignificance_level > c)
# integer(0)
This is i believe the answer to your question:
if length(object) == then expression1 else expression2
posautorrelation <- if(length(poscorr) == 0) Lag(x,0) else Lag(x, poscorr)
posautorrelation
# [1] NA 1 1 3 8 5 2 4 3 1 1 0 5 1 1 3 4 6 7 8 9 4 2
negautorrelation <- if(length(negcorr) == 0) Lag(x,0) else Lag(x, negcorr)
# [1] 1 1 3 8 5 2 4 3 1 1 0 5 1 1 3 4 6 7 8 9 4 2 1

Related

extract all two-character-combinations from a string

In order to identify nonsense text (e.g. djsarejslslasdfhsl) from real (German) words, I would like to do an analysis of letter-frequencies.
My idea is to calculate the relative frequencies of two-letter-combinations ("te", "ex", "xt", "is" etc.) using a long text. Based on that information I would like to calculate the probability that a given word (or sentence) is real German.
But my first problem is, how to extract all the two-letter-combinations and to count them? I fear that using substring(string, start, stop) and increasing the values of start and stop in a loop might not be a very efficient solution. Do you have any idea?
# A short sample text
text <- 'Es ist ein Freudentag – ohne Zweifel. Gesundheitsminister Alain Berset und der Bundesrat gehen weiter, als man annehmen durfte. Die Zertifikatspflicht wird aufgehoben, die Maskenpflicht gilt nur noch im ÖV und in Gesundheitseinrichtungen.
Die beste Meldung des Tages aber ist: Die Covid-19-Task-Force, inzwischen als «Task-Farce» verballhornt, wird auf Ende März aufgehoben – zwei Monaten früher als geplant. Die Dauerkritik war wohl mit ein Grund, dass dieses Gremium sich jetzt rasch auflösen will.
Keine Rosen ohne Dornen: Einzelne Punkte von Bersets Ausführungen geben zu denken.
Die «Isolationshaft» für positiv Getestete bleibt zwingend. Das ist Unsinn und steht in einem scharfen Kontrast zu den übrigen Öffnungsschritten. Die Grundimmunität der Bevölkerung beträgt über 90 Prozent, das Virus ist nicht mehr gefährlich, warum will man weiter Leute zu Hause einsperren? Wer schwer krank ist, geht von sich aus nicht zur Arbeit. Die krankheitsbedingte Bettruhe muss man den Menschen nicht vorschreiben.
Gesundheitsminister Berset findet, das Modell Task-Force habe eine interessante Möglichkeit aufgezeigt für die Zusammenarbeit zwischen Regierung und Wissenschaft. Unter Umständen eigne sich dieses Modell auch für andere Bereiche.
Nein danke, Herr Berset.
Die Task-Force war mit ihrem öffentlichen Dauer-Alarmismus und ihren haarsträubenden Falsch-Prognosen vor allem eine Manipulationsmaschine.
Und dann noch dies: Irgendwann während der heutigen Pressekonferenz gab Alain Berset zu verstehen, man habe mit all diesen Massnahmen die Bevölkerung schützen wollen. Vielleicht hatte man diese hehre Absicht einmal im Hinterkopf. Alle Massnahmen ab der zweiten Welle erfolgten nicht zum Schutz der Bevölkerung, sondern, um einen Zusammenbruch des Spital-Systems zu verhindern.
Doch jetzt stossen wir erst einmal auf das Ende der Apartheit an.'
# Some cleaning:
library(stringr)
text <- str_replace_all(text, "[^[:alnum:]]", " ")
text <- tolower(text)
words <- strsplit(text, "\\s+")[[1]]
words
for(word in words){
???
}
Clean, replacing any sequence of non-alphanumeric with a space
text = tolower(gsub("[^[:alnum:]]+", " ", text))
Find all pairs of sequential letters
twos = substring(text, 1:(nchar(text) - 1), 2:nchar(text))
but only keep those that did not overlap a space
twos[nchar(trimws(twos)) == 2L]
Here's the result
> twos[nchar(trimws(twos)) == 2L] |> table()
19 90 aa ab af ag äg ah äh ai al am an än ap ar är as at ät au äu ba be bl br
1 1 1 6 2 2 1 2 2 2 14 2 16 1 1 10 1 15 6 1 12 1 1 24 1 2
bs bt bu ce ch co da de dh di do du dw eb ed ef eg eh ei ek el em en ep er es
1 1 1 4 34 1 9 23 3 18 2 2 1 1 1 1 1 9 32 1 7 5 54 1 42 19
et eu ev ez fa fä fe ff fg fi fl fn fo fr ft fü ga ge gi gl gn gr gs gt ha he
12 3 3 1 2 1 4 2 3 2 3 1 4 2 3 4 1 19 2 1 2 3 1 4 8 17
hi hk hl hm hn ho hr ht hu hü hw ib ic id ie if ig ih ik il im in io ip ir is
3 1 1 3 2 3 9 11 1 1 1 2 16 1 18 2 4 2 2 3 3 28 2 1 5 12
it iu iv je ka ke kh ko kr kt la ld le lg lh li lk ll ln lö ls lt ma mä me mi
19 1 1 2 1 8 1 3 3 1 6 1 7 1 1 5 3 11 1 1 4 1 12 1 8 7
mm mo mö ms mu na nb nd ne nf ng ni nk nm nn no np nr ns nt nu nz ob oc od öf
3 3 1 2 3 4 1 23 13 1 10 8 5 2 4 3 1 1 6 10 2 3 2 3 2 2
og ög oh ol öl on op or os ös ov öv oz pa pe pf pi pl po pr pu ra rä rb rc rd
1 1 3 3 3 8 1 7 4 1 1 1 1 1 1 3 1 1 1 3 2 5 2 3 4 2
re rf rg rh ri rk rl rm rn ro rr rs rt ru rü rz sa sb sc se sf sh si sk sm sn
14 3 1 1 4 2 1 1 4 3 2 9 2 11 1 1 3 1 13 17 1 1 6 5 4 2
so sp sr ss st su sy ta tä te th ti tl to tr ts tt tu tz ub üb uc ud ue uf uh
2 3 1 9 17 3 1 7 2 24 1 6 1 1 4 6 3 1 4 1 2 2 1 2 6 1
üh ul um un ur ür us ut üt ve vi vo vö wa wä we wi wo ys ze zt zu zw
2 1 5 24 3 3 8 3 1 3 3 4 3 4 1 8 9 2 1 5 2 9 6
The algorithm seems to generalize to sequences of any number of letters by separating words with
chartuples <-
function(text, n = 2)
{
n0 <- n - 1
text <- tolower(gsub(
"[^[:alnum:]]+", paste(rep(" ", n0), collapse = ""), text
))
tuples <- substring(text, 1:(nchar(text) - n0), n:nchar(text))
tuples[nchar(trimws(tuples)) == n]
}
This is also easy to use for looking up the values of any 'word'
counts <- table(charuples(text))
counts[chartuples("djsarejslslasdfhsl")] |> as.vector()
(the NA's in the resulting vector mean letters not present in your original corpus).
words <- unlist(strsplit(text, '[^[:alnum:]]+'))
cmbs2 <- sapply(words, function(x)substring(x, len <- seq(nchar(x) - 1), len + 1),USE.NAMES = TRUE)
head(cmbs2) ## Just to show a few words.
$Es
[1] "Es"
$ist
[1] "is" "st"
$ein
[1] "ei" "in"
$Freudentag
[1] "Fr" "re" "eu" "ud" "de" "en" "nt" "ta" "ag"
$ohne
[1] "oh" "hn" "ne"
$Zweifel
[1] "Zw" "we" "ei" "if" "fe" "el"
If I'm not wrong, this should be pretty efficient:
tokens_char <- function(str, window = 2) {
# remove non-word characters
str <- stringi::stri_replace_all_regex(str, "\\W", "")
# lowercase
str <- tolower(str)
# prep window variable
win <- window - 1
len1 <- seq_len(nchar(str) - win)
# split into strings of length window
stringi::stri_sub(str, from = len1, to = len1 + win)
}
The key is stringi::stri_sub which is a vectorised version of substr. A string is split by moving the window one character at the time. So "This text" is turned into "th" "hi" "is" "st" "te" "ex" "xt". After doing this, we can use some tidyverse code to count occurrences of tokens:
library(tidyverse)
tibble(
token = tokens_char(text, window = 2)
) %>%
count(token, sort = TRUE)
#> # A tibble: 308 × 2
#> token n
#> <chr> <int>
#> 1 en 55
#> 2 er 43
#> 3 ei 35
#> 4 ch 34
#> 5 nd 34
#> 6 in 28
#> 7 te 28
#> 8 be 24
#> 9 un 24
#> 10 de 23
#> # … with 298 more rows
Note that I also included a window argument, which I believe might be useful for your analysis.
tibble(
token = tokens_char(text, window = 3)
) %>%
count(token, sort = TRUE)
#> # A tibble: 851 × 2
#> token n
#> <chr> <int>
#> 1 die 16
#> 2 ich 16
#> 3 ein 15
#> 4 end 13
#> 5 sch 13
#> 6 und 12
#> 7 eit 11
#> 8 nde 10
#> 9 cht 9
#> 10 der 9
#> # … with 841 more rows
And finally, you can also first split your string into words so that letters following each other over word boundaries do not count. For example, "This text" is turned into "th" "hi" "is" "te" "ex" "xt":
tokens_char_words <- function(str, window = 2) {
str <- unlist(tokenizers::tokenize_words(str))
# prep window variable
win <- window - 1
len1 <- lapply(nchar(str) - win, seq_len)
# split into strings of length window
unlist(stringi::stri_sub_all(str = str, from = len1, to = lapply(len1, function(x) x + win)))
}
tokens_char_words("This text", window = 2)
#> [1] "th" "hi" "is" "te" "ex" "xt"
Created on 2022-02-18 by the reprex package (v2.0.1)

Make HTML page (text) suitable for text analysis in R

I would like to do some text analytics on text from following web page:
https://narodne-novine.nn.hr/clanci/sluzbeni/full/2007_07_79_2491.html
I don't know how to convert this HTML to tidy text object (every row in text is every row in dataframe).
For example, just applying html_text() function doesn't help:
url <- "https://narodne-novine.nn.hr/clanci/sluzbeni/full/2007_07_79_2491.html"
p <- rvest::read_html(url, encoding = "UTF-8") %>%
rvest::html_text()
p
since I don't have separated rows.
That site has some very well-structured HTML with the headers and the body text of the section each given their own align attributes. We can use that to extract your text by section:
library(rvest)
library(tidyverse)
pg <- read_html("https://narodne-novine.nn.hr/clanci/sluzbeni/full/2007_07_79_2491.html")
html_nodes(pg, xpath = ".//p[#align='center']/following-sibling::p[#align='justify']") %>%
map_df(~{
data_frame(
section = html_node(.x, xpath=".//preceding-sibling::p[#align='center'][1]") %>%
html_text(trim=TRUE),
section_text = html_text(.x, trim=TRUE)
)
})
## # A tibble: 38 x 2
## section section_text
## <chr> <chr>
## 1 Članak 1. "U Zakonu o autorskom pravu i srodnim pravima (»Narodne novine«, br. 167/03.) u \r\nčlanku 4. sta…
## 2 Članak 2. "U članku 8. stavku 2. točki 1. riječ: »standardi« briše se.\r\nU stavku 3. druga rečenica mijenj…
## 3 Članak 3. "U članku 20. stavku 2. riječi: »na području Republike Hrvatske« zamjenjuju se \r\nriječima: »na …
## 4 Članak 4. "U članku 32. stavku 5. točki 1. i 3. riječ: »naprava« zamjenjuje se riječju: \r\n»uređaja«.\r\nU…
## 5 Članak 5. U članku 39. stavku 1. riječi: »stavka 1.« brišu se.
## 6 Članak 6. "U članku 44. stavku 5. dodaje se rečenica koja glasi:\r\n»U slučaju sumnje, u drugim slučajevima…
## 7 Članak 7. "U članku 52. stavku 3. riječ: »korištenja« zamjenjuje se riječju: \r\n»iskorištavanja«."
## 8 Članak 8. U članku 86. iza riječi: »koji je« dodaje se riječ: »u«.
## 9 Članak 9. "U članku 98. u stavku 1. riječ: »tehnoloških« zamjenjuje se riječju: \r\n»tehničkih«.\r\nStavak …
## 10 Članak 10. "U članku 109. dodaje se stavak 3. koji glasi:\r\n»(3) Odredbe iz članka 20. ovoga Zakona o iscrp…
## # ... with 28 more rows
You'll need to double check that the above didn't miss anything. Even if it did it should be straightforward to expand upon the answer.
You can get individual lines broken out using the above as well:
html_nodes(pg, xpath = ".//p[#align='center']/following-sibling::p[#align='justify']") %>%
map_df(~{
data_frame(
section = html_node(.x, xpath=".//preceding-sibling::p[#align='center'][1]") %>%
html_text(trim=TRUE),
section_text = html_text(.x, trim=TRUE)
)
}) %>%
mutate(section_text = stri_split_lines(section_text)) %>%
unnest(section_text)
## # A tibble: 334 x 2
## section section_text
## <chr> <chr>
## 1 Članak 1. "U Zakonu o autorskom pravu i srodnim pravima (»Narodne novine«, br. 167/03.) u "
## 2 Članak 1. članku 4. stavak 2. mijenja se i glasi:
## 3 Članak 1. "»(2) Odredbe iz ovoga Zakona o definicijama pojedinih autorskih imovinskih "
## 4 Članak 1. "prava, o pravu na naknadu za reproduciranje autorskog djela za privatno ili "
## 5 Članak 1. "drugo vlastito korištenje, o pravu na naknadu za javnu posudbu, kao i o "
## 6 Članak 1. "iscrpljenju prava distribucije, iznimkama i ograničenjima autorskih prava, "
## 7 Članak 1. "početku tijeka i učincima isteka rokova trajanja autorskog prava, autorskom "
## 8 Članak 1. "pravu u pravnom prometu te o odnosu autorskog prava i prava vlasništva "
## 9 Članak 1. "primjenjuju se na odgovarajući način i za srodna prava, ako za njih nije što "
## 10 Članak 1. posebno određeno ili ne proizlazi iz njihove pravne naravi.«
## # ... with 324 more rows
The tidytext package has examples of how to perform further cleanup transformations to facilitate text mining.

how to find the number that is most repeated in a window size with for loop

i have a data frame that looks like this
Time DT5.0_Prediction
20:10:36.051 IST 3
20:10:36.150 IST 3
20:10:36.251 IST 3
20:10:36.350 IST 3
20:10:36.450 IST 3
20:10:36.551 IST 1
20:10:36.651 IST 1
20:10:36.750 IST 1
20:10:36.851 IST 3
20:10:36.952 IST 1
20:10:37.051 IST 1
20:10:37.151 IST 1
20:10:37.252 IST 1
20:10:37.351 IST 3
20:10:37.452 IST 1
20:10:37.551 IST 1
20:10:37.652 IST 1
20:10:37.752 IST 3
20:10:37.853 IST 1
20:10:37.953 IST 1
20:10:38.053 IST 1
20:10:38.152 IST 1
20:10:38.252 IST 1
20:10:38.352 IST 1
20:10:38.453 IST 1
20:10:38.554 IST 1
I want to use window size of 10 and get the data to be like this
Starting Time Ending time Mode
20:10:36.051 IST 20:10:36.952 IST 3
20:10:37.051 IST 20:10:37.953 IST 1
20:10:38.053 IST 20:10:38.955 IST 1
and so on
In mode column from the above table, the number "3" is the most number of times repeated in that particular window and "1" is the most number of times repeated in the next consecutive window.
i used the following code
a <- 1
for(i in 1: length(mydata[,2])){
b <- a+99
mydata$StartTime [i] <- mydata$Time[a]
mydata$EndTime [i] <- mydata$Time[b]
mydata$mode1234567 [i] <- ifelse( b <= nrow(mydata),
count(mydata[a:min(b, nrow(mydata)),2]),
NA)
a <- b+1
}
using frequency and count is wrong...
thanks in advance
One way is to split every 10 rows, and create a data frame based on each element, i.e.
do.call(rbind,
lapply(split(df, (0:nrow(df) %/% 10)), function(i)
data.frame(Starting_Time = i[1,1],
Ending_Time = i[nrow(i),1],
mode = Mode(i[[2]]))))
which gives,
Starting_Time Ending_Time mode
0 20:10:36.051_IST 20:10:36.952_IST 3
1 20:10:37.051_IST 20:10:37.953_IST 1
2 20:10:38.053_IST 20:10:38.554_IST 1
Where Mode is simply a custom function to calculate the mode, taken from this answer.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

ggplot: geom_errorbar: Error in "var"- se : non-numeric argument to binary operator

How to handle error "Error in rate - se : non-numeric argument to binary operator"?
My code:
ggplot(df, aes(x=zone, y=rate, fill=race))+geom_bar(stat ="identity",position="dodge")+geom_errorbar(aes(ymin=rate-se, ymax=rate+se))
My data:
race<-c(1,2,1,2,1,2,1,2)
zone<-c(0,0,1,1,2,2,3,3)
rate<-c(10.9,7.7,12.9,9.2,12.5,9.6,10.4,5.1)
se <-c(0.001,0.103,0.066,0.099,0.0060,0.062,0.096,0.001)
df <- data.frame(race, zone, rate,se)
Any hints are appreciated.
I suspect you've got character data masquerading as numbers...
With what looks like your data:
> zorace
racecat zone rate se
1 1 0 10.886621 0.001159755
2 2 0 7.763123 0.103422900
3 1 1 12.926866 0.065986546
4 2 1 9.196214 0.098244182
5 1 2 12.487529 0.060695012
6 2 2 9.626924 0.062437645
7 1 3 10.378148 0.096269240
8 2 3 5.042412 0.001159755
I get no errors:
> ggplot(zorace, aes(x=zone, y=rate, fill=racecat))+geom_bar(stat ="identity",position="dodge")+geom_errorbar(aes(ymin=rate-se, ymax=rate+se))
But if I convert the se column to characters:
> zorace$se=as.character(zorace$se)
It looks almost the same:
> zorace
racecat zone rate se
1 1 0 10.886621 0.001159755
2 2 0 7.763123 0.1034229
3 1 1 12.926866 0.065986546
4 2 1 9.196214 0.098244182
5 1 2 12.487529 0.060695012
6 2 2 9.626924 0.062437645
7 1 3 10.378148 0.09626924
8 2 3 5.042412 0.001159755
BUT:
> ggplot(zorace, aes(x=zone, y=rate, fill=racecat))+geom_bar(stat ="identity",position="dodge")+geom_errorbar(aes(ymin=rate-se, ymax=rate+se))
Error in rate - se : non-numeric argument to binary operator
>
What does summary(zorace) tell you about the columns? I suspect you've accidentally converted something to character, or its been read in as character from a file with non-numeric fields that have been filtered out since.
It's odd that your "number" columns align to the left - I suspect there's some spaces padding them out to a fixed length.
Convert back to numbers with:
zorace$se = as.numeric(as.character(zorace$se))
Converting to character first protects you against if se is a factor variable, in which case it would get converted to numeric 1 to N.

Error in sortedXyData using R

I have a dataset composed of observations X, regressors X and grouping variables group.
Link to the dataset : data.txt
Using the library nlme, I can build a dataframe with :
ex1 <- groupedData(Y ~ X | group,data=mydata)
After that, I would like to apply the function sortedXyData in order to sort my data with respect to X. When I try
sortedXyData("X","Y",ex1)
I get the following error :
[1] x y
<0 lignes> (ou 'row.names' de longueur nulle)
Message d'avis :
In sortedXyData.default("X", "Y", ex1) :
NAs introduits lors de la conversion automatique
But if I try with a much simpler dataset such as :
X <- c(1.2,2.2,3.5,-3.8,9,3.7,4,8,7)
Y <- c(5,4,8,3,6,2,0,5,5)
group <- c(1,2,3,4,5,6,7,8,9)
group <- as.factor(group)
data1 <- data.frame(X,Y,group)
data2 <- groupedData(Y ~ X | group,data=data1)
sortedXyData("X","Y",data2)
then, the output is :
x y
1 4 0
2 7 5
3 8 5
4 9 6
Message d'avis :
In sortedXyData.default("X", "Y", data2) :
NAs introduits lors de la conversion automatique
We can see that the integer values in X are sorted but not the decimal values. It seems the problem comes from the way R deals with these values. I don't know what to do in order to have all the values in X sorted.
You have something wrong with your setup, or you are not posting exactly what you did.
If I run your simple data set, I get proper sorting:
sortedXyData("X","Y",data2)
x y
1 -3.8 3
2 1.2 5
3 2.2 4
4 3.5 8
5 3.7 2
6 4.0 0
7 7.0 5
8 8.0 5
9 9.0 6
Make sure you have the latest version of R and of the packages in use.

Resources