Make HTML page (text) suitable for text analysis in R - r

I would like to do some text analytics on text from following web page:
https://narodne-novine.nn.hr/clanci/sluzbeni/full/2007_07_79_2491.html
I don't know how to convert this HTML to tidy text object (every row in text is every row in dataframe).
For example, just applying html_text() function doesn't help:
url <- "https://narodne-novine.nn.hr/clanci/sluzbeni/full/2007_07_79_2491.html"
p <- rvest::read_html(url, encoding = "UTF-8") %>%
rvest::html_text()
p
since I don't have separated rows.

That site has some very well-structured HTML with the headers and the body text of the section each given their own align attributes. We can use that to extract your text by section:
library(rvest)
library(tidyverse)
pg <- read_html("https://narodne-novine.nn.hr/clanci/sluzbeni/full/2007_07_79_2491.html")
html_nodes(pg, xpath = ".//p[#align='center']/following-sibling::p[#align='justify']") %>%
map_df(~{
data_frame(
section = html_node(.x, xpath=".//preceding-sibling::p[#align='center'][1]") %>%
html_text(trim=TRUE),
section_text = html_text(.x, trim=TRUE)
)
})
## # A tibble: 38 x 2
## section section_text
## <chr> <chr>
## 1 Članak 1. "U Zakonu o autorskom pravu i srodnim pravima (»Narodne novine«, br. 167/03.) u \r\nčlanku 4. sta…
## 2 Članak 2. "U članku 8. stavku 2. točki 1. riječ: »standardi« briše se.\r\nU stavku 3. druga rečenica mijenj…
## 3 Članak 3. "U članku 20. stavku 2. riječi: »na području Republike Hrvatske« zamjenjuju se \r\nriječima: »na …
## 4 Članak 4. "U članku 32. stavku 5. točki 1. i 3. riječ: »naprava« zamjenjuje se riječju: \r\n»uređaja«.\r\nU…
## 5 Članak 5. U članku 39. stavku 1. riječi: »stavka 1.« brišu se.
## 6 Članak 6. "U članku 44. stavku 5. dodaje se rečenica koja glasi:\r\n»U slučaju sumnje, u drugim slučajevima…
## 7 Članak 7. "U članku 52. stavku 3. riječ: »korištenja« zamjenjuje se riječju: \r\n»iskorištavanja«."
## 8 Članak 8. U članku 86. iza riječi: »koji je« dodaje se riječ: »u«.
## 9 Članak 9. "U članku 98. u stavku 1. riječ: »tehnoloških« zamjenjuje se riječju: \r\n»tehničkih«.\r\nStavak …
## 10 Članak 10. "U članku 109. dodaje se stavak 3. koji glasi:\r\n»(3) Odredbe iz članka 20. ovoga Zakona o iscrp…
## # ... with 28 more rows
You'll need to double check that the above didn't miss anything. Even if it did it should be straightforward to expand upon the answer.
You can get individual lines broken out using the above as well:
html_nodes(pg, xpath = ".//p[#align='center']/following-sibling::p[#align='justify']") %>%
map_df(~{
data_frame(
section = html_node(.x, xpath=".//preceding-sibling::p[#align='center'][1]") %>%
html_text(trim=TRUE),
section_text = html_text(.x, trim=TRUE)
)
}) %>%
mutate(section_text = stri_split_lines(section_text)) %>%
unnest(section_text)
## # A tibble: 334 x 2
## section section_text
## <chr> <chr>
## 1 Članak 1. "U Zakonu o autorskom pravu i srodnim pravima (»Narodne novine«, br. 167/03.) u "
## 2 Članak 1. članku 4. stavak 2. mijenja se i glasi:
## 3 Članak 1. "»(2) Odredbe iz ovoga Zakona o definicijama pojedinih autorskih imovinskih "
## 4 Članak 1. "prava, o pravu na naknadu za reproduciranje autorskog djela za privatno ili "
## 5 Članak 1. "drugo vlastito korištenje, o pravu na naknadu za javnu posudbu, kao i o "
## 6 Članak 1. "iscrpljenju prava distribucije, iznimkama i ograničenjima autorskih prava, "
## 7 Članak 1. "početku tijeka i učincima isteka rokova trajanja autorskog prava, autorskom "
## 8 Članak 1. "pravu u pravnom prometu te o odnosu autorskog prava i prava vlasništva "
## 9 Članak 1. "primjenjuju se na odgovarajući način i za srodna prava, ako za njih nije što "
## 10 Članak 1. posebno određeno ili ne proizlazi iz njihove pravne naravi.«
## # ... with 324 more rows
The tidytext package has examples of how to perform further cleanup transformations to facilitate text mining.

Related

extract all two-character-combinations from a string

In order to identify nonsense text (e.g. djsarejslslasdfhsl) from real (German) words, I would like to do an analysis of letter-frequencies.
My idea is to calculate the relative frequencies of two-letter-combinations ("te", "ex", "xt", "is" etc.) using a long text. Based on that information I would like to calculate the probability that a given word (or sentence) is real German.
But my first problem is, how to extract all the two-letter-combinations and to count them? I fear that using substring(string, start, stop) and increasing the values of start and stop in a loop might not be a very efficient solution. Do you have any idea?
# A short sample text
text <- 'Es ist ein Freudentag – ohne Zweifel. Gesundheitsminister Alain Berset und der Bundesrat gehen weiter, als man annehmen durfte. Die Zertifikatspflicht wird aufgehoben, die Maskenpflicht gilt nur noch im ÖV und in Gesundheitseinrichtungen.
Die beste Meldung des Tages aber ist: Die Covid-19-Task-Force, inzwischen als «Task-Farce» verballhornt, wird auf Ende März aufgehoben – zwei Monaten früher als geplant. Die Dauerkritik war wohl mit ein Grund, dass dieses Gremium sich jetzt rasch auflösen will.
Keine Rosen ohne Dornen: Einzelne Punkte von Bersets Ausführungen geben zu denken.
Die «Isolationshaft» für positiv Getestete bleibt zwingend. Das ist Unsinn und steht in einem scharfen Kontrast zu den übrigen Öffnungsschritten. Die Grundimmunität der Bevölkerung beträgt über 90 Prozent, das Virus ist nicht mehr gefährlich, warum will man weiter Leute zu Hause einsperren? Wer schwer krank ist, geht von sich aus nicht zur Arbeit. Die krankheitsbedingte Bettruhe muss man den Menschen nicht vorschreiben.
Gesundheitsminister Berset findet, das Modell Task-Force habe eine interessante Möglichkeit aufgezeigt für die Zusammenarbeit zwischen Regierung und Wissenschaft. Unter Umständen eigne sich dieses Modell auch für andere Bereiche.
Nein danke, Herr Berset.
Die Task-Force war mit ihrem öffentlichen Dauer-Alarmismus und ihren haarsträubenden Falsch-Prognosen vor allem eine Manipulationsmaschine.
Und dann noch dies: Irgendwann während der heutigen Pressekonferenz gab Alain Berset zu verstehen, man habe mit all diesen Massnahmen die Bevölkerung schützen wollen. Vielleicht hatte man diese hehre Absicht einmal im Hinterkopf. Alle Massnahmen ab der zweiten Welle erfolgten nicht zum Schutz der Bevölkerung, sondern, um einen Zusammenbruch des Spital-Systems zu verhindern.
Doch jetzt stossen wir erst einmal auf das Ende der Apartheit an.'
# Some cleaning:
library(stringr)
text <- str_replace_all(text, "[^[:alnum:]]", " ")
text <- tolower(text)
words <- strsplit(text, "\\s+")[[1]]
words
for(word in words){
???
}
Clean, replacing any sequence of non-alphanumeric with a space
text = tolower(gsub("[^[:alnum:]]+", " ", text))
Find all pairs of sequential letters
twos = substring(text, 1:(nchar(text) - 1), 2:nchar(text))
but only keep those that did not overlap a space
twos[nchar(trimws(twos)) == 2L]
Here's the result
> twos[nchar(trimws(twos)) == 2L] |> table()
19 90 aa ab af ag äg ah äh ai al am an än ap ar är as at ät au äu ba be bl br
1 1 1 6 2 2 1 2 2 2 14 2 16 1 1 10 1 15 6 1 12 1 1 24 1 2
bs bt bu ce ch co da de dh di do du dw eb ed ef eg eh ei ek el em en ep er es
1 1 1 4 34 1 9 23 3 18 2 2 1 1 1 1 1 9 32 1 7 5 54 1 42 19
et eu ev ez fa fä fe ff fg fi fl fn fo fr ft fü ga ge gi gl gn gr gs gt ha he
12 3 3 1 2 1 4 2 3 2 3 1 4 2 3 4 1 19 2 1 2 3 1 4 8 17
hi hk hl hm hn ho hr ht hu hü hw ib ic id ie if ig ih ik il im in io ip ir is
3 1 1 3 2 3 9 11 1 1 1 2 16 1 18 2 4 2 2 3 3 28 2 1 5 12
it iu iv je ka ke kh ko kr kt la ld le lg lh li lk ll ln lö ls lt ma mä me mi
19 1 1 2 1 8 1 3 3 1 6 1 7 1 1 5 3 11 1 1 4 1 12 1 8 7
mm mo mö ms mu na nb nd ne nf ng ni nk nm nn no np nr ns nt nu nz ob oc od öf
3 3 1 2 3 4 1 23 13 1 10 8 5 2 4 3 1 1 6 10 2 3 2 3 2 2
og ög oh ol öl on op or os ös ov öv oz pa pe pf pi pl po pr pu ra rä rb rc rd
1 1 3 3 3 8 1 7 4 1 1 1 1 1 1 3 1 1 1 3 2 5 2 3 4 2
re rf rg rh ri rk rl rm rn ro rr rs rt ru rü rz sa sb sc se sf sh si sk sm sn
14 3 1 1 4 2 1 1 4 3 2 9 2 11 1 1 3 1 13 17 1 1 6 5 4 2
so sp sr ss st su sy ta tä te th ti tl to tr ts tt tu tz ub üb uc ud ue uf uh
2 3 1 9 17 3 1 7 2 24 1 6 1 1 4 6 3 1 4 1 2 2 1 2 6 1
üh ul um un ur ür us ut üt ve vi vo vö wa wä we wi wo ys ze zt zu zw
2 1 5 24 3 3 8 3 1 3 3 4 3 4 1 8 9 2 1 5 2 9 6
The algorithm seems to generalize to sequences of any number of letters by separating words with
chartuples <-
function(text, n = 2)
{
n0 <- n - 1
text <- tolower(gsub(
"[^[:alnum:]]+", paste(rep(" ", n0), collapse = ""), text
))
tuples <- substring(text, 1:(nchar(text) - n0), n:nchar(text))
tuples[nchar(trimws(tuples)) == n]
}
This is also easy to use for looking up the values of any 'word'
counts <- table(charuples(text))
counts[chartuples("djsarejslslasdfhsl")] |> as.vector()
(the NA's in the resulting vector mean letters not present in your original corpus).
words <- unlist(strsplit(text, '[^[:alnum:]]+'))
cmbs2 <- sapply(words, function(x)substring(x, len <- seq(nchar(x) - 1), len + 1),USE.NAMES = TRUE)
head(cmbs2) ## Just to show a few words.
$Es
[1] "Es"
$ist
[1] "is" "st"
$ein
[1] "ei" "in"
$Freudentag
[1] "Fr" "re" "eu" "ud" "de" "en" "nt" "ta" "ag"
$ohne
[1] "oh" "hn" "ne"
$Zweifel
[1] "Zw" "we" "ei" "if" "fe" "el"
If I'm not wrong, this should be pretty efficient:
tokens_char <- function(str, window = 2) {
# remove non-word characters
str <- stringi::stri_replace_all_regex(str, "\\W", "")
# lowercase
str <- tolower(str)
# prep window variable
win <- window - 1
len1 <- seq_len(nchar(str) - win)
# split into strings of length window
stringi::stri_sub(str, from = len1, to = len1 + win)
}
The key is stringi::stri_sub which is a vectorised version of substr. A string is split by moving the window one character at the time. So "This text" is turned into "th" "hi" "is" "st" "te" "ex" "xt". After doing this, we can use some tidyverse code to count occurrences of tokens:
library(tidyverse)
tibble(
token = tokens_char(text, window = 2)
) %>%
count(token, sort = TRUE)
#> # A tibble: 308 × 2
#> token n
#> <chr> <int>
#> 1 en 55
#> 2 er 43
#> 3 ei 35
#> 4 ch 34
#> 5 nd 34
#> 6 in 28
#> 7 te 28
#> 8 be 24
#> 9 un 24
#> 10 de 23
#> # … with 298 more rows
Note that I also included a window argument, which I believe might be useful for your analysis.
tibble(
token = tokens_char(text, window = 3)
) %>%
count(token, sort = TRUE)
#> # A tibble: 851 × 2
#> token n
#> <chr> <int>
#> 1 die 16
#> 2 ich 16
#> 3 ein 15
#> 4 end 13
#> 5 sch 13
#> 6 und 12
#> 7 eit 11
#> 8 nde 10
#> 9 cht 9
#> 10 der 9
#> # … with 841 more rows
And finally, you can also first split your string into words so that letters following each other over word boundaries do not count. For example, "This text" is turned into "th" "hi" "is" "te" "ex" "xt":
tokens_char_words <- function(str, window = 2) {
str <- unlist(tokenizers::tokenize_words(str))
# prep window variable
win <- window - 1
len1 <- lapply(nchar(str) - win, seq_len)
# split into strings of length window
unlist(stringi::stri_sub_all(str = str, from = len1, to = lapply(len1, function(x) x + win)))
}
tokens_char_words("This text", window = 2)
#> [1] "th" "hi" "is" "te" "ex" "xt"
Created on 2022-02-18 by the reprex package (v2.0.1)

Histogram with density curves of other data

I want to recreate but I have trouble to fit the density curves to the plot.
MWE ( of what I achieved so far. Data in the tibble are just sample data) :
tibble(home = sample(1:10,90, replace = T), away = sample(1:10,90, replace = T)) %>%
gather(key=Type, value=Value) %>%
ggplot(aes(x=Value,fill=Type)) +
geom_histogram(position="dodge")
UPDATE after answer by #Kota Mori
I adjusted the answer given by Kota Mori to get the following which results in an error. Before I start lets have a look at the datasets I want to use for the graph :
#Both Goals variables of this dataframe should be used for the histogram
actual
# A tibble: 90 x 7
season matchday club_name_home club_name_away goals_team_home goals_team_away sumgoals
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1819 21 ETuS Haltern TuS 05 Sinsen II 2 2 4
2 1819 21 VfL Ramsdorf Westfalia Gemen II 2 0 2
3 1819 21 FC RW Dorsten SV Altendorf-Ulfkotte 8 4 12
4 1819 21 SuS Hervest-Dorsten 1. SC BW Wulfen 0 0 0
5 1819 21 SV Lembeck SC Reken II 1 1 2
6 1819 21 RC Borken-Hoxfeld TSV Raesfeld 3 1 4
7 1819 21 TuS Velen Fenerbahce I. Marl 5 2 7
8 1819 21 BVH Dorsten SC Marl-Hamm 2 0 2
9 1819 21 1. SC BW Wulfen FC RW Dorsten 3 0 3
10 1819 21 BVH Dorsten SV Altendorf-Ulfkotte 2 0 2
# ... with 80 more rows
#Both Goals variables of this dataframe should be used for the density lines
poisson
# A tibble: 90 x 6
season matchday club_name_home club_name_away Goals_team_home Goals_team_away
<dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1819 21 ETuS Haltern TuS 05 Sinsen II 2 2
2 1819 21 VfL Ramsdorf Westfalia Gemen II 3 0
3 1819 21 FC RW Dorsten SV Altendorf-Ulfkotte 2 0
4 1819 21 SuS Hervest-Dorsten 1. SC BW Wulfen 0 4
5 1819 21 SV Lembeck SC Reken II 2 1
6 1819 21 RC Borken-Hoxfeld TSV Raesfeld 2 1
7 1819 21 TuS Velen Fenerbahce I. Marl 2 1
8 1819 21 BVH Dorsten SC Marl-Hamm 3 1
9 1819 21 1. SC BW Wulfen FC RW Dorsten 2 0
10 1819 21 BVH Dorsten SV Altendorf-Ulfkotte 2 1
# ... with 80 more rows
So I adjusted the answer by Kota Mori to end up with the following code :
simyears = 1819
actual <- read_rds(here::here(paste0("/data/database_match_results_",sim_years,".rds")))%>%
filter(between(matchday, 21, max(database_season$matchday)))
poisson <- missinggames
data <- rbind(data.frame(type="home", value=actual$goals_team_home, stringsAsFactors=FALSE),
data.frame(type="away", value=actual$goals_team_home, stringsAsFactors=FALSE))
estimate <- group_by(poisson %>% select(Goals_team_home,Goals_team_away), type) %>% summarize(mu=mean(value))
dens <- expand.grid(value=0:max(data$value), type=c("away", "home"),
stringsAsFactors=FALSE) %>%
inner_join(estimate) %>%
mutate(density=dpois(value, mu))
prop <- group_by(data, type, value) %>% summarize(count=n()) %>%
group_by(type) %>% mutate(prop=count/sum(count))
tmp_actual <- left_join(dens, prop) %>% replace_na(list(prop=0, count=0))
ggplot(tmp_actual, aes(x=value, weight=prop, fill=type)) +
geom_bar(position="dodge") +
geom_line(aes(value, density, color=type, weight=NULL))
Which results in the following error : 'Error: Mapping should be created with aes() or aes_().'
I think you need to calculate the poisson parameters on your own, which turns out to be as easy as calculating the sample mean for each type.
The following code generates a graph similar to the example.
library(dplyr)
library(ggplot2)
data <- rbind(data.frame(type="home", value=rpois(90, 2.5), stringsAsFactors=FALSE),
data.frame(type="away", value=rpois(90, 1.5), stringsAsFactors=FALSE))
estimate <- group_by(data, type) %>% summarize(mu=mean(value))
dens <- expand.grid(value=0:max(data$value), type=c("away", "home"),
stringsAsFactors=FALSE) %>%
inner_join(estimate) %>%
mutate(density=dpois(value, mu))
prop <- group_by(data, type, value) %>% summarize(count=n()) %>%
group_by(type) %>% mutate(prop=count/sum(count))
tmp <- left_join(dens, prop) %>% replace_na(list(prop=0, count=0))
ggplot(tmp, aes(x=value, weight=prop, fill=type)) +
geom_bar(position="dodge") +
geom_line(aes(value, density, color=type, weight=NULL))

Sum certain rows given 2 constraints in R

I am trying to write an conditional statement with the following constraints. Below is an example data frame showing the problem I am running into.
Row <- c(1,2,3,4,5,6,7)
La <- c(51.25,51.25,51.75,53.25,53.25,54.25,54.25)
Lo <- c(128.25,127.75,127.25,119.75,119.25,118.75,118.25)
Y <- c(5,10,2,4,5,7,9)
Cl <- c("EF","EF","EF","EF","NA","NA","CE")
d <- data.frame(Row,La,Lo,Y,Cl)
Row La Lo Y Cl
1 1 51.25 128.25 5 EF
2 2 51.25 127.75 10 EF
3 3 51.75 127.25 2 EF
4 4 53.25 119.75 4 EF
5 5 53.25 119.25 5 NA
6 6 54.25 118.75 7 NA
7 7 55.25 118.25 9 CE
I would like to sum column "Y" (removing all values from that row) if "Cl" is NA with the corresponding "Lo" and "La" values that are close (equal to or less than 1.00). In effect, I want to remove NA from being in the data frame without losing the value of "Y", but instead adding this value to its closest neighbor.
I would like the return data frame to look like this:
Row2 <- c(1,2,3,4,7)
La2 <- c(51.25,51.25,51.75,53.25,55.25)
Lo2 <- c(128.25,127.75,127.25,119.75,118.25)
Y2 <- c(5,10,2,9,16)
Cl2 <- c("EF","EF","EF","EF","CE")
d2 <- data.frame(Row2,La2,Lo2,Y2,Cl2)
Row2 La2 Lo2 Y2 Cl2
1 1 51.25 128.25 5 EF
2 2 51.25 127.75 10 EF
3 3 51.75 127.25 2 EF
4 4 53.25 119.75 9 EF
5 7 55.25 118.25 16 CE
recent edit: If NA row is close to one row in terms of Lo value and same closeness to another row in La value, join by La value. If there are 2 equally close rows of Lo and La values, join by smaller La value.
Thank you for the help!
Here is a method to use if you can make some distance matrix m for the distance between all the (La, Lo) rows in your data. I use the output of dist, which is euclidean distance. The row with the lowest distance is selected, or the earliest such row if the lowest distance is shared by > 1 row.
w <- which(is.na(d$Cl))
m <- as.matrix(dist(d[c('La', 'Lo')]))
m[row(m) %in% w] <- NA
d$g <- replace(seq(nrow(d)), w, apply(m[,w], 2, which.min))
library(dplyr)
d %>%
group_by(g) %>%
summarise(La = La[!is.na(Cl)],
Lo = Lo[!is.na(Cl)],
Y = sum(Y),
Cl = Cl[!is.na(Cl)]) %>%
select(-g)
# # A tibble: 5 x 4
# La Lo Y Cl
# <dbl> <dbl> <dbl> <fct>
# 1 51.2 128. 5 EF
# 2 51.2 128. 10 EF
# 3 51.8 127. 2 EF
# 4 53.2 120. 9 EF
# 5 54.2 118. 16 CE

R conditionally map a repeated vector as column to a data frame

I would like to substract a vector of means from the original values. I cannot figure out, how to map the corresponding conditions of the means and values. So far i tried it with arranging the values correctly, but even there i fail.
library("reshape")
require('plyr')
require("dplyr")
The dataframe:
n <- as.factor(rep(c(1:16), times=2))
s <- as.factor(rep(c("ja","nein"), each=8, times=2))
b <- as.factor(rep(c("red", "green","blue", "pink"),times=8))
zahl <- runif(32)
df <- data.frame(n, s, b, zahl)
the means as a column:
df.mean <- melt(data.frame(cast(df, b~s, mean)), id=1, measured=2:3)
my wrong version:
df.final <- df%>%
mutate(r=1:32,
trial=rep(1:2, each=16))%>%
#arrange(r,n,trial,s,b)%>% # this does't arrange the "ja, nein" eaqual to the means
mutate(mean.bs=rep(df.mean[,3], times=4),
diff=zahl-mean.bs)
the results should be like:
n s b zahl trial mean.bs diff
1 1 ja red 0.49 1 0.8025 -0.3125
2 2 ja green 0.59 1 0.6200 -0.0300
3 3 ja blue 0.97 1 0.3175 0.6525
4 4 ja pink 0.04 1 0.5225 -0.4825
5 9 nein red 0.x 1 0.4775 0.x
6 10 nein green 0.x 1 0.3975 0.x
7 11 nein blue 0.x 1 0.5625 0.x
8 12 nein pink 0.x 1 0.3925 0.x
9 5 ja red 0.x 1 0.8025 -0.x # here means repeat
10 6 ja green 0.x 1 0.6200 -0.x
...
And maybe there is a more precise way to do it? (with condition ...)
thank you!
We can get the difference within the mutate itself
library(dplyr)
df %>%
group_by(b,s) %>%
mutate(mean.bs= mean(zahl), diff= zahl-mean.bs)
Ok I'm not 100% sure that's what you want to achieve (setting seed before using randomized data is a good idea), but try this (picking up after your df.mean <- ... line:
colnames(df.mean) <- c("b","s","mean.bs")
df$trial <- rep(1:2, each=16)
df2 <- merge(df, df.mean, by=c("b", "s"))
df2$diff <- df2$zahl - df2$mean.bs
df2 <- df2[order(df2$trial, df2$n),]
rownames(df2) <- NULL
head(df2)
b s n zahl trial mean.bs diff
1 red ja 1 0.87370077 1 0.6972817 0.1764190
2 green ja 2 0.01389495 1 0.4272126 -0.4133177
3 blue ja 3 0.96772185 1 0.5276125 0.4401094
4 pink ja 4 0.80911187 1 0.3625441 0.4465678
5 red ja 5 0.47676424 1 0.6972817 -0.2205175
6 green ja 6 0.07390932 1 0.4272126 -0.3533033

Use length and integer(0) to test a conditional expression

I am posting this question which refers to a statistical problem however the issue I have is with the code.
I have the below vector and autocorrelation computation of significant values for a confidence intervals of 95%. (There is only one significant value at t+1) Package Quantmod.
x<-c(1,1,3,8,5,2,4,3,1,1,0,5,1,1,3,4,6,7,8,9,4,2,1)
a<-acf(x)
b<-a[[1]]
c<-(b[2:length(b)])
posssignificance_level<-qnorm((1+0.95)/2)/sqrt(sum(!is.na(x)))
posssignificance_level
negsignificance_level<- -posssignificance_level
negsignificance_level
poscorr<-which(posssignificance_level<c)
negcorr<-which(negsignificance_level>c)
poscorr
negcorr
I would like to instruct to lag if there is any significant value above/below 95% confident interval either in poscorr or negcorr or both. I atempted to use the below expresions without sucess. (I use length(poscorr==0 and length(negcorr==0) as the lengh of the resulting vector is 0 when there is no autocorrelation. The result for negcorr is "integer(0)").
posautorrelation <- if(length(poscorr==0)) Lag(x,0) else Lag(x,poscorr)
negautorrelation <- if(length(negcorr==0)) Lag(x,0) else Lag(x,negcorr)
Error en `colnames<-`(`*tmp*`, value = "Lag.") :
la longitud de 'dimnames' [2] no es igual a la extensión del arreglo
Error durante el wrapup: no se puede abrir la conexión
I also try
posautorrelation <- if((poscorr==integer(0)) Lag(x,0) else Lag(x,poscorr)
Error: inesperado símbolo in "posautorrelation <- if(length(poscorr==integer(0)) Lag"
Error durante el wrapup: no se puede abrir la conexión
negautorrelation <- if((negcorr==integer(0)) Lag(x,0) else Lag(x,negcorr)
Error: inesperado símbolo in "negautorrelation <- if(length(negcorr==integer(0)) Lag"
Error durante el wrapup: no se puede abrir la conexión
I would like to know how could I instruct the last two expressions as to obtain two versions of x. One with a lag and the one without any autocorrelation value with a 0 lag using the result of negcorr integer(0).
The first part explains why the which(...) expression returns integer(0)
print(negsignificance_level)
# [1] -0.4086807
min(c)
# [1] -0.3432622
which(negsignificance_level > c)
# integer(0)
This is i believe the answer to your question:
if length(object) == then expression1 else expression2
posautorrelation <- if(length(poscorr) == 0) Lag(x,0) else Lag(x, poscorr)
posautorrelation
# [1] NA 1 1 3 8 5 2 4 3 1 1 0 5 1 1 3 4 6 7 8 9 4 2
negautorrelation <- if(length(negcorr) == 0) Lag(x,0) else Lag(x, negcorr)
# [1] 1 1 3 8 5 2 4 3 1 1 0 5 1 1 3 4 6 7 8 9 4 2 1

Resources