I have a dataset in which I paste values in a dplyr chain and collapse with the pipe character (e.g. " | "). If any of the values in the dataset are blank, I just get recurring pipe characters in the pasted list.
Some of the values look like this, for example:
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
I want to match all the pipes that occur more than once and delete them, so that just the names appear like so:
correctstring = "| GHOULSBY,SCROGGINS | CAT,JOHNSON | |BURGLAR,PALA |"
I tried the following, but to no avail:
mutate(names = gsub('[\\|]{2,}', '', name_list))
The difficulty in this question is in formulating a regex which can selectively remove every pipe, except the ones we want to remain as actual separators between terms. We can match on the following pattern:
\|\s+(?=\|)
and then replace just empty string. This pattern will remove any pipe (and any following whitespace) so long as what follows is another pipe. A removal would not occur when a pipe is followed by an actual term, or when it is followed by the end of the string.
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
result <- gsub("\\|\\s+(?=\\|)", "", badstring, perl=TRUE)
result
[1] "| GHOULSBY,SCROGGINS | CAT,JOHNSON | BURGLAR,PALA |"
Demo
Edit:
If you expect to have inputs like | | | which are devoid of any terms, and you would expect empty string as the output, then my solution would fail. I don't see an obvious way to modify the above regex, but you can handle this case with one more call to sub:
result <- sub("^\\|$", "", result)
We also might be able to modify the original pattern to use an alternation covering all cases:
result <- gsub("\\|\\s+(?=\\|)|(?:^\\|$)", "", badstring, perl=TRUE)
Related
I'm trying to create a dummy variable based on the character type variable.
For example, I need to create "newcat" variable ranging from "I00" to "I99".
In the code I wrote, I place all the characters from I00-I99.
But is there any way to make this code efficient with the loop to iterate number after the string?
Thank you in advance!!
mort <- mort %>%
mutate(newcat = ifelse(ucod=="I00" |
ucod=="I01" | ucod=="I02" | ucod=="I03" | ucod=="I04" | ucod=="I05" |
ucod=="I06" | ucod=="I07" | ucod=="I08" | ucod=="I09" | ucod=="I10" |
ucod=="I11" | ucod=="I12" | ucod=="I13" | ucod=="I14" | ucod=="I15" |
ucod=="I16" | ucod=="I17" | ucod=="I18" | ucod=="I19" | ucod=="I20" |
ucod=="I21" | ucod=="I22" | ucod=="I23" | ucod=="I24" | ucod=="I25" |
ucod=="I26" | ucod=="I27" | ucod=="I28" | ucod=="I29" | ucod=="I30" |
ucod=="I31" | ucod=="I32" | ucod=="I33" | ucod=="I34" | ucod=="I35" |
ucod=="I36" | ucod=="I37" | ucod=="I38" | ucod=="I39" | ucod=="I40" |
ucod=="I41" | ucod=="I42" | ucod=="I43" | ucod=="I44" | ucod=="I45" |
ucod=="I46" | ucod=="I47" | ucod=="I48" | ucod=="I49" | ucod=="I50" |
ucod=="I51" | ucod=="I52" | ucod=="I53" | ucod=="I54" | ucod=="I55" |
ucod=="I56" | ucod=="I57" | ucod=="I58" | ucod=="I59" | ucod=="I60" |
ucod=="I61" | ucod=="I62" | ucod=="I63" | ucod=="I64" | ucod=="I65" |
ucod=="I66" | ucod=="I67" | ucod=="I68" | ucod=="I69" | ucod=="I70" |
ucod=="I71" | ucod=="I72" | ucod=="I73" | ucod=="I74" | ucod=="I75" |
ucod=="I76" | ucod=="I77" | ucod=="I78" | ucod=="I79" | ucod=="I80" |
ucod=="I81" | ucod=="I82" | ucod=="I83" | ucod=="I84" | ucod=="I85" |
ucod=="I86" | ucod=="I87" | ucod=="I88" | ucod=="I89" | ucod=="I90" |
ucod=="I91" | ucod=="I92" | ucod=="I93" | ucod=="I94" | ucod=="I95" |
ucod=="I96" | ucod=="I97" | ucod=="I98" | ucod=="I99", 1, 0))
Try %in% instead of == with |
x <- c(paste0("I0", 0:9),paste0("I", c(10:99)))
mort %>%
mutate(newcat = ifelse(ucod %in% x, 1, 0))
Another option is to use regex:
mort <- mort %>%
mutate(newcat = +str_detect(ucod, '^I[0-9]{2}$'))
where ^ is a metacharacter which indicates the beginning of the string. Then we have I[0-9]{2} which matches the letter I and any 2 combinations of the numbers 0-9. Then $ is another metacharacter that indicates the end of the string. So the string matched must start with I followed by 2 numbers and that should be the end of the string. Any string that does not match the pattern will be flaged as FALSE
How to shorten the following codes? I feel it's so repetitive and lengthy and perhaps can be shortened. Not sure how to select those variables and do the recoding like this in a succinct way. Any help is welcome!
data_France$X.1CTP2[data_France$X.1CTP2>7.01 | data_France$X.1CTP2<0.99]<-NA
data_France$X.1CTP3[data_France$X.1CTP3>7.01 | data_France$X.1CTP3<0.99]<-NA
data_France$X.1CTP4[data_France$X.1CTP4>7.01 | data_France$X.1CTP4<0.99]<-NA
data_France$X.1CTP5[data_France$X.1CTP5>7.01 | data_France$X.1CTP5<0.99]<-NA
data_France$X.1CTP6[data_France$X.1CTP6>7.01 | data_France$X.1CTP6<0.99]<-NA
data_France$X.1CTP7[data_France$X.1CTP7>7.01 | data_France$X.1CTP7<0.99]<-NA
data_France$X.1CTP8[data_France$X.1CTP8>7.01 | data_France$X.1CTP8<0.99]<-NA
data_France$X.1CTP9[data_France$X.1CTP9>7.01 | data_France$X.1CTP9<0.99]<-NA
data_France$X.1CTP10[data_France$X.1CTP10>7.01 | data_France$X.1CTP10<0.99]<-NA
data_France$X.1CTP11[data_France$X.1CTP11>7.01 | data_France$X.1CTP11<0.99]<-NA
data_France$X.1CTP12[data_France$X.1CTP12>7.01 | data_France$X.1CTP12<0.99]<-NA
data_France$X.1CTP13[data_France$X.1CTP13>7.01 | data_France$X.1CTP13<0.99]<-NA
data_France$X.1CTP14[data_France$X.1CTP14>7.01 | data_France$X.1CTP14<0.99]<-NA
data_France$X.1CTP15[data_France$X.1CTP15>7.01 | data_France$X.1CTP15<0.99]<-NA
data_France$X.2CTP1[data_France$X.2CTP1>7.01 | data_France$X.2CTP1<0.99]<-NA
data_France$X.2CTP3[data_France$X.2CTP3>7.01 | data_France$X.2CTP3<0.99]<-NA
data_France$X.2CTP4[data_France$X.2CTP4>7.01 | data_France$X.2CTP4<0.99]<-NA
data_France$X.2CTP5[data_France$X.2CTP5>7.01 | data_France$X.2CTP5<0.99]<-NA
data_France$X.2CTP6[data_France$X.2CTP6>7.01 | data_France$X.2CTP6<0.99]<-NA
data_France$X.2CTP7[data_France$X.2CTP7>7.01 | data_France$X.2CTP7<0.99]<-NA
data_France$X.2CTP8[data_France$X.2CTP8>7.01 | data_France$X.2CTP8<0.99]<-NA
data_France$X.2CTP9[data_France$X.2CTP9>7.01 | data_France$X.2CTP9<0.99]<-NA
data_France$X.2CTP10[data_France$X.2CTP10>7.01 | data_France$X.2CTP10<0.99]<-NA
data_France$X.2CTP11[data_France$X.2CTP11>7.01 | data_France$X.2CTP11<0.99]<-NA
data_France$X.2CTP12[data_France$X.2CTP12>7.01 | data_France$X.2CTP12<0.99]<-NA
data_France$X.2CTP13[data_France$X.2CTP13>7.01 | data_France$X.2CTP13<0.99]<-NA
data_France$X.2CTP14[data_France$X.2CTP14>7.01 | data_France$X.2CTP14<0.99]<-NA
data_France$X.2CTP15[data_France$X.2CTP15>7.01 | data_France$X.2CTP15<0.99]<-NA
data_France$X.3CTP1[data_France$X.3CTP1>7.01 | data_France$X.3CTP1<0.99]<-NA
data_France$X.3CTP2[data_France$X.3CTP2>7.01 | data_France$X.3CTP2<0.99]<-NA
data_France$X.3CTP4[data_France$X.3CTP4>7.01 | data_France$X.3CTP4<0.99]<-NA
data_France$X.3CTP5[data_France$X.3CTP5>7.01 | data_France$X.3CTP5<0.99]<-NA
data_France$X.3CTP6[data_France$X.3CTP6>7.01 | data_France$X.3CTP6<0.99]<-NA
data_France$X.3CTP7[data_France$X.3CTP7>7.01 | data_France$X.3CTP7<0.99]<-NA
data_France$X.3CTP8[data_France$X.3CTP8>7.01 | data_France$X.3CTP8<0.99]<-NA
data_France$X.3CTP9[data_France$X.3CTP9>7.01 | data_France$X.3CTP9<0.99]<-NA
data_France$X.3CTP10[data_France$X.3CTP10>7.01 | data_France$X.3CTP10<0.99]<-NA
data_France$X.3CTP11[data_France$X.3CTP11>7.01 | data_France$X.3CTP11<0.99]<-NA
data_France$X.3CTP12[data_France$X.3CTP12>7.01 | data_France$X.3CTP12<0.99]<-NA
data_France$X.3CTP13[data_France$X.3CTP13>7.01 | data_France$X.3CTP13<0.99]<-NA
data_France$X.3CTP14[data_France$X.3CTP14>7.01 | data_France$X.3CTP14<0.99]<-NA
data_France$X.3CTP15[data_France$X.3CTP15>7.01 | data_France$X.3CTP15<0.99]<-NA
data_France$X.4CTP1[data_France$X.4CTP1>7.01 | data_France$X.4CTP1<0.99]<-NA
data_France$X.4CTP2[data_France$X.4CTP2>7.01 | data_France$X.4CTP2<0.99]<-NA
data_France$X.4CTP3[data_France$X.4CTP3>7.01 | data_France$X.4CTP3<0.99]<-NA
data_France$X.4CTP5[data_France$X.4CTP5>7.01 | data_France$X.4CTP5<0.99]<-NA
data_France$X.4CTP6[data_France$X.4CTP6>7.01 | data_France$X.4CTP6<0.99]<-NA
data_France$X.4CTP7[data_France$X.4CTP7>7.01 | data_France$X.4CTP7<0.99]<-NA
data_France$X.4CTP8[data_France$X.4CTP8>7.01 | data_France$X.4CTP8<0.99]<-NA
data_France$X.4CTP9[data_France$X.4CTP9>7.01 | data_France$X.4CTP9<0.99]<-NA
data_France$X.4CTP10[data_France$X.4CTP10>7.01 | data_France$X.4CTP10<0.99]<-NA
data_France$X.4CTP11[data_France$X.4CTP11>7.01 | data_France$X.4CTP11<0.99]<-NA
data_France$X.4CTP12[data_France$X.4CTP12>7.01 | data_France$X.4CTP12<0.99]<-NA
data_France$X.4CTP13[data_France$X.4CTP13>7.01 | data_France$X.4CTP13<0.99]<-NA
data_France$X.4CTP14[data_France$X.4CTP14>7.01 | data_France$X.4CTP14<0.99]<-NA
data_France$X.4CTP15[data_France$X.4CTP15>7.01 | data_France$X.4CTP15<0.99]<-NA
data_France$X.5CTP1[data_France$X.5CTP1>7.01 | data_France$X.5CTP1<0.99]<-NA
data_France$X.5CTP2[data_France$X.5CTP2>7.01 | data_France$X.5CTP2<0.99]<-NA
data_France$X.5CTP3[data_France$X.5CTP3>7.01 | data_France$X.5CTP3<0.99]<-NA
data_France$X.5CTP4[data_France$X.5CTP4>7.01 | data_France$X.5CTP4<0.99]<-NA
data_France$X.5CTP6[data_France$X.5CTP6>7.01 | data_France$X.5CTP6<0.99]<-NA
data_France$X.5CTP7[data_France$X.5CTP7>7.01 | data_France$X.5CTP7<0.99]<-NA
data_France$X.5CTP8[data_France$X.5CTP8>7.01 | data_France$X.5CTP8<0.99]<-NA
data_France$X.5CTP9[data_France$X.5CTP9>7.01 | data_France$X.5CTP9<0.99]<-NA
data_France$X.5CTP10[data_France$X.5CTP10>7.01 | data_France$X.5CTP10<0.99]<-NA
data_France$X.5CTP11[data_France$X.5CTP11>7.01 | data_France$X.5CTP11<0.99]<-NA
data_France$X.5CTP12[data_France$X.5CTP12>7.01 | data_France$X.5CTP12<0.99]<-NA
data_France$X.5CTP13[data_France$X.5CTP13>7.01 | data_France$X.5CTP13<0.99]<-NA
data_France$X.5CTP14[data_France$X.5CTP14>7.01 | data_France$X.5CTP14<0.99]<-NA
data_France$X.5CTP15[data_France$X.5CTP15>7.01 | data_France$X.5CTP15<0.99]<-NA
data_France$X.6CTP1[data_France$X.6CTP1>7.01 | data_France$X.6CTP1<0.99]<-NA
data_France$X.6CTP2[data_France$X.6CTP2>7.01 | data_France$X.6CTP2<0.99]<-NA
data_France$X.6CTP3[data_France$X.6CTP3>7.01 | data_France$X.6CTP3<0.99]<-NA
data_France$X.6CTP4[data_France$X.6CTP4>7.01 | data_France$X.6CTP4<0.99]<-NA
data_France$X.6CTP5[data_France$X.6CTP5>7.01 | data_France$X.6CTP5<0.99]<-NA
data_France$X.6CTP7[data_France$X.6CTP7>7.01 | data_France$X.6CTP7<0.99]<-NA
data_France$X.6CTP8[data_France$X.6CTP8>7.01 | data_France$X.6CTP8<0.99]<-NA
data_France$X.6CTP9[data_France$X.6CTP9>7.01 | data_France$X.6CTP9<0.99]<-NA
data_France$X.6CTP10[data_France$X.6CTP10>7.01 | data_France$X.6CTP10<0.99]<-NA
data_France$X.6CTP11[data_France$X.6CTP11>7.01 | data_France$X.6CTP11<0.99]<-NA
data_France$X.6CTP12[data_France$X.6CTP12>7.01 | data_France$X.6CTP12<0.99]<-NA
data_France$X.6CTP13[data_France$X.6CTP13>7.01 | data_France$X.6CTP13<0.99]<-NA
data_France$X.6CTP14[data_France$X.6CTP14>7.01 | data_France$X.6CTP14<0.99]<-NA
data_France$X.6CTP15[data_France$X.6CTP15>7.01 | data_France$X.6CTP15<0.99]<-NA
data_France$X.7CTP1[data_France$X.7CTP1>7.01 | data_France$X.7CTP1<0.99]<-NA
data_France$X.7CTP2[data_France$X.7CTP2>7.01 | data_France$X.7CTP2<0.99]<-NA
data_France$X.7CTP3[data_France$X.7CTP3>7.01 | data_France$X.7CTP3<0.99]<-NA
data_France$X.7CTP4[data_France$X.7CTP4>7.01 | data_France$X.7CTP4<0.99]<-NA
data_France$X.7CTP5[data_France$X.7CTP5>7.01 | data_France$X.7CTP5<0.99]<-NA
data_France$X.7CTP6[data_France$X.7CTP6>7.01 | data_France$X.7CTP6<0.99]<-NA
data_France$X.7CTP8[data_France$X.7CTP8>7.01 | data_France$X.7CTP8<0.99]<-NA
data_France$X.7CTP9[data_France$X.7CTP9>7.01 | data_France$X.7CTP9<0.99]<-NA
data_France$X.7CTP10[data_France$X.7CTP10>7.01 | data_France$X.7CTP10<0.99]<-NA
data_France$X.7CTP11[data_France$X.7CTP11>7.01 | data_France$X.7CTP11<0.99]<-NA
data_France$X.7CTP12[data_France$X.7CTP12>7.01 | data_France$X.7CTP12<0.99]<-NA
data_France$X.7CTP13[data_France$X.7CTP13>7.01 | data_France$X.7CTP13<0.99]<-NA
data_France$X.7CTP14[data_France$X.7CTP14>7.01 | data_France$X.7CTP14<0.99]<-NA
data_France$X.7CTP15[data_France$X.7CTP15>7.01 | data_France$X.7CTP15<0.99]<-NA
data_France$X.8CTP1[data_France$X.8CTP1>7.01 | data_France$X.8CTP1<0.99]<-NA
data_France$X.8CTP2[data_France$X.8CTP2>7.01 | data_France$X.8CTP2<0.99]<-NA
data_France$X.8CTP3[data_France$X.8CTP3>7.01 | data_France$X.8CTP3<0.99]<-NA
data_France$X.8CTP4[data_France$X.8CTP4>7.01 | data_France$X.8CTP4<0.99]<-NA
data_France$X.8CTP5[data_France$X.8CTP5>7.01 | data_France$X.8CTP5<0.99]<-NA
data_France$X.8CTP6[data_France$X.8CTP6>7.01 | data_France$X.8CTP6<0.99]<-NA
data_France$X.8CTP7[data_France$X.8CTP7>7.01 | data_France$X.8CTP7<0.99]<-NA
data_France$X.8CTP9[data_France$X.8CTP9>7.01 | data_France$X.8CTP9<0.99]<-NA
data_France$X.8CTP10[data_France$X.8CTP10>7.01 | data_France$X.8CTP10<0.99]<-NA
data_France$X.8CTP11[data_France$X.8CTP11>7.01 | data_France$X.8CTP11<0.99]<-NA
data_France$X.8CTP12[data_France$X.8CTP12>7.01 | data_France$X.8CTP12<0.99]<-NA
data_France$X.8CTP13[data_France$X.8CTP13>7.01 | data_France$X.8CTP13<0.99]<-NA
data_France$X.8CTP14[data_France$X.8CTP14>7.01 | data_France$X.8CTP14<0.99]<-NA
data_France$X.8CTP15[data_France$X.8CTP15>7.01 | data_France$X.8CTP15<0.99]<-NA
data_France$X.9CTP1[data_France$X.9CTP1>7.01 | data_France$X.9CTP1<0.99]<-NA
data_France$X.9CTP2[data_France$X.9CTP2>7.01 | data_France$X.9CTP2<0.99]<-NA
data_France$X.9CTP3[data_France$X.9CTP3>7.01 | data_France$X.9CTP3<0.99]<-NA
data_France$X.9CTP4[data_France$X.9CTP4>7.01 | data_France$X.9CTP4<0.99]<-NA
data_France$X.9CTP5[data_France$X.9CTP5>7.01 | data_France$X.9CTP5<0.99]<-NA
data_France$X.9CTP6[data_France$X.9CTP6>7.01 | data_France$X.9CTP6<0.99]<-NA
data_France$X.9CTP7[data_France$X.9CTP7>7.01 | data_France$X.9CTP7<0.99]<-NA
data_France$X.9CTP8[data_France$X.9CTP8>7.01 | data_France$X.9CTP8<0.99]<-NA
data_France$X.9CTP10[data_France$X.9CTP10>7.01 | data_France$X.9CTP10<0.99]<-NA
data_France$X.9CTP11[data_France$X.9CTP11>7.01 | data_France$X.9CTP11<0.99]<-NA
data_France$X.9CTP12[data_France$X.9CTP12>7.01 | data_France$X.9CTP12<0.99]<-NA
data_France$X.9CTP13[data_France$X.9CTP13>7.01 | data_France$X.9CTP13<0.99]<-NA
data_France$X.9CTP14[data_France$X.9CTP14>7.01 | data_France$X.9CTP14<0.99]<-NA
data_France$X.9CTP15[data_France$X.9CTP15>7.01 | data_France$X.9CTP15<0.99]<-NA
data_France$X.10CTP1[data_France$X.10CTP1>7.01 | data_France$X.10CTP1<0.99]<-NA
data_France$X.10CTP2[data_France$X.10CTP2>7.01 | data_France$X.10CTP2<0.99]<-NA
data_France$X.10CTP3[data_France$X.10CTP3>7.01 | data_France$X.10CTP3<0.99]<-NA
data_France$X.10CTP4[data_France$X.10CTP4>7.01 | data_France$X.10CTP4<0.99]<-NA
data_France$X.10CTP5[data_France$X.10CTP5>7.01 | data_France$X.10CTP5<0.99]<-NA
data_France$X.10CTP6[data_France$X.10CTP6>7.01 | data_France$X.10CTP6<0.99]<-NA
data_France$X.10CTP7[data_France$X.10CTP7>7.01 | data_France$X.10CTP7<0.99]<-NA
data_France$X.10CTP8[data_France$X.10CTP8>7.01 | data_France$X.10CTP8<0.99]<-NA
data_France$X.10CTP9[data_France$X.10CTP9>7.01 | data_France$X.10CTP9<0.99]<-NA
data_France$X.10CTP11[data_France$X.10CTP11>7.01 | data_France$X.10CTP11<0.99]<-NA
data_France$X.10CTP12[data_France$X.10CTP12>7.01 | data_France$X.10CTP12<0.99]<-NA
data_France$X.10CTP13[data_France$X.10CTP13>7.01 | data_France$X.10CTP13<0.99]<-NA
data_France$X.10CTP14[data_France$X.10CTP14>7.01 | data_France$X.10CTP14<0.99]<-NA
data_France$X.10CTP15[data_France$X.10CTP15>7.01 | data_France$X.10CTP15<0.99]<-NA
data_France$X.11CTP1[data_France$X.11CTP1>7.01 | data_France$X.11CTP1<0.99]<-NA
data_France$X.11CTP2[data_France$X.11CTP2>7.01 | data_France$X.11CTP2<0.99]<-NA
data_France$X.11CTP3[data_France$X.11CTP3>7.01 | data_France$X.11CTP3<0.99]<-NA
data_France$X.11CTP4[data_France$X.11CTP4>7.01 | data_France$X.11CTP4<0.99]<-NA
data_France$X.11CTP5[data_France$X.11CTP5>7.01 | data_France$X.11CTP5<0.99]<-NA
data_France$X.11CTP6[data_France$X.11CTP6>7.01 | data_France$X.11CTP6<0.99]<-NA
data_France$X.11CTP7[data_France$X.11CTP7>7.01 | data_France$X.11CTP7<0.99]<-NA
data_France$X.11CTP8[data_France$X.11CTP8>7.01 | data_France$X.11CTP8<0.99]<-NA
data_France$X.11CTP9[data_France$X.11CTP9>7.01 | data_France$X.11CTP9<0.99]<-NA
data_France$X.11CTP10[data_France$X.11CTP10>7.01 | data_France$X.11CTP10<0.99]<-NA
data_France$X.11CTP12[data_France$X.11CTP12>7.01 | data_France$X.11CTP12<0.99]<-NA
data_France$X.11CTP13[data_France$X.11CTP13>7.01 | data_France$X.11CTP13<0.99]<-NA
data_France$X.11CTP14[data_France$X.11CTP14>7.01 | data_France$X.11CTP14<0.99]<-NA
data_France$X.11CTP15[data_France$X.11CTP15>7.01 | data_France$X.11CTP15<0.99]<-NA
data_France$X.12CTP1[data_France$X.12CTP1>7.01 | data_France$X.12CTP1<0.99]<-NA
data_France$X.12CTP2[data_France$X.12CTP2>7.01 | data_France$X.12CTP2<0.99]<-NA
data_France$X.12CTP3[data_France$X.12CTP3>7.01 | data_France$X.12CTP3<0.99]<-NA
data_France$X.12CTP4[data_France$X.12CTP4>7.01 | data_France$X.12CTP4<0.99]<-NA
data_France$X.12CTP5[data_France$X.12CTP5>7.01 | data_France$X.12CTP5<0.99]<-NA
data_France$X.12CTP6[data_France$X.12CTP6>7.01 | data_France$X.12CTP6<0.99]<-NA
data_France$X.12CTP7[data_France$X.12CTP7>7.01 | data_France$X.12CTP7<0.99]<-NA
data_France$X.12CTP8[data_France$X.12CTP8>7.01 | data_France$X.12CTP8<0.99]<-NA
data_France$X.12CTP9[data_France$X.12CTP9>7.01 | data_France$X.12CTP9<0.99]<-NA
data_France$X.12CTP10[data_France$X.12CTP10>7.01 | data_France$X.12CTP10<0.99]<-NA
data_France$X.12CTP11[data_France$X.12CTP11>7.01 | data_France$X.12CTP11<0.99]<-NA
data_France$X.12CTP13[data_France$X.12CTP13>7.01 | data_France$X.12CTP13<0.99]<-NA
data_France$X.12CTP14[data_France$X.12CTP14>7.01 | data_France$X.12CTP14<0.99]<-NA
data_France$X.12CTP15[data_France$X.12CTP15>7.01 | data_France$X.12CTP15<0.99]<-NA
data_France$X.13CTP1[data_France$X.13CTP1>7.01 | data_France$X.13CTP1<0.99]<-NA
data_France$X.13CTP2[data_France$X.13CTP2>7.01 | data_France$X.13CTP2<0.99]<-NA
data_France$X.13CTP3[data_France$X.13CTP3>7.01 | data_France$X.13CTP3<0.99]<-NA
data_France$X.13CTP4[data_France$X.13CTP4>7.01 | data_France$X.13CTP4<0.99]<-NA
data_France$X.13CTP5[data_France$X.13CTP5>7.01 | data_France$X.13CTP5<0.99]<-NA
data_France$X.13CTP6[data_France$X.13CTP6>7.01 | data_France$X.13CTP6<0.99]<-NA
data_France$X.13CTP7[data_France$X.13CTP7>7.01 | data_France$X.13CTP7<0.99]<-NA
data_France$X.13CTP8[data_France$X.13CTP8>7.01 | data_France$X.13CTP8<0.99]<-NA
data_France$X.13CTP9[data_France$X.13CTP9>7.01 | data_France$X.13CTP9<0.99]<-NA
data_France$X.13CTP10[data_France$X.13CTP10>7.01 | data_France$X.13CTP10<0.99]<-NA
data_France$X.13CTP11[data_France$X.13CTP11>7.01 | data_France$X.13CTP11<0.99]<-NA
data_France$X.13CTP12[data_France$X.13CTP12>7.01 | data_France$X.13CTP12<0.99]<-NA
data_France$X.13CTP14[data_France$X.13CTP14>7.01 | data_France$X.13CTP14<0.99]<-NA
data_France$X.13CTP15[data_France$X.13CTP15>7.01 | data_France$X.13CTP15<0.99]<-NA
data_France$X.14CTP1[data_France$X.14CTP1>7.01 | data_France$X.14CTP1<0.99]<-NA
data_France$X.14CTP2[data_France$X.14CTP2>7.01 | data_France$X.14CTP2<0.99]<-NA
data_France$X.14CTP3[data_France$X.14CTP3>7.01 | data_France$X.14CTP3<0.99]<-NA
data_France$X.14CTP4[data_France$X.14CTP4>7.01 | data_France$X.14CTP4<0.99]<-NA
data_France$X.14CTP5[data_France$X.14CTP5>7.01 | data_France$X.14CTP5<0.99]<-NA
data_France$X.14CTP6[data_France$X.14CTP6>7.01 | data_France$X.14CTP6<0.99]<-NA
data_France$X.14CTP7[data_France$X.14CTP7>7.01 | data_France$X.14CTP7<0.99]<-NA
data_France$X.14CTP8[data_France$X.14CTP8>7.01 | data_France$X.14CTP8<0.99]<-NA
data_France$X.14CTP9[data_France$X.14CTP9>7.01 | data_France$X.14CTP9<0.99]<-NA
data_France$X.14CTP10[data_France$X.14CTP10>7.01 | data_France$X.14CTP10<0.99]<-NA
data_France$X.14CTP11[data_France$X.14CTP11>7.01 | data_France$X.14CTP11<0.99]<-NA
data_France$X.14CTP12[data_France$X.14CTP12>7.01 | data_France$X.14CTP12<0.99]<-NA
data_France$X.14CTP13[data_France$X.14CTP13>7.01 | data_France$X.14CTP13<0.99]<-NA
data_France$X.14CTP15[data_France$X.14CTP15>7.01 | data_France$X.14CTP15<0.99]<-NA
data_France$X.15CTP1[data_France$X.15CTP1>7.01 | data_France$X.15CTP1<0.99]<-NA
data_France$X.15CTP2[data_France$X.15CTP2>7.01 | data_France$X.15CTP2<0.99]<-NA
data_France$X.15CTP3[data_France$X.15CTP3>7.01 | data_France$X.15CTP3<0.99]<-NA
data_France$X.15CTP4[data_France$X.15CTP4>7.01 | data_France$X.15CTP4<0.99]<-NA
data_France$X.15CTP5[data_France$X.15CTP5>7.01 | data_France$X.15CTP5<0.99]<-NA
data_France$X.15CTP6[data_France$X.15CTP6>7.01 | data_France$X.15CTP6<0.99]<-NA
data_France$X.15CTP7[data_France$X.15CTP7>7.01 | data_France$X.15CTP7<0.99]<-NA
data_France$X.15CTP8[data_France$X.15CTP8>7.01 | data_France$X.15CTP8<0.99]<-NA
data_France$X.15CTP9[data_France$X.15CTP9>7.01 | data_France$X.15CTP9<0.99]<-NA
data_France$X.15CTP10[data_France$X.15CTP10>7.01 | data_France$X.15CTP10<0.99]<-NA
data_France$X.15CTP11[data_France$X.15CTP11>7.01 | data_France$X.15CTP11<0.99]<-NA
data_France$X.15CTP12[data_France$X.15CTP12>7.01 | data_France$X.15CTP12<0.99]<-NA
data_France$X.15CTP13[data_France$X.15CTP13>7.01 | data_France$X.15CTP13<0.99]<-NA
data_France$X.15CTP14[data_France$X.15CTP14>7.01 | data_France$X.15CTP14<0.99]<-NA
Base R equivalent of #Cettt's answer:
## helper function to replace elements with NA
rfun <- function(x) replace(x, which(x<0.99 | x>7.01), NA)
## identify which columns need to be changed
cnm <- grep("^X.[0-9]+CTP[0-9]+", names(data_France))
for (i in cnm) {
data_France[cnm] <- rfun(data_France[cnm])
}
You could also use lapply(), but sometimes the for loop is easier to understand and debug.
I would recommend the dplyr package which has the mutate_at function.
In your case you could use it like this:
library(dplyr)
data_France %>%
as_tibble %>%
mutate_at(vars(matches("^X.[0-9]+CTP[0-9]+")), ~ifelse(.x < 0.99 | .x > 7.01, NA_real_, .x))
#Create a vector of variable names. There may be other ways to do this, like using
#regex or just taking the indices of the variables names (e.g., 1:225)
vars <- apply(expand.grid("X.", as.character(1:15), "CTP", as.character(1:15)),
1, paste0, collapse = "")
for (i in vars) {
data_France[[i]][data_France[[i]] > 7.01 | data_France[[i]] < 0.99] <- NA
}
If this is your entire data set (i.e., there are no other variables in the data), you can simply do
data_France[data_France > 7.01 | data_France < 0.99] <- NA
When rendering tables such as this one (using RStudio + knitr), there is unwanted indentation (see red zone in the image). How can I avoid such indentation?
I imagine there is some CSS involved, but if there was a way to even prevent rmarkdown from "considering" this as a list, it could simplify matters. This is needed for an R package, so heavy hacks are not really an option, but I'll gladly receive all suggestions. Thx.
The (grid) table:
+------------------------+------------------------------------+
| Variable | Stats / Values |
+========================+====================================+
| SomeVar1 | mean (sd) : 1500000.5 (288675.28)\ |
| [numeric] | min < med < max :\ |
| | 1000001 < 1500000.5 < 2e+06\ |
| | IQR (CV) : 499999.5 (0.19) |
+------------------------+------------------------------------+
| SomeVar2 | 1. AAAAAA\ |
| [factor] | 2. BBBBBB\ |
| | 3. CCCCCC\ |
| | 4. DDDDDD\ |
| | 5. EEEEEE\ |
| | 6. FFFFFF\ |
| | 7. GGGGGG\ |
| | 8. HHHHHH\ |
| | 9. IIIIII\ |
| | 10. JJJJJJ\ |
| | [ 102917 others ] |
+------------------------+------------------------------------+
The rendered html table:
I have a list of publisher that looks like this :
+--------------+
| Site Name |
+--------------+
| Radium One |
| Euronews |
| EUROSPORT |
| WIRED |
| RadiumOne |
| Eurosport FR |
| Wired US |
| Eurosport |
| EuroNews |
| Wired |
+--------------+
I'd like to create the following result:
+--------------+----------------+
| Site Name | Publisher Name |
+--------------+----------------+
| Radium One | RadiumOne |
| Euronews | Euronews |
| EUROSPORT | Eurosport |
| WIRED | Wired |
| RadiumOne | RadiumOne |
| Eurosport FR | Eurosport |
| Wired US | Wired |
| Eurosport | Eurosport |
| EuroNews | Euronews |
| Wired | Wired |
+--------------+----------------+
I would like to understand how I can replicate this code I use in Power Query :
search first 4 characters
if Text.Start([Site Name],4) = "WIRE" then "Wired" else
search last 3 characters
if Text.End([Site Name],3) = "One" then "RadiumOne" else
If no match is found, then add "Rest"
It does not have to be case sensitive.
Using properCase from ifultools package and gsub, we replace everything after first word with "" i.e delete it and treat the exceptional case of Radium separtely. If you have many exceptions like Radium case, please update your post with those so that we can find a neater solution to this hack :)
library("ifultools")
siteName=c("Radium One","Euronews","EUROSPORT","WIRED","RadiumOne","Eurosport FR","Wired US","Eurosport","EuroNews","Wired")
publisherName = gsub("^Radium$","Radiumone",gsub("\\s+.*","",properCase(siteName)))
# [1] "Radiumone" "Euronews" "Eurosport" "Wired" "Radiumone" "Eurosport" "Wired"
# [8] "Eurosport" "Euronews" "Wired"
I'm able to do forecasts with an ARIMA model, but when I try to do a forecast for a linear model, I do not get any actual forecasts - it stops at the end of the data set (which isn't useful for forecasting since I already know what's in the data set). I've found countless examples online where using this same code works just fine, but I haven't found anyone else having this same error.
library("stats")
library("forecast")
y <- data$Mfg.Shipments.Total..USA.
model_a1 <- auto.arima(y)
forecast_a1 <- forecast.Arima(model_a1, h = 12)
The above code works perfectly. However, when I try to do a linear model....
model1 <- lm(y ~ Mfg.NO.Total..USA. + Mfg.Inv.Total..USA., data = data )
f1 <- forecast.lm(model1, h = 12)
I get an error message saying that I MUST provide a new data set (which seems odd to me, since the documentation for the forecast package says that it is an optional argument).
f1 <- forecast.lm(model1, newdata = x, h = 12)
If I do this, I am able to get the function to work, but the forecast only predicts values for the existing data - it doesn't predict the next 12 periods. I have also tried using the append function to add additional rows to see if that would fix the issue, but when trying to forecast a linear model, it immediately stops at the most recent point in the time series.
Here's the data that I'm using:
+------------+---------------------------+--------------------+---------------------+
| | Mfg.Shipments.Total..USA. | Mfg.NO.Total..USA. | Mfg.Inv.Total..USA. |
+------------+---------------------------+--------------------+---------------------+
| 2110-01-01 | 3.59746e+11 | 3.58464e+11 | 5.01361e+11 |
| 2110-01-01 | 3.59746e+11 | 3.58464e+11 | 5.01361e+11 |
| 2110-02-01 | 3.62268e+11 | 3.63441e+11 | 5.10439e+11 |
| 2110-03-01 | 4.23748e+11 | 4.24527e+11 | 5.10792e+11 |
| 2110-04-01 | 4.08755e+11 | 4.02769e+11 | 5.16853e+11 |
| 2110-05-01 | 4.08187e+11 | 4.02869e+11 | 5.18180e+11 |
| 2110-06-01 | 4.27567e+11 | 4.21713e+11 | 5.15675e+11 |
| 2110-07-01 | 3.97590e+11 | 3.89916e+11 | 5.24785e+11 |
| 2110-08-01 | 4.24732e+11 | 4.16304e+11 | 5.27734e+11 |
| 2110-09-01 | 4.30974e+11 | 4.35043e+11 | 5.28797e+11 |
| 2110-10-01 | 4.24008e+11 | 4.17076e+11 | 5.38917e+11 |
| 2110-11-01 | 4.11930e+11 | 4.09440e+11 | 5.42618e+11 |
| 2110-12-01 | 4.25940e+11 | 4.34201e+11 | 5.35384e+11 |
| 2111-01-01 | 4.01629e+11 | 4.07748e+11 | 5.55057e+11 |
| 2111-02-01 | 4.06385e+11 | 4.06151e+11 | 5.66058e+11 |
| 2111-03-01 | 4.83827e+11 | 4.89904e+11 | 5.70990e+11 |
| 2111-04-01 | 4.54640e+11 | 4.46702e+11 | 5.84808e+11 |
| 2111-05-01 | 4.65124e+11 | 4.63155e+11 | 5.92456e+11 |
| 2111-06-01 | 4.83809e+11 | 4.75150e+11 | 5.86645e+11 |
| 2111-07-01 | 4.44437e+11 | 4.40452e+11 | 5.97201e+11 |
| 2111-08-01 | 4.83537e+11 | 4.79958e+11 | 5.99461e+11 |
| 2111-09-01 | 4.77130e+11 | 4.75580e+11 | 5.93065e+11 |
| 2111-10-01 | 4.69276e+11 | 4.59579e+11 | 6.03481e+11 |
| 2111-11-01 | 4.53706e+11 | 4.55029e+11 | 6.02577e+11 |
| 2111-12-01 | 4.57872e+11 | 4.81454e+11 | 5.86886e+11 |
| 2112-01-01 | 4.35834e+11 | 4.45037e+11 | 6.04042e+11 |
| 2112-02-01 | 4.55996e+11 | 4.70820e+11 | 6.12071e+11 |
| 2112-03-01 | 5.04869e+11 | 5.08818e+11 | 6.11717e+11 |
| 2112-04-01 | 4.76213e+11 | 4.70666e+11 | 6.16375e+11 |
| 2112-05-01 | 4.95789e+11 | 4.87730e+11 | 6.17639e+11 |
| 2112-06-01 | 4.91218e+11 | 4.87857e+11 | 6.09361e+11 |
| 2112-07-01 | 4.58087e+11 | 4.61037e+11 | 6.19166e+11 |
| 2112-08-01 | 4.97438e+11 | 4.74539e+11 | 6.22773e+11 |
| 2112-09-01 | 4.86994e+11 | 4.85560e+11 | 6.23067e+11 |
| 2112-10-01 | 4.96744e+11 | 4.92562e+11 | 6.26796e+11 |
| 2112-11-01 | 4.70810e+11 | 4.64944e+11 | 6.23999e+11 |
| 2112-12-01 | 4.66721e+11 | 4.88615e+11 | 6.08900e+11 |
| 2113-01-01 | 4.51585e+11 | 4.50763e+11 | 6.25881e+11 |
| 2113-02-01 | 4.56329e+11 | 4.69574e+11 | 6.33157e+11 |
| 2113-03-01 | 5.04023e+11 | 4.92978e+11 | 6.31055e+11 |
| 2113-04-01 | 4.84798e+11 | 4.76750e+11 | 6.35643e+11 |
| 2113-05-01 | 5.04478e+11 | 5.04488e+11 | 6.34376e+11 |
| 2113-06-01 | 4.99043e+11 | 5.13760e+11 | 6.25715e+11 |
| 2113-07-01 | 4.75700e+11 | 4.69012e+11 | 6.34892e+11 |
| 2113-08-01 | 5.05244e+11 | 4.90404e+11 | 6.37735e+11 |
| 2113-09-01 | 5.00087e+11 | 5.04849e+11 | 6.34665e+11 |
| 2113-10-01 | 5.05965e+11 | 4.99682e+11 | 6.38945e+11 |
| 2113-11-01 | 4.78876e+11 | 4.80784e+11 | 6.34442e+11 |
| 2113-12-01 | 4.80640e+11 | 4.98807e+11 | 6.19458e+11 |
| 2114-01-01 | 4.56779e+11 | 4.57684e+11 | 6.36568e+11 |
| 2114-02-01 | 4.62195e+11 | 4.70312e+11 | 6.48982e+11 |
| 2114-03-01 | 5.19472e+11 | 5.25900e+11 | 6.47038e+11 |
| 2114-04-01 | 5.04217e+11 | 5.06090e+11 | 6.52612e+11 |
| 2114-05-01 | 5.14186e+11 | 5.11149e+11 | 6.58990e+11 |
| 2114-06-01 | 5.25249e+11 | 5.33247e+11 | 6.49512e+11 |
| 2114-07-01 | 4.99198e+11 | 5.52506e+11 | 6.57645e+11 |
| 2114-08-01 | 5.17184e+11 | 5.07622e+11 | 6.59281e+11 |
| 2114-09-01 | 5.23682e+11 | 5.24051e+11 | 6.55582e+11 |
| 2114-10-01 | 5.17305e+11 | 5.09549e+11 | 6.59237e+11 |
| 2114-11-01 | 4.71921e+11 | 4.70093e+11 | 6.57044e+11 |
| 2114-12-01 | 4.84948e+11 | 4.86804e+11 | 6.34120e+11 |
+------------+---------------------------+--------------------+---------------------+
Edit - Here's the code I used for adding new datapoints for forecasting.
library(xts)
library(mondate)
d <- as.mondate("2115-01-01")
d11 <- d + 11
seq(d, d11)
newdates <- seq(d, d11)
new_xts <- xts(order.by = as.Date(newdates))
new_xts$Mfg.Shipments.Total..USA. <- NA
new_xts$Mfg.NO.Total..USA. <- NA
new_xts$Mfg.Inv.Total..USA. <- NA
x <- append(data, new_xts)
Not sure if you ever figured this out, but just in case I thought I'd point out what's going wrong.
The documentation for forecast.lm says:
An optional data frame in which to look for variables with which to predict. If omitted, it is assumed that the only variables are trend and season, and h forecasts are produced.
so it's optional if trend and season are your only predictors.
The ARIMA model works because it's using lagged values of the time series in the forecast. For the linear model, it uses the given predictors (Mfg.NO.Total..USA. and Mfg.Inv.Total..USA. in your case) and thus needs their corresponding future values; without these, there are no independent variables to predict from.
In the edit, you added those variables to your future dataset, but they still have values of NA for all future points, thus the forecasts are also NA.
Gabe is correct. You need future values of your causals.
You should consider the Transfer Function modeling process instead of regression(ie developed for use with cross-sectional data). By using prewhitening your X variables (ie build a model for each one), you can calculate the Cross correlation function to see any lead or lag relationship.
It is very apparent that Inv.Total is a lead variable(b**-1) from the standardized graph of Y and the two x's. When Invto moves down so does shipments. In addition, there is also AR seasonal component beyond the causals that is driving the data. There are a few outliers as well so this is a robust solution. I am developer of this software used here, but this can be run in any tool.