How to sum integer arrays across rows - azure-data-explorer

How to achieve the equivalent of summarize sum(Trend) by id where Trend is array of integers?
Input:
——————————————————————————
Id | ParentId | Trend
——————————————————————————
C1-P1 | P1 | [1,2,3]
C2-P1 | P1 | [4,5,6]
C3-P1 | P1 | [1,1,1]
P1 | |
C1-P2 | P2 | [4,5,6]
C2-P2 | P2 | [7,8,9]
P2 | |
—————————————————————————-
Needed Output:
——————————————————————————
Id | ParentId | Trend
——————————————————————————
C1-P1 | P1 | [1,2,3]
C2-P1 | P1 | [4,5,6]
C3-P1 | P1 | [1,1,1]
P1 | | [6,8,10]
C1-P2 | P2 | [4,5,6]
C2-P2 | P2 | [7,8,9]
P2 | | [11,13,15]
—————————————————————————-

Please check if the query below solves your scenario:
It uses mv-expand operator and 'with_itemindex' option to expand the values of the array.
let _data = datatable(Id:string, ParentId:string, Trend:dynamic)
[
'C1-P1','P1', dynamic([1,2,3]),
'C2-P1', 'P1', dynamic([4,5,6]),
'C3-P','P1',dynamic([1,1,1]),
'P1','',dynamic([]),
'C1-P2','P2',dynamic([4,5,6]),
'C2-P2','P2',dynamic([7,8,9]),
'P2', '', dynamic([])
];
_data
| mv-expand with_itemindex=x Trend to typeof(long)
| summarize sum(Trend) by ParentId, x
| summarize Trend=make_list(sum_Trend) by ParentId
| union (_data | where isnotempty( ParentId))

Related

How to create a variable based on character and number iteration in R?

I'm trying to create a dummy variable based on the character type variable.
For example, I need to create "newcat" variable ranging from "I00" to "I99".
In the code I wrote, I place all the characters from I00-I99.
But is there any way to make this code efficient with the loop to iterate number after the string?
Thank you in advance!!
mort <- mort %>%
mutate(newcat = ifelse(ucod=="I00" |
ucod=="I01" | ucod=="I02" | ucod=="I03" | ucod=="I04" | ucod=="I05" |
ucod=="I06" | ucod=="I07" | ucod=="I08" | ucod=="I09" | ucod=="I10" |
ucod=="I11" | ucod=="I12" | ucod=="I13" | ucod=="I14" | ucod=="I15" |
ucod=="I16" | ucod=="I17" | ucod=="I18" | ucod=="I19" | ucod=="I20" |
ucod=="I21" | ucod=="I22" | ucod=="I23" | ucod=="I24" | ucod=="I25" |
ucod=="I26" | ucod=="I27" | ucod=="I28" | ucod=="I29" | ucod=="I30" |
ucod=="I31" | ucod=="I32" | ucod=="I33" | ucod=="I34" | ucod=="I35" |
ucod=="I36" | ucod=="I37" | ucod=="I38" | ucod=="I39" | ucod=="I40" |
ucod=="I41" | ucod=="I42" | ucod=="I43" | ucod=="I44" | ucod=="I45" |
ucod=="I46" | ucod=="I47" | ucod=="I48" | ucod=="I49" | ucod=="I50" |
ucod=="I51" | ucod=="I52" | ucod=="I53" | ucod=="I54" | ucod=="I55" |
ucod=="I56" | ucod=="I57" | ucod=="I58" | ucod=="I59" | ucod=="I60" |
ucod=="I61" | ucod=="I62" | ucod=="I63" | ucod=="I64" | ucod=="I65" |
ucod=="I66" | ucod=="I67" | ucod=="I68" | ucod=="I69" | ucod=="I70" |
ucod=="I71" | ucod=="I72" | ucod=="I73" | ucod=="I74" | ucod=="I75" |
ucod=="I76" | ucod=="I77" | ucod=="I78" | ucod=="I79" | ucod=="I80" |
ucod=="I81" | ucod=="I82" | ucod=="I83" | ucod=="I84" | ucod=="I85" |
ucod=="I86" | ucod=="I87" | ucod=="I88" | ucod=="I89" | ucod=="I90" |
ucod=="I91" | ucod=="I92" | ucod=="I93" | ucod=="I94" | ucod=="I95" |
ucod=="I96" | ucod=="I97" | ucod=="I98" | ucod=="I99", 1, 0))
Try %in% instead of == with |
x <- c(paste0("I0", 0:9),paste0("I", c(10:99)))
mort %>%
mutate(newcat = ifelse(ucod %in% x, 1, 0))
Another option is to use regex:
mort <- mort %>%
mutate(newcat = +str_detect(ucod, '^I[0-9]{2}$'))
where ^ is a metacharacter which indicates the beginning of the string. Then we have I[0-9]{2} which matches the letter I and any 2 combinations of the numbers 0-9. Then $ is another metacharacter that indicates the end of the string. So the string matched must start with I followed by 2 numbers and that should be the end of the string. Any string that does not match the pattern will be flaged as FALSE

manipulate multiple variables in a data frame

How to shorten the following codes? I feel it's so repetitive and lengthy and perhaps can be shortened. Not sure how to select those variables and do the recoding like this in a succinct way. Any help is welcome!
data_France$X.1CTP2[data_France$X.1CTP2>7.01 | data_France$X.1CTP2<0.99]<-NA
data_France$X.1CTP3[data_France$X.1CTP3>7.01 | data_France$X.1CTP3<0.99]<-NA
data_France$X.1CTP4[data_France$X.1CTP4>7.01 | data_France$X.1CTP4<0.99]<-NA
data_France$X.1CTP5[data_France$X.1CTP5>7.01 | data_France$X.1CTP5<0.99]<-NA
data_France$X.1CTP6[data_France$X.1CTP6>7.01 | data_France$X.1CTP6<0.99]<-NA
data_France$X.1CTP7[data_France$X.1CTP7>7.01 | data_France$X.1CTP7<0.99]<-NA
data_France$X.1CTP8[data_France$X.1CTP8>7.01 | data_France$X.1CTP8<0.99]<-NA
data_France$X.1CTP9[data_France$X.1CTP9>7.01 | data_France$X.1CTP9<0.99]<-NA
data_France$X.1CTP10[data_France$X.1CTP10>7.01 | data_France$X.1CTP10<0.99]<-NA
data_France$X.1CTP11[data_France$X.1CTP11>7.01 | data_France$X.1CTP11<0.99]<-NA
data_France$X.1CTP12[data_France$X.1CTP12>7.01 | data_France$X.1CTP12<0.99]<-NA
data_France$X.1CTP13[data_France$X.1CTP13>7.01 | data_France$X.1CTP13<0.99]<-NA
data_France$X.1CTP14[data_France$X.1CTP14>7.01 | data_France$X.1CTP14<0.99]<-NA
data_France$X.1CTP15[data_France$X.1CTP15>7.01 | data_France$X.1CTP15<0.99]<-NA
data_France$X.2CTP1[data_France$X.2CTP1>7.01 | data_France$X.2CTP1<0.99]<-NA
data_France$X.2CTP3[data_France$X.2CTP3>7.01 | data_France$X.2CTP3<0.99]<-NA
data_France$X.2CTP4[data_France$X.2CTP4>7.01 | data_France$X.2CTP4<0.99]<-NA
data_France$X.2CTP5[data_France$X.2CTP5>7.01 | data_France$X.2CTP5<0.99]<-NA
data_France$X.2CTP6[data_France$X.2CTP6>7.01 | data_France$X.2CTP6<0.99]<-NA
data_France$X.2CTP7[data_France$X.2CTP7>7.01 | data_France$X.2CTP7<0.99]<-NA
data_France$X.2CTP8[data_France$X.2CTP8>7.01 | data_France$X.2CTP8<0.99]<-NA
data_France$X.2CTP9[data_France$X.2CTP9>7.01 | data_France$X.2CTP9<0.99]<-NA
data_France$X.2CTP10[data_France$X.2CTP10>7.01 | data_France$X.2CTP10<0.99]<-NA
data_France$X.2CTP11[data_France$X.2CTP11>7.01 | data_France$X.2CTP11<0.99]<-NA
data_France$X.2CTP12[data_France$X.2CTP12>7.01 | data_France$X.2CTP12<0.99]<-NA
data_France$X.2CTP13[data_France$X.2CTP13>7.01 | data_France$X.2CTP13<0.99]<-NA
data_France$X.2CTP14[data_France$X.2CTP14>7.01 | data_France$X.2CTP14<0.99]<-NA
data_France$X.2CTP15[data_France$X.2CTP15>7.01 | data_France$X.2CTP15<0.99]<-NA
data_France$X.3CTP1[data_France$X.3CTP1>7.01 | data_France$X.3CTP1<0.99]<-NA
data_France$X.3CTP2[data_France$X.3CTP2>7.01 | data_France$X.3CTP2<0.99]<-NA
data_France$X.3CTP4[data_France$X.3CTP4>7.01 | data_France$X.3CTP4<0.99]<-NA
data_France$X.3CTP5[data_France$X.3CTP5>7.01 | data_France$X.3CTP5<0.99]<-NA
data_France$X.3CTP6[data_France$X.3CTP6>7.01 | data_France$X.3CTP6<0.99]<-NA
data_France$X.3CTP7[data_France$X.3CTP7>7.01 | data_France$X.3CTP7<0.99]<-NA
data_France$X.3CTP8[data_France$X.3CTP8>7.01 | data_France$X.3CTP8<0.99]<-NA
data_France$X.3CTP9[data_France$X.3CTP9>7.01 | data_France$X.3CTP9<0.99]<-NA
data_France$X.3CTP10[data_France$X.3CTP10>7.01 | data_France$X.3CTP10<0.99]<-NA
data_France$X.3CTP11[data_France$X.3CTP11>7.01 | data_France$X.3CTP11<0.99]<-NA
data_France$X.3CTP12[data_France$X.3CTP12>7.01 | data_France$X.3CTP12<0.99]<-NA
data_France$X.3CTP13[data_France$X.3CTP13>7.01 | data_France$X.3CTP13<0.99]<-NA
data_France$X.3CTP14[data_France$X.3CTP14>7.01 | data_France$X.3CTP14<0.99]<-NA
data_France$X.3CTP15[data_France$X.3CTP15>7.01 | data_France$X.3CTP15<0.99]<-NA
data_France$X.4CTP1[data_France$X.4CTP1>7.01 | data_France$X.4CTP1<0.99]<-NA
data_France$X.4CTP2[data_France$X.4CTP2>7.01 | data_France$X.4CTP2<0.99]<-NA
data_France$X.4CTP3[data_France$X.4CTP3>7.01 | data_France$X.4CTP3<0.99]<-NA
data_France$X.4CTP5[data_France$X.4CTP5>7.01 | data_France$X.4CTP5<0.99]<-NA
data_France$X.4CTP6[data_France$X.4CTP6>7.01 | data_France$X.4CTP6<0.99]<-NA
data_France$X.4CTP7[data_France$X.4CTP7>7.01 | data_France$X.4CTP7<0.99]<-NA
data_France$X.4CTP8[data_France$X.4CTP8>7.01 | data_France$X.4CTP8<0.99]<-NA
data_France$X.4CTP9[data_France$X.4CTP9>7.01 | data_France$X.4CTP9<0.99]<-NA
data_France$X.4CTP10[data_France$X.4CTP10>7.01 | data_France$X.4CTP10<0.99]<-NA
data_France$X.4CTP11[data_France$X.4CTP11>7.01 | data_France$X.4CTP11<0.99]<-NA
data_France$X.4CTP12[data_France$X.4CTP12>7.01 | data_France$X.4CTP12<0.99]<-NA
data_France$X.4CTP13[data_France$X.4CTP13>7.01 | data_France$X.4CTP13<0.99]<-NA
data_France$X.4CTP14[data_France$X.4CTP14>7.01 | data_France$X.4CTP14<0.99]<-NA
data_France$X.4CTP15[data_France$X.4CTP15>7.01 | data_France$X.4CTP15<0.99]<-NA
data_France$X.5CTP1[data_France$X.5CTP1>7.01 | data_France$X.5CTP1<0.99]<-NA
data_France$X.5CTP2[data_France$X.5CTP2>7.01 | data_France$X.5CTP2<0.99]<-NA
data_France$X.5CTP3[data_France$X.5CTP3>7.01 | data_France$X.5CTP3<0.99]<-NA
data_France$X.5CTP4[data_France$X.5CTP4>7.01 | data_France$X.5CTP4<0.99]<-NA
data_France$X.5CTP6[data_France$X.5CTP6>7.01 | data_France$X.5CTP6<0.99]<-NA
data_France$X.5CTP7[data_France$X.5CTP7>7.01 | data_France$X.5CTP7<0.99]<-NA
data_France$X.5CTP8[data_France$X.5CTP8>7.01 | data_France$X.5CTP8<0.99]<-NA
data_France$X.5CTP9[data_France$X.5CTP9>7.01 | data_France$X.5CTP9<0.99]<-NA
data_France$X.5CTP10[data_France$X.5CTP10>7.01 | data_France$X.5CTP10<0.99]<-NA
data_France$X.5CTP11[data_France$X.5CTP11>7.01 | data_France$X.5CTP11<0.99]<-NA
data_France$X.5CTP12[data_France$X.5CTP12>7.01 | data_France$X.5CTP12<0.99]<-NA
data_France$X.5CTP13[data_France$X.5CTP13>7.01 | data_France$X.5CTP13<0.99]<-NA
data_France$X.5CTP14[data_France$X.5CTP14>7.01 | data_France$X.5CTP14<0.99]<-NA
data_France$X.5CTP15[data_France$X.5CTP15>7.01 | data_France$X.5CTP15<0.99]<-NA
data_France$X.6CTP1[data_France$X.6CTP1>7.01 | data_France$X.6CTP1<0.99]<-NA
data_France$X.6CTP2[data_France$X.6CTP2>7.01 | data_France$X.6CTP2<0.99]<-NA
data_France$X.6CTP3[data_France$X.6CTP3>7.01 | data_France$X.6CTP3<0.99]<-NA
data_France$X.6CTP4[data_France$X.6CTP4>7.01 | data_France$X.6CTP4<0.99]<-NA
data_France$X.6CTP5[data_France$X.6CTP5>7.01 | data_France$X.6CTP5<0.99]<-NA
data_France$X.6CTP7[data_France$X.6CTP7>7.01 | data_France$X.6CTP7<0.99]<-NA
data_France$X.6CTP8[data_France$X.6CTP8>7.01 | data_France$X.6CTP8<0.99]<-NA
data_France$X.6CTP9[data_France$X.6CTP9>7.01 | data_France$X.6CTP9<0.99]<-NA
data_France$X.6CTP10[data_France$X.6CTP10>7.01 | data_France$X.6CTP10<0.99]<-NA
data_France$X.6CTP11[data_France$X.6CTP11>7.01 | data_France$X.6CTP11<0.99]<-NA
data_France$X.6CTP12[data_France$X.6CTP12>7.01 | data_France$X.6CTP12<0.99]<-NA
data_France$X.6CTP13[data_France$X.6CTP13>7.01 | data_France$X.6CTP13<0.99]<-NA
data_France$X.6CTP14[data_France$X.6CTP14>7.01 | data_France$X.6CTP14<0.99]<-NA
data_France$X.6CTP15[data_France$X.6CTP15>7.01 | data_France$X.6CTP15<0.99]<-NA
data_France$X.7CTP1[data_France$X.7CTP1>7.01 | data_France$X.7CTP1<0.99]<-NA
data_France$X.7CTP2[data_France$X.7CTP2>7.01 | data_France$X.7CTP2<0.99]<-NA
data_France$X.7CTP3[data_France$X.7CTP3>7.01 | data_France$X.7CTP3<0.99]<-NA
data_France$X.7CTP4[data_France$X.7CTP4>7.01 | data_France$X.7CTP4<0.99]<-NA
data_France$X.7CTP5[data_France$X.7CTP5>7.01 | data_France$X.7CTP5<0.99]<-NA
data_France$X.7CTP6[data_France$X.7CTP6>7.01 | data_France$X.7CTP6<0.99]<-NA
data_France$X.7CTP8[data_France$X.7CTP8>7.01 | data_France$X.7CTP8<0.99]<-NA
data_France$X.7CTP9[data_France$X.7CTP9>7.01 | data_France$X.7CTP9<0.99]<-NA
data_France$X.7CTP10[data_France$X.7CTP10>7.01 | data_France$X.7CTP10<0.99]<-NA
data_France$X.7CTP11[data_France$X.7CTP11>7.01 | data_France$X.7CTP11<0.99]<-NA
data_France$X.7CTP12[data_France$X.7CTP12>7.01 | data_France$X.7CTP12<0.99]<-NA
data_France$X.7CTP13[data_France$X.7CTP13>7.01 | data_France$X.7CTP13<0.99]<-NA
data_France$X.7CTP14[data_France$X.7CTP14>7.01 | data_France$X.7CTP14<0.99]<-NA
data_France$X.7CTP15[data_France$X.7CTP15>7.01 | data_France$X.7CTP15<0.99]<-NA
data_France$X.8CTP1[data_France$X.8CTP1>7.01 | data_France$X.8CTP1<0.99]<-NA
data_France$X.8CTP2[data_France$X.8CTP2>7.01 | data_France$X.8CTP2<0.99]<-NA
data_France$X.8CTP3[data_France$X.8CTP3>7.01 | data_France$X.8CTP3<0.99]<-NA
data_France$X.8CTP4[data_France$X.8CTP4>7.01 | data_France$X.8CTP4<0.99]<-NA
data_France$X.8CTP5[data_France$X.8CTP5>7.01 | data_France$X.8CTP5<0.99]<-NA
data_France$X.8CTP6[data_France$X.8CTP6>7.01 | data_France$X.8CTP6<0.99]<-NA
data_France$X.8CTP7[data_France$X.8CTP7>7.01 | data_France$X.8CTP7<0.99]<-NA
data_France$X.8CTP9[data_France$X.8CTP9>7.01 | data_France$X.8CTP9<0.99]<-NA
data_France$X.8CTP10[data_France$X.8CTP10>7.01 | data_France$X.8CTP10<0.99]<-NA
data_France$X.8CTP11[data_France$X.8CTP11>7.01 | data_France$X.8CTP11<0.99]<-NA
data_France$X.8CTP12[data_France$X.8CTP12>7.01 | data_France$X.8CTP12<0.99]<-NA
data_France$X.8CTP13[data_France$X.8CTP13>7.01 | data_France$X.8CTP13<0.99]<-NA
data_France$X.8CTP14[data_France$X.8CTP14>7.01 | data_France$X.8CTP14<0.99]<-NA
data_France$X.8CTP15[data_France$X.8CTP15>7.01 | data_France$X.8CTP15<0.99]<-NA
data_France$X.9CTP1[data_France$X.9CTP1>7.01 | data_France$X.9CTP1<0.99]<-NA
data_France$X.9CTP2[data_France$X.9CTP2>7.01 | data_France$X.9CTP2<0.99]<-NA
data_France$X.9CTP3[data_France$X.9CTP3>7.01 | data_France$X.9CTP3<0.99]<-NA
data_France$X.9CTP4[data_France$X.9CTP4>7.01 | data_France$X.9CTP4<0.99]<-NA
data_France$X.9CTP5[data_France$X.9CTP5>7.01 | data_France$X.9CTP5<0.99]<-NA
data_France$X.9CTP6[data_France$X.9CTP6>7.01 | data_France$X.9CTP6<0.99]<-NA
data_France$X.9CTP7[data_France$X.9CTP7>7.01 | data_France$X.9CTP7<0.99]<-NA
data_France$X.9CTP8[data_France$X.9CTP8>7.01 | data_France$X.9CTP8<0.99]<-NA
data_France$X.9CTP10[data_France$X.9CTP10>7.01 | data_France$X.9CTP10<0.99]<-NA
data_France$X.9CTP11[data_France$X.9CTP11>7.01 | data_France$X.9CTP11<0.99]<-NA
data_France$X.9CTP12[data_France$X.9CTP12>7.01 | data_France$X.9CTP12<0.99]<-NA
data_France$X.9CTP13[data_France$X.9CTP13>7.01 | data_France$X.9CTP13<0.99]<-NA
data_France$X.9CTP14[data_France$X.9CTP14>7.01 | data_France$X.9CTP14<0.99]<-NA
data_France$X.9CTP15[data_France$X.9CTP15>7.01 | data_France$X.9CTP15<0.99]<-NA
data_France$X.10CTP1[data_France$X.10CTP1>7.01 | data_France$X.10CTP1<0.99]<-NA
data_France$X.10CTP2[data_France$X.10CTP2>7.01 | data_France$X.10CTP2<0.99]<-NA
data_France$X.10CTP3[data_France$X.10CTP3>7.01 | data_France$X.10CTP3<0.99]<-NA
data_France$X.10CTP4[data_France$X.10CTP4>7.01 | data_France$X.10CTP4<0.99]<-NA
data_France$X.10CTP5[data_France$X.10CTP5>7.01 | data_France$X.10CTP5<0.99]<-NA
data_France$X.10CTP6[data_France$X.10CTP6>7.01 | data_France$X.10CTP6<0.99]<-NA
data_France$X.10CTP7[data_France$X.10CTP7>7.01 | data_France$X.10CTP7<0.99]<-NA
data_France$X.10CTP8[data_France$X.10CTP8>7.01 | data_France$X.10CTP8<0.99]<-NA
data_France$X.10CTP9[data_France$X.10CTP9>7.01 | data_France$X.10CTP9<0.99]<-NA
data_France$X.10CTP11[data_France$X.10CTP11>7.01 | data_France$X.10CTP11<0.99]<-NA
data_France$X.10CTP12[data_France$X.10CTP12>7.01 | data_France$X.10CTP12<0.99]<-NA
data_France$X.10CTP13[data_France$X.10CTP13>7.01 | data_France$X.10CTP13<0.99]<-NA
data_France$X.10CTP14[data_France$X.10CTP14>7.01 | data_France$X.10CTP14<0.99]<-NA
data_France$X.10CTP15[data_France$X.10CTP15>7.01 | data_France$X.10CTP15<0.99]<-NA
data_France$X.11CTP1[data_France$X.11CTP1>7.01 | data_France$X.11CTP1<0.99]<-NA
data_France$X.11CTP2[data_France$X.11CTP2>7.01 | data_France$X.11CTP2<0.99]<-NA
data_France$X.11CTP3[data_France$X.11CTP3>7.01 | data_France$X.11CTP3<0.99]<-NA
data_France$X.11CTP4[data_France$X.11CTP4>7.01 | data_France$X.11CTP4<0.99]<-NA
data_France$X.11CTP5[data_France$X.11CTP5>7.01 | data_France$X.11CTP5<0.99]<-NA
data_France$X.11CTP6[data_France$X.11CTP6>7.01 | data_France$X.11CTP6<0.99]<-NA
data_France$X.11CTP7[data_France$X.11CTP7>7.01 | data_France$X.11CTP7<0.99]<-NA
data_France$X.11CTP8[data_France$X.11CTP8>7.01 | data_France$X.11CTP8<0.99]<-NA
data_France$X.11CTP9[data_France$X.11CTP9>7.01 | data_France$X.11CTP9<0.99]<-NA
data_France$X.11CTP10[data_France$X.11CTP10>7.01 | data_France$X.11CTP10<0.99]<-NA
data_France$X.11CTP12[data_France$X.11CTP12>7.01 | data_France$X.11CTP12<0.99]<-NA
data_France$X.11CTP13[data_France$X.11CTP13>7.01 | data_France$X.11CTP13<0.99]<-NA
data_France$X.11CTP14[data_France$X.11CTP14>7.01 | data_France$X.11CTP14<0.99]<-NA
data_France$X.11CTP15[data_France$X.11CTP15>7.01 | data_France$X.11CTP15<0.99]<-NA
data_France$X.12CTP1[data_France$X.12CTP1>7.01 | data_France$X.12CTP1<0.99]<-NA
data_France$X.12CTP2[data_France$X.12CTP2>7.01 | data_France$X.12CTP2<0.99]<-NA
data_France$X.12CTP3[data_France$X.12CTP3>7.01 | data_France$X.12CTP3<0.99]<-NA
data_France$X.12CTP4[data_France$X.12CTP4>7.01 | data_France$X.12CTP4<0.99]<-NA
data_France$X.12CTP5[data_France$X.12CTP5>7.01 | data_France$X.12CTP5<0.99]<-NA
data_France$X.12CTP6[data_France$X.12CTP6>7.01 | data_France$X.12CTP6<0.99]<-NA
data_France$X.12CTP7[data_France$X.12CTP7>7.01 | data_France$X.12CTP7<0.99]<-NA
data_France$X.12CTP8[data_France$X.12CTP8>7.01 | data_France$X.12CTP8<0.99]<-NA
data_France$X.12CTP9[data_France$X.12CTP9>7.01 | data_France$X.12CTP9<0.99]<-NA
data_France$X.12CTP10[data_France$X.12CTP10>7.01 | data_France$X.12CTP10<0.99]<-NA
data_France$X.12CTP11[data_France$X.12CTP11>7.01 | data_France$X.12CTP11<0.99]<-NA
data_France$X.12CTP13[data_France$X.12CTP13>7.01 | data_France$X.12CTP13<0.99]<-NA
data_France$X.12CTP14[data_France$X.12CTP14>7.01 | data_France$X.12CTP14<0.99]<-NA
data_France$X.12CTP15[data_France$X.12CTP15>7.01 | data_France$X.12CTP15<0.99]<-NA
data_France$X.13CTP1[data_France$X.13CTP1>7.01 | data_France$X.13CTP1<0.99]<-NA
data_France$X.13CTP2[data_France$X.13CTP2>7.01 | data_France$X.13CTP2<0.99]<-NA
data_France$X.13CTP3[data_France$X.13CTP3>7.01 | data_France$X.13CTP3<0.99]<-NA
data_France$X.13CTP4[data_France$X.13CTP4>7.01 | data_France$X.13CTP4<0.99]<-NA
data_France$X.13CTP5[data_France$X.13CTP5>7.01 | data_France$X.13CTP5<0.99]<-NA
data_France$X.13CTP6[data_France$X.13CTP6>7.01 | data_France$X.13CTP6<0.99]<-NA
data_France$X.13CTP7[data_France$X.13CTP7>7.01 | data_France$X.13CTP7<0.99]<-NA
data_France$X.13CTP8[data_France$X.13CTP8>7.01 | data_France$X.13CTP8<0.99]<-NA
data_France$X.13CTP9[data_France$X.13CTP9>7.01 | data_France$X.13CTP9<0.99]<-NA
data_France$X.13CTP10[data_France$X.13CTP10>7.01 | data_France$X.13CTP10<0.99]<-NA
data_France$X.13CTP11[data_France$X.13CTP11>7.01 | data_France$X.13CTP11<0.99]<-NA
data_France$X.13CTP12[data_France$X.13CTP12>7.01 | data_France$X.13CTP12<0.99]<-NA
data_France$X.13CTP14[data_France$X.13CTP14>7.01 | data_France$X.13CTP14<0.99]<-NA
data_France$X.13CTP15[data_France$X.13CTP15>7.01 | data_France$X.13CTP15<0.99]<-NA
data_France$X.14CTP1[data_France$X.14CTP1>7.01 | data_France$X.14CTP1<0.99]<-NA
data_France$X.14CTP2[data_France$X.14CTP2>7.01 | data_France$X.14CTP2<0.99]<-NA
data_France$X.14CTP3[data_France$X.14CTP3>7.01 | data_France$X.14CTP3<0.99]<-NA
data_France$X.14CTP4[data_France$X.14CTP4>7.01 | data_France$X.14CTP4<0.99]<-NA
data_France$X.14CTP5[data_France$X.14CTP5>7.01 | data_France$X.14CTP5<0.99]<-NA
data_France$X.14CTP6[data_France$X.14CTP6>7.01 | data_France$X.14CTP6<0.99]<-NA
data_France$X.14CTP7[data_France$X.14CTP7>7.01 | data_France$X.14CTP7<0.99]<-NA
data_France$X.14CTP8[data_France$X.14CTP8>7.01 | data_France$X.14CTP8<0.99]<-NA
data_France$X.14CTP9[data_France$X.14CTP9>7.01 | data_France$X.14CTP9<0.99]<-NA
data_France$X.14CTP10[data_France$X.14CTP10>7.01 | data_France$X.14CTP10<0.99]<-NA
data_France$X.14CTP11[data_France$X.14CTP11>7.01 | data_France$X.14CTP11<0.99]<-NA
data_France$X.14CTP12[data_France$X.14CTP12>7.01 | data_France$X.14CTP12<0.99]<-NA
data_France$X.14CTP13[data_France$X.14CTP13>7.01 | data_France$X.14CTP13<0.99]<-NA
data_France$X.14CTP15[data_France$X.14CTP15>7.01 | data_France$X.14CTP15<0.99]<-NA
data_France$X.15CTP1[data_France$X.15CTP1>7.01 | data_France$X.15CTP1<0.99]<-NA
data_France$X.15CTP2[data_France$X.15CTP2>7.01 | data_France$X.15CTP2<0.99]<-NA
data_France$X.15CTP3[data_France$X.15CTP3>7.01 | data_France$X.15CTP3<0.99]<-NA
data_France$X.15CTP4[data_France$X.15CTP4>7.01 | data_France$X.15CTP4<0.99]<-NA
data_France$X.15CTP5[data_France$X.15CTP5>7.01 | data_France$X.15CTP5<0.99]<-NA
data_France$X.15CTP6[data_France$X.15CTP6>7.01 | data_France$X.15CTP6<0.99]<-NA
data_France$X.15CTP7[data_France$X.15CTP7>7.01 | data_France$X.15CTP7<0.99]<-NA
data_France$X.15CTP8[data_France$X.15CTP8>7.01 | data_France$X.15CTP8<0.99]<-NA
data_France$X.15CTP9[data_France$X.15CTP9>7.01 | data_France$X.15CTP9<0.99]<-NA
data_France$X.15CTP10[data_France$X.15CTP10>7.01 | data_France$X.15CTP10<0.99]<-NA
data_France$X.15CTP11[data_France$X.15CTP11>7.01 | data_France$X.15CTP11<0.99]<-NA
data_France$X.15CTP12[data_France$X.15CTP12>7.01 | data_France$X.15CTP12<0.99]<-NA
data_France$X.15CTP13[data_France$X.15CTP13>7.01 | data_France$X.15CTP13<0.99]<-NA
data_France$X.15CTP14[data_France$X.15CTP14>7.01 | data_France$X.15CTP14<0.99]<-NA
Base R equivalent of #Cettt's answer:
## helper function to replace elements with NA
rfun <- function(x) replace(x, which(x<0.99 | x>7.01), NA)
## identify which columns need to be changed
cnm <- grep("^X.[0-9]+CTP[0-9]+", names(data_France))
for (i in cnm) {
data_France[cnm] <- rfun(data_France[cnm])
}
You could also use lapply(), but sometimes the for loop is easier to understand and debug.
I would recommend the dplyr package which has the mutate_at function.
In your case you could use it like this:
library(dplyr)
data_France %>%
as_tibble %>%
mutate_at(vars(matches("^X.[0-9]+CTP[0-9]+")), ~ifelse(.x < 0.99 | .x > 7.01, NA_real_, .x))
#Create a vector of variable names. There may be other ways to do this, like using
#regex or just taking the indices of the variables names (e.g., 1:225)
vars <- apply(expand.grid("X.", as.character(1:15), "CTP", as.character(1:15)),
1, paste0, collapse = "")
for (i in vars) {
data_France[[i]][data_France[[i]] > 7.01 | data_France[[i]] < 0.99] <- NA
}
If this is your entire data set (i.e., there are no other variables in the data), you can simply do
data_France[data_France > 7.01 | data_France < 0.99] <- NA

Unable to forecast linear model in R

I'm able to do forecasts with an ARIMA model, but when I try to do a forecast for a linear model, I do not get any actual forecasts - it stops at the end of the data set (which isn't useful for forecasting since I already know what's in the data set). I've found countless examples online where using this same code works just fine, but I haven't found anyone else having this same error.
library("stats")
library("forecast")
y <- data$Mfg.Shipments.Total..USA.
model_a1 <- auto.arima(y)
forecast_a1 <- forecast.Arima(model_a1, h = 12)
The above code works perfectly. However, when I try to do a linear model....
model1 <- lm(y ~ Mfg.NO.Total..USA. + Mfg.Inv.Total..USA., data = data )
f1 <- forecast.lm(model1, h = 12)
I get an error message saying that I MUST provide a new data set (which seems odd to me, since the documentation for the forecast package says that it is an optional argument).
f1 <- forecast.lm(model1, newdata = x, h = 12)
If I do this, I am able to get the function to work, but the forecast only predicts values for the existing data - it doesn't predict the next 12 periods. I have also tried using the append function to add additional rows to see if that would fix the issue, but when trying to forecast a linear model, it immediately stops at the most recent point in the time series.
Here's the data that I'm using:
+------------+---------------------------+--------------------+---------------------+
| | Mfg.Shipments.Total..USA. | Mfg.NO.Total..USA. | Mfg.Inv.Total..USA. |
+------------+---------------------------+--------------------+---------------------+
| 2110-01-01 | 3.59746e+11 | 3.58464e+11 | 5.01361e+11 |
| 2110-01-01 | 3.59746e+11 | 3.58464e+11 | 5.01361e+11 |
| 2110-02-01 | 3.62268e+11 | 3.63441e+11 | 5.10439e+11 |
| 2110-03-01 | 4.23748e+11 | 4.24527e+11 | 5.10792e+11 |
| 2110-04-01 | 4.08755e+11 | 4.02769e+11 | 5.16853e+11 |
| 2110-05-01 | 4.08187e+11 | 4.02869e+11 | 5.18180e+11 |
| 2110-06-01 | 4.27567e+11 | 4.21713e+11 | 5.15675e+11 |
| 2110-07-01 | 3.97590e+11 | 3.89916e+11 | 5.24785e+11 |
| 2110-08-01 | 4.24732e+11 | 4.16304e+11 | 5.27734e+11 |
| 2110-09-01 | 4.30974e+11 | 4.35043e+11 | 5.28797e+11 |
| 2110-10-01 | 4.24008e+11 | 4.17076e+11 | 5.38917e+11 |
| 2110-11-01 | 4.11930e+11 | 4.09440e+11 | 5.42618e+11 |
| 2110-12-01 | 4.25940e+11 | 4.34201e+11 | 5.35384e+11 |
| 2111-01-01 | 4.01629e+11 | 4.07748e+11 | 5.55057e+11 |
| 2111-02-01 | 4.06385e+11 | 4.06151e+11 | 5.66058e+11 |
| 2111-03-01 | 4.83827e+11 | 4.89904e+11 | 5.70990e+11 |
| 2111-04-01 | 4.54640e+11 | 4.46702e+11 | 5.84808e+11 |
| 2111-05-01 | 4.65124e+11 | 4.63155e+11 | 5.92456e+11 |
| 2111-06-01 | 4.83809e+11 | 4.75150e+11 | 5.86645e+11 |
| 2111-07-01 | 4.44437e+11 | 4.40452e+11 | 5.97201e+11 |
| 2111-08-01 | 4.83537e+11 | 4.79958e+11 | 5.99461e+11 |
| 2111-09-01 | 4.77130e+11 | 4.75580e+11 | 5.93065e+11 |
| 2111-10-01 | 4.69276e+11 | 4.59579e+11 | 6.03481e+11 |
| 2111-11-01 | 4.53706e+11 | 4.55029e+11 | 6.02577e+11 |
| 2111-12-01 | 4.57872e+11 | 4.81454e+11 | 5.86886e+11 |
| 2112-01-01 | 4.35834e+11 | 4.45037e+11 | 6.04042e+11 |
| 2112-02-01 | 4.55996e+11 | 4.70820e+11 | 6.12071e+11 |
| 2112-03-01 | 5.04869e+11 | 5.08818e+11 | 6.11717e+11 |
| 2112-04-01 | 4.76213e+11 | 4.70666e+11 | 6.16375e+11 |
| 2112-05-01 | 4.95789e+11 | 4.87730e+11 | 6.17639e+11 |
| 2112-06-01 | 4.91218e+11 | 4.87857e+11 | 6.09361e+11 |
| 2112-07-01 | 4.58087e+11 | 4.61037e+11 | 6.19166e+11 |
| 2112-08-01 | 4.97438e+11 | 4.74539e+11 | 6.22773e+11 |
| 2112-09-01 | 4.86994e+11 | 4.85560e+11 | 6.23067e+11 |
| 2112-10-01 | 4.96744e+11 | 4.92562e+11 | 6.26796e+11 |
| 2112-11-01 | 4.70810e+11 | 4.64944e+11 | 6.23999e+11 |
| 2112-12-01 | 4.66721e+11 | 4.88615e+11 | 6.08900e+11 |
| 2113-01-01 | 4.51585e+11 | 4.50763e+11 | 6.25881e+11 |
| 2113-02-01 | 4.56329e+11 | 4.69574e+11 | 6.33157e+11 |
| 2113-03-01 | 5.04023e+11 | 4.92978e+11 | 6.31055e+11 |
| 2113-04-01 | 4.84798e+11 | 4.76750e+11 | 6.35643e+11 |
| 2113-05-01 | 5.04478e+11 | 5.04488e+11 | 6.34376e+11 |
| 2113-06-01 | 4.99043e+11 | 5.13760e+11 | 6.25715e+11 |
| 2113-07-01 | 4.75700e+11 | 4.69012e+11 | 6.34892e+11 |
| 2113-08-01 | 5.05244e+11 | 4.90404e+11 | 6.37735e+11 |
| 2113-09-01 | 5.00087e+11 | 5.04849e+11 | 6.34665e+11 |
| 2113-10-01 | 5.05965e+11 | 4.99682e+11 | 6.38945e+11 |
| 2113-11-01 | 4.78876e+11 | 4.80784e+11 | 6.34442e+11 |
| 2113-12-01 | 4.80640e+11 | 4.98807e+11 | 6.19458e+11 |
| 2114-01-01 | 4.56779e+11 | 4.57684e+11 | 6.36568e+11 |
| 2114-02-01 | 4.62195e+11 | 4.70312e+11 | 6.48982e+11 |
| 2114-03-01 | 5.19472e+11 | 5.25900e+11 | 6.47038e+11 |
| 2114-04-01 | 5.04217e+11 | 5.06090e+11 | 6.52612e+11 |
| 2114-05-01 | 5.14186e+11 | 5.11149e+11 | 6.58990e+11 |
| 2114-06-01 | 5.25249e+11 | 5.33247e+11 | 6.49512e+11 |
| 2114-07-01 | 4.99198e+11 | 5.52506e+11 | 6.57645e+11 |
| 2114-08-01 | 5.17184e+11 | 5.07622e+11 | 6.59281e+11 |
| 2114-09-01 | 5.23682e+11 | 5.24051e+11 | 6.55582e+11 |
| 2114-10-01 | 5.17305e+11 | 5.09549e+11 | 6.59237e+11 |
| 2114-11-01 | 4.71921e+11 | 4.70093e+11 | 6.57044e+11 |
| 2114-12-01 | 4.84948e+11 | 4.86804e+11 | 6.34120e+11 |
+------------+---------------------------+--------------------+---------------------+
Edit - Here's the code I used for adding new datapoints for forecasting.
library(xts)
library(mondate)
d <- as.mondate("2115-01-01")
d11 <- d + 11
seq(d, d11)
newdates <- seq(d, d11)
new_xts <- xts(order.by = as.Date(newdates))
new_xts$Mfg.Shipments.Total..USA. <- NA
new_xts$Mfg.NO.Total..USA. <- NA
new_xts$Mfg.Inv.Total..USA. <- NA
x <- append(data, new_xts)
Not sure if you ever figured this out, but just in case I thought I'd point out what's going wrong.
The documentation for forecast.lm says:
An optional data frame in which to look for variables with which to predict. If omitted, it is assumed that the only variables are trend and season, and h forecasts are produced.
so it's optional if trend and season are your only predictors.
The ARIMA model works because it's using lagged values of the time series in the forecast. For the linear model, it uses the given predictors (Mfg.NO.Total..USA. and Mfg.Inv.Total..USA. in your case) and thus needs their corresponding future values; without these, there are no independent variables to predict from.
In the edit, you added those variables to your future dataset, but they still have values of NA for all future points, thus the forecasts are also NA.
Gabe is correct. You need future values of your causals.
You should consider the Transfer Function modeling process instead of regression(ie developed for use with cross-sectional data). By using prewhitening your X variables (ie build a model for each one), you can calculate the Cross correlation function to see any lead or lag relationship.
It is very apparent that Inv.Total is a lead variable(b**-1) from the standardized graph of Y and the two x's. When Invto moves down so does shipments. In addition, there is also AR seasonal component beyond the causals that is driving the data. There are a few outliers as well so this is a robust solution. I am developer of this software used here, but this can be run in any tool.

Neo4j shortest path with rels in both directions

I have a graph set up with the function...
create (a:station {name:"a"}),
(b:station {name:"b"}),
(c:station {name:"c"}),
(d:station {name:"d"}),
(e:station {name:"e"}),
(f:station {name:"f"}),
(a)-[:CONNECTS_TO {time:8}]->(b),
(a)-[:CONNECTS_TO {time:4}]->(c),
(a)-[:CONNECTS_TO {time:10}]->(d),
(b)-[:CONNECTS_TO {time:3}]->(c),
(b)-[:CONNECTS_TO {time:9}]->(e),
(c)-[:CONNECTS_TO {time:40}]->(f),
(d)-[:CONNECTS_TO {time:5}]->(e),
(e)-[:CONNECTS_TO {time:3}]->(f)
and using the function
START startStation=node:node_auto_index(name = "a"), endStation=node:node_auto_index(name = "f")
MATCH p =(startStation)-[r*]->(endStation)
WITH extract(x IN rels(p)| x.time) AS Times, length(p) AS `Number of Stops`, reduce(totalTime = 0, x IN rels(p)| totalTime + x.time) AS `Total Time`, extract(x IN nodes(p)| x.name) AS Route
RETURN Route, Times, `Total Time`, `Number of Stops`
ORDER BY `Total Time`
and it returns the results...
+-------------------------------------------------------------+
| Route | Times | Total Time | Number of Stops |
+-------------------------------------------------------------+
| ["a","d","e","f"] | [10,5,3] | 18 | 3 |
| ["a","b","e","f"] | [8,9,3] | 20 | 3 |
| ["a","c","f"] | [4,40] | 44 | 2 |
| ["a","b","c","f"] | [8,3,40] | 51 | 3 |
+-------------------------------------------------------------+
Which is fine except because it is a directed graph and there is no path from c -> b it doesn't return (for instance) [a, c, b, e, f] which is a valid path of length 4.
So, if I add the inverse paths...
MATCH (START)-[r:CONNECTS_TO]->(END )
CREATE UNIQUE (START)<-[:CONNECTS_TO { time:r.time }]-(END )
And run the query again I get... (for paths length 1..4)...
+---------------------------------------------------------------------+
| Route | Times | Total Time | Number of Stops |
+---------------------------------------------------------------------+
| ["a","d","e","f"] | [10,5,3] | 18 | 3 |
| ["a","c","b","e","f"] | [4,3,9,3] | 19 | 4 |
| ["a","b","e","f"] | [8,9,3] | 20 | 3 |
| ["a","c","f"] | [4,40] | 44 | 2 |
| ["a","c","b","c","f"] | [4,3,3,40] | 50 | 4 |
| ["a","c","f","e","f"] | [4,40,3,3] | 50 | 4 |
| ["a","b","c","f"] | [8,3,40] | 51 | 3 |
| ["a","b","a","c","f"] | [8,8,4,40] | 60 | 4 |
| ["a","d","a","c","f"] | [10,10,4,40] | 64 | 4 |
+---------------------------------------------------------------------+
This does include the path [a, c, b, e, f] but it also include [a, c, b, c, f] which uses c twice and [a, c, f, e, f] which uses f (the destination?!) twice.
Is there a way of filtering the paths so each path only includes the same node once?
You could do a filtering after the fact, but it might not be the fastest thing.
Something like this:
START startStation=node:node_auto_index(name = "a"), endStation=node:node_auto_index(name = "f")
MATCH p = (startStation)-[r*..4]->(endStation)
WHERE length(reduce (a=[startStation], n IN nodes(p) | CASE WHEN n IN a THEN a ELSE a + n END)) = length(nodes(p))
WITH extract(x IN rels(p)| x.time) AS Times, length(p) AS `Number of Stops`, reduce(totalTime = 0, x IN rels(p)| totalTime + x.time) AS `Total Time`, extract(x IN nodes(p)| x.name) AS Route
RETURN Route, Times, `Total Time`, `Number of Stops`
ORDER BY `Total Time`
I created a GraphGist with your question and answers in as an executable, live document.
See here: Neo4j shortest path with rels in both directions

By group: sum of variable values under condition

Sum of var values by group with certain values excluded conditioned on the other variable.
How to do it elegantly without transposing?
So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.
In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A
+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker | DATE_f | wght | excl_val |
+---------+---------+-----------+-------------+-------------+
| A | XLK | 6/20/2003 | 0.087600002 | 1.980834016 |
| A | XLK | 6/23/2003 | 0.08585 | 1.898560068 |
| A | XLK | 6/24/2003 | 0.085500002 | |
| AAPL | XLK | 6/20/2003 | 0.070080002 | |
| AAPL | XLK | 6/23/2003 | 0.06868 | |
| AAPL | XLK | 6/24/2003 | 0.068400002 | |
| AA | XLK | 6/20/2003 | 1.910754014 | |
| AA | XLK | 6/23/2003 | 1.829880067 | |
| AA | XLK | 6/24/2003 | 1.819775 | |
| | | | | |
| | | | | |
+---------+---------+-----------+-------------+-------------+
There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.
I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.
In R, I got as far as this
weights$excl_val <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))
but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.
About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short
Thank you!
Arsenio
When you have data in a data.frame, it is better if the rows are meaningful
(in particular, the columns should have the same length):
in this case, excl_val looks like a separate vector.
After putting the information it contains in the data.frame,
things become easier.
# Sample data
k <- 5
d <- data.frame(
sTicker = rep(LETTERS[1:k], k),
fTicker = rep(LETTERS[1:k], each=k),
DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))

Resources