Conditional vectorization - r
Given a data frame like below:
set.seed(123)
df1 <- data.frame(V1=sample(c(0,1,2),100,replace=TRUE),
V2=sample(c(2,3,4),100,replace=TRUE),
V3=sample(c(4,5,6),100,replace=TRUE),
V4=sample(c(6,7,8),100,replace=TRUE),
V5=sample(c(6,7,8),100,replace=TRUE))
I want to sum each row, starting from the first column with a value >=2, and ending with the column with a value >6, else sum until the end of the row.
How would I do this in a vectorized fashion?
Update: This is not for any homework assignment. I just want more examples of vectorization code that I can study and learn from. I had to do something like the above before, but couldn't figure out the apply syntax for this particular task and resorted to for loops.
This is what appeared the most R-like approach but I don't consider it "vectorized" in the R meaning of the term:
apply( df1, 1, function(x) sum( x[which(x>=2)[1]: min(which(x>6)[1], 5, na.rm=TRUE)] ) )
#---------
[1] 15 22 16 19 17 17 23 21 14 13 18 13 16 23 15 18 16 21 16 19 17 23 21 18
[25] 21 24 15 20 15 18 17 24 19 18 19 15 18 17 15 17 14 21 13 19 15 15 15 15
[49] 21 19 21 15 17 18 14 17 15 16 22 16 23 22 17 21 17 16 23 23 16 14 18 13
[73] 18 15 17 17 17 20 20 16 17 16 16 16 14 16 20 23 23 24 14 18 16 17 22 23
[97] 23 19 20 17
Due to your sampling structure, we can vectorize quite easily.
We know that only the first column can be less than 2, and thus excluded, and that columns V2, V3 and V4 must be included, as they are either below 6, or the first non six. Column V5 is excluded, only if column V4 was above 6.
So:
(df1$V1 == 2) * df1$V1 + df1$V2 + df1$V3 + df1$V4 + df1$V5 * !(df1$V4 > 6)
[1] 15 22 16 19 17 17 23 21 14 13 18 13 16 23 15 18 16 21 16 19 17 23 21 18 21 24 15 20 15 18 17 24 19 18
[35] 19 15 18 17 15 17 14 21 13 19 15 15 15 15 21 19 21 15 17 18 14 17 15 16 22 16 23 22 17 21 17 16 23 23
[69] 16 14 18 13 18 15 17 17 17 20 20 16 17 16 16 16 14 16 20 23 23 24 14 18 16 17 22 23 23 19 20 17
is your vectorized calculation. This is obviously much less general than the other answers here, but fits your question.
Using apply would be the most sensible solution. However, since we seem to be competing on who can answer this without using R-based loops, I humbly offer this
m<-as.matrix(df1)
start<-max.col(m>=2,ties="first")
end<-max.col(`[<-`(m>6,,ncol(m),TRUE),ties="first")
i<-t(matrix(1:ncol(m),nrow=ncol(m),ncol=nrow(m)))
rowSums(m*(i>=start & i<=end))
Output is the same as these answres.
I am sure that there is a more elegant way but for a brute force approach you can write a function and pass it to apply.
First, define your example data
df <- data.frame(V1=sample(c(0,1,2),100,replace=TRUE),
V2=sample(c(2,3,4),100,replace=TRUE),
V3=sample(c(4,5,6),100,replace=TRUE),
V4=sample(c(6,7,8),100,replace=TRUE),
V5=sample(c(6,7,8),100,replace=TRUE))
Write a function that will define the conditional statement. The use of which returns the position of the condition in the vector. The first use of which "start" pulls the position of the first occurrence of the condition thus the bracket use of [1]. Since there are multiple potential outcomes of the end position I used an if statement to fulfill it. If there is not a value that meets the condition > 6 for "end" the variable is assigned the last position of the vector otherwise the position the meets the condition. Then it is just a matter of subsetting the vector based on the start and end values to be evaluated using sum.
sum.col <- function(x) {
start <- which(x >= 2)[1]
end <- which(x > 6)
if( length(end) == 0 ) {
end <- length(x)
} else {
end <- end[length(end)]
}
return( sum( x[start:end] ) )
}
Now we can pass the function to apply, which deals with the vectorization of each row for us.
apply(df, FUN=sum.col, MARGIN = 1)
Related
Convert entire data set into numeric form during retrieving the file
i am working on Shiny app and want to convert entire data set into numeric form.I have used this code for retrieving file from local PC. what changes can be done that while retrieving i can convert entire data set into numeric form datami <- reactive({ file1 <- input$file if(is.null(file1)){return()} read.csv(file=file1$datapath, sep=input$sep, header = input$header, stringsAsFactors = input$stringAsFactors)}) output$table <- renderPrint({ if(is.null(datami())){return ()} str(datami())}) tabsetPanel(tabPanel("Data",div(h5("Data",style="color:red")),verbatimTextOutput("table"))```
Depending on how you want to deal with lower/uppercase letters (if you have them in your data) we could do one of the following: MRE: letter_variable <- c(letters, LETTERS) Same numeric value for upper and lower case letters: letter_variable_as_numeric1 <- as.numeric(factor(toupper(letter_variable), levels = LETTERS)) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 [22] 22 23 24 25 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [43] 17 18 19 20 21 22 23 24 25 26 Different numeric value for upper and lower case letters: letter_variable_as_numeric2 <- as.numeric(factor(letter_variable), levels = c(letters, LETTERS)) [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 [22] 43 45 47 49 51 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 [43] 34 36 38 40 42 44 46 48 50 52
Predict future values using Arma
I have a sales data for 51 points. I want to predict the say 10 more future values. It is a sales data and hence seasonal but the data points are very few for predicting seasonality. When I used time series it maybe tried to fit and gave "103" as the results for all the next prediction. I thought using ARMA would help but after fitting to ARMA and using forecast() I still got the same output. I am new to trending and forecasting and do not know if there are different methods other than regression may be to predict future values. Kindly help. Data: Product 23 22 21 31 29 13 15 20 15 26 11 24 14 18 15 21 25 23 27 30 19 18 20 13 23 40 14 15 20 14 9 22 14 24 26 22 23 16 24 19 14 10 17 12 11 15 9 24 17 22 28 The code I used: library("tseries") arma<-arma(Product) final<-forecast(arma,10)
Your question is missing some details, like the order of the ARMA model you want to fit and the code you used. Considering that 103 is way larger than any of the values you have given us, I'd suspect that there is an error in your code. Here's an implementation for ARMA(1,1) that should work: data<-strsplit("23 22 21 31 29 13 15 20 15 26 11 24 14 18 15 21 25 23 27 30 19 18 20 13 23 40 14 15 20 14 9 22 14 24 26 22 23 16 24 19 14 10 17 12 11 15 9 24 17 22 28",split=" ") data<-as.numeric(data[[1]]) mod<-arima(data,order=c(1,0,1)) pred<-predict(mod,n.ahead=10) pred plot(c(data,pred$pred),type="l") EDIT: Ken is right, btw, an ARMA model doesn't seem that useful for your data and the predictions will mostly be given by the intercept term.
Fitting an ARIMA-model to your data, results in a ARIMA(0,0,0), which means that the fitted values depend on 0 previous observations and 0 previous fitting error. This again means that the best predictor the ARIMA-model can make (based on this data) is a constant. It will, for every observation, regardless of previous observations, predict the same value. library(forecast) df <- c(23, 22, 21, 31, 29, 13, 15, 20, 15, 26, 11, 24, 14, 18, 15, 21, 25, 23 , 27, 30, 19, 18 , 20 , 13 , 23 , 40 ,14 , 15 , 20 ,14 , 9 , 22 , 14 , 24 ,26 ,22 , 23 , 16 , 24 , 19 ,14 , 10 ,17 , 12, 11, 15 , 9 , 24 , 1, 7, 22, 28) # auto.arima() selects the ARIMA(r, s, q) model with the highest AIC-score: (auto.arima(df)) # Series: ts # ARIMA(0,0,0) with non-zero mean #Coefficients: # intercept 19.0000 # s.e. 0.9642 # sigma^2 estimated as 49.29: log likelihood=-174.62 # AIC=353.25 AICc=353.49 BIC=357.15 forecast.Arima(object = auto.arima(df), h = 10) # Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 # 53 19 10.00226 27.99774 5.239138 32.76086 # 54 19 10.00226 27.99774 5.239138 32.76086 # 55 19 10.00226 27.99774 5.239138 32.76086 # 56 19 10.00226 27.99774 5.239138 32.76086 # 57 19 10.00226 27.99774 5.239138 32.76086 # 58 19 10.00226 27.99774 5.239138 32.76086 # 59 19 10.00226 27.99774 5.239138 32.76086 # 60 19 10.00226 27.99774 5.239138 32.76086 # 61 19 10.00226 27.99774 5.239138 32.76086 # 62 19 10.00226 27.99774 5.239138 32.76086
Finding connected component in undirected graph with union-find algorithm gives wrong answer
I'm using union-find algorithm to find all the connected components of an un-directed graph. For some unknown reasons it gives a wrong answer. The input graph that I give is connected and yet it outputs 2 different labels for that graph. I would be glad if someone can explain this anomaly. Are there some special undirected graphs where Union-Find may not work? ** The graph is a little big, so I added a picture for better readability of the graph. couldn't trim the graph or the essence of the question will vanish. ** Here, p and rk are parent and rank respectively. Code Union function void union_find(int x, int y, int *p, int *rk) { int px = find(x, p); int py = find(y, p); if(px != py) { if(rk[px] < rk[py]) { p[px] = py; } else { p[py] = px; if(rk[px] == rk[py]) rk[px]++; } } } Find function int find(int x , int *p) { return (p[x] == x) ? x : p[x] = find(p[x], p); } Sample Input 37 51 4 5 4 10 5 7 5 8 5 9 6 7 6 16 7 8 7 15 8 9 8 10 10 11 10 14 11 12 12 21 13 14 13 32 14 15 14 22 15 16 15 22 16 17 17 22 17 19 19 24 19 27 20 30 20 32 21 31 22 23 22 36 23 24 23 36 24 25 25 26 26 36 26 27 27 28 27 29 27 32 27 37 28 29 29 30 31 32 32 33 32 35 33 34 33 35 34 35 35 36 36 37 Photo of the graph Please note that, I'm not considering the nodes 1,2,3 and 18 and their edges. Undirected Graph
What is the function that allows you to avoid explicitly naming the dataset when using a variable in R?
I know there is a function for this and have used it in the past but can't seem to remember or find it. basically I'm trying to do the following: data(mtcars) missing_function(mtcars) lm(mpg ~ cyl) where "missing_function()" allows me to use the the variables in the mtcars dataset without having to use the "$". For example: "mpg" instead of "mtcars$mpg".
Though generally frowned upon, attach will do that for you R> attach(mtcars) R> mpg [1] 21 21 23 21 19 18 14 24 23 19 18 16 17 15 10 10 15 32 30 34 22 16 15 13 19 27 26 30 [29] 16 20 15 21 R>
Import stuff from a R file
I'm studying R, and I was asked to use the pi2000 data set, available at the simpleR library (http://www.math.csi.cuny.edu/Statistics/R/simpleR/index.html/simpleR.R). Then I downloaded this file. How do I import it to the command line?
Use the source function: > source("http://www.math.csi.cuny.edu/Statistics/R/simpleR/index.html/simpleR.R") You can then access the variables defined there: > vacation [1] 23 12 10 34 25 16 27 18 28 13 14 20 8 21 23 33 30 13 16 14 38 19 6 11 15 21 10 [28] 39 42 25 12 17 19 26 20 for more information, type ?source into an R terminal.
Do you mean load the function into R? If so, use source(): source("http://www.math.csi.cuny.edu/Statistics/R/simpleR/index.html/simpleR.R")