Calculate a lag or lead mean in r - r

I need to calculate a lag or lead mean between two sequential values in a table and then output the means to a new column. I can write a for loop for this operation, but would prefer to avoid this so that the codes is more flexible. Is it possible to do this operation in dplyr and tidyr? Below is an example data set and the desired result. Thanks in advance.
DATA = data.frame(POO = c(2, 4, 6, 8, 10 , 20))
RESULTS = data.frame(POO = c(2, 4, 6, 8, 10 , 20), YEY = c(0,3,5,7,9,15))

Use filter:
DATA$YEY <- filter(DATA$POO, c(1, 1)/2, sides = 1)
# POO YEY
#1 2 NA
#2 4 3
#3 6 5
#4 8 7
#5 10 9
#6 20 15
You can then substitute NA with 0, but I don't understand the logic behind that.
Note that filter gets masked by package dplyr unfortunately. You might need to use stats::filter, if you have attached dplyr.

There's also a way in dplyr:
DATA %>%
mutate(YEY = (POO + lag(POO)) / 2)
This also has NA in the first row, which you could fix afterwards if you need to.

df1<-structure(list(POO = c(2, 4, 6, 8, 10, 20)), .Names = "POO", row.names = c(NA,
-6L), class = "data.frame")
library(dplyr)
libary(zoo) # for rollmean function
df1 %>% # df1 is your data frame
mutate(TEY=rollmean(POO,2,fill=0,align="right"))
POO TEY
1 2 0
2 4 3
3 6 5
4 8 7
5 10 9
6 20 15

Related

Select only columns with column names that match values from rows in other df

I have two dataframes, one that is a very large, wide dataset with hundreds of parameters and another with 3 columns that identify the parameters in the larger dataframe with specification limits and two columns for the lower and upper limits. What I want to do is to be able to reduce the wide dataframe to just the columns that are in the limits dataframe. I feel like this is incredibly basic but I cannot get it to work
See below for an example and output that I would like.
df
df <- data.frame("par.1" = c(1, 1, 2, 3, 5), "par.2" = c(10, 11, 12, 11, 15),"par.3" = c(8, 8, 12, 8, 9),"par.4" = c(8, 8, 12, 8, 9))
limits
limits <- data.frame("parameter" = c("par.2", "par.4"), "lsl" = c(8,5), "usl" = c(16,15))
Here is the output I am looking for
df.reduced
par.2 par.4
1 10 8
2 11 8
3 12 12
4 11 8
5 15 9
Just subset df column names by values %in% the parameter column of limits
df[names(df) %in% limits$parameter]
par.2 par.4
1 10 8
2 11 8
3 12 12
4 11 8
5 15 9
Alternatively, use match:
df[match(limits$parameter, names(df))]
An option with intersect
df[intersect(names(df), as.character(limits$parameter))]

Alternative to a nested for do loop

I have a data frame which name is df, of 200+ variables with 300,000+ observations (200+ columns, 300000+ rows)
The end goal of my R code is to find the outlier of each column and replace them with a certain value, say, NA. If the value is already NA, skip and proceed to the next loop
for (j in 1:ncol(df)){
outnumtext <- paste0('out_value <- boxplot.stats(df$',colnames(df[j]),')$out')
eval(parse(text=outnumtext))
for (k in 1:nrow(df)){
replacetext <- paste0('
if ((df[',k,',',j,'] %in% out_value) & !(is.na(df[',k,',',j,']))) {
df[',k,',',j,'] <- NA
} else if (is.na(df[',k,',',j,'])) {
next
} else {
next
}')
eval(parse(text=replacetext))
}
}
I discovered that using the for loop in r and looping through each and every one of the rows in every column, considerably slows down the running. Are there any alternatives to this?
Thank you very much in advance!
Edit P/S: The real code is not just replacing outliers with NA, but has several ways of dealing based on several conditions (where if & if else conditions will be executed accordingly). However my goal is to get a possible alternative in reducing the running time, thus I tried simplifying my original code as much as possible to get to the main point
You don't want to use loops for this. You could try dplyr::mutate_all().
It will still be slow over 300K+ rows, but should be better than the loop.
library(dplyr)
df <- df %>%
mutate_all(funs(ifelse(. %in% boxplot.stats(.)$out, NA, .)))
Example:
exdata <- structure(list(x = c(200, 6, 8, 2, 7, 1, 4, 9, 3, 5, 1000),
y = c(300, 1, 18, 3, 2, 16, 14, 9, 11, 6, 100)),
row.names = c(NA, -11L),
class = "data.frame")
exdata
x y
1 200 300
2 6 1
3 8 18
4 2 3
5 7 2
6 1 16
7 4 14
8 9 9
9 3 11
10 5 6
11 1000 100
data1 %>%
mutate_all(funs(ifelse(. %in% boxplot.stats(.)$out, NA, .)))
x y
1 NA NA
2 6 1
3 8 18
4 2 3
5 7 2
6 1 16
7 4 14
8 9 9
9 3 11
10 5 6
11 NA NA

how to combine multiple columns with grep and sum the values in r

I have following dataframe in r
Engine General Ladder.winch engine.phe subm.gear.box aux.engine pipeline.maintain pipeline pipe.line engine.mpd
1 12 22 2 4 2 4 5 6 7
and so on with more than 10000 rows.
Now,I want to combine columns and add values to reduce the columns into broader categories. e.g Engine,engine.phe,aux.engine,engine.mpd should be combined into Engine category and all the values to be added. likewise pipeline.maintain,pipeline,pipe.line to be combined into Pipeline And rest columns to be added under General Category.
Desired dataframe would be
Engine Pipeline General
12 15 38
How can I do it in r?
Many ways in which you can do it, this is a more straight forward approach
# Example data.frame
dtf <- structure(list(Engine = c(1, 0, 1),
General = c(12, 3, 15), Ladder.winch = c(22, 28, 26),
engine.phe = c(2, 1, 0), subm.gear.box = c(4, 4, 10),
aux.engine = c(2, 3, 1), pipeline.maintain = c(4, 5, 1),
pipeline = c(5, 5, 2), pipe.line = c(6, 8, 2), engine.mpd = c(7, 8, 19)),
.Names = c("Engine", "General", "Ladder.winch", "engine.phe",
"subm.gear.box", "aux.engine", "pipeline.maintain",
"pipeline", "pipe.line", "engine.mpd"),
row.names = c(NA, -3L), class = "data.frame")
with(dtf, data.frame(Engine=Engine+engine.phe+aux.engine+engine.mpd,
Pipeline=pipeline.maintain+pipeline+pipe.line,
General=General+Ladder.winch+subm.gear.box))
# Engine Pipeline General
# 1 12 15 38
# 2 12 18 35
# 3 21 5 51
# a more generalized and 'greppy' solution
cnames <- tolower(colnames(dtf))
data.frame(Engine=rowSums(dtf[, grep("eng", cnames)]),
Pipeline=rowSums(dtf[, grep("pip", cnames)]),
General=rowSums(dtf[, !grepl("eng|pip", cnames)]))
Here is an option by extracting the concerned words from the names of the column, and using tapply to get the sum. The str_extract_all returns a list ('lst'). Replace those elements which are having zero length with 'GENERAL', Then, using a group by function i.e. tapply, unlist the dataset, and use the grouping variables i.e replicated 'lst' and the row of 'df1' get the sum
library(stringr)
lst <- str_extract_all(toupper(sub("(pipe)\\.", "\\1", names(df1))),
"ENGINE|PIPELINE|GENERAL")
lst[lengths(lst)==0] <- "GENERAL"
t(tapply(unlist(df1), list(unlist(lst)[col(df1)], row(df1)), FUN = sum))
# ENGINE GENERAL PIPELINE
#1 12 38 15
It is mostly better to store you data in long format. Therefore, my proposal would to approach your problem as below:
1 - get your data in long format
library(reshape2)
dfl <- melt(df)
2 - create 'engine' and 'pipeline'-vectors
e_vec <- c("Engine","engine.phe","aux.engine","engine.mpd")
p_vec <- c("pipeline.maintain","pipeline","pipe.line")
3 - create a category column
dfl$newcat <- c("general","engine","pipeline")[1 + dfl$variable %in% e_vec + 2*(dfl$variable %in% p_vec)]
The result:
> dfl
variable value newcat
1 Engine 1 engine
2 General 12 general
3 Ladder.winch 22 general
4 engine.phe 2 engine
5 subm.gear.box 4 general
6 aux.engine 2 engine
7 pipeline.maintain 4 pipeline
8 pipeline 5 pipeline
9 pipe.line 6 pipeline
10 engine.mpd 7 engine
Now you can use aggregate to get the final result:
> aggregate(value ~ newcat, dfl, sum)
newcat value
1 engine 12
2 general 38
3 pipeline 15
myfactors = ifelse(grepl("engine", names(df), ignore.case = TRUE), "Engine",
ifelse(grepl("pipe|pipeline", names(df), ignore.case = TRUE), "Pipeline",
"General"))
data.frame(lapply(split.default(df, myfactors), rowSums))
# Engine General Pipeline
#1 12 38 15
#2 12 35 18
#3 21 51 5
df is the data from this answer

How to reorder columns of a data.frame with na condition? [duplicate]

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.
You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.
Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))
test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]
An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10
test[,sort(names(test))]
sort on names of columns can work easily.
If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())
So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]
Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))
another option is..
mtcars %>% dplyr::select(order(names(mtcars)))
In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

Sort columns of a dataframe by column name

This is possibly a simple question, but I do not know how to order columns alphabetically.
test = data.frame(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
# C A B
# 1 0 4 1
# 2 2 2 3
# 3 4 4 8
# 4 7 7 3
# 5 8 8 2
I like to order the columns by column names alphabetically, to achieve
# A B C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
For others I want my own defined order:
# B A C
# 1 4 1 0
# 2 2 3 2
# 3 4 8 4
# 4 7 3 7
# 5 8 2 8
Please note that my datasets are huge, with 10000 variables. So the process needs to be more automated.
You can use order on the names, and use that to order the columns when subsetting:
test[ , order(names(test))]
A B C
1 4 1 0
2 2 3 2
3 4 8 4
4 7 3 7
5 8 2 8
For your own defined order, you will need to define your own mapping of the names to the ordering. This would depend on how you would like to do this, but swapping whatever function would to this with order above should give your desired output.
You may for example have a look at Order a data frame's rows according to a target vector that specifies the desired order, i.e. you can match your data frame names against a target vector containing the desired column order.
Here's the obligatory dplyr answer in case somebody wants to do this with the pipe.
test %>%
select(sort(names(.)))
test = data.frame(C=c(0,2,4, 7, 8), A=c(4,2,4, 7, 8), B=c(1, 3, 8,3,2))
Using the simple following function replacement can be performed (but only if data frame does not have many columns):
test <- test[, c("A", "B", "C")]
for others:
test <- test[, c("B", "A", "C")]
An alternative option is to use str_sort() from library stringr, with the argument numeric = TRUE. This will correctly order column that include numbers not just alphabetically:
str_sort(c("V3", "V1", "V10"), numeric = TRUE)
# [1] V1 V3 V10
test[,sort(names(test))]
sort on names of columns can work easily.
If you only want one or more columns in the front and don't care about the order of the rest:
require(dplyr)
test %>%
select(B, everything())
So to have a specific column come first, then the rest alphabetically, I'd propose this solution:
test[, c("myFirstColumn", sort(setdiff(names(test), "myFirstColumn")))]
Here is what I found out to achieve a similar problem with my data set.
First, do what James mentioned above, i.e.
test[ , order(names(test))]
Second, use the everything() function in dplyr to move specific columns of interest (e.g., "D", "G", "K") at the beginning of the data frame, putting the alphabetically ordered columns after those ones.
select(test, D, G, K, everything())
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Similar to other syntax above but for learning - can you sort by column names?
sort(colnames(test[1:ncol(test)] ))
another option is..
mtcars %>% dplyr::select(order(names(mtcars)))
In data.table you can use the function setcolorder:
setcolorder reorders the columns of data.table, by reference, to the
new order provided.
Here a reproducible example:
library(data.table)
test = data.table(C = c(0, 2, 4, 7, 8), A = c(4, 2, 4, 7, 8), B = c(1, 3, 8, 3, 2))
setcolorder(test, c(order(names(test))))
test
#> A B C
#> 1: 4 1 0
#> 2: 2 3 2
#> 3: 4 8 4
#> 4: 7 3 7
#> 5: 8 2 8
Created on 2022-07-10 by the reprex package (v2.0.1)

Resources