How can in create a Start/Stop condition within an subset? - r

I'm fairly new to R so I'd like to apollogize in advance for eventually not choosing the best words to explain my issue.
My problem is that I'd like to create a subset out of a dataset (old) which has several colums. So far no problem...
My subset should start when the value (x) in one of the colums reaches its highest point; and stop right after x droppend down again to its lowest point.
Then create a new dataset (new) with this subset of the data (old).
As there are multiple positions in my original dataset (old) where the value x behaves as descibed above I'd like to have a new dataset (new1, new2, new....) for every subset I create.
I hope a was clear in what I'd like to say. If there is more information needed, I'm happy to provide it.
Thanks a lot for your help.

If for instance you have
x <- c(5,4,3,2,1,2,3,4,5,4,3,2,1,2,3,2,1)
Then
direction <- sign(diff(x))
will give a series of +1s and -1s indicating whether x is on an upward or downward swing. We're only interested in downward swings, so let's label upward points with NA, and downward points in the nth swing with the number n:
run <- rle(direction)
run$values[run$values==1] <- NA
run$values[!is.na(run$values)] <- 1:sum(!is.na(run$values))
Now it seems you want to include the last point in a run of downward points (where the sign is positive, as the point after the last point in a downward run is higher). So we need to extend the length of downward runs, and decrease the upward:
run$lengths <- run$lengths + ifelse(is.na(run$values), -1, +1)
swing <- inverse.rle(run)
plot(x, col=swing)
should colour downward runs in different colours, and omit upward runs. You've now got a variable that labels the runs, and you can split your data.frame by
split(myDataFrame, swing)
You might need to check this works if we start/finish a on an up or down swing

Here is an option where we check when direction changes with diff, and then split along that. First, make some data:
df <- data.frame(x=rep(c(1:3, 2:1), 3))
Then:
dir.vec <- c(diff(df$x) <= 0, tail(diff(df$x) <= 0, 1)) # has drop started?
split.vec <- cumsum(c(0, diff(dir.vec)) < 0) # which drop # is this?
split(df[dir.vec,,drop=F], split.vec[dir.vec]) # split drops by drop num
Original:
x
1 1
2 2
3 3
4 2
5 1
6 1
7 2
8 3
9 2
10 1
11 1
12 2
13 3
14 2
15 1
Split:
$`0`
x
3 3
4 2
5 1
$`1`
x
8 3
9 2
10 1
$`2`
x
13 3
14 2
15 1

Related

Backward moving average with varying moving window size to keep the output series size same as the original time series in R

I have a long time series of a variable. I want to perform a backward moving average with window size of 20. If I keep this window size then the output series will be shortened by length 20 (I mean first 20 values will be NAs) but what I want is that the the output series should be of same length as original series with non NAs. For this, I want to vary the window size in the beginning so that I can get the desired output. For example, for the first 20 values in original time series, the moving window size can be 1, 2, 3, ...., 20, respectively. Then I want to keep the window size of 20 afterwards. How this can be done?
Here is the sample data and desired output with window size 3:
Days Original_Values Desired_Output
1 2 2
2 4 2
3 1 3
4 3 7/3
5 5 8/3
6 6 9/3
7 4 14/3
8 9 15/3
Using the input shown reproducibly in the Note at the end, use rollapplyr specifying offsets -1, -2, -3 and use the argument partial=TRUE to let it use fewer than the specified offsets if only fewer are available. The first element cannot be calculated since there are no prior elements so for that specify that the first element be filled in using the fill argument.
library(zoo)
DF2 <- transform(DF, roll =
rollapplyr(Original, list(-(1:3)), mean, partial = TRUE, fill = Original[1]))
with(DF2, identical(Desired, roll)) # check that result matches Desired
## [1] TRUE
Note
Lines <- "
Days Original Desired
1 2 2
2 4 2
3 1 3
4 3 7/3
5 5 8/3
6 6 9/3
7 4 14/3
8 9 15/3"
DF <- read.table(text = Lines, header = TRUE)
DF <- transform(DF, Desired = sapply(Desired, function(x) eval(parse(text = x))))

How can I troubleshoot the delete row function

I am attempting to delete a row like this:
data <- data[-1645,]
However, after running the code, the row is still there. I can tell because there is an outlier in that row that is showing up on all my graphs, and when I view the data I can sort a column to easily find the offending outlier. I have had no trouble deleting rows in the past- has anyone run into anything similar? I do understand the limitations of outlier removal and I don't typically remove them however for a number of reasons I would like to see what the data look like without this one (in this case, all other values in the response variable are between -1 and 0, and in this row the value is 10^4).
You really need to provide more information, but there are several ways you can troubleshoot the problem. The first one is to print out the line you are removing:
data[1645, ]
Is that the outlier? You did not tell us how you identified the outlier. If lines have been removed from the data frame, the row names are not changed but the index values are changed, e.g.
set.seed(42)
x <- sample.int(25)
y <- sample.int(25)
data <- data.frame(x, y)
head(data)
# x y
# 1 17 2
# 2 5 8
# 3 1 3
# 4 10 1
# 5 4 10
# 6 18 11
data <- data[-c(5, 10, 15, 20, 25), ]
head(data)
# x y
# 1 17 2
# 2 5 8
# 3 1 3
# 4 10 1
# 6 18 11
# 7 25 15
data[6, ]
# x y
# 7 25 15
data["6", ]
# x y
# 6 18 11
Notice that the 6th row of the data has a row name of "7" but the row with name "6" is the 5th row in the data frame because we deleted the 5th row. The which function will give you the index value, but if you identified the outlier by looking at the printout, you got the row name and that may be different from the index. If we want to remove values in x greater than 24, here is one way to do that:
data[data$x<25, ]
After playing around with the data, I think the best explanation is that the indexing is off. This is in line with what dcarlson was saying- that it could be removing the 1,645th row, it just isn't labelled as such. I think the best solution is to use subset:
data <- subset(data, Yield.Decline < 100)
This is a more robust solution than trying to remove any given row based on its value (the line can be accidentally run multiple times without erroneously removing additional lines).

Looping through items on a list in R

this may be a simple question but I'm fairly new to R.
What I want to do is to perform some kind of addition on the indexes of a list, but once I get to a maximum value it goes back to the first value in that list and start over from there.
for example:
x <-2
data <- c(0,1,2,3,4,5,6,7,8,9,10,11)
data[x]
1
data[x+12]
1
data[x+13]
3
or something functionaly equivalent. In the end i want to be able to do something like
v=6
x=8
y=9
z=12
values <- c(v,x,y,z)
data <- c(0,1,2,3,4,5,6,7,8,9,10,11)
set <- c(data[values[1]],data[values[2]], data[values[3]],data[values[4]])
set
5 7 8 11
values <- values + 8
set
1 3 4 7
I've tried some stuff with additon and substraction to the lenght of my list but it does not work well on the lower numbers.
I hope this was a clear enough explanation,
thanks in advance!
We don't need a loop here as vectors can take vectors of length >= 1 as index
data[values]
#[1] 5 7 8 11
NOTE: Both the objects are vectors and not list
If we need to reset the index
values <- values + 8
ifelse(values > length(data), values - length(data) - 1, values)
#[1] 1 3 4 7

Twofold, consecutive row selecting starting at different rows in R

I have got the following problem. I have a data.frame with an x and y column representing some points in space:
X<-c(18.25743,18.25783,18.25823,18.25850,18.25863,18.25878,
18.25885,18.25912,18.25943,18.25962,18.25978,18.26000,
18.26022,18.26051,18.26070,18.26095,18.26118,18.26140,
18.26189,18.26250,18.26310,18.26390)
Y<-c(44.69561,44.69564,44.69567,44.69567,44.69586,
44.69600,44.69637,44.69671,44.69691,44.69701,44.69720,
44.69740,44.69763,44.69774,44.69787,44.69790,44.69791,
44.69795,44.69812,44.69802,44.69812,44.69834)
eDF<-data.frame(X,Y)
Now my problem is they are "sorted" wrong for plotting.So what I need is a function to write together the rows of the two points which belong together (in a list of lists):
1 and 12 is ID1
2 and 13 is ID2
3 and 14 is ID3
...
11 and 22 is ID11
Every so created list within the list of lists should have its unique ID (just numerating from 1 to the end). Well because I got this problem in all my data with different length.
It would be great if the starting point of the second consecutive row selecting (the 12) is flexible always taking the first row after half of the data.((rownumber/2)+1) in this example
12.
Well I have tried some things and i think Im on the right way but I cant figure out a solution by myself.
This function is pretty near but i cant manage to make it start at different rows(1 and 12):
lapply(2:nrow(eDF), function(x) eDF[(x-1):x,])
I also tried to figure it out with seq and it would do what i need if i could make a list of lists by connecting both code samples. Well I also need to change the concrete start and end numbers to a dynamic solution.
eDF[(seq(1,to=11,by=1)),] # selecting rows 1 to 11
eDF[(seq(12,to=nrow(eDF),by=1)),] #selecting rows 12 to end
Anyone any ideas?
I don't know if you needed an ID column inside of the new list but another way would be:
#create the IDs
eDF$ID <- rep(1:11,2)
#split the data.frame according to those
mylist <- split(eDF, eDF$ID)
Output:
mylist
$`1`
X Y ID
1 18.25743 44.69561 1
12 18.26000 44.69740 1
$`2`
X Y ID
2 18.25783 44.69564 2
13 18.26022 44.69763 2
$`3`
X Y ID
3 18.25823 44.69567 3
14 18.26051 44.69774 3
$`4`
X Y ID
4 18.2585 44.69567 4
15 18.2607 44.69787 4
#and so on...
You could only do split(eDF, rep(1:11,2) if you don't need the ID column.
We can modify the OP's lapply code
lapply(1:11, function(i) eDF[c(i, i+11),])

finding set of multinomial combinations

Let's say I have a vector of integers 1:6
w=1:6
I am attempting to obtain a matrix of 90 rows and 6 columns that contains the multinomial combinations from these 6 integers taken as 3 groups of size 2.
6!/(2!*2!*2!)=90
So, columns 1 and 2 of the matrix would represent group 1, columns 3 and 4 would represent group 2 and columns 5 and 6 would represent group 3. Something like:
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
1 2 4 5 3 6
1 2 4 6 3 5
...
Ultimately, I would want to expand this to other multinomial combinations of limited size (because the numbers get large rather quickly) but I am having trouble getting things to work. I've found several functions that do binomial combinations (only 2 groups) but I could not locate any functions that do this when the number of groups is greater than 2.
I've tried two approaches to this:
Building up the matrix from nothing using for loops and attempting things with the reshape package (thinking that might be something there for this with melt() )
working backwards from the permutation matrix (720 rows) by attempting to retain unique rows within groups and or removing duplicated rows within groups
Neither worked for me.
The permutation matrix can be obtained with
library(gtools)
dat=permutations(6, 6, set=TRUE, repeats.allowed=FALSE)
I think working backwards from the full permutation matrix is a bit excessive but I'm tring anything at this point.
Is there a package with a prebuilt function for this? Anyone have any ideas how I shoud proceed?
Here is how you can implement your "working backwards" approach:
gps <- list(1:2, 3:4, 5:6)
get.col <- function(x, j) x[, j]
is.ordered <- function(x) !colSums(diff(t(x)) < 0)
is.valid <- Reduce(`&`, Map(is.ordered, Map(get.col, list(dat), gps)))
dat <- dat[is.valid, ]
nrow(dat)
# [1] 90

Resources