Store every value in a sequence except some values - r

If I do the following to a string of letters:
x <- 'broke'
y <- nchar(x)
z <- sequence(y)
How do I store every value of the z that isn't the first, last, or middle values of the sequence.
In this example if z is 1 2 3 4 5 then the desired output would be 2 4
in the case of 1 2 3 4 nothing would be stored however, In the case of say 1 2 3 4 5 6 , 2 and 5 would be stored and so on

if (length(z) %% 2) {
z[-c(1, ceiling(length(z)/2), length(z))]
} else
z[-c(1, c(1,0) + floor(length(z)/2), length(z))]

Related

Shifting positions of values in a single column

This is my first question, so please let me know if I made any mistakes in the ask.
I am trying to create a dataframe which has multiple columns all containing the same values in the same order, but shifted in position. Where the first value from each column is moved to the end, and everything else is shifted up.
For example, I would like to convert a data frame like this:
example = data.frame(x=c(1,2,3,4), y=c(1,2,3,4), z=c(1,2,3,4), w=c(1,2,3,4)
Which looks like this
x y z w
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
into this:
x y z w
1 2 3 4
2 3 4 1
3 4 1 2
4 1 2 3
In the new dataframe, the "peak" or # 4 has moved progressively up in rows.
I've seen advice on how to shift columns up and down, but just replacing the remaining values with zeroes or NA. But I don't know how to shift the column up and replace the bottom-most value with what was formerly at the top.
Thanks in advance for any help.
In base R, we can update with Map by removing the sequence of elements while appending values from the end
example[-1] <- Map(function(x, y) c(tail(x, -y),
head(x, y)), example[-1], head(seq_along(example), -1))
example
# x y z w
#1 1 2 3 4
#2 2 3 4 1
#3 3 4 1 2
#4 4 1 2 3
Or another option is embed
example[] <- embed(unlist(example), 4)[1:4, 4:1]

how does the order function break ties in R?

I've read most of the similar questions here, but I'm still having a hard time understanding how passing arguments in the order function break ties.
The example introduced in the R documentation shows that :
order(x <- c(1,1,3:1,1:4,3), y <- c(9,9:1), z <- c(2,1:9))
returns
[1] 6 5 2 1 7 4 10 8 3 9
However, what does it mean when y is 'breaking ties' of x, and z 'breaking ties' of y? the x vector is:
[1] 1 1 3 2 1 1 2 3 4 3
and the y vector is:
[1] 9 9 8 7 6 5 4 3 2 1
Also, if I eliminate z from the first function,
order(x <- c(1,1,3:1,1:4,3), y <- c(9,9:1))
it returns :
[1] 6 5 1 2 7 4 10 8 3 9
so I'm unclear how the numbers in the y vector are relevant with ordering the four 1s, the two 2s, and the three 3s in x. I would very much appreciate the help. Thanks!
Let's take a look at
idx <- order(x <- c(1,1,3:1,1:4,3), y <- c(9,9:1), z <- c(2,1:9))
idx;
#[1] 6 5 2 1 7 4 10 8 3 9
First thing to note is that
x[idx]
# [1] 1 1 1 1 2 2 3 3 3 4
So idx orders entries in x from smallest to largest values.
Values in y and z affect how order treats ties in x.
Take entries x[5] = 1 and x[6] = 1. Since there is a tie here, order looks up entries at the corresponding positions in y, i.e. y[5] = 6 and y[6] = 5. Since y[6] < y[5], the entries in x are sorted x[6] < x[5].
If there is a tie in y as well, order will look up entries in the next vector z. This happens for x[1] = 1 and x[2] = 2, where both y[1] = 9 and y[2] = 9. Here z breaks the tie because z[2] = 1 < z[1] = 2 and therefore x[2] < x[1].

remove duplicate row based only of previous row

I'm trying to remove duplicate rows from a data frame, based only on the previous row. The duplicate and unique functions will remove all duplicates, leaving you only with unique rows, which is not what I want.
I've illustrated the problem here with a loop. I need to vectorize this because my actual data set is much to large to use a loop on.
x <- c(1,1,1,1,3,3,3,4)
y <- c(1,1,1,1,3,3,3,4)
z <- c(1,2,1,1,3,2,2,4)
xy <- data.frame(x,y,z)
xy
x y z
1 1 1 1
2 1 1 2
3 1 1 1
4 1 1 1 #this should be removed
5 3 3 3
6 3 3 2
7 3 3 2 #this should be removed
8 4 4 4
# loop that produces desired output
toRemove <- NULL
for (i in 2:nrow(xy)){
test <- as.vector(xy[i,] == xy[i-1,])
if (!(FALSE %in% test)){
toRemove <- c(toRemove, i) #build a vector of rows to remove
}
}
xy[-toRemove,] #exclude rows
x y z
1 1 1 1
2 1 1 2
3 1 1 1
5 3 3 3
6 3 3 2
8 4 4 4
I've tried using dplyr's lag function, but it only works on single columns, when I try to run it over all 3 columns it doesn't work.
ifelse(xy[,1:3] == lag(xy[,1:3],1), NA, xy[,1:3])
Any advice on how to accomplish this?
Looks like we want to remove if the row is same as above:
# make an index, if cols not same as above
ix <- c(TRUE, rowSums(tail(xy, -1) == head(xy, -1)) != ncol(xy))
# filter
xy[ix, ]
Why don't you just iterate the list while keeping track of the previous row to compare it to the next row?
If this is true at some point: remember that row position and remove it from the list then start iterating from the beginning of the list.
Don't delete row while iterating because you will get concurrent modification error.

R - Using for loop to conditionally change values in a dataframe

All of the variables are on the same scale in the data.frame 1-5.
Example of data.frame
rpi_invert
A B C D
5 2 4 1
3 5 5 2
1 1 3 4
For all values that equal 5 I would like to change it to 1.
for 4 change to 2.
for 2 change to 4.
for 1 change to 5.
Example of data.frame after values have been changed.
rpi_invert
A B C D
1 4 2 5
3 1 1 4
5 5 3 2
What I have tired.
for(b in colnames(rpi_invert)){
rpi_invert[[b]][rpi_invert[[b]] == 5] <- 1
rpi_invert[[b]][rpi_invert[[b]] == 4] <- 2
rpi_invert[[b]][rpi_invert[[b]] == 2] <- 4
rpi_invert[[b]][rpi_invert[[b]] == 1] <- 5
}
This will only change the values in the first row and not the second column.
for(b in colnames(rpi_invert)){
rpi_invert <- ifelse(rpi_invert[[b]] == 5,1,
ifelse(rpi_invert[[b]] == 4,2,
ifelse(rpi_invert[[b]] == 2,4,
ifelse(rpi_invert[[b]] == 1,5,rpi_invert[[b]]))))
}
But this gives me the error:
Error in rpi_invert[[b]] : subscript out of bounds
If I try to the same methods for an individual column instead of looping through the data.frame then both methods work so I am not sure what is the problem.
I am sure what I am trying to do can be done more efficiently without a for loop probably with some type of apply function but I am not sure how.
Any help will be appreciated please let me know if further information is needed.
You can try (if your data.frame is df):
3-(df-3)
# A B C D
#1 1 4 2 5
#2 3 1 1 4
#3 5 5 3 2
or, same but written a bit differently: 6-df

Find unique set of strings in vector where vector elements can be multiple strings

I have a series of batch records that are labeled sequentially. Sometimes batches overlap.
x <- c("1","1","1/2","2","3","4","5/4","5")
> data.frame(x)
x
1 1
2 1
3 1/2
4 2
5 3
6 4
7 5/4
8 5
I want to find the set of batches that are not overlapping and label those periods. Batch "1/2" includes both "1" and "2" so it is not unique. When batch = "3" that is not contained in any previous batches, so it starts a new period. I'm having difficulty dealing with the combined batches, otherwise this would be straightforward. The result of this would be:
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
My experience is in more functional programming paradigms, so I know the way I did this is very un-R. I'm looking for the way to do this in R that is clean and simple. Any help is appreciated.
Here's my un-R code that works, but is super clunky and not extensible.
x <- c("1","1","1/2","2","3","4","5/4","5")
p <- 1 #period number
temp <- NULL #temp variable for storing cases of x (batches)
temp[1] <- x[1]
period <- NULL
rl <- 0 #length to repeat period
for (i in 1:length(x)){
#check for "/", split and add to temp
if (grepl("/", x[i])){
z <- strsplit(x[i], "/") #split character
z <- unlist(z) #convert to vector
temp <- c(temp, z, x[i]) #add to temp vector for comparison
}
#check if x in temp
if(x[i] %in% temp){
temp <- append(temp, x[i]) #add to search vector
rl <- rl + 1 #increase length
} else {
period <- append(period, rep(p, rl)) #add to period vector
p <- p + 1 #increase period count
temp <- NULL #reset
rl <- 1 #reset
}
}
#add last batch
rl <- length(x) - length(period)
period <- append(period, rep(p,rl))
df <- data.frame(x,period)
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
R has functional paradigm influences, so you can solve this with Map and Reduce. Note that this solution follows your approach in unioning seen values. A simpler approach is possible if you assume batch numbers are consecutive, as they are in your example.
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
r<-Reduce(union,s,init=list(),acc=TRUE)
p<-cumsum(Map(function(x,y) length(intersect(x,y))==0,s,r[-length(r)]))
data.frame(x,period=p)
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
What this does is first calculate a cumulative union of seen values. Then, it maps across this to determine the places where none of the current values have been seen before. (Alternatively, this second step could be included within the reduce, but this would be wordier without support for destructuring.) The cumulative sum provides the "period" numbers based on the number of times the intersections have come up empty.
If you do make the assumption that the batch numbers are consecutive then you can do the following instead
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
n<-mapply(function(x) range(as.numeric(x)),s)
p<-cumsum(c(1,n[1,-1]>n[2,-ncol(n)]))
data.frame(x,period=p)
For the same result (not repeated here).
A little bit shorter:
x <- c("1","1","1/2","2","3","4","5/4","5")
x<-data.frame(x=x, period=-1, stringsAsFactors = F)
period=0
prevBatch=-1
for (i in 1:nrow(x))
{
spl=unlist(strsplit(x$x[i], "/"))
currentBatch=min(spl)
if (currentBatch<prevBatch) { stop("Error in sequence") }
if (currentBatch>prevBatch)
period=period+1;
x$period[i]=period;
prevBatch=max(spl)
}
x
Here's a twist on the original that uses tidyr to split the data into two columns so it's easier to use:
# sample data
x <- c("1","1","1/2","2","3","4","5/4","5")
df <- data.frame(x)
library(tidyr)
# separate x into two columns, with second NA if only one number
df <- separate(df, x, c('x1', 'x2'), sep = '/', remove = FALSE, convert = TRUE)
Now df looks like:
> df
x x1 x2
1 1 1 NA
2 1 1 NA
3 1/2 1 2
4 2 2 NA
5 3 3 NA
6 4 4 NA
7 5/4 5 4
8 5 5 NA
Now the loop can be a lot simpler:
period <- 1
for(i in 1:nrow(df)){
period <- c(period,
# test if either x1 or x2 of row i are in any x1 or x2 above it
ifelse(any(df[i, 2:3] %in% unlist(df[1:(i-1),2:3])),
period[i], # if so, repeat the terminal value
period[i] + 1)) # else append the terminal value + 1
}
# rebuild df with x and period, which loses its extra initializing value here
df <- data.frame(x = df$x, period = period[2:length(period)])
The resulting df:
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3

Resources