Removing points in a vector based on another

Removing points in a vector based on another - r

I've two vectors with data x and y. Let's say the first one is the distance and the second the temperature.
How can I remove from both x and y all points which distance is lower a constant distance 'd' between two consecutive points ( xi - xi-1 )
x = (1,2,3,8,12)
y = (10,12,11,9,12)
remove points with a distance smaller than 5
x = 1, 2(out as 2-1 <5), 3 (out as 3-1 <5), 8, 12 (fine as last even thoug 12-8<5)
x = (1,8,12)
y = (10,9,12)

Here is one idea assuming that your first and last elements are never removed,
v1 <- setNames(x, y)[c(TRUE, (diff(x) >= 5)[-(length(x)-1)], TRUE)]
#10 9 12
# 1 8 12
#To make it a bit more clear on how the named vector is structured (still a vector)
names(v1)
#[1] "10" "9" "12" <- Note: I get 9 whereas you get 11
unname(v1)
#[1] 1 8 12
Or you can make it a function,
rm_elements <- function(x, y, n){
v1 <- setNames(x, y)[c(TRUE, (diff(x) >= n)[-(length(x)-1)], TRUE)]
return(list(x = unname(v1), y = as.numeric(names(v1))))
}
rm_elements(x, y, 5)
#$x
#[1] 1 8 12
#$y
#[1] 10 9 12
EDIT: To accomodate your comment for when you have them in a data frame, then we can alter the function a bit to accept a data frame (no matter how you name the variables), and return a subset of that data frame, i.e.
rm_elements <- function(df, n){
v1 <- df[c(TRUE, (diff(df[[1]]) >= n)[-(nrow(df)-1)], TRUE),]
return(v1)
}
#Make a data frame from the vectors,
d1 <- data.frame(x=x, y=y)
rm_elements(d1, 5)
which gives,
x y
1 1 10
4 8 9
5 12 12

Related

Partitioning Data creates unexpected results

I am trying to partition my data to a 60% Training and 40% Test Set using the following code.
split <- sample.split(divdat, SplitRatio = 0.6)
split
train.div <- subset(divdat, split == "TRUE")
test.div <- subset(divdat, split == "FALSE")
However, when using this code it splits my data as if it were 50/50. I have two hundred observations but and I get 100 observations for each. Any ideas what I am doing wrong here?

Function sample.split splits not by row, but by labels. to do it should change the first argument of sample.split to column values where you store labels. Then you'll observe 60/40 ration of training/test sets. I.e.
library(caTools)
divdat <- data.frame(id = 1:10, chars = letters[1:10], labels = c("X", "Y"))
split <- sample.split(divdat$labels, SplitRatio = 0.6)
train.div <- subset(divdat, split == "TRUE")
test.div <- subset(divdat, split == "FALSE")
train.div
test.div
Output:
> train.div
id chars labels
2 2 b Y
3 3 c X
5 5 e X
6 6 f Y
9 9 i X
10 10 j Y
> test.div
id chars labels
1 1 a X
4 4 d Y
7 7 g X
8 8 h Y

r - find maximum length "chain" of numerically increasing pairs of numbers

I have a two column dataframe of number pairs:
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
> dfPairs
ODD EVEN
1 1 10
2 1 8
3 1 2
4 3 2
5 3 6
6 3 4
7 5 2
8 7 6
9 7 8
10 9 4
11 9 8
Each row of this dataframe is a pair of numbers, and I would like to a find the longest possible numerically increasing combination of pairs. Conceptually, this is analogous to making a chain link of number pairs; with the added conditions that 1) links can only be formed using the same number and 2) the final chain must increase numerically. Visually, the program I am looking for will accomplish this:
For instance, row three is pair (1,2), which increases left to right. The next link in the chain would need to have a 2 in the EVEN column and increase right to left, such as row four (3,2). Then the pattern repeats, so the next link would need to have a 3 in the ODD column, and increase left to right, such as rows 5 or 6. The chain doesn't have to start at 1, or end at 9 - this was simply a convenient example.
If you try to make all possible linked pairs, you will find that many unique chains of various lengths are possible. I would like to find the longest possible chain. In my real data, I will likely encounter a situation in which more than one chain tie for the longest, in which case I would like all of these returned.
The final result should return the longest possible chain that meets these requirements as a dataframe, or a list of dataframes if more than one solution is possible, containing only the rows in the chain.
Thanks in advance. This one has been perplexing me all morning.

Edited to deal with df that does not start at 1 and returns maximum chains rather than chain lengths
Take advantage of graph data structure using igraph
Your data, dfPairs
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
New data, dfTest
ODD <- c(3,3,3,5,7,7,9,9)
EVEN <- c(2,6,4,2,6,8,4,8)
dfTest <- data.frame(ODD, EVEN)
Make graph of your data. A key to my solution is to rbind the reverse (rev(dfPairs)) of the data frame to the original data frame. This will allow for building directional edges from odd numbers to even numbers. Graphs can be used to construct directional paths fairly easily.
library(igraph)
library(dplyr)
GPairs <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2"))), X1))
GTest <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfTest, c("X1", "X2")), setNames(rev(dfTest), c("X1", "X2"))), X1))
Here's the first three elements of all_simple_paths(GPairs, 1) (starting at 1)
[[1]]
+ 2/10 vertices, named, from f8e4f01:
[1] 1 2
[[2]]
+ 3/10 vertices, named, from f8e4f01:
[1] 1 2 3
[[3]]
+ 4/10 vertices, named, from f8e4f01:
[1] 1 2 3 4
I create a function to 1) convert all simple paths to list of numeric vectors, 2) filter each numeric vector for only elements that satisfy left->right increasing, and 3) return the maximum chain of left->right increasing numeric vector
max_chain_only_increasing <- function(gpath) {
list_vec <- lapply(gpath, function(v) as.numeric(names(unclass(v)))) # convert to list of numeric vector
only_increasing <- lapply(list_vec, function(v) v[1:min(which(v >= dplyr::lead(v, default=tail(v, 1))))]) # subset vector for only elements that are left->right increasing
return(unique(only_increasing[lengths(only_increasing) == max(lengths(only_increasing))])) # return maximum chain length
}
This is the output of the above function using all paths that start from 1
max_chain_only_increasing(all_simple_paths(GPairs, 1))
# [[1]]
# [1] 1 2 3 6 7 8 9
Now, I'll output (header) of max chains starting with each unique element in dfPairs, your original data
start_vals <- sort(unique(unlist(dfPairs)))
# [1] 1 2 3 4 5 6 7 8 9 10
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GPairs, i)))
names(max_chains) <- start_vals
# $`1`
# [1] 1 2 3 6 7 8 9
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# etc
And finally with dfTest, the newer data
start_vals <- sort(unique(unlist(dfTest)))
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GTest, i)))
names(max_chains) <- start_vals
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# $`6`
# [1] 6 7 8 9

In spite of Cpak's efforts I ended up writing my own function to solve this. In essence I realize I could make the right to left chain links left to right by using this section of code from Cpak's answer:
output <- arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2")))`, X1)
To ensure the resulting chains were sequential, I deleted all decreasing links:
output$increase <- with(output, ifelse(X2>X1, "Greater", "Less"))
output <- filter(output, increase == "Greater")
output <- select(output, -increase)
I realized that if I split the dataframe output by unique values in X1, I could join each of these dataframes sequentially by joining the last column of the first dataframe to the first column of the next dataframe, which would create rows of sequentially increasing chains. The only problem I needed to resolve was the issues of NAs in last column of the mered dataframe. So ended up splitting the joined dataframe after each merge, and then shifted the dataframe to remove the NAs, and rbinded the result back together.
This is the actual code:
out_split <- split(output, output$X1)
df_final <- Reduce(join_shift, out_split)
The function, join_shift, is this:
join_shift <- function(dtf1,dtf2){
abcd <- full_join(dtf1, dtf2, setNames(colnames(dtf2)[1], colnames(dtf1)[ncol(dtf1)]))
abcd[is.na(abcd)]<-0
colnames(abcd)[ncol(abcd)] <- "end"
# print(abcd)
abcd_na <- filter(abcd, end==0)
# print(abcd_na)
abcd <- filter(abcd, end != 0)
abcd_na <- abcd_na[moveme(names(abcd_na), "end first")]
# print(abcd_na)
names(abcd_na) <- names(abcd)
abcd<- rbind(abcd, abcd_na)
z <- length(colnames(abcd))
colnames(abcd)<- c(paste0("X", 1:z))
# print(abcd)
return(abcd)
}
Finally, I found there were a lot of columns that had only zeros in it, so I wrote this to delete them and trim the final dataframe:
df_final_trim = df_final[,colSums(df_final) > 0]
Overall Im happy with this. I imagine it could be a little more elegant, but it works on anything, and it works on some rather huge, and complicated data. This will produce ~ 241,700 solutions from a dataset of 700 pairs.
I also used a moveme function that I found on stackoverflow (see below). I employed it to move NA values around to achieve the shift aspect of the join_shift function.
moveme <- function (invec, movecommand) {
movecommand <- lapply(strsplit(strsplit(movecommand, ";")[[1]],
",|\\s+"), function(x) x[x != ""])
movelist <- lapply(movecommand, function(x) {
Where <- x[which(x %in% c("before", "after", "first",
"last")):length(x)]
ToMove <- setdiff(x, Where)
list(ToMove, Where)
})
myVec <- invec
for (i in seq_along(movelist)) {
temp <- setdiff(myVec, movelist[[i]][[1]])
A <- movelist[[i]][[2]][1]
if (A %in% c("before", "after")) {
ba <- movelist[[i]][[2]][2]
if (A == "before") {
after <- match(ba, temp) - 1
}
else if (A == "after") {
after <- match(ba, temp)
}
}
else if (A == "first") {
after <- 0
}
else if (A == "last") {
after <- length(myVec)
}
myVec <- append(temp, values = movelist[[i]][[1]], after = after)
}
myVec
}

Loop a sequence in R (standardize and winsorize dataframe)

I'm trying to loop this sequence of steps in r for a data frame.
Here is my data:
ID Height Weight
a 100 80
b 80 90
c na 70
d 120 na
....
Here is my code so far
winsorize2 <- function(x) {
Min <- which(x == min(x))
Max <- which(x == max(x))
ord <- order(x)
x[Min] <- x[ord][length(Min)+1]
x[Max] <- x[ord][length(x)-length(Max)]
x}
df<-read.csv("data.csv")
df2 <- scale(df[,-1], center = TRUE, scale = TRUE)
id<-df$Type
full<-data.frame(id,df2)
full[is.na(full)] <- 0
full[, -1] <- sapply(full[,-1], winsorize2)
what i'm trying to do is this:-> Standardize a dataframe, then winsorize the standardized dataframe using the function winsorize2, ie replace the most extreme values with the second least extreme value. This is then repeated 10 times. How do i do a loop for this? Im confused as in the sequence ive already replaced the nas with 0s and so i should remove this step from the loop too?
edit:After discussion with #ekstroem, we decided to change to code to introduce the boundaries
df<-read.csv("data.csv")
id<-df$Type
df2<- scale(df[,-1], center = TRUE, scale = TRUE)
df2[is.na(df2)] <- 0
df2[df2<=-3] = -3
df2[df2>=3] = 3
df3<-df2 #trying to loop again
df3<- scale(df3, center = TRUE, scale = TRUE)
df3[is.na(df3)] <- 0
df3[df3<=-3] = -3
df3[df3>=3] = 3

There are some boundary issues that are not fully specified in your code, but maybe the following can be used (using base R and not super efficient)
wins2 <- function(x, n=1) {
xx <- sort(unique(x))
x[x<=xx[n]] <- xx[n+1]
x[x>=xx[length(xx)-n]] <- xx[length(xx)-n]
x
}
This yields:
x <- 1:11
wins(x,1)
[1] 2 2 3 4 5 6 7 8 9 10 10
wins(x,3)
[1] 4 4 4 4 5 6 7 8 8 8 8

Replicate variable based off match of two other variables in R

I've got a seemingly simple question that I can't answer: I've got three vectors:
x <- c(1,2,3,4)
weight <- c(5,6,7,8)
y <- c(1,1,1,2,2,2)
I want to create a new vector that replicates the values of weight for each time an element in x matches y such that it produces the following new weight vector associated with y:
y_weight <- c(5,5,5,6,6,6)
Any thoughts on how to do this (either loop or vectorized)? Thanks

You want the match function.
match(y, x)
to return the indicies of the matches, the use that to build your new weight vector
weight[match(y, x)]

#Using plyr
library(plyr)
df<-as.data.frame(cbind(x,weight)) # converting to dataframe
df<-rename(df,c(x="y")) # rename x as y for joining dataframes
y<-as.data.frame(y) # converting to dataframe
mydata <- join(df, y, by = "y",type="right")
> mydata
y weight
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6

Apply over two data frames

I'm using R, and I have two data.frames, A and B. They both have 6 rows, but A has 25000 columns (genes), and B has 30 columns. I'd like to apply a function with two arguments f(x,y) where x is every column of A and y is every column of B. So far it looks like this:
i = 1
for (x in A){
j = 1
for (y in B){
out[i,j] <- f(x,y)
j = j + 1
}
i = i + 1
}
I have two issues with this: from my Python programming I associate keeping track of counters like this as crufty, and from my R programming I am nervous of for loops. However, I can't quite see how to apply apply (or even if I should apply apply) to this problem and was hoping someone might enlighten me. I need to treat f() as atomic (it's actually cor.test()) for now.

Since you are using data frames, it might be faster to use lapply or sapply to do this (specially given the scope of your data frames). For example,
x <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8), col3=c(9,10,11,12))
y <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
bl <- lapply(x, function(u){
lapply(y, function(v){
f(u,v) # Function with column from x and column from y as inputs
})
})
out = matrix(unlist(bl), ncol=ncol(y), byrow=T)

Some data
nrows <- 6
A <- data.frame(a = runif(nrows), b = runif(nrows), c = runif(nrows))
B <- data.frame(z = rnorm(nrows), y = rnorm(nrows))
The trick: remember columns with expand.grid
counter <- expand.grid(seq_along(A), seq_along(B))
f <- function(x)
{
cor.test(A[, x["Var1"]], B[, x["Var2"]])$estimate
}
Now we only need 1 call to apply.
stats <- apply(counter, 1, f)
names(stats) <- paste(names(A)[counter$Var1], names(B)[counter$Var2], sep = ",")
stats

Nesting the applies works, not the easiest syntax, though.
x<-data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8), col3=c(9,10,11,12))
y<-data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
z<-apply(x,2,function(col,df2)
{
apply(df2,2,function(col2,col1)
{
col2+col1
},col)
},y)
z
col1 col2 col3
[1,] 2 6 10
[2,] 4 8 12
[3,] 6 10 14
[4,] 8 12 16
[5,] 6 10 14
[6,] 8 12 16
[7,] 10 14 18
[8,] 12 16 20