With a vector, say:
a <- c(1,1,1, 2,2,2,2, 1,1, 3,3,3,3)
I need to know
at which index values the element values changed
the element value before/after change
Using the above example, the output would look something like:
i before after
3 1 2
7 2 1
9 1 3
I suppose I can convert the values into a data.frame and lag/shift the column but am wondering if there is a better approach. I am also trying to avoid looping over the vector, if possible.
I also don't need the results in data.frame / spreadsheet format; feel free to propose a different output format.
Here is an option
idx <- which(a != c(a[-1], NA))
data.frame(
i = idx,
before = a[idx],
after = a[idx + 1])
# i before after
#1 3 1 2
#2 7 2 1
#3 9 1 3
You can roll everything into a function
f <- function(x) {
idx <- which(x != c(x[-1], NA))
data.frame(
i = idx,
before = x[idx],
after = x[idx + 1])
}
f(a)
giving the same output as above.
with(rle(a), data.frame(i = head(cumsum(lengths), -1),
before = head(values, -1),
after = values[-1]))
# i before after
#1 3 1 2
#2 7 2 1
#3 9 1 3
Related
i have a data frame that look like this :
a
b
1
1
2
2
3
3
4
4
5
5
i want to implement the following slicing:
if the window is 2 that means i have to take the first 2 elements of column a and the last 2 elements of columns b. Sum them and take the minimum of them (which i have done it). For example in the data frame above it must be
a
b
result
1
4
5
2
5
7
and will report 5.
But (and this is the problem) when the window equals the number of rows of the data frame then i want to take the first element of column a and the last element of column b and just sum them. In my data frame that would be 1+5 = 6.
My effort is this but i dont know how to insert this if else in the function
library(tidyverse)
a = seq(1,5,1)
b = seq(1,5,1)
w = tibble(a,b);w
w[1,1]+w[nrow(w),2]
im = function(mat,window){
if(window == nrow(mat))
mat[1,1] + mat[nrow(mat),2]
else
SA = mat%>%
dplyr::select(1)%>%
dplyr::slice_head(n=window)
SB = mat%>%
dplyr::dplyr::select(2)%>%
dplyr::slice_tail(n=window)
margin = min(SA+SB)
return(margin)
}
im(w,5)
to test it let's say that i have a vector or windows
vec = seq(1,5,1)
i want to run this function im that i have created for all the elements in the vector vec.
How can i do them in R ?
Any help
You can do it with a bunch of if/else:
f <- function(n, data){
if(n == 1){
data.frame(a = head(data[1], n),
b = head(data[2], n)) |>
transform(result = a + b)
}
else{
if(n == nrow(data)) n <- 1
data.frame(a = head(data[1], n),
b = tail(data[2], n)) |>
transform(result = a + b)
}
}
output
lapply(vec, f, data = w)
[[1]]
a b result
1 1 1 2
[[2]]
a b result
1 1 4 5
2 2 5 7
[[3]]
a b result
1 1 3 4
2 2 4 6
3 3 5 8
[[4]]
a b result
1 1 2 3
2 2 3 5
3 3 4 7
4 4 5 9
[[5]]
a b result
1 1 5 6
You can use ifelse function , it always simplify things for me more than if else instruction.
Try this it will works :
library(tidyverse)
a = seq(1,5,1)
b = seq(1,5,1)
w = tibble(a,b)
im = function(mat,window){
SA = unlist(#unlist to transform list type result of ifelse to vector
ifelse(
# here is the condition
window==nrow(mat),
# if condition is TRUE return first element
mat%>%
dplyr::select(1)%>%
dplyr::slice_head(n=1),
# if condition is FALSE return first window elements
mat%>%
dplyr::select(1)%>%
dplyr::slice_head(n=window))
)
SB = unlist(
ifelse(window==nrow(mat),
mat%>%
dplyr::select(2)%>%
dplyr::slice_tail(n=1),
mat%>%
dplyr::select(2)%>%
dplyr::slice_tail(n=window))
)
margin = min(SA+SB)
return(margin)
}
im(w,5)
I'm trying to completely delete rows in a dataset for cases with matching variables (case ID) with the help of this function I wrote:
del_row_func <- function(x){
for(i in 1:length(x$FALL_ID)){
for(j in 1:length(x$FALL_ID)){
if(x$FALL_ID[i] == x$FALL_ID[j] & i != j){
x[-i, ]
}
}
}
}
Anybody have an idea, why it doesn't work?
The reason your code didn't work was that you weren't modifying or returning x. However, there is a better way to remove all rows with a duplicated ID:
dat = data.frame(FALL_ID = c(1, 2, 2, 3), y = 1:4)
dat
# FALL_ID y
# 1 1 1
# 2 2 2
# 3 2 3
# 4 3 4
dat[!duplicated(dat$FALL_ID) & !duplicated(dat$FALL_ID, fromLast=T),]
# FALL_ID y
# 1 1 1
# 4 3 4
I'm trying to apply a function to a dataframe using ddply from the plyr package, but I'm getting some results that I don't understand. I have 3 questions about the
results
Given:
mydf<- data.frame(c(12,34,9,3,22,55),c(1,2,1,1,2,2)
, c(0,1,2,1,1,2))
colnames(mydf)[1] <- 'n'
colnames(mydf)[2] <- 'x'
colnames(mydf)[3] <- 'x1'
mydf looks like this:
n x x1
1 12 1 0
2 34 2 1
3 9 1 2
4 3 1 1
5 22 2 1
6 55 2 2
Question #1
If I do:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
mydf <- ddply(mydf, c("x") , .fun = k, .inform = TRUE)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "z", value = structure(c(12, 34, 9, :
replacement has 3 rows, data has 6
Error: with piece 1:
n x x1
1 12 1 0
2 9 1 2
3 3 1 1
I get this error regardless of whether I specify the variable to split by as c("x"), "x", or .(x). I don't understand why I'm getting this error message.
Question #2
But, what I really want to do is set up an if/else function because my dataset has variables x1, x2, x3, and x4 and I want to take those variables into account as well. But when I try something simple such as:
j <- function(x) {
if(x == 1){
mydf$z <- 0
} else {
mydf$z <- mydf$n
}
return(mydf)
}
mydf <- ddply(mydf, x, .fun = j, .inform = TRUE)
I get:
Warning messages:
1: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
2: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
Question #3
I'm confused about to use function() and when to use function(x). Using function() for either j() or k() gives me a different error:
Error in .fun(piece, ...) : unused argument (piece)
Error: with piece 1:
n x x1 z
1 12 1 0 12
2 9 1 2 9
3 3 1 1 3
4 12 1 0 12
5 9 1 2 9
6 3 1 1 3
7 12 1 0 12
8 9 1 2 9
9 3 1 1 3
10 12 1 0 12
11 9 1 2 9
12 3 1 1 3
where column z is not correct. Yet I see a lot of functions written as function().
I sincerely appreciate any comments that can help me out with this
There's a lot that needs explaining here. Let's start with the simplest case. In your first example, all you need is:
mydf$z <- with(mydf,ifelse(x == 1,0,n))
An equivalent ddply solution might look like this:
ddply(mydf,.(x),transform,z = ifelse(x == 1,0,n))
Probably your biggest source of confusion is that you seem to not understand what is being passed as arguments to functions within ddply.
Consider your first attempt:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
The way ddply works is that it splits mydf up into several, smaller data frame, based on the values in the column x. That means that each time ddply calls k, the argument passed to k is a data frame. Specifically, a subset of you primary data frame.
So within k, x is a subset of mydf, with all the columns. You should not be trying to modify mydf from within k. Modify x, and then return the modified version. (If you must, but the options I displayed above are better.) So we might re-write your k like this:
k <- function(x) {
x$z <- ifelse(x$x == 1, 0, x$n)
return (x)
}
Note that you've created some confusing stuff by using x as both an argument to k and as the name of one of our columns.
Fake data for illustration:
df <- data.frame(a=c(1,2,3,4,5), b=(c(2,2,2,2,NA)),
c=c(NA,2,3,4,5)))
This would get me the answer I want IF it weren't for the NA values:
df$count <- with(df, (a==1) + (b==2) + (c==3))
Also, would there be an even more elegant way if I was only interested in, e.g. variables==2?
df$count <- with(df, (a==2) + (b==2) + (c==2))
Many thanks!
The following works for your specific example, but I have a suspicion that your real use case is more complicated:
df$count <- apply(df,1,function(x){sum(x == 1:3,na.rm = TRUE)})
> df
a b c count
1 1 2 NA 2
2 2 2 2 1
3 3 2 3 2
4 4 2 4 1
5 5 NA 5 0
but this general approach should work. For instance, your second example would be something like this:
df$count <- apply(df,1,function(x){sum(x == 2,na.rm = TRUE)})
or more generally you could allow yourself to pass in a variable for the comparison:
df$count <- apply(df,1,function(x,compare){sum(x == compare,na.rm = TRUE)},compare = 1:3)
Another way is to subtract your target vector from each row of your data.frame, negate and then do rowSums with na.rm=TRUE:
target <- 1:3
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 2 1 2 1 0
target <- rep(2,3)
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 1 3 1 1 0
I am new to R and am trying to accomplish the following task efficiently.
I have a data.frame, x, with columns: start, end, val1, val2, val3, val4. The columns are sorted/ordered by start.
For each start, first I have to find all the entries in x that share the same start. Because the list is ordered, they will be consecutive. If a particular start occurs only once, then I ignore it. Then, for these entries that have the same start, lets say for one particular start, there are 3 entries, as shown below:
entries for start=10
start end val1 val2 val3 val4
10 25 8 9 0 0
10 55 15 200 4 9
10 30 4 8 0 1
Then, I have to take 2 rows at a time and perform a fisher.test on the 2x4 matrices of val1:4. That is,
row1:row2 => fisher.test(matrix(c(8,15,9,200,0,4,0,9), nrow=2))
row1:row3 => fisher.test(matrix(c(8,4,9,8,0,0,0,1), nrow=2))
row2:row3 => fisher.test(matrix(c(15,4,200,8,4,0,9,1), nrow=2))
The code I wrote is accomplished using for-loops, traditionally. I was wondering if this could be vectorized or improved in anyway.
f_start = as.factor(x$start) #convert start to factor to get count
tab_f_start = as.table(f_start) # convert to table to access count
o_start1 = NULL
o_end1 = NULL
o_start2 = NULL
o_end2 = NULL
p_val = NULL
for (i in 1:length(tab_f_start)) {
# check if there are more than 1 entries with same start
if ( tab_f_start[i] > 1) {
# get all rows for current start
cur_entry = x[x$start == as.integer(names(tab_f_start[i])),]
# loop over all combinations to obtain p-values
ctr = tab_f_start[i]
for (j in 1:(ctr-1)) {
for (k in (j+1):ctr) {
# store start and end values separately
o_start1 = c(o_start1, x$start[j])
o_end1 = c(o_end1, x$end[j])
o_start2 = c(o_start2, x$start[k])
o_end2 = c(o_end2, x$end[k])
# construct matrix
m1 = c(x$val1[j], x$val1[k])
m2 = c(x$val2[j], x$val2[k])
m3 = c(x$val3[j], x$val3[k])
m4 = c(x$val4[j], x$val4[k])
m = matrix(c(m1,m2,m3,m4), nrow=2)
p_val = c(p_val, fisher.test(m))
}
}
}
}
result=data.frame(o_start1, o_end1, o_start2, o_end2, p_val)
Thank you!
As #Ben Bolker suggested, you can use the plyr package to do this compactly. The first step is to create
a wider data-frame that contains the desired row-pairs. The row-pairs are generated using the combn function:
set.seed(1)
x <- data.frame( start = c(1,2,2,2,3,3,3,3),
end = 1:8,
v1 = sample(8), v2 = sample(8), v3 = sample(8), v4 = sample(8))
require(plyr)
z <- ddply(x, .(start), function(d) if (nrow(d) == 1) NULL
else {
row_pairs <- combn(nrow(d),2)
cbind( a = d[ row_pairs[1,], ],
b = d[ row_pairs[2,], ] )
})[, -1]
The second step is to extract the p.value from applying the fisher.test to each row-pair:
result <- ddply(z, .(a.start, a.end, b.start, b.end),
function(d)
fisher.test(matrix(unlist( d[, -c(1,2,7,8) ]),
nrow=2, byrow=TRUE))$p.value )
> result
a.start a.end b.start b.end V1
1 2 2 2 3 0.33320784
2 2 2 2 4 0.03346192
3 2 3 2 4 0.84192284
4 3 5 3 6 0.05175017
5 3 5 3 7 0.65218289
6 3 5 3 8 0.75374989
7 3 6 3 7 0.34747011
8 3 6 3 8 0.10233072
9 3 7 3 8 0.52343422