Input
final_table =
Chr start end num seg.mean seg.mean.1 seg.mean.2
1 68580000 68640000 A8430 0.7000 0.1440 0.1032
1 115900000 116260000 B8430 0.0039 2.7202 2.7202
1 173500000 173680000 C5 -1.7738 -0.0746 -0.2722
How can I make a new data.frame where the values of columns 5 through 7 are set to:
-1, if value < -0.679
0, if -0.679 <= value <= 0.450
+1, if value > 0.450
Expected output
Chr start end num seg.mean seg.mean.1 seg.mean.2
1 68580000 68640000 A8430 1 0 0
1 115900000 116260000 B8430 0 1 1
1 173500000 173680000 C5 -1 0 0
try this:
# read the data in
df <- read.table(header = TRUE, text="Chr start end num seg.mean seg.mean.1 seg.mean.2
1 68580000 68640000 A8430 0.7000 0.1440 0.1032
1 115900000 116260000 B8430 0.0039 2.7202 2.7202
1 173500000 173680000 C5 -1.7738 -0.0746 -0.2722")
# get the column-names of the columns you wanna change
cols <- names(df[5:length(df)])
# set a function for the different values you want for the value-ranges
fun_cond <- function(x) {
ifelse(x < -0.679 , -1, ifelse(
x >= -0.679 & x <= 0.450, 0, 1))
}
# copy the data-frame so the old one doesnt get overwritten
new_df <- df
# work with data-table to apply the function to the columns
library(data.table)
setDT(new_df)[ , (cols) := lapply(.SD, fun_cond), .SDcols = cols]
output:
Chr start end num seg.mean seg.mean.1 seg.mean.2
1: 1 68580000 68640000 A8430 1 0 0
2: 1 115900000 116260000 B8430 0 1 1
3: 1 173500000 173680000 C5 -1 0 0
same thing without using any additional packages:
cols <- names(df[5:length(df)])
fun_cond <- function(x) {
ifelse(x < -0.679 , -1, ifelse(
x >= -0.679 & x <= 0.450, 0, 1))
}
new_df <- df
new_df[5:length(df)] <- lapply(new_df[5:length(df)], fun_cond)
I'd use the cut function and apply it to the last three columns individually.
Here's a simple example:
original = data.frame(a=c(rep("A", 2), rep("B", 2)), seg.mean=c(-1, 0, 0.4, 0.5));
original$segmented = cut(original$seg.mean, c(-Inf, -0.679, 0.450, Inf), labels = c(-1,0,1))
One thing to be careful about: the new column will be a factor. If you need numerical values, you may need to apply as.numeric to it.
You can also try to use labels=FALSE which will give you numerical values (but likely 1,2,3 rather than -1,0,1). You can fix that by subtracting 2:
original$segmented = cut(original$seg.mean, c(-Inf, -0.679, 0.450, Inf), labels = FALSE)-2
You can directly replace fields in the data frame by subsetting
df[, 5:7] <- ifelse(df[, 5:7] < -0.679, -1,
ifelse(df[, 5:7] < 0.450, 0,
1))
Related
I have a large xts object. However the example is in a data.frame two column subset of the data. I would like to calculate (in a new column) the cumulative product of the first column df$rt whenever the second column df$dd is less than 0. Whenever df$dd is 0 I want to reset the cumulating to 0 again. So for the next instance that df$dd is less than 0 the cumulative product starts again for df$rt.
The following example dataframe adds the desired outcome as column three df$crt, for reference. Note that some rounding has been applied.
df <- data.frame(
rt = c(0, 0.0171, 0.0796, 0.003, 0.0754, -0.0314, 0.0275, -0.0323, 0.0364, 0.0473, -0.0021),
dd = c(0, -0.0657, -0.0013, 0, -0.018, -0.0012, 0, 0, 0, -0.0016, -0.0856),
crt = c(0, 0.171, 0.0981, 0, 0.0754, 0.0415, 0, 0, 0, 0.473, 0.045)
)
I have tried various combinations of with, ifelse and cumprod like:
df$crt <- with(df, ifelse(df$dd<0, cumprod(1+df$rt)-1, 0))
However this does not reset the cumulative product after a 0 in df$dd, it only writes a 0 and continues the previous cumulation of df$rt when df$dd is below zero again.
I think I am missing a counter of some sort to initiate the reset. Note that the dataframe I'm working with to implement this is large.
Create a grouping column by taking the cumulative sum of logical vector (dd == 0) so that it increments by 1 at positions where dd is 0, then use replace with the condition to do the cumulative product in 'rt' only in places where 'dd' is not equal to 0
library(dplyr)
df %>%
group_by(grp = cumsum(dd == 0)) %>%
mutate(crt1 = replace(dd, dd != 0, (cumprod(1 + rt[dd!=0]) - 1))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 11 x 4
rt dd crt crt1
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0.0171 -0.0657 0.171 0.0171
3 0.0796 -0.0013 0.0981 0.0981
4 0.003 0 0 0
5 0.0754 -0.018 0.0754 0.0754
6 -0.0314 -0.0012 0.0415 0.0416
7 0.0275 0 0 0
8 -0.0323 0 0 0
9 0.0364 0 0 0
10 0.0473 -0.0016 0.473 0.0473
11 -0.0021 -0.0856 0.045 0.0451
Or using base R
with(df, ave(rt * (dd != 0), cumsum(dd == 0), FUN = function(x)
replace(x, x != 0, (cumprod(1 + x[x != 0]) - 1))))
-ouptut
[1] 0.00000000 0.01710000 0.09806116 0.00000000 0.07540000 0.04163244 0.00000000 0.00000000 0.00000000 0.04730000 0.04510067
I can´t nest an apply funtion to repeat it many times with other combination of columns
I need to get a percentage of sp1==1 & s1==1and sp2==1 & s1==1 regarding s1, and in the same way regarding s2, s3... s1000. Here a short example:
x <- data.frame("sp1"=rep(0:1, times=5),
"sp2"=rep(0:1, each=5),
"s1" = rep(0:1, times=10),
"s2" = rep(0:1, each=2),
"s3" = rep(1:0, each=2))
> x
sp1 sp2 s1 s2 s3
1 0 0 0 0 1
2 1 0 1 0 1
3 0 0 0 1 0
4 1 0 1 1 0
5 0 0 0 0 1
6 1 1 1 0 1
7 0 1 0 1 0
8 1 1 1 1 0
9 0 1 0 0 1
10 1 1 1 0 1
11 0 0 0 1 0
12 1 0 1 1 0
13 0 0 0 0 1
14 1 0 1 0 1
15 0 0 0 1 0
16 1 1 1 1 0
17 0 1 0 0 1
18 1 1 1 0 1
19 0 1 0 1 0
20 1 1 1 1 0
Now I typed a function to calculate percentage regarding s1:
r <- as.data.frame(sapply(x[,1:2],
function(i) sum(i ==1 & x$s1 == 1)/sum(i ==1)))
> r
sapply(x[, 1:2], function(i) sum(i == 1 & x$s1 == 1)/sum(i == 1))
sp1 1.0
sp2 0.6
I want to built a df with all percentages of sp1, sp2, sp3, ...sp200 regarding s1, s2, s3, ...s1000...
> r
s1 s2 s3 ... s1000
sp1 1.0 0.5 0.5
sp2 0.6 0.5 0.5
...
sp200
I've tried to do a function with both groups-variables, one for sp's and another for s's:
intento <- as.data.frame(sapply(i=x[,1:2],
j=x[,3:5],
function(i,j)sum(i ==1 & j == 1)/sum(i ==1)))
But logically that´s not the way:
Error in match.fun(FUN) : argument "FUN" is missing, with no default
We can seperate the columns based on their names and use sapply on them
sp_cols <- grep("^sp", names(x))
s_cols <- grep("^s\\d+", names(x))
sapply(x[sp_cols], function(i) sapply(x[s_cols],
function(j) sum(i == 1 & j == 1)/sum(i == 1)))
If you have only 1 and 0's as values in the columns this can be reduced to
sapply(x[s_cols], function(i) sapply(x[sp_cols], function(j) sum(i & j)/sum(j)))
# s1 s2 s3
#sp1 1.0 0.5 0.5
#sp2 0.6 0.5 0.5
You're looking for outer. Your function just needs to be Vectorized.
FUN <- Vectorize(function(i,j) sum(x[i] == 1 & x[j] == 1)/sum(x[i] == 1))
outer(1:2, 3:5, FUN)
# [,1] [,2] [,3]
# [1,] 1.0 0.5 0.5
# [2,] 0.6 0.5 0.5
You could refine this using grep to find the columns automatically
outer(grep("sp", names(x)), grep("s\\d+", names(x)), FUN)
A similar approach is to use lapply(x, function(x) which(x == 1) and then use that down the road. The thought process being that we might as well store the information instead of repeatedly checking it.
#as suggested by #Ronak
sp_cols <- grep("^sp", names(x))
s_cols <- grep("^s\\d+", names(x))
x_l_zero <- lapply(x, function(x) which(x == 1))
sapply(x_l_zero[s_cols]
, function(x) sapply(x_l_zero[sp_cols]
, function(y) length(intersect(x,y))/length(y)))
s1 s2 s3
sp1 1.0 0.5 0.5
sp2 0.6 0.5 0.5
#Ronak has the fastest solution and is more-or-less the OP's code that's been addressed.
Unit: microseconds
expr min lq mean median uq max neval
jay.sf_outer_FUN 1190.8 1240.85 1360.103 1284.50 1337.30 2627.0 100
cole_which_apply 268.4 289.00 454.609 306.05 322.00 7610.7 100
ronak_1_unsimple 181.3 193.95 321.863 209.95 233.40 6227.4 100
ronak_2_simple 228.5 241.25 342.354 250.65 276.05 7478.4 100
akrun_dplyr 5218.7 5506.05 6108.997 5721.80 6081.65 25147.3 100
Code for performance:
library(microbenchmark)
library(tidyverse)
##data set
x <- data.frame("sp1"=rep(0:1, times=5),
"sp2"=rep(0:1, each=5),
"s1" = rep(0:1, times=10),
"s2" = rep(0:1, each=2),
"s3" = rep(1:0, each=2))
#for jay.sf
FUN <- Vectorize(function(i,j) sum(x[i] == 1 & x[j] == 1)/sum(x[i] == 1))
#names of columns
sp_cols <- grep("^sp", names(x))
s_cols <- grep("^s\\d+", names(x))
sp_cols_nam <- grep("^sp", names(x), value = T)
s_cols_nam <- grep("^s\\d+", names(x), value = T)
#benchmark
microbenchmark(
outer_FUN = {
outer(sp_cols, s_cols, FUN)
}
, apply_heaven = {
x_l_zero <- lapply(x, function(x) which(x == 1))
sapply(x_l_zero[s_cols], function(x) sapply(x_l_zero[sp_cols] , function(y) length(intersect(x,y))/length(y)))
}
, ronak_1_unsimple = {
sapply(x[sp_cols], function(i) sapply(x[s_cols],
function(j) sum(i == 1 & j == 1)/sum(i == 1)))
}
, ronak_2_simple = {
sapply(x[s_cols], function(i) sapply(x[sp_cols], function(j) sum(i & j)/sum(j)))
}
, akrun_dplyr = {
crossing(nm1 = sp_cols_nam,
nm2 = s_cols_nam) %>%
mutate(val = pmap_dbl(., ~ sum(x[..1] ==1 & x[..2] == 1)/sum(x[..1]))) %>%
spread(nm2, val)
}
)
Here is an option with tidyverse
library(tidyverse)
crossing(nm1 = names(x)[startsWith(names(x), "sp")],
nm2 = grep("^s\\d+", names(x), value = TRUE)) %>%
mutate(val = pmap_dbl(., ~ sum(x[..1] ==1 & x[..2] == 1)/sum(x[..1]))) %>%
spread(nm2, val)
# A tibble: 2 x 4
# nm1 s1 s2 s3
# <chr> <dbl> <dbl> <dbl>
#1 sp1 1 0.5 0.5
#2 sp2 0.6 0.5 0.5
Im trying to write a function with nested if-else in R. How can I convert a data.frame where the values of columns are set to:
Input
df <- read.table(header = TRUE, text="Chr start end num seg.mean seg.mean.1 seg.mean.2
1 68580000 68640000 A8430 0.7000 0 0.1032
1 115900000 116260000 B8430 0.0039 2.7202 2.7202
1 173500000 173680000 C5 -1.7738 -2.0746 -0.2722")
condition:
x > 0 & x< 1 : 1
x >= 1 : 2
x < 0 & x > - 1 : -1
x <= -1 : -2
x = 0 : 0
expected output
df <- read.table(header = TRUE, text="Chr start end num seg.mean seg.mean.1 seg.mean.2
1 68580000 68640000 A8430 1 0 1
1 115900000 116260000 B8430 1 2 2
1 173500000 173680000 C5 -2 -2 -1")
fun_cond <- function(x) { ifelse( x >= 1, 2,ifelse( x > 0 & x < 1, 1),ifelse( x <= 1, 2,ifelse( x < 0 & x > -1, -1)))}
new_df[5:length(df)] <- lapply(new_df[5:length(df)], fun_cond)
I think what you want is this:
x = c(-1, 1, 0, 0, 1, -1, 0.5, 0.3, -0.4)
fun_cond(x)
fun_cond <- function(x){
ifelse(x >= 1, 2, ifelse(x < 1 & x > 0, 1, ifelse(x < 0 & x > -1, -1, -2)))
}
> fun_cond(x)
#[1] -2 2 -2 -2 2 -2 1 1 -1
Try it out...
Note that x == 0 is -2. There is no x <= 0 ... or x >= 0 ... expression like you described it.
If you want 0 as zero then use:
x = c(-1,1,0,0,1,-1,0.5,0.3, -0.4)
fun_cond(x)
fun_cond <- function(x){
ifelse(x >= 1, 2, ifelse(x < 1 & x > 0, 1, ifelse( x == 0, 0, ifelse(x < 0 & x > -1, -1, -2))))
}
> fun_cond(x)
#[1] -2 2 0 0 2 -2 1 1 -1
Try cut in base R:
cols <- grep("seg.mean", names(df))
res <- sapply(cols, function(i)
cut(df[,i], breaks = c(-Inf, -1, 0, 1, Inf), labels = c(-2,-1,1,2)))
# to leave zeros untouched
res[df[cols]==0] <- 0
If you want to get your expected output:
df[cols] <- res
# Chr start end num seg.mean seg.mean.1 seg.mean.2
# 1 1 68580000 68640000 A8430 1 0 1
# 2 1 115900000 116260000 B8430 1 2 2
# 3 1 173500000 173680000 C5 -2 -2 -1
I have a dataframe in R that I would like to convert all columns (outside the ids) from negative to zero
id1 id2 var1 var2 var3
-1 -1 0 -33 5
-1 -2 9 -10 -1
I can convert all columns with code line like:
temp[temp < 0] <- 0
But I can't adjust it to only a subset of columns. I've tried:
temp[temp < 0, -c(1,2)] <- 0
But this errors saying non-existent rows not allowed
Edit a bit your variant
temp[,-c(1,2)][temp[, -c(1,2)] < 0] <- 0
You can try using replace:
> mydf[-c(1, 2)] <- replace(mydf[-c(1, 2)], mydf[-c(1, 2)] < 0, 0)
> mydf
id1 id2 var1 var2 var3
1 -1 -1 0 0 5
2 -1 -2 9 0 0
We can use data.table
setDT(d1)
for(j in grep('^var', names(d1))){
set(d1, i= which(d1[[j]]<0), j= j, value=0)
}
d1
# id1 id2 var1 var2 var3
# 1: -1 -1 0 0 5
# 2: -1 -2 9 0 0
There might be fancier or more compact ways, but here's a vectorised replacement you can apply to the var columns:
mytable <- read.table(textConnection("
id1 id2 var1 var2 var3
-1 -1 0 -33 5
-1 -2 9 -10 -1"), header = TRUE)
mytable[, grep("^var", names(mytable))] <-
apply(mytable[, grep("^var", names(mytable))], 2, function(x) ifelse(x < 0, 0, x))
mytable
## id1 id2 var1 var2 var3
## 1 -1 -1 0 0 5
## 2 -1 -2 9 0 0
You could use pmax:
dat <- data.frame(id1=c(-1,-1), id2=c(-1,-2), var1=c(0,9), var2=c(-33,10), var3=c(5,-1))
dat[,-c(1,2)] <- matrix(pmax(unlist(dat[,-c(1,2)]),0), nrow=nrow(dat))
I have an hourly value. I want to count how many consecutive hours the value has been zero since the last time it was not zero. This is an easy job for a spreadsheet or for loop, but I am hoping for a snappy vectorized one-liner to accomplish the task.
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
df <- data.frame(x, zcount = NA)
df$zcount[1] <- ifelse(df$x[1] == 0, 1, 0)
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(df$x[i] == 0, df$zcount[i - 1] + 1, 0)
Desired output:
R> df
x zcount
1 1 0
2 0 1
3 1 0
4 0 1
5 0 2
6 0 3
7 1 0
8 1 0
9 0 1
10 0 2
William Dunlap's posts on R-help are the place to look for all things related to run lengths. His f7 from this post is
f7 <- function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)}
and in the current situation f7(!x). In terms of performance there is
> x <- sample(0:1, 1000000, TRUE)
> system.time(res7 <- f7(!x))
user system elapsed
0.076 0.000 0.077
> system.time(res0 <- cumul_zeros(x))
user system elapsed
0.345 0.003 0.349
> identical(res7, res0)
[1] TRUE
Here's a way, building on Joshua's rle approach: (EDITED to use seq_len and lapply as per Marek's suggestion)
> (!x) * unlist(lapply(rle(x)$lengths, seq_len))
[1] 0 1 0 1 2 3 0 0 1 2
UPDATE. Just for kicks, here's another way to do it, around 5 times faster:
cumul_zeros <- function(x) {
x <- !x
rl <- rle(x)
len <- rl$lengths
v <- rl$values
cumLen <- cumsum(len)
z <- x
# replace the 0 at the end of each zero-block in z by the
# negative of the length of the preceding 1-block....
iDrops <- c(0, diff(v)) < 0
z[ cumLen[ iDrops ] ] <- -len[ c(iDrops[-1],FALSE) ]
# ... to ensure that the cumsum below does the right thing.
# We zap the cumsum with x so only the cumsums for the 1-blocks survive:
x*cumsum(z)
}
Try an example:
> cumul_zeros(c(1,1,1,0,0,0,0,0,1,1,1,0,0,1,1))
[1] 0 0 0 1 2 3 4 5 0 0 0 1 2 0 0
Now compare times on a million-length vector:
> x <- sample(0:1, 1000000,T)
> system.time( z <- cumul_zeros(x))
user system elapsed
0.15 0.00 0.14
> system.time( z <- (!x) * unlist( lapply( rle(x)$lengths, seq_len)))
user system elapsed
0.75 0.00 0.75
Moral of the story: one-liners are nicer and easier to understand, but not always the fastest!
rle will "count how many consecutive hours the value has been zero since the last time it was not zero", but not in the format of your "desired output".
Note the lengths for the elements where the corresponding values are zero:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 1 3 2 2
# values : num [1:6] 1 0 1 0 1 0
A simple base R approach:
ave(!x, cumsum(x), FUN = cumsum)
#[1] 0 1 0 1 2 3 0 0 1 2
One-liner, not exactly super elegant:
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
unlist(lapply(split(x, c(0, cumsum(abs(diff(!x == 0))))), function(x) (x[1] == 0) * seq(length(x))))
Using purr::accumulate() is very straightforward, so this tidyverse solution may add some value here. I must acknowledge it is definitely not the fastest, as it calls the same function length(x)times.
library(purrr)
accumulate(x==0, ~ifelse(.y!=0, .x+1, 0))
[1] 0 1 0 1 2 3 0 0 1 2