I've this df:
set.seed(007)
x <- data.frame(v1=sample (1:100, 50),
v2=sample (1:100, 50),
v3=sample (1:100, 50),
v4=sample (1:100, 50),
v5=sample (1:100, 50))
I need to count the values across the rows (v1:v5) between these intervals: <25; 25-49; 50-74; >=75.
I tried with:
x$less.25 <- rowSums(x < 25, na.rm=TRUE)
x$between.25_49 <- rowSums(x >= 25 & x < 50, na.rm=TRUE)
x$between.50_74 <- rowSums(x >= 50 & x < 75, na.rm=TRUE)
x$greater.75 <- rowSums(x >= 75, na.rm=TRUE)
If I have correctly understood your problem:
x$less.25 <- apply(x, 1, function(x){sum(x < 25)})
x$between.25_49 <- apply(x, 1, function(x){sum(x >= 25 & x <50)})
x$between.50_74 <- apply(x, 1, function(x){sum(x >= 50 & x <75)})
x$greater.75 <- apply(x, 1, function(x){sum(x >= 75)})
This gives
v1 v2 v3 v4 v5 less.25 between.25_49 between.50_74 greater.75
1 99 58 40 10 70 1 1 2 1
2 40 72 49 90 87 0 2 1 2
3 12 76 99 19 71 2 0 1 2
4 7 61 38 20 43 2 2 1 0
5 24 70 62 28 45 1 2 2 0
6 76 37 33 76 83 0 2 0 3
Related
So I have this dataframe and I aim to add a new variable based on others:
Qi
Age
c_gen
1
56
13
2
43
15
5
31
6
3
67
8
I want to create a variable called c_sep that if:
Qi==1 or Qi==2 c_sep takes a random number between (c_gen + 6) and Age;
Qi==3 or Qi==4 c_sep takes a random number between (Age-15) and Age;
And 0 otherwise,
so my data would look something like this:
Qi
Age
c_gen
c_sep
1
56
13
24
2
43
15
13
5
31
6
0
3
67
8
40
Any ideas please
In base R, you can do something along the lines of:
dat <- read.table(text = "Qi Age c_gen
1 56 13
2 43 15
5 31 6
3 67 8", header = T)
set.seed(100)
dat$c_sep <- 0
dat$c_sep[dat$Qi %in% c(1,2)] <- apply(dat[dat$Qi %in% c(1,2),], 1, \(row) sample(
(row["c_gen"]+6):row["Age"], 1
)
)
dat$c_sep[dat$Qi %in% c(3,4)] <- apply(dat[dat$Qi %in% c(3,4),], 1, \(row) sample(
(row["Age"]-15):row["Age"], 1
)
)
dat
# Qi Age c_gen c_sep
# 1 1 56 13 28
# 2 2 43 15 43
# 3 5 31 6 0
# 4 3 67 8 57
If you are doing it more than twice you might want to put this in a function - depending on your requirements.
Try this
df$c_sep <- ifelse(df$Qi == 1 | df$Qi == 2 ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$c_gen[x] + 6, df$Age[x]) ,1)) ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$Age[x] - 15, df$Age[x]) ,1)) , 0))
output
Qi Age c_gen c_sep
1 1 56 13 41
2 2 43 15 42
3 5 31 6 0
4 3 67 8 58
A tidyverse option:
library(tidyverse)
df <- tribble(
~Qi, ~Age, ~c_gen,
1, 56, 13,
2, 43, 15,
5, 31, 6,
3, 67, 8
)
df |>
rowwise() |>
mutate(c_sep = case_when(
Qi <= 2 ~ sample(seq(c_gen + 6, Age, 1), 1),
between(Qi, 3, 4) ~ sample(seq(Age - 15, Age, 1), 1),
TRUE ~ 0
)) |>
ungroup()
#> # A tibble: 4 × 4
#> Qi Age c_gen c_sep
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 56 13 39
#> 2 2 43 15 41
#> 3 5 31 6 0
#> 4 3 67 8 54
Created on 2022-06-29 by the reprex package (v2.0.1)
I've got a dataset that when I score needs to be converted from a continuous scale to categorical. Each value will be put into one of those categories at 10 intervals based on the minimum and maximum of that column. So if the minimum = 1 and the maximum = 100 there will be 10 categories so that any value from 1-10 = 1, and 11-20 = 2, 21-30 = 3, ..., 91-100 = 10. Here's what my data looks like
df <- as.data.frame(cbind(test1 = sample(13:52, 15),
test2 = sample(16:131, 15)))
> df
test1 test2
1 44 131
2 26 83
3 74 41
4 6 73
5 83 20
6 63 110
7 23 29
8 42 64
9 41 40
10 10 96
11 2 39
12 14 24
13 67 30
14 51 59
15 66 37
So far I have a function:
trail.bin <- function(data, col, min, max) {
for(i in 1:10) {
for(e in 0:9) {
x <- as.data.table(data)
mult <- (max - min)/10
x[col >= min+(e*mult) & col < min+(i*mult),
col := i]
}
}
return(x)
}
What I'm trying to do is take the minimum and maximum, find what the spacing of intervals would be (mult), then use two loops on a data.table reference syntax. The outcome I'm hoping for is:
df2
test1 test2
1 5 131
2 3 83
3 8 41
4 1 73
5 9 20
6 7 110
7 3 29
8 5 64
9 5 40
10 2 96
11 1 39
12 2 24
13 7 30
14 6 59
15 7 37
Thanks!
You could create a function using cut
library(data.table)
trail.bin <- function(data, col, n) {
data[, (col) := lapply(.SD, cut, n, labels = FALSE), .SDcols = col]
return(data)
}
setDT(df)
trail.bin(df, 'test1', 10)
You can also pass multiple columns
trail.bin(df, c('test1', 'test2'), 10)
[ First Stack question please be kind :) ]
I'm creating multiple new columns in a data frame based on multiple conditional statements of existing columns - all essentially new combinations of columns.
For example, if there are 4 columns (a:d), I need new columns of all combinations (abcd, abc, abd, etc) and a 0/1 coding based on threshold data in a:d.
Toy data example included and desired outcome. However needs to be scalable: there are 4 base columns, but I need all combinations of 2, 3 and 4 columns not just 3-value (abc, abd, .... ab, ac, ad, ... total n = 11)
[Background for context: this is actually flow cytometry data from multipotent stem cells that can grow into colonies of all lineage cell type (multipotent, or abcd) or progressively more restricted populations (only abc, or abd, ab, ac, etc)
# Toy data set
set.seed(123)
df <- tibble(a = c(sample(10:50, 10)),
b = c(sample(10:50, 10)),
c = c(sample(10:50, 10)),
d = c(sample(10:50, 10)))
Current code produces the desired result however, this needs 11 lines of repetitive code which is error prone and I hope has a more elegant solution:
df %>%
mutate(
abcd = if_else(a > 30 & b > 20 & c > 30 & d > 30, 1, 0),
abc = if_else(a > 30 & b > 20 & c > 30 & d <= 30, 1, 0),
abd = if_else(a > 30 & b > 20 & c <= 30 & d > 30, 1, 0),
acd = if_else(a > 30 & b <= 20 & c > 30 & d > 30, 1, 0),
bcd = if_else(a <= 30 & b > 20 & c > 30 & d > 30, 1, 0))
What I understand from your question, for each row you just need to find which columns meet the criteria defined in your ifelse() conditions. This vectorized solution will add a column to your df which contains all the combinations. This probably is also faster than multiple ifelse conditions as well. Finally, the new column can be used for ordering or grouping.
# define the threshold levels for all columns
threshold = c(a=30, b=20, c=30, d=30)
# get names of columns meeting the threshold and paste names
df$combn <- apply(df, 1, function(x) {
paste(names(x)[x > threshold], collapse = "")
})
> df
# A tibble: 10 x 5
a b c d combn
<int> <int> <int> <int> <chr>
1 21 49 46 49 bcd
2 41 28 37 46 abcd
3 25 36 34 36 bcd
4 43 31 47 40 abcd
5 44 13 48 10 ac
6 11 42 35 27 bc
7 28 18 29 48 d
8 40 11 30 17 a
9 46 20 19 20 a
10 24 40 14 43 bd
If I get that correctly, you want to categorize each row into exactly one class, so getting the category name as concatenation of threshold tests should be enough. Then you can get 0/1 columns using spread():
df %>%
mutate(
a_ = if_else(a > 30, 'a', 'x'),
b_ = if_else(b > 20, 'b', 'x'),
c_ = if_else(c > 30, 'c', 'x'),
d_ = if_else(d > 30, 'd', 'x'),
all_ = paste0(a_, b_, c_, d_),
one_ = 1) %>%
spread(all_, one_, fill = 0) %>%
select(-ends_with("_"))
Gives
# A tibble: 10 x 11
a b c d abcd axcx axxx xbcd xbcx xbxd xxxd
<int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 42 35 27 0 0 0 0 1 0 0
2 21 49 46 49 0 0 0 1 0 0 0
3 24 40 14 43 0 0 0 0 0 1 0
4 25 36 34 36 0 0 0 1 0 0 0
5 28 18 29 48 0 0 0 0 0 0 1
6 40 11 30 17 0 0 1 0 0 0 0
7 41 28 37 46 1 0 0 0 0 0 0
8 43 31 47 40 1 0 0 0 0 0 0
9 44 13 48 10 0 1 0 0 0 0 0
10 46 20 19 20 0 0 1 0 0 0 0
(You can use '' instead of 'x', but then spread() will overwrite some of your original columns.)
I have currently have a data frame that is taken from a data feed of events that happened in chronological order. I would like to add a new column onto to each row of my data the corresponds to the previous event's endx if the prior event type is 1 & the previous event's x if the prior event type is not 1
e.g
player_id <- c(12, 17, 26, 3)
event_type <- c(1, 3, 1, 10)
x <- c(65, 34, 43, 72)
endx <- c(68, NA, 47, NA)
df <- data.frame(player_id, event_type, x, endx)
df
player_id event_type x endx
1 12 1 65 68
2 17 3 34 NA
3 26 1 43 47
4 3 10 72 NA
so end result
player_id event_type x endx previous
1 12 1 65 68 NA
2 17 3 34 NA 68
3 26 1 43 47 34
4 3 10 72 NA 47
We can use if_else
library(dplyr)
df %>%
mutate(previous = if_else(lag(event_type)==1, lag(endx), lag(x)))
# player_id event_type x endx previous
#1 12 1 65 68 NA
#2 17 3 34 NA 68
#3 26 1 43 47 34
#4 3 10 72 NA 47
I am sure this isn't the most succient way but you can use a loop and indexing.
df$previous <- NA
for( i in 2: nrow(df)){
df[ i , "previous"] <- df[ i-1 , "endx"]
}
I need to automate the process of getting the next number(s) in the given sequence.
Can we make a function which takes two inputs
a vector of numbers(3,7,13,21 e.g.)
how many next numbers
seqNext <- function(sequ, next) {
..
}
seqNext( c(3,7,13,21), 3)
# 31 43 57
seqNext( c(37,26,17,10), 1)
# 5
By the power of maths!
x1 <- c(3,7,13,21)
dat <- data.frame(x=seq_along(x1), y=x1)
predict(lm(y ~ poly(x, 2), data=dat), newdata=list(x=5:15))
# 1 2 3 4 5 6 7 8 9 10 11
# 31 43 57 73 91 111 133 157 183 211 241
When dealing with successive differences that change their sign, the pattern of output values ends up switching from decreasing to increasing:
x2 <- c(37,26,17,10)
dat <- data.frame(x=seq_along(x2), y=x2)
predict(lm(y ~ poly(x,2), data=dat), newdata=list(x=1:10))
# 1 2 3 4 5 6 7 8 9 10
#37 26 17 10 5 2 1 2 5 10
-(11) -(9) -(7) -(5) -(3) -(1) -(-1) -(-3) -(-5)
-2 -2 -2 -2 -2 -2 -2 -2
As a function:
seqNext <- function(x,n) {
L <- length(x)
dat <- data.frame(x=seq_along(x), y=x)
unname(
predict(lm(y ~ poly(x, 2), data=dat), newdata=list(x=seq(L+1,L+n)))
)
}
seqNext(x1,5)
#[1] 31 43 57 73 91
seqNext(x2,5)
#[1] 5 2 1 2 5
This is also easily extensible to circumstances where the pattern might be n orders deep, e.g.:
x3 <- c(100, 75, 45, 5, -50)
diff(x3)
#[1] -25 -30 -40 -55
diff(diff(x3))
#[1] -5 -10 -15
diff(diff(diff(x3)))
#[1] -5 -5
seqNext <- function(x,n,degree=2) {
L <- length(x)
dat <- data.frame(x=seq_along(x), y=x)
unname(
predict(lm(y ~ poly(x, degree), data=dat), newdata=list(x=seq(L+1,L+n)))
)
}
seqNext(x3,n=5,deg=3)
#[1] -125 -225 -355 -520 -725
seqNext <- function(x, n) {
k <- length(x); d <- diff(x[(k - 2):k])
x[k] + 1:n * d[2] + cumsum(1:n) * diff(d[1:2])
}
seqNext(c(3,7,13,21),3)
# [1] 31 43 57
seqNext(c(37,26,17,10),1)
# [1] 5
seqNext(c(137,126,117,110),10)
# [1] 105 102 101 102 105 110 117 126 137 150
seqNext(c(105,110,113,114),5)
# [1] 113 110 105 98 89