I have a dataset to which I am trying to apply a ruleset. I would like to find out which rule an observation hit (if any) and record the result.
Here's an example. The first data frame df contains the observations. The second contains the ruleset rs. The third contains the desired result fn.
My question is how do I take the rule strings, apply each one to each observation until I get a match and then record which rule was hit? I would prefer a tidy solution but this seems like it might require a loop. Any insights are appreciated.
df <- data.frame(ID = c("A", "B", "C"),
x1 = c(1, 2, 3),
x2 = c(0L, 1L, 0L))
rs <- data.frame(RID = c(1, 2),
Rule = c("x1 <= 2 & x2 == 0L",
"x1 > 2 & x2 == 0L"))
fn <- data.frame(ID = c("A", "B", "C"),
x1 = c(1, 2, 3),
x2 = c(0L, 1L, 0L),
Rule = c(1, NA, 2))
> df
ID x1 x2
1 A 1 0
2 B 2 1
3 C 3 0
> rs
RID Rule
1 1 "x1 <= 2 & x2 == 0L"
2 2 "x1 > 2 & x2 == 0L"
> fn
ID x1 x2 Rule
1 A 1 0 1
2 B 2 1 NA
3 C 3 0 2
Try this using parse and eval. The output is a list because both rules can apply.
library(dplyr)
df %>%
rowwise() %>%
mutate(Rule = list(rs$RID[sapply(rs$Rule, function(x)
(eval(parse(t=x))))])) %>%
data.frame()
ID x1 x2 Rule
1 A 1 0 1
2 B 2 1
3 C 3 0 2
Edit: for big data sets maybe try data.table
library(data.table)
setDT(df)
df[, Rule := rs$RID[sapply(rs$Rule, function(x)
eval(parse(t=x)))], by=1:NROW(df)]
df
ID x1 x2 Rule
1: A 1 0 1
2: B 2 1 NA
3: C 3 0 2
Related
Ok this example should clarify what I am looking for
set.seed(123456789)
df <- data.frame(
x1 = sample(c(0,1), size = 10, replace = TRUE),
x2 = sample(c(0,1), size = 10, replace = TRUE),
z1 = sample(c(0,1), size = 10, replace = TRUE)
)
I want to select all rows that have x1 and x2 =1. That is,
df[df$x1==1 & df$x2==1,]
which returns
x1 x2 z1
1 1 1 1
4 1 1 1
6 1 1 1
10 1 1 0
but I want to do it in a way that scales to many x variables (e.g. x1,x2,...x40), so I would like to index the columns by "x" rather than having to write df$x1==1 & df$x2==1 &... & df$x40==1. Note that I care about having the z1 variable in the resulting data set (i.e. while the rows are selected based on the x variables, I am not looking to select the x columns only). Is it possible?
A possible solution, based on dplyr:
library(dplyr)
set.seed(123456789)
df <- data.frame(
x1 = sample(c(0,1), size = 10, replace = TRUE),
x2 = sample(c(0,1), size = 10, replace = TRUE),
z1 = sample(c(0,1), size = 10, replace = TRUE)
)
df %>%
filter(across(starts_with("x"), ~ .x == 1))
#> x1 x2 z1
#> 1 1 1 1
#> 2 1 1 1
#> 3 1 1 1
#> 4 1 1 0
Here is a base R way with Reduce applied to the data.frame's rows.
cols <- grep("^x", names(df))
i <- apply(df[cols], 1, \(x) Reduce(`&`, x == 1L))
df[i,]
# x1 x2 z1
#1 1 1 1
#4 1 1 1
#6 1 1 1
#10 1 1 0
I have a data frame where I would like to put in front of a column name the following words: "high_" and "low_". The name of the columns from X2-X4 should be renamed eg.high_X2 and X5-X7 eg. low_X6.
Please see an example below.
X1 X2 X3 X4 X5 X6 X7
a 1 0 1 1 1 1 0
b 2 2 1 1 1 1 0
result
X1 high_X2 high_X3 high_X4 low_X5 low_X6 low_X7
a 1 0 1 1 1 1 0
b 2 2 1 1 1 1 0
You can use rep and paste -
names(df)[-1] <- paste(rep(c('high', 'low'), each = 3), names(df)[-1], sep = '_')
df
# X1 high_X2 high_X3 high_X4 low_X5 low_X6 low_X7
#a 1 0 1 1 1 1 0
#b 2 2 1 1 1 1 0
If you want to rely on range of columns then dplyr code would be easier.
library(dplyr)
df %>%
rename_with(~paste('high', ., sep = '_'), X2:X4) %>%
rename_with(~paste('low', ., sep = '_'), X5:X7)
The base solution (which is more straitforward for these kind of things imo)
df <- data.frame(X1=c(a=1L,b=2L),
X2=c(a=0L,b=2L),
X3=c(a=1L,b=1L),
X4=c(a=1L,b=1L),
X5=c(a=1L,b=1L),
X6=c(a=1L,b=1L),
X7=c(a=1L,b=1L))
cn <- colnames(df)
cond <- as.integer(substr(cn,2L,nchar(cn))) %% 2L == 0L
colnames(df)[cond] <- paste0(cn[cond],"_is_pair")
A tidyverse solution (a bit more awkward due to the tidyeval)
library(dplyr)
library(stringr)
library(tidyselect)
df <- data.frame(X1=c(a=1L,b=2L),
X2=c(a=0L,b=2L),
X3=c(a=1L,b=1L),
X4=c(a=1L,b=1L),
X5=c(a=1L,b=1L),
X6=c(a=1L,b=1L),
X7=c(a=1L,b=1L))
is_pair <- function(vars = peek_vars(fn = "is_pair")) {
vars[as.integer(str_sub(vars,2L,nchar(vars))) %% 2L == 0L]
}
df %>% rename_with(~paste0(.x,"_is_pair"),
is_pair())
I did not find any method of checking whether categorical value elements of a vector are between other categorical value elements.
A dataframe is given:
id letter
1 B
2 A
3 B
4 B
5 C
6 B
7 A
8 B
9 C
Everything I found is related to numerical values and to the notion of general order (rather than to index of an element in a specific vector).
I want to add a new column with boolean values (1 if B is between A and C; 0 if B is between C and A) to the dataframe,
id letter between
1 B 0
2 A NA
3 B 1
4 B 1
5 C NA
6 B 0
7 A NA
8 B 1
9 C NA
A combination of rle (run length encoding) and zoo::rollapply is one option:
library(zoo)
d <- structure(list(id = 1:9,
letter = structure(c(2L, 1L, 2L, 2L, 3L, 2L, 1L, 2L, 3L),
.Label = c("A", "B", "C"),
class = "factor")),
class = "data.frame", row.names = c(NA, -9L))
rl <- rle(as.numeric(d$letter))
rep(rollapply(c(NA, rl$values, NA),
3,
function(x) if (x[2] == 2)
ifelse(x[1] == 1 && x[3] == 3, 1, 0)
else NA),
rl$lengths)
# [1] 0 NA 1 1 NA 0 NA 1 NA
Explanation
With rleyou identify blocks of consecutive values.
With rollapply you "roll" a function with a given window size (here 3) over a vector.
Our vector rl$values contains the different elements and the function we apply to it is pretty straight forward:
if the second element is anything but a 2 (corresponding to B) return NA
if the second element is a 2 and element 1 is an A and element 3 is a C return 1 and 0 otherwise
A different tidyverse possibility could be:
df %>%
group_by(grp = with(rle(letter), rep(seq_along(lengths), lengths))) %>%
filter(row_number() == 1) %>%
ungroup() %>%
mutate(res = ifelse(lag(letter, default = first(letter)) == "A" &
lead(letter, default = last(letter)) == "C", 1, 0)) %>%
select(-letter, -grp) %>%
full_join(df, by = c("id" = "id")) %>%
arrange(id) %>%
fill(res) %>%
mutate(res = ifelse(letter != "B", NA, res))
id res letter
<int> <dbl> <chr>
1 1 0 B
2 2 NA A
3 3 1 B
4 4 1 B
5 5 NA C
6 6 0 B
7 7 NA A
8 8 1 B
9 9 NA C
In this case it, first, groups by a run-length type ID and keeps the first rows with a given ID. Second, it checks the condition. Third, it performs a full join with the original df on "id" column. Finally, it arranges according "id", fills the missing values and assigns NA to rows where "letter" != B.
It's unclear from the question whether "A" and "C" must alternate, though that's implied because there is no coding for "B" between "A" and "A" or vv. Supposing that they do, for the vector
x = c("B", "A", "B", "B", "C", "B", "A", "B", "C")
map to numeric values c(A=1, B=0, C=-1) and form the cumulative sum
v = cumsum(c(A=1, B=0, C=-1)[x])
(increment by 1 when encountering "A", decrement by one when "C"). Replace positions not corresponding to "B" with NA
v[x != "B"] = NA
giving
> v
B A B B C B A B C
0 NA 1 1 NA 0 NA 1 NA
This could be captured as a function
fun = function(x, map = c(A = 1, B = 0, C = -1)) {
x = map[x]
v = cumsum(x)
v[x != 0] = NA
v
}
and used to transform a data.frame or tibble, e.g.,
tibble(x) %>% mutate(v = fun(x))
Here's one solution, which I hope is fairly easy conceptually. For 'special' cases such as B being at the top or bottom of the list, or having an A or a C on both sides, I've set such values to 0.
# Create dummy data - you use your own
df <- data.frame(id=1:100, letter=sample(c("A", "B", "C"), 100, replace=T))
# Copy down info on whether A or C is above each B
acup <- df$letter
for(i in 2:nrow(df))
if(df$letter[i] == "B")
acup[i] <- acup[i-1]
# Copy up info on whether A or C is below each B
acdown <- df$letter
for(i in nrow(df):2 -1)
if(df$letter[i] == "B")
acdown[i] <- acdown[i+1]
# Set appropriate values for column 'between'
df$between <- NA
df$between[acup == "A" & acdown == "C"] <- 1
df$between[df$letter == "B" & is.na(df$between)] <- 0 # Includes special cases
You can use lead and lag functions to know the letters before and after and then mutate as below:
library(dplyr)
df %>%
mutate(letter_lag = lag(letter, 1),
letter_lead = lead(letter, 1)) %>%
mutate(between = case_when(letter_lag == "A" | letter_lead == "C" ~ 1,
letter_lag == "C" | letter_lead == "A" ~ 0,
TRUE ~ NA_real_)) %>%
select(id, letter, between)
id letter between
1 1 B 0
2 2 A NA
3 3 B 1
4 4 B 1
5 5 C NA
6 6 B 0
7 7 A NA
8 8 B 1
9 9 C NA
Given this data, the first 4 columns (rowid, order, line, special), I need to create a column, numSpecial as such:
rowid order line special numSpecial
1 A 01 X 1
2 B 01 0
3 B 02 X 2
4 B 03 X 2
5 C 01 X 1
6 C 02 0
Where numSpecial is determined by summing the number of times for each order that is special (value = X), given that order-line is special itself, otherwise its 0.
I first tried adding a column that simply concats 'order' with 'X', call it orderX, and would look like:
orderX
AX
BX
BX
BX
CX
CX
Then do a sum of order & special in orderx:
df$numSpecial <- sum(paste(order, special, sep = "") %in% orderx)
But that doesnt work, it returns the sum of the results for all rows for every order:
numSpecial
4
4
4
4
4
4
I then tried as.data.table, but I'm not getting the expected results using:
as.data.table(mydf)[, numSpecial := sum(paste(order, special, sep = "") %in% orderx), by = rowid]
However that is returning just 1 for each row and not sums:
numSpecial
1
0
1
1
1
0
Where am I going wrong with these? I shouldn't have to create that orderX column either I don't think, but I can't figure out the way to get this count right. It's similar to a countif in excel which is easy to do.
There's probably several ways, but you could just multiply it by a TRUE/FALSE flag of "X" being present:
dat[, numSpecial := sum(special == "X") * (special == "X"), by=order]
dat
# rowid order line special numSpecial
#1: 1 A 1 X 1
#2: 2 B 1 0
#3: 3 B 2 X 2
#4: 4 B 3 X 2
#5: 5 C 1 X 1
#6: 6 C 2 0
You could also do it a bit differently like:
dat[, numSpecial := 0L][special == "X", numSpecial := .N, by=order]
Where dat was:
library(data.table)
dat <- structure(list(rowid = 1:6, order = c("A", "B", "B", "B", "C",
"C"), line = c(1L, 1L, 2L, 3L, 1L, 2L), special = c("X", "",
"X", "X", "X", "")), .Names = c("rowid", "order", "line", "special"
), row.names = c(NA, -6L), class = "data.frame")
setDT(dat)
You could use ave with a dummy variable (just filled with 1s):
df$numSpecial <- ifelse(df$special == "X", ave(rep(1,nrow(df)), df$order, df$special, FUN = length), 0)
df
# rowid order line special numSpecial
#1 1 A 1 X 1
#2 2 B 1 0
#3 3 B 2 X 2
#4 4 B 3 X 2
#5 5 C 1 X 1
#6 6 C 2 0
Note I read in your data without the numSpecial column.
Using the dplyr package:
library(dplyr)
df %>% group_by(order) %>%
mutate(numSpecial = ifelse(special=="X", sum(special=="X"), 0))
rowid order special numSpecial
1 1 A X 1
2 2 B 0
3 3 B X 2
4 4 B X 2
5 5 C X 1
6 6 C 0
One other option using base R only would be to use aggregate:
# Your data
df <- data.frame(rowid = 1:6, order = c("A", "B", "B", "B", "C", "C"), special = c("X", "", "X", "X", "X", ""))
# Make the counts
dat <- with(df,aggregate(x=list(answer=special),by=list(order=order,special=special),FUN=function(x) sum(x=="X")))
# Merge back to original dataset:
dat.fin <- merge(df,dat,by=c('order','special'))
I want to Transform R Dataframe factor into Indicator Variable using some index in R.
Given following representation
StudentID Subject
1 A
1 B
2 A
2 C
3 A
3 B
I need following representation using StudentID as index
StudentID SubjectA SubjectB SubjectC
1 1 1 0
2 1 0 1
3 1 1 0
We can use table
table(df1)
# Subject
#StudentID A B C
# 1 1 1 0
# 2 1 0 1
# 3 1 1 0
If we need a data.frame
as.data.frame.matrix(table(df1))
Here's how I got it, using dcast from reshape2 as suggested in the comment above
library(reshape2)
ID <- c(1, 1, 2, 2, 3, 3)
Subject <- c('A', 'B', 'A', 'C', 'A', 'B')
data <- data.frame(ID, Subject)
data <- dcast(data, ID ~ Subject)
data[is.na(data)] <- 0
f <- function(x) {
x <- gsub('[A-Z]', 1, x)
}
as.data.frame(apply(data, 2, f))
# ID A B C
#1 1 1 1 0
#2 2 1 0 1
#3 3 1 1 0
Now that I look at this solution it may not be very efficient. But it is much more dynamic than some other solutions. There might also be a way to use data.table directly but I cannot figure it out. This might help though:
library(data.table)
df <- structure(list(StudentID = c(1, 1, 2, 2, 3, 3),
Subject = structure(c(1L,
2L, 1L, 3L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("StudentID",
"Subject"), row.names = c(NA, -6L), class = "data.frame")
df <- data.table(df)
### here we pull the unique student id's to use in group by
studentid <- as.character(unique(df$Subject))
### here we group by student ID's and paste which Subjects exist
x <- df[,list("Values"=paste(Subject,collapse="_")),by=StudentID]
### then we go through each one and try to match it to the unique vector
tmp <- strsplit(x$Values,"_")
res <- do.call(rbind,lapply(tmp,function(i) match(studentid,i)))
### change the results to the indicator variable desired
res[!is.na(res)] <- 1
res[is.na(res)] <- 0
res <- data.frame("StudentID"=x$StudentID,res)
colnames(res) <- c("StudentID",studentid)