Here is the data set for reproducibility:
a=c(90.41,37.37,18.98)
b=c(103.39,39.44,51.68)
c=c(83.51,36.41,47.46)
d=c(94.60,38.57,50.22)
e=c(95.04,38.81,50.49)
xx=t(data.frame(a,b,c,d,e))
df=data.frame(xx)
And here is the if/else function I am trying run on the data frame
classify=function(df){
if (df>=110) {
class="5"}
else if (df<110 & df>=103){
class="4"}
else if (df<103 & df>=95){
class="3"}
else if (df<95 & df>=76){
class="2"}
else if (df<76){
class="1"}
else {class="none"}
}
However, what I want the if/else function to produce is a new data frame that looks like this:
df
X1 X2 X3
a 2 1 1
b 4 1 1
c 2 1 1
d 2 1 1
e 3 1 1
I am unsure as to how to do this so any help would be super appreciated. I anticipate something is wrong in the if/else function itself but I am quite inexperienced and I don't know how to detect errors in the script that easily. Thank you!
sapply(df, function(x) {as.numeric(as.character(cut(x, c(-Inf,76,95,103,110,Inf), seq(1:5))))})
X1 X2 X3
[1,] 2 1 1
[2,] 4 1 1
[3,] 2 1 1
[4,] 2 1 1
[5,] 3 1 1
Use cut to set the intervals (its 2nd argument) and the labels (its 3rd argument). However, it returns a factor, so convert back to numeric if you like that. Since you want to run the function over the the full dataframe, use sapply or lapply.
You can do this with findInterval. All you have to do is to pass it a non-decreasing vector of break points.
classify <- function(DF, breaks = c(-Inf, 76, 95, 103, 110, Inf)){
f <- function(x, breaks) findInterval(x, breaks)
DF[] <- lapply(DF, f, breaks)
DF
}
classify(df)
# X1 X2 X3
#a 2 1 1
#b 4 1 1
#c 2 1 1
#d 2 1 1
#e 3 1 1
Quite a similar approach to your example, using case_when from dplyr:
library(dplyr)
classify <- function(x){
case_when(
x >= 110 ~ "5",
x >= 103 & x < 110 ~ "4",
x >= 95 & x < 103 ~ "3",
x >= 76 & x < 95 ~ "2",
x < 76 ~ "1",
TRUE ~ "none"
)
}
a = c(90.41, 37.37, 18.98)
b = c(103.39, 39.44, 51.68)
c = c(83.51, 36.41, 47.46)
d = c(94.60, 38.57, 50.22)
e = c(95.04, 38.81, 50.49)
df <- data.frame(matrix(c(a, b, c, d, e), ncol = 3, byrow = T))
mutate_all(df, classify)
# X1 X2 X3
#1 2 1 1
#2 4 1 1
#3 2 1 1
#4 2 1 1
#5 3 1 1
In case if:
df
# X1 X2 X3
#1 -Inf 37.37 18.98
#2 103.39 NaN 51.68
#3 83.51 36.41 47.46
#4 94.60 Inf 50.22
#5 95.04 38.81 NA
The results look like this:
mutate_all(df, classify)
# X1 X2 X3
#1 1 1 1
#2 4 none 1
#3 2 1 1
#4 2 5 1
#5 3 1 none
Related
I have a "small" square matrix that I want to add to a "big" matrix. The big matrix contains all the rows and columns of the small matrix plus extras. I want to add the values where the indices are in common and just keep the values from the big one where that index is not contained in the small one. Unfortunately, all the data is copied on the addition so it takes a long time and can temporarily spike memory when the matrices are large.
I have tried adding subsets using matrices and data.frames, as well as a data.table method using rbindlist. Both the data.frame and matrix methods seem to cause a memory copy (why?) and the rbindlist method is not ideal because it requires a melt and dcast and temporarily spiking the memory by spiking the number of rows.
Is there any way to just change the values of some items in a matrix without causing a copy of the entire matrix?
Here are my attempts:
MList <- list(M1,M2)
unionCols <- Reduce(union, lapply(MList, colnames))
MTotal <- matrix(as.double(rep(0,(length(unionCols))^2)), nrow = length(unionCols))
rownames(MTotal) <- colnames(MTotal) <- unionCols
DFTotal <- as.data.frame(MTotal)
DFList <- lapply(MList, as.data.frame)
for(i in 1:length(MList)){
tracemem(MTotal)
tracemem(DFTotal)
mCol <- match(colnames(MList[[i]]), colnames(MTotal))
MTotal[mCol,mCol] <- MTotal[mCol,mCol] + MList[[i]] # this causes a copy
DFTotal[mCol,mCol] <- DFTotal[mCol,mCol] + DFList[[i]] # this causes a copy
}
M1
M2
MTotal
# rbindlist method
.AggDMCMatsSingleM2 <- function(M1, M2){
.MyMelt <- function(M){
DT <- setnames(reshape2::melt(M, id.vars = colnames(M)), c('Var1','Var2'), c('row','col'))
}
M_total <- as.matrix(data.table::dcast(rbindlist(lapply(list(M1,M2), .MyMelt)),
formula = as.formula(row ~ col),
value.var = 'value',
fun.aggregate = sum,
fill = 0),
rownames = 'row')
return(M_total)
}
M1
M2
.AggDMCMatsSingleM2(M1,M2)
If I follow what you are asking we can directly add and write to the big matrix using the bracket notation row/col names of the small matrix:
big_matrix<-matrix(data=rep(1, 25), nrow=5,
dimnames = list(c(LETTERS[1:5]),
c(letters[1:5])))
# a b c d e
#A 1 1 1 1 1
#B 1 1 1 1 1
#C 1 1 1 1 1
#D 1 1 1 1 1
#E 1 1 1 1 1
small_matrix<-matrix(data=c(1:9), nrow=3,
dimnames = list(c(LETTERS[2:4]),
c(letters[2:4])))
# b c d
#B 1 4 7
#C 2 5 8
#D 3 6 9
big_matrix[rownames(small_matrix), colnames(small_matrix)] <-
big_matrix[rownames(small_matrix), colnames(small_matrix)] + small_matrix
# a b c d e
#A 1 1 1 1 1
#B 1 2 5 8 1
#C 1 3 6 9 1
#D 1 4 7 10 1
#E 1 1 1 1 1
More complex test:
big_matrix<-matrix(data=rep(1, 25), nrow=5,
dimnames = list(c(LETTERS[1:5]),
c(letters[1:5])))
# a b c d e
#A 1 1 1 1 1
#B 1 1 1 1 1
#C 1 1 1 1 1
#D 1 1 1 1 1
#E 1 1 1 1 1
small_matrix<-matrix(data=c(1:9), nrow=3,
dimnames = list(c("A", "D", "C"),
c(letters[c(2:4)])))
# b c d
#A 1 4 7
#D 2 5 8
#C 3 6 9
big_matrix[rownames(small_matrix), colnames(small_matrix)] <-
big_matrix[rownames(small_matrix), colnames(small_matrix)] + small_matrix
big_matrix
# a b c d e
#A 1 2 5 8 1
#B 1 1 1 1 1
#C 1 4 7 10 1
#D 1 3 6 9 1
#E 1 1 1 1 1
I was wondering is there a way to automate (e.g., loop) the subtraction of (X2-X1), (X3-X1), (X3-X2) in my data below and add them as three new columns to the data?
m="
id X1 X2 X3
A 1 0 4
B 2 2 2
C 3 4 1"
data <- read.table(text = m, h = T)
This is very similar to this question; we basically just need to change the function that we are using in map2_dfc:
library(tidyverse)
combn(names(data)[-1], 2) %>%
map2_dfc(.x = .[1,], .y = .[2,],
.f = ~transmute(data, !!paste0(.y, "-", .x) := !!sym(.y) - !!sym(.x))) %>%
bind_cols(data, .)
#> id X1 X2 X3 X2-X1 X3-X1 X3-X2
#> 1 A 1 0 4 -1 3 4
#> 2 B 2 2 2 0 0 0
#> 3 C 3 4 1 1 -2 -3
With combn:
dif <- combn(data[-1], 2, \(x) x[, 2] - x[, 1])
colnames(dif) <- combn(names(data)[-1], 2, \(x) paste(x[2], x[1], sep = "-"))
cbind(data, dif)
# id X1 X2 X3 X2-X1 X3-X1 X3-X2
#1 A 1 0 4 -1 3 4
#2 B 2 2 2 0 0 0
#3 C 3 4 1 1 -2 -3
I have for example a datset like this:
data <- data.frame(matrix(c(1,2,2,3,4,5,5,"a","a","b","a","a","a","b"), nrow = 7, ncol = 2, byrow = F))
X1 X2
1 a
2 a
2 b
3 a
4 a
5 a
5 b
then I add another variable "tag", initially set to 0.
data$tag <- 0
X1 X2 tag
1 a 0
2 a 0
2 b 0
3 a 0
4 a 0
5 a 0
5 b 0
I'd like to have "tag" equal to 1 for each row that is repeated, like:
X1 X2 tag
1 a 0
2 a 1
2 b 1
3 a 0
4 a 0
5 a 1
5 b 1
I used the followed code:
for (i in data$X1) {
for (j in 1:length(data$X1)) {
if (j==2) {data$tag[j] <- 1}
}
}
but it doesn't work like I would like to. I'd like the second loop (j) to work inside the previous one in order to obtain what I want, where j starts from 1 every time X1 changes.
How can I manage it?
Thanks a lot
Maybe you can try ave
within(
data,
tag <- +(ave(X1, X1, FUN = length) > 1)
)
which gives
X1 X2 tag
1 1 a 0
2 2 a 1
3 2 b 1
4 3 a 0
5 4 a 0
6 5 a 1
7 5 b 1
You can use duplicated from both the ends in base R :
data$tag <- as.integer(duplicated(data$X1) |
duplicated(data$X1, fromLast = TRUE))
data
# X1 X2 tag
#1 1 a 0
#2 2 a 1
#3 2 b 1
#4 3 a 0
#5 4 a 0
#6 5 a 1
#7 5 b 1
An option with add_count
library(dplyr)
data %>%
add_count(X1) %>%
mutate(n = +(n > 1))
I have this sample dataframe where column a to d are reference columns and column x1-3 need to be parsed and plugged with new values.
Here is the code to re-produce the data frame:
df1 <- data_frame(a = c(0,1,0,1), b = c(0,0,1,1), c = c(0,0,0,0), d =
c(1,0,0,1), x1= c(NA, NA, NA, NA), x2= c(NA, NA, NA, NA), x3= c(NA, NA, NA, NA))
I want to give new values to x1 -x3 based on different value combination from column a, b, c, d. My pseudocode is as follows:
for df1[ , "x1"]:
if a = 1: then return 1
else: return 0
for df1[ , "x2"]:
if a = 1 & b = 1: then return 1
else: return 0
for df1[ , "x3"]:
all conditions: return 1
Ideally, all the values in x1 and x2 will be changed according to their given conditions. X3 should be filled with 1 no matter what. Can anyone suggest a efficient way to loop & parse through those columns, please?
You don't need loops:
df1$x1 <- df1$a
df1$x2 <- as.integer(df1$a & df1$b)
df1$x3 <- 1
Result:
a b c d x1 x2 x3
1 0 0 0 1 0 0 1
2 1 0 0 0 1 0 1
3 0 1 0 0 0 0 1
4 1 1 0 1 1 1 1
Edit:
If columns a-d are not binary values (0 or 1) you still can use the same expressions to create columns x1-3. Let's say you have this data frame:
a b c d x1 x2 x3
1 0 0 1 5 NA NA NA
2 3 9 2 1 NA NA NA
3 4 2 3 5 NA NA NA
4 2 1 4 1 NA NA NA
And your conditions are:
x1 = 1 if (b >= 2) and (d < 4) 0 otherwise
x2 = 1 if (a > b) and (b < d) 0 otherwise
x3 = always 1
You can use the same methodology:
df1$x1 <- as.integer(df1$b >= 2 & df1$d < 4)
df1$x2 <- as.integer(df1$a > df1$b & df1$b < df1$d)
df1$x3 <- 1
Result:
a b c d x1 x2 x3
1 0 0 1 5 0 0 1
2 3 9 2 1 1 0 1
3 4 2 3 5 0 1 1
4 2 1 4 1 0 0 1
Suppose I have the data.frame below where treat == 1 means that the id received treatment and prob is the calculated probability that treat == 1.
set.seed(1)
df <- data.frame(id = 1:10, treat = sample(0:1, 10, replace = T))
df$prob <- ifelse(df$treat, rnorm(10, .8, .1), rnorm(10, .4, .4))
df
id treat prob
1 1 0 0.3820266
2 2 0 0.3935239
3 3 1 0.8738325
4 4 1 0.8575781
5 5 0 0.6375605
6 6 1 0.9511781
7 7 1 0.8389843
8 8 1 0.7378759
9 9 1 0.5785300
10 10 0 0.6479303
To minimize selection bias, I now wish to create pseudo treatment and control groups on the basis of the values of treat and prob:
When any id withtreat == 1 is within 0.1 prob of any id with treat == 0, I want the value of group to be "treated".
When any id withtreat == 0 is within 0.1 prob of any id with treat == 1, I want the value of group to be "control".
Below is an example of what I'd like the result to be.
df$group <- c(NA, NA, NA, NA, 'control', NA, NA, 'treated', 'treated', 'control')
df
id treat prob group
1 1 0 0.3820266 <NA>
2 2 0 0.3935239 <NA>
3 3 1 0.8738325 <NA>
4 4 1 0.8575781 <NA>
5 5 0 0.6375605 control
6 6 1 0.9511781 <NA>
7 7 1 0.8389843 <NA>
8 8 1 0.7378759 treated
9 9 1 0.5785300 treated
10 10 0 0.6479303 control
How would I go about doing this? In the example above, matching is done with replacements, but a solution without replacements would be welcome, too.
You can try
foo <- function(x){
TR <- range(x$prob[x$treat == 0])
CT <- range(x$prob[x$treat == 1])
tmp <- sapply(1:nrow(x), function(y, z){
if(z$treat[y] == 1){
ifelse(any(abs(z$prob[y] - TR) <= 0.1), "treated", "NA")
}else{
ifelse(any(abs(z$prob[y] - CT) <= 0.1), "control", "NA")
}}, x)
cbind(x, group = tmp)
}
foo(df)
id treat prob group
1 1 0 0.3820266 NA
2 2 0 0.3935239 NA
3 3 1 0.8738325 NA
4 4 1 0.8575781 NA
5 5 0 0.6375605 control
6 6 1 0.9511781 NA
7 7 1 0.8389843 NA
8 8 1 0.7378759 treated
9 9 1 0.5785300 treated
10 10 0 0.6479303 control
I think this problem is well suited for cut in base R. Here is how you can do it in a vectorized way:
f <- function(r) {
x <- cut(df[r,]$prob, breaks = c(df[!r,]$prob-0.1, df[!r,]$prob+0.1))
df[r,][!is.na(x),]$id
}
ones <- df$treat==1
df$group <- NA
df[df$id %in% f(ones),]$group <- "treated"
df[df$id %in% f(!ones),]$group <- "control"
> df
# id treat prob group
# 1 1 0 0.3820266 <NA>
# 2 2 0 0.3935239 <NA>
# 3 3 1 0.8738325 <NA>
# 4 4 1 0.8575781 <NA>
# 5 5 0 0.6375605 control
# 6 6 1 0.9511781 <NA>
# 7 7 1 0.8389843 <NA>
# 8 8 1 0.7378759 treated
# 9 9 1 0.5785300 treated
# 10 10 0 0.6479303 control
Perhaps not the most elegant but it seems to work for me:
df %>% group_by(id,treat) %>% mutate(group2 = ifelse(treat==1,
ifelse(any(abs(prob-df[df$treat==0,3])<0.1),"treated","NA"),
ifelse(any(abs(prob-df[df$treat==1,3])<0.1),"control","NA"))) # treat==0
Is this what you want?
#Base R:
apply(df[df$treat == 1, ],1, function(x){
ifelse(any(df[df$treat == 0, 'prob'] -.1 < x[3] & x[3] < df[df$treat == 0, 'prob'] +.1), 'treated', NA)
})
You can invert $treatclause to reflect control-group and attach the variables to your df.