I am with a small problem and hope someone can help me.
I have a dataframe like this:
df <- data.frame(foo = 1:20, bar = c(0,0,1,0,0,0,1,2,0,0,1,2,3,0,0,0,1,2,3,4))
and what to have a result like this:
df_result <- data.frame(foo = 1:20, bar = c(0,0,1,0,0,0,2,2,0,0,3,3,3,0,0,0,4,4,4,4))
How do I do this without using a while loop?
Using ave in base R :
with(df, as.integer(bar > 0) * (ave(bar, cumsum(bar == 0), FUN = max)))
#[1] 0 0 1 0 0 0 2 2 0 0 3 3 3 0 0 0 4 4 4 4
where cumsum(bar == 0) is used to create groups, ave is used to calculate max in each group and as.integer(bar > 0) is to keep value which are 0 as 0.
Related
I have got a data set that looks like this:
COMPANY DATABREACH CYBERBACKGROUND
A 1 2
B 0 2
C 0 1
D 0 2
E 1 1
F 1 2
G 0 2
H 0 2
I 0 2
J 0 2
No I want to create the following: 40% of the cases that the column DATABREACH has the value of 1, I want the value CYBERBACKGROUND to take the value of 2. I figure there must be some function to do this, but I cannot find it.
ind <- which(df$DATABREACH == 1)
ind <- ind[rbinom(length(ind), 1, prob = 0.4) > 0]
df$CYBERBACKGROUND[ind] <- 2
The above is a bit more efficient in that it only pulls randomness for as many as strictly required. If you aren't concerned (11000 doesn't seem too high), you can reduce that to
df$CYBERBACKGROUND <-
ifelse(df$DATABREACH == 1 & rbinom(nrow(df), 1, prob = 0.4) > 0,
2, df$CYBERBACKGROUND)
We may use
library(dplyr)
df1 <- df1 %>%
mutate(CYBERBACKGROUND = replace(CYBERBACKGROUND,
sample(which(DATABREACH == 0), sum(ceiling(sum(DATABREACH) * 0.4))), 2))
I have a data frame - in which I have a column with a lengthy string separated by _. Now I am interested in counting the patterns and several possible combinations from the long string. In the use case I provided below, you can find that I would like to count the occurrence of events A and B but not anything else.
If A and B repeat like A_B or B_A alone or if they repeats itself n number of times, I want to count them and also if there are several occurrences of those combinations.
Example data frame:
participant <- c("A", "B", "C")
trial <- c(1,1,2)
string_pattern <- c("A_B_A_C_A_B", "B_A_B_A_C_D_A_B", "A_B_C_A_B")
df <- data.frame(participant, trial, string_pattern)
Expected output:
participant trial string_pattern A_B B_A A_B_A B_A_B B_A_B_A
1. A 1 A_B_A_C_A_B 2 1 1 0 0
2. B 1 B_A_B_A_C_D_A_B 2 2 1 1 1
3. C 2 A_B_C_A_B 2 0 0 0 0
My code:
revised_df <- df%>%
dplyr::mutate(A_B = stringr::str_count(string_pattern, "A_B"),
B_A = stringr::str_count(string_pattern, "B_A"),
B_A_B = string::str_count(string_pattern, "B_A_B"))
My approach gets complicated as the number of combinations increases. Hence, looking for a better solution.
You could write a function to solve this:
m <- function(s){
a <- seq(nchar(s)-1)
start <- rep(a, rev(a))
stop <- ave(start, start, FUN = \(x)seq_along(x)+x)
b <- substring(s, start, stop)
gsub('(?<=\\B)|(?=\\B)', '_', b, perl = TRUE)
}
n <- function(x){
names(x) <- x
a <- strsplit(gsub("_", '', gsub("_[^AB]+_", ':', x)), ':')
b <- t(table(stack(lapply(a, \(y)unlist(sapply(y, m))))))
data.frame(pattern=x, as.data.frame.matrix(b), row.names = NULL)
}
n(string_pattern)
pattern A_B A_B_A B_A B_A_B B_A_B_A
1 A_B_A_C_A_B 2 1 1 0 0
2 B_A_B_A_C_D_A_B 2 1 2 1 1
3 A_B_C_A_B 2 0 0 0 0
Try: This checks each string row for current column name
library(dplyr)
df |>
mutate(A_B = 0, B_A = 0, A_B_A = 0, B_A_B = 0, B_A_B_A = 0) |>
mutate(across(A_B:B_A_B_A, ~ str_count(string_pattern, cur_column())))
participant trial string_pattern A_B B_A A_B_A B_A_B B_A_B_A
1 A 1 A_B_A_C_A_B 2 1 1 0 0
2 B 1 B_A_B_A_C_D_A_B 2 2 1 1 1
3 C 2 A_B_C_A_B 2 0 0 0 0
According to the following table, I have many different teachers(10,11,12,...) with different ideas(1,2,... for example 1:Very Good, 2:Good,... ) of each class (1,2,3,...).
Some teachers don't have any idea about some classes.
class Teacher-code Opinion
1 12 1
1 13 1
1 14 1
2 11 3
2 13 1
3 10 1
3 11 2
3 12 1
3 13 1
This is a sample of my table but I have many records. I want to have a symmetric matrix of teachers with counts of their same ideas about classes. for example, teacher 12 and 13 have the same idea in the class of 1 and 3, then the intersection elements of them are 2. Or teacher codes of 14 and 13 have just one same idea about the first class. I want to get the following matrix:
[10] [11] [12] [13] [14]
[10] 0 0 1 1 0
[11] 0 0 0 0 0
[12] 1 0 0 2 1
[13] 1 0 2 0 1
[14] 0 0 1 1 0
This is a base R solution based on a general approach taken to find common rows between data frames. Maybe this could be helpful.
Create a function that will find overlap in your dataframe between teachers that share other common values in certain columns (in this case, class and Opinion). With merge you can identify overlap, and nrow to count the overlapping rows.
Using outer you can generate a matrix of all teachers. The function passed to the product needs to be vectorized.
the_teachers <- sort(unique(df$Teacher_code))
get_num_classes <- function(x, y) {
nrow(
merge(
df[df$Teacher_code == x, c("class", "Opinion")],
df[df$Teacher_code == y, c("class", "Opinion")]
)
)
}
mat <- outer(the_teachers, the_teachers, Vectorize(get_num_classes))
diag(mat) <- 0
dimnames(mat) <- list(the_teachers, the_teachers)
mat
Output
10 11 12 13 14
10 0 0 1 1 0
11 0 0 0 0 0
12 1 0 0 2 1
13 1 0 2 0 1
14 0 0 1 1 0
Edit: Based on comment, there is interest in identifying the fraction of (teacher pairs sharing same opinions in same class) / (teacher pairs sharing same class). Building off the same logic, you could modify the function as below. A separate merge will determine the number of teachers sharing the same class. If this number is not zero, it will determine the number of opinions shared between the teacher pair. If there are no classes shared, the function will just return zero. Depending on size of data and concordance between teachers this may be optimized further.
get_num_classes <- function(x, y) {
same_class <- nrow(
merge(
df[df$Teacher_code == x, "class", drop = F],
df[df$Teacher_code == y, "class", drop = F]
)
)
if (same_class != 0) {
same_opinion <- nrow(
merge(
df[df$Teacher_code == x, c("class", "Opinion")],
df[df$Teacher_code == y, c("class", "Opinion")]
)
)
return(same_opinion / same_class)
} else {
return(0)
}
}
Here is a base R option by defining a user function f, where aggregate + pmin + vecsets::vintersect are applied:
library(vecsets)
f <- function(df) {
u <- aggregate(. ~ Teacher_code, df, I)
res <- do.call(
pmin,
lapply(
u[c("class", "Opinion")],
function(x) outer(x, x, FUN = function(...) lengths(Vectorize(vintersect)(...)))
)
)
`dimnames<-`(`diag<-`(res, 0), rep(list(u[["Teacher_code"]]), 2))
}
and you will see
> f(df)
10 11 12 13 14
10 0 0 1 1 0
11 0 0 0 0 0
12 1 0 0 2 1
13 1 0 2 0 1
14 0 0 1 1 0
I mean its some pretty hideous code (I'm sure someone can do something better) but I think it gets you the result you need (this isn't homework is it -.- ?). Also in the future it makes life easier if you provide data using dput()
library(dplyr)
library(tidyr)
dat <- tibble(
class = c( 1,1,1,2,2,3,3,3,3),
Teacher_code = c(12,13,14,11,13,10,11,12,13),
Opinion = c(1,1,1,3,1,1,2,1,1)
)
dat2 <- complete(dat, class, Teacher_code)
classes <- unique(dat2$class)
teachers <- unique(dat2$Teacher_code)
len_teachers <- length(teachers)
mat <- matrix(nrow = len_teachers, ncol = len_teachers)
for(i in seq_along(teachers)){
for( j in seq_along(teachers)){
same_opinion <- 0
for(k in classes){
opinion_i <- dat2 %>% filter(Teacher_code == teachers[[i]] , class == k) %>% pull(Opinion)
opinion_j <- dat2 %>% filter(Teacher_code == teachers[[j]] , class == k) %>% pull(Opinion)
same_opinion <- same_opinion + (opinion_i == opinion_j & !(is.na(opinion_i) | is.na(opinion_j)))
}
mat[i,j] <- same_opinion
}
}
I have a data set and would like to do two things:
Set certain row values in Col A to 0 based on values in Col B
Create a new column with values of either 0 or 1 based on the edited values in Col A
My current approach is shown below - the issue is I occasionally get an error:
Error in `[<-.data.frame`(`*tmp*`, "OCS_dose", value = 0) :
replacement has 1 row, data has 0
As the numbers that I am generating are randomly selected and on certain trials there are no rows to update in Col A based on the numbers in Col B.
Here is an example of my code that causes the error:
pbo_IFNlow_data[pbo_IFNlow_data$OCS_status == 0,]['OCS_dose'] <- 0
OCS_status is either a 0 or 1 that is generated using:
pbo_OCS_status_low <- sample(c(0,1), replace = TRUE,
size = pbo_n_IFNlow, prob=c(1-.863, 0.863))
Therefore on occasion, I have no 0's... In my mind R should then just not try to update anything.
Is there a better way to do what I am trying to do?
Here is a more complete segment of my code:
pbo_OCS_status_low <- sample(c(0,1), replace = TRUE, size = pbo_n_IFNlow, prob=c(1-.863, 0.863)) #on OCS = 1
#OCS dose
pbo_OCS_dose_low <- rtruncnorm(pbo_n_IFNlow, a=0, b=Inf, mean=12.8, sd=8.1)
#IFN boolean flag
pbo_IFN_low <- rep(0, pbo_n_IFNlow)
#SLEDAI score
pbo_SLEDAI_low <- rtruncnorm(pbo_n_IFNlow, a=0, b=Inf, mean=11.1, sd=4.4)
#Response criteria met for SRI score reduction
pbo_SRI_low <- sample(c(0,1), replace = TRUE, size = pbo_n_IFNlow, prob=c(1-0.423, 0.423))
pbo_IFNlow_data <- cbind(IFN_status=pbo_IFN_low,
OCS_status=pbo_OCS_status_low,
OCS_dose=pbo_OCS_dose_low,
SLEDAI=pbo_SLEDAI_low,
SRI_response=pbo_SRI_low)
pbo_IFNlow_data <- data.frame(pbo_IFNlow_data)
#set those off OCS to 0
pbo_IFNlow_data[pbo_IFNlow_data$OCS_status == 0,]['OCS_dose'] <- 0
#stratifcation factor for OCS dosage
pbo_IFNlow_data$OCS_lessthan10 <- "temp"
pbo_IFNlow_data[pbo_IFNlow_data$OCS_dose < 10, ]['OCS_lessthan10'] <- 1
pbo_IFNlow_data[pbo_IFNlow_data$OCS_dose >= 10, ]['OCS_lessthan10'] <- 0
#stratification factor for SLE score
pbo_IFNlow_data$SLE_lessthan10 <- "temp"
pbo_IFNlow_data[pbo_IFNlow_data$SLEDAI < 10, ]['SLE_lessthan10'] <- 1
pbo_IFNlow_data[pbo_IFNlow_data$SLEDAI >= 10, ]['SLE_lessthan10'] <- 0
It would be easier if we can have a minimal reproducible example. If I understand your question correctly, you may want to try ifelse statement in R?
df <- data.frame(colA = seq(1, 10), colB = seq(11, 20))
# Set certain row values in Col A to 0 based on values in Col B
df$colA <- ifelse(df$colB > 15, 0, df$colB)
# Create a new column with values of either 0
# or 1 based on the edited values in Col A
df$colC <- ifelse(df$colA == 0, 1, 0)
print(df)
## colA colB colC
## 1 11 11 0
## 2 12 12 0
## 3 13 13 0
## 4 14 14 0
## 5 15 15 0
## 6 0 16 1
## 7 0 17 1
## 8 0 18 1
## 9 0 19 1
## 10 0 20 1
I have a data set that looks as follows
df <- data.frame( name = c("a", "b", "c"),
judgement1= c(5, 0, NA),
judgement2= c(1, 1, NA),
judgement3= c(2, 1, NA))
I want to reshape the dataframe to look like this
# name judgement1 judgement2 judgement3
# a 1 0 0
# a 1 0 0
# a 1 0 0
# a 1 0 0
# a 1 0 0
# b 1 0 0
# b 0 1 0
# b 0 0 1
And so on. I have seen that untable is recommended on some other threads, but it does not appear to work with the current version of r. Is there a package that can convert summarised counts into individual observations?
You could try something like this:
df <- data.frame( name = c("a", "b", "c"),
judgement1= c(5, 0, NA),
judgement2= c(1, 1, NA),
judgement3= c(2, 1, NA))
rep.vec <- colSums(df[colnames(df) %in% paste0("judgement", (1:nrow(df)), sep="")], na.rm = TRUE)
want <- data.frame(name=df$name, cbind(diag(nrow(df))))
colnames(want)[-1] <- paste0("judgement", (1:nrow(df)), sep="")
(want <- want[rep(1:nrow(want), rep.vec), ])
I wrote a function that works to give you your desired output:
untabl <- function(df, id.col, count.cols) {
df[is.na(df)] <- 0 # replace NAs
out <- lapply(count.cols, function(x) { # for each column with counts
z <- df[rep(1:nrow(df), df[,x]), ] # replicate rows
z[, -c(id.col)] <- 0 # set all other columns to zero
z[, x] <- 1 # replace the count values with 1
z
})
out <- do.call(rbind, out) # combine the list
out <- out[order(out[,c(id.col)]),] # reorder (you can change this)
rownames(out) <- NULL # return to simple row numbers
out
}
untabl(df = df, id.col = 1, count.cols = c(2,3,4))
# name judgement1 judgement2 judgement3
#1 a 1 0 0
#2 a 1 0 0
#3 a 1 0 0
#4 a 1 0 0
#5 a 1 0 0
#6 a 0 1 0
#7 b 0 1 0
#8 a 0 0 1
#9 a 0 0 1
#10 b 0 0 1
And for your reference, reshape::untable consists of the following code:
function (df, num)
{
df[rep(1:nrow(df), num), ]
}