I have got a data set that looks like this:
COMPANY DATABREACH CYBERBACKGROUND
A 1 2
B 0 2
C 0 1
D 0 2
E 1 1
F 1 2
G 0 2
H 0 2
I 0 2
J 0 2
No I want to create the following: 40% of the cases that the column DATABREACH has the value of 1, I want the value CYBERBACKGROUND to take the value of 2. I figure there must be some function to do this, but I cannot find it.
ind <- which(df$DATABREACH == 1)
ind <- ind[rbinom(length(ind), 1, prob = 0.4) > 0]
df$CYBERBACKGROUND[ind] <- 2
The above is a bit more efficient in that it only pulls randomness for as many as strictly required. If you aren't concerned (11000 doesn't seem too high), you can reduce that to
df$CYBERBACKGROUND <-
ifelse(df$DATABREACH == 1 & rbinom(nrow(df), 1, prob = 0.4) > 0,
2, df$CYBERBACKGROUND)
We may use
library(dplyr)
df1 <- df1 %>%
mutate(CYBERBACKGROUND = replace(CYBERBACKGROUND,
sample(which(DATABREACH == 0), sum(ceiling(sum(DATABREACH) * 0.4))), 2))
Related
I am with a small problem and hope someone can help me.
I have a dataframe like this:
df <- data.frame(foo = 1:20, bar = c(0,0,1,0,0,0,1,2,0,0,1,2,3,0,0,0,1,2,3,4))
and what to have a result like this:
df_result <- data.frame(foo = 1:20, bar = c(0,0,1,0,0,0,2,2,0,0,3,3,3,0,0,0,4,4,4,4))
How do I do this without using a while loop?
Using ave in base R :
with(df, as.integer(bar > 0) * (ave(bar, cumsum(bar == 0), FUN = max)))
#[1] 0 0 1 0 0 0 2 2 0 0 3 3 3 0 0 0 4 4 4 4
where cumsum(bar == 0) is used to create groups, ave is used to calculate max in each group and as.integer(bar > 0) is to keep value which are 0 as 0.
I am trying to create a program to iterate through a R data table. I am trying to avoid for loops, because as far as I know they are slow.
#creation of the data table
col <- c(0, 1, 0, 1, 0, 1)
Priority <- c(1,2,3,4,5,6) #1 highest, 6 lowest
IEC_category <- c("a","b","c","d","e","f")
eventlog_overlap.dt <- data.table(col,Priority, IEC_category)
#comparison and assignation of the priority
if (eventlog_overlap.dt$col == 1){
if (eventlog_overlap.dt$Priority <= shift(eventlog_overlap.dt$Priority,1)){
eventlog_overlap.dt$AlarmaPrior <- eventlog_overlap.dt$IEC_category #write the actual category
}
else{
eventlog_overlap.dt$AlarmaPrior <- shift(eventlog_overlap.dt$IEC_category,1) #write the previous category
}
} else{ eventlog_overlap.dt$AlarmaPrior <- NA
}
Pleas provide the desired result. A dplyr attempt:
library(dplyr)
library(hablar)
col <- c(0, 1, 0, 1, 0, 1)
Priority <- c(1,2,3,4,5,6) #1 highest, 6 lowest
IEC_category <- c("a","b","c","d","e","f")
df <- data.frame(col,Priority, IEC_category)
df %>%
mutate(AlarmaPrior = if_else_(col == 1,
if_else_(Priority <= lag(Priority),
IEC_category,
lag(IEC_category)), NA))
gives you:
col Priority IEC_category AlarmaPrior
1 0 1 a <NA>
2 1 2 b a
3 0 3 c <NA>
4 1 4 d c
5 0 5 e <NA>
6 1 6 f e
I need to transfrom a categorical attribute vector into a "same attribute matrix" using R.
For example I have a vector which reports gender of N people (male = 1, female = 0). I need to convert this vector into a NxN matrix named A (with people names on rows and columns), where each cell Aij has the value of 1 if two persons (i and j) have the same gender and 0 otherwise.
Here is an example with 3 persons, first male, second female, third male, which produce this vector:
c(1, 0, 1)
I want to transform it into this matrix:
A = matrix( c(1, 0, 1, 0, 1, 0, 1, 0, 1), nrow=3, ncol=3, byrow = TRUE)
Like lmo said in acomment it's impossible to know the structure of your dataset so what follows is just an example for you to see how it could be done.
First, make up some data.
set.seed(3488) # make the results reproducible
x <- LETTERS[1:5]
y <- sample(0:1, 5, TRUE)
df <- data.frame(x, y)
Now tabulate it according to your needs
A <- outer(df$y, df$y, function(a, b) as.integer(a == b))
dimnames(A) <- list(df$x, df$x)
A
# A B C D E
#A 1 1 1 0 0
#B 1 1 1 0 0
#C 1 1 1 0 0
#D 0 0 0 1 1
#E 0 0 0 1 1
I wanted to create a vector of counts if possible.
For example: I have a vector
x <- c(3, 0, 2, 0, 0)
How can I create a frequency vector for all integers between 0 and 3? Ideally I wanted to get a vector like this:
> 3 0 1 1
which gives me the counts of 0, 1, 2, and 3 respectively.
Much appreciated!
You can do
table(factor(x, levels=0:3))
Simply using table(x) is not enough.
Or with tabulate which is faster
tabulate(factor(x, levels = min(x):max(x)))
You can do this using rle (I made this in minutes, so sorry if it's not optimized enough).
x = c(3, 0, 2, 0, 0)
r = rle(x)
f = function(x) sum(r$lengths[r$values == x])
s = sapply(FUN = f, X = as.list(0:3))
data.frame(x = 0:3, freq = s)
#> data.frame(x = 0:3, freq = s)
# x freq
#1 0 3
#2 1 0
#3 2 1
#4 3 1
You can just use table():
a <- table(x)
a
x
#0 2 3
#3 1 1
Then you can subset it:
a[names(a)==0]
#0
#3
Or convert it into a data.frame if you're more comfortable working with that:
u<-as.data.frame(table(x))
u
# x Freq
#1 0 3
#2 2 1
#3 3 1
Edit 1:
For levels:
a<- as.data.frame(table(factor(x, levels=0:3)))
I am trying to create a column in a very large data frame (~ 2.2 million rows) that calculates the cumulative sum of 1's for each factor level, and resets when a new factor level is reached. Below is some basic data that resembles my own.
itemcode <- c('a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a4', 'a4', 'a5', 'a6', 'a6', 'a6', 'a6')
goodp <- c(0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1)
df <- data.frame(itemcode, goodp)
I would like the output variable, cum.goodp, to look like this:
cum.goodp <- c(0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 0, 1)
I get that there is a lot out there using the canonical split-apply-combine approach, which, conceptually is intuitive, but I tried using the following:
k <- transform(df, cum.goodp = goodp*ave(goodp, c(0L, cumsum(diff(goodp != 0)), FUN = seq_along, by = itemcode)))
When I try to run this code it's very very slow. I get that transform is part of the reason why (the 'by' doesn't help either). There are over 70K different values for the itemcode variable, so it should probably be vectorized. Is there a way to vectorize this, using cumsum? If not, any help whatsoever would be truly appreciated. Thanks so much.
A base R approach is to calculate cumsum over the whole vector, and capture the geometry of the sub-lists using run-length encoding. Figure out the start of each group, and create new groups
start <- c(TRUE, itemcode[-1] != itemcode[-length(itemcode)]) | !goodp
f <- cumsum(start)
Summarize these as a run-length encoding, and calculate the overall sum
r <- rle(f)
x <- cumsum(x)
Then use the geometry to get the offset that each embedded sum needs to be corrected by
offset <- c(0, x[cumsum(r$lengths)])
and calculate the updated value
x - rep(offset[-length(offset)], r$lengths)
Here's a function
cumsumByGroup <- function(x, f) {
start <- c(TRUE, f[-1] != f[-length(f)]) | !x
r <- rle(cumsum(start))
x <- cumsum(x)
offset <- c(0, x[cumsum(r$lengths)])
x - rep(offset[-length(offset)], r$lengths)
}
Here's the result applied to the sample data
> cumsumByGroup(goodp, itemcode)
[1] 0 1 2 0 1 1 2 0 0 1 1 1 2 0 1
and it's performance
> n <- 1 + rpois(1000000, 1)
> goodp <- sample(c(0, 1), sum(n), TRUE)
> itemcode <- rep(seq_along(n), n)
> system.time(cumsumByGroup(goodp, itemcode))
user system elapsed
0.55 0.00 0.55
The dplyr solution takes about 70s.
#alexis_laz solution is both elegant and 2 times faster than mine
cumsumByGroup1 <- function(x, f) {
start <- c(TRUE, f[-1] != f[-length(f)]) | !x
cs = cumsum(x)
cs - cummax((cs - x) * start)
}
With the modified example input/output you could use the following base R approach (among others):
transform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum))
# itemcode goodp cum.goodp cum.goodpX
#1 a1 0 0 0
#2 a1 1 1 1
#3 a1 1 2 2
#4 a1 0 0 0
#5 a1 1 1 1
#6 a2 1 1 1
#7 a2 1 2 2
#8 a3 0 0 0
#9 a4 0 0 0
#10 a4 1 1 1
#11 a5 1 1 1
#12 a6 1 1 1
#13 a6 1 2 2
#14 a6 0 0 0
#15 a6 1 1 1
Note: I added column cum.goodp to the input df and created a new column cum.goodpX so you can easily compare the two.
But of course you can use many other approaches with packages, either what #MartinMorgan suggested or for example using dplyr or data.table, to name just two options. Those may be a lot faster than base R approaches for large data sets.
Here's how it would be done in dplyr:
library(dplyr)
df %>%
group_by(itemcode, grp = cumsum(goodp == 0)) %>%
mutate(cum.goodpX = cumsum(goodp))
A data.table option was already provided in the comments to your question.