I'm fairly new to R and I have the following issue.
I have a dataframe like this:
A | B | C | E | F |G
1 02 XXX XXX XXX 1
1 02 XXX XXX XXX 1
2 02 XXX XXX XXX NA
2 02 XXX XXX XXX NA
3 02 XXX XXX XXX 1
3 Z1 XXX XXX XXX 1
4 02 XXX XXX XXX 2
....
M 02 XXX XXX XXX 1
The thing is that the dataframe possibly has 150k rows or more, and I need to generate another dataframe grouping by A (which is an ID) and count the following occurrences:
When B is 02 and G has 1 <- V
When B is 02 and G is NA <- W
When B is Z1 and G has 1 <- X
When B is Z1 and G is NA <- Y
Any other kind of occurrence <- Z
For this simple example, the result should look something like this
A | V | W | X | Y | Z
1 2 0 0 0 0
2 0 2 0 0 0
3 1 1 0 0 0
4 0 0 0 0 1
...
M 1 0 0 0 0
At this point I managed to get the results using a for loop:
get_counters <- function(df){
counters <- data.frame(matrix(ncol = 6, nrow = length(unique(df$A))))
colnames(counters) <- c("A", "V", "W", "X", "Y", "Z")
counters$A<- unique(df$A)
for (i in 1:nrow(counters)) {
counters$V[i] <- sum(df$A == counters$A[i] & df$B == "02" & df$G == 1, na.rm = TRUE)
counters$W[i] <- sum(df$A == counters$A[i] & df$B == "02" & is.na(df$G), na.rm = TRUE)
counters$X[i] <- sum(df$A == counters$A[i] & df$B == "Z1" & df$G== 1, na.rm = TRUE)
counters$Y[i] <- sum(df$A == counters$A[i] & df$B == "Z1" & is.na(df$G), na.rm = TRUE)
counters$Z[i] <- sum(df$A == counters$A[i] & (df$B == "Z1" | df$B == "02") & df$G!= 1, na.rm = TRUE)
}
return(counters)
}
Trying that on a small test dataframe returns all the correct results, but with the real data is extremely slow. I'm not sure how to use the apply functions, seems like a simple problem, but I have not found an answer. So far I've assumed that if I could use apply with the sum statement in my for loop (maybe using group_by(A)) I could do it, but I receive all kind of errors.
counters$V <- df%>%
group_by(A)%>%
sum(df$A == counters$A& df$B == "02" &df$G == 1, na.rm = TRUE)
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
In addition: Warning message:
In df$A== counters$A:
longer object length is not a multiple of shorter object length
If I change the function to not use a for loop and not use $ (I get an error referring to "$ operator is invalid for atomic vectors") I either get more errors or weird unreadable results (Large lists that contain more values that the original dataframe, huge empty matrices, etc...)
Is there a simple (maybe not simple but fast and efficient) way to solve this problem? Thanks in advance.
You can do this very quickly using data.table.
Creating Dummy Data:
set.seed(123)
counters <- data.frame(A = rep(1:100000, each = 3), B = sample(c("02","Z1"), size = 300000, replace = T), G = sample(c(1,NA), size = 300000, replace = T))
All I am doing is counting the instances of the combination, then reshaping the data in the format you need:
library(data.table)
setDT(counters)
counters[,comb := paste0(B,"_",G)]
dcast(counters, A ~ comb, fun.aggregate = length, value.var = "A")
A 02_1 02_NA Z1_1 Z1_NA
1: 1 0 2 1 0
2: 2 1 0 1 1
3: 3 0 0 2 1
4: 4 1 1 0 1
5: 5 0 1 2 0
---
99996: 99996 0 1 1 1
99997: 99997 0 2 1 0
99998: 99998 2 0 1 0
99999: 99999 1 0 1 1
100000: 100000 0 2 0 1
I adopted a naming convention that is a bit more extensible (the new columns indicate what combination you are counting), but if you want to override, replace the comb := line with four lines like the following:
counters[B == "02" & is.na(G), comb := "V"]
counters[B == "02" & !is.na(G), comb := "X"]
....
But I think the above is a bit more flexible.
Hello i need help with programming R. I have data.frame B with four column
x<- c(1,2,1,2,1,2,1,2,1,2,1,2,.......etc.)
y<-c(5,5,8,8,12,12,19,19,30,30,50,50,...etc.)
z<- c(2018-11-08,2018-11-08,2018-11-09,2018-11-09,2018-11-11,2018-11-11,2018-11-20,2018-11-20,2018-11-29,2018-11-29,2018-11-30,2018-11-30,.......etc.)
m<-c(0,1,1,0,1,1,0,1,0,1,0,1,...etc.)
2 milion rows and i need create next columns . Next columns should look as
t<-c(0,1,0,0,0,0,0,1,0,1,0,1,....)
code in cycle look like
B$t[1]=ifelse(B$y[i]==B$y[i+1] & B$z[i]==B$z[i+1] & B$x[i]==2 & B$m[1]==1,1,0)
for (i in 2:length(B$z))
{
B$t[i]<-ifelse(B$y[i]==B$y[i-1] & B$z[i]==B$z[i-1] & B$x[i]==2 & B$m[i]==1 & B$m[i]!=B$m[i-1],1,0)
}
I do not want to use cycle- loop.
I use basic package in R.
And i have new one question when i have data.frame E
x<- c(1,2,3,1,2,3,1,2,3,1,2,3,.......etc.)
y<-c(5,5,5,8,8,8,12,12,12,,19,19,19,30,30,30,50,50,50,...etc.)
z<- c(2018-11-08,2018-11-08,2018-11-08,2018-11-09,2018-11-09,2018-11-09,2018-11-11,2018-11-11,2018-11-11,2018-11-20,2018-11-20,2018-11-20,2018-11-29,2018-11-29,2018-11-29,2018-11-30,2018-11-30,2018-11-30,.......etc.)
m<-c(0,1,1,0,0,1,0,1,0,1,0,1,0,0,1...etc.)
2 milion rows and i need create next columns . Next columns should look as
t<-c(0,1,0,0,1,....)
code in cycle look like
E$t[1]=ifelse(E$y[i]==E$y[i+1] & E$z[i]==E$z[i+1] & E$x[1]==2 & E$m[1]==1,1,0)
E$t[2]=ifelse(E$y[i]==E$y[i+1] & E$z[i]==E$z[i+1] & E$x[2]==3 & E$m[2]==1,1,0)
for (i in 3:length(E$y))
{
E$t[i]<-ifelse(E$y[i]==E$y[i-2] & E$z[i]==E$z[i-2] & E$x[i]==3 & E$m[i]==1 &
E$m[i-1]==0 & E$m[i-2]==0,1,0)
}
I do not want to use cycle- loop.
I use basic package in R.
Here is a solution with base R:
N <- nrow(B)
B$t <- ifelse(B$y==c(NA, B$y[-N]) & B$z==c(NA, B$z[-N]) & B$x==2 & B$m==1 & B$m!=c(NA, B$m[-N]), 1, 0)
Here is a solution with data.table:
library("data.table")
B <- data.table(
x= c(1,2,1,2,1,2,1,2,1,2,1,2), y= c(5,5,8,8,12,12,19,19,30,30,50,50),
z= c("2018-11-08", "2018-11-08", "2018-11-09", "2018-11-09", "2018-11-11", "2018-11-11", "2018-11-20",
"2018-11-20", "2018-11-29", "2018-11-29", "2018-11-30", "2018-11-30"),
m= c(0,1,1,0,1,1,0,1,0,1,0,1)
)
B[, t := ifelse(y==c(NA, y[- .N]) & z==c(NA, z[- .N]) & x==2 & m==1 & m!=c(NA, m[- .N]), 1, 0)]
or (if logical is acceptable)
B[, t := (y==c(NA, y[- .N]) & z==c(NA, z[- .N]) & x==2 & m==1 & m!=c(NA, m[- .N]))]
or using shift()
B[, t := (y==shift(y) & z==shift(z) & x==2 & m==1 & m!=shift(m))]
With dplyr you can use if_else and lag:
library(dplyr)
dat %>%
mutate(t = if_else(
y == lag(y) & z == lag(z) & x == 2 & m == 1 & m != lag(m), 1, 0)
) # mutate lets you create a new variable in dat (named t here)
# x y z m t
# 1 1 5 2018-11-08 0 0
# 2 2 5 2018-11-08 1 1
# 3 1 8 2018-11-09 1 0
# 4 2 8 2018-11-09 0 0
# 5 1 12 2018-11-11 1 0
# 6 2 12 2018-11-11 1 0
# 7 1 19 2018-11-20 0 0
# 8 2 19 2018-11-20 1 1
# 9 1 30 2018-11-29 0 0
# 10 2 30 2018-11-29 1 1
# 11 1 50 2018-11-30 0 0
# 12 2 50 2018-11-30 1 1
Data:
x<- c(1,2,1,2,1,2,1,2,1,2,1,2)
y<-c(5,5,8,8,12,12,19,19,30,30,50,50)
z<- c("2018-11-08","2018-11-08","2018-11-09","2018-11-09","2018-11-11","2018-11-11","2018-11-20","2018-11-20","2018-11-29","2018-11-29","2018-11-30","2018-11-30")
m<-c(0,1,1,0,1,1,0,1,0,1,0,1)
dat <- data.frame(x, y, z, m)
Here's my problem I couldn't solve it all.
Suppose that we have the following code as follows:
## A data frame named a
a <- data.frame(A = c(0,0,1,1,1), B = c(1,0,1,0,0), C = c(0,0,1,1,0), D = c(0,0,1,1,0), E = c(0,1,1,0,1))
## 1st function calculates all the combinaisons of colnames of a and the output is a character vector named item2
items2 <- c()
countI <- 1
while(countI <= ncol(a)){
for(i in countI){
countJ <- countI + 1
while(countJ <= ncol(a)){
for(j in countJ){
items2 <- c(items2, paste(colnames(a[i]), colnames(a[j]), collapse = '', sep = ""))
}
countJ <- countJ + 1
}
countI <- countI + 1
}
}
And here's my code I'm trying to solve (the output is a numeric vector called count_1):
## 2nd function
colnames(a) <- NULL ## just for facilitating the calculation
count_1 <- numeric(ncol(a)*2)
countI <- 1
while(countI <= ncol(a)){
for(i in countI){
countJ <- countI + 1
while(countJ <= ncol(a)){
for(j in countJ){
s <- a[, i]
p <- a[, j]
count_1[i*2] <- as.integer(s[i] == p[j] & s[i] == 1)
}
countJ <- countJ + 1
}
countI <- countI + 1
}
}
But when I execute this code in RStudio Console, a non-expectation result returned!:
count_1
[1] 0 0 0 0 0 1 0 1 0 0
However, I am expecting the following result:
count_1
[1] 1 2 2 2 1 1 1 1 2 1
You can see visit the following URL where you can find an image on Dropbox for detailed explanation.
https://www.dropbox.com/s/5ylt8h8wx3zrvy7/IMAG1074.jpg?dl=0
I'll try to explain a little more,
I posted the 1st function (code) just to show you what I'm looking for exactly that is an example that's all.
What I'm trying to get from the second function (code) is calculating the number of occurrences of number 1 (firstly we put counter = 0) in each row (while each row of two columns (AB, for example) must equal to one in both columns to say that counter = counter + 1) we continue by combing each column by all other columns (with AC, AD, AE, BC, BD, BE, CD, CE, and then DE), combination is n!/2!(n-2)!, that means for example if I have the following data frame:
a =
A B C D E
0 1 0 0 0
0 0 0 0 1
1 1 1 1 1
1 0 0 1 0
1 0 1 0 1
Then, the number of occurrences of the number 1 for each row by combining the two first columns is as follows: (Note that I put colnames(a) <- NULL just to facilitate the work and be more clear)
0 1 0 0 0
0 0 0 0 1
1 1 1 1 1
1 0 0 1 0
1 0 1 0 1
### Example 1: #####################################################
so from here I put (for columns A and B (AB))
s <- a[, i]
## s is equal to
## [1] 0 0 1 1 1
p <- a[, j]
## p is equal to
## [1] 1 0 1 0 0
Then I'll look for the occurrence of the number 1 in both vectors in condition it must be the same, i.e. a[, i] == 1 && a[, j] == 1 && a[, i] == a[, j], and for this example a numeric vector will be [1] 1
### Example 2: #####################################################
From here I put (for columns A and D (AD))
s <- a[, i]
## s is equal to
## [1] 0 0 1 1 1
p <- a[, j]
## p is equal to
## [1] 0 0 1 1 0
Then I'll look for the occurrence of the number 1 in both vectors in condition it must be the same, i.e. a[, i] == 1 && a[, j] == 1 && a[, i] == a[, j], and for this example a numeric vector will be [1] 2
And so on,
I'll have a numeric vector named count_1 equal to:
[1] 1 2 2 2 1 1 1 1 2 1
while each index of count_1 is a combination of each column by others (without the names of the data frame)
AB AC AD AE BC BD BE CD CE DE
1 2 2 2 1 1 1 1 2 1
Not clear what you're after at all.
As to the first code chunk, that is some ugly R coding involving a whole bunch of unnecessary while/for loops.
You can get the same result items2 in one single line.
items2 <- sort(toupper(unlist(sapply(1:4, function(i)
sapply(5:(i+1), function(j)
paste(letters[i], letters[j], sep = ""))))));
items2;
# [1] "AB" "AC" "AD" "AE" "BC" "BD" "BE" "CD" "CE" "DE"
As to the second code chunk, please explain what you're trying to calculate. It's likely that these while/for loops are as unnecessary as in the first case.
Update
Note that this is based on a as defined at the beginning of your post. Your expected output is based on a different a, that you changed further down the post.
There is no need for a for/while loop, both "functions" can be written in two one-liners.
# Your sample dataframe a
a <- data.frame(A = c(0,0,1,1,1), B = c(1,0,1,0,0), C = c(0,0,1,1,0), D = c(0,0,1,1,0), E = c(0,1,1,0,1))
# Function 1
items2 <- toupper(unlist(sapply(1:(ncol(a) - 1), function(i) sapply(ncol(a):(i+1), function(j)
paste(letters[i], letters[j], sep = "")))));
# Function 2
count_1 <- unlist(sapply(1:(ncol(a) - 1), function(i) sapply(ncol(a):(i+1), function(j)
sum(a[, i] + a[, j] == 2))));
# Add names and sort
names(count_1) <- items2;
count_1 <- count_1[order(names(count_1))];
# Output
count_1;
#AB AC AD AE BC BD BE CD CE DE
# 1 2 2 2 1 1 1 2 1 1