count the number of distinct variables in a group - r

I have a data frame such as this:
df <- data.frame(
ID = c('123','124','125','126'),
Group = c('A', 'A', 'B', 'B'),
V1 = c(1,2,1,0),
V2 = c(0,0,1,0),
V3 = c(1,1,0,3))
which returns:
ID Group V1 V2 V3
1 123 A 1 0 1
2 124 A 2 0 1
3 125 B 1 1 0
4 126 B 0 0 3
and I would like to return a table that indicates if a variable is represented in the group or not:
Group V1 V2 V3
A 1 0 1
B 1 1 1
In order to count the number of distinct variables in each group.

We can do this with base R
aggregate(.~Group, df[-1], function(x) as.integer(sum(x)>0))
# Group V1 V2 V3
#1 A 1 0 1
#2 B 1 1 1
Or using rowsum from base R
+(rowsum(df[-(1:2)], df$Group)>0)
# V1 V2 V3
#A 1 0 1
#B 1 1 1
Or with by from base R
+(do.call(rbind, by(df[3:5], df['Group'], FUN = colSums))>0)
# V1 V2 V3
#A 1 0 1
#B 1 1 1

Have you tried
unique(group_by(mtcars,cyl)$cyl).
Output:[1] 6 4 8

Related

how to replace a true value based on column name and additional table?

I have the following data. v1:v4 are boolean (TRUE/FALSE)
df1
id v1 v2 v3 v4
1 T T F F
2 F F T F
3 T F F F
4 F T T T
df2
var weight
v1 1
v2 4
v3 2
v4 5
I require to first replace the TRUE value of each variable based on the variable name and the secondary table df2. So any TRUE under V1 column, for example, will become 1. FALSE will always be 0.
Later, the status variable will define whether the entire row contains a single non-zero value or multiple
df.out
id v1 v2 v3 v4 Status
1 1 4 0 0 Multiple
2 0 0 2 0 Single
3 1 0 0 0 Single
4 0 4 2 5 Multiple
Here is one option with tidyverse. Loop across the column names specified in 'var' column of 'df2', replace the TRUEvalues by the corresponding 'weight' element by matching the column name (cur_column()) with the 'var' column. Then, we create a the 'Status' column based on the number of non-zero elements per each row using rowSums
library(dplyr)
df1 %>%
mutate(across(df2$var,
~ replace(., ., df2$weight[match(cur_column(), df2$var)]))) %>%
mutate(Status = case_when(rowSums(.[df2$var] > 0) > 1
~ 'Multiple', TRUE ~ 'Single'))
-output
id v1 v2 v3 v4 Status
1 1 1 4 0 0 Multiple
2 2 0 0 2 0 Single
3 3 1 0 0 0 Single
4 4 0 4 2 5 Multiple
Or using base R
df1new <- cbind(df1[1], setNames(df2$weight,
df2$var)[col(df1[df2$var])] * df1[df2$var])
df1new$Status <- c("Single", "Multiple")[1 + (rowSums(df1new[df2$var] > 0) > 1)]
-output
> df1new
id v1 v2 v3 v4 Status
1 1 1 4 0 0 Multiple
2 2 0 0 2 0 Single
3 3 1 0 0 0 Single
4 4 0 4 2 5 Multiple
Or another option is Map from base R
lst1 <- Map(`*`, df1[df2$var], df2$weight)
cbind(df1[1], lst1, Status = c('Single', 'Multiple')[1 + (rowSums(df1[-1]) > 1)])
id v1 v2 v3 v4 Status
1 1 1 4 0 0 Multiple
2 2 0 0 2 0 Single
3 3 1 0 0 0 Single
4 4 0 4 2 5 Multiple
data
df1 <- structure(list(id = 1:4, v1 = c(TRUE, FALSE, TRUE, FALSE), v2 = c(TRUE,
FALSE, FALSE, TRUE), v3 = c(FALSE, TRUE, FALSE, TRUE), v4 = c(FALSE,
FALSE, FALSE, TRUE)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(var = c("v1", "v2", "v3", "v4"), weight = c(1L,
4L, 2L, 5L)), class = "data.frame", row.names = c(NA, -4L))
It can be very simple by matrix operation x is df1 and y is df2
x <- matrix(c(T,F,T,F,T,F,F,T,F,T,F,T,F,F,F,T), nrow =4 , ncol = 4)
y <- c(1,4,2,5)
z <- x %*% diag(y)
z
result is
[,1] [,2] [,3] [,4]
[1,] 1 4 0 0
[2,] 0 0 2 0
[3,] 1 0 0 0
[4,] 0 4 2 5
letting
Status<- x %*% diag(y) %>% as.logical(.>0) %>% matrix(.,4,4) %>% rowSums
res <- cbind(z,Status) %>% as.data.frame
V1 V2 V3 V4 Status
1 1 4 0 0 2
2 0 0 2 0 1
3 1 0 0 0 1
4 0 4 2 5 3

How can i partition market basket items into clusters?

I have a data set as follows: (i took a simple example but the real data set is much bigger)
V1 V2 V3 V4
1 1 0 0 1
2 0 1 1 0
3 0 0 1 0
4 1 1 1 1
5 0 1 1 0
6 1 0 0 1
7 0 0 0 1
8 0 1 1 1
9 1 0 1 0
10 0 1 1 0
...
where V1, V2,V3...Vn are items and 1,2,3,4...1000 are transactions. I want to partition these items into k clusters such that in each cluster i have the items that appear the most frequently together in the same transactions.
To determine the number times each couple of items appear together i tried crosstable, i got the following results:
V1 V2 V3 V4
V1 4 1 2 3
V2 1 5 5 2
V3 2 5 7 2
V4 3 2 2 5
For this small example if i want to create 2 clusters (k=2) such that a cluster must contain 2 items (to maintain the balance between clusters), i will get:
Cluster1={V1,V4}
Cluster2={V2,V3}
because:
1) V1 appears more frequently with V4 (V1,V4)=3 > (V1,V3) > (V1,V2) and same for V4.
2) V2 appears more frequently with V2 (V2,V3)=5 > (V2,V4) > (V2, V1) and same for V3.
How can i do this partition with R and for a bigger set of data ?
I think you are asking about clustering. It is not quite the same as what you are doing above, but you could use hclust to look for similarity between variables with a suitable distance measure.
For example
plot(hclust(dist(t(df),method="binary")))
produces the following...
You should look at ?dist to see if this distance measure is meaningful in your context, and ?hclust for other things you can do once you have your dendrogram (such as identifying clusters).
Or you could use your crosstab as a distance matrix (perhaps take the reciprocal of the values, and then as.dist).
library(data.table)
data:
df<-
fread("
V1 V2 V3
1 1 0 0
2 0 0 1
3 0 0 1
4 1 1 1
5 0 0 1
6 1 0 0
7 0 0 0
8 0 1 1
9 1 0 1
10 0 1 1
")[,-1]
code:
setDT(df)
sapply(names(df),function(x){
df[get(x)==1,lapply(.SD,sum,na.rm=T),.SDcols=names(df)]
})
result:
V2 V3 V4
V2 4 1 2
V3 1 3 3
V4 2 3 7
df <- read.table(text="
ID V1 V2 V3
1 1 0 0
2 0 0 1
3 0 0 1
4 1 1 1
5 0 0 1
6 1 0 0
7 0 0 0
8 0 1 1
9 1 0 1
10 0 1 1
", header = TRUE)
k = 3 # number of clusters
library(dplyr)
df %>%
# group and count on all except the first id column
group_by_at(2:ncol(df)) %>%
# get the counts, and collect all the transaction ids
summarize(n = n(), tran_ids = paste(ID, collapse = ',')) %>%
ungroup() %>%
# grab the top k summarizations
top_n(k, n)
# V1 V2 V3 n tran_ids
# <int> <int> <int> <int> <chr>
# 1 0 0 1 3 2,3,5
# 2 0 1 1 2 8,10
# 3 1 0 0 2 1,6
You can transpose your table and use standard methods of clustering. Thus, you will cluster the items. The features are the transactions.
Geometrical approaches can be used like kmeans. Alternatively, you can use mixture models which provide information criteria (like BIC) for selecting the number of clusters. Here is an R script
require(VarSelLCM)
my.data <- as.data.frame(t(df))
# To consider Gaussian mixture
# Alternatively Poisson mixture can be considered by converting each column into integer.
for (j in 1:ncol(my.data)) my.data[,j] <- as.numeric(my.data[,j])
## Clustering by considering all the variables as discriminative
# Number of clusters is between 1 and 6
res.all <- VarSelCluster(my.data, 1:6, vbleSelec = FALSE)
# partition
res.all#partitions#zMAP
# shiny application
VarSelShiny(res.all)
## Clustering with variable selection
# Number of clusters is between 1 and 6
res.selec <- VarSelCluster(my.data, 1:6, vbleSelec = TRUE)
# partition
res.selec#partitions#zMAP
# shiny application
VarSelShiny(res.selec)

R: Applying a function to every entry in dataframe

I want to make every element in the dataframe (except fot the ID column) become a 0 if it is any number other than 1.
I have:
ID A B C D E
abc 5 3 1 4 1
def 4 1 3 2 5
I want:
ID A B C D E
abc 0 0 1 0 1
def 0 1 0 0 0
I am having trouble figuring out how to specify for this to be done to do to every entry in every column and row.
Here is my code:
apply(dat.lec, 2 , function(y)
if(!is.na(y)){
if(y==1){y <- 1}
else{y <-0}
}
else {y<- NA}
)
Thank you for your help!
No need for implicit or explicit looping.
# Sample data
set.seed(2016);
df <- as.data.frame(matrix(sample(10, replace = TRUE), nrow = 2));
df <- cbind.data.frame(id = sample(letters, 2), df);
df;
# id V1 V2 V3 V4 V5
#1 k 2 9 5 7 1
#2 g 2 2 2 9 1
# Replace all entries != 1 with 0's
df[, -1][df[, -1] != 1] <- 0;
df;
# id V1 V2 V3 V4 V5
#1 k 0 0 0 0 1
#2 g 0 0 0 0 1

R - Add a column from values of others dataframes with conditions

My dataframe is :
dataMDS <- data.frame(FID=c(1,1), IID=c("CD03577","50016"), SOL=c(0,0), C1=c(0.00332472,-0.00154285))
> dataMDS
FID IID SOL C1
1 1 CD03577 0 0.00332472
2 1 50016 0 -0.00154285
I would like to add a new column plates with values from 2 others dataframe :
platesRAC <- data.frame(V1=c(1,1), V2=c("CD03577","CD0371"), V3=c("2011-01-12_RAC1","2011-01-27_RAC5"))
> platesRAC
V1 V2 V3
1 1 CD03577 2011-01-12_RAC1
2 1 CD0371 2011-01-27_RAC5
platesDESIR <- data.frame(V1=c(1,1,1), V2=c("50015","50016","50017"), V3=c("2011-11-23_DESIR9","2011-11-23_DESIR9","2011-11-23_DESIR8"))
> platesDESIR
V1 V2 V3
1 1 50015 2011-11-23_DESIR9
2 1 50016 2011-11-23_DESIR9
3 1 50017 2011-11-23_DESIR8
I would like to get the value in V3 from platesRAC OR platesDESIR when V2 == IID and add this value in a new column plates in dataMDS.
I tried with merge :
new <- merge(x = dataMDS, y = platesRAC, by.x = "IID", by.y = 'V2', all = TRUE)
FID IID SOL C1 V1 V3
1 1 CD03577 0 0.00332472 1 2011-01-12_RAC1
2 1 50016 0 -0.00154285 NA <NA>
And of course I have NA values because IID 50016 is in platesDESIR and not in platesRAC. I don't know how to do an OR | to don't have NA values.
Also, I don't want the V1 column after merging, just the V3 column rename in plates
The results I would like to have :
FID IID SOL C1 plates
1 1 CD03577 0 0.00332472 2011-01-12_RAC1
2 1 50016 0 -0.00154285 2011-11-23_DESIR9
Thanks for any help
It's not a merge but a match after binded platesRAC and platesDESIR :
bindRACDESIR = rbind(platesRAC, platesDESIR)
dataMDS$plates <- bindRACDESIR$V3[match(dataMDS$IID,bindRACDESIR$V2)]
And the result is :
FID IID SOL C1 plates
1 1 CD03577 0 0.00332472 2011-01-12_RAC1
2 1 50016 0 -0.00154285 2011-11-23_DESIR9

R Cumulative calculation based on prior values in the same column

This is what my dataframe looks like. V3 is my desired Column. V3 is not available to me.
library(data.table)
dt <- fread('
Level V1 V2
0 10 2
1 0 3
1 0 2
1 0 2 ')
I am trying to calculate V3 based on prior values of V3. The V3 formula is:
New Value of V3 =((Prior Value of V3+ Prior Value of V3*V2)*Level)+V1
1st Row V3 = (NA+NA*3)*1 + 10 = 10
2nd Row V3 = (10+10*3)*1 + 0 =40
3rd Row V3 = (40+40*2)*1 + 0 =120
4th Row V3 = (120+120*2)*1 + 0 = 360
The output should look like this.
Level V1 V2 V3
0 10 2 10
1 0 3 40
1 0 2 120
1 0 2 360
I was trying:
dt[,V3:= (cumsum(V3+V3*V2)*Level)+V1]
I reworked your efforts in the comments to get the desired result:
dt[,V3:=cumprod( c(V1[1] ,(Level*(1 + V2))[-1]) ) ]
dt
Level V1 V2 V3
1: 0 10 2 10
2: 1 0 3 40
3: 1 0 2 120
4: 1 0 2 360
I didn't actually get an error (only a warning) with dt[,V3:= V1[1] * cumprod((Level*(1 + V2))[-1])]. Using the [-1] shortened the cumprod with no extension, and resulted in recycling.
Within data.table
dt[,{ lag.V3=c(0, V3[-.N]) ; V3 = (lag.V3 + lag.V3 * V2 )* Level + V1 }]
Output
[1] 10 40 120 360
Here is one way to do it in dplyr
dt %>%
mutate(V4=lag(V3) + lag(V3)*V2 + V1,
V4=ifelse(is.na(V4), 0, V4))
Level V1 V2 V3 V4
1 0 10 2 10 0
2 1 0 3 40 40
3 1 0 2 120 120
4 1 0 2 360 360

Resources