Data Manipulation in R Project: compare rows - r

I'm looking to compare values within a dataset
Every row starts with a unique ID followed by a couple binary variables
The data looks like this:
row.name v1 v2 v3 ...
1 0 0 0
2 1 1 1
3 1 0 1
I want to know which values are the same (if equal assign value of 1) and which are different (if not equal assign value of 0) for all unique pairings.
For example in column v1: row1 == 0 and row2 == 1, which should result in an assignment of 0.
So, the output should look like this
id1 id2 v1 v2 v3 ...
1 2 0 0 0 ...
1 3 0 1 0 ...
2 3 1 0 1 ...
I'm looking for an efficient way of doing this for more than 1000 rows...

There's no way to do this without expanding each combination of rows, so with 1000 rows, it is going to take a bit of time. But here is a solution:
dat <- read.table(header=T, text="row.name v1 v2 v3
1 0 0 0
2 1 1 1
3 1 0 1")
Create the index rows:
indices <- t(combn(dat$row.name, 2))
colnames(indices) <- c('id1', 'id2')
Loop through the index rows, and collect the comparisons:
res1 <- t(apply(indices, 1, function(x) as.numeric(dat[x[1],-1] == dat[x[2],-1])))
colnames(res1) <- names(dat[-1])
Put them together:
result <- cbind(indices, res1)
result
## id1 id2 v1 v2 v3
## [1,] 1 2 0 0 0
## [2,] 1 3 0 1 0
## [3,] 2 3 1 0 1

Related

Is there a way to make addition of two contiguos columns in R?

I have a data frame where each observation is comprehended in two columns. In this way, columns 1 and 2 represents the individual 1, 3 and 4 the individual 2 and so on.
Basically what I want to do is to add two contigous columns so I have the individual real score.
In this example V1 and V2 represent individual I and V3 and V4 represent individual II. So for the result data frame I will have the half of columns, the same number of rows and each value will be the addition of each value between two contigous colums.
Data
V1 V2 V3 V4
1 0 0 1 1
2 1 0 0 0
3 0 1 1 1
4 0 1 0 1
Desire Output
I II
1 0 2
2 1 0
3 1 2
4 1 1
I tried something like this
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
b <- data.frame(NA, nrow = nrow(a), ncol = ncol(data))
for (i in seq(2,ncol(a),by=2)){
for (k in 1:nrow(a)){
b[k,i] <- a[k,i] + a[k,i-1]
}
}
b <- as.data.frame(b)
b <- b[,-c(seq(1,length(b),by=2))]
Is there a way to make it simplier?
We could use split.default to split the data and then do rowSums by looping over the list
sapply(split.default(a, as.integer(gl(ncol(a), 2, ncol(a)))), rowSums)
1 2
[1,] 0 2
[2,] 1 0
[3,] 1 2
[4,] 1 1
You can use vector recycling to select columns and add them.
res <- a[c(TRUE, FALSE)] + a[c(FALSE, TRUE)]
names(res) <- paste0('col', seq_along(res))
res
# col1 col2
#1 0 2
#2 1 0
#3 1 2
#4 1 1
dplyr's approach with row-wise operations (rowwise is a special type of grouping per row)
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
library(dplyr)
a%>%
rowwise()%>%
transmute(I=sum(c(V1,V2)),
II=sum(c(V3,V4)))
or alternatively with a built-in row-wise variant of the sum
a %>% transmute(I = rowSums(across(1:2)),
II = rowSums(across(3:4)))

How can i partition market basket items into clusters?

I have a data set as follows: (i took a simple example but the real data set is much bigger)
V1 V2 V3 V4
1 1 0 0 1
2 0 1 1 0
3 0 0 1 0
4 1 1 1 1
5 0 1 1 0
6 1 0 0 1
7 0 0 0 1
8 0 1 1 1
9 1 0 1 0
10 0 1 1 0
...
where V1, V2,V3...Vn are items and 1,2,3,4...1000 are transactions. I want to partition these items into k clusters such that in each cluster i have the items that appear the most frequently together in the same transactions.
To determine the number times each couple of items appear together i tried crosstable, i got the following results:
V1 V2 V3 V4
V1 4 1 2 3
V2 1 5 5 2
V3 2 5 7 2
V4 3 2 2 5
For this small example if i want to create 2 clusters (k=2) such that a cluster must contain 2 items (to maintain the balance between clusters), i will get:
Cluster1={V1,V4}
Cluster2={V2,V3}
because:
1) V1 appears more frequently with V4 (V1,V4)=3 > (V1,V3) > (V1,V2) and same for V4.
2) V2 appears more frequently with V2 (V2,V3)=5 > (V2,V4) > (V2, V1) and same for V3.
How can i do this partition with R and for a bigger set of data ?
I think you are asking about clustering. It is not quite the same as what you are doing above, but you could use hclust to look for similarity between variables with a suitable distance measure.
For example
plot(hclust(dist(t(df),method="binary")))
produces the following...
You should look at ?dist to see if this distance measure is meaningful in your context, and ?hclust for other things you can do once you have your dendrogram (such as identifying clusters).
Or you could use your crosstab as a distance matrix (perhaps take the reciprocal of the values, and then as.dist).
library(data.table)
data:
df<-
fread("
V1 V2 V3
1 1 0 0
2 0 0 1
3 0 0 1
4 1 1 1
5 0 0 1
6 1 0 0
7 0 0 0
8 0 1 1
9 1 0 1
10 0 1 1
")[,-1]
code:
setDT(df)
sapply(names(df),function(x){
df[get(x)==1,lapply(.SD,sum,na.rm=T),.SDcols=names(df)]
})
result:
V2 V3 V4
V2 4 1 2
V3 1 3 3
V4 2 3 7
df <- read.table(text="
ID V1 V2 V3
1 1 0 0
2 0 0 1
3 0 0 1
4 1 1 1
5 0 0 1
6 1 0 0
7 0 0 0
8 0 1 1
9 1 0 1
10 0 1 1
", header = TRUE)
k = 3 # number of clusters
library(dplyr)
df %>%
# group and count on all except the first id column
group_by_at(2:ncol(df)) %>%
# get the counts, and collect all the transaction ids
summarize(n = n(), tran_ids = paste(ID, collapse = ',')) %>%
ungroup() %>%
# grab the top k summarizations
top_n(k, n)
# V1 V2 V3 n tran_ids
# <int> <int> <int> <int> <chr>
# 1 0 0 1 3 2,3,5
# 2 0 1 1 2 8,10
# 3 1 0 0 2 1,6
You can transpose your table and use standard methods of clustering. Thus, you will cluster the items. The features are the transactions.
Geometrical approaches can be used like kmeans. Alternatively, you can use mixture models which provide information criteria (like BIC) for selecting the number of clusters. Here is an R script
require(VarSelLCM)
my.data <- as.data.frame(t(df))
# To consider Gaussian mixture
# Alternatively Poisson mixture can be considered by converting each column into integer.
for (j in 1:ncol(my.data)) my.data[,j] <- as.numeric(my.data[,j])
## Clustering by considering all the variables as discriminative
# Number of clusters is between 1 and 6
res.all <- VarSelCluster(my.data, 1:6, vbleSelec = FALSE)
# partition
res.all#partitions#zMAP
# shiny application
VarSelShiny(res.all)
## Clustering with variable selection
# Number of clusters is between 1 and 6
res.selec <- VarSelCluster(my.data, 1:6, vbleSelec = TRUE)
# partition
res.selec#partitions#zMAP
# shiny application
VarSelShiny(res.selec)

Split a string in column and count occurrence of characters

I have a very huge file with dim: 47,685 x 10,541. In that file, there is no spaces between the characters in each row in the second column, as following:
File # 1
Row1 01205201207502102102…..
Row2 20101020100210201022…..
Row3 21050210210001120120…..
I want to do some statistics on that file and may be delete some columns or rows. So, using R, I want to add one space between each two characters in the second column to get something like this:
File # 2
Row1 0 1 2 0 5 2 0 1 2 0 7 5 0 2 1 0 2 1 0 2…..
Row2 2 0 1 0 1 0 2 0 1 0 0 2 1 0 2 0 1 0 2 2…..
Row3 2 1 0 0 0 2 1 0 2 1 0 0 0 1 1 2 0 1 2 0…..
And then, after I finish editing, remove the spaces between the characters in the second column, so the final format will be just like File # 1.
What is the best and faster way to do that?
updated addressing the column count as well. ( From your comments)
Here is a solution using tidyr and stringr. However, this considers that your string is of equal length for the column2. The solution gives you both rowwise and columnwise count. This is done in very basic step by step manner, could be achieved the same with few lines of the code as well.
library(stringr)
library(tidyr)
data<-data.frame( Column.1 = c("01205", "20705", "27057"),
stringsAsFactors = FALSE)
count<-str_count(data$Column.1) # Get the length of the string in column 2
index<-1:count[1] # Generate an index based on the length
# Count the number of 5 and 7 in each string by row and add it as new column
data$Row.count_5 <- str_count(data$Column.1, "5")
data$Row.count_7 <- str_count(data$Column.1, "7")
new.data <- separate(data, Column.1, into = paste("V", 1:count[1], sep = ""), sep = index)
new.data$'NA' <- NULL
new.data
Column_count_5 <- apply(new.data[1:5],2,FUN=function(x) sum(x == 5))
Column_count_7 <- apply(new.data[1:5],2,FUN=function(x) sum(x == 7))
column_count <- as.data.frame(t(data.frame(Column_count_5,Column_count_7)))
library(plyr)
Final.df<- rbind.fill(new.data,column_count)
rownames(Final.df)<-c("Row1","Row2","Row3", "Column.count_5","Column.count_7")
Final.df
output
V1 V2 V3 V4 V5 Row.count_5 Row.count_7
Row1 0 1 2 0 5 1 0
Row2 2 0 7 0 5 1 1
Row3 2 7 0 5 7 1 2
Column.count_5 0 0 0 1 2 NA NA
Column.count_7 0 1 1 0 1 NA NA
Sample data
data<-data.frame( Column.1 = c("01205", "20705", "27057"),
stringsAsFactors = FALSE)

Count in buckets (Total by Row, aka Tabulate) [duplicate]

This question already has answers here:
Table by row with R
(4 answers)
Closed 6 years ago.
Imagine a group of three of machines (a,b,c) capture data in a series of tests. I need to count per test how many of each possible outcome has happened.
Using this test data and sample output, how might you solve it (assume that the test results may be numbers or alpha).
tests <- data.table(
a = c(1,2,2,3,0),
b = c(1,2,3,0,3),
c = c(2,2,3,0,2)
)
sumry <- data.table(
V0 = c(0,0,0,2,1),
V1 = c(2,0,0,0,0),
V2 = c(1,3,1,0,1),
V3 = c(0,0,2,1,1),
v4 = c(0,0,0,0,0)
)
tests
sumry
The output from sumry shows a column for each possible outcome/value (prefixed with V as in 'value' measured). Note: the sumry output indicates that there is the potential for a value of 4 but that is not observed in any of the test data here and therefore is always zero.
> tests
a b c
1: 1 1 2
2: 2 2 2
3: 2 3 3
4: 3 0 0
5: 0 3 2
> sumry
V0 V1 V2 V3 v4
1: 0 2 1 0 0
2: 0 0 3 0 0
3: 0 0 1 2 0
4: 2 0 0 1 0
5: 1 0 1 1 0
the V0 column from sumry indicates how many times the value zero is observed from any machine in test #1. For this set of test data zero is only observed in the 4th and 5th tests. The same holds true for V1-V4
I'm sure there's a simple name for this.
Here's one solution built around tabulate():
res <- suppressWarnings(do.call(rbind,apply(tests+1L,1L,tabulate)));
colnames(res) <- paste0('V',seq(0L,len=ncol(res)));
res;
## V0 V1 V2 V3
## [1,] 0 2 1 0
## [2,] 0 0 3 0
## [3,] 0 0 1 2
## [4,] 2 0 0 1
## [5,] 1 0 1 1

Remove rows that are assoicated to certain columns value

I am new to R, I have 0's and 1's X matrix and associated with y's as the data.
I need to remove the observations that have less than 10 one's so I add the columns for x and i return the column name to a vector. then drop the y's that associated with the one's then I need to remove the columns because it will be column with zero.
so I am getting this error and I dont know how to fix and improve the code
Error in -Col[i] : invalid argument to unary operator
Here is the code
a0=rep(1,40)
a=rep(0:1,20)
b=c(rep(1,20),rep(0,20))
c0=c(rep(0,12),rep(1,28))
c1=c(rep(1,5),rep(0,35))
c2=c(rep(1,8),rep(0,32))
c3=c(rep(1,23),rep(0,17))
x=matrix(cbind(a0,a,b,c0,c1,c2,c3),nrow=40,ncol=7)
nam <- paste("V",1:7,sep="")
colnames(x)<-nam
dat <- cbind(y=rnorm(40,50,7),x)
#===================================
toSum <- apply(dat,2,sum)
Col <- Val <- NULL
for(i in 1:length(toSum)){
if(toSum[i]<10){
Col <- c(Col,colnames(dat)[i])
Val <- c(Val,toSum[i])}
}
for(i in 1:length(Col)){
indx <- dat[,Col[i]]==0
datnw <- dat[indx,]
datnw2 <- datnw[,-Col[i]]
}
Can some one help please? I am not sure if there is a way to get the position for the columns in Col vector. I have around 1500 columns on my original data.
Thanks
This should do the trick
datnw2 <- dat[, -which(toSum<10)]
This allows you to avoid the loop
head(datnw2)
y V1 V2 V3 V4 V7
[1,] 60.88166 1 0 1 0 1
[2,] 54.35388 1 1 1 0 1
[3,] 39.78881 1 0 1 0 1
[4,] 44.20074 1 1 1 0 1
[5,] 42.27351 1 0 1 0 1
[6,] 43.52390 1 1 1 0 1
Edit: Some pointers
toSum<10 will give a logical vector to you, the length of this vector is the same as length(toSum)
which(toSum<10) will give you the positions of those elements meeting the condition
Since you want to select those columns from dat which the associated toSum<10 is FALSE, then you have to left those columns out by doing dat[, -which(toSum<10)], this means: chose all columns but 6 and 7 which are the ones meeting condition toSum<10
Using your example data, if you want to find which rows (i.e. observations) have fewer than 10 1s
rs <- rowSums(dat[, -1]) < 10
If you want to know which columns (i.e. variables) have less than 10 "presences" then
cs <- colSums(dat[, -1]) < 10
R> cs
V1 V2 V3 V4 V5 V6 V7
FALSE FALSE FALSE FALSE TRUE TRUE FALSE
Both rs and cs are logical variables that can be used to index to remove rows/columns.
To get rid of the columns we use:
dat2 <- dat
dat2 <- dat2[, !cs]
head(dat2)
R> head(dat2)
y V1 V2 V3 V6 V7
[1,] 47.61253 1 0 1 1 1
[2,] 60.51697 1 1 1 1 1
[3,] 53.69815 1 0 1 1 1
[4,] 53.79534 1 1 1 1 1
[5,] 49.04329 1 0 1 1 1
[6,] 42.04286 1 1 1 1 1
Next it seems that you are concerned that some rows will now be all zero? Is that what you are trying to do with the final step? That doesn't appear to be the case here, so perhaps the way or removing the columns I show has solved that problem too?
R> rowSums(dat2[,-1])
[1] 4 5 4 5 4 5 4 5 3 4 3 4 3 4 3 4 3 4 3 4 2 3 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
[39] 1 2

Resources