I have an array for which I would like to obtain a measure of the similarity between values in each column. By which I mean I wish to compare the rows between pairwise columns of the array and increment a measure when their values match. The resulting measure would then be at a maximum for two columns exactly the same.
Essentially my problem is the same as discussed here: R: Compare all the columns pairwise in matrix except that I do not wish empty cells to be counted.
With the example data created from code derived from the linked page:
data1 <- c("", "B", "", "", "")
data2 <- c("A", "", "", "", "")
data3 <- c("", "", "C", "", "A")
data4 <- c("", "", "", "", "")
data5 <- c("", "", "C", "", "A")
data6 <- c("", "B", "C", "", "")
my.matrix <- cbind(data1, data2, data3, data4, data5, data6)
similarity.matrix <- matrix(nrow=ncol(my.matrix), ncol=ncol(my.matrix))
for(col in 1:ncol(my.matrix)){
matches <- my.matrix[,col] == my.matrix
match.counts <- colSums(matches)
match.counts[col] <- 0
similarity.matrix[,col] <- match.counts
}
I obtain:
similarity.matrix =
V1 V2 V3 V4 V5 V6
1 0 3 2 4 2 4
2 3 0 2 4 2 2
3 2 2 0 3 5 3
4 4 4 3 0 3 3
5 2 2 5 3 0 3
6 4 2 3 3 3 0
which counts non-value pairs.
My desired output would be:
expected.output =
V1 V2 V3 V4 V5 V6
1 0 0 0 0 0 1
2 0 0 0 0 0 0
3 0 0 0 0 2 1
4 0 0 0 0 0 0
5 0 0 2 0 0 1
6 1 0 1 0 1 0
Thanks,
Matt
So the following is the answer from akrun :
first changing the blank cells to NA's
is.na(my.matrix) <- my.matrix==''
and then removing the NA's for the match.counts
similarity.matrix <- matrix(nrow=ncol(my.matrix), ncol=ncol(my.matrix))
for(col in 1:ncol(my.matrix)){
matches <- my.matrix[,col] == my.matrix
match.counts <- colSums(matches, na.rm=TRUE)
match.counts[col] <- 0
similarity.matrix[,col] <- match.counts
}
Which did indeed give me my desired output:
V1 V2 V3 V4 V5 V6
1 0 0 0 0 0 1
2 0 0 0 0 0 0
3 0 0 0 0 2 1
4 0 0 0 0 0 0
5 0 0 2 0 0 1
6 1 0 1 0 1 0
thank you.
Related
Problem - Data Wrangling:
I want to fine adjust the note of a Multiple-Choice-Questions exam with 5 items on each question - A, B, C, D, E. I want to use coefficients on each possible item. For this I need to do some data wrangling:
Input:
library(tibble)
(
df <- tribble(
~id, ~Q1, ~Q2, ~Q3,
#|----|------|------|------|
1, "CDE", "A", "AD",
2, "CDE", "AB", "AD",
3, "DE", "BC", "AD")
)
Expected output :
id
Q1_A
Q1_B
Q1_C
Q1_D
Q1_E
Q2_A
Q2_B
Q2_C
Q2_D
Q2_E
Q3_A
Q3_B
Q3_C
Q3_D
Q3_E
1
0
0
1
1
1
1
0
0
0
0
1
0
0
1
0
2
0
0
1
1
1
1
1
0
0
0
1
0
0
1
0
3
0
0
0
1
1
0
1
1
0
0
1
0
0
1
0
We could use mtabulate by splitting
library(qdapTools)
cbind(df[1], do.call(cbind, lapply(df[-1],
function(x) mtabulate(strsplit(x, "")))))
Or using base R with table after splitting each of the column values with strsplit, get the frequency count and then cbind the list elements
cbind(df[1], do.call(cbind, lapply(df[-1], function(x) {
x1 <- strsplit(x, "")
as.data.frame.matrix(table(data.frame(ind = rep(seq_along(x1),
lengths(x1)), val = factor(unlist(x1), levels = LETTERS[1:5]))))})))
-output
# id Q1.A Q1.B Q1.C Q1.D Q1.E Q2.A Q2.B Q2.C Q2.D Q2.E Q3.A Q3.B Q3.C Q3.D Q3.E
#1 1 0 0 1 1 1 1 0 0 0 0 1 0 0 1 0
#2 2 0 0 1 1 1 1 1 0 0 0 1 0 0 1 0
#3 3 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0
Another base R option
cbind(
df[1],
`colnames<-`(
do.call(
cbind,
lapply(
df[-1],
function(x) {
t(sapply(
strsplit(x, ""),
function(v) table(factor(v, levels = LETTERS[1:5]))
))
}
)
),
paste0(rep(names(df)[-1], each = 5), "_", LETTERS[1:5])
)
)
which gives
id Q1_A Q1_B Q1_C Q1_D Q1_E Q2_A Q2_B Q2_C Q2_D Q2_E Q3_A Q3_B Q3_C Q3_D Q3_E
1 1 0 0 1 1 1 1 0 0 0 0 1 0 0 1 0
2 2 0 0 1 1 1 1 1 0 0 0 1 0 0 1 0
3 3 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0
Very clever oneliners from other posters but hard to decypher.
This is a more readable solution imho:
ABCDE <- LETTERS[1:5]
one_col_to_five <- function(col) sapply(ABCDE, grepl, col)
(proper_df <- do.call(cbind, lapply(df[, -1], one_col_to_five)))
(proper_df <- as.data.frame(cbind(df$id, proper_df)))
names(proper_df) <- c("id", paste(rep(names(df[-1]), 5), ABCDE, sep = "_"))
I want to sum two data frames like,
> ab
1 2 3 4 5
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
> cd
2 3 4 5
1 1 1 1 1
2 1 1 1 1
4 1 1 1 1
So that the elements are summed by the corresponding rows and columns names in the larger data frame, such that,
> ab
1 2 3 4 5
1 0 1 1 1 1
2 0 1 1 1 1
3 0 0 0 0 0
4 0 1 1 1 1
The code for the data frames are
a <- array(0, c(4,5))
ab <- data.frame(a, row.names = c(1,2,3,4))
ab <- rename(ab, c("X1" = "1", "X2" = "2", "X3" = "3", "X4" = "4", "X5"= "5"))
c <- array(1, c(3,4))
cd <- data.frame(c, row.names= c(1,2,4))
cd <- rename(cd, c("X1"="2", "X2"="3", "X3"="4", "X4"= "5"))
Any help would be really appreciated, thanks.
If the order of dimension names is identical:
ab <- as.matrix(ab)
cd <- as.matrix(cd)
ab[rownames(ab) %in% rownames(cd), colnames(ab) %in% colnames(cd)] <-
ab[rownames(ab) %in% rownames(cd), colnames(ab) %in% colnames(cd)] +
cd
ab
# 1 2 3 4 5
#1 0 1 1 1 1
#2 0 1 1 1 1
#3 0 0 0 0 0
#4 0 1 1 1 1
You could use a data.table join:
ab$rn <- rownames(ab)
cd$rn <- rownames(cd)
library(data.table)
setDT(ab)
setDT(cd)
abm <- melt(ab, id = "rn")
cdm <- melt(cd, id = "rn")
abm[cdm, value := i.value + value, on = .(rn, variable)]
res <- dcast(abm, rn ~ variable)
res[, rn := NULL]
res
# 1 2 3 4 5
#1: 0 1 1 1 1
#2: 0 1 1 1 1
#3: 0 0 0 0 0
#4: 0 1 1 1 1
I have a dataset like this
df <- data.frame("col1" = c("a", "b", "a", "c", "d", "e", "f", "c"), "col2" = c("v2", "v2", "v2", "v3", "v4", "v1", "v2", "v4"), "index" = c(3,1,3,0,1,2,3,0))
And I hope to get a matrix like this:
v1 v2 v2 v3 v4
a 0 3 3 0 0
b 0 1 0 0 0
c 0 0 0 0 0
d 0 0 0 0 1
e 2 0 0 0 0
f 0 3 0 0 0
Thank you very much for your answer!!
You do not have unique identifier in your groups and have values (V2) repeated. We can complete col1 and col2 values and fill index with 0. Create a unique identifier for each group (col1) and then spread the values.
library(tidyverse)
df %>%
complete(col1, col2, fill = list(index = 0)) %>%
group_by(col1) %>%
mutate(col2 = paste0("V", row_number())) %>%
spread(col2, index, fill = 0)
# col1 V1 V2 V3 V4 V5
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 0 3 3 0 0
#2 b 0 1 0 0 0
#3 c 0 0 0 0 0
#4 d 0 0 0 1 0
#5 e 2 0 0 0 0
#6 f 0 3 0 0 0
We can do this easily in base R
xtabs(index ~ col1 + col2, unique(df))
# col2
#col1 v1 v2 v3 v4
# a 0 3 0 0
# b 0 1 0 0
# c 0 0 0 0
# d 0 0 0 1
# e 2 0 0 0
# f 0 3 0 0
NOTE: No packages loaded
I have a dataframe that only consists of 0 and 1. So for each individual instead of having one column with a factoral value (ex. low price, 4 rooms) I have
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0
2 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1
3 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0
4 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0
How can I transform the dataset in R, so that I create new columns (#number of rooms) and give the position of the 1 (in the 4th column) a vhigh value?
I have multiple expenatory varibales I need to do this for. the 21 columns are representing 6 variables for 1000+ observations. should be something like this
PurchaseP. NumberofRooms ...
1. vhigh. 4
2. low. 4
3. vhigh. 1
4. vhigh. 2
Just did it for the first 2 epxlenatory varibales here, but essentially it repeats like this with each explenatory variable has 3-4 possible factoral values.
V1:V4 = purchase price, V5:V8 = number of rooms,V9:V11 = floors, and so on
In my head something like this could work
create a if statemt to give each 1 a value depending on column position, ex. if value in V4=1 then name "vhigh". and do this for each Vx
Then combine each column V1:V4, V5:V8, V9:V11 (depending on if it has 3-4 possible factoral/integer values) while ignoring 0 values.
Would this work, or is there a simpler approach? How would one code this in R?
Here is an approach that should work for you. I wrote a function, which will take as arguments your data.frame, the columns representing one of your variables of interest (e.g. purchase price is stored in columns 1 to 4), and the names of the levels you would like as a result. The function will then return the result you requested. You'll need to write this out for the 6 variables you are interested in.
I'll simulate some data and illustrate the approach.
df <- data.frame(matrix(rep(c(0,0,0,1, 1,0,0,0, 1,0,0,0,0,0,0,1), 2),
nrow = 4, byrow = T))
df
#> X1 X2 X3 X4 X5 X6 X7 X8
#> 1 0 0 0 1 1 0 0 0
#> 2 1 0 0 0 0 0 0 1
#> 3 0 0 0 1 1 0 0 0
#> 4 1 0 0 0 0 0 0 1
We'll say that the first four columns are the purchase price in v.low to v.high, and the second four are the number of rooms (1:4). We'll write a function that takes this information as arguments and returns the result:
rangeToCol <- function(df, # Your data.frame
range, # the columns that incode the category of interest
lev.names # The names of the category levels
) {
tdf <- df[range]
lev.names[unlist(apply(tdf, 1, function(rw){which(rw==1)}))]
}
new.df <- data.frame(PurchaseP = rangeToCol(df, 1:4,
c('vlow','low','high','vhigh')),
NumberofRooms = rangeToCol(df, 5:8, c(1:4)))
new.df
#> PurchaseP NumberofRooms
#> 1 vhigh 1
#> 2 vlow 4
#> 3 vhigh 1
#> 4 vlow 4
I'm reading in data from a csv file where each row contains some number of individual strings:
e.g.
data.csv ->
x,f,t,h,b,g
d,g,h
g,h,a,s,d
f
q,w,e,r,t,y,u,i,o
data <- read.csv("data.csv", header = FALSE)
I want to transform this input into a data frame where the columns are the set of unique strings present in the input. In this case, the columns would be the set of strings {x,f,t,h,b,g,d,a,s,q,w,e,r,y,u,i,o}. Additionally, the new data frame should contain a row for each row in the input data frame such that a column will have the value 1 if the column's name was present in that row in the input data frame, or 0 if the column's name was not present in that input row.
In this example, the desired output would be the following:
x f t h b g d a s q w e r y u i o
----------------------------------
1 | 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
2 | 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
3 | 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
4 | 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 | 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
The code below is what I currently have. However, the output df ends up being a data frame with what appear to be the correct columns, but 0 rows.
I'm very inexperienced at R, and this was my attempt at putting together something that works. It seems to work as expected up until the call to apply(), which unexpectedly doesn't add anything to df.
data <- read.csv("data.csv", header = FALSE)
columnNames = c()
for (row in data) {
for (eventName in row) {
if (!(eventName %in% columnNames)) {
columnNames = c(columnNames, eventName)
}
}
}
columnNames = t(columnNames)
df = data.frame(columnNames)
colnames(df) = columnNames
df = df[-1,]
apply(data, 1, function(row, df) {
dat = data.frame(columnNames)
colnames(dat) = columnNames
dat = dat[-1,]
for (eventName in row) {
if (eventName != "") {
dat[1,eventName] = 1
}
}
df = rbind(df, dat)
}, df)
After the script finishes it tells me there were many warnings of the following two forms:
9: In `[<-.factor`(`*tmp*`, iseq, value = 1) : invalid factor level, NA generated
10: In `[<-.factor`(`*tmp*`, iseq, value = 1) :
invalid factor level, NA generated
We can use mtabulate after splitting the column by ,
library(qdapTools)
mtabulate(strsplit(as.character(df1[,1]), ","))
Or with base R methods by splitting the column by ,, set the names of the list output as the sequence of rows, convert the list to data.frame (stack), change the 'values' column to factor with levels specified and then get the frequency with table.
table(transform(stack(setNames(strsplit(as.character(df1[,1]), ","), 1:nrow(df1)))[2:1],
values = factor(values, levels = unique(values))))
#
# x f t h b g d a s q w e r y u i o
# 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
# 2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
# 3 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
# 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 5 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
Update
If this is not a single column,
mtabulate(apply(df2, 1, FUN = function(x) x[x!=""]))
Or
as.data.frame.matrix(table(transform(stack(setNames(apply(df2, 1,
FUN = function(x) x[x!=""]),
1:nrow(df2)))[2:1], values = factor(values, levels = unique(values)))))
#
# x f t h b g d a s q w e r y u i o
# 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
# 2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
# 3 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
# 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 5 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
data
df1 <- structure(list(V1 = c("x,f,t,h,b,g", "d,g,h", "g,h,a,s,d", "f",
"q,w,e,r,t,y,u,i,o")), .Names = "V1", class = "data.frame",
row.names = c(NA, -5L))
df2 <- structure(list(v1 = c("x", "d", "g", "f", "q"), v2 = c("f", "g",
"h", "", "w"), v3 = c("t", "h", "a", "", "e"), v4 = c("h", "",
"s", "", "r"), v5 = c("b", "", "d", "", "t"), v6 = c("g", "",
"", "", "y"), v7 = c("", "", "", "", "u"), v8 = c("", "", "",
"", "i"), v9 = c("", "", "", "", "o")), .Names = c("v1", "v2",
"v3", "v4", "v5", "v6", "v7", "v8", "v9"), row.names = c(NA,
-5L), class = "data.frame")