Generating sub tables in R - r

I need to translate some Python code to R. What I need to do is sample random rows from a larger table multiple times so I can use it for later. Here is an illustration:
library(data.table)
library(dplyr)
test_table <- data.table(replicate(10, sample(0:1, 10, rep=TRUE)))
test_table
Gives a 10 x 10 table populated with (on some particular run):
So for instance one can get a sample:
sample <- sample_n(test_table, 2)
sample
Which might look like:
However, I don't understand the result when taking multiple samples:
kSampleSize <- 2
kNumSamples <- 3
samples <- replicate(kNumSamples, sample_n(test_table, kSampleSize))
samples
may give:
But it doesn't really look like a "list of sample". I expected samples[1] to give a result similar to sample but instead I get a weird result (varies per run):
1. 1 0
Am I doing something wrong? Am I misinterpreting the output? Is expecting a "list of sample" something to expect in Python but not in R?

There is a simplify argument within replicate that determines whether R attempts to simplify the returned object to a less complicated data structure.
simplify defaults to TRUE, and in this case it collapses the returned list of data frames down into a single object of type list. Specifying simplify = FALSE turns off this behavior.
kSampleSize <- 2
kNumSamples <- 3
replicate(kNumSamples, sample_n(test_table, kSampleSize), simplify = FALSE)
Returns a list of three data frames, preserving the original data structure:
[[1]]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 1 0 0 0 1 0 0 1 0 1
2: 1 1 1 0 0 1 0 0 1 1
[[2]]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 1 1 0 1 0 1 0 1 0 0
2: 1 1 1 1 1 0 0 1 0 1
[[3]]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 0 0 1 0 1 1 0 0 1 1
2: 1 1 1 1 0 0 1 0 0 0

Related

Transforming Dataset with only 0 and 1 values

I'm unsure of what to call this, so I'll try to describe in laymens terms what the issue is. I have a dataframe that only consists of 0 and 1. So for each individual instead of having one column with a factoral value (ex. low price, 4 rooms) I have
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0
2 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1
3 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0
4 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0
How can I transform the dataset in R, so that I create new columns (#number of rooms) and give the position of the 1 (in the 4th column) a vhigh value?
I have multiple expenatory varibales I need to do this for. the 21 columns are representing 6 variables for 1000+ observations. should be something like this
PurchaseP. NumberofRooms ...
1. vhigh. 4
2. low. 4
3. vhigh. 1
4. vhigh. 2
Just did it for the first 2 epxlenatory varibales here, but essentially it repeats like this with each explenatory variable has 3-4 possible factoral values.
V1:V4 = purchase price, V5:V8 = number of rooms,V9:V11 = floors, and so on
In my head something like this could work
create a if statemt to give each 1 a value depending on column position, ex. if value in V4=1 then name "vhigh". and do this for each Vx
Then combine each column V1:V4, V5:V8, V9:V11 (depending on if it has 3-4 possible factoral/integer values) while ignoring 0 values.
Would this work, or is there a simpler approach? How would one code this in R?
If the dataset contains a single 1 per row this is a pretty simple problem
Here your data according to your picture (please edit your question to put a code instead of picture)
df = data.frame(r1 = 0, r2 = 1, r3 = 0)
rownames(df)<- 1
Then, you simply have to sum your column with the room number as weight
df$room = df$r1*1 + df$r2 * 2 + df$r3 *3
You can use the function which() similar to
lapply(df, function(x) { %now x is a row
idx = which(x == 1)[1]
return(idx)
})
The interesting part is to use which(x ==1) on each row. This gives you an array of all indices that contain a one. The first of those can be used in your case (assuming that you only have one 1 per line) Otherwise, aggregation needs to be discussed. The resulting column can then be transformed into a factor by giving sensible names to the various indices.

How can i partition market basket items into clusters?

I have a data set as follows: (i took a simple example but the real data set is much bigger)
V1 V2 V3 V4
1 1 0 0 1
2 0 1 1 0
3 0 0 1 0
4 1 1 1 1
5 0 1 1 0
6 1 0 0 1
7 0 0 0 1
8 0 1 1 1
9 1 0 1 0
10 0 1 1 0
...
where V1, V2,V3...Vn are items and 1,2,3,4...1000 are transactions. I want to partition these items into k clusters such that in each cluster i have the items that appear the most frequently together in the same transactions.
To determine the number times each couple of items appear together i tried crosstable, i got the following results:
V1 V2 V3 V4
V1 4 1 2 3
V2 1 5 5 2
V3 2 5 7 2
V4 3 2 2 5
For this small example if i want to create 2 clusters (k=2) such that a cluster must contain 2 items (to maintain the balance between clusters), i will get:
Cluster1={V1,V4}
Cluster2={V2,V3}
because:
1) V1 appears more frequently with V4 (V1,V4)=3 > (V1,V3) > (V1,V2) and same for V4.
2) V2 appears more frequently with V2 (V2,V3)=5 > (V2,V4) > (V2, V1) and same for V3.
How can i do this partition with R and for a bigger set of data ?
I think you are asking about clustering. It is not quite the same as what you are doing above, but you could use hclust to look for similarity between variables with a suitable distance measure.
For example
plot(hclust(dist(t(df),method="binary")))
produces the following...
You should look at ?dist to see if this distance measure is meaningful in your context, and ?hclust for other things you can do once you have your dendrogram (such as identifying clusters).
Or you could use your crosstab as a distance matrix (perhaps take the reciprocal of the values, and then as.dist).
library(data.table)
data:
df<-
fread("
V1 V2 V3
1 1 0 0
2 0 0 1
3 0 0 1
4 1 1 1
5 0 0 1
6 1 0 0
7 0 0 0
8 0 1 1
9 1 0 1
10 0 1 1
")[,-1]
code:
setDT(df)
sapply(names(df),function(x){
df[get(x)==1,lapply(.SD,sum,na.rm=T),.SDcols=names(df)]
})
result:
V2 V3 V4
V2 4 1 2
V3 1 3 3
V4 2 3 7
df <- read.table(text="
ID V1 V2 V3
1 1 0 0
2 0 0 1
3 0 0 1
4 1 1 1
5 0 0 1
6 1 0 0
7 0 0 0
8 0 1 1
9 1 0 1
10 0 1 1
", header = TRUE)
k = 3 # number of clusters
library(dplyr)
df %>%
# group and count on all except the first id column
group_by_at(2:ncol(df)) %>%
# get the counts, and collect all the transaction ids
summarize(n = n(), tran_ids = paste(ID, collapse = ',')) %>%
ungroup() %>%
# grab the top k summarizations
top_n(k, n)
# V1 V2 V3 n tran_ids
# <int> <int> <int> <int> <chr>
# 1 0 0 1 3 2,3,5
# 2 0 1 1 2 8,10
# 3 1 0 0 2 1,6
You can transpose your table and use standard methods of clustering. Thus, you will cluster the items. The features are the transactions.
Geometrical approaches can be used like kmeans. Alternatively, you can use mixture models which provide information criteria (like BIC) for selecting the number of clusters. Here is an R script
require(VarSelLCM)
my.data <- as.data.frame(t(df))
# To consider Gaussian mixture
# Alternatively Poisson mixture can be considered by converting each column into integer.
for (j in 1:ncol(my.data)) my.data[,j] <- as.numeric(my.data[,j])
## Clustering by considering all the variables as discriminative
# Number of clusters is between 1 and 6
res.all <- VarSelCluster(my.data, 1:6, vbleSelec = FALSE)
# partition
res.all#partitions#zMAP
# shiny application
VarSelShiny(res.all)
## Clustering with variable selection
# Number of clusters is between 1 and 6
res.selec <- VarSelCluster(my.data, 1:6, vbleSelec = TRUE)
# partition
res.selec#partitions#zMAP
# shiny application
VarSelShiny(res.selec)

Dummy variable where two continuous variables are equal in R?

Data set cwm looks like this
V1 V2 V3
1 2 ?
3 5 ?
4 4 ?
#NA 9 ?
#NA #NA ?
Want to create dummy variable V3, 1 if V1=V2, 0 otherwise, and producing #NA in any case where #NA is involved.
After I have done a similar thing for equivalent columns V3 and V4, to produce dummy variable V5, I need to create a continuous variable, V6, where 1 means neither V3 or V5 = 1, 2 means either V3 or V5 = 1, 3 means both V3 and V5 = 1.
V3 V5 V6
1 0 ?
1 0 ?
0 0 ?
1 1 ?
If done correctly, V3 = {0,0,1,#NA,#NA} and V6 = {2,2,1,3}
Best approach?
df = read.table(text="V1 V2
1 2
3 5
4 4
NA 9
NA NA",
header = TRUE, na.strings="NA")
V3 = as.numeric(df$V1 == df$V2)
V3
[1] 0 0 1 NA NA
df2 = read.table(text="V3 V5
1 0
1 0
0 0
1 1",
header = TRUE)
V6 = df2$V3 + df2$V5 + 1
V6
[1] 2 2 1 3

Recode and codense variables

I'm working on the output off an online questionnaire and have some trouble handling the data. This is the setups: 200 images have been rated on two 9-point-scales, totaling in 400 combinations. Unfortunately, the data hasn't been in encoded in 400 variables with values ranging from 1 to 9, but for each scale-image combination, 9 binary variables have been encoded, looking like this for two image-scale combinations:
Part. V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 1 0 0
3 0 0 1 0 0 0 0 0 0
As you can see, there are also some N/A values in the data set. That's because of all 400 combinations, each participant only rated a randomised 50. Given the 400 combinations, we have a total of 3600 variables in the data set. I would now like to condense and recode those values in a sense, that R counts the vars in intervals of 9, then recodes the binary 1 for a value of 1 to 9, depending on its position on the scale, and then condenses everything into 400 combination variables. In the end, it should look something like this:
Part. C1 C2
1 3 2
2 7
3 3
I've looked into the reshape package, but couldn't exactly figure out the way to do this.
Any suggestions?
Using apply family functions:
#dummy data
df <- read.table(text = "
Part.,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18
1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,,,,,,,,,
3,,,,,,,,,,0,0,1,0,0,0,0,0,0
", header = TRUE, sep = ",")
# result
# cbind - column bind, put columns side by side
cbind(
# First column is the "Part." column
df[, "Part.", drop = FALSE],
# other columns are coming from below code
# sapply returns matrix, converting it to data.frame so we can use cbind.
as.data.frame(
# get data column index 9 columns each, first 2 to 9, then 10 to 18, etc.
sapply(seq(2, ncol(df), 9), function(i)
# for each 9 columns check at which position it is equal to 1,
# using which() function
apply(df[, i:(i + 8)], 1, function(j) which(j == 1)))
)
)
#output
# Part. V1 V2
# 1 1 3 2
# 2 2 7
# 3 3 3
Here is a solution for a small example. I did it for only 2 possible outcomes. So v1 = 1 for pic 1, v2 = 2 for pic one, v3 = 1 for pic 2 ... . If you have 9 possible outcomes you have to change id <- rep(1:2, each = 2) to id <- rep(1:n, each = 9) where n is the total number of pictures. Also change the 2 in final <- matrix(nrow = nrow(dat), ncol = ncol(dat)/2) to 9.
I hope that helps.
dat <- data.frame(v1 = c(NA,0,1,0), v2 = c(NA,1,0,1), v3 = c(0,1,NA,0), v4 = c(1,0,NA,1))
id <- rep(1:2, each = 2)
final <- matrix(nrow = nrow(dat), ncol = ncol(dat)/2)
for (i in unique(id)){
wdat <- dat[ ,which(id == i)]
for (j in 1:nrow(wdat)){
if(is.na(wdat[j,1] )) {
final[j,i] <- NA
} else {
final[j,i] <- which(wdat[j, ] == 1)
}
}
}
The input and output for my example:
> dat
v1 v2 v3 v4
1 NA NA 0 1
2 0 1 1 0
3 1 0 NA NA
4 0 1 0 1
> final
[,1] [,2]
[1,] NA 2
[2,] 2 1
[3,] 1 NA
[4,] 2 2

Data Manipulation in R Project: compare rows

I'm looking to compare values within a dataset
Every row starts with a unique ID followed by a couple binary variables
The data looks like this:
row.name v1 v2 v3 ...
1 0 0 0
2 1 1 1
3 1 0 1
I want to know which values are the same (if equal assign value of 1) and which are different (if not equal assign value of 0) for all unique pairings.
For example in column v1: row1 == 0 and row2 == 1, which should result in an assignment of 0.
So, the output should look like this
id1 id2 v1 v2 v3 ...
1 2 0 0 0 ...
1 3 0 1 0 ...
2 3 1 0 1 ...
I'm looking for an efficient way of doing this for more than 1000 rows...
There's no way to do this without expanding each combination of rows, so with 1000 rows, it is going to take a bit of time. But here is a solution:
dat <- read.table(header=T, text="row.name v1 v2 v3
1 0 0 0
2 1 1 1
3 1 0 1")
Create the index rows:
indices <- t(combn(dat$row.name, 2))
colnames(indices) <- c('id1', 'id2')
Loop through the index rows, and collect the comparisons:
res1 <- t(apply(indices, 1, function(x) as.numeric(dat[x[1],-1] == dat[x[2],-1])))
colnames(res1) <- names(dat[-1])
Put them together:
result <- cbind(indices, res1)
result
## id1 id2 v1 v2 v3
## [1,] 1 2 0 0 0
## [2,] 1 3 0 1 0
## [3,] 2 3 1 0 1

Resources