Reducing crosstab size by frequency of responses - r

Excuse my neophyte question - I'm new to R and pretty unversed in statistics.
I have a simple contingency table representing the number of queries per user for a group of web pages gathered over a period of time. There are about 15,000 total observations. This works out to a table of around 100 users viewing 50 groups of pages.
Since a 50x100 matrix is unwieldy to visualize, I would like to present a subset of this table sorted by the largest aggregates - either column (page groups), row (users), or perhaps even the largest row-by-column counts. For example I might choose the top 20 users and the top 10 groups, or the top 99% row-by-column counts.
Ideally, I end up with a table that still represents the major interactions between the most represented users and the page groups.
Is this a reasonable approach? Will I lose some large amount of statistical significance; and, is there a way to compare the before and after significance.
I must admit that I still don't know how to sort and subset a table based on two factors without resorting to row-by-column manipulation.

S <- trunc(10*runif(1000) )
R <- trunc(10*runif(1000))
RStab <- table(R, S)
str(RStab)
# 'table' int [1:10, 1:10] 6 12 10 13 10 7 8 6 9 10 ...
# - attr(*, "dimnames")=List of 2
# ..$ R: chr [1:10] "0" "1" "2" "3" ...
# ..$ S: chr [1:10] "0" "1" "2" "3" ...
rowSums( RStab[ order(rowSums(RStab)) , order(colSums(RStab) ) ])
# 8 0 1 3 2 5 9 4 6 7
# 90 94 96 99 100 101 101 103 107 109
colSums( RStab[ order(rowSums(RStab)) , order(colSums(RStab) ) ])
6 0 3 5 7 2 4 8 9 1
80 91 94 96 98 100 106 109 112 114
The 5 highest marginals for row and columns:
RStab[ order(rowSums(RStab)) , order(colSums(RStab) ) ][ 6:10, 6:10]
#-------------
S
R 2 4 8 9 1
5 14 10 12 10 12
9 6 8 9 10 13
4 10 10 8 8 18
6 9 12 12 17 8
7 14 10 14 12 9
It does sound as though you might be a little shakey on the statistical issues. Can you explain more fully what you mean by "losing a large amount of significance"? What sort of statistical test were you thinking of?

Related

Multivariable Partition with dplyr

I have a large data frame with over 1 million observations. Two of my independent variables A and B have 18 and 72 numerically labelled categories respectively. For simplicity sake, assume the categories are labelled 1-18 and 1-72. I'd like to partition all of my data into 36 groups of 6, (A 1-6 with B 1-6, A 1-6 with B 7-12, etc.)
Currently, I am using dplyr's mutate with 36 nested ifelse statements, such as mutate(partition = ifelse(A <= 6 & B <= 6, 1, ifelse(...))) but this is tedious and difficult to change should I want to make partitions of different sizes.
Another way of describing it is that there are 18 * 72 = 1296 unique combinations of parameter A and B, but I would like to partition these 1296 into 36 groups of 36 observations, with the flexibility to change the number of observations and groups.
I really feel like there should be a better way to partition my data, but nothing comes to mind immediately. The only other idea I have is to use expand.grid and use a join of sorts. What other methods exist that allow me to partition my data?
The below example is kind of how I would like my data to appear.
A B Partition
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
1 6 1
2 1 1
... ... ...
6 6 1
7 1 2
... ... ...
12 71 12
12 72 12
13 1 13
... ... ...
18 70 36
18 71 36
18 72 36

Hierarchical clustering with specific number of data in each cluster

I'm clustering a set of words using "Hierarchical Clustering". I want each cluster to contain a certain number of words, for example 2 words, or 3 words.
I'm trying to modify existing code for this clustering.
I just put the value of max(d) to Inf as well
Lm[min(d),] <- sl
Lm[,min(d)] <- sl
if (length(cluster)>2){#if it's already clustered with more than 2 points
#then dont't cluster them again by setting values to Inf
Lm[min(d), min(d)] <- Inf
Lm[max(d), max(d)] <- Inf
Lm[max(d),] <- Inf
Lm[,max(d)] <- Inf
Lm[min(d),] <- Inf
Lm[,min(d)] <- Inf
}
However, it doesn't give me the expected results, I was wondering if it's correct approach? How can I do this type of clustering with constraint in r ?
example of results that I got
row V1 V2
166 -194 -38
167 166 -1
……..
240 239 239
241 240 240
242 241 241
243 242 242
244 243 243
This will be tough to optimize, or it can produce arbitrarily bad results. Because your size constraint goes against the principles of clustering.
Consider the one-dimensional data set -100, -1, 1, 100. Assuming you want to limit the cluster size to 2 elements. Hierarchical clustering will first merge -1 and +1 because they are closest. Now they have reached maximum size, so the only option is now to cluster -100 and +100, the worst possible result - this cluster is as big as the entire data set.
Just to give you an example of what I meant with partitional clustering:
library(cluster)
data("ruspini")
desired_cluster_size <- 3L
corresponding_num_clusters <- round(nrow(ruspini) / desired_cluster_size)
km <- kmeans(ruspini, corresponding_num_clusters)
table(km$cluster)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3 3 2 4 2 2 2 1 3 3 2 3 2 3 3 2 6 3 2 1 3 6 2 8 4
This definitely can't guarantee how many observations you'll have in each group,
and it's not deterministic,
but it at least gives you an approximation.
In the tabulated results you can see that many clusters (1 through 25) ended up with 2 or 3 elements.

Mapping dataframe column values to a n by n matrix

I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.
df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)

How can I tell a for loop in R to regenerate a sample if the sample contains a certain pair of species?

I am creating 1000 random communities (vectors) from a species pool of 128 with certain operations applied to the community and stored in a new vector. For simplicity, I have been practicing writing code using 10 random communities from a species pool of 20. The problem is that there are a couple of pairs of species such that if one of the pairs is generated in the random community, I need that community to be thrown out and a new one regenerated. I have been able to code that if the pair is found in a community for that community(vector) to be labeled NA. I also know how to tell the loop to skip that vector using the "next" command. But with both of these options, I do not get all of the communities that I needing.
Here is my code using the NA option, but again that ends up shorting me communities.
C<-c(1:20)
D<-numeric(10)
X<- numeric(5)
for(i in 1:10){
X<-sample(C, size=5, replace = FALSE)
if("10" %in% X & "11" %in% X) X=NA else X=X
if("1" %in% X & "2" %in% X) X=NA else X=X
print(X)
D[i]<-sum(X)
}
print(D)
This is what my result looks like.
[1] 5 1 7 3 14
[1] 20 8 3 18 17
[1] NA
[1] NA
[1] 4 7 1 5 3
[1] 16 1 11 3 12
[1] 14 3 8 10 15
[1] 7 6 18 3 17
[1] 6 5 7 3 20
[1] 16 14 17 7 9
> print(D)
[1] 30 66 NA NA 20 43 50 51 41 63
Thanks so much!

Reverting to Factor Codes R

Let's say I have a data.frame that looks like this:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
and I apply a factor:
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
Now, how I would like to convert it back the integer codes:
as.numeric(df.test[1])## replies with an error code.
But this works:
as.numeric(df.test$a)
Why is that?
Actually Joshua's link are not applicable here because the task is not coverting from a factor with levels that have numeric interpretation. Your original effort that produced an error was almost correct. It was missing only a comma before the 1:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
as.numeric(df.test[,1])
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# [19] 19 20 21 22 23 24 25 26
Or you could have used "[["
> as.numeric(df.test[[1]])
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26
as.numeric will convert a factor to numeric:
as.numeric(df.test$a)
Accessing a column by name gives you a factor vector, which can be converted to numeric.
However, a data frame is a list (of columns), and when you use the single bracket operator and a single number on a list, you get a list of length one. The same applies for data frames, so df.test[1] gets you column one as a new data frame, which cannot be coerced by as.numeric(). I did not know this!
> str(df.test$a)
Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(df.test[1])
'data.frame': 26 obs. of 1 variable:
$ a: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
To respond to your edit: Keep in mind that a factor has two parts: 1) the labels, and 2) the underlying integer codes. The two answers I linked to in my comment were to convert the labels to numeric. If you just want to get the internal codes, use as.integer(df.test$a) as demonstrated in the examples section of ?factor. aL3xa answered your question about why as.numeric(df.test[1]) throws an error.

Resources