Permute labels in a dataframe but for pairs of observations - r

Not sure title is clear or not, but I want to shuffle a column in a dataframe, but not for every individual row, which is very simple to do using sample(), but for pairs of observations from the same sample.
For instance, I have the following dataframe df1:
>df1
sampleID groupID A B C D E F
438 1 1 0 0 0 0 0
438 1 0 0 0 0 1 1
386 1 1 1 1 0 0 0
386 1 0 0 0 1 0 0
438 2 1 0 0 0 1 1
438 2 0 1 1 0 0 0
582 2 0 0 0 0 0 0
582 2 1 0 0 0 1 0
597 1 0 1 0 0 0 1
597 1 0 0 0 0 0 0
I want to randomly shuffle the labels here for groupID for each sample, not observation, so that the result looks like:
>df2
sampleID groupID A B C D E F
438 1 1 0 0 0 0 0
438 1 0 0 0 0 1 1
386 2 1 1 1 0 0 0
386 2 0 0 0 1 0 0
438 1 1 0 0 0 1 1
438 1 0 1 1 0 0 0
582 1 0 0 0 0 0 0
582 1 1 0 0 0 1 0
597 2 0 1 0 0 0 1
597 2 0 0 0 0 0 0
Notice that in column 2 (groupID), sample 386 is now 2 (for both observations).
I have searched around but haven't found anything that works the way I want. What I have now is just shuffling the second column. I tried to use dplyr as follows:
df2 <- df1 %>%
group_by(sampleID) %>%
mutate(groupID = sample(df1$groupID, size=2))
But of course that only takes all the group IDs and randomly selects 2.
Any tips or suggestions would be appreciated!

One technique would be to extract the unique combinations so you have one row per sampleID, then you can shuffle and merge the shuffled items back to the main table. Here's what that would look like
library(dplyr)
df1 %>%
distinct(sampleID, groupID) %>%
mutate(shuffle_groupID = sample(groupID)) %>%
inner_join(df1)

Using dplyr nest_by and unnest:
library(dplyr)
df1 |>
nest_by(sampleID, groupID) |>
mutate(groupID = sample(groupID, n())) |>
unnest(cols = c(data))
+ # A tibble: 10 x 3
# Groups: sampleID, groupID [4]
sampleID groupID A
<dbl> <int> <dbl>
1 386 1 1
2 386 1 0
3 438 1 0
4 438 1 0
5 438 1 0
6 438 1 1
7 582 2 0
8 582 2 0
9 597 1 1
10 597 1 0

Related

count the number of occurrences for each variable using dplyr

Here is my data frame (tibble) df:
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
<dbl> <dbl> <dbl> <dbl> <dbl>
1 61 0 70 0 0
2 0 0 127 0 0
3 318 0 2 0 0
4 1 0 0 0 0
5 1 0 67 0 0
6 0 0 0 139 0
7 0 0 0 0 0
8 113 0 0 0 0
9 0 0 1 0 0
10 0 0 0 1 0
For each column/variable, I would like to count the number of rows with value greater than 10. In this case, column 1 would be 3, column 2 would be zero, etc. This is a test data frame, and I would like to do this for many columns.
We can use colSums on a logical matrix
colSums(df > 10, na.rm = TRUE)
Or using dplyr
library(dplyr)
df %>%
summarise_all(~ sum(. > 10, na.rm = TRUE))
I think
library(dplyr)
df %>% summarise_all(~sum(.>10))
will do what you want.

Making adjacency matrix using group information

I am relatively new to R and I am have issues in creating an adjacency matrix using group characteristics.
I have a data frame that looks like this:
distid villageid hhid group1 group2 group3 group4
1 1 111 0 1 0 0
1 1 112 1 1 1 0
1 2 121 1 1 0 1
1 2 122 1 0 0 1
2 1 211 1 1 0 0
2 1 212 1 1 1 1
2 2 221 0 0 1 0
2 2 222 0 1 1 0
I need to create an adjacency matrix where if a hhid is in the same distid, villageid and group then they are all fully connected.
So my final matrix should look something like this
hhid 111 112 121 122 211 212 221 222
111 0 1 0 0 0 0 0 0
112 1 0 0 0 0 0 0 0
121 0 0 0 1 0 0 0 0
122 0 0 0 0 0 0 0 0
211 0 0 0 0 0 1 0 0
212 0 0 0 0 1 0 0 0
221 0 0 0 0 0 0 0 1
222 0 0 0 0 0 0 1 0
We assume that what is wanted is that two elements are regarded as adjacent if they are in the same group, dist and village.
Using the input in the Note create the adjacency matrices for groups, for distid and for villageid and then multiply them together and zero out the diagonal.
m1 <- sign(crossprod(t(DF[-(1:3)])))
m2 <- +outer(DF$distid, DF$distid, "==")
m3 <- +outer(DF$villageid, DF$villageid, "==")
m4 <- 1 - diag(nrow(DF))
m <- m1 * m2 * m3 * m4
dimnames(m) <- list(DF$hhid, DF$hhid)
giving:
> m
111 112 121 122 211 212 221 222
111 0 1 0 0 0 0 0 0
112 1 0 0 0 0 0 0 0
121 0 0 0 1 0 0 0 0
122 0 0 1 0 0 0 0 0
211 0 0 0 0 0 1 0 0
212 0 0 0 0 1 0 0 0
221 0 0 0 0 0 0 0 1
222 0 0 0 0 0 0 1 0
Graph
library(igraph)
g <- graph_from_adjacency_matrix(m)
plot(g)
Note
The input in reproducible form.
Lines <- "distid villageid hhid group1 group2 group3 group4
1 1 111 0 1 0 0
1 1 112 1 1 1 0
1 2 121 1 1 0 1
1 2 122 1 0 0 1
2 1 211 1 1 0 0
2 1 212 1 1 1 1
2 2 221 0 0 1 0
2 2 222 0 1 1 0"
DF <- read.table(text = Lines, header = TRUE)

R dplyr nested dummy coding

I need to recode a data set of test responses for use in another application (a program called BLIMP that imputes missing values). Specifically, I need to represent the test items and subscale assignments with dummy codes.
Here I create a data frame that holds the responses to a 10-item test for two persons in a nested format. These data are a simplified version of the actual input table.
library(tidyverse)
df <- tibble(
person = rep(101:102, each = 10),
item = as.factor(rep(1:10, 2)),
response = sample(1:4, 20, replace = T),
scale = as.factor(rep(rep(1:2, each = 5), 2))
) %>% mutate(
scale_last = case_when(
as.integer(scale) != lead(as.integer(scale)) | is.na(lead(as.integer(scale))) ~ 1,
TRUE ~ NA_real_
)
)
The columns of df contain:
person: ID numbers for the persons (10 rows for each person)
item: test items 1-10 for each person. Note how the items are nested within each person.
response: score for each item
scale: the test has two subscales. Items 1-5 are assigned to subscale 1, and items 6-10 are assigned to subscale 2.
scale_last: a code of 1 in this column indicates that the item is the last item in its assigned sub scale. This characteristic becomes important below.
I then create dummy codes for the items using the recipes package.
library(recipes)
dum <- df %>%
recipe(~ .) %>%
step_dummy(item, one_hot = T) %>%
prep(training = df) %>%
bake(new_data = df)
print(dum, width = Inf)
# person response scale scale_last item_X1 item_X2 item_X3 item_X4 item_X5 item_X6 item_X7
# <int> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 2 1 NA 1 0 0 0 0 0 0
# 2 101 3 1 NA 0 1 0 0 0 0 0
# 3 101 3 1 NA 0 0 1 0 0 0 0
# 4 101 1 1 NA 0 0 0 1 0 0 0
# 5 101 1 1 1 0 0 0 0 1 0 0
# 6 101 1 2 NA 0 0 0 0 0 1 0
# 7 101 3 2 NA 0 0 0 0 0 0 1
# 8 101 4 2 NA 0 0 0 0 0 0 0
# 9 101 2 2 NA 0 0 0 0 0 0 0
#10 101 4 2 1 0 0 0 0 0 0 0
#11 102 2 1 NA 1 0 0 0 0 0 0
#12 102 1 1 NA 0 1 0 0 0 0 0
#13 102 2 1 NA 0 0 1 0 0 0 0
#14 102 3 1 NA 0 0 0 1 0 0 0
#15 102 2 1 1 0 0 0 0 1 0 0
#16 102 1 2 NA 0 0 0 0 0 1 0
#17 102 4 2 NA 0 0 0 0 0 0 1
#18 102 2 2 NA 0 0 0 0 0 0 0
#19 102 4 2 NA 0 0 0 0 0 0 0
#20 102 3 2 1 0 0 0 0 0 0 0
# item_X8 item_X9 item_X10
# <dbl> <dbl> <dbl>
# 1 0 0 0
# 2 0 0 0
# 3 0 0 0
# 4 0 0 0
# 5 0 0 0
# 6 0 0 0
# 7 0 0 0
# 8 1 0 0
# 9 0 1 0
#10 0 0 1
#11 0 0 0
#12 0 0 0
#13 0 0 0
#14 0 0 0
#15 0 0 0
#16 0 0 0
#17 0 0 0
#18 1 0 0
#19 0 1 0
#20 0 0 1
The output shows the item dummy codes represented in the columns with the item_ prefix. For downstream processing, I need a further level of recoding. Within each subscale, the items must be dummy-coded relative to the last item of the subscale. Here’s where the scale_last variable comes into play; this variable identifies the rows in the output that need to be recoded.
For example, the first of these rows is row 5, the row for the last item (item 5) in subscale 1 for person 101. In this row the value of column item_X5 needs to be recoded from 1 to 0. In the next row to be recoded (row 10), it is the value of item_X10 that needs to be recoded from 1 to 0. And so on.
I’m struggling for the right combination of dplyr verbs to accomplish this. What’s tripping me up is the need to isolate specific cells within specific rows to be recoded.
Thanks in advance for any help!
We can use mutate_at and replace values from "item" columns to 0 where scale_last == 1
library(dplyr)
dum %>% mutate_at(vars(starts_with("item")), ~replace(., scale_last == 1, 0))
# A tibble: 20 x 14
# person response scale scale_last item_X1 item_X2 item_X3 item_X4 item_X5
# <int> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 2 1 NA 1 0 0 0 0
# 2 101 3 1 NA 0 1 0 0 0
# 3 101 1 1 NA 0 0 1 0 0
# 4 101 1 1 NA 0 0 0 1 0
# 5 101 3 1 1 0 0 0 0 0
# 6 101 4 2 NA 0 0 0 0 0
# 7 101 4 2 NA 0 0 0 0 0
# 8 101 3 2 NA 0 0 0 0 0
# 9 101 2 2 NA 0 0 0 0 0
#10 101 4 2 1 0 0 0 0 0
#11 102 2 1 NA 1 0 0 0 0
#12 102 1 1 NA 0 1 0 0 0
#13 102 4 1 NA 0 0 1 0 0
#14 102 4 1 NA 0 0 0 1 0
#15 102 4 1 1 0 0 0 0 0
#16 102 3 2 NA 0 0 0 0 0
#17 102 4 2 NA 0 0 0 0 0
#18 102 1 2 NA 0 0 0 0 0
#19 102 4 2 NA 0 0 0 0 0
#20 102 4 2 1 0 0 0 0 0
# … with 5 more variables: item_X6 <dbl>, item_X7 <dbl>, item_X8 <dbl>,
# item_X9 <dbl>, item_X10 <dbl>
In base R, we can use lapply
cols <- grep("^item", names(dum))
dum[cols] <- lapply(dum[cols], function(x) replace(x, dum$scale_last == 1, 0))

creating a DTM from a 3 column CSV file with r

I have that csv file, containing 600k lines and 3 rows, first one containing a disease name, second one a gene, a third one a number something like that: i have roughly 4k disease and 16k genes so sometimes the disease names and genes names are redudant.
cholera xx45 12
Cancer xx65 1
cholera xx65 0
i would like to make a DTM matrix using R, i've been trying to use the Corpus command from the tm library but corpus doesn't reduce the amount of disease and size's 600k ish, i'd love to understand how to transform that file into a DTM.
I'm sorry for not being that precise, totally starting with computer science things as a bio guy :)
Cheers!
If you're not concerned with the number in the third column, then you can accomplish what I think you're trying to do using only the first two columns (gene and disease).
Example with some simulated data:
library(data.table)
# Create a table with 10k combinations of ~6k different genes and 40 different diseases
df <- data.frame(gene=sapply(1:10000, function(x) paste(c(sample(LETTERS, size=2), sample(10, size=1)), collapse="")), disease=sample(40, size=100000, replace=TRUE))
table(df) creates a large matrix, nGenes rows long and nDiseases columns wide. Looking at just the first 10 rows (because it's so large and sparse).
head(table(df))
disease
gene 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
AB10 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
AB2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 1 0 1
AB3 0 1 0 0 2 1 1 0 0 1 0 0 0 0 0 2 1 0 0 1 0 0 1 0 3 0 1
AB4 0 0 1 0 0 1 0 2 1 1 0 1 0 0 1 1 1 1 0 1 0 2 0 0 0 1 1
AB5 0 1 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 0 1 1 1 0
AB6 0 0 2 0 2 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0
disease
gene 28 29 30 31 32 33 34 35 36 37 38 39 40
AB10 0 0 1 2 1 0 0 1 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 0 0 0
AB3 0 0 1 1 1 0 0 0 0 0 1 1 0
AB4 0 0 1 2 1 1 1 1 1 2 0 3 1
AB5 0 2 1 1 0 0 3 4 0 1 1 0 2
AB6 0 0 0 0 0 0 0 1 0 0 0 0 0
Alternatively, you can exclude the counts of 0 and only include combinations that actually exist. Easy aggregation can be done with data.table, e.g. (continuing from the above example)
library(data.table)
dt <- data.table(df)
dt[, .N, by=list(gene, disease)]
which gives a frequency table like the following:
gene disease N
1: HA5 20 2
2: RF9 10 3
3: SD8 40 2
4: JA7 35 4
5: MJ2 1 2
---
75872: FR10 26 1
75873: IC5 40 1
75874: IU2 20 1
75875: IG5 13 1
75876: DW7 21 1

Create block diagonal data frame in R

I have a data set that looks like this:
Person Team
114 1
115 1
116 1
117 1
121 1
122 1
123 1
214 2
215 2
216 2
217 2
221 2
222 2
223 2
"Team" ranges from 1 to 33, and teams vary in terms of size (i.e., there can be 5, 6, or 7 members, depending on the team). I need to create a data set into something that looks like this:
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
The sizes of the individual blocks are given by the number of people in a team. How can I do this in R?
You could use bdiag from the package Matrix. For example:
> bdiag(matrix(1,ncol=7,nrow=7),matrix(1,ncol=7,nrow=7))
Another idea, although, I guess this is less efficient/elegant than RStudent's:
DF = data.frame(Person = sample(100, 21), Team = rep(1:5, c(3,6,4,5,3)))
DF
lengths = tapply(DF$Person, DF$Team, length)
mat = matrix(0, sum(lengths), sum(lengths))
mat[do.call(rbind,
mapply(function(a, b) arrayInd(seq_len(a ^ 2), c(a, a)) + b,
lengths, cumsum(c(0, lengths[-length(lengths)])),
SIMPLIFY = F))] = 1
mat

Resources