coercing data frame rows to matrix in R - r

I'm unsure of better terminology for my question, so forgive me for the long winded approach.
I'm trying to use two identifying variables, id and duration to fill up the rows of a matrix where the columns denote half hour periods (so there should be 6 for a 3 hour period) and the rows are a given person's activities in those time periods. If the activities do not fill up the matrix, a dummy variable should be used instead. I've written an example below which should help clarify.
Example:
data has 3 columns, id, activity, and duration. id and duration should serve as identifying variables and activity should serve as the variable in the matrix.
data <- data.frame(id = c(1, 1, 1, 2, 2, 3, 3, 3),
activity = c("a", "b", "c", "d", "e", "b", "b", "a"),
duration = c(60, 30, 90, 45, 30, 15, 60, 100))
For the example, I used a 3-hour duration hence the 6 columns in the matrix. The matrix below is the wanted output. There are DUMMY instances where the total duration of a person's activities does not sum to the duration of the matrix. In this example, the total duration is 180 (3 hours * 60), so person 2 who's activity duration sums to 75 (45 + 30) will get the DUMMY variable after the activities for the first 75 minutes are done.
mat <- t(matrix(c("a", "a", "b", "c", "c", "c",
"d", "d", "e", "DUMMY", "DUMMY", "DUMMY",
"b", "b", "b", "a", "a", "a"),
nrow = 6, ncol = 3))
colnames(mat) <- c("0", "30", "60", "90", "120", "150")
I'm unsure how to fill the matrix mat above with the data above. Any help would be appreciated. Please let me know if the question needs to be made clearer.
EDIT: edited output
EDIT2: Added matrix column names
EDIT3: Added info on dummy variable
EDIT4: Would it be easier if I added start and end time instead of duration?

An approach would be to locate the activities for every 30-min interval by "id":
ints = seq(0, by = 30, length.out = 6)
data2 = do.call(rbind,
lapply(split(data, data$id),
function(d) {
dur = d$duration
i = findInterval(ints, c(cumsum(c(0, dur[-length(dur)])), sum(dur)))
data.frame(id = d$id[1], ints = ints, activity = d$activity[i])
}))
And on the new "data.frame":
tapply(as.character(data2$activity), data2[c("id", "ints")], identity)
# ints
#id 0 30 60 90 120 150
# 1 "a" "a" "b" "c" "c" "c"
# 2 "d" "d" "e" NA NA NA
# 3 "b" "b" "b" "a" "a" "a"

Related

R add all combinations of three values of a vector to a three-dimensional array

I have a data frame with two columns. The first one "V1" indicates the objects on which the different items of the second column "V2" are found, e.g.:
V1 <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C")
V2 <- c("a","b","c","d","a","c","d","a","b","d","e")
df <- data.frame(V1, V2)
"A" for example contains "a", "b", "c", and "d". What I am looking for is a three dimensional array with dimensions of length(unique(V2)) (and the names "a" to "e" as dimnames).
For each unique value of V1 I want all possible combinations of three V2 items (e.g. for "A" it would be c("a", "b", "c"), c("a", "b", "d", and c("b", "c", "d").
Each of these "three-item-co-occurrences" should be regarded as a coordinate in the three-dimensional array and therefore be added to the frequency count which the values in the array should display. The outcome should be the following array
ar <- array(data = c(0,0,0,0,0,0,0,1,2,1,0,1,0,2,0,0,2,2,0,1,0,1,0,1,0,
0,0,1,2,1,0,0,0,0,0,1,0,0,1,0,2,0,1,0,1,1,0,0,1,0,
0,1,0,2,0,1,0,0,1,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,
0,2,2,0,1,2,0,1,0,1,2,1,0,0,0,0,0,0,0,0,1,1,0,0,0,
0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0),
dim = c(5, 5, 5),
dimnames = list(c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e")))
I was wondering about the 3D symmetry of your result. It took me a while to understand that you want to have all permutations of all combinations.
library(gtools) #for the permutations
foo <- function(x) {
#all combinations:
combs <- combn(x, 3, simplify = FALSE)
#all permutations for each of the combinations:
combs <- do.call(rbind, lapply(combs, permutations, n = 3, r = 3))
#tabulate:
do.call(table, lapply(asplit(combs, 2), factor, levels = letters[1:5]))
}
#apply grouped by V1, then sum the results
res <- Reduce("+", tapply(df$V2, df$V1, foo))
#check
all((res - ar)^2 == 0)
#[1] TRUE
I used to use the crossjoin CJ() to retain the pairwise count of all combinations of two different V2 items
res <- setDT(df)[,CJ(unique(V2), unique(V2)), V1][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
This code creates a data frame res with three columns. V1 and V2 contain the respective items of V2 from the original data frame df and N contains the count (how many times V1 and V2 appear with the same value of V1 (from the original data frame df).
Now, I found that I could perform this crossjoin with three 'dimensions' as well by just adding another unique(V2) and adapting the rest of the code accordingly.
The result is a data frame with four columns. V1, V2, and V3 indicate the original V2 items and N again shows the number of mutual appearances with the same original V1 objects.
res <- setDT(df)[,CJ(unique(V2), unique(V2), unique(V2)), V1][V1!=V2 & V1 != V3 & V2 != V3,
.N, .(V1,V2,V3)][order(V1,V2,V3)]
The advantage of this code is that all empty combinations (those which do not appear at all) are not considered. It worked with 1,000,000 unique values in V1 and over 600 unique items in V2, which would have otherwise caused an extremely large array of 600 x 600 x 600

How do I get all pairs of values in a variable based on shared values in a different variable

My problem is perhaps a little difficult to formulate, hence I haven't found any solutions yet, but I'll try:
I wan't to find all pairs of values in a variable based on whether they share any value in another variable. Maybe the following example can illustrate it more clearly.
In a 2 variable data frame like this:
data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162"))
#> scaffold geneID
#> A 162
#> A 276
#> B 64
#> B 276
#> B 281
#> C 64
#> C 162
#> D 162
... I want to find all pairs of "scaffolds" A, B, C, and D, that share any of the "geneID"s 64, 162, 176, and 281, so that the above would become a data frame with all pairs of scaffolds in 2 new columns like this:
data.frame(V1 = c("A", "A", "A", "B", "C"), V2 =c("B", "C", "D", "C", "D"))
#> V1 V2
#> A B
#> A C
#> A D
#> B C
#> C D
Obviously A and B is the same pair as B and A, so these should be removed somehow, but that's probably easy. Afterwards, this data frame needs to be combined with a data frame containing x/y coordinates of the scaffolds for drawing a line between the pairs on top of a plot with the scaffolds.
I do have a working for-loop to do the job, but I need to replace that with a much faster alternative. I'll spare you the code, it's complicated and doesn't always do it right. Running it on just 20 scaffolds can take seconds, but I need to do it on thousands. I was hoping a series of dplyr or data.table functions could do the job as they probably are as fast as it gets, but I haven't been able to get my head around how.
I hope you can help me, or perhaps something similar is already in another threat I just wasn't able to find.
A performance comparison of the two solutions by #Florian and #Roman can be found at http://rpubs.com/kasperskytte/SO_question_48407650
Here is a possible solution. Note that I modified your example df so A and C share both 162 and 64, and we have to make sure that this group does not occur twice in the output.
df = data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D","A"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162","64"),stringsAsFactors = F)
y = split(df$scaffold,df$geneID)
unique(do.call(rbind,(lapply(y[which(sapply(y, length) > 1)],function(x){t(combn(sort(x),2))}))))
Output:
[,1] [,2]
[1,] "A" "C"
[2,] "A" "D"
[3,] "C" "D"
[4,] "A" "B"
[5,] "B" "C"
How it works: First we split the data into groups based on df$geneID, the result we call y. Then we lapply over every element of y that has more than 1 element in it a function that gives us all n possible combinations of 2 as a nx2 matrix. By calling sort() on x inside this function we make removing duplicates easier later on, because we then rbind this list into a large matrix, and call unique() on the result to remove duplicates.
Hope this helps!
See the commends in the code.
xy <- data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162"))
# split by gene
xy1 <- split(xy, f = xy$geneID)
# find all combinations
out <- sapply(xy1, FUN = function(x) {
x$scaffold <- as.character(x$scaffold)
# add NA so that we can remove any cases that have a single scaffold
tryCatch(t(combn(x$scaffold, 2)), error = function(e) NA)
}, simplify = FALSE)
# remove NAs and some fiddling to get the desired format
out <- out[!is.na(out)]
out <- do.call(rbind, out)
# sort the data
out <- t(apply(out, MARGIN = 1, FUN = function(x) sort(x)))
# remove duplicates
out <- out[!duplicated(out), ]
out
[,1] [,2]
[1,] "A" "C"
[2,] "A" "D"
[3,] "C" "D"
[4,] "A" "B"
[5,] "B" "C"

How to delete all rows with counterparts and the counterparts themselves?

Please, have a look at the following data frame:
df <- data.frame(col1 = c(1, -2, -1, 3, 2 , 2),
col2 = c("a", "b", "a", "c", "b", "b"),
col3 = c("d", "e", "f", "g", "h", "i"))
My goal is to delete all rows in df with negative counterparts and the counterparts themselves. Now, what do I mean by a "negative counterpart"? A row has a negative counterpart if there is another row with the same number in col1 but with a minus, and the same value in col2. The value in col3 does not matter. Rows can have multiple counterparts. In this case, only one of them should be deleted. Thus, in the above example, the final data frame should contain only the fourth and either the fifth or sixth row.
The real df has approx. 4*10^5 rows and 25 columns. Most rows do not have a counterpart. So, my idea was to build a for loop that checks for each row whose value in col1 is less than 0, if there is positive counterpart. But I am struggling with the "checking" part.
for (i in nrow(df)) {
if (df[i, ] < 0) {
# check for positive counterparts here
}
}

Counting number of elements in a character column by levels of a factor column in a dataframe

I am a beginner in R. I have a dataframe in which there are two factor columns. One column is a company column, second is a product column. There are several missing values in product column and so I want to count the number of values in product column for each company (or each level of the company variable). I tried table, and count function in plyr package but they only seem to work with numeric variables. Please help!
Lets say the data frame looks like this:
df <- data.frame(company= c("A", "B", "C", "D", "A", "B", "C", "C", "D", "D"), product = c(1, 1, 2, 3, 4, 3, 3, NA, NA, NA))
So the output I am looking for is -
A 2
B 2
C 3
D 2
Thanks in advance!!
A dplyr solution.
df %>%
filter(!is.na(product)) %>%
group_by(company) %>%
count()
# A tibble: 4 × 2
comp n
<fctr> <int>
1 A 2
2 B 2
3 C 3
4 D 1
We can use rowsum from base R
with(df, rowsum(+!is.na(prod), comp))
Assuming your df is :
CASE 1) As give in question
Data for df:
options(stringsAsFactors = F)
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c(1,1,2,3,4,3,3,1,NA,NA)
df <- data.frame(comp=comp,prod=prod)
Program:
df$prodflag <- !is.na(df$prod)
tapply(df$prodflag , df$comp,sum)
Output:
> tapply(df$prodflag , df$comp,sum)
A B C D
2 2 3 1
#########################################################################
CASE 2) In case stringsAsFactors is on and prod is in characters, even NAs are quoted as characters and marked as factors then you can do:
Data:
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c("a","a","b","c","d","c","c","a","NA","NA")
df <- data.frame(comp=comp,prod=prod,stringsAsFactors = T)
Solution:
df$prodflag <- as.numeric(!as.character(df$prod)=="NA")
tapply(df$prodflag , df$comp,sum)
#########################################################################
CASE 3) In case the prod is a character and stringsAsFactors is on but NAs are not quoted then you can do:
Data:
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c("a","a","b","c","d","c","c","a",NA,NA)
df <- data.frame(comp=comp,prod=prod,stringsAsFactors = T)
Solution:
df$prodflag <- as.numeric(!is.na(df$prod))
tapply(df$prodflag , df$comp,sum)
Moral of the story, we should understand our data and then we can the logic which best suits our need.

Grouping factor levels in a data.table

I'm trying to combine factor levels in a data.table & wondering if there's a data.table-y way to do so.
Example:
DT = data.table(id = 1:20, ind = as.factor(sample(8, 20, replace = TRUE)))
I want to say types 1,3,8 are in group A; 2 and 4 are in group B; and 5,6,7 are in group C.
Here's what I've been doing, which has been quite slow in the full version of the problem:
DT[ind %in% c(1, 3, 8), grp := as.factor("A")]
DT[ind %in% c(2, 4), grp := as.factor("B")]
DT[ind %in% c(5, 6, 7), grp := as.factor("C")]
Another approach, suggested by this related question, would I guess translate like so:
DT[ , grp := ind]
levels(DT$grp) = c("A", "B", "A", "B", "C", "C", "C", "A")
Or perhaps (given I've got 65 underlying groups and 18 aggregated groups, this feels a little neater)
DT[ , grp := ind]
lev <- letters(1:8)
lev[c(1, 3, 8)] <- "A"
lev[c(2, 4)] <- "B"
lev[5:7] <- "C"
levels(DT$grp) <- lev
Both of these seem unwieldy; does this seem like the appropriate way to do this in data.table?
For reference, I timed a beefed up version of this with 10,000,000 observations and some more subgroup/supergroup levels. My original approach is slowest (having to run all those logic checks is costly), the second the fastest, and the third a close second. But I like the readability of that approach better.
(Keying DT before searching speeds things up, but it only halves the gap vis-a-vis the latter two methods)
Update:
I recently learned of a much simpler way to re-associate factor levels from this question and a closer reading of ?levels. No merges, correspondence table, etc. necessary, just pass a named list to levels:
levels(DT$ind) = list(A = c(1, 3, 8), B = c(2, 4), C = 5:7)
Original Answer:
As suggested by #Arun we have the option of creating the correspondence as a separate data.table, then joining it to the original:
match_dt = data.table(ind = as.factor(1:12),
grp = as.factor(c("A", "B", "A", "B", "C", "C",
"C", "A", "D", "E", "F", "D")))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]
We can also do this in (what I consider to be) the more readable fashion like so (with marginal speed costs):
levels <- letters[1:12]
levels[c(1, 3, 8)] <- "A"
levels[c(2, 4)] <- "B"
levels[5:7] <- "C"
levels[c(9, 12)] <- "D"
levels[10] <- "E"
levels[11] <- "F"
match_dt <- data.table(ind = as.factor(1:12),
grp = as.factor(levels))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]

Resources