count of records within levels of a factor - r

I am trying to populate a field in a table (or create a separate vector altogether, whichever is easier) with consecutive numbers from 1 to n, where n is the total number of records that share the same factor level, and then back to 1 for the next level, etc. That is, for a table like this
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
the result should be a new column (e.g. "sample") as follows:
sample<-c(1,2,3,4,1,2,3,1,2,3,4,1,2)

You can get it as follows, using ave:
data <- data.frame(data)
new <- ave(rep(1,nrow(data)),data$data,FUN=cumsum)
all.equal(new,sample) # check if it's right.

You can use rle function together with lapply :
sample <- unlist(lapply(rle(data[,1])$lengths,FUN=function(x){1:x}))
data <- cbind(data,sample)
Or even better, you can combine rle and sequence in the following one-liner (thanks to #Arun suggestion)
data <- cbind(data,sequence(rle(data[,1])$lengths))
> data
[,1] [,2]
[1,] "A" "1"
[2,] "A" "2"
[3,] "A" "3"
[4,] "A" "4"
[5,] "B" "1"
[6,] "B" "2"
[7,] "B" "3"
[8,] "C" "1"
[9,] "C" "2"
[10,] "C" "3"
[11,] "C" "4"
[12,] "D" "1"
[13,] "D" "2"

There are lots of different ways of achieving this, but I prefer to use ddply() from plyr because the logic seems very consistent to me. I think it makes more sense to be working with a data.frame (your title talks about levels of a factor):
dat <- data.frame(ID = c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)))
library(plyr)
ddply(dat, .(ID), summarise, sample = 1:length(ID))
# ID sample
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 B 1
# 6 B 2
# 7 B 3
# 8 C 1
# 9 C 2
# 10 C 3
# 11 C 4
# 12 D 1
# 13 D 2

My answer:
sample <- unlist(lapply(levels(factor(data)), function(x)seq_len(sum(factor(data)==x))))

factors <- unique(data)
f1 <- length(which(data == factors[1]))
...
fn <- length(which(data == factors[length(factors)]))
You can use a for loop or 'apply' family to speed that part up.
Then,
sample <- c(1:f1, 1:f2, ..., 1:fn)
Once again you can use a for loop for that part. Here is the full script you can use:
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
factors <- unique(data)
f <- c()
for(i in 1:length(factors)) {
f[i] <- length(which(data == factors[i]))
}
sample <- c()
for(i in 1:length(f)) {
sample <- c(sample, 1:f[i])
}
> sample
[1] 1 2 3 4 1 2 3 1 2 3 4 1 2

Related

Merging list of lists with comma separated in R

I have a data set like that
Reproducible data
test <- list(c("1"),c("2"),c("3"),c(c("a"),c("b")),c("d"))
Desired output is:
1
2
3
a,b
d
I have tried
output <- do.call(rbind, test)
You can use toString to collapse every element in the list.
out <- data.frame(result = sapply(test, toString))
out
# result
#1 1
#2 2
#3 3
#4 a, b
#5 d
We can use tidyverse methods
Loop over the list with map
Paste the elements using str_c
Return a single column dataset by adding the suffix _dfr in map
library(dplyr)
library(purrr)
library(stringr)
map_dfr(test, ~ tibble(col1 = str_c(.x, collapse=",")))
# A tibble: 5 x 1
col1
<chr>
1 1
2 2
3 3
4 a,b
5 d
Actually you are already close to your goal. Maybe the code below can help you
> do.call(rbind, Map(toString, test))
[,1]
[1,] "1"
[2,] "2"
[3,] "3"
[4,] "a, b"
[5,] "d"
Maybe this could help too beside all the very good answers you received:
test |>
lapply(\(x) paste(x, collapse = ",")) |>
cbind()
[,1]
[1,] "1"
[2,] "2"
[3,] "3"
[4,] "a,b"
[5,] "d"

Pairwise matrix to list of outcomes

Suppose we have a matrix M
M <- matrix(c(1:9),3,3)
diag(M) <- NA
M
[,1] [,2] [,3]
[1,] NA 4 7
[2,] 2 NA 8
[3,] 3 6 NA
where each entry describes the outcomes of pairwise interactions. Each interaction of row i with column j is interepreted as "object i outperformed object j X times". Examples: Object 2 performs better than object 1 in 2 cases. Object 1 performs better than object 3 in 7 cases.
Is there a quick way to transform this matrix into an object holding this information in a format where each row fully describes the interactions between two objects? The goal is something like this:
[,1] [,2] [,3] [,4]
[1,] "OBJ1" "OBJ2" "N1" "N2"
[2,] "1" "2" "4" "2"
[3,] "1" "3" "7" "3"
[4,] "2" "3" "8" "6"
where the first two columns give the objects that are compared while columns 3 and 4 describe how often OBJ1 outperformed OBJ2 and vice versa. The interpretation of the first row is: Object 1 has outperformed Object 2 4 times, whereas Object 2 has outperformed Object 1 2 times. I have been playing around with reshape2 and aggregating without useful results so far.
Maybe you can try the code below
inds <- t(combn(dim(M)[1], 2))
Mout <- `colnames<-`(
cbind(inds, M[inds], M[inds[, 2:1]]),
do.call(paste0, rev(expand.grid(1:2, c("Obj", "N"))))
)
which gives
> Mout
Obj1 Obj2 N1 N2
[1,] 1 2 4 2
[2,] 1 3 7 3
[3,] 2 3 8 6
Another solution could be:
M <- matrix(c(1:9),3,3)
diag(M) <- NA
M1 <- M
M[upper.tri(M, diag=TRUE)] <- NA
M1[lower.tri(M1, diag=TRUE)] <- NA
R1 = reshape2::melt(M1, na.rm=TRUE, value.name="N1")
R2 = reshape2::melt(M, na.rm=TRUE, value.name="N2")
R1$N2 <- R2$N2
rownames(R1) <- NULL
Output:
> R1
Var1 Var2 N1 N2
1 1 2 4 2
2 1 3 7 3
3 2 3 8 6

cbind coerces a data frame to matrix

I'm having trouble When using cbind. Prior to using cbind the object is a data.frame of two character vectors.
After I add a column using cbind, the data.frame object changes class to matrix. I've tried as.vector, declaring h as an empty character vector, etc. but couldn't fix it. Thank you for any suggestions and help.
output <- data.frame(h = character(), st = character()) ## empty dataframe
st <- state.abb
h <- (rep("a", 50))
output <- cbind(output$h, h) ## output changes to matrix class here
output <- cbind(output, st) ## adding a second column
I guess you may not need cbind().
output <- data.frame(state = state.abb, h = rep("a", 50))
head(output)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
# Ken I'm not sure what you actually want to obtain but it may be easier if variables are kept in a list. Below is an example.
state <- state.abb
h <- rep("a", 50)
lst <- list(state = state, h = h)
mat <- as.matrix(do.call(cbind, lst))
head(mat)
state h
[1,] "AL" "a"
[2,] "AK" "a"
[3,] "AZ" "a"
[4,] "AR" "a"
[5,] "CA" "a"
[6,] "CO" "a"
df <- as.data.frame(do.call(cbind, lst))
head(df)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
As a complement of info, notice that you could use single bracket notation to make it work with something close to your original code:
data
output <- data.frame(h = letters[1:5],st = letters[6:10])
h2 <- (rep("a", 5))
This won't work
cbind(output$h, h2)
# h2
# [1,] "1" "a"
# [2,] "2" "a"
# [3,] "3" "a"
# [4,] "4" "a"
# [5,] "5" "a"
class(cbind(output$h, h2)) # matrix
It's a matrix and factors have been coerced in numbers
this will work
cbind(output["h"], h2)
# h h2
# 1 a a
# 2 b a
# 3 c a
# 4 d a
# 5 e a
class(cbind(output["h"], h2)) # data.frame
Note that with double brackets (output[["h"]]) you'll have the same inadequate result as when using the dollar notation.

R: transposing and splitting a row with a delimiter.

I have a table
rawData <- as.data.frame(matrix(c(1,2,3,4,5,6,"a,b,c","d,e","f"),nrow=3,ncol=3))
1 4 a,b,c
2 5 d,e
3 6 f
I would like to convert to
1 2 3
4 5 6
a d f
b e
c
so far I can transpose and split the third column, however, I'm lost as to how to reconstruct a new table with the format outline above?
new = t(rawData)
for (e in 1:ncol(new)){
s<-strsplit(new[3:3,e], split=",")
print(s)
}
I tried creating new vectors for each iteration but I'm not sure how to efficiently put each one back into a dataframe. Would be grateful for any help. thanks!
You can use stri_list2matrix from the stringi package:
library(stringi)
rawData <- as.data.frame(matrix(c(1,2,3,4,5,6,"a,b,c","d,e","f"),nrow=3,ncol=3),stringsAsFactors = F)
d1 <- t(rawData[,1:2])
rownames(d1) <- NULL
d2 <- stri_list2matrix(strsplit(rawData$V3,split=','))
rbind(d1,d2)
# [,1] [,2] [,3]
# [1,] "1" "2" "3"
# [2,] "4" "5" "6"
# [3,] "a" "d" "f"
# [4,] "b" "e" NA
# [5,] "c" NA NA
You can also use cSplit from my "splitstackshape" package.
By default, it just creates additional columns after splitting the input:
library(splitstackshape)
cSplit(rawData, "V3")
# V1 V2 V3_1 V3_2 V3_3
# 1: 1 4 a b c
# 2: 2 5 d e NA
# 3: 3 6 f NA NA
You can just transpose that to get your desired output.
t(cSplit(rawData, "V3"))
# [,1] [,2] [,3]
# V1 "1" "2" "3"
# V2 "4" "5" "6"
# V3_1 "a" "d" "f"
# V3_2 "b" "e" NA
# V3_3 "c" NA NA

R - Splitting a column text into 2 columns without delimiter

I need to manipulate the following data frame (data) so that the PATCH_CODE column is split into 2 resulting columns where the 1st column contains the letter of the string and the 2nd column contains the number as in the 2nd example dataframe below.
EDIT PATCH_CODE is not always 2 letters, occasional cases have a single letter in which case I need to force a 1 into the resulting code column.
initial data frame: head(data,4)
PATCH_CODE TERR PC1
A1 MENS_10 0.8629186
A3 MENS_10 -0.2703238
B1 MENS_10 0.9516067
B2 MENS_10 -0.1722446
resulting data frame:
PATCH CODE TERR PC1
A 1 MENS_10 0.8629186
A 3 MENS_10 -0.2703238
B 1 MENS_10 0.9516067
B 2 MENS_10 -0.1722446
I have seen examples of how to accomplish this when the column to be split has an identifiable text delimiter such as a comma by using colsplit in reshape but I have failed to find a solution for a structure like mine. Is this possible?
output of str(data)
'data.frame': 240 obs. of 3 variables:
$ PATCH_CODE: Factor w/ 42 levels "A","A1","A2",..: 2 3 4 7 8 12 13 16 17 18 ...
$ TERR : Factor w/ 19 levels "MENS_10","MENS_14",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PC1 : num 0.548 1.228 0.273 5.548 3.853 ...
You can use strsplit. Passing an empty string as a delimiter results in a split at each letter.
a <- c("A1", "B1", "C2", "D5", "R3")
strsplit(a, "")
[[1]]
[1] "A" "1"
[[2]]
[1] "B" "1"
[[3]]
[1] "C" "2"
[[4]]
[1] "D" "5"
[[5]]
[1] "R" "3"
If you want to put that in a matrix
> do.call(rbind, strsplit(a, ""))
[,1] [,2]
[1,] "A" "1"
[2,] "B" "1"
[3,] "C" "2"
[4,] "D" "5"
[5,] "R" "3"
By the sounds of your description, strsplit should work fine. If your data are a little more complicated, you can also look at a possible regex-based solution.
For this particular example, try:
do.call(rbind, strsplit(mydf$PATCH_CODE,
split = "(?<=[a-zA-Z])(?=[0-9])",
perl = TRUE))
# [,1] [,2]
# [1,] "A" "1"
# [2,] "A" "3"
# [3,] "B" "1"
# [4,] "B" "2"

Resources