R: Select multiple values from vector of sequences - r

In R I'm trying to figure out how to select multiple values from a predefined vector of sequences (e.g. indices = c(1:3, 4:6, 10:12, ...)). In other words, if I want a new vector with the 3rd, 5th, and 7th entries in "indices", what syntax should I use to get back a vector with just those sequences intact, e.g. c(10:12, ...)?

If I understand correctly, you want the 3rd, 5th, and 7th entry in c(1:3, 4:6, 10:12, ...), which means you want extract specific sets of indices from a vector.
When you do something like c(1:3, 4:6, ...), the resulting vector isn't what it sounds like you want. Instead, use list(1:3, 4:6, ...). Then you can do this:
indices <- list(1:3, 4:6, 10:12, 14:16, 18:20)
x <- rnorm(100)
x[c(indices[[3]], indices[[5]])]
This is equivalent to:
x[c(10:12, 18:20)]
That is in turn equivalent to:
x[c(10, 11, 12, 18, 19, 20)]
Please let me know if I've misinterpreted your question.

What you are looking for is how to subset data. Most commonly it is done using square bracket notation:
sample data:
my_vector <- c(100:120)
my_vector
# 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
values you want taken out:
indices <- c(1:3, 4:6, 10:12)
indices
# 1 2 3 4 5 6 10 11 12
subsetting using bracket notation
my_vector[indices]
# 100 101 102 103 104 105 109 110 111
there is also a function called subset can can do this as well.

Related

lapply with different indices in list of lists

I'm trying to get the output of a certain column ($new_age = numeric values) within lists of lists.
Data is "my_groups", which consists of 28 lists. Those lists have lists themselves of irregular size:
92 105 96 86 91 94 73 100 87 89 88 90 112 82 95 83 94 106
91 101 86 81 89 68 89 87 109 73 (len_df)
The 1st list has 92 lists, the 2nd 105 etc. ... until the 28th list with 73 lists.
First, I want my function to iterate through the 28 years of data and second, within these years I want to iterate through len_df, since $new_age is in the nested lists.
What I tried is this:
test <- lapply(seq(1:28), function(i) sapply(seq(1:len_df), function(j) (my_groups[[i]][[j]]$new_age) ) )
However, the index is out of bounds and I'm not sure how to combine two different indices for the nested lists. Unlist is not ideal, since I have to treat the data as separate groups and sorted for each year.
Expected output: $new_age (numeric values) for each of the 28 years e.g. 1st = 92 values, 2nd = 105 values etc.
Any idea how to make this work? Thank you!
Here are a few different approaches:
1) whole object approach Assuming that the input is L shown reproducibly in the Note at the end and that what is wanted is a list of length(L) numeric vectors, i.e. list(1:2, 3:5), consisting of the new_age values:
lapply(L, sapply, `[[`, "new_age")
giving:
[[1]]
[1] 1 2
[[2]]
[1] 3 4 5
2) indices If you want to do it using indices, as in the code shown in question, then using seq_along:
ix <- seq_along(L)
lapply(ix, function(i) sapply(seq_along(L[[i]]), function(j) L[[i]][[j]]$new_age))
3) unlist To use unlist form an appropriate grouping variable using rep and split into separate vectors by it. This assumes that new_age are the only leaves which may or may not be the case in your data but is the case in the reproducible example in the Note at the end.
split(unname(unlist(L)), rep(seq_along(L), lengths(L)))
Note
L <- list(list(list(new_age = 1), list(new_age = 2)),
list(list(new_age = 3), list(new_age = 4), list(new_age = 5)))

How to get a specific column from a matrix in r?

I have a matrix as following, how can I extract the desired column with [?
MX <- matrix(101:112,ncol=3)
MX[,2]
# [1] 105 106 107 108
`[`(MX, c(1:4,2))
# [1] 101 102 103 104 102
Obviously, it does not extract 2nd column as intuitive guess, but honestly gets the 2nd element of all.
More like I am asking how to express MX[,2] with [.
Please advise, Thanks
Keep the row index as blank
`[`(MX, ,2)
#[1] 105 106 107 108
or if we need to extract selected rows (1:4) of a specific column (2), specify the row, column index without concatenating. c will turn the row and column index to a single vector instead of two
`[`(MX, 1:4, 2)
#[1] 105 106 107 108

Condense/merge cells in a table in R

I'm trying to do something with a table in R.
The table comes into the script like this
M P
Position1 34 56
Position2 45 23
Position3 89 78
Position1 56 45
Position3 54 35
Position2 56 89
And after analyzing this script, ideally, I'd like a final output to be this:
M P
Position1 90 101
Position2 101 102
Position3 143 113
Basically I sum the total number across the positions for M and P. I was wondering if there was an easier way to do this. The positions will be at random. Is there a way to potentially split the data table by the position?
You can use summarise_each from dplyr if you have multiple columns and you have a big dataset and of course the data is data.frame (From the post, it is not clear whether you have a matrix or data.frame)
library(dplyr)
dat %>%
group_by(Pos) %>%
summarise_each(funs(sum=sum(., na.rm=TRUE)))
# Pos M P
#1 Position1 90 101
#2 Position2 101 112
#3 Position3 143 113
Or another option I would use for bigger datasets is data.table. From the benchmarks by #Ananda Mahto, it is the clear winner in speed.
library(data.table)
setDT(dat)[, lapply(.SD, sum, na.rm=TRUE), by=Pos]
# Pos M P
#1: Position1 90 101
#2: Position2 101 112
#3: Position3 143 113
If you are using a matrix and do not need to transform it to data.frame with creating a new column for row.names. (Perhaps, that option would still be efficient)
do.call(rbind, by(m1, list(rownames(m1)), colSums, na.rm=TRUE))
# M P
#Position1 90 101
#Position2 101 112
#Position3 143 113
Or a slightly more efficient method when dealing with matrices
library(reshape2)
acast(melt(m1), Var1~Var2, value.var="value", sum, na.rm=TRUE)
# M P
#Position1 90 101
#Position2 101 112
#Position3 143 113
data
The rownames are added as a column as data.frame won't allow duplicate rownames.
dat <- structure(list(Pos = c("Position1", "Position2", "Position3",
"Position1", "Position3", "Position2"), M = c(34L, 45L, 89L,
56L, 54L, 56L), P = c(56L, 23L, 78L, 45L, 35L, 89L)), .Names = c("Pos",
"M", "P"), class = "data.frame", row.names = c(NA, -6L))
m1 <- structure(c(34, 45, 89, 56, 54, 56, 56, 23, 78, 45, 35, 89), .Dim = c(6L,
2L), .Dimnames = list(c("Position1", "Position2", "Position3",
"Position1", "Position3", "Position2"), c("M", "P")))
One more, just for fun. This one produces the structure you show in the post.
t(sapply(split(dat[-1], dat$Pos), colSums))
# M P
# Position1 90 101
# Position2 101 112
# Position3 143 113
This answer only applies if you are dealing with a matrix (like the "m1" dataset shared in #akrun's answer):
xtabs(Freq ~ Var1 + Var2, data.frame(as.table(m1)))
# Var2
# Var1 M P
# Position1 90 101
# Position2 101 112
# Position3 143 113
'aggregate' which needs to be used as follows:
> ddf
V1 V2 V3
1 Position1 34 56
2 Position2 45 23
3 Position3 89 78
4 Position1 56 45
5 Position3 54 35
6 Position2 56 89
> a1 = aggregate(V2~V1, ddf, sum)
> a2 = aggregate(V3~V1, ddf, sum)
> merge(a1, a2)
V1 V2 V3
1 Position1 90 101
2 Position2 101 112
3 Position3 143 113
First get your rownames
rows<-unique(rownames(yourDataFrame))
Make sure unique is there or we'll get a lot of duplicates
Then you can do a couple of different things here, the package plyr would come in handy, but just using base R you can use lapply to calculate your sums
result<-lapply(rownames, function(rname){
subsetDF<-yourDataFrame[rname,]
apply(subsetDF, 2, sum)
}
)
To break it down, you take all your rownames, and in lapply subset your dataframe by just the rows of that rowname. Next, you apply sum over that subset, taking the columns, and then output that to a list. You could then do something like rbindlist(result) to get your resulting dataframe.
Definitely not the most efficient way to do it, but it's the first thing I thought of
What you want is the aggregate function.
Say you have your table stored as data then try
condensedData <- aggregate(data, by=list(position), FUN=sum, na.rm=TRUE)
If that doesn't do exactly what you want, try experimenting with the aggregate function. The important inputs are by and FUN. by tells aggregate which columns you want the result to be identified uniquely by, while FUN tells aggregate what to do to combine numbers with the same by. FUN can be "sum", "mean", etc...

data.matrix() when character involved

In order to calculate the highest contribution of a row per ID I have a beautiful script which works when the IDs are a numeric. Today however I found out that it is also possible that IDs can have characters (for instance ABC10101). For the function to work, the dataset is converted to a matrix. However data.matrix(df) does not support characters. Can the code be altered in order for the function to work with all kinds of IDs (character, numeric, etc.)? Currently I wrote a quick workaround which converts IDs to numeric when ID=character, but that will slow the process down for large datasets.
Example with code (function: extract the first entry with the highest contribution, so if 2 entries have the same contribution it selects the first):
Note: in this example ID is interpreted as a factor and data.matrix() converts it to a numeric value. In the code below the type of the ID column should be character and the output should be as shown at the bottom. Order IDs must remain the same.
tc <- textConnection('
ID contribution uniqID
ABCUD022221 40 101
ABCUD022221 40 102
ABCUD022222 20 103
ABCUD022222 10 104
ABCUD022222 90 105
ABCUD022223 75 106
ABCUD022223 15 107
ABCUD022223 10 108 ')
df <- read.table(tc,header=TRUE)
#Function that needs to be altered
uniqueMaxContr <- function(m, ID = 1, contribution = 2) {
t(
vapply(
split(1:nrow(m), m[,ID]),
function(i, x, contribution) x[i, , drop=FALSE]
[which.max(x[i,contribution]),], m[1,], x=m, contribution=contribution
)
)
}
df<-data.matrix(df) #only works when ID is numeric
highestdf<-uniqueMaxContr(df)
highestdf<-as.data.frame(highestdf)
In this case the outcome should be:
ID contribution uniqID
ABCUD022221 40 101
ABCUD022222 90 105
ABCUD022223 75 106
Others might be able to make it more concise, but this is my attempt at a data.table solution:
tc <- textConnection('
ID contribution uniqID
ABCUD022221 40 101
ABCUD022221 40 102
ABCUD022222 20 103
ABCUD022222 10 104
ABCUD022222 90 105
ABCUD022223 75 106
ABCUD022223 15 107
ABCUD022223 10 108 ')
df <- read.table(tc,header=TRUE)
library(data.table)
dt <- as.data.table(df)
setkey(dt,uniqID)
dt2 <- dt[,list(contribution=max(contribution)),by=ID]
setkeyv(dt2,c("ID","contribution"))
setkeyv(dt,c("ID","contribution"))
dt[dt2,mult="first"]
## ID contribution uniqID
## [1,] ABCUD022221 40 101
## [2,] ABCUD022222 90 105
## [3,] ABCUD022223 75 106
EDIT -- more concise solution
You can use .SD which is the subset of the data.table for the grouping, and then use which.max to extract a single row.
in one line
dt[,.SD[which.max(contribution)],by=ID]
## ID contribution uniqID
## [1,] ABCUD022221 40 101
## [2,] ABCUD022222 90 105
## [3,] ABCUD022223 75 106

R : how to Detect Pattern in Matrix By Row

I have a big matrix with 4 columns, containing normalized values (by column, mean ~ 0 and standard deviation = 1)
I would like to see if there is a pattern in the matrix, and if yes I would like to cluster rows by pattern, by pattern I mean values in a given row example
for row N
if value in column 1 < column 2 < column 3 < column 4 then it is let's say a pattern 1
Basically there is 4^4 = 256 possible patterns (in theory)
Is there a way in R to do this ?
Thanks in advance
Rad
Yes. (Although the number of distinct permutations is only 24 = 4*3*2. After one value is chosen, there are only three possible second values, and after the second is specified there are only two more orderings left.) The order function applied to each row should give the desired 1,2,3, 4 permutations:
mtx <- matrix(rnorm(10000), ncol=4)
res <- apply(mtx, 1, function(x) paste( order(x), collapse=".") )
> table(res)[1:10]
> table(res)
res
1.2.3.4 1.2.4.3 1.3.2.4 1.3.4.2 1.4.2.3 1.4.3.2
98 112 95 120 114 118
2.1.3.4 2.1.4.3 2.3.1.4 2.3.4.1 2.4.1.3 2.4.3.1
101 114 105 102 104 122
3.1.2.4 3.1.4.2 3.2.1.4 3.2.4.1 3.4.1.2 3.4.2.1
105 82 107 90 97 86
4.1.2.3 4.1.3.2 4.2.1.3 4.2.3.1 4.3.1.2 4.3.2.1
99 93 100 108 118 110

Resources