R select names from a list using logical vector - r

Suppose I have my list with names and its component and I want to get those names which have its components in other vector:
that is my list neighbors
neighbors[[1]]
[1] "CNBP" "IGF2BP1" "RPL3|OK/SW-cl.32"
[4] "HNRNPC" "PURA|hCG_45299" "RPS3A"
"Cnbp" "Mis12|DN-393H17.5"
neighbors[[2]]
[1] "NIN" "PRKACA" "AURKA|RP5-1167H4.6"
[4] "GSK3B" "AMOT" "UBC"
and my vector of interest
mtop
[1] "TUBA1A" "DNAJB1" "MME"
[4] "PRKCB" "PARK2|KB-152G3.1" "UBC"
My idea for example is return the name of neighbors[2], which have in common UBC
Any ideas??

First off, your data. Your output appears sonewhat strange. If this is not what you have, consider using dput to dump these variables in a reproducible way.
mtop <- c("TUBA1A", "DNAJB1", "MME",
"PRKCB", "PARK2|KB-152G3.1", "UBC")
neighbors <- list(c("CNBP", "IGF2BP1", "RPL3|OK/SW-cl.32",
"HNRNPC", "PURA|hCG_45299", "RPS3A",
"Cnbp", "Mis12|DN-393H17.5"),
c("NIN", "PRKACA", "AURKA|RP5-1167H4.6",
"GSK3B", "AMOT", "UBC"))
To select those elements of list neighbors which have at least one vector element in common with mtop, you can use this command:
matching <- sapply(neighbors, function(l) length(intersect(mtop, l)) > 0)
print(neighbors[matching])
This will print neighbors[2], as it has "UBC" in common with mtop. It does this via the logical vector matching. Which seems to be what your question asked.
If you want to take position into account, i.e. only select neighbors[2] because "UBC" is in position 6 in both vectors, then you should use this command:
matching <- sapply(neighbors, function(l) any(l == mtop))
However, this will create a warning, as neighbors[[1]] is longer than mtop.
If you want the names common to both your data structures, you can use this code:
intersect(unlist(neighbors), mtop)
If you need something else, you have to be more specific in your question, i.e. give an explicit example of what the output should look like, and cover all the possible input configurations that might lead to structurally different output.

How about:
l<- lapply(neighbours,function(x)x[x %in% mtop])
This will return the list where each list element will have the elements which are in the vector mtop.
Now select only those elements which have non-zero length:
names(l)[sapply(l,length)>0]
You can combine these into one line:
names(neighbours)[sapply(neighbours,function(x)Reduce("|",mtop %in% x))]

Related

Comparing character lists in R

I have two lists of characters that i read in from excel files
One is a very long list of all bird species that have been documented in a region (allBirds) and another is a list of species that were recently seen in a sample location (sampleBirds), which is much shorter. I want to write a section of code that will compare the lists and tell me which sampleBirds show up in the allBirds list. Both are lists of characters.
I have tried:
# upload xlxs file
Full_table <- read_excel("Full_table.xlsx")
Pathogen_table <- read_excel("pathogens.xlsx")
# read species columnn into a new dataframe
species <-c(as.data.frame(Full_table[,7], drop=FALSE))
pathogens <- c(as.data.frame(Pathogen_table[,3], drop=FALSE))
intersect(pathogens, species)
intersect(species, pathogens)
but intersect is outputting lists of 0, which I know cannot be true, any suggestions?
Maybe you can try match() function or "==".
You need to run the intersect on the individual columns that are stored in the list:
> a <- c(data.frame(c1=as.factor(c('a', 'q'))))
> b <- c(data.frame(c1=as.factor(c('w', 'a'))))
> intersect(a,b)
list()
> intersect(a$c1,b$c1)
[1] "a"
This will probably do in your case
intersect(Full_table[,7], Pathogen_table[,3])
Or if you insist on creating the data.frames:
intersect(pathogens[1,], species[1,])
where [1,] should select the first column of the data.frame only. Note that by using c(as.data.frame(... you are converting the data.frame to a regular list. I'd go with only as.data.frame(....

Binding data to a dataframe in each 'for' run under different columns each time to compute average of each column finally

I am trying to do 10-fold-cross-validation in R. In each for run a new row with several columns will be generated, each column will have an appropriate name, I want the results of each 'for' to go under the appropriate column, so that at end I will be able to compute the average value for each column. In each 'for' run results that are generated belong to different columns than the previous for, therefore the names of the columns should also be checked. Is it possible to do it anyway? Or maybe it would be better to just compute the averages for the columns on the spot?
for(i in seq(from=1, to=8200, by=820)){
fold <- df_vector[i:i+819,]
y_fold_vector <- df_vector[!(rownames(df_vector) %in% rownames(folding)),]
alpha_coefficient <- solve(K_training, y_fold_vector)
test_points <- df_matrix[rownames(df_matrix) %in% rownames(K_training), colnames(df_matrix) %in% rownames(folding)]
predictions <- rbind(predictions, crossprod(alpha_coefficient,test_points))
}
You are having problems with the operator precedence of dyadic operators in R should be:
fold <- df_vector[ i:(i+819), ]
Consider:
> i=1
> i:i+189
[1] 190
Lack of a simple example (or any comments on what your code is supposed to be doing) prevents any testing of the rest of the code, but you can find the precedence of operators at ?Syntax. Unary "=" is higher, but binary "+" is lower than ":".
(It's also unclear what the folding vector is supposed to be. You only defined a fold value and it wasn't a vector since you addressed it as you would a dataframe.)

How to subset a list based on the length of its elements in R

In R I have a function (coordinates from the package sp ) which looks up 11 fields of data for each IP addresss you supply.
I have a list of IP's called ip.addresses:
> head(ip.addresses)
[1] "128.177.90.11" "71.179.12.143" "66.31.55.111" "98.204.243.187" "67.231.207.9" "67.61.248.12"
Note: Those or any other IP's can be used to reproduce this problem.
So I apply the function to that object with sapply:
ips.info <- sapply(ip.addresses, ip2coordinates)
and get a list called ips.info as my result. This is all good and fine, but I can't do much more with a list, so I need to convert it to a dataframe. The problem is that not all IP addresses are in the databases thus some list elements only have 1 field and I get this error:
> ips.df <- as.data.frame(ips.info)
Error in data.frame(`128.177.90.10` = list(ip.address = "128.177.90.10", :
arguments imply differing number of rows: 1, 0
My question is -- "How do I remove the elements with missing/incomplete data or otherwise convert this list into a data frame with 11 columns and 1 row per IP address?"
I have tried several things.
First, I tried to write a loop that removes elements with less than a length of 11
for (i in 1:length(ips.info)){
if (length(ips.info[i]) < 11){
ips.info[i] <- NULL}}
This leaves some records with no data and makes others say "NULL", but even those with "NULL" are not detected by is.null
Next, I tried the same thing with double square brackets and get
Error in ips.info[[i]] : subscript out of bounds
I also tried complete.cases() to see if it could potentially be useful
Error in complete.cases(ips.info) : not all arguments have the same length
Finally, I tried a variation of my for loop which was conditioned on length(ips.info[[i]] == 11 and wrote complete records to another object, but somehow it results in an exact copy of ips.info
Here's one way you can accomplish this using the built-in Filter function
#input data
library(RDSTK)
ip.addresses<-c("128.177.90.10","71.179.13.143","66.31.55.111","98.204.243.188",
"67.231.207.8","67.61.248.15")
ips.info <- sapply(ip.addresses, ip2coordinates)
#data.frame creation
lengthIs <- function(n) function(x) length(x)==n
do.call(rbind, Filter(lengthIs(11), ips.info))
or if you prefer not to use a helper function
do.call(rbind, Filter(function(x) length(x)==11, ips.info))
Alternative solution based on base package.
# find non-complete elements
ids.to.remove <- sapply(ips.info, function(i) length(i) < 11)
# remove found elements
ips.info <- ips.info[!ids.to.remove]
# create data.frame
df <- do.call(rbind, ips.info)

R colon operator on list of matrices

I've created a list of matrices in R. In all matrices in the list, I'd like to "pull out" the collection of matrix elements of a particular index. I was thinking that the colon operator might allow me to implement this in one line. For example, here's an attempt to access the [1,1] elements of all matrices in a list:
myList = list() #list of matrices
myList[[1]] = matrix(1:9, nrow=3, ncol=3, byrow=TRUE) #arbitrary data
myList[[2]] = matrix(2:10, nrow=3, ncol=3, byrow=TRUE)
#I expected the following line to output myList[[1]][1,1], myList[[2]][1,1]
slice = myList[[1:2]][1,1] #prints error: "incorrect number of dimensions"
The final line of the above code throws the error "incorrect number of dimensions."
For reference, here's a working (but less elegant) implementation of what I'm trying to do:
#assume myList has already been created (see the code snippet above)
slice = c()
for(x in 1:2) {
slice = c(slice, myList[[x]][1,1])
}
#this works. slice = [1 2]
Does anyone know how to do the above operation in one line?
Note that my "list of matrices" could be replaced with something else. If someone can suggest an alternative "collection of matrices" data structure that allows me to perform the above operation, then this will be solved.
Perhaps this question is silly...I really would like to have a clean one-line implementation though.
Two things. First, the difference between [ and [[. The relevant sentence from ?'[':
The most important distinction between [, [[ and $ is that the [ can
select more than one element whereas the other two select a single
element.
So you probably want to do myList[1:2]. Second, you can't combine subsetting operations in the way you describe. Once you do myList[1:2] you will get a list of two matrices. A list typically has only one dimension, so doing myList[1:2][1,1] is nonsensical in your case. (See comments for exceptions.)
You might try lapply instead: lapply(myList,'[',1,1).
If your matrices will all have same dimension, you could store them in a 3-dimensional array. That would certainly make indexing and extracting elements easier ...
## One way to get your data into an array
a <- array(c(myList[[1]], myList[[2]]), dim=c(3,3,2))
## Extract the slice containing the upper left element of each matrix
a[1,1,]
# [1] 1 2
This works:
> sapply(myList,"[",1,1)
[1] 1 2
edit: oh, sorry, I see almost the same idea toward the end of an earlier answer. But sapply probably comes closer to what you want, anyway

how do I get the difference between two R named lists?

OK, I've got two named lists, one is "expected" and one is "observed". They may be complex in structure, with arbitrary data types. I want to get a new list containing just those elements of the observed list that are different from what's in the expected list. Here's an example:
Lexp <- list(a=1, b="two", c=list(3, "four"))
Lobs <- list(a=1, c=list(3, "four"), b="ni")
Lwant <- list(b="ni")
Lwant is what I want the result to be. I tried this:
> setdiff(Lobs, Lexp)
[[1]]
[1] "ni"
Nope, that loses the name, and I don't think setdiff pays attention to the names. Order clearly doesn't matter here, and I don't want a=1 to match with b=1.
Not sure what a good approach is... Something that loops over a list of names(Lobs)? Sounds clumsy and non-R-like, although workable... Got any elegant ideas?
At least in this case
Lobs[!(Lobs %in% Lexp)]
gives you what you want.
OK, I found one slightly obtuse answer, using the plyr package:
> Lobs[laply(names(Lobs), function(x) !identical(Lobs[[x]], Lexp[[x]]))]
$b
[1] "ni"
So, it takes the names of the array from the observed function, uses double-bracket indexing and the identical() function to compare the sub-lists, then uses the binary array that results from laply() to index into the original observed function.
Anyone got a better/cleaner/sexier/faster way?

Resources