Pre-Processing / Formatting Data - r

I have two vectors in R:
list1 <- c("ABCDEF", "FEDCBA", "AA-BB-CCCC", "ABCDEFGH-IJK", "ZZZZ")
list2 <- c("ABCDEF", "FEDCBA:XA",
"AA-BB-CCCC-01","AA-BB-CCCC-21:ABC", "ABCDEFGH-IJK-1X",
"AKDWXFE-XXY")
I'd like to compare the two lists -- with list1 being the 'correct' list. If an item in list1 does not appear in list2, then print out 'Add [item in list1]'; if item in list2 is not in list1, then print out 'delete [item in list 2]'. I would like to find partial matches. For example, list 1 has 'FEDCBA' and list2 has 'FEDCBA:XA" -- this would be an acceptable partial match....same with list 2 having AA-BB-CCCC-21:ABC while list1 has AA-BB-CCCC (this is also an acceptable partial match).

It looks like a homework to me, but OK, let us make it a teaching moment.
First, let us find out which elements of list1 have matches in list2. We will use grepl for that, which returns a logical vector with one TRUE/FALSE value for each element of list2.
library(tidyverse)
list1_has_match <- map_lgl(list1, ~ any(grepl(., list2)))
msg <- sprintf("Add [%s]", list1[ !list1_has_match ])
In the above code, I use map_lgl to run the any(grepl(...)) expression for each element of list1 and return a logical vector. Any element that has a FALSE value in that vector is not present in list2 and should be added.
Next, we do the same – the other way around. However, we have still to use the elements of list1 as a pattern. This is why the next point gets a bit complicated. In each call within map_dfr, we are generating a named vector corresponding to one element of list1. However, since we use map_dfr, each of these vectors will be considered a row in a data frame. Thus, the columns of the result will correspond to the elemnts of list2.
map1 <- map_dfr(list1, ~ set_names(grepl(., list2), list2))
list2_has_match <- map_lgl(map1, any)
msg <- c(msg,
sprintf("delete [%s]", list2[ !list2_has_match ]))
And now print the messages
cat(msg, sep="\n")

Related

how to eliminate element(dataframe) from a list in R if contain a specific word

I have a dataframe. I divide this dataframe in subframes of 6 rows each in a list.
I want if inside in those subframes exist the word "#ERROR" to be deleted all the dataframe( that contain even in one row the specific word) and to receive the list with smaller number of dataframes. Then I am going to convert the list in dataframe again. My problem is that I try different codes and I cannot figure out how to eliminate subdataframe with specific word from the list.
I try the follow
a<-dataset
View(a)
my.list<-split(a, rep(1:119, each = 6))
z=lapply(1:length(my.list), function(i) my.list[[i]] != "#ERROR")
but what I get they are 119 elements TRUE FALSE. But I want to eliminate those false... anyone please help....
Try using sapply as it is will return a vector instead of list like lapply.
new.list <- my.list[sapply(1:length(my.list), function(i)
all(my.list[[i]] != "#ERROR"))]
Or a bit simplified with Filter :
new.list <- Filter(function(x) all(x != "#ERROR"), my.list)

Converting specific parts of lists to a dataframe

I have a large list of 2 elements containing lists of species containing lists of 25 vectors, resembling a set like this:
l1 <- list(time=runif(100), space=runif(100))
l2 <- list(time=runif(100), space=runif(100))
list1 <- list(test1=list(species1=l1, species2=l2),test2=list(species1=l1, species2=l2))
I think, its essentially a list of a list of lists.of vectors.
I want to create a data.frame from all space-vectors of all 'species' in just one of the two sublists:
final <- as.data.frame(cbind(unlist(list1[[2]]$species1$space), unlist(list1[[2]]$species2$space)))
names(final) <- names(list1[[2]])
Essentially, i need a loop/apply command that navigates me through list1[[2]]$species and picks all vectors called space.
Thank you very much!
We can use a nested loop to extract the 'space' elements
data.frame(lapply(list1, function(x)
sapply(x, "[", 'space')))

Subsetting elements in list a based on elements list b

I have 2 lists:
c1 <- c("e","f","g","h")
c2 <- c("j","k","l","m")
list1 <- list(c1,c2)
i1 <- c(1,3)
i2 <- c(2,4)
list2 <- list(i1,i2)
I would like to subset the character vectors in list a based on integer vectors in list b. This way I would end up with a new list (list3) containing c1 (with only e and g) and c2 (with only k and m). I'm currently looking into the possibilities of plyr so the solution should preferably be with plyr. I tried something similar to this but to no avail.
list3 <- llply(list1, function (x) x[list2])
You could try with base R using Map, which would be more compact than the one with llply. Basically, you have two lists with same number of list elements and wanted to subset one each list element of the first list ("list1") based on the index list elements of ("list2"). Map will compare the corresponding elements of "list1" and "list2" and subset using [
Map(`[`, list1, list2)
which is the same as
Map(function(x,y) x[y], list1, list2)
Or using llply from plyr (you don't really need llply). This could be achieved with lapply itself. The key is to compare the corresponding elements of both lists and the possible way you can link both (when they have same elements) is to use seq_along which will get you the sequence of elements in one list (1:3) and use that index to get the corresponding elements of both list and then subset.
llply(seq_along(list1), function(i) list1[[i]][list2[[i]]])

How to check if a nested list of strings contain the same elements

I have nested lists of strings:
mylist1 <- list(
list(c("banana"),c("banana","tomato"))
, list(c("", "nut"), c("nut", "orange"))
)
mylist2 <- list(
list(c("orange","nut"), c("nut", ""))
, list(c("tomato","banana"),c("banana"))
)
mylist3 <- list(
list(c("orange","nut"), c("nut"))
, list(c("tomato","banana"),c("banana"))
)
Note: In the above example mylist1and mylist2 would be equal. But mylist3 is different from mylist1 and mylist2, as the sublist with the empty string and "nut" is missing c("nut", "")
The order of the elements in the lists are not important. I want a function that compares two such lists and returns a boolean, if they are equal when disregarding the order of elements.
Essentially my nested lists of type string represent mathematical sets. I want to compare two such nested lists, but as they represent sets the order is not important. I want to get a boolean (true/false) value back.
If you dont have duplicated items in your lists, you can use Set and Set Operations.
So it would be something like:
(set1 <-c(mylist1, NA))
(set2 <-c(mylist2, NA))
setequal(set1, set2)
You can always compare each element of list 1 to second list's elements, but that would take a long time and can be impossible (On^2) (it is going to be a nested loop)
The other option that comes to my mind is to sort both lists and then do a comparison between them, since they are both sorted you only have to check an element in list 1 to the same order element in list 2.
The second option is theoretically going to be faster: (assuming sorting the lists take nLog(n))
it is going to be 2*nLog(n) + n (number of times you check the elements of both sorted lists) which is OnLog(n)

For a unique item in list1, get the corresponding maximum value in list2 in python

I have two lists of elements (list1 and list2) that I generated after parsing a two-columns file. List1 contains items that are repeated different times (i.e a,a,a,b,b,c,c,c,c,d,d) and list2 contains their corresponding values that either are repeated, like in the list1, or are unique.
What I want to do is, for the common items in list1 to get the maximum corresponding number. I am thinking of doing it in python, by initiating a dictionary and, using a condition, to populate it using as a key unique items from list1 and the corresponding maximum value from list2.
I would appreciate any help.
Thanks
You can combine those two lists into one list of pairs using zip:
# You probably want the values in list2 to be ints
list2 = map(int, list2)
# Combines each item in list1 with the corresponding one in list2
pairs = zip(list1, list2)
Then to create a dictionary of the max values, you can just go through those pairs:
max_values = {}
for key, value in pairs:
current_value = max_values.get(key) # None if the key isn't there.
if current_value is None or current_value < value:
max_values[key] = value

Resources