Comparing specific columns from embedded data frames within separate lists - r

Suppose I have two lists with the following embedded data frames:
# Data frames to intersect
US <- data.frame("Group" = c(1,2,3), "Age" = c(21,20,17), "Name" = c("John","Dora","Helen"))
CA <- data.frame("Group" = c(2,3,4), "Age" = c(21,20,19), "Name" = c("John","Dora","Dan"))
JP <- data.frame("Group" = c(4,5,6), "Age" = c(16,15,14), "Name" = c("Mac","Hector","Jack"))
# Lists to compare----
list1<-list(US,CA,JP)
names(list1)<-c("US","CA","JP")
# List 2 can serve as a "reference list," a duplicate of the first.
list2<-list(US,CA,JP)
names(list2)<-c("US","CA","JP")
I have a second list, that serves as a "reference list" to the first. It is copy and is only meant to be used as a reference in some operation, like a for loop. What I want to do is intersect the scalars / values from only the first column (e.g. Group), and store the intersected output in separate data frames or matrices. I do not want to intersect dataframe groups that have the same names(i.e. List 1 US groups should not be intersected with List 2 US groups).
Ideally, a final list of DFs would be created, containing all possible combinations of intersected DF, their names and the results for final output would be something to the effect of:
print(comb_list)
$US_CA
Group
1 2
2 3
$US_JP
data frame with 0 columns and 0 rows
$CA_JP
Group
1 4
Would it be possible to create this as a for-loop?

Sure that looks doable with a nested for loop. There's no need to copy the initial list. The loop can iterate over the same list. I'd suggest using dplyr for it's handy filter and select functions
require(dplyr)
comb_list <- list()
for (i in 1:length(list1)) {
for (j in 1:length(list1)) {
# don't intersect country with itself
if (names(list1)[i] != names(list1)[j]) {
value <- filter(list1[[i]], Group %in% list1[[j]]$Group)
value <- select(value, Group)
name <- paste0(names(list1)[i], "_", names(list1[j]))
name_alt <- paste0(names(list1)[j], "_", names(list1[i]))
#don't store equivalent country intersections i.e. US_CA and CA_US
if (!name %in% names(comb_list) & !name_alt %in% names(comb_list)) {
comb_list[[name]] <- value
}
}
}
}
print(comb_list)
$US_CA
Group
1 2
2 3
$US_JP
[1] Group
<0 rows> (or 0-length row.names)
$CA_JP
Group
1 4

Related

How do I convert my code that transforms nested lists to a dataframe into a function?

I have 18 lists, one for each condition. Inside a list of a condiction there are 10 lists, one for each participant. Within a list of a participant, there is a list with anywhere between 1 and 20 values of type double. To clarify, this is code to reproduce the list of one condition, remember I have 18 of these all slightly different.
Participant_List <- list()
for (i in 1:10) {
Scores <- list()
for (k in sample(1:5, replace = TRUE)) {
Scores[[k]] <- sample(1:7, sample(1:10), replace = TRUE)
}
Participant_List[[i]] <- Scores
}
Now with some help, I got code to transform the list of one condition into a data frame in a long format:
#convert each participant's list to a data frame
x_dataframes <- lapply(seq_along(Participant_List), function(curParticipant){
return(data.frame(Participant = curParticipant,
Score = unlist(Participant_List[[curParticipant]])))
})
#combine the list of dataframes into one dataframe
x_combined <- do.call("rbind", x_dataframes)
I would like to create a function containing this code to be able to simply apply this to the other conditions. I came up with the following, where I first create a list containing the conditions I have, called Hypo1_lists and then I feed this into the function below:
function(Hypo1_lists){
#convert each participant's list to a data frame
x_dataframe <- lapply(seq_along(Hypo1_lists), function(curParticipant){
return(data.frame(Participant = curParticipant,
Score = unlist(Hypo1_lists[[curParticipant]])))
#combine the list of dataframes into one dataframe
Hypo1_lists <- do.call("rbind", x_dataframe)
})
}
But this outputs one nested list...I want to store the outputs in separate data frames (one for each condition), the same I get from the code before I put it into a function.
You were mistakenly including the binding into the apply function.
myf <- function(Hypo1_lists){
#convert each participant's list to a data frame
x_dataframe <- lapply(seq_along(Hypo1_lists), function(curParticipant){
return(data.frame(Participant = curParticipant,
Score = unlist(Hypo1_lists[[curParticipant]])))
})
#combine the list of dataframes into one dataframe
Hypo1_lists <- do.call("rbind", x_dataframe)
return(Hypo1_lists)
}
myf(Participant_List)
Participant Score
1 1 2
2 1 1
3 1 6
4 1 3
Also, don't forget to return something from your main function.
To apply this function to a nested list :
full <- list(Participant_List, Participant_List)
names(full) <- c("firstname", "secondname")
full_result <- lapply(full, myf)
summary(full_result)
Length Class Mode
firstname 2 data.frame list
secondname 2 data.frame list
To retrieve your second result for example, just use full_result[[2]] which is of class data.frame

How to loop through muliple row index ranges to create separate dataframes for each row index range- R

I'm basically taking a non normal dataset and converting it into a dataset that I can load into a SQL Server table. Using the example code below, is there a more efficient way to do this without having to explicity list the row indices of "ASRN1" or the dataframes I want spread, merge and bind? I have hundred of datasets i have to loop through, and some might have 3 sets of asrn1, service, and OCR, while others may have 30 sets of asrn1, service and ocr.
Columns<-c("SERIVCE ORDER", "SERVICE ORDER DATE", "ASRN1", "SERVICE","OCR","ASRN1","SERVICE","OCR", "ASRN1", "SERVICE", "OCR", "COMMENTS")
Values<-c("peanuts", "06/09/2020","1111", "abcd","xxxx", "2222", "efgh", "yyyy", "3333", "ijkl", "zzzz", "zippitydoda" )
df <- data.frame(Columns, Values)
a = which(df$Columns == "ASRN1",arr.ind=FALSE, useNames = TRUE)[1]
b = which(df$Columns == "ASRN1",arr.ind=FALSE, useNames = TRUE)[2]
c = which(df$Columns == "ASRN1",arr.ind=FALSE, useNames = TRUE)[3]
dfa<-spread(unique(df[0:(a-1),]),Columns,Values)
dfb<-spread(df[a:(b-1),],Columns, Values)
dfc<-spread(df[b:(c-1),],Columns,Values)
dfe<-spread(tail(df,-c+1),Columns,Values)
dff<-merge(dfa,dfb)
dfg<-merge(dfa,dfc)
dfh<-merge(dfa,dfe)
dfj<-dplyr::bind_rows(dff, dfg,dfh)
Consider by to subset data frame by Columns subsets and then build a list of vectors to call cbind at end. This assumes repetition is the same for multiple values and all others values appear once.
# BUILD LIST OF VECTORS
vec_list <- by(df, df$Columns, function(sub) {
# RENAME COLUMNS
tmp <- setNames(sub, c("Columns", as.character(sub$Columns[1])))
# REMOVE FIRST COLUMN
tmp <- transform(tmp, Columns = NULL)
})
# CBIND ALL DF ELEMENTS
final_df <- do.call(cbind.data.frame, vec_list)
final_df
# ASRN1 COMMENTS OCR SERIVCE ORDER SERVICE SERVICE ORDER DATE
# 1 1111 zippitydoda xxxx peanuts abcd 06/09/2020
# 2 2222 zippitydoda yyyy peanuts efgh 06/09/2020
# 3 3333 zippitydoda zzzz peanuts ijkl 06/09/2020

Use a vector/index as a row name in a dataframe using rbind

I think I'm missing something super simple, but I seem to be unable to find a solution directly relating to what I need: I've got a data frame that has a letter as the row name and a two columns of numerical values. As part of a loop I'm running I create a new vector (from an index) that has both a letter and number (e.g. "f2") which I then need to be the name of a new row, then add two numbers next to it (based on some other section of code, but I'm fine with that). What I get instead is the name of the vector/index as the title of the row name, and I'm not sure if I'm missing a function of rbind or something else to make it easy.
Example code:
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
rownames(data.frame) <- row.names
data.frame
index.vector <- "f2"
#what I want the data frame to look like with the new row
data.frame <- rbind(data.frame, "f2" = c(6,11))
data.frame
#what the data frame looks like when I attempt to use a vector as a row name
data.frame <- rbind(data.frame, index.vector = c(6,11))
data.frame
#"why" I can't just type "f" every time
index.vector2 = paste(index.vector, "2", sep="")
data.frame <- rbind(data.frame, index.vector2 = c(6,11))
data.frame
In my loop the "index.vector" is a random sample, hence where I can't just write the letter/number in as a row name, so need to be able to create the row name from a vector or from the index of the sample.
The loop runs and a random number of new rows will be created, so I can't specify what number the row is that needs a new name - unless there's a way to just do it for the newest or bottom row every time.
Any help would be appreciated!
Not elegant, but works:
new_row <- data.frame(setNames(list(6, 11), colnames(data.frame)), row.names = paste(index.vector, "2", sep=""))
data.frame <- rbind(data.frame, new_row)
data.frame
# vector.1 vector.2
# a 1 2
# b 2 3
# c 3 4
# d 4 5
# e 5 6
# f22 6 11
I Understood the problem , but not able to resolve the issue. Hence, suggesting an alternative way to achieve the same
Alternate solution: append your row labels after the data binding in your loop and then assign the row names to your dataframe at the end .
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
#loop starts
index.vector <- "f2"
data.frame <- rbind(data.frame,c(6,11))
row.names<-append(row.names,index.vector)
#loop ends
rownames(data.frame) <- row.names
data.frame
output:
vector.1 vector.2
a 1 2
b 2 3
c 3 4
d 4 5
e 5 6
f2 6 11
Hope this would be helpful.
If you manipulate the data frame with rbind, then the newest elements will always be at the "bottom" of your data frame. Hence you could also set a single row name by
rownnames(data.frame)[nrow(data.frame)] = "new_name"

Creating a count matrix from factor level occurences in a list of dataframes

Since i cannot give example data, here are two small textfiles representing the first 5 lines of two of my input files:
https://www.dropbox.com/sh/s0rmi2zotb3dx3o/AAAq0G3LbOokfN8MrYf7jLofa?dl=0
I read all textfiles in the working directory into a list, cut some columns, set new names and subset by a numerical cutoff in the third column:
all.files <- list.files(pattern = ".*.txt")
data.list <- lapply(all.files, function(x)read.table(x, sep="\t"))
names(data.list) <- all.files
data.list <- lapply(data.list, function(x) x[,1:3])
new.names<-c("query", "sbjct", "ident")
data.list <- lapply(data.list, setNames, new.names)
new.list <- lapply(data.list, function(x) subset(x, ident>99))
I am ending up with a list of dataframes, which consist of three columns each.
Now, i want to
count the occurences of factors in the column "sbjct" in all dataframes in the list, and
build a matrix from the counts, in which rows=factor levels of "sbjct" and columns=occurences in each dataframe.
For each dataframe in the list, a new object with two columns (sbjct/counts) should be created named according to the original dataframe in the original list. In the end, all the new objects should be merged with cbind (for example), and empty cells (data absent) should be filled with zeros, resulting in a "sbjct x counts" matrix.
For example, if i would have a single dataframe, dplyr would help me like this:
library(dplyr)
some.object <- some.dataframe %>%
group_by(sbjct) %>%
summarise(counts = length(sbjct))
>some.object
Source: local data frame [5 x 2]
sbjct counts
1 AB619702.1.1454 1
2 EU287121.1.1497 1
3 HM062118.1.1478 1
4 KC437137.1.1283 1
5 Yq2He155 1
But it seems it cannot be applied to lists of dataframes.
Add a column to each data set which acts as indicator [lets name that Ndata] that the particular observation is coming from that dataset. Now rbind all these data sets.
Now when you make a cross table of sbjct X Ndata , you'll get the matrix that you are looking for.
here is some code to clarify:
t=c("a","b","c","d","e","f")
set.seed(10)
d1=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d2=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d3=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d1$Ndata=rep("d1",nrow(d1))
d2$Ndata=rep("d2",nrow(d2))
d3$Ndata=rep("d3",nrow(d3))
all=rbind(d1,d2,d3)
ct=table(all$sbjt,all$Ndata)
ct looks like this:
> ct
d1 d2 d3
a 1 0 0
b 4 0 1
c 2 2 1
d 3 1 0
e 1 0 0
>

compare values of data frames with different number of rows

I defined the following function, which takes two DataFrames, DF_TAGS_LIST and DF_epc_list. Both data frames have a column with a different number of rows. I want to search each value DF_TAGS_LIST in DF_epc_list, and if found, store it in another dataframe
One example of DF_TAGS_LIST:
TAGS_LIST
3036029B539869100000000B
3036029B537663000000002A
3036029B5398694000000009
3036029B539869400000000C
3036029B5398690000000006
3036029B5398692000000007
And one example of DF_epc_list:
EPC
3036029B539869100000000B
3036029B537663000000002A
3036029B5398690000000006
3036029B5398692000000007
3036029B5398691000000006
3036029B5376630000000034
3036029B53986940000000WF
3036029B5398694000000454
3036029B5398690000000234
3036029B53986920000000FG
In this case, I would like one dataframe output that had the following values:
FOUND_TAGS
3036029B5398690000000006
3036029B5398692000000007
3036029B539869100000000B
3036029B537663000000002A
My function is:
FOUND_COMPARE_TAGS<-function(DF_TAGS_LIST, DF_epc_list){
DF_epc_list<-toString(DF_epc_list)
DF_TAGS_LIST<-toString(DF_TAGS_LIST)
DF_found_epc_tags <- data.frame(DF_found_epc_tags=intersect(DF_TAGS_LIST$DF_TAGS_LIST, DF_epc_list$DF_epc_list)); setdiff(union(DF_TAGS_LIST$DF_TAGS_LIST, DF_epc_list$DF_epc_list), DF_found_epc_tags$DF_found_epc_tags)
#DF_found_epc_tags <- data.frame(DF_found_epc_tags = DF_TAGS_LIST[unique(na.omit(match(DF_epc_list$DF_epc_list, DF_TAGS_LIST$DF_TAGS_LIST))),])
return(DF_found_epc_tags)
}
I now returns an empty data frame with two columns. Only recently programmed in R
You can use %in% or (as I mentioned in my comment) intersect:
DF_TAGS_LIST[DF_TAGS_LIST$TAGS_LIST %in% DF_epc_list$EPC, , drop = FALSE]
# TAGS_LIST
# 1 3036029B539869100000000B
# 2 3036029B537663000000002A
# 5 3036029B5398690000000006
# 6 3036029B5398692000000007
intersect(DF_TAGS_LIST$TAGS_LIST, DF_epc_list$EPC)
# [1] "3036029B539869100000000B" "3036029B537663000000002A"
# [3] "3036029B5398690000000006" "3036029B5398692000000007"
FOUND_TAGS <- rbind(TAGS_LIST, EPC)
FOUND_TAGS <- FOUND_TAGS[duplicated(FOUND_TAGS), , drop = FALSE]

Resources