Combining common IDs in 2 Lists of data tables - r

I have two lists, each containing a few thousand data tables. The data tables contain id's and each id will only appear once within each list. Additionally, each data table will have different columns, though they will share column names with some other data tables. For example, in my lists created below, id 1 appears in the 1st data table in list1 and the 2nd data table in list2. In the first list id 1 has data for columns 'a' and 'd' and in the second list it has columns for 'a' and 'b'.
library(data.table)
# Create 2 lists of data frames
list1 <- list(data.table(id=c(1,3), a=c(0,0), d=c(1,1)),
data.table(id=c(2,4), b=c(1,0), c=c(2,1), f=c(3,1)),
data.table(id=c(5,6), a=c(4,0), b=c(2,1)))
list2 <- list(data.table(id=c(2,3,6), c=c(0,0,1), d=c(1,1,0), e=c(0,1,2)),
data.table(id=c(1,4,5), a=c(1,0,3), b=c(2,1,2)))
What I need to do is find the id in each list, and average their results.
list id a b d
list1 1 0 NA 1
list2 1 1 2 NA
NA values are treated as 0, so the result for id 1 should be:
id a b d
1 0.5 1 0.5
Next, the top 3 column names are selected and ordered based on their values so that the result is:
id top3
1 b d a
This needs to be repeated for all id's. I have code that can achieve this (below), but for a large list with thousands of data tables and over a million ids it is very slow.
for (i in 1:6){ # i is the id to be searched for
for (j in 1:length(list1)){
if (i %in% list1[[j]]$id){
listnum1 <- j
rownum1 <- which(list1[[j]]$id==i)
break
}
}
for (j in 1:length(list2)){
if (i %in% list2[[j]]$id){
listnum2 <- j
rownum2 <- which(list2[[j]]$id==i)
break
}
}
v1 <- data.table(setDF(list1[[listnum1]])[rownum1,]) # Converting to data.frame using setDF and extracting the row is faster than using data.table
v2 <- data.table(setDF(list2[[listnum2]])[rownum2,])
bind <- rbind(v1, v2, fill=TRUE) # Combines two rows and fills in columns they don't have in common
for (j in 1:ncol(bind)){ # Convert NAs to 0
set(bind, which(is.na(bind[[j]])), j, 0)}
means <- colMeans(bind[,2:ncol(bind),with=F]) # Average the two rows
col_ids <- as.data.table(t(names(sort(means)[length(means):(length(means)-2)])))
# select and order the top 3 ids and bind to a data frame
top3 <- rbind(top3, cbind(id=i, top3=data.table(do.call("paste", c(col_ids[,1:min(length(col_ids),3),with=F], sep=" ")))))
}
id top3.V1
1: 1 b d a
2: 2 f c d
3: 3 d e c
4: 4 f c b
5: 5 a b
6: 6 e c b
When I run this code on my full data set (which has a few million IDs) it only makes it through about 400 ids after about 60 seconds. It would take days to go through the entire data set. Converting each list into 1 much larger data table is not an option; there are 100,000 possible columns so it becomes too large. Is there a faster way to achieve the desired result?

Melt down the individual data.table's and you won't run into the issue of wasted memory:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id', variable.factor = F))[
# find number of "rows" per id
, nvals := max(rle(sort(variable))$lengths), by = id][
# compute the means, assuming that missing values are equal to 0
, sum(value)/nvals[1], by = .(id, variable)][
# extract top 3 values
order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
# id V1
#1: 1 b a d
#2: 2 f c b
#3: 3 d e a
#4: 4 b c f
#5: 5 a b
#6: 6 e b c
Or instead of rle you can do:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
, .(vals = sum(value), nvals = .N), by = .(id, variable)][
, vals := vals / max(nvals), by = id][
order(-vals), paste(head(variable, 3), collapse = " "), keyby = id]
Or better yet, as Frank points out, don't even bother with the mean:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
, sum(value), by = .(id, variable)][
order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]

Not sure about the performance, but this should prevent the for-loop:
library(plyr)
library(dplyr)
a <- ldply(list1, data.frame)
b <- ldply(list2, data.frame)
dat <- full_join(a,b)
This will give you a single data frame:
id a d b c f e
1 1 0 1 NA NA NA NA
2 3 0 1 NA NA NA NA
3 2 NA NA 1 2 3 NA
4 4 NA NA 0 1 1 NA
5 5 4 NA 2 NA NA NA
6 6 0 NA 1 NA NA NA
7 2 NA 1 NA 0 NA 0
8 3 NA 1 NA 0 NA 1
9 6 NA 0 NA 1 NA 2
10 1 1 NA 2 NA NA NA
11 4 0 NA 1 NA NA NA
12 5 3 NA 2 NA NA NA
By summarising based on id:
means <- function(x) mean(x, na.rm=T)
output <- dat %>% group_by(id) %>% summarise_each(funs(means))
id a d b c f e
1 1 0.5 1 2.0 NA NA NA
2 2 NaN 1 1.0 1 3 0
3 3 0.0 1 NaN 0 NaN 1
4 4 0.0 NaN 0.5 1 1 NaN
5 5 3.5 NaN 2.0 NaN NaN NaN
6 6 0.0 0 1.0 1 NaN 2
Listing the top 3 through sapply will give you the same resulting table (but as a matrix, each column corresponding to id)
sapply(1:nrow(output), function(x) sort(output[x,-1], decreasing=T)[1:3] %>% names)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "b" "f" "d" "c" "a" "e"
[2,] "d" "d" "e" "f" "b" "b"
[3,] "a" "b" "a" "b" NA "c"
** Updated **
Since the data is going to be large, it's prudent to create some functions that can choose and combine appropriate data.frame for each id.
(i) find out all the id present in both list
id_list1 <- lapply(list1, "[[", "id")
id_list2 <- lapply(list2, "[[", "id")
(ii) find out in which table ids 1 to 6 are within the list
id_l1<-lapply(1:6, function(x) sapply(id_list1, function(y) any(y==x) %>% unlist))
id_l2<-lapply(1:6, function(x) sapply(id_list2, function(y) any(y==x) %>% unlist))
(iii) create a function to combine appropriate dataframe for specific id
id_who<-function(x){
a <- data.frame(list1[id_l1[[x]]])
a <- a[a$id==x, ]
b <- data.frame(list2[id_l2[[x]]])
b <- b[b$id==x, ]
full_join(a,b)
}
lapply(1:6, id_who)
[[1]]
id a d b
1 1 0 1 NA
2 1 1 NA 2
[[2]]
id b c f d e
1 2 1 2 3 NA NA
2 2 NA 0 NA 1 0
[[3]]
id a d c e
1 3 0 1 0 1
[[4]]
id b c f a
1 4 0 1 1 NA
2 4 1 NA NA 0
[[5]]
id a b
1 5 4 2
2 5 3 2
[[6]]
id a b c d e
1 6 0 1 1 0 2
output<-ldply(new, summarise_each, funs(means))
Output will be the same as the above.
The advantage of this process is that you can easily put in logical breaks in the process, either in (ii) or (iii).

Related

Replace missing values with row means if exactly N missing values per row

I have a data matrix with different number of missing values per rows. What I want is to replace the missing values with row means if the number of missing values per row is N (let's say 1).
I have already created a solution for this problem but it's a very inelegant one so I'm looking for something else.
My solution:
#SAMPLE DATA
a <- c(rep(c(1:4, NA), 2))
b <- c(rep(c(1:3, NA, 5), 2))
c <- c(rep(c(1:3, NA, 5), 2))
df <- as.matrix(cbind(a,b,c), ncol = 3, nrow = 10)
#CALCULATING THE NUMBER OF MISSING VALUES PER ROW
miss_row <- rowSums(apply(as.matrix(df), c(1,2), function(x) {
sum(is.na(x)) +
sum(x == "", na.rm=TRUE)
}) )
df <- cbind(df, miss_row)
#CALCULATING THE ROW MEANS FOR ROWS WITH 1 MISSING VALUE
row_mean <- ifelse(df[,4] == 1, rowMeans(df[,1:3], na.rm = TRUE), NA)
df <- cbind(df, row_mean)
Here is the way I mentionned in comment, with more details:
# create your matrix
df <- cbind(a, b, c) # already a matrix, you don't need as.matrix there
# Get number of missing values per row (is.na is vectorised so you can apply it directly on the entire matrix)
nb_NA_row <- rowSums(is.na(df))
# Replace missing values row-wise by the row mean when there is N NA in the row
N <- 1 # the given example
df[nb_NA_row==N] <- rowMeans(df, na.rm=TRUE)[nb_NA_row==N]
# check df
df
# a b c
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 NA NA
# [5,] 5 5 5
# [6,] 1 1 1
# [7,] 2 2 2
# [8,] 3 3 3
# [9,] 4 NA NA
#[10,] 5 5 5
df <- data.frame(df)
df$miss_row <- rowSums(is.na(df))
df$row_mean <- NA
df$row_mean[df$miss_row == 1] <- rowMeans(df[df$miss_row == 1,1:3],na.rm = TRUE)
# a b c miss_row row_mean
# 1 1 1 1 0 NA
# 2 2 2 2 0 NA
# 3 3 3 3 0 NA
# 4 4 NA NA 2 NA
# 5 NA 5 5 1 5
# 6 1 1 1 0 NA
# 7 2 2 2 0 NA
# 8 3 3 3 0 NA
# 9 4 NA NA 2 NA
# 10 NA 5 5 1 5
(This gives your expected output, which seems not to be completely in line with your text, but for this see comments and duplicate link)

merge columns that have the same name r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.
using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4
We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5
Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

Create a counter in a for loop in R

I'm an unexperienced user of R and I need to create quite a complicated stuff.
My dataset looks like this :
dataset
a,b,c,d,e are different individuals.
I want to complete the D column as follows :
At the last line for each individual in the col A, D = sum(C)/(B-1).
Expected results should look like :
results
D4=sum(C2:C4)/(B4-1)=0.5
D6=sum(C5:C6)/(B6-1)=1, etc.
I attempted to deal with it with something like :
for(i in 2:NROW(dataset)){
dataset[i,4]<-ifelse(
(dataset[i,1]==data1[i-1,1]),sum(dataset[i,3])/(dataset[i,2]-1),NA
)
}
But it is obviously not sufficient, as it computes the D value for all the rows and not only the last for each individual, and it does not calculate the sum of C values for this individual.
And I really don't know how to figure it out. Do you guys have any advice ?
Many thanks.
If I understood your question correctly, then this is one approach to get to the desired result:
df <- data.frame(
A=c("a","a","a","b","b","c","c","c","d","e","e"),
B=c(3,3,3,2,2,3,3,3,1,2,2),
C=c(NA,1,0,NA,1,NA,0,1,NA,NA,0),
stringsAsFactors = FALSE)
for(i in 2:NROW(df)){
df[i,4]<-ifelse(
(df[i,1]!=df[i+1,1] | i == nrow(df)),sum(df[df$A == df[i,1],]$C, na.rm=TRUE)/(df[i,2]-1),NA
)
}
This code results in the following table:
A B C V4
1 a 3 NA NA
2 a 3 1 NA
3 a 3 0 0.5
4 b 2 NA NA
5 b 2 1 1.0
6 c 3 NA NA
7 c 3 0 NA
8 c 3 1 0.5
9 d 1 NA NaN
10 e 2 NA NA
11 e 2 0 0.0
The ifelse first tests if the individual of the current row of column A is different than the individual in the next row OR if it's the last row.
If it is the last row with this individual it takes the sum of column C (ignoring the NAs) of the rows with the individual present in column A divided by the value in column B minus one.
Otherwise it puts an NA in the fourth column.
Using dplyr you can try generating D for all rows and then remove where not required:
dftest %>%
group_by(A,B) %>%
dplyr::mutate(D = sum(C, na.rm=TRUE)/(B-1)) %>%
dplyr::mutate(D = if_else(row_number()== n(), D, as.double(NA)))
which gives:
Source: local data frame [11 x 4]
Groups: A, B [5]
A B C D
<chr> <dbl> <dbl> <dbl>
1 a 3 NA NA
2 a 3 1 NA
3 a 3 0 0.5
4 b 2 NA NA
5 b 2 1 1.0
6 c 3 NA NA
7 c 3 0 NA
8 c 3 1 0.5
9 d 1 NA NaN
10 e 2 NA NA
11 e 2 0 0.0

Correlations by grouping twice in R, using dplyR or aggregate?

My (toy) data looks like:
Item_Id Location_Id date price
1 A 5372 1 .5
2 A 5372 2 NA
3 A 5372 3 1
4 A 6065 1 1
5 A 6065 2 1
6 A 6065 3 3
7 A 7000 1 NA
8 A 7000 2 NA
9 A 7000 3 NA
10 B 5372 1 3
11 B 5372 2 NA
12 B 5372 3 1
13 B 6065 1 2
14 B 6065 2 1
15 B 6065 3 3
16 B 7000 1 8
17 B 7000 2 NA
18 B 7000 3 9
In reality there are hundreds of unique item_Ids and location_Ids.
Data
Item_Id=c(rep('A',9),rep('B',9))
Location_Id=rep(c(rep(5372,3),rep(6065,3),rep(7000,3)),2)
date = rep(1:3,6)
price = c(0.5,NA,1,1,1,3,NA,NA,NA,3,NA,1,2,1,3,8,NA,9)
df = data.frame(Item_Id,Location_Id,date,price)
I want to ultimate get the median correlation (over locations) of the prices series for every item with every other item. I tried writing a loop in the hopes that it would be quick (not finished):
for(item in items){
remainingitems = items[items!=item]
for(item2 in remainingitems){
cortemp = numeric(0)
for(locat in locations){
print(locat)
a = pricepanel[pricepanel$Item_Id==item &
pricepanel$Location_Id==locat,]$price
b = pricepanel[pricepanel$Item_Id==item2 &
pricepanel$Location_Id==locat,]$price
cortemp=c(cortemp,cor(cbind(a,b), use="pairwise.complete.obs")[2])
}
}
But I stopped because it was much too slow. The most inner loop took several minutes alone and there are hundreds of stores and items. Basically I want to get the correlation matrix (every product with every other product) for every location, and then take the element-wise median across those matrices.
I expect there is an efficient way to do this, but I am new to this kind of thing in R. I tried reading dplyr since I suspect the solution lies in there, but I got stuck.
The interim output would be something like:
$5752
A B
A 1 -1
B -1 1
$6065
A B
A 1 0.8660254
B 0.8660254 1
$7000
A B
A 1 NA
B NA 1
Then the final would take the elementwise median of all those location matrices.
Final:
A B
A 1 -.0669873
B -.0669873 1
You could get the "interim" output using dplyr and tidyr:
library(dplyr)
library(tidyr)
cors <- df %>% spread(Item_Id, price) %>%
group_by(Location_Id) %>%
do(correlation = cor(.[, -(1:2)], use = "pairwise.complete.obs"))
The way that this works is that the spread function (from tidyr) spreads the As, Bs, Cs etc into their own columns:
df %>% spread(Item_Id, price)
# Location_Id date A B
# 1 5372 1 0.5 3
# 2 5372 2 NA NA
# 3 5372 3 1.0 1
# 4 6065 1 1.0 2
# 5 6065 2 1.0 1
# 6 6065 3 3.0 3
# 7 7000 1 NA 8
# 8 7000 2 NA NA
# 9 7000 3 NA 9
(This should work with any number of "Items"- A, B, C, D...) The group_by(Location_Id) function then tells the code to operate within each location. Finally the do command tells it to find the correlation of the columns within each group (. is a placeholder for "the data within each group"), while ignoring the first two columns, Location_Id and date.
The above code produces a result that looks like:
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# Location_Id correlation
# 1 5372 <dbl[2,2]>
# 2 6065 <dbl[2,2]>
# 3 7000 <dbl[2,2]>
The correlation column is a list of your three within-location matrices. At that point you can use the solution in this question to take the elementwise median:
apply(simplify2array(cors$correlation), c(1,2), median, na.rm = TRUE)
Here's a possible split apply solution using base R
lapply(split(df[, c("Item_Id", "price")], df$Location_Id),
function(x) {
cor(matrix(x$price, nrow = nrow(x)/length(unique(x$Item_Id))), use ="pairwise.complete.obs")
} )
# $`5372`
# [,1] [,2]
# [1,] 1 -1
# [2,] -1 1
#
# $`6065`
# [,1] [,2]
# [1,] 1.0000000 0.8660254
# [2,] 0.8660254 1.0000000
#
# $`7000`
# [,1] [,2]
# [1,] NA NA
# [2,] NA 1
And here's a similar solution to #Davids using data.table package
library(data.table)
DT <- dcast.data.table(as.data.table(df),
Location_Id + date ~ Item_Id,
value.var = "price")[, -2, with = FALSE]
Res <- DT[, .(Res = list(cor(.SD, use = "pairwise.complete.obs"))), Location_Id]
You can then view the cor matrices using
Res$Res
# [[1]]
# A B
# A 1 -1
# B -1 1
#
# [[2]]
# A B
# A 1.0000000 0.8660254
# B 0.8660254 1.0000000
#
# [[3]]
# A B
# A NA NA
# B NA 1

R colSum for two every two rows

I am struggeling with the following(easy) problem but cannot find a good solution to it. Consider a df as follows:
test<-c("A","B","C","D","E","F")
test2<-sample(1:6)
test3<-data.frame(test,test2)
I would like to have a third column that in the second row shows ratio of row 1:2 of column 2, in the fourth row the ratio 3:4 of column2 and in the sixth row the ratio 5:6 of column2. My df is by far larger otherwise I would have done by hand:)
Any suggestions on how to do that? i know that you can get the diff with the diff command but the ratio? And how do I bind to rows together? split() does not seem to do that.
This should be pretty fast:
test3$ratio <- NA
test3$ratio[c(FALSE, TRUE)] <- test3$test2[c(FALSE, TRUE)] /
test3$test2[c(TRUE, FALSE)]
Using loop (instead of 6 below you can put the number of last row in your large dataframe):
for( i in seq(2,6,by=2)) {
test3$ratio[i] <- with(test3,test2[i-1]/test2[i])
}
> test3
test test2 ratio
1 A 3 NA
2 B 5 0.6000000
3 C 4 NA
4 D 6 0.6666667
5 E 1 NA
6 F 2 0.5000000
You can use gl to generate your groups:
temp <- within(test3, {
Sums <- ave(test2, gl(nrow(test3)/2, 2), FUN = function(x) x[2]/x[1])
Sums[c(TRUE, FALSE)] <- NA
})
temp
# test test2 Sums
# 1 A 2 NA
# 2 B 6 3.000000
# 3 C 3 NA
# 4 D 4 1.333333
# 5 E 1 NA
# 6 F 5 5.000000
Alternatively (and similar to flodel's answer), you can use head and tail:
test3$Sums <- NA
test3$Sums[c(FALSE, TRUE)] <- (tail(c(0, test3$test2), -1)/
head(c(0, test3$test2), -1))[c(FALSE, TRUE)]
test3
# test test2 Sums
# 1 A 2 NA
# 2 B 6 3.000000
# 3 C 3 NA
# 4 D 4 1.333333
# 5 E 1 NA
# 6 F 5 5.000000
For the above, the sample data was:
set.seed(1)
test<-c("A","B","C","D","E","F")
test2<-sample(1:6)
test3<-data.frame(test,test2)

Resources