Maintain NA's after aggregation R - r

I have a data frame as follows
test_df<-data.frame(col1=c(1,NA,NA,4,5),col2=c(3,NA,NA,5,6),col3=c("a","b","c","d","c"))
test_df
col1 col2 col3
1 3 a
NA NA b
NA NA c
4 5 d
5 6 c
I am aggregating data based on col3
agg_test<-aggregate(list(test_df$col1,test_df$col2),by=list(test_df$col3),sum,na.rm=T)
agg_test
Col3 col1 col2
a 1 3
b 0 0
c 5 6
d 4 5
From what I know for summation to be correct we need to explicitly define what is to be done with NA's, in this case I have specified that NA's are to be removed from summation, I guess internally R converts all NA's to 0 and sums up according to the by condition. I need to treat the NA's and 0's in my data differently and therefore have to maintain the NA's that are valid (in this case the observations for b are NA's and not 0). How can I achieve this?
Expected o/p
Col3 col1 col2
a 1 3
b NA NA
c 5 6
d 4 5

library(data.table)
unique(setDT(test_df)[, lapply(.SD, function(x)
replace(x, !all(is.na(x)), sum(x, na.rm=TRUE))) , by=col3])
# col3 col1 col2
#1: a 1 3
#2: b NA NA
#3: c 5 6
#4: d 4 5
test_df1 <- test_df
test_df1$col2[2] <- 2
unique(setDT(test_df1)[, lapply(.SD, function(x)
replace(x, !all(is.na(x)), sum(x, na.rm=TRUE))) , by=col3])
# col3 col1 col2
#1: a 1 3
#2: b NA 2
#3: c 5 6
#4: d 4 5
Update
Or using the compact code suggested by #Arun
test_df1$col2[5] <- NA
setDT(test_df1)[, lapply(.SD,
function(x) sum(x,na.rm= !all(is.na(x)))), by=col3]
# col3 col1 col2
#1: a 1 3
#2: b NA 2
#3: c 5 NA
#4: d 4 5

It sounds like (based on your comments to requests for clarification) you want aggregate your groups so you get NA if all the values are missing, and otherwise you want the sum of the non-missing values. You can pass aggregate a user-defined function that has this behavior:
aggregate(list(test_df$col1,test_df$col2), by=list(test_df$col3),
function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm=T)))
# Group.1 c.1..NA..NA..4..5. c.3..NA..NA..5..6.
# 1 a 1 3
# 2 b NA NA
# 3 c 5 6
# 4 d 4 5

Related

Skip NAs when using Reduce() in data.table

I'm trying to get the cumulative sum of data.table rows and was able to find this code in another stackoverflow post:
devDF1[,names(devDF1):=Reduce(`+`,devDF1,accumulate=TRUE)]
It does what I need it to do, however when it comes across a row that starts off with an NA, it will just replace every element in that row with NA (instead of the cumsum of the other elements in the row). I don't want to replace the NAs with 0s, because I'll be needing this output for further processes and don't want the same final cumsum duplicated in the rows. Is there any way I can adjust that piece of code to ignore the NAs? Or is there an alternate code that could be used to get the cumulative sum of the rows in a data.table while ignoring NAs?
Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = c(3, NA, 1, 2, 4), c = c(NA, 1, NA, 3, 4))
dt
# a b c
#1: 1 3 NA
#2: 2 NA 1
#3: 3 1 NA
#4: 4 2 3
#5: 5 4 4
If you want to carry previous value to NA values you can use :
dt[, names(dt) := lapply(.SD, function(x) cumsum(replace(x, is.na(x), 0))),
.SDcols = names(dt)]
dt
# a b c
#1: 1 3 0
#2: 3 3 1
#3: 6 4 1
#4: 10 6 4
#5: 15 10 8
If you want to keep NA as NA :
dt[, names(dt) := lapply(.SD, function(x) {
x1 <- cumsum(replace(x, is.na(x), 0))
x1[is.na(x)] <- NA
x1
}), .SDcols = names(dt)]
dt
# a b c
#1: 1 3 NA
#2: 3 NA 1
#3: 6 4 NA
#4: 10 6 4
#5: 15 10 8

How to do a complex wide-to-long operation for network analysis

I have survey data that includes who the respondent is (iAmX), who they work with (withX), how frequently they work with each partner (freqX), and how satisfied they are with each partner (likeX). Participants can select multiple options for who they are and who they work with.
I would like to go from something like this, with one row per respondent:
df <- read.table(header=T, text='
id iAmA iAmB iAmC withA withB withC freqA freqB freqC likeA likeB likeC
1 X X NA X X NA 3 2 NA 3 2 NA
2 NA NA X X NA NA 5 NA NA 5 NA NA
')
To something like this, with one row per combination, where "from" is who the actor is and "to" is who they work with:
goal <- read.table(header=T, text='
id from to freq like
1 A A 3 3
1 B A 3 3
1 A B 2 2
1 B B 2 2
2 C A 5 5
')
I have tried some melt, gather, and reshape functions but frankly I think I'm just not up to the logic puzzle today. I would really appreciate some help!
Although I must admit I have not fully understood OP's logic, the code below reproduces the expected goal.
The key points here are data.table's incarnation of the melt() function which is able to reshape multiple measure columns simultaneously and the cross join function CJ().
library(data.table)
# reshape multiple measure columns simultaneously
cols <- c("iAm", "with", "freq", "like")
long <- melt(setDT(df), measure.vars = patterns(cols),
value.name = cols, variable.name = "to")[
# rename factor levels
, to := forcats::fct_relabel(to, function(x) LETTERS[as.integer(x)])]
# create combinations for each id
combi <- long[, CJ(from = na.omit(to[iAm == "X"]), to = na.omit(to[with == "X"])), by = id]
# join to append freq and like
result <- combi[long, on = .(id, to), nomatch = 0L][, -c("iAm", "with")]
# reorder result
setorder(result, id)
result
id from to freq like
1: 1 A A 3 3
2: 1 B A 3 3
3: 1 A B 2 2
4: 1 B B 2 2
5: 2 C A 5 5
The intermediate results are
long
id to iAm with freq like
1: 1 A X X 3 3
2: 2 A <NA> X 5 5
3: 1 B X X 2 2
4: 2 B <NA> <NA> NA NA
5: 1 C <NA> <NA> NA NA
6: 2 C X <NA> NA NA
and
combi
id from to
1: 1 A A
2: 1 A B
3: 1 B A
4: 1 B B
5: 2 C A

merge columns that have the same name r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.
using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4
We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5
Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

Combining common IDs in 2 Lists of data tables

I have two lists, each containing a few thousand data tables. The data tables contain id's and each id will only appear once within each list. Additionally, each data table will have different columns, though they will share column names with some other data tables. For example, in my lists created below, id 1 appears in the 1st data table in list1 and the 2nd data table in list2. In the first list id 1 has data for columns 'a' and 'd' and in the second list it has columns for 'a' and 'b'.
library(data.table)
# Create 2 lists of data frames
list1 <- list(data.table(id=c(1,3), a=c(0,0), d=c(1,1)),
data.table(id=c(2,4), b=c(1,0), c=c(2,1), f=c(3,1)),
data.table(id=c(5,6), a=c(4,0), b=c(2,1)))
list2 <- list(data.table(id=c(2,3,6), c=c(0,0,1), d=c(1,1,0), e=c(0,1,2)),
data.table(id=c(1,4,5), a=c(1,0,3), b=c(2,1,2)))
What I need to do is find the id in each list, and average their results.
list id a b d
list1 1 0 NA 1
list2 1 1 2 NA
NA values are treated as 0, so the result for id 1 should be:
id a b d
1 0.5 1 0.5
Next, the top 3 column names are selected and ordered based on their values so that the result is:
id top3
1 b d a
This needs to be repeated for all id's. I have code that can achieve this (below), but for a large list with thousands of data tables and over a million ids it is very slow.
for (i in 1:6){ # i is the id to be searched for
for (j in 1:length(list1)){
if (i %in% list1[[j]]$id){
listnum1 <- j
rownum1 <- which(list1[[j]]$id==i)
break
}
}
for (j in 1:length(list2)){
if (i %in% list2[[j]]$id){
listnum2 <- j
rownum2 <- which(list2[[j]]$id==i)
break
}
}
v1 <- data.table(setDF(list1[[listnum1]])[rownum1,]) # Converting to data.frame using setDF and extracting the row is faster than using data.table
v2 <- data.table(setDF(list2[[listnum2]])[rownum2,])
bind <- rbind(v1, v2, fill=TRUE) # Combines two rows and fills in columns they don't have in common
for (j in 1:ncol(bind)){ # Convert NAs to 0
set(bind, which(is.na(bind[[j]])), j, 0)}
means <- colMeans(bind[,2:ncol(bind),with=F]) # Average the two rows
col_ids <- as.data.table(t(names(sort(means)[length(means):(length(means)-2)])))
# select and order the top 3 ids and bind to a data frame
top3 <- rbind(top3, cbind(id=i, top3=data.table(do.call("paste", c(col_ids[,1:min(length(col_ids),3),with=F], sep=" ")))))
}
id top3.V1
1: 1 b d a
2: 2 f c d
3: 3 d e c
4: 4 f c b
5: 5 a b
6: 6 e c b
When I run this code on my full data set (which has a few million IDs) it only makes it through about 400 ids after about 60 seconds. It would take days to go through the entire data set. Converting each list into 1 much larger data table is not an option; there are 100,000 possible columns so it becomes too large. Is there a faster way to achieve the desired result?
Melt down the individual data.table's and you won't run into the issue of wasted memory:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id', variable.factor = F))[
# find number of "rows" per id
, nvals := max(rle(sort(variable))$lengths), by = id][
# compute the means, assuming that missing values are equal to 0
, sum(value)/nvals[1], by = .(id, variable)][
# extract top 3 values
order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
# id V1
#1: 1 b a d
#2: 2 f c b
#3: 3 d e a
#4: 4 b c f
#5: 5 a b
#6: 6 e b c
Or instead of rle you can do:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
, .(vals = sum(value), nvals = .N), by = .(id, variable)][
, vals := vals / max(nvals), by = id][
order(-vals), paste(head(variable, 3), collapse = " "), keyby = id]
Or better yet, as Frank points out, don't even bother with the mean:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
, sum(value), by = .(id, variable)][
order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
Not sure about the performance, but this should prevent the for-loop:
library(plyr)
library(dplyr)
a <- ldply(list1, data.frame)
b <- ldply(list2, data.frame)
dat <- full_join(a,b)
This will give you a single data frame:
id a d b c f e
1 1 0 1 NA NA NA NA
2 3 0 1 NA NA NA NA
3 2 NA NA 1 2 3 NA
4 4 NA NA 0 1 1 NA
5 5 4 NA 2 NA NA NA
6 6 0 NA 1 NA NA NA
7 2 NA 1 NA 0 NA 0
8 3 NA 1 NA 0 NA 1
9 6 NA 0 NA 1 NA 2
10 1 1 NA 2 NA NA NA
11 4 0 NA 1 NA NA NA
12 5 3 NA 2 NA NA NA
By summarising based on id:
means <- function(x) mean(x, na.rm=T)
output <- dat %>% group_by(id) %>% summarise_each(funs(means))
id a d b c f e
1 1 0.5 1 2.0 NA NA NA
2 2 NaN 1 1.0 1 3 0
3 3 0.0 1 NaN 0 NaN 1
4 4 0.0 NaN 0.5 1 1 NaN
5 5 3.5 NaN 2.0 NaN NaN NaN
6 6 0.0 0 1.0 1 NaN 2
Listing the top 3 through sapply will give you the same resulting table (but as a matrix, each column corresponding to id)
sapply(1:nrow(output), function(x) sort(output[x,-1], decreasing=T)[1:3] %>% names)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "b" "f" "d" "c" "a" "e"
[2,] "d" "d" "e" "f" "b" "b"
[3,] "a" "b" "a" "b" NA "c"
** Updated **
Since the data is going to be large, it's prudent to create some functions that can choose and combine appropriate data.frame for each id.
(i) find out all the id present in both list
id_list1 <- lapply(list1, "[[", "id")
id_list2 <- lapply(list2, "[[", "id")
(ii) find out in which table ids 1 to 6 are within the list
id_l1<-lapply(1:6, function(x) sapply(id_list1, function(y) any(y==x) %>% unlist))
id_l2<-lapply(1:6, function(x) sapply(id_list2, function(y) any(y==x) %>% unlist))
(iii) create a function to combine appropriate dataframe for specific id
id_who<-function(x){
a <- data.frame(list1[id_l1[[x]]])
a <- a[a$id==x, ]
b <- data.frame(list2[id_l2[[x]]])
b <- b[b$id==x, ]
full_join(a,b)
}
lapply(1:6, id_who)
[[1]]
id a d b
1 1 0 1 NA
2 1 1 NA 2
[[2]]
id b c f d e
1 2 1 2 3 NA NA
2 2 NA 0 NA 1 0
[[3]]
id a d c e
1 3 0 1 0 1
[[4]]
id b c f a
1 4 0 1 1 NA
2 4 1 NA NA 0
[[5]]
id a b
1 5 4 2
2 5 3 2
[[6]]
id a b c d e
1 6 0 1 1 0 2
output<-ldply(new, summarise_each, funs(means))
Output will be the same as the above.
The advantage of this process is that you can easily put in logical breaks in the process, either in (ii) or (iii).

Reshaping data.table with cumulative sum

I want to reshape a data.table, and include the historic (cumulative summed) information for each variable. The No variable indicates the chronological order of measurements for object ID. At each measurement additional information is found. I want to aggregate the known information at each timestamp No for object ID.
Let me demonstrate with an example:
For the following data.table:
df <- data.table(ID=c(1,1,1,2,2,2,2),
No=c(1,2,3,1,2,3,4),
Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'),
Value=c(2,1,3,3,2,1,5))
df
ID No Variable Value
1: 1 1 a 2
2: 1 2 b 1
3: 1 3 a 3
4: 2 1 c 3
5: 2 2 a 2
6: 2 3 a 1
7: 2 4 b 5
I want to reshape it to this:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
So the summed values of Value, per Variable by (ID, No), cumulative over No.
I can get the result without the cumulative part by doing
dcast(df, ID+No~Variable, value.var="Value")
which results in the non-cumulative variant:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 NA 1 NA
3: 1 3 3 NA NA
4: 2 1 NA NA 3
5: 2 2 2 NA NA
6: 2 3 1 NA NA
7: 2 4 NA 5 NA
Any ideas how to make this cumulative? The original data.table has over 250,000 rows, so efficiency matters.
EDIT: I just used a,b,c as an example, the original file has about 40 different levels. Furthermore, the NAs are important; there are also Value-values of 0, which means something else than NA
POSSIBLE SOLUTION
Okay, so I've found a working solution. It is far from efficient, since it enlarges the original table.
The idea is to duplicate each row TotalNo - No times, where TotalNo is the maximum No per ID. Then the original dcast function can be used to extract the dataframe. So in code:
df[,TotalNo := .N, by=ID]
df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))] #create duplicates
df3 <- df2[order(ID, No)]#, No:= seq_len(.N), by=.(ID, No)]
df3[,No:= seq(from=No[1], to=TotalNo[1], by=1), by=.(ID, No)]
df4<- dcast(df3,
formula = ID + No ~ Variable,
value.var = "Value", fill=NA, fun.aggregate = sum)
It is not really nice, because the creation of duplicates uses more memory. I think it can be further optimized, but so far it works for my purposes. In the sample code it goes from 7 rows to 16 rows, in the original file from 241,670 rows to a whopping 978,331. That's over a factor 4 larger.
SOLUTION
Eddi has improved my solution in computing time in the full dataset (2.08 seconds of Eddi versus 4.36 seconds of mine). Those are numbers I can work with! Thanks everybody!
Your solution is good, but you're adding too many rows, that are unnecessary if you compute the cumsum beforehand:
# add useful columns
df[, TotalNo := .N, by = ID][, CumValue := cumsum(Value), by = .(ID, Variable)]
# do a rolling join to extend the missing values, and then dcast
dcast(df[df[, .(No = seq(No[1], TotalNo[1])), by = .(ID, Variable)],
on = c('ID', 'Variable', 'No'), roll = TRUE],
ID + No ~ Variable, value.var = 'CumValue')
# ID No a b c
#1: 1 1 2 NA NA
#2: 1 2 2 1 NA
#3: 1 3 5 1 NA
#4: 2 1 NA NA 3
#5: 2 2 2 NA 3
#6: 2 3 3 NA 3
#7: 2 4 3 5 3
Here's a standard way:
library(zoo)
df[, cv := cumsum(Value), by = .(ID, Variable)]
DT = dcast(df, ID + No ~ Variable, value.var="cv")
lvls = sort(unique(df$Variable))
DT[, (lvls) := lapply(.SD, na.locf, na.rm = FALSE), by=ID, .SDcols=lvls]
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
One alternative way to do it is using a custom built cumulative sum function. This is exactly the method in #David Arenburg's comment, but substitutes in a custom cumulative summary function.
EDIT: Using #eddi's much more efficient custom cumulative sum function.
cumsum.na <- function(z){
Reduce(function(x, y) if (is.na(x) && is.na(y)) NA else sum(x, y, na.rm = T), z, accumulate = T)
}
cols <- sort(unique(df$Variable))
res <- dcast(df, ID + No ~ Variable, value.var = "Value")[, (cols) := lapply(.SD, cumsum.na), .SDcols = cols, by = ID]
res
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
This definitely isn't the most efficient, but it gets the job done and gives you an admittedly very slow very slow cumulative summary function that handles NAs the way you want to.

Resources