merge columns that have the same name r - r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.

using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4

We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5

Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

Related

How to do a complex wide-to-long operation for network analysis

I have survey data that includes who the respondent is (iAmX), who they work with (withX), how frequently they work with each partner (freqX), and how satisfied they are with each partner (likeX). Participants can select multiple options for who they are and who they work with.
I would like to go from something like this, with one row per respondent:
df <- read.table(header=T, text='
id iAmA iAmB iAmC withA withB withC freqA freqB freqC likeA likeB likeC
1 X X NA X X NA 3 2 NA 3 2 NA
2 NA NA X X NA NA 5 NA NA 5 NA NA
')
To something like this, with one row per combination, where "from" is who the actor is and "to" is who they work with:
goal <- read.table(header=T, text='
id from to freq like
1 A A 3 3
1 B A 3 3
1 A B 2 2
1 B B 2 2
2 C A 5 5
')
I have tried some melt, gather, and reshape functions but frankly I think I'm just not up to the logic puzzle today. I would really appreciate some help!
Although I must admit I have not fully understood OP's logic, the code below reproduces the expected goal.
The key points here are data.table's incarnation of the melt() function which is able to reshape multiple measure columns simultaneously and the cross join function CJ().
library(data.table)
# reshape multiple measure columns simultaneously
cols <- c("iAm", "with", "freq", "like")
long <- melt(setDT(df), measure.vars = patterns(cols),
value.name = cols, variable.name = "to")[
# rename factor levels
, to := forcats::fct_relabel(to, function(x) LETTERS[as.integer(x)])]
# create combinations for each id
combi <- long[, CJ(from = na.omit(to[iAm == "X"]), to = na.omit(to[with == "X"])), by = id]
# join to append freq and like
result <- combi[long, on = .(id, to), nomatch = 0L][, -c("iAm", "with")]
# reorder result
setorder(result, id)
result
id from to freq like
1: 1 A A 3 3
2: 1 B A 3 3
3: 1 A B 2 2
4: 1 B B 2 2
5: 2 C A 5 5
The intermediate results are
long
id to iAm with freq like
1: 1 A X X 3 3
2: 2 A <NA> X 5 5
3: 1 B X X 2 2
4: 2 B <NA> <NA> NA NA
5: 1 C <NA> <NA> NA NA
6: 2 C X <NA> NA NA
and
combi
id from to
1: 1 A A
2: 1 A B
3: 1 B A
4: 1 B B
5: 2 C A

Appending data frames in R based on column names

I am relatively new to R, so bear with me. I have a list of data frames that I need to combine into one data frame. so:
dfList <- list(
df1 = data.frame(x=letters[1:2],y=1:2),
df2 = data.frame(x=letters[3:4],z=3:4)
)
comes out as:
$df1
x y
1 a 1
2 b 2
$df2
x z
1 c 3
2 d 4
and I want them to combine common columns and add anything not already there. the result would be:
final result
x y z
1 a 1
2 b 2
3 c 3
4 d 4
Is this even possible?
Yep, it's pretty easy, actually:
library(dplyr)
df_merged <- bind_rows(dfList)
df_merged
x y z
1 a 1 NA
2 b 2 NA
3 c NA 3
4 d NA 4
And if you don't want NA in the empty cells, you can replace them like this:
df_merged[is.na(df_merged)] <- 0 # or whatever you want to replace NA with
Just using do.call with rbind.fill
do.call(rbind.fill,dfList)
x y z
1 a 1 NA
2 b 2 NA
3 c NA 3
4 d NA 4
You could do that with base function merge():
merge(dfList$df1, dfList$df2, by = "x", all = TRUE)
# x y z
# 1 a 1 NA
# 2 b 2 NA
# 3 c NA 3
# 4 d NA 4
Or with dplyr package with function full_join:
dplyr::full_join(dfList$df1, dfList$df2, by = "x")
# x y z
# 1 a 1 NA
# 2 b 2 NA
# 3 c NA 3
# 4 d NA 4
They both join everything that is in both data.frames.
Hope that works for you.

Replacing the values from another data from based on the information in the first column in R

I'm trying to merge informations in two different data frames, but problem begins with uneven dimensions and trying to use not the column index but the information in the column. merge function in R or join's (dplyr) don't work with my data.
I have to dataframes (One is subset of the others with updated info in the last column):
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"))
Name val Case
1 A 1 NA
2 B 2 1
3 C 3 NA
4 D 1 NA
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 NA
9 I 3 NA
Some rows in the Case column in df1 have to be changed with the info in the df2 below:
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1")
Name val Case
1 A 1 1
2 D 2 1
3 H 3 1
So there's nothing important in the val column, however I added it into the examples since I want to indicate that I have more columns than two and also my real data is way bigger than the examples.
Basically, I want to change specific rows by checking the information in the first columns (in this case, they're unique letters) and in the end I still want to have df1 as a final data frame.
for a better explanation, I want to see something like this:
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Note changed information for A,D and H.
Thanks.
%in% from base-r is there to rescue.
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"), stringsAsFactors = F)
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1", stringsAsFactors = F)
df1$Case <- ifelse(df1$Name %in% df2$Name, df2$Case[df2$Name %in% df1$Name], df1$Case)
df1
Output:
> df1
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Here is what I would do using dplyr:
df1 %>%
left_join(df2, by = c("Name")) %>%
mutate(val = if_else(is.na(val.y), val.x, val.y),
Case = if_else(is.na(Case.y), Case.x, Case.y)) %>%
select(Name, val, Case)

Combining common IDs in 2 Lists of data tables

I have two lists, each containing a few thousand data tables. The data tables contain id's and each id will only appear once within each list. Additionally, each data table will have different columns, though they will share column names with some other data tables. For example, in my lists created below, id 1 appears in the 1st data table in list1 and the 2nd data table in list2. In the first list id 1 has data for columns 'a' and 'd' and in the second list it has columns for 'a' and 'b'.
library(data.table)
# Create 2 lists of data frames
list1 <- list(data.table(id=c(1,3), a=c(0,0), d=c(1,1)),
data.table(id=c(2,4), b=c(1,0), c=c(2,1), f=c(3,1)),
data.table(id=c(5,6), a=c(4,0), b=c(2,1)))
list2 <- list(data.table(id=c(2,3,6), c=c(0,0,1), d=c(1,1,0), e=c(0,1,2)),
data.table(id=c(1,4,5), a=c(1,0,3), b=c(2,1,2)))
What I need to do is find the id in each list, and average their results.
list id a b d
list1 1 0 NA 1
list2 1 1 2 NA
NA values are treated as 0, so the result for id 1 should be:
id a b d
1 0.5 1 0.5
Next, the top 3 column names are selected and ordered based on their values so that the result is:
id top3
1 b d a
This needs to be repeated for all id's. I have code that can achieve this (below), but for a large list with thousands of data tables and over a million ids it is very slow.
for (i in 1:6){ # i is the id to be searched for
for (j in 1:length(list1)){
if (i %in% list1[[j]]$id){
listnum1 <- j
rownum1 <- which(list1[[j]]$id==i)
break
}
}
for (j in 1:length(list2)){
if (i %in% list2[[j]]$id){
listnum2 <- j
rownum2 <- which(list2[[j]]$id==i)
break
}
}
v1 <- data.table(setDF(list1[[listnum1]])[rownum1,]) # Converting to data.frame using setDF and extracting the row is faster than using data.table
v2 <- data.table(setDF(list2[[listnum2]])[rownum2,])
bind <- rbind(v1, v2, fill=TRUE) # Combines two rows and fills in columns they don't have in common
for (j in 1:ncol(bind)){ # Convert NAs to 0
set(bind, which(is.na(bind[[j]])), j, 0)}
means <- colMeans(bind[,2:ncol(bind),with=F]) # Average the two rows
col_ids <- as.data.table(t(names(sort(means)[length(means):(length(means)-2)])))
# select and order the top 3 ids and bind to a data frame
top3 <- rbind(top3, cbind(id=i, top3=data.table(do.call("paste", c(col_ids[,1:min(length(col_ids),3),with=F], sep=" ")))))
}
id top3.V1
1: 1 b d a
2: 2 f c d
3: 3 d e c
4: 4 f c b
5: 5 a b
6: 6 e c b
When I run this code on my full data set (which has a few million IDs) it only makes it through about 400 ids after about 60 seconds. It would take days to go through the entire data set. Converting each list into 1 much larger data table is not an option; there are 100,000 possible columns so it becomes too large. Is there a faster way to achieve the desired result?
Melt down the individual data.table's and you won't run into the issue of wasted memory:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id', variable.factor = F))[
# find number of "rows" per id
, nvals := max(rle(sort(variable))$lengths), by = id][
# compute the means, assuming that missing values are equal to 0
, sum(value)/nvals[1], by = .(id, variable)][
# extract top 3 values
order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
# id V1
#1: 1 b a d
#2: 2 f c b
#3: 3 d e a
#4: 4 b c f
#5: 5 a b
#6: 6 e b c
Or instead of rle you can do:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
, .(vals = sum(value), nvals = .N), by = .(id, variable)][
, vals := vals / max(nvals), by = id][
order(-vals), paste(head(variable, 3), collapse = " "), keyby = id]
Or better yet, as Frank points out, don't even bother with the mean:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
, sum(value), by = .(id, variable)][
order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
Not sure about the performance, but this should prevent the for-loop:
library(plyr)
library(dplyr)
a <- ldply(list1, data.frame)
b <- ldply(list2, data.frame)
dat <- full_join(a,b)
This will give you a single data frame:
id a d b c f e
1 1 0 1 NA NA NA NA
2 3 0 1 NA NA NA NA
3 2 NA NA 1 2 3 NA
4 4 NA NA 0 1 1 NA
5 5 4 NA 2 NA NA NA
6 6 0 NA 1 NA NA NA
7 2 NA 1 NA 0 NA 0
8 3 NA 1 NA 0 NA 1
9 6 NA 0 NA 1 NA 2
10 1 1 NA 2 NA NA NA
11 4 0 NA 1 NA NA NA
12 5 3 NA 2 NA NA NA
By summarising based on id:
means <- function(x) mean(x, na.rm=T)
output <- dat %>% group_by(id) %>% summarise_each(funs(means))
id a d b c f e
1 1 0.5 1 2.0 NA NA NA
2 2 NaN 1 1.0 1 3 0
3 3 0.0 1 NaN 0 NaN 1
4 4 0.0 NaN 0.5 1 1 NaN
5 5 3.5 NaN 2.0 NaN NaN NaN
6 6 0.0 0 1.0 1 NaN 2
Listing the top 3 through sapply will give you the same resulting table (but as a matrix, each column corresponding to id)
sapply(1:nrow(output), function(x) sort(output[x,-1], decreasing=T)[1:3] %>% names)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "b" "f" "d" "c" "a" "e"
[2,] "d" "d" "e" "f" "b" "b"
[3,] "a" "b" "a" "b" NA "c"
** Updated **
Since the data is going to be large, it's prudent to create some functions that can choose and combine appropriate data.frame for each id.
(i) find out all the id present in both list
id_list1 <- lapply(list1, "[[", "id")
id_list2 <- lapply(list2, "[[", "id")
(ii) find out in which table ids 1 to 6 are within the list
id_l1<-lapply(1:6, function(x) sapply(id_list1, function(y) any(y==x) %>% unlist))
id_l2<-lapply(1:6, function(x) sapply(id_list2, function(y) any(y==x) %>% unlist))
(iii) create a function to combine appropriate dataframe for specific id
id_who<-function(x){
a <- data.frame(list1[id_l1[[x]]])
a <- a[a$id==x, ]
b <- data.frame(list2[id_l2[[x]]])
b <- b[b$id==x, ]
full_join(a,b)
}
lapply(1:6, id_who)
[[1]]
id a d b
1 1 0 1 NA
2 1 1 NA 2
[[2]]
id b c f d e
1 2 1 2 3 NA NA
2 2 NA 0 NA 1 0
[[3]]
id a d c e
1 3 0 1 0 1
[[4]]
id b c f a
1 4 0 1 1 NA
2 4 1 NA NA 0
[[5]]
id a b
1 5 4 2
2 5 3 2
[[6]]
id a b c d e
1 6 0 1 1 0 2
output<-ldply(new, summarise_each, funs(means))
Output will be the same as the above.
The advantage of this process is that you can easily put in logical breaks in the process, either in (ii) or (iii).

Condensing Data Frame in R

I just have a simple question, I really appreciate everyones input, you have been a great help to my project. I have an additional question about data frames in R.
I have data frame that looks similar to something like this:
C <- c("","","","","","","","A","B","D","A","B","D","A","B","D")
D <- c(NA,NA,NA,2,NA,NA,1,1,4,2,2,5,2,1,4,2)
G <- list(C=C,D=D)
T <- as.data.frame(G)
T
C D
1 NA
2 NA
3 NA
4 2
5 NA
6 NA
7 1
8 A 1
9 B 4
10 D 2
11 A 2
12 B 5
13 D 2
14 A 1
15 B 4
16 D 2
I would like to be able to condense all the repeat characters into one, and look similar to this:
J B C E
1 2 1
2 A 1 2 1
3 B 4 5 4
4 D 2 2 2
So of course, the data is all the same, it is just that it is condensed and new columns are formed to hold the data. I am sure there is an easy way to do it, but from the books I have looked through, I haven't seen anything for this!
EDIT I edited the example because it wasn't working with the answers so far. I wonder if the NA's, blanks, and unevenness from the blanks are contributing??
hereĀ“s a reshape solution:
require(reshape)
cast(T, C ~ ., function(x) x)
Changed T to df to avoid a bad habit. Returns a list, which my not be what you want but you can convert from there.
C <- c("A","B","D","A","B","D","A","B","D")
D <- c(1,4,2,2,5,2,1,4,2)
my.df <- data.frame(id=C,val=D)
ret <- function(x) x
by.df <- by(my.df$val,INDICES=my.df$id,ret)
This seems to get the results you are looking for. I'm assuming it's OK to remove the NA values since that matches the desired output you show.
T <- na.omit(T)
T$ind <- ave(1:nrow(T), T$C, FUN = seq_along)
reshape(T, direction = "wide", idvar = "C", timevar = "ind")
# C D.1 D.2 D.3
# 4 2 1 NA
# 8 A 1 2 1
# 9 B 4 5 4
# 10 D 2 2 2
library(reshape2)
dcast(T, C ~ ind, value.var = "D", fill = "")
# C 1 2 3
# 1 2 1
# 2 A 1 2 1
# 3 B 4 5 4
# 4 D 2 2 2

Resources