I want to amend several columns in a consecutive way based on serial ID without rbind as columns headings are not identical - r

Instead of copy and paste corresponding columns into excel, I want to amend several columns in a consecutive way based on serial ID named addr.
Assume my data sets are like these
df1 <- data.frame(addr=c('a','b','c','d'),
num = c(1,2,3,4),
x=c(1, NA,4,5));df1
df2 <- data.frame(addr=c('e','f','g'),
num=c(100,200,500));df2
var<-intersect(names(df), names(df2));var
combined.df<-merge(x = df1, y = df2, by = var, all=T);combined.df
df3 <- data.frame(addr=c('e','f','g'),
x=c(5,7,NA));df3
var<-intersect(names(df3), names(combined.df));var
combined.df<-merge(x = combined.df, y = df3, by = var, all=T);combined.df
The current output is
addr x num
1 a 1 1
2 b NA 2
3 c 4 3
4 d 5 4
5 e 5 NA
6 e NA 100
7 f 7 NA
8 f NA 200
9 g NA 500
The desired output is
addr x num
1 a 1 1
2 b NA 2
3 c 4 3
4 d 5 4
5 e 5 100
6 f 7 200
7 g NA 500
i.e.: Overwrite empty columns without deleting prior full cells
Any advice will be greatly appreciated

If we want to automate using a for loop, place the datasets in a list except the first one, then create a copy of the first dataset as 'out', loop over the sequence of the list, merge the first one i.e 'out' with the corresponding list elements, specify the by as intersect of names of both datasets and update by assigning (<-) back to the 'out'
out <- df1
lst1 <- list(df2, df3)
for(i in seq_along(lst1)) {
out <- merge(out, lst1[[i]],
by = intersect(names(out), names(lst1[[i]])), all = TRUE)
}
Then, we change the output by grouping over the 'addr', and summarise across all other columns by removing the NA if there exist a non-NA element
library(dplyr)
out %>%
group_by(addr) %>%
summarise(across(everything(),
~ if(all(is.na(.))) NA_real_ else .[!is.na(.)]), .groups = 'drop')
-output
# addr x num
# <chr> <dbl> <dbl>
#1 a 1 1
#2 b NA 2
#3 c 4 3
#4 d 5 4
#5 e 5 100
#6 f 7 200
#7 g NA 500

Related

Adding an index column representing a repetition of a dataframe in R

I have a dataframe in R that I'd like to repeat several times, and I want to add in a new variable to index those repetitions. The best I've come up with is using mutate + rbind over and over, and I feel like there has to be an efficient dataframe method I could be using here.
Here's an example: df <- data.frame(x = 1:3, y = letters[1:3]) gives us the dataframe
x
y
1
a
2
b
3
c
I'd like to repeat that say 3 times, with an index that looks like this:
x
y
index
1
a
1
2
b
1
3
c
1
1
a
2
2
b
2
3
c
2
1
a
3
2
b
3
3
c
3
Using the rep function, I can get the first two columns, but not the index column. The best I've come up with so far (using dplyr) is:
df2 <-
df %>%
mutate(index = 1) %>%
rbind(df %>% mutate(index = 2)) %>%
rbind(df %>% mutate(index = 3))
This obviously doesn't work if I need to repeat my dataframe more than a handful of times. It feels like the kind of thing that should be easy to do using dataframe methods, but I haven't been able to find anything.
Grateful for any tips!
You can use this code for as many data frames as you would like. You just have to set the n argument:
replicate function takes 2 main arguments. We first specify the number of time we would like to reproduce our data set by n. Then we specify our data set as expr argument. The result would be a list whose elements are instances of our data set
After that we pass it along to imap function from purrr package to define the unique id for each of our data set. .x represents each element of our list (here a data frame) and .y is the position of that element which amounts to the number of instances we created. So for example we assign value 1 to the first id column of the first data set as .y is equal to 1 for that and so on.
library(dplyr)
library(purrr)
replicate(3, df, simplify = FALSE) %>%
imap_dfr(~ .x %>%
mutate(id = .y))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
In base R you can use the following code:
do.call(rbind,
mapply(function(x, z) {
x$id <- z
x
}, replicate(3, df, simplify = FALSE), 1:3, SIMPLIFY = FALSE))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
You can use rerun to repeat the dataframe n times and add an index column using bind_rows -
library(dplyr)
library(purrr)
n <- 3
df <- data.frame(x = 1:3, y = letters[1:3])
bind_rows(rerun(n, df), .id = 'index')
# index x y
#1 1 1 a
#2 1 2 b
#3 1 3 c
#4 2 1 a
#5 2 2 b
#6 2 3 c
#7 3 1 a
#8 3 2 b
#9 3 3 c
In base R, we can repeat the row index 3 times.
transform(df[rep(1:nrow(df), n), ], index = rep(1:n, each = nrow(df)))
One more way
n <- 3
map_dfr(seq_len(n), ~ df %>% mutate(index = .x))
x y index
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3

Binding lists of data frames while preserving index

There are many posts on converting list of data frames to single data frame (ex. here). However, I need to do that while preserving information from which list
resulting rows are, i.e. I need to preserve an index of original list. If the list is unnamed, I need just an index number, while if the list is named I need to save the name of original list element. How can I do that?
With the data like below:
foo <- list(data.frame(x=c('a', 'b', 'c'),y = c(1,2,3)),
data.frame(x=c('d', 'e', 'f'),y = c(4,5,6)))
I need an output like this:
index x y
1 1 a 1
2 1 b 2
3 1 c 3
4 2 d 4
5 2 e 5
6 2 f 6
while with named elements of a list:
foo <- list(df1 = data.frame(x=c('a', 'b', 'c'),y = c(1,2,3)),
df2 = data.frame(x=c('d', 'e', 'f'),y = c(4,5,6)))
the output would be:
index x y
1 df1 a 1
2 df1 b 2
3 df1 c 3
4 df2 d 4
5 df2 e 5
6 df2 f 6
We can use imap to get the index
library(purrr)
imap_dfr(foo, ~ .x %>%
mutate(index = .y))
Or with map
map_dfr(foo, .f = cbind, .id = 'index')
# index x y
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 2 d 4
#5 2 e 5
#6 2 f 6
Or use Map from base R, where we loop through the elements of 'foo' and a corresponding sequence of 'foo', cbind to create a new column and then rbind the list elements
do.call(rbind, Map(cbind, index = seq_along(foo), foo))
# index x y
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 2 d 4
#5 2 e 5
#6 2 f 6
You could name the list if it is unnamed and then loop over them using lapply to create list of dataframes and rbind them together as one dataframe.
names(foo) <- if (is.null(names(foo))) seq_len(length(foo)) else names(foo)
do.call("rbind", lapply(names(foo), function(x) cbind(index = x, foo[[x]])))
# index x y
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 2 d 4
#5 2 e 5
#6 2 f 6
For named list
do.call("rbind", lapply(names(foo), function(x) cbind(index = x, foo[[x]])))
# index x y
#1 df1 a 1
#2 df1 b 2
#3 df1 c 3
#4 df2 d 4
#5 df2 e 5
#6 df2 f 6

merge columns that have the same name r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.
using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4
We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5
Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

Replacing the values from another data from based on the information in the first column in R

I'm trying to merge informations in two different data frames, but problem begins with uneven dimensions and trying to use not the column index but the information in the column. merge function in R or join's (dplyr) don't work with my data.
I have to dataframes (One is subset of the others with updated info in the last column):
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"))
Name val Case
1 A 1 NA
2 B 2 1
3 C 3 NA
4 D 1 NA
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 NA
9 I 3 NA
Some rows in the Case column in df1 have to be changed with the info in the df2 below:
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1")
Name val Case
1 A 1 1
2 D 2 1
3 H 3 1
So there's nothing important in the val column, however I added it into the examples since I want to indicate that I have more columns than two and also my real data is way bigger than the examples.
Basically, I want to change specific rows by checking the information in the first columns (in this case, they're unique letters) and in the end I still want to have df1 as a final data frame.
for a better explanation, I want to see something like this:
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Note changed information for A,D and H.
Thanks.
%in% from base-r is there to rescue.
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"), stringsAsFactors = F)
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1", stringsAsFactors = F)
df1$Case <- ifelse(df1$Name %in% df2$Name, df2$Case[df2$Name %in% df1$Name], df1$Case)
df1
Output:
> df1
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Here is what I would do using dplyr:
df1 %>%
left_join(df2, by = c("Name")) %>%
mutate(val = if_else(is.na(val.y), val.x, val.y),
Case = if_else(is.na(Case.y), Case.x, Case.y)) %>%
select(Name, val, Case)

Merge and sum two different data.tables [duplicate]

This question already has answers here:
Adding values in two data.tables
(2 answers)
Closed 5 years ago.
I've 2 different data.tables. I need to merge and sum based on a row values. The examples of two tables are given as Input below and expected output shown below.
Input
Table 1
X A B
A 3
B 4 6
C 5
D 9 12
Table 2
X A B
A 1 5
B 6 8
C 7 14
D 5
E 1 1
F 2 3
G 5 6
Expected Output:
X A B
A 4 5
B 10 14
C 12 14
D 14 12
E 1 1
F 2 3
G 5 6
We can do this by rbinding the two tables and then do a group by sum
library(data.table)
rbindlist(list(df1, df2))[, lapply(.SD, sum, na.rm = TRUE), by = X]
# X A B
#1: A 4 5
#2: B 10 14
#3: C 12 14
#4: D 14 12
#5: E 1 1
#6: F 2 3
#7: G 5 6
Or using a similar approach with dplyr
library(dplyr)
bind_rows(df1, df2) %>%
group_by(X) %>%
summarise_all(funs(sum(., na.rm = TRUE)))
Note: Here, we assume that the blanks are NA and the 'A' and 'B' columns are numeric/integer class
Merge your tables together first, then do the sum. If you later want to drop the individual values you can do so easily.
out <- merge(df1, df2, by.x="X", by.y="X", all.x=T, all.y=T)
out$sum <- rowSums(out[2:3])
out$A <- out$B <- NULL # drop original values
Below code will help you to do required job for all numeric columns at once
library(dplyr)
Table = Table1 %>% full_join(Table2) %>%
group_by(X) %>% summarise_all(funs(sum(.,na.rm = T)))

Resources