adding unique rows from one data frame to another - r

I have a data frame which comprises a subset of records contained in a 2nd data frame. I would like to add the record rows of the 2nd data frame that are not common in the first data frame to the first... Thank you.

If you want all unique rows from both dataframes, this would work:
df1 <- data.frame(X = c('A','B','C'), Y = c(1,2,3))
df2 <- data.frame(X = 'A', Y = 1)
df <- rbind(df1,df2)
no.dupes <- df[!duplicated(df),]
no.dupes
# X Y
#1 A 1
#2 B 2
#3 C 3
But it won't work if there's duplicate rows in either dataframe that you want to preserve.

You should look dplyr's distint() and bind_rows() functions.
Or Better provide a dummy data to work on and expected output .
Suppose you have two dataframes a and b ,and you want to merge unique rows of a dataframe to the b dataframe
a = data.frame(
x = c(1,2,3,1,4,3),
y = c(5,2,3,5,3,3)
)
b = data.frame(
x = c(6,2,2,3,3),
y = c(19,13,12,3,1)
)
library(dplyr)
distinct(a) %>% bind_rows(.,b)

Related

Adding a column to every dataframe in a list with the name of the list element

I have a list containing multiple data frames, and each list element has a unique name. The structure is similar to this dummy data
a <- data.frame(z = rnorm(20), y = rnorm(20))
b <- data.frame(z = rnorm(30), y = rnorm(30))
c <- data.frame(z = rnorm(40), y = rnorm(40))
d <- data.frame(z = rnorm(50), y = rnorm(50))
my.list <- list(a,b,c,d)
names(my.list) <- c("a","b","c","d")
I want to create a column in each of the data frames that has the name of it's respective list element. My goal is to merge all the list element into a single data frame, and know which data frame they came from originally. The end result I'm looking for is something like this:
z y group
1 0.6169132 0.09803228 a
2 1.1610584 0.50356131 a
3 0.6399438 0.84810547 a
4 1.0878453 1.00472105 b
5 -0.3137200 -1.20707112 b
6 1.1428834 0.87852556 b
7 -1.0651735 -0.18614224 c
8 1.1629891 -0.30184443 c
9 -0.7980089 -0.35578381 c
10 1.4651651 -0.30586852 d
11 1.1936547 1.98858128 d
12 1.6284174 -0.17042835 d
My first thought was to use mutate to assign the list element name to a column in each respective data frame, but it appears that when used within lapply, names() refers to the column names, not the list element names
test <- lapply(my.list, function(x) mutate(x, group = names(x)))
Error: Column `group` must be length 20 (the number of rows) or one, not 2
Any suggestions as to how I could approach this problem?
there is no need to mutate just bind using dplyr's bind_rows
library(tidyverse)
my.list %>%
bind_rows(.id = "groups")
Obviously requires that the list is named.
We can use Map from base R
Map(cbind, my.list, group = names(my.list))
Or with imap from purrr
library(dplyr)
library(purrr)
imap(my.list, ~ .x %>% mutate(group = .y))
Or if the intention is to create a single data.frame
library(data.table)
rbindlist(my.list. idcol = 'groups')

R- How to rearrange rows in a data frame with a foreign key from another data frame

I'm having a bit of trouble trying to figure out how to rearrange rows in a data frame in R.
I have two data frames which are in different order and both do have a ID which identifies the tipples.
Now I would like to reorder data frame 1 (ID 1) so that it is in the same order like data frame 2 (ID2).
Many thanks in advance.
Create a column of ascending integers in data frame 2 to encode the ordering. Then merge that column to data frame 1 and sort on it.
library(dplyr)
df1 <- tibble(
id = c(1, 2, 3),
col1 = c('a', 'b', 'c')
)
df2 <- tibble(
id = c(3, 1, 2),
col2 = c('c', 'a', 'b')
)
df2$ordering <- sequence(nrow(df2))
df1_ordered <- df1 %>%
left_join(df2, by = 'id') %>%
arrange(ordering)
We can use match to match the ID's and then reorder df1 based on it. Using #Chris' data
df1[match(df2$id, df1$id),]
# id col1
# <dbl> <chr>
#1 3 c
#2 1 a
#3 2 b

Subset columns of one data frame (by name) by information from second data frame

I would appreciate a solution for the following problem: I have the following example data frame:
df1 = data_frame(Tom = c(1,2,3,4), Tina = c(5,6,7,8), Todd = c(9,10,11,12), Brit = c(1,2,3,4))
I have a second data frame with information about Tom, Tina etc.
df2 = data_frame(ID = c("Tom","Todd","Tina","Brit"), value = c(1,3,2,1))
Now I would like to subset colums from data frame df1 if the "value" in df2 fulfils a particular condition, e.g. df2$value = 1 | df2$value = 2
The resulting table should look like:
desired_result_look_like = data_frame(Tom = c(1,2,3,4), Tina = c(5,6,7,8), Brit = c(1,2,3,4))
Thanks for you help.
Because you're using row values in one data frame to select the columns in another data frame, the solution isn't particularly clean, but if you wanted to stick with this approach, you could create a third data frame that filters the second data frame based on your conditions, then select the column names in the first data frame that correspond with values in the filtered data frame. The code would look something like this:
library(dplyr)
df2_filtered <- df2 %>% filter(value == 1 | value == 2)
desired_result <- df1[ , colnames(df1) %in% df2_filtered$ID]
(This is operating under the assumption that in your posted "desired result", you meant to include Tina instead of Todd)

Matching data from unequal length data frames in r

This seems like it should be really simple. Ive 2 data frames of unequal length in R. one is simply a random subset of the larger data set. Therefore, they have the same exact data and a UniqueID that is exactly the same. What I would like to do is put an indicator say a 0 or 1 in the larger data set that says this row is in the smaller data set.
I can use which(long$UniqID %in% short$UniqID) but I can't seem to figure out how to match this indicator back to the long data set
Made same sample data.
long<-data.frame(UniqID=sample(letters[1:20],20))
short<-data.frame(UniqID=sample(letters[1:20],10))
You can use %in% without which() to get values TRUE and FALSE and then with as.numeric() convert them to 0 and 1.
long$sh<-as.numeric(long$UniqID %in% short$UniqID)
I'll use #AnandaMahto's data to illustrate another way using duplicated which also works if you've a unique ID or not.
Case 1: Has unique id column
set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1)[, "ID",
drop=FALSE])[-seq_len(nrow(df2))])
Case 2: Has no unique id column
set.seed(1)
df1 <- data.frame(A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
transform(df1, indicator = 1 * duplicated(rbind(df2, df1))[-seq_len(nrow(df2))])
The answers so far are good. However, a question was raised, "what if there wasn't a "UniqID" column?
At that point, perhaps merge can be of assistance:
Here's an example using merge and %in% where an ID is available:
set.seed(1)
df1 <- data.frame(ID = 1:10, A = rnorm(10), B = rnorm(10))
df2 <- df1[sample(10, 4), ]
temp <- merge(df1, df2, by = "ID")$ID
df1$matches <- as.integer(df1$ID %in% temp)
And, a similar example where an ID isn't available.
set.seed(1)
df1_NoID <- data.frame(A = rnorm(10), B = rnorm(10))
df2_NoID <- df1_NoID[sample(10, 4), ]
temp <- merge(df1_NoID, df2_NoID, by = "row.names")$Row.names
df1_NoID$matches <- as.integer(rownames(df1_NoID) %in% temp)
You can directly use the logical vector as a new column:
long$Indicator <- 1*(long$UniqID %in% short$UniqID)
See if this can get you started:
long <- data.frame(UniqID=sample(1:100)) #creating a long data frame
short <- data.frame(UniqID=long[sample(1:100, 30), ]) #creating a short one with the same ids.
long$indicator <- long$UniqID %in% short$UniqID #creating an indicator column in long.
> head(long)
UniqID indicator
1 87 TRUE
2 15 TRUE
3 100 TRUE
4 40 FALSE
5 89 FALSE
6 21 FALSE

Extract factor column from data frame

My data frame is breaking when i extract some rows from a factor column:
data.df = data.frame(x = factor(letters[1:10]))
data.temp = data.df[1:3, ]
print(data.temp)
How can i avoid that? I need to column name to be kept also. Thanks!
You can add argument drop=FALSE to keep data as data frame.
data.df = data.frame(x = factor(letters[1:10]))
data.temp = data.df[1:3, ,drop=FALSE]
print(data.temp)
x
1 a
2 b
3 c

Resources