R two table merge - r

I have two data.frame df1 and df2 .
df1=data.frame(id=c(1,2,2),var1=c(3,5,5),var3=c(2,3,4))
df2=data.frame(id=c(1,1,2,2),var1=c('NONE','NONE','NONE','NONE'),var3=c(2,4,6,5))
now I want to merge to one data.frame. First, I should change the df2$var1. re encoding the df2$var1 with df1$var1 when df2$id match with df1$id. For example, df1$id=1 df1$var1=3 then df2$id=1 and df2$var1=3, so the result should like this:
df1=data.frame(id=c(1,2,2),var1=c(3,5,5),var3=c(2,3,4)).
df2=data.frame(id=c(1,1,2,2),var1=c(3,3,5,5),var3=c(2,4,6,5))
secondly, I want to merge two data.frame and delete the same one.the result should like this:
df=data.frame(id=c(1,1,2,2,2,2),var1=c(3,3,5,5,5,5),var2=c(2,4,3,4,6,5))
Sorry, it's my first to use stackoverflow. And most importantly,English isn't my native language.

library(dplyr)
union_all(df1, df2) %>%
distinct() %>%
arrange(id, var1)
id var1 var3
1 1 3 2
2 1 3 4
3 2 5 3
4 2 5 4
5 2 5 6
6 2 5 5
First,I use dplyr::union,then I found that the order is disrupted.
So,finally I use union_all, then rank it

I think this is what you want.
library(sqldf)
sqldf("select b.id, a.var1, b.var3 from df1 a left join df2 b on a.id = b.id")
id var1 var3
1 1 3 2
2 1 3 4
3 2 5 5
4 2 5 6
5 2 5 5
6 2 5 6
This is the same as the example you gave of your desired result, except for the 3rd column of the 3rd and 4th row. I believe that is due to a typo in your example, however if I am mistaken about this please let me know (and just explain why those values would be different and I will update my answer accordingly).
By the way, there are multiple ways to do this, but I find this one to be quick and easy.

with merge:
df2$var1 <- df1[df2$id,'var1'];
df2
id var1 var3
1 1 3 2
2 1 3 4
3 2 5 6
4 2 5 5
df <- merge(df1, df2, by='id')[-2:-3]
df
id var1.y var3.y
1 1 3 2
2 1 3 4
3 2 5 6
4 2 5 5
5 2 5 6
6 2 5 5

Related

How to assign values in one column to other columns in wide data using R

There is a wide data set, a simple example is
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(7,8,8,9,10,10),
"cx"=c(11,12,12,13,14,14))
I'm looking for a way to assign the values in "ax" to column "bx" and "cx". Here, imagine we have thousands of columns we intend to replace with "ax", so I want this to be done in an automated approach using R. The expected output look like
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(1,2,2,3,4,4),
"cx"=c(1,2,2,3,4,4))
I've thought of, and tried using mutate_at and ends_with, but this has not work for me. For example, I tried
df %>%
mutate_at(vars(ends_with("x")), labels = "ax")
and this prints an error. Not sure what's wrong or what's to be added to get this working, so I would like to request your help on this. Thank you very much!
A simple way using base R would be :
change_cols <- grep('x$', names(df))
df[change_cols] <- df$ax
df
# id ax bx cx
#1 1 1 1 1
#2 2 2 2 2
#3 3 2 2 2
#4 4 3 3 3
#5 5 4 4 4
#6 6 4 4 4
I would suggest this tidyverse approach using across() to select the range of variables you want:
library(tidyverse)
#Data
df<-data.frame("id"=c(1:6),
"ax"=c(1,2,2,3,4,4),
"bx"=c(7,8,8,9,10,10),
"cx"=c(11,12,12,13,14,14))
#Mutate
df %>% mutate(across(c(bx:cx), ~ ax))
Output:
id ax bx cx
1 1 1 1 1
2 2 2 2 2
3 3 2 2 2
4 4 3 3 3
5 5 4 4 4
6 6 4 4 4
Another option with mutate_at()
df %>%
mutate_at(vars(matches("x$")), ~ax)
# id ax bx cx
# 1 1 1 1 1
# 2 2 2 2 2
# 3 3 2 2 2
# 4 4 3 3 3
# 5 5 4 4 4
# 6 6 4 4 4

Merge multiple data frames with partially matching rows

I have data frames with lists of elements such as NAMES. There are different names in dataframes, but most of them match together. I'd like to combine all of them in one list in which I'd see whether some names are missing from any of df.
DATA sample for df1:
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 RH_Type-Function-S
6 6 RH_REFERENT-S
and for df2
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 UCZESTNIK
6 6 COACH
and expected result would be:
NAME. df1 df2
1 COACH NA 6
2 lh_Structure/Focus_C 4 4
3 lh_Structure/Focus_S 3 3
4 RH_REFERENT-S 6 NA
5 rh_Structure/Focus_C 2 2
6 rh_Structure/Focus_S 1 1
7 RH_Type-Function-S 5 NA
8 UCZESTNIK NA 5
I can do that with merge.data.frame(df1,df2,by = "x", all=T),
but the I can't do it with more df with similar structure. Any help would be appreciated.
It might be easier to work with this in a long form. Just rbind all the datasets below one another with a flag for which dataset they came from. Then it's relatively straightforward to get a tabulation of all the missing values (and as an added bonus, you can see if you have any duplicates in any of the source datasets):
dfs <- c("df1","df2")
dfall <- do.call(rbind, Map(cbind, mget(dfs), src=dfs))
table(dfall$x, dfall$src)
# df1 df2
# COACH 0 1
# lh_Structure/Focus_C 1 1
# lh_Structure/Focus_S 1 1
# RH_REFERENT-S 1 0
# rh_Structure/Focus_C 1 1
# rh_Structure/Focus_S 1 1
# RH_Type-Function-S 1 0
# UCZESTNIK 0 1

Add rows to dataframe from another dataframe, based on a vector

I'd like to add rows to a dataframe based on a vector within the dataframe. Here are the dataframes (df2 is the one I'd like to add rows to; df1 is the one I'd like to take the rows from):
ID=c(1:5)
x=c(rep("a",3),rep("b",2))
y=c(rep(0,5))
df1=data.frame(ID,x,y)
df2=df1[2:4,1:2]
df2$y=c(5,2,3)
df1
ID x y
1 1 a 0
2 2 a 0
3 3 a 0
4 4 b 0
5 5 b 0
df2
ID x y
2 2 a 5
3 3 a 2
4 4 b 3
I'd like to add to df2 any rows that aren't in df1, based on the ID vector. so my output dataframe would look like this:
ID x y
1 b 0
5 b 0
2 a 5
3 a 2
4 b 3
Can anyone see a way of doing this neatly, please? I need to do it for a lot of dataframes, all with different numbers of rows. I've tried using merge or rbind but I haven't been able to work out how to do it based on the vector.
Thank you!
A solution with dplyr:
bind_rows(df2,anti_join(df1,df2,by="ID"))
# ID x y
#1 2 a 5
#2 3 a 2
#3 4 b 3
#4 1 a 0
#5 5 b 0
You can do the following:
missingIDs <- which(!df1$ID %in% df2$ID) #check which df1 ID's are not in df2, see function is.element()
df.toadd <- df1[missingIDs,] #define the data frame to add to df2
result <- rbind(df.toadd, df2) #use rbind to add it
result
ID x y
1 1 a 0
5 5 b 0
2 2 a 5
3 3 a 2
4 4 b 3
What about this one-liner?
rbind(df2, df1[!df1$ID %in% df2$ID,])
ID x y
2 2 a 5
3 3 a 2
4 4 b 3
1 1 a 0
5 5 b 0

Double left join in dplyr to recover values

I've checked this issue but couldn't find a matching entry.
Say you have 2 DFs:
df1:mode df2:sex
1 1
2 2
3
And a DF3 where most of the combinations are not present, e.g.
mode | sex | cases
1 1 9
1 1 2
2 2 7
3 1 2
1 2 5
and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):
mode | sex | cases
1 1 11
1 2 5
2 1 0
2 2 7
3 1 2
3 2 0
If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).
So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.
Thanks in advance, p.
The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.
The help page says:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data. It turns
implicitly missing values into explicitly missing values.
To add the missing combinations of df3 and fill with 0 values instead, you would do:
library(tidyr)
library(dplyr)
df3 %>% complete(mode, sex, fill = list(cases = 0))
mode sex cases
1 1 1 9
2 1 1 2
3 1 2 5
4 2 1 0
5 2 2 7
6 3 1 2
7 3 2 0
You would still need to group_by and summarise to get the final output you want.
df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
group_by(mode, sex) %>%
summarise(cases = sum(cases))
Source: local data frame [6 x 3]
Groups: mode
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0
First here's you data in a more friendly, reproducible format
df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))
I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.
mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))
which gives
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources