I have two data frames which have the same elements initially but after eliminating some rows in one of them are not the same length.
x <-c(4,2,3,6,7,3,1,8,5,2,4,1,2,6,3)
y <-c(1,4,2,3,6,7,3,1,8,5,2,3,1,4,3)
z <-c(4,2,3,1,8,5,2,4,1)
k <-c(1,4,2,3,1,8,5,2,3)
df1 <- data.frame(x,y)
df2 <- data.frame(z,k)
I would like to find a way in the second data frame (df2) to create a row or have the index reference with the index row number of the first data frame (df1) so it results in a new data frame as follows (a would be the index reference from df1).
df3
a z k
1 1 4 1
2 2 2 4
3 3 3 2
4 7 1 3
5 8 8 1
6 9 5 8
7 10 2 5
8 11 4 2
9 12 1 3
I could create a column manually of all rows that are eliminated or use
library(sqldf)
a1NotIna2 <- (sqldf('SELECT * FROM df1 EXCEPT SELECT * FROM df2'))
a1NotIna2
x y
1 2 1
2 3 3
3 3 7
4 6 3
5 6 4
6 7 6
I have tried using -which- without sucess on this last expression to find out the rows of df1 that were eliminated to be used this in removing from a sequencing vector of length equal to df1 those common elements as to obtain a vector with the index similar to df3
Any help is welcomed
A generic solution if your data.frames have two columns, using pmatch:
transform(df2, a=pmatch(do.call(paste0, df2), do.call(paste0, df1)))
# z k a
#1 4 1 1
#2 2 4 2
#3 3 2 3
#4 1 3 7
#5 8 1 8
#6 5 8 9
#7 2 5 10
#8 4 2 11
#9 1 3 12
You can get the first matching row of df1 for each row in df2 with:
match(paste(df2$z, df2$k), paste(df1$x, df1$y))
# [1] 1 2 3 7 8 9 10 11 7
Unfortunately this won't maintain ordering when you have duplicated rows, so for instance we got index 7 for the last row of df2 instead of 12.
Related
I have a large dataset of matched pairs (id1 and id2) and would like to create an index variable to enable me to merge these pairs into rows.
As such, the first row would be index 1 and from then on the index will increase by 1, unless either id1 or id2 match any of the values in previous rows. Where this is the case, the previously attributed index should be applied.
I have looked for weeks and most solutions seem to fall short of what I need.
Here's some data to replicate what I have:
id1 <- c(1,2,2,4,6,7,9,11)
id2 <- c(2,3,4,5,7,8,10,2)
df <- cbind(id1,id2)
df <- as.data.frame(df)
df
id1 id2
1 1 2
2 2 3
3 2 4
4 4 5
5 6 7
6 7 8
7 9 10
8 11 2
And here's what hope to achieve:
#wanted result
index <- c(1,1,1,1,2,2,3,1)
df_indexed <- cbind(df,index)
df_indexed
id1 id2 index
1 1 2 1
2 2 3 1
3 2 4 1
4 4 5 1
5 6 7 2
6 7 8 2
7 9 10 3
8 11 2 1
It may be easier to do in igraph
library(igraph)
g <- graph.data.frame(df)
df$index <- clusters(g)$membership[as.character(df$id1)]
df$index
#[1] 1 1 1 1 2 2 3 1
I have two data frames with duplicate columns, data1 and data2. I am now running a for loop and every loop merges one column in data1 with the whole columns in data2. For example
data1:
1 1 3 4 4
2 5 2 4 2
2 2 8 8 0
data2
1 4 5 4 5
2 9 3 4 5
2 7 4 8 0
columns 1 and 4 are duplicate in data1 and data2. For the first loop, it merges
1
2
2
with data2
1 4 5 4 5
2 9 3 4 5
2 7 4 8 0
so the desired result is
1 4 5 4 5
2 9 3 4 5
2 7 4 8 0
Then it goes to the second column
1
5
2
and it merges with data2
1 4 5 4 5
2 9 3 4 5
2 7 4 8 0
The desired result will be
1 1 4 5 4 5
5 2 9 3 4 5
2 2 7 4 8 0
My idea is to use combine or merge function, but these two functions do not achieve the desired output
for(i in 1:dim(data[2])){
datam_merge<- merge(data1[i], data2)
}
Any suggestion is appreciated!
This should do the trick:
data3 <- dplyr::left_join(data2, data1)
head(data3)
The left_join() function is determining which columns data2 has in common with data1, and then only joining the dissimilar columns from data1 to data2.
I noticed that your "desired result" is dropping column 5 from data1. Was this intentional, or is your desired output a new dataframe that has all of the columns from data1 and data2 without any duplicates?
This is another approach that may be a more generalized solution:
data3 <- dplyr::inner_join(data1, data2)
This only joins the unique columns between either of the two dataframes instead of just data1.
Let me know if this is what you were looking for!
Edit:
Here's my example:
data1 <- data.frame(c(1,2,2),c(1,5,2),c(3,2,8),c(4,4,8),c(4,2,0))
names(data1) <- c("A","B","C","D","E")
data2 <- data.frame(c(1,2,2),c(4,9,7),c(5,3,4),c(4,4,8),c(5,5,0))
names(data2) <- c("A","F","G","D","H")
## columns 'A' and 'D' are in common, but we only need one of each letter ('A' through 'E').
data3 <- left_join(data2, data1)
head(data3)
A F G D H B C E
1 1 4 5 4 5 1 3 4
2 2 9 3 4 5 5 2 2
3 2 7 4 8 0 2 8 0
I want to delete all rows containing a value larger than 7 in a cell in an arbitrary column, either across all columns or across specific columns.
a <- c(3,6,99,7,8,9)
b <- c(99,6,3,4,5,6)
c <- c(2,5,6,7,8,3)
df <- data.frame (a,b,c)
a b c
1 3 99 2
2 6 6 5
3 99 3 6
4 7 4 7
5 8 5 8
6 9 6 3
V1:
I want to delete all rows containing values larger than 7, regardless of the column.
# result V1
a b c
2 6 6 5
4 7 4 7
V2:
I want to delete all rows containing values larger than 7 in column b and c
# result V2
a b c
2 6 6 5
3 99 3 6
4 7 4 7
6 9 6 3
There are plenty of similar problems on SOF, but I couldn't find a solution to this problem. So far I can only find rows that include 7using res <- df[rowSums(df != 7) < ncol(df), ].
rowSums of the logical matrix df > 7 gives the number of 'TRUE' per each row. We get '0' if there are no 'TRUE' for that particular row. By negating the results, '0' will change to 'TRUE", and all other values not equal to 0 will be FALSE. This can be used for subsetting.
df[!rowSums(df >7),]
# a b c
#2 6 6 5
#4 7 4 7
For the 'V2', we use the same principle except that we are getting the logical matrix on a subset of 'df'. ie. selecting only the second and third columns.
df[!rowSums(df[-1] >7),]
# a b c
#2 6 6 5
#3 99 3 6
#4 7 4 7
#6 9 6 3
I have a dataframe with an id variable, which may be duplicated. I want to split this into two dataframes, one which contains only the entries where the id's are duplicated, the other which shows only the id's which are unique. What is the best way of doing this?
For example, say I had the data frame:
dataDF <- data.frame(id = c(1,1,2,3,4,4,5,6),
a = c(1,2,3,4,5,6,7,8),
b = c(8,7,6,5,4,3,2,1))
i.e. the following
id a b
1 1 1 8
2 1 2 7
3 2 3 6
4 3 4 5
5 4 5 4
6 4 6 3
7 5 7 2
8 6 8 1
I want to get the following dataframes:
id a b
1 1 1 8
2 1 2 7
5 4 5 4
6 4 6 3
and
id a b
3 2 3 6
4 3 4 5
7 5 7 2
8 6 8 1
I am currently doing this as follows
dupeIds <- unique(subset(dataDF, duplicated(dataDF$id))$id)
uniqueDF <- subset(dataDF, !id %in% dupeIds)
dupeDF <- subset(dataDF, id %in% dupeIds)
which seems to work but it seems a bit off to subset three times, is there a simpler way of doing this? Thanks
Use duplicated twice, once top down, and once bottom up, and then use split to get it all in a list, like this:
split(dataDF, duplicated(dataDF$id) | duplicated(dataDF$id, fromLast = TRUE))
# $`FALSE`
# id a b
# 3 2 3 6
# 4 3 4 5
# 7 5 7 2
# 8 6 8 1
#
# $`TRUE`
# id a b
# 1 1 1 8
# 2 1 2 7
# 5 4 5 4
# 6 4 6 3
If you need to split this out into separate data.frames in your workspace (not sure why you would need to do that), assign names to the list items (eg names(mylist) <- c("nodupe", "dupe")) and then use list2env.
I have a data.frame that contains many columns. I want to keep the rows that have no NAs in 4 of these columns. The complication arises from the fact that I have other rows that are allowed have NAs in them so I can't use complete.cases or is.na. What's the most efficient way to do this?
You can still use complete.cases(). Just apply it to the desired columns (columns 1:4 in the example below) and then use the Boolean vector it returns to select valid rows from the entire data.frame.
set.seed(4)
x <- as.data.frame(replicate(6, sample(c(1:10,NA))))
x[complete.cases(x[1:4]),]
# V1 V2 V3 V4 V5 V6
# 1 7 4 6 8 10 5
# 2 1 2 5 5 1 2
# 5 6 8 4 10 6 6
# 6 2 6 9 3 4 4
# 7 4 3 3 1 2 1
# 9 8 5 2 7 7 3
# 10 10 10 1 2 5 NA