Separate unique and duplicate entries in dataframe based off id - r

I have a dataframe with an id variable, which may be duplicated. I want to split this into two dataframes, one which contains only the entries where the id's are duplicated, the other which shows only the id's which are unique. What is the best way of doing this?
For example, say I had the data frame:
dataDF <- data.frame(id = c(1,1,2,3,4,4,5,6),
a = c(1,2,3,4,5,6,7,8),
b = c(8,7,6,5,4,3,2,1))
i.e. the following
id a b
1 1 1 8
2 1 2 7
3 2 3 6
4 3 4 5
5 4 5 4
6 4 6 3
7 5 7 2
8 6 8 1
I want to get the following dataframes:
id a b
1 1 1 8
2 1 2 7
5 4 5 4
6 4 6 3
and
id a b
3 2 3 6
4 3 4 5
7 5 7 2
8 6 8 1
I am currently doing this as follows
dupeIds <- unique(subset(dataDF, duplicated(dataDF$id))$id)
uniqueDF <- subset(dataDF, !id %in% dupeIds)
dupeDF <- subset(dataDF, id %in% dupeIds)
which seems to work but it seems a bit off to subset three times, is there a simpler way of doing this? Thanks

Use duplicated twice, once top down, and once bottom up, and then use split to get it all in a list, like this:
split(dataDF, duplicated(dataDF$id) | duplicated(dataDF$id, fromLast = TRUE))
# $`FALSE`
# id a b
# 3 2 3 6
# 4 3 4 5
# 7 5 7 2
# 8 6 8 1
#
# $`TRUE`
# id a b
# 1 1 1 8
# 2 1 2 7
# 5 4 5 4
# 6 4 6 3
If you need to split this out into separate data.frames in your workspace (not sure why you would need to do that), assign names to the list items (eg names(mylist) <- c("nodupe", "dupe")) and then use list2env.

Related

I would like to extract the columns of each element of a list in R

I'd to extract the 3rd column (c) of each element in this list and store the result.
(I've listed the data frame in this example so that it looks like the long list of lists I have):
set.seed(59)
df<- data.frame(a=c(1,4,5,2),b=c(9,2,7,4),c=c(5,2,9,4))
df1<- data.frame(df,2*df)
df1<- list(df,2*df)
[[1]]
a b c
1 1 9 5
2 4 2 2
3 5 7 9
4 2 4 4
[[2]]
a b c
1 2 18 10
2 8 4 4
3 10 14 18
4 4 8 8
Seems fairly simple for just one element
> df1[[1]]["c"]
c
1 5
2 2
3 9
4 4
> df1["c"] # cries again
[[1]]
NULL
All I want to see is:
[[1]]
c
1 5
2 2
3 9
4 4
[[2]]
c
1 10
2 4
3 18
4 8
Thanks in advance
Use lapply :
data <- lapply(df1, function(x) x[, 'c', drop = FALSE])
data
#[[1]]
# c
#1 5
#2 2
#3 9
#4 4
#[[2]]
# c
#1 10
#2 4
#3 18
#4 8
When you subset one column dataframe it coerces it to lowest possible dimension which is a vector in this case. drop = FALSE is needed to keep it as a dataframe.

Change the order of numerically named columns in r

If I have a dataframe like the one below which has numerical column names
example = data.frame(1=c(1,8,3,9), 2=c(3,2,3,3), 3=c(5,2,5,4), 4=c(1,2,3,4), 5=c(2,5,7,8))
Which looks like this:
1 2 3 4 5
1 3 5 1 2
8 2 2 2 5
3 3 5 3 7
9 3 4 4 8
And I want to arrange it so that the column names start with three and proceed through five and back to one, like this:
3 4 5 1 2
5 1 2 1 3
2 2 5 8 2
5 3 7 3 3
4 4 8 9 3
I know how to rearrange the position of a single column in a dataset, but I'm not sure how to do this with more than one column in this particular order.
We can use the column index concatenated (c) based on the sequence (:) on a range of values
example[c(3:5, 1:2)]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3
As the column names are all numeric, just convert to numeric and use that for ordering
v1 <- as.numeric(names(example))
example[c(v1[3:5], v1[1:2])]
Or simply do
example[c(names(example)[3:5], names(example)[1:2])]
Or another way is with head and tail
example[c(tail(names(example), 3), head(names(example), 2))]
data
example <- data.frame(`1`=c(1,8,3,9), `2`=c(3,2,3,3),
`3`=c(5,2,5,4), `4`=c(1,2,3,4), `5`=c(2,5,7,8), check.names = FALSE)
R will not easily let you create columns with numbers as name. If somehow, you are able to create columns with numbers you can use match to get order in which you want the column names.
example[match(c(3:5, 1:2), names(example))]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3

Order data frame by column and display WITH indices

I have the following R data frame
> df
a
1 3
3 2
4 1
5 3
6 6
7 7
8 2
10 8
I order it by the a column with the order function df[ order(df), ]:
[1] 1 2 2 3 3 6 7 8
This is the result I want, BUT, how can list the whole data frame with the permuted indices?
The only thing that works is the following, but it seems sloppy and I don't really understand what it does:
> df[ order(df), c(1,1) ] # I want this but without the a.1 column!!!!
a a.1
4 1 1
3 2 2
8 2 2
1 3 3
5 3 3
6 6 6
7 7 7
10 8 8
Thanks
If we need the indices as well, use sort with index.return = TRUE
data.frame(sort(df$a, index.return=TRUE))

How to reverse a column in R

I have a dataframe as described below. Now I want to reverse the order of column B without hampering the total order of the dataframe. So now the column B has 5,4,3,2,1. I want to change it to 1,2,3,4,5. I don't want to sort as it will hamper the total ordering.
A B C
1 5 6
2 4 8
3 3 5
4 2 5
5 1 3
You can replace just that column:
x$B <- rev(x$B)
On your data:
> x$B <- rev(x$B)
> x
A B C
1 1 1 6
2 2 2 8
3 3 3 5
4 4 4 5
5 5 5 3
transform is also handy for this:
> transform(x, B = rev(B))
A B C
1 1 1 6
2 2 2 8
3 3 3 5
4 4 4 5
5 5 5 3
This doesn't modify x so you need to assign the result to something (perhaps back to x).

How to only keep the columns with same names between two data frames?

I have two data frames like the following:
a<-c(1,3,4,5,6,8)
b<-c(2,3,4,2,6,7)
c<-c(2,5,6,3,5,6)
df1<-data.frame(a,b,c)
d<-c(3,4,5,6,7,8)
e<-c(1,2,3,2,1,1)
c<-c(1,3,4,5,6,2)
df2<-data.frame(d,e,c)
> df1
a b c
1 1 2 2
2 3 3 5
3 4 4 6
4 5 2 3
5 6 6 5
6 8 7 6
> df2
d e c
1 3 1 1
2 4 2 3
3 5 3 4
4 6 2 5
5 7 1 6
6 8 1 2
I want combine the two data frames,and only keep the columns with the same names. The final data frame should like this:
> df3
c1 c2
1 2 1
2 5 3
3 6 4
4 3 5
5 5 6
6 6 2
My real data frames have hundreds columns,so I need codes do this job. Can anyone help me?
Find out which names belong to both dataframes and then bind them:
eqnames <- names(df1)[names(df1) %in% names(df2)]
df3 <- cbind(df1[eqnames], df2[eqnames])
You can then rename the columns:
names(df3) <- paste0(names(df3), 1:ncol(df3))
Resulting in:
> df3
c1 c2
1 2 1
2 5 3
3 6 4
4 3 5
5 5 6
6 6 2

Resources