This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 8 years ago.
I want to partition a dataframe so that elements unique in a certain column are separated from the non-unique elements. So the dataframe below will be separated to two dataframes like so
id v1 v2
1 1 2 3
2 1 1 1
3 2 1 1
4 3 1 2
5 4 5 6
6 4 3 1
to
id v1 v2
1 2 1 1
2 3 1 2
and
id v1 v2
1 1 2 3
2 1 1 1
3 4 5 6
4 4 3 1
where they are split on the uniqueness of the id column. duplicated doesn't work in this situation because lines 1 and 5 in the top dataframe are not considered to be duplicates i.e. the first occurrence returns FALSE in duplicated.
EDIT
I went with
dups <- df[duplicated(df1$id) | duplicated(df$id, fromLast=TRUE), ]
uniq <- df[!duplicated(df1$id) & !duplicated(df$id, fromLast=TRUE), ]
which ran very quickly with my 250,000 row dataframe.
I think the easiest way to approach this problem is with data.table and see where you have more than 1 count by id
Your data
data <- read.table(header=T,text="
id v1 v2
1 2 3
1 1 1
2 1 1
3 1 2
4 5 6
4 3 1
")
Code to spilt data
library(data.table)
setDT(data)
data[, Count := .N, by=id]
Unique table by id
data[Count==1]
id v1 v2 Count
1: 2 1 1 1
2: 3 1 2 1
Non-unique table by id
data[Count>1]
id v1 v2 Count
1: 1 2 3 2
2: 1 1 1 2
3: 4 5 6 2
4: 4 3 1 2
Related
I have the following df
df <- data.frame(value = c(1,2,3,4,5,6,7,8,9,10), win=c(1,1,1,2,2,3,4,4,5,5))
> df
value win
1 1 1
2 2 1
3 3 1
4 4 2
5 5 2
6 6 3
7 7 4
8 8 4
9 9 5
10 10 5
And I wanted to keep only the rows where the variable win is in more that 3 rows. So if I look into
> table(df$win)
1 2 3 4 5
3 2 1 2 2
I know that I will only want to keep the rows where win=1. But how do I do that for a big data frame ?
I was thinking of having a vector which would give me the unique values of df$win
xx <- unique(df$win)
> xx
[1] 1 2 3 4 5
And somehow make a loop where it would count which rows does df$win == xx and then extract only those rows but I wasn't able to make it come true so if any of you could help me I would be very thankfull !
Edit
Expected output [only for this example tho so doing subset(df, win =="1") is not useful as I don't know which "win" will be in more than 3 rows]
> new_df
value win
1 1 1
2 2 1
3 3 1
If you have a big dataset, use data.table
library(data.table)
setDT(df)[, if(.N>=3) .SD, win]
Output:
win value
1: 1 1
2: 1 2
3: 1 3
This question already has answers here:
Select columns larger than a value
(3 answers)
Closed 3 years ago.
I have a data frame with many columns and rows, for example
id column1 column2 column3
1 2 3 5
2 3 2 6
3 4 1 3
4 1 1 2
5 3 3 2
6 5 2 1
How can I select the column (except id) whose max value is more than certain value, like 5 in the example data?
So the select data should be:
id column1 column3
1 2 5
2 3 6
3 4 3
4 1 2
5 3 2
6 5 1
I would appreciate any help on my question. Thank you very much!
Multiple ways to do this.
Using base R
cbind(df[1], df[-1][sapply(df[-1], function(x) any(x >=5))])
# id column1 column3
#1 1 2 5
#2 2 3 6
#3 3 4 3
#4 4 1 2
#5 5 3 2
#6 6 5 1
We could also use colSums on logical matrix after comparing it with >= 5
cbind(df[1], df[-1][colSums(df[-1] >= 5) > 0])
Or with Filter
cbind(df[1], Filter(function(x) any(x >= 5), df[-1]))
Or using dplyr
library(dplyr)
bind_cols(df[1], df %>%
select(-1) %>%
select_if(~any(. >=5)))
That requires first finding those maximal values and then accordingly subsetting the data frame, as in
df[c(TRUE, apply(df[-1], 2, max) >= 5)]
# id column1 column3
# 1 1 2 5
# 2 2 3 6
# 3 3 4 3
# 4 4 1 2
# 5 5 3 2
# 6 6 5 1
where
apply(df[-1], 2, max)
# column1 column2 column3
# 5 3 6
and adding TRUE also preserves the id column.
This question already has answers here:
Fill missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 3 years ago.
I have a data frame of ids with number column
df <- read.table(text="
id nr
1 1
2 1
1 2
3 1
1 3
", header=TRUE)
I´d like to create new dataframe from it, where each id will have unique nr from df dataframe. As you may notice, id 3 have only nr 1, but no 2 and 3. So result should be.
result <- read.table(text="
id nr
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
", header=TRUE)
You can use expand.grid as:
library(dplyr)
result <- expand.grid(id = unique(df$id), nr = unique(df$nr)) %>%
arrange(id)
result
id nr
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
We can do:
tidyr::expand(df,id,nr)
# A tibble: 9 x 2
id nr
<int> <int>
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
I would like to create an adjacency list from a dataset like the following:
id group
1 1
2 1
3 1
4 2
5 2
The connected id are those who are in the same group. Therefore, I would like to get the following adjacency list:
id id2
1 2
1 3
2 1
2 3
3 1
3 2
4 5
5 4
I am struggling in figuring out how to do it. In particular, I have found a solution where order does not matter (split and expand.grid by group on large data set). In my case, it does, so I would not like to have those observations dropped.
Maybe something like this, using data.table:
require(data.table)
dt <- fread('id group
1 1
2 1
3 1
4 2
5 2')
dt[, expand.grid(id, id), by = group][Var1 != Var2][, -1]
# Var1 Var2
# 1: 2 1
# 2: 3 1
# 3: 1 2
# 4: 3 2
# 5: 1 3
# 6: 2 3
# 7: 5 4
# 8: 4 5
I have a large data frame with the columns V1 and V2. It is representing an edgelist. I want to create a third column, COUNT, which counts how many times that exact edge appears. For example, if V1 == 1 and V2 == 2, I want to count how many other times V1 == 1 and V2 == 2, combine them into one row and put the count in a third column.
Data <- data.frame(
V1 = c(1,1),
V2 = c(2,2)
)
I've tried something like new = aggregate(V1 ~ V2,data=df,FUN=length) but it's not working for me.
...or maybe use data.table:
library(data.table)
df<-data.table(v1=c(1,2,3,4,5,1,2,3,1),v2=c(2,3,4,5,6,2,3,4,3))
df[ , count := .N, by=.(v1,v2)] ; df
v1 v2 count
1: 1 2 2
2: 2 3 2
3: 3 4 2
4: 4 5 1
5: 5 6 1
6: 1 2 2
7: 2 3 2
8: 3 4 2
9: 1 3 1
Assuming the structure of data as :
df<-data.frame(v1=c(1,2,3,4,5,1,2,3),v2=c(2,3,4,5,6,2,3,4),stringsAsFactors = FALSE)
> df
v1 v2
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 1 2
7 2 3
8 3 4
Using ddply function from plyr package to get count of all edge-pairs
df2 <- ddply(df, .(v1,v2), function(df) c(count=nrow(df)))
> df2
v1 v2 count
1 1 2 2
2 2 3 2
3 3 4 2
4 4 5 1
5 5 6 1