This question already has answers here:
Aggregate column - how to handle uneven dataframe
(4 answers)
Closed yesterday.
I want to add a new column with a 3-number repeat 1,1,1,2,2,2,3,3,3 until the end of each group (Chr) within the data frame. This would be easy if all the groups can be divided by 3, but I am not sure how to do this when the group length is divisible by 2. What happens with the last repeat within each group?
original <- data.frame(Chr=c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
value=c(1,3,1,3,5,6,3,1,3,5,0),
seq=c(1,2,3,4,5,1,2,3,4,5,6))
modified <- data.frame(Chr=c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
value=c(1,3,1,3,5,6,3,1,3,5,0),
seq=c(1,2,3,4,5,1,2,3,4,5,6),
rep=c(1,1,1,2,2,1,1,1,2,2,1))
We could use gl() function:
library(dplyr)
original %>%
mutate(rep = as.integer(gl(n(),3,n())), .by=Chr)
Chr value seq group
1 chr1 1 1 1
2 chr1 3 2 1
3 chr1 1 3 1
4 chr1 3 4 2
5 chr1 5 5 2
6 chr2 6 1 1
7 chr2 3 2 1
8 chr2 1 3 1
9 chr2 3 4 2
10 chr2 5 5 2
11 chr3 0 6 1
Related
This question already has answers here:
Filtering a dataframe showing only duplicates
(4 answers)
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 3 months ago.
I want to select rows with duplicated id but keep both rows in the resulting dataset. Here is the original dataset:
dd <- data.frame(id=c(1,1,2,2,3,4,4,5,6,7,7),
coder=c(1,2,1,2,1,1,2,1,1,1,2)
)
dd
id coder
1 1
1 2
2 1
2 2
3 1
4 1
4 2
5 1
6 1
7 1
7 2
In the end, I want this:
id coder
1 1
1 2
2 1
2 2
4 1
4 2
7 1
7 2
I tried subset(dd, duplicated(id)) but it only kept one row:
id coder
1 2
2 2
4 2
7 2
How to achieve that?
This question already has an answer here:
Incrementing an ID number each time a condition is met
(1 answer)
Closed 1 year ago.
I have a data.frame ordered by ID with a column of numeric values that I would like to bin into groups, increasing the group number only when a certain target value/trigger is surpassed. I haven't had success with seq(), seq_along(), or data.table cumsum(), but I'm sure there must be a way
Example data.frame with desired group column below. In this example, the sequence generating the group column should increase only when a number >= 300 appears in the value column.
dat = data.frame(ID=1:10, value=c(0,2,1,12,68,300,41,0,72959,51), group=c(1,1,1,1,1,2,2,2,3,3))
> dat
ID value group
1 1 0 1
2 2 2 1
3 3 1 1
4 4 12 1
5 5 68 1
6 6 300 2
7 7 41 2
8 8 0 2
9 9 72959 3
10 10 51 3
We may use cumsum on a logical vector to create the group
library(dplyr)
dat %>%
mutate(group2 = cumsum(value >=300)+ 1)
-output
ID value group group2
1 1 0 1 1
2 2 2 1 1
3 3 1 1 1
4 4 12 1 1
5 5 68 1 1
6 6 300 2 2
7 7 41 2 2
8 8 0 2 2
9 9 72959 3 3
10 10 51 3 3
This question already has answers here:
Delete duplicate rows in two columns simultaneously [duplicate]
(2 answers)
Closed 6 years ago.
I have got the following data.frame:
df = read.table(text = 'a b c d
1 12 2 1
1 13 2 1
1 3 3 1
2 12 6 2
2 11 2 2
2 14 2 2
1 12 1 2
1 13 2 2
2 11 4 3, header = TRUE')
I need to remove the rows which have the same observations based on columns a and b, so that the results would be:
a b c d
1 12 2 1
1 13 2 1
1 3 3 1
2 12 6 2
2 11 2 2
2 13 2 2
Thank you for any help
We can use duplicated
df[!duplicated(df[1:2]),]
This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 8 years ago.
I want to partition a dataframe so that elements unique in a certain column are separated from the non-unique elements. So the dataframe below will be separated to two dataframes like so
id v1 v2
1 1 2 3
2 1 1 1
3 2 1 1
4 3 1 2
5 4 5 6
6 4 3 1
to
id v1 v2
1 2 1 1
2 3 1 2
and
id v1 v2
1 1 2 3
2 1 1 1
3 4 5 6
4 4 3 1
where they are split on the uniqueness of the id column. duplicated doesn't work in this situation because lines 1 and 5 in the top dataframe are not considered to be duplicates i.e. the first occurrence returns FALSE in duplicated.
EDIT
I went with
dups <- df[duplicated(df1$id) | duplicated(df$id, fromLast=TRUE), ]
uniq <- df[!duplicated(df1$id) & !duplicated(df$id, fromLast=TRUE), ]
which ran very quickly with my 250,000 row dataframe.
I think the easiest way to approach this problem is with data.table and see where you have more than 1 count by id
Your data
data <- read.table(header=T,text="
id v1 v2
1 2 3
1 1 1
2 1 1
3 1 2
4 5 6
4 3 1
")
Code to spilt data
library(data.table)
setDT(data)
data[, Count := .N, by=id]
Unique table by id
data[Count==1]
id v1 v2 Count
1: 2 1 1 1
2: 3 1 2 1
Non-unique table by id
data[Count>1]
id v1 v2 Count
1: 1 2 3 2
2: 1 1 1 2
3: 4 5 6 2
4: 4 3 1 2
I have a dataframe that looks like
day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3
and another like
day.of.week count
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13
I want to add the values from df1 to df2 based on day.of.week. I was trying to use ddply
total=ddply(merge(total, subtotal, all.x=TRUE,all.y=TRUE),
.(day.of.week), summarize, count=sum(count))
which almost works, but merge combines rows that have a shared value. For instance in the example above for day.of.week=5. Rather than being merged to two records each with count one, it is instead merged to one record of count one, so instead of total count of two I get a total count of one.
day.of.week count
1 0 3
2 0 17
3 1 6
4 2 1
5 3 1
6 4 1
7 4 5
8 5 1
9 6 3
10 6 13
There is no need to merge. You can simply do
ddply(rbind(d1, d2), .(day.of.week), summarize, sum_count = sum(count))
I have assumed that both data frames have identical column names day.of.week and count
In addition to the suggestion Ben gave you about using merge, you could also do this simply using subsetting:
d1 <- read.table(textConnection(" day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3"),sep="",header = TRUE)
d2 <- read.table(textConnection(" day.of.week count1
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13"),sep = "",header = TRUE)
d2[match(d1[,1],d2[,1]),2] <- d2[match(d1[,1],d2[,1]),2] + d1[,2]
> d2
day.of.week count1
1 0 20
2 1 6
3 2 1
4 3 2
5 4 6
6 5 2
7 6 16
This assumes no repeated day.of.week rows, since match will return only the first match.