How to delete rows if repeated more than 5 times? [duplicate] - r

This question already has answers here:
Truncating a dataframe according to count of vector elements
(2 answers)
Closed 3 years ago.
I'm really desperately looking for an answer.
I have only one column with duplicated IDs.
I want to have this kind of code:
ID
a
a
a
a
a
b
b
b
b
b
so if there are 6 a's, the 6th row should be deleted.

Here are couple of options. Grouped by the 'ID' column, slice the first 5 rows (with head and row_number())
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(head(row_number(), 5))
or with filter to create a logical expression based on row_number() after grouping by the 'ID' column
df1 %>%
group_by(ID) %>%
filter(row_number() < 6)

In base R, we can use ave with seq_along and subset for each ID.
subset(df, ave(ID, ID, FUN = seq_along) <= 5)
# ID
#1 a
#2 a
#3 a
#4 a
#5 a
#7 b
#8 b
#9 b
#10 b
#11 b
data
df <- structure(list(ID = c("a", "a", "a", "a", "a", "a", "b", "b",
"b", "b", "b")), class = "data.frame", row.names = c(NA, -11L))

Related

Adding new information to a table upon matching rows

I have very basic knowledge of R. I have two tabs (A and B) with rows I want to compare - some values match and some don't. I want R to find the matching elements and add the text value "E" to a pre-existing row in tab A if this is the case.
Example:
Tab A
ID Existing?
1 A
2 B
3 C
4 D
5 E
Tab B
ID
1 D
2 B
3 Y
4 A
5 W
Upon match:
Tab A
ID Existing?
1 A E
2 B E
3 C
4 D E
5 E
I have found information online on how to match tables but none on how to write new information when the match takes place.
Please explain like I'm 5... I have no programming background.
Thank you in advance!
Use match to get the elements in df1$ID that are also in df2$ID, and ifelse to recode the values that are both in df1 and in df2 with "E", and NA otherwise.
df1 <- data.frame(ID = LETTERS[1:5])
df2 <- data.frame(ID = c("D", "B", "Y", "A", "W"))
df1$Existing <- ifelse(match(df1$ID, df2$ID), "E", NA)
ID Existing
1 A E
2 B E
3 C <NA>
4 D E
5 E <NA>
Another solution - using dplyr - would be to join the two dataframes, where you have added the column Existing to the one being joined:
library(dplyr, warn.conflicts = FALSE)
df1 <- tibble(ID = LETTERS[1:5])
df2 <- tibble(ID = c("D", "B", "Y", "A", "W"))
df1 %>%
left_join(df2 %>% mutate(Existing = "E"))
#> Joining, by = "ID"
#> # A tibble: 5 x 2
#> ID Existing
#> <chr> <chr>
#> 1 A E
#> 2 B E
#> 3 C <NA>
#> 4 D E
#> 5 E <NA>
This will set all matching IDs to E and all non-matching to NA.
# data
tab1 <- structure(list(ID = c("A", "B", "C", "D", "E"), Existing = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA,
-5L))
tab2 <- structure(list(ID = c("D", "B", "Y", "A", "W")), class = "data.frame", row.names = c(NA,
-5L))
There are many ways to skin this cat. In base-R, you could try, e.g.,
tab1$Existing[tab1$ID %in% tab2$ID] <- 'E'
In practise, for anything more complicated than tables with 6 rows, you could try dplyr:
library(dplyr)
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA))
Another useful tool -- with a slightly differing syntax -- is data.table.
library(data.table)
setDT(tab1) -> tab1
setDT(tab2) -> tab2
tab1[,Existing := ifelse(tab1$ID %in% tab2$ID, 'E',NA)]
Note that, here mutate and := play roughly the same role. Probably, if you work more with R, you will develop an affinity with one of the "dialects" above.
EDIT: To drop the rows NA values values (in dplyr), you could either do:
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA)) %>%
filter(!is.na(Existing))
Or piggy-backing on #jpiversen's solution:
df1 %>%
inner_join(df2 %>% mutate(Existing = "E"))

Selecting most frequent combinations while removing the parts of that combination R

I have a dataset with combinations and their frequency as shown below in an example. The idea is to find all combinations (every name has to be used) to have the highest possible value for count (frequency).
Person 1
Person 2
Count
A
B
4
A
D
4
A
C
3
B
C
2
C
D
1
B
D
0
A, B, C and D are names of people and count is the frequency of a combination of two people.
In this example the highest count can be reached by having an AD and BC combination, which sums to 6 (4+2). If we take AB and CD as a combination the total sum of count will be lower (5, 4+1).
I would like to have a dataset looking like this as an answer:
Person 1
Person 2
Count
A
D
4
B
C
2
How can I create this dataset from the original without having duplicate names and with having the highest possible count. So if there is an AD combination, there can not be another combination including A or D.
I tried following code, but this does not give me the desired dataset:
dat <- data %>%
arrange(desc(count))
count = 0
while (nrow(dat)>0){
print(dat[1,])
dat <- dat %>%
filter(!(X1==X1[1]|X1==X2[1]|X2==X1[1]|X2==X2[1]))
}
dat is the arranged dataset shown in the first table. I print the first row with the highest count and delete all combinations that has one of the names in their combination (because I can use a name only once). This is looped until there are no more people left.
This code will give following dataset:
Person 1
Person 2
Count
A
B
4
C
D
1
Thank you in advance.
There is probably a more elegant solution with igraph, but here is my approach:
Using your data
your_data <- tibble::tribble( ~Person.1, ~Person.2, ~Count, "A", "B", 4L, "A", "D", 4L, "A", "C", 3L, "B", "C", 2L, "C", "D", 1L, "B", "D", 0L)
and assuming Person.1 and Person.2 are in alphabetical order, you can do
library(purrr)
with(your_data, unique(c(Person.1, Person.2))) %>%
combinat::permn(\(x) split(x, (seq_along(x) + 1) %/% 2) %>%
map(sort) %>%
map_dfr(set_names, c("Person.1", "Person.2"))) %>%
map(~ arrange(.x, Person.1)) %>%
unique() %>%
imap(~ dplyr::left_join(.x, your_data)) %>%
rlist::list.sort((sum(Count))) %>%
first()
returning the desired
# A tibble: 2 x 3
Person.1 Person.2 Count
<chr> <chr> <int>
1 A D 4
2 B C 2

Modify a data frame converting colnames into factor

I´m analyzing some data structured as "df" in the example and I need to convert it into something like the "example" object below:
a<- c(1:3)
b<- c(1:3)
c<- c(1:3)
df<- data.frame(a, b, c)
col1<- c("a","a","a", "b", "b", "b", "c", "c", "c")
col2<- rep(1:3,3)
example<- data.frame(col1, col2)
We can use pivot_longer
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything())
A quick base R solution is stack:
stack(df)
values ind
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
7 1 c
8 2 c
9 3 c
You can also use gather() from tidyr package
gather(df, colnames(df), key = "col1", value = "col2")
key and value serves as new column names in the resulting dataframe. Use in tidyverse syntax as follows
df %>%
gather(colnames(df), key = "col1", value = "col2")

R: group_id by changing row values

1) Firstly, I have this data frame:
df <- data.frame(value=c("a","a","a", "b", "b", "b", "a", "a", "a"), ,
desired_id=c(1,1,1,2,2,2,3,3,3))
How do I generate the desired_id column?
My groups are assigned by row order.
That is, everytime the value column changes, I want the group indices to assign the next higher group indices.
I tried df$desired_id_replicate <- df %>% group_by(value) %>% group_indices
but that doesn't work as all value=="a" will be assigned the same group indices.
2)Secondly, I have this data frame:
df <- data.frame(value=c("a","a","a", "b", "b", "b", "a", "a", "a"),
value2=c("a","a","c", "b", "b", "c", "a", "a", "d"),
desired_id=c(1,1,2,3,3,4,5,5,6))
How do I generate the desired_id from the value and value2 column.
My groups are assigned row-wise again. That is, everytime a unique combination of value and value2 changes, the next higher desired_id should be assigned.
Similar to the above, I tried df$desired_id_replicate <- df %>% group_by(value, value2) %>% group_indices
but that doesn't work as all value=="a"&value2=="a" will be assigned the same group indices.
Thank you!
We can use rleid (run-length-encoding id) from data.table which would basically increment 1 for each element that is not equal to the previous element
library(data.table)
library(dplyr)
df%>%
mutate(newcol = rleid(value))
and for the second dataset, it would be
df %>%
mutate(new = rleid(value, value2))
# value value2 desired_id new
#1 a a 1 1
#2 a a 1 1
#3 a c 2 2
#4 b b 3 3
#5 b b 3 3
#6 b c 4 4
#7 a a 5 5
#8 a a 5 5
#9 a d 6 6
Or with rle from base R
df$newcol <- with(rle(df$value), rep(seq_along(values), lengths))

Select minimum data of grouped data - keeping all columns [duplicate]

This question already has an answer here:
R: Uniques (or dplyr distinct) + most recent date
(1 answer)
Closed 7 years ago.
I am running into a wall here.
I have a dataframe, many rows.
Here is schematic example.
#myDf
ID c1 c2 myDate
A 1 1 01.01.2015
A 2 2 02.02.2014
A 3 3 03.01.2014
B 4 4 09.09.2009
B 5 5 10.10.2010
C 6 6 06.06.2011
....
I need to group my dataframe by my ID, and then select the row with the oldest date, and write the ouput into a new dataframe - keeping all rows.
ID c1 c2 myDate
A 3 3 03.01.2014
B 4 4 09.09.2009
C 6 6 06.06.2011
....
That is how I approach it:
test <- myDf %>%
group_by(ID) %>%
mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
filter(date == min(b2))
To verfiy: The nrow of my resulting dataframe should be the same as unique returns.
unique(myDf$ID) %>% length == nrow(test)
FALSE
Does not work. I tried this:
newDf <- ddply(.data = myDf,
.variables = "ID",
.fun = function(piece){
take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
piece[take.this.row,]
})
That does run forever. I terminated it.
Why is the first approach not working and what would be a good way to approach the problem?
Considering you have a pretty large dataset, I think using data.table will be better ! Here is the data.table version to solve your problem, it will be quicker than dplyr package:
library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
myDate=c("01.01.2015","02.02.2014",
"03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]
> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
ID c1 c2 myDate
1: A 3 3 2014-01-03
2: B 4 4 2009-09-09
3: C 6 6 2011-06-06
PS: you can use setDT(mydf) to transform data.frame to data.table.
After grouping by 'ID', we can use which.min to get the index of 'myDate' (after converting to Date class), and we extract the rows with slice.
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.min(as.Date(myDate, '%d.%m.%Y')))
# ID c1 c2 myDate
# (chr) (int) (int) (chr)
#1 A 3 3 03.01.2014
#2 B 4 4 09.09.2009
#3 C 6 6 06.06.2011
data
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")), .Names = c("ID",
"c1", "c2", "myDate"), class = "data.frame", row.names = c(NA,
-6L))
If you wanted to just use the base functions you can also go with the aggregate and merge functions.
# data (from response above)
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")),
.Names = c("ID","c1", "c2", "myDate"),
class = "data.frame", row.names = c(NA,-6L))
# convert your date column to POSIXct object
df1$myDate = as.POSIXct(df1$myDate,format="%d.%m.%Y")
# Use the aggregate function to look for the minimum dates by group.
# In this case our variable of interest in the myDate column and the
# group to sort by is the "ID" column.
# The function will sort out the minimum date and create a new data frame
# with names "myDate" and "ID"
df2 = aggregate(list(myDate = df1$myDate),list(ID = df1$ID),
function(x){x[which(x == min(x))]})
df2
# Use the merge function to merge your original data frame with the
# data from the aggregate function
merge(df1,df2)

Resources