This question already has an answer here:
Create a group index for values connected directly and indirectly
(1 answer)
Closed 2 years ago.
I have a dataset with two variables. As a simple example:
df <- rbind(c("A",1),c("B",2),c("C",2),c("C",3),c("D",4),c("D",5),c("E",1))
I would like to group them by the first component or the second, the desired output would be a third column with the following values:
c(1,2,2,2,3,3,1)
If I use dplyr and group_by and cur_group_id(), I would get groups by the first and second component, obtaining therefore
c(1,2,3,4,5,6,7)
Can anyone help me in an easy way, it could be either base R, dplyr or data.table, to obtain the desired group?
Thank you
Perhaps igraph could be a helpful tool for you
library(igraph)
df$grp <- membership(components(graph_from_data_frame(df, directed = FALSE)))[df$X1]
which gives
> df
X1 X2 grp
1 A 1 1
2 B 2 2
3 C 2 2
4 C 3 2
5 D 4 3
6 D 5 3
7 E 1 1
Data
> dput(df)
structure(list(X1 = c("A", "B", "C", "C", "D", "D", "E"), X2 = c(1L,
2L, 2L, 3L, 4L, 5L, 1L)), row.names = c(NA, -7L), class = "data.frame")
Related
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 21 days ago.
I have a coding problem regarding subsetting my dataset. I would like to subset my data with the following conditions (1) one observation per ID and (2) retaining a row for "event" = 1 occurring at any time, while still not losing any observations.
An example dataset looks like this:
ID event
A 1
A 1
A 0
A 1
B 0
B 0
B 0
C 0
C 1
Desired output
A 1
B 0
C 1
I imagine this would be done using dplyr df >%> group_by(ID), but I'm unsure how to prioritize selecting for any row that contains event = 1 without losing when event = 0. I do not want to lose any of the IDs.
Any help would be appreciated - thank you very much.
We may use
aggregate(event ~ ID, df1, max)
ID event
1 A 1
2 B 0
3 C 1
Or with dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
slice_max(n = 1, event, with_ties = FALSE) %>%
ungroup
# A tibble: 3 × 2
ID event
<chr> <int>
1 A 1
2 B 0
3 C 1
data
df1 <- structure(list(ID = c("A", "A", "A", "A", "B", "B", "B", "C",
"C"), event = c(1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L)),
class = "data.frame", row.names = c(NA,
-9L))
I have a data table which includes NAs in some cells as below.
Datatable:
enter image description here
However, I want to repeat 1st row in the column called "Category" to the following two rows written "NA" without any change in other columns which are "Numeric" and "Numeric.null". Same thing for 4th row in Category, repeat it to 5th and 6th rows but no change in other columns.
New:
2
I'm just learning R programming. I have tried rep function. But I couldn't do. Please help me.
We can use fill from tidyr
library(dplyr)
library(tidyr)
df1 <- df1 %>%
fill(Category)
df1
# Category Numeric Numeric.null
#1 A 1 1
#2 A 2 2
#3 A 3 4
#4 D 4 7
#5 D 5 6
#6 D 6 8
#7 E 7 11
Or using data.table with na.locf0
library(data.table)
library(zoo)
setDT(df1)[, Category := na.locf0(Category)][]
data
df1 <- structure(list(Category = c("A", NA, NA, "D", NA, NA, "E"), Numeric = 1:7,
Numeric.null = c(1L, 2L, 4L, 7L, 6L, 8L, 11L)),
class = "data.frame", row.names = c(NA,
-7L))
This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 1 year ago.
Problem:
I want to remove all the rows of a specific category if one of the rows has a certain value in another column (similar to problems in the links below). However, the main difference is I would like it to only work if it matches a criteria in another column.
Making a practice df
prac_df <- data_frame(
subj = rep(1:4, each = 4),
trial = rep(rep(1:4, each = 2), times = 2),
ias = rep(c('A', 'B'), times = 8),
fixations = c(17, 14, 0, 0, 15, 0, 8, 6, 3, 2, 3,3, 23, 2, 3,3)
)
So my data frame looks like this.
subj ias fixations
1 1 A 17
2 1 B 14
3 2 A 0
4 2 B 0
5 3 A 15
6 3 B 0
7 4 A 8
8 4 B 6
And I want to remove all of subject 2 because it has a value of 0 for fixations column in a row that ias has a value of A. However I want to do this without removing subject 3, because even though there is a 0 it is in a row where the ias column has a value of B.
My attempt so far.
new.df <- prac_df[with(prac_df, ave(prac_df$fixations != 0, subj, FUN = all)),]
However this is missing the part that will only get rid of it if it has the value A in the ias column. I've attempted various uses of & or if but I feel like there's likely a clever and clean way I just don't know of.
My goal is to make a df like this.
subj ias fixations
1 1 A 17
2 1 B 14
3 3 A 15
4 3 B 0
5 4 A 8
6 4 B 6
Thank you very much!
Related questions:
R: Remove rows from data frame based on values in several columns
How to remove all rows belonging to a particular group when only one row fulfills the condition in R?
We group by 'subj' and then filter based on the logical condition created with any and !
library(dplyr)
df1 %>%
group_by(subj) %>%
filter(!any(fixations==0 & ias == "A"))
# subj ias fixations
# <int> <chr> <int>
#1 1 A 17
#2 1 B 14
#3 3 A 15
#4 3 B 0
#5 4 A 8
#6 4 B 6
Or use all with |
df1 %>%
group_by(subj) %>%
filter(all(fixations!=0 | ias !="A"))
The same approach can be used with ave from base R
df1[with(df1, !ave(fixations==0 & ias =="A", subj, FUN = any)),]
data
df1 <- structure(list(subj = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), ias = c("A",
"B", "A", "B", "A", "B", "A", "B"), fixations = c(17L, 14L, 0L,
0L, 15L, 0L, 8L, 6L)), .Names = c("subj", "ias", "fixations"),
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))
This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
How do I get a contingency table?
(6 answers)
Closed 4 years ago.
I have a large data set with column names: ID and Property. There may be several rows sharing the same ID, which means that one ID has many different properties (categorical variable). I want to add dummy variables for property and finally get a data frame with distinct ID in each row, and indicate whether it has that property using 1/0. The original data has 2 million rows and 10000 distinct properties. So, ideally, I will shrink the row size by combining same IDs and add dummy variable columns (1 column for each property).
R crashes when I use the following code:
for(t in unique(df$property)){
df3[paste("property",t,sep="")] <- ifelse(df$property==t,1,0)
}
So I am wondering what's the most efficient way to add dummy variable columns for large data set in R?
We can just use table
as.data.frame.matrix(table(df1))
# A B C D
#1 1 1 0 0
#3 0 0 1 0
#4 0 0 0 1
#5 0 0 0 2
Or an efficient approach would be dcast from data.table
library(data.table)
dcast(setDT(df1), a~b, value.var = "a", length)
data
df1 <- structure(list(a = c(1L, 1L, 3L, 4L, 5L, 5L), b = c("A", "B",
"C", "D", "D", "D")), .Names = c("a", "b"), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")
This question already has an answer here:
Create a group index for values connected directly and indirectly
(1 answer)
Closed 4 years ago.
I'm trying to link together pairs of unique IDs using R. Given the example below, I have two IDs (here ID1 and ID2) that indicate linkage. I'm trying to create groups of rows that are linked. In this example A is linked to B which is linked to D which is linked to E. Because these are all connected, I want to group them together. Next, there is also X which is linked to both Y and Z. Because these two are also connected, I want to assign them to a single group as well. How can I tackle this using R?
Thanks!
Example data:
ID1 ID2
A B
B D
D E
X Y
X Z
DPUT R representation
structure(list(id1 = structure(c(1L, 2L, 3L, 4L, 4L), .Label = c("A", "B", "D", "X"), class = "factor"), id2 = structure(1:5,.Label = c("B", "D", "E", "Y", "Z"), class = "factor")), .Names = c("id1", "id2"), row.names = c(NA, -5L), class = "data.frame")
Output needed:
ID1 ID2 GROUP
A B 1
B D 1
D E 1
X Y 2
X Z 2
As per mentionned by #Frank in the comments, you can use igraph:
library(igraph)
idf <- graph.data.frame(df)
clusters(idf)$membership
Which gives:
A B D X E Y Z
1 1 1 2 1 2 2
Should you want to assign the result back to rows of df:
merge(df, stack(clusters(idf)$membership), by.x = "id1", by.y = "ind", all.x = TRUE)