This question already has answers here:
Assign unique ID based on two columns [duplicate]
(2 answers)
Closed 3 years ago.
EDITED:
I have a very simple question. I have a dataframe (already given) with repeated rows. I want to identify each unique row and add a column with an ID number.
The original table has thousands of row, but I simplify it here. A toy df can be created in this way.
df <- data.frame(var1 = c('a', 'a', 'a', 'b', 'c', 'c', 'a'),
var2 = c('d', 'd', 'd', 'e', 'f', 'f', 'c'))
For each unique row, I want a numeric ID:
var1 var2 ID
1 a d 1
2 a d 1
3 a d 1
4 b e 2
5 c f 3
6 c f 3
7 a c 4
/EDITED
Here is a base R solution using cumsum + duplicated, i.e.,
df$ID <- cumsum(!duplicated(df))
such that
> df
var1 var2 ID
1 a d 1
2 a d 1
3 a d 1
4 b e 2
5 c f 3
6 c f 3
7 a c 4
EDIT
Well, the question was completely changed by OP. For the updated question we can do
df$ID <- match(paste0(df$var1, df$var2), unique(paste0(df$var1, df$var2)))
Original answer
One way would be to use uncount from tidyr
library(dplyr)
df %>% mutate(ID = row_number()) %>% tidyr::uncount(ID, .remove = FALSE)
# var1 var2 ID
#1 a d 1
#2 b e 2
#2.1 b e 2
#3 c f 3
#3.1 c f 3
#3.2 c f 3
In base R we can create a row number column in the dataframe and repeat rows based on that.
df$ID <- seq(nrow(df))
df[rep(df$ID, df$ID), ]
data
df <- structure(list(var1 = structure(1:3, .Label = c("a", "b", "c"
), class = "factor"), var2 = structure(1:3, .Label = c("d", "e",
"f"), class = "factor")), row.names = c(NA, -3L), class = "data.frame")
Related
I'm quite new to R. I have a large dataframe approximating the following:
df <- data.frame(
source = c('a', 'b', 'c', 'e'),
partner = c('b', 'c', 'e', 'a'),
info = c(1,2,3,4)
)
For each row in the dataframe I want to get the info column from the partner and concatenate it to the source row. I'm doing this by building a second dataframe in the following way:
prt <- unlist(df$partner)
collect_partner <- function(x, df) {
df[df[, 'source'] == x, 'info']
}
prt_df <- do.call("rbind", lapply(prt, collect_partner, df)) # slow
final_df <- cbind(df, prt_df)
However, this approach is very slow and I'm sure there must be a better way. Unfortunately I'm finding it hard to articulate what I'm trying to do, so solutions aren't forthcoming from googling etc. Any suggestions would be much appreciated!
If you work with the tidyverse, I'd use a left_join basically with itself. I first create a data.frame that contains only the info about source and info. To make sure that you have only one value per unique entry in source, I use distinct (not necessarily needed).
Then, I join the data to the original data frame:
library(dplyr)
df <- data.frame(
source = c('a', 'b', 'c', 'e'),
partner = c('b', 'c', 'e', 'a'),
info = c(1,2,3,4)
)
source_info <- df %>%
select(source, prt_df = info) %>%
distinct(source, .keep_all = TRUE)
df %>%
left_join(source_info, by = c("partner" = "source"))
#> source partner info prt_df
#> 1 a b 1 2
#> 2 b c 2 3
#> 3 c e 3 4
#> 4 e a 4 1
Created on 2023-02-13 by the reprex package (v1.0.0)
With base R using sapply and ==
df$prt_df <- sapply(df$partner, function(x) which(x == df$source))
df
source partner info prt_df
1 a b 1 2
2 b c 2 3
3 c e 3 4
4 e a 4 1
Using data.table
library(data.table)
dt <- as.data.table(df)
dt[, prt_df := lapply(partner, function(x) which(x == source)), ]
dt
source partner info prt_df
1: a b 1 2
2: b c 2 3
3: c e 3 4
4: e a 4 1
On a slightly modified set dt_m with repeated and missing values.
dt_m[, prt_df := lapply(partner, function(x) which(x == source)), ]
dt_m
source partner info prt_df
1: a b 1
2: a c 2 3
3: c e 3 4
4: e a 4 1,2
modified data
dt_m <- structure(list(source = c("a", "a", "c", "e"), partner = c("b",
"c", "e", "a"), info = c(1, 2, 3, 4)), row.names = c(NA, -4L),
class = c("data.table", "data.frame"))
Suppose I have a data frame with a single column that contains letters a, b, c, d, e.
a
b
c
d
e
In R, is it possible to extract a single letter, such as 'a', and produce all possible paired combinations between 'a' and the other letters (with no duplications)? Could the combn command be used in this case?
a b
a c
a d
a e
We can use data.frame
data.frame(col1 = 'a', col2 = setdiff(df1$V1, "a"))
-ouptput
col1 col2
1 a b
2 a c
3 a d
4 a e
data
df1 <- structure(list(V1 = c("a", "b", "c", "d", "e")),
class = "data.frame", row.names = c(NA,
-5L))
Update:
With .before=1 argument the code is shorter :-)
df %>%
mutate(col_a = first(col1), .before=1) %>%
slice(-1)
With dplyr you can:
library(dplyr)
df %>%
mutate(col2 = first(col1)) %>%
slice(-1) %>%
select(col2, col1)
Output:
col2 col1
<chr> <chr>
1 a b
2 a c
3 a d
4 a e
You could use
expand.grid(x=df[1,], y=df[2:5,])
which returns
x y
1 a b
2 a c
3 a d
4 a e
I would like to add values to a column based on non-unique values in another column. For example, say I have a dataframe with a currently empty column that looks like this:
Site
Species Richness
A
0
A
0
A
0
B
0
B
0
I want to assign known species richness values for each site. Let's say site A has species richness 3, and site B has species richness 5. I would like the output to be:
Site
Species Richness
A
3
A
3
A
3
B
5
B
5
How do I input species richness values for specific sites?
I've tried this:
rows_update(df, tibble(Site = A, richness = 3))
rows_update(df, tibble(Site = B, richness = 5))
But I get an error message saying "'x' key values are not unique"
Any help would be appreciated!
Here, we could make use of join on from data.table and assign := the corresponding column of 'SpeciesRichness'. It would be more efficient
library(data.table)
setDT(df)[data.table(Site = c('A','B'), SpeciesRichness = c(3, 5)),
SpeciesRichness := i.SpeciesRichness, on = .(Site)]
The issue with ?rows_update is that the by column should be uniquely identifying in both data.
The two tables are matched by a set of key variables whose values must uniquely identify each row.
With 'df', the values are replicated 3 times for 'A' and 2 for 'B'. Using dplyr, we can do a left_join
library(dplyr)
df %>%
left_join(tibble(Site = c('A', "B"), new = c(3, 5))) %>%
transmute(Site, SpeciesRichness = new)
-output
# Site SpeciesRichness
#1 A 3
#2 A 3
#3 A 3
#4 B 5
#5 B 5
data
df <- structure(list(Site = c("A", "A", "A", "B", "B"),
SpeciesRichness = c(0L,
0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -5L
))
You can create a dataframe with Site and Richness value and join them together.
In base R :
df1 <- data.frame(Site = rep(c('A', 'B'), c(3, 2)))
df2 <- data.frame(Site = c('A', 'B'), richness = c(3, 5))
df1 <- merge(df1, df2)
df1
# Site richness
#1 A 3
#2 A 3
#3 A 3
#4 B 5
#5 B 5
You can also use match :
df1$richness <- df2$richness[match(df1$Site, df2$Site)]
You could define the values then use case_when
x <- 3
y <- 5
df %>%
mutate(SpeciesRichness= case_when(Site=="A" ~ x,
Site=="B" ~ y))
Output:
Site SpeciesRichness
1 A 3
2 A 3
3 A 3
4 B 5
5 B 5
This question already has an answer here:
Replace NA with mode based on ID attribute
(1 answer)
Closed 2 years ago.
I'd like to fill the NA-values in F2-column, based on the the most common F2-value when grouped by F1-column.
F1 F2
1 A C
2 B D
3 A NA
4 A C
5 B NA
Desired outcome:
F1 F2
1 A C
2 B D
3 A C
4 A C
5 B D
Thank you for help
Here is a base R solution. First define a function for Mode (Taken from here) and then apply it to you data frame, i.e.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df$F2 <- with(df, ave(F2, F1, FUN = function(i) replace(i, is.na(i), Mode(i))))
df
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Here is one way using dplyr :
library(dplyr)
df %>%
group_by(F1) %>%
mutate(F2 = replace(F2, is.na(F2),
names(sort(table(F2), decreasing = TRUE)[1])))
# F1 F2
# <chr> <chr>
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
In case of ties, preference is given to lexicographic order.
Try this:
First in df2 I get max count by the variable F1 where F2 is not missing. That will give you the most common F2 value when groups by F1. I join it back onto the original data.frame and use a mutate to fill by the new variable F2_fill and then remove it from this variable from the data.frame.
library(tidyverse)
df <- tribble(
~F1, ~F2,
'A', 'C',
'B' , 'D',
'A' ,NA,
'A', 'C',
'B', NA)
df2 <- df %>%
group_by(F1) %>%
count(F2) %>%
filter(!is.na(F2), n == max(n)) %>%
select(-n) %>%
rename(F2_fill = F2)
df3 <- left_join(df,df2, by="F1") %>%
mutate(F2 = ifelse(is.na(F2), F2_fill,F2)) %>%
select(-F2_fill)
You can use ave with table and which.max and subsetting with is.na when it is a character.
i <- is.na(x$F2)
x$F2[i] <- ave(x$F2, x$F1, FUN=function(y) names(which.max(table(y))))[i]
x
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Data:
x <- data.frame(F1 = c("A", "B", "A", "A", "B")
, F2 = c("C", "D", NA, "C", NA))
I have a df that looks like this:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
5 e 0.11258865
6 f 0.07228970
7 g 0.05673759
8 h 0.05319149
9 i 0.03989362
I would like to subset it using the sum of the column value, i.e, I want to extract those rows which sum of values from column value is higher than 0.6, but starting to sum values from the first row. My desired output will be:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
I have tried df2[, colSums[,5]>=0.6] but obviously colSums is expecting an array
Thanks in advance
Here's an approach:
df2[seq(which(cumsum(df2$value) >= 0.6)[1]), ]
The result:
name value
1 a 0.2001942
2 b 0.1799645
3 c 0.1425701
4 d 0.1425701
I'm not sure I understand exactly what you are trying to do, but I think cumsum should be able to help.
First to make this reproducible, let's use dput so others can help:
df <- structure(list(name = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), value = c(0.20019421,
0.17996454, 0.1425701, 0.1425701, 0.11258865, 0.0722897, 0.05673759,
0.05319149, 0.03989362)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-9L))
Then look at what cumsum(df$value) provides:
cumsum(df$value)
# [1] 0.2001942 0.3801587 0.5227289 0.6652990 0.7778876 0.8501773 0.9069149 0.9601064 1.0000000
Finally, subset accordingly:
subset(df, cumsum(df$value) <= 0.6)
# name value
# 1 a 0.2001942
# 2 b 0.1799645
# 3 c 0.1425701
subset(df, cumsum(df$value) >= 0.6)
# name value
# 4 d 0.14257010
# 5 e 0.11258865
# 6 f 0.07228970
# 7 g 0.05673759
# 8 h 0.05319149
# 9 i 0.03989362