Create numerical discrete values if values in a column equal in R - r

I have a column of IDs in a dataframe that sometimes has duplicates, take for example,
ID
209
315
109
315
451
209
What I want to do is take this column and create another column that indicates what ID the row belongs to. i.e. I want it to look like,
ID
ID Category
209
1
315
2
109
3
315
2
451
4
209
1
Essentially, I want to loop through the IDs and if it equals to a previous one, I indicate that it is from the same ID, and if it is a new ID, I create a new indicator for it.
Does anyone know is there a quick function in R that I could do this with? Or have any other suggestions?

Convert to factor with levels ordered with unique (order of appearance in the data set) and then to numeric:
data$IDCategory <- as.numeric(factor(data$ID, levels = unique(data$ID)))
#> data
# ID IDCategory
#1 209 1
#2 315 2
#3 109 3
#4 315 2
#5 451 4
#6 209 1

library(tidyverse)
data <- tibble(ID= c(209,315,109,315,451,209))
data %>%
left_join(
data %>%
distinct(ID) %>%
mutate(`ID Category` = row_number())
)
#> Joining, by = "ID"
#> # A tibble: 6 × 2
#> ID `ID Category`
#> <dbl> <int>
#> 1 209 1
#> 2 315 2
#> 3 109 3
#> 4 315 2
#> 5 451 4
#> 6 209 1
Created on 2022-03-10 by the reprex package (v2.0.0)

df <- df %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, drop=TRUE)))
Answer with data.table
library(data.table)
df <- as.data.table(df)
df <- df[
j = `ID Category` := as.numeric(interaction(ID, drop=TRUE))
]
The pro of this solution is that you can create an unique ID for a group of variables. Here you only need ID, but if you want to have an unique ID let say for the couple [ID—Location] you could.
data <- tibble(ID= c(209,209,209,315,315,315), Location = c("A","B","C","A","A","B"))
data <- data %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, Location, drop=TRUE)))

another way:
merge(data,
data.frame(ID = unique(data$ID),
ID.Category = seq_along(unique(data$ID))
), sort = F)
# ID ID.Category
# 1 209 1
# 2 209 1
# 3 315 2
# 4 315 2
# 5 109 3
# 6 451 4
data:
tibble(ID = c(209,315,109,315,451,209)) -> data

Related

mutate across with vectorized function parameters

I know the "across" paradigm is "many columns, one function" so this might not be possible. The idea is that I want to apply the same function to several columns with a parameter varying based on the column.
I got this to work using cur_column() but it basically amounts to computing the parameters 1 by 1 rather than providing a vector of equal size to the number of columns containing the parameters.
This first block produces what I want but it I'm wondering if there's a cleaner way.
library(dplyr)
df = data.frame(column1 = 1:100, column2 = 1:100)
parameters = data.frame(
column_names = c('column1','column2'),
parameters = c(10,100))
custom_function = function(x,addend){
x + addend
}
df2 = df %>% mutate(
across(x$column_names,
~custom_function(.,addend = x %>%
filter(column_names == cur_column()) %>%
pull(parameters))))
What I would like to do for the last line would look like
df2 = df %>% mutate(
across(x$column_names,~custom_function(.,addend = x$parameters)))
We can do this in base with mapply:
mapply(`+`, df[,parameters$column_names], parameters$parameters)
##> column1 column2
##> [1,] 11 101
##> [2,] 12 102
##> [3,] 13 103
##> [4,] 14 104
##> [5,] 15 105
##> ...
I think a mapping function operating on the parameters would easier than across on the data:
library(tidyverse)
with(parameters, as_tibble(map2(df[column_names], parameters, custom_function)))
#> # A tibble: 100 x 2
#> column1 column2
#> <dbl> <dbl>
#> 1 11 101
#> 2 12 102
#> 3 13 103
#> 4 14 104
#> 5 15 105
#> 6 16 106
#> 7 17 107
#> 8 18 108
#> 9 19 109
#> 10 20 110
#> # ... with 90 more rows
Created on 2022-12-15 with reprex v2.0.2
We could either use match on the column name (cur_column()) with the column_names column of 'parameters' and extract the 'parameters' column value to be used as input in custom_function
library(dplyr)
df %>%
mutate(across(all_of(parameters$column_names),
~ custom_function(.x, parameters$parameters[match(cur_column(),
parameters$column_names)])))
-output
column1 column2
1 11 101
2 12 102
3 13 103
4 14 104
5 15 105
6 16 106
7 17 107
8 18 108
...
Or convert the two column data.frame to a named vector (deframe) and directly extract the value from the name
library(tibble)
params <- deframe(parameters)
df %>%
mutate(across(all_of(names(params)),
~ custom_function(.x, params[cur_column()])))
-output
column1 column2
1 11 101
2 12 102
3 13 103
4 14 104
5 15 105
6 16 106
...
Edit (this doesn't work on the second column). I don't think across is what you want here. One of the other answers is better.
You could also just add the column to your function like this:
library(dplyr)
df = data.frame(column1 = 1:100, column2 = 1:100)
parameters = data.frame(
column_names = c('column1','column2'),
parameters = c(10,100))
custom_function = function(x,column, parameters){
addend = parameters %>%
filter(column_names == column) %>%
pull(parameters)
x + addend
}
df2 = df %>% mutate(
across(parameters$column_names,custom_function,cur_column(),parameters))

Remove not increasing rows based on other columns values

I have a data frame on R and I want to remove all rows that are not increasing in my column 3. Each row have to be higher or equal than the previous one. But my main difficulty is that the rows have to increase according other columns 1 and 2. In my example, Column3 have to increase according Column1 [A-B] and 2 [1:4]. Here, Column1 [B] have to be removed because 199>197.
PS : It is CO2 measurements corresponding to many plots and date. When the CO2 measurments is not monotonous in the time, the measurement is wrong.
Column1
Column2
Column3
A
1
200
A
2
202
A
3
204
A
4
207
B
1
199
B
2
197
B
3
200
B
4
202
You can use diff() to determine if a group is increasing.
subset(df, ave(Column3, Column1, FUN = \(x) all(diff(x) >= 0)) == 1)
# Column1 Column2 Column3
# 1 A 1 200
# 2 A 2 202
# 3 A 3 204
# 4 A 4 207
Its dplyr equivalent:
library(dplyr)
df %>%
group_by(Column1) %>%
filter(all(diff(Column3) >= 0)) %>%
ungroup()
There may be an easier way to go about it, but here is an approach:
If you want just the observation that violates the condition removed (here, the observation with value 197 in it), try this:
df %>% group_by(Column1) %>%
mutate(del = (lag(Column3) > Column3)) %>%
filter(!del|is.na(del)) %>%
select(-del)
Output:
# Column1 Column2 Column3
# <chr> <int> <int>
# 1 A 1 200
# 2 A 2 202
# 3 A 3 204
# 4 A 4 207
# 5 B 1 199
# 6 B 3 200
# 7 B 4 202
If you want to remove all the observations from a given group where the condition is not met (here, group b)
df %>% group_by(Column1) %>%
mutate(del = any((lag(Column3) > Column3), na.rm = TRUE)) %>%
filter(!del) %>%
select(-del)
Output:
# Column1 Column2 Column3
# <chr> <int> <int>
# 1 A 1 200
# 2 A 2 202
# 3 A 3 204
# 4 A 4 207
Data used in this example:
df <- read.table(text = "Column1 Column2 Column3
A 1 200
A 2 202
A 3 204
A 4 207
B 1 199
B 2 197
B 3 200
B 4 202", header = TRUE)

Create new columns based on count of occurrence in another data frame in r

I want to create new columns in df1 based on count of occurrence of columns in df2
df1:
df2:
[![enter image description here][2]][2]
For ID 100 in data frame 1 no RM assigned in data frame 2, for ID 103 there are 2 RM in dataframe 2 and for 108 there are 3 RM
So my final data frame is
I tried merge function using left join but not sure how to count the numbers of non empty cells
Here is the sample dataset
df1 <- data.frame(Id = c(100,101,103,105,108,109),
channel = c("A","C","C","C","D","D"),
duration = c(12,23,56,89,73,76))
df2 <- data.frame(ID=c(100,103,109,105,101,108),
RM1= c("","john","","Miller","","Maddy"),
RM2 = c("","Ryan","","","","sean"),
RM3 = c("","","","","","Arvind"))
Add the aggregate in df2 and merge
df2$Total=rowSums(df2[,-1]!="")
merge(
df1,
df2[c("ID","Total")],
by="ID",
all.x=T
)
ID channel duration Total
1 100 A 12 0
2 101 C 23 0
3 103 C 56 2
4 105 C 89 1
5 108 D 73 3
6 109 D 76 0
Here is a tidyverse solution. Join by ID, reshape to long format and count the values not left blank.
df1 <- data.frame(ID = c(100,101,103,105,108,109),
channel = c("A","C","C","C","D","D"),
duration = c(12,23,56,89,73,76))
df2 <- data.frame(ID=c(100,103,109,105,101,108),
RM1= c("","john","","Miller","","Maddy"),
RM2 = c("","Ryan","","","","sean"),
RM3 = c("","","","","","Arvind"))
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
inner_join(df1, df2) %>%
pivot_longer(cols = starts_with("RM")) %>%
group_by(ID, channel, duration) %>%
summarise(Total = sum(value != ""), .groups = "drop")
#> Joining, by = "ID"
#> # A tibble: 6 x 4
#> ID channel duration Total
#> <dbl> <chr> <dbl> <int>
#> 1 100 A 12 0
#> 2 101 C 23 0
#> 3 103 C 56 2
#> 4 105 C 89 1
#> 5 108 D 73 3
#> 6 109 D 76 0
Created on 2022-03-23 by the reprex package (v2.0.1)

What is this arrange function in the second line doing here?

I'm currently reviewing R for Data Science when I encounter this chunk of code.
The question for this code is as follows. I don't understand the necessity of the arrange function here. Doesn't arrange function just reorder the rows?
library(tidyverse)
library(nycflights13))
flights %>%
arrange(tailnum, year, month, day) %>%
group_by(tailnum) %>%
mutate(delay_gt1hr = dep_delay > 60) %>%
mutate(before_delay = cumsum(delay_gt1hr)) %>%
filter(before_delay < 1) %>%
count(sort = TRUE)
However, it does output differently with or without the arrange function, as shown below:
#with the arrange function
tailnum n
<chr> <int>
1 N954UW 206
2 N952UW 163
3 N957UW 142
4 N5FAAA 117
5 N38727 99
6 N3742C 98
7 N5EWAA 98
8 N705TW 97
9 N765US 97
10 N635JB 94
# ... with 3,745 more rows
and
#Without the arrange function
tailnum n
<chr> <int>
1 N952UW 215
2 N315NB 161
3 N705TW 160
4 N961UW 139
5 N713TW 128
6 N765US 122
7 N721TW 120
8 N5FAAA 117
9 N945UW 104
10 N19130 101
# ... with 3,774 more rows
I'd appreciate it if you can help me understand this. Why is it necessary to include the arrange function here?
Yes, arrange just orders the rows but you are filtering after that which changes the result.
Here is a simplified example to demonstrate how the output differs with and without arrange.
library(dplyr)
df <- data.frame(a = 1:5, b = c(7, 8, 9, 1, 2))
df %>% filter(cumsum(b) < 20)
# a b
#1 1 7
#2 2 8
df %>% arrange(b) %>% filter(cumsum(b) < 20)
# a b
#1 4 1
#2 5 2
#3 1 7
#4 2 8

how do I add a third column in a first dataframe and place the values from some column of a second dataframe whose ID matches the one in the first DF?

I have the following dataframes df1 and df2 (the true ones have around a million rows):
df1 <- data.frame(ID=c(23425, 84733, 49822, 39940), X=c(312,354,765,432))
df2 <- data.frame(ID=c(23425, 49822), Y=c(111,222))
And I want to add an additional column Z in dataFrame df1. Each time an ID from df1 match with some ID from df2, the corresponding Y value must be added to that third row. If there is no match, a zero must be added
The result must be this one:
df <- data.frame(ID=c(23425,84733, 49822, 39940), X=c(312,354,765,432), Z=c(111,0,222,0))
I stored the ID's from the second dataframe in a vector and used a loop, but it takes forever.
I believe what you want is a join:
library(dplyr)
df1 %>%
left_join(df2)
#> Joining, by = "ID"
#> # A tibble: 4 × 3
#> ID X Y
#> <dbl> <dbl> <dbl>
#> 1 23425 312 111
#> 2 84733 354 NA
#> 3 49822 765 222
#> 4 39940 432 NA
If you want it exactly the way you have it with a new column name and zeroes instead of NA, you can add a few more lines:
library(tidyr)
df1 %>%
left_join(df2) %>%
rename(Z = Y) %>%
replace_na(replace = list(Z = 0))
#> Joining, by = "ID"
#> # A tibble: 4 × 3
#> ID X Z
#> <dbl> <dbl> <dbl>
#> 1 23425 312 111
#> 2 84733 354 0
#> 3 49822 765 222
#> 4 39940 432 0
We can use data.table to do a join and replace the NA with 0
library(data.table)
setDT(df1)[df2, Z := Y, on = .(ID)][is.na(Z), Z:= 0]
df1
# ID X Z
#1: 23425 312 111
#2: 84733 354 0
#3: 49822 765 222
#4: 39940 432 0
You can accomplish this simply with merge. Merge will match the Y values in df2 by "ID" to df1. If you specify the "all = TRUE" argument, when no match in df2 is found for IDs in df1, the Y value will be NA.
merge the 2 datasets by ID and keep all values in each dataset.
df <- merge(df1, df2, by = "ID", all = TRUE)
df
ID X Y
1 23425 312 111
2 39940 432 NA
3 49822 765 222
4 84733 354 NA
If you want no match to be specified by 0 instead of NA, just replace that value in the Y column.
df$Y <- ifelse(is.na(df$Y), 0, df$Y)
df
ID X Y
1 23425 312 111
2 39940 432 0
3 49822 765 222
4 84733 354 0

Resources