Remove not increasing rows based on other columns values - r

I have a data frame on R and I want to remove all rows that are not increasing in my column 3. Each row have to be higher or equal than the previous one. But my main difficulty is that the rows have to increase according other columns 1 and 2. In my example, Column3 have to increase according Column1 [A-B] and 2 [1:4]. Here, Column1 [B] have to be removed because 199>197.
PS : It is CO2 measurements corresponding to many plots and date. When the CO2 measurments is not monotonous in the time, the measurement is wrong.
Column1
Column2
Column3
A
1
200
A
2
202
A
3
204
A
4
207
B
1
199
B
2
197
B
3
200
B
4
202

You can use diff() to determine if a group is increasing.
subset(df, ave(Column3, Column1, FUN = \(x) all(diff(x) >= 0)) == 1)
# Column1 Column2 Column3
# 1 A 1 200
# 2 A 2 202
# 3 A 3 204
# 4 A 4 207
Its dplyr equivalent:
library(dplyr)
df %>%
group_by(Column1) %>%
filter(all(diff(Column3) >= 0)) %>%
ungroup()

There may be an easier way to go about it, but here is an approach:
If you want just the observation that violates the condition removed (here, the observation with value 197 in it), try this:
df %>% group_by(Column1) %>%
mutate(del = (lag(Column3) > Column3)) %>%
filter(!del|is.na(del)) %>%
select(-del)
Output:
# Column1 Column2 Column3
# <chr> <int> <int>
# 1 A 1 200
# 2 A 2 202
# 3 A 3 204
# 4 A 4 207
# 5 B 1 199
# 6 B 3 200
# 7 B 4 202
If you want to remove all the observations from a given group where the condition is not met (here, group b)
df %>% group_by(Column1) %>%
mutate(del = any((lag(Column3) > Column3), na.rm = TRUE)) %>%
filter(!del) %>%
select(-del)
Output:
# Column1 Column2 Column3
# <chr> <int> <int>
# 1 A 1 200
# 2 A 2 202
# 3 A 3 204
# 4 A 4 207
Data used in this example:
df <- read.table(text = "Column1 Column2 Column3
A 1 200
A 2 202
A 3 204
A 4 207
B 1 199
B 2 197
B 3 200
B 4 202", header = TRUE)

Related

R: create new rows from preexistent dataframe

I want to create new rows based on the value of pre-existent rows in my dataset. There are two catches: first, some cell values need to remain constant while others have to increase by +1. Second, I need to cycle through every row the same amount of times.
I think it will be easier to understand with data
Here is where I am starting from:
mydata <- data.frame(id=c(10012000,10012002,10022000,10022002),
col1=c(100,201,44,11),
col2=c("A","C","B","A"))
Here is what I want:
mydata2 <- data.frame(id=c(10012000,10012001,10012002,10012003,10022000,10022001,10022002,10022003),
col1=c(100,100,201,201,44,44,11,11),
col2=c("A","A","C","C","B","B","A","A"))
Note how I add +1 in the id column cell for each new row but col1 and col2 remain constant.
Thank you
library(tidyverse)
mydata |>
mutate(id = map(id, \(x) c(x, x+1))) |>
unnest(id)
#> # A tibble: 8 × 3
#> id col1 col2
#> <dbl> <dbl> <chr>
#> 1 10012000 100 A
#> 2 10012001 100 A
#> 3 10012002 201 C
#> 4 10012003 201 C
#> 5 10022000 44 B
#> 6 10022001 44 B
#> 7 10022002 11 A
#> 8 10022003 11 A
Created on 2022-04-14 by the reprex package (v2.0.1)
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
mydata %>%
group_by(id) %>%
uncount(2) %>%
mutate(id = first(id) + row_number() - 1) %>%
ungroup()
This returns
# A tibble: 8 x 3
id col1 col2
<dbl> <dbl> <chr>
1 10012000 100 A
2 10012001 100 A
3 10012002 201 C
4 10012003 201 C
5 10022000 44 B
6 10022001 44 B
7 10022002 11 A
8 10022003 11 A
library(data.table)
setDT(mydata)
final <- setorder(rbind(copy(mydata), mydata[, id := id + 1]), id)
# id col1 col2
# 1: 10012000 100 A
# 2: 10012001 100 A
# 3: 10012002 201 C
# 4: 10012003 201 C
# 5: 10022000 44 B
# 6: 10022001 44 B
# 7: 10022002 11 A
# 8: 10022003 11 A
I think this should do it:
library(dplyr)
df1 <- arrange(rbind(mutate(mydata, id = id + 1), mydata), id, col2)
Gives:
id col1 col2
1 10012000 100 A
2 10012001 100 A
3 10012002 201 C
4 10012003 201 C
5 10022000 44 B
6 10022001 44 B
7 10022002 11 A
8 10022003 11 A
in base R, for nostalgic reasons:
mydata2 <- as.data.frame(lapply(mydata, function(col) rep(col, each = 2)))
mydata2$id <- mydata2$id + 0:1

Create numerical discrete values if values in a column equal in R

I have a column of IDs in a dataframe that sometimes has duplicates, take for example,
ID
209
315
109
315
451
209
What I want to do is take this column and create another column that indicates what ID the row belongs to. i.e. I want it to look like,
ID
ID Category
209
1
315
2
109
3
315
2
451
4
209
1
Essentially, I want to loop through the IDs and if it equals to a previous one, I indicate that it is from the same ID, and if it is a new ID, I create a new indicator for it.
Does anyone know is there a quick function in R that I could do this with? Or have any other suggestions?
Convert to factor with levels ordered with unique (order of appearance in the data set) and then to numeric:
data$IDCategory <- as.numeric(factor(data$ID, levels = unique(data$ID)))
#> data
# ID IDCategory
#1 209 1
#2 315 2
#3 109 3
#4 315 2
#5 451 4
#6 209 1
library(tidyverse)
data <- tibble(ID= c(209,315,109,315,451,209))
data %>%
left_join(
data %>%
distinct(ID) %>%
mutate(`ID Category` = row_number())
)
#> Joining, by = "ID"
#> # A tibble: 6 × 2
#> ID `ID Category`
#> <dbl> <int>
#> 1 209 1
#> 2 315 2
#> 3 109 3
#> 4 315 2
#> 5 451 4
#> 6 209 1
Created on 2022-03-10 by the reprex package (v2.0.0)
df <- df %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, drop=TRUE)))
Answer with data.table
library(data.table)
df <- as.data.table(df)
df <- df[
j = `ID Category` := as.numeric(interaction(ID, drop=TRUE))
]
The pro of this solution is that you can create an unique ID for a group of variables. Here you only need ID, but if you want to have an unique ID let say for the couple [ID—Location] you could.
data <- tibble(ID= c(209,209,209,315,315,315), Location = c("A","B","C","A","A","B"))
data <- data %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, Location, drop=TRUE)))
another way:
merge(data,
data.frame(ID = unique(data$ID),
ID.Category = seq_along(unique(data$ID))
), sort = F)
# ID ID.Category
# 1 209 1
# 2 209 1
# 3 315 2
# 4 315 2
# 5 109 3
# 6 451 4
data:
tibble(ID = c(209,315,109,315,451,209)) -> data

Counting Number of Times Each Row is Duplicated in R

In my dataset, I want to count the number of times each row appears in my dataset, which consists of five columns. I tried using table; however, this seems to only work with seeing how many times one column, not multiple, is duplicated since I get the error
attempt to make a table with >= 2^31 elements
As a quick example, say my dataframe is as follows:
dat <- data.frame(
SSN = c(204,401,204,666,401),
Name=c("Blossum","Buttercup","Blossum","MojoJojo","Buttercup"),
Age = c(7,8,7,43,8),
Gender = c(0,0,0,1,0)
)
How do I add another column with how many times each row appears in this dataframe?
With dplyr, we could group by all columns:
dat %>%
group_by(across(everything())) %>%
mutate(n = n())
# # A tibble: 5 x 5
# # Groups: SSN, Name, Age, Gender [3]
# SSN Name Age Gender n
# <dbl> <chr> <dbl> <dbl> <int>
# 1 204 Blossum 7 0 2
# 2 401 Buttercup 8 0 2
# 3 204 Blossum 7 0 2
# 4 666 MojoJojo 43 1 1
# 5 401 Buttercup 8 0 2
(mutate(n = n()) is has a shortcut, add_tally(), if you prefer. Use summarize(n = n() or count() if you want to collapse the data frame to the unique rows while adding counts)
Using data.table package. setDT is used to inplace transform data.frame into a data.table.
Inplace (:=) modification of dat by adding count (.N) of lines grouped by all columns of dat (by=names(dat)).
Note: inplace modification result is invisible. So you need to explicitly print it or add [] after (dat[, ...][]).
setDT(dat)
dat[,by=names(dat),N:=.N][]
#> SSN Name Age Gender N
#> 1: 204 Blossum 7 0 2
#> 2: 401 Buttercup 8 0 2
#> 3: 204 Blossum 7 0 2
#> 4: 666 MojoJojo 43 1 1
#> 5: 401 Buttercup 8 0 2
or (to collapse lines)
setDT(dat)
dat[,by=names(dat),.N]
#> SSN Name Age Gender N
#> 1: 204 Blossum 7 0 2
#> 2: 401 Buttercup 8 0 2
#> 3: 666 MojoJojo 43 1 1
We can use add_count without grouping as well
library(dplyr)
dat %>%
add_count(across(everything()))
-output
# SSN Name Age Gender n
#1 204 Blossum 7 0 2
#2 401 Buttercup 8 0 2
#3 204 Blossum 7 0 2
#4 666 MojoJojo 43 1 1
#5 401 Buttercup 8 0 2
I am not sure which is your desired output. Below are some base R options
> aggregate(
+ cnt ~ .,
+ cbind(dat, cnt = 1),
+ sum
+ )
SSN Name Age Gender cnt
1 204 Blossum 7 0 2
2 401 Buttercup 8 0 2
3 666 MojoJojo 43 1 1
> transform(
+ cbind(dat, n = 1),
+ n = ave(n, SSN, Name, Age, Gender, FUN = sum)
+ )
SSN Name Age Gender n
1 204 Blossum 7 0 2
2 401 Buttercup 8 0 2
3 204 Blossum 7 0 2
4 666 MojoJojo 43 1 1
5 401 Buttercup 8 0 2

Remove if unit only has one observation

I have a long form of clinical data that looks something like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
outcome <- c(1,1,1,1,1,NA,1,NA,NA,NA,NA,NA)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
A patient should be kept in the database only if they 2 or 3 observations (so patients that have complete data for 0 or only 1 time points should be thrown out. So for this example my desired result is this:
patientid <- c(100,100,100,101,101,101)
outcome <- c(1,1,1,1,1,NA)
time <- c(1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
Hence patients 102 and 104 are thrown out of the database because of they were missing the outcome variable in 2 or 3 of the time points.
We can create a logical expression on the sum of non-NA elements as a logical vector, grouped by 'patientid' to filter patientid's having more than one non-NA 'outcome'
library(dplyr)
Data %>%
group_by(patientid) %>%
filter(sum(!is.na(outcome)) > 1) %>%
ungroup
-output
# A tibble: 6 x 3
# patientid outcome time
# <dbl> <dbl> <dbl>
#1 100 1 1
#2 100 1 2
#3 100 1 3
#4 101 1 1
#5 101 1 2
#6 101 NA 3
A base R option using subset + ave
subset(
Data,
ave(!is.na(outcome), patientid, FUN = sum) > 1
)
giving
patientid outcome time
1 100 1 1
2 100 1 2
3 100 1 3
4 101 1 1
5 101 1 2
6 101 NA 3
A data.table option
setDT(Data)[, Y := sum(!is.na(outcome)), patientid][Y > 1, ][, Y := NULL][]
or a simpler one (thank #akrun)
setDT(Data)[Data[, .I[sum(!is.na(outcome)) > 1], .(patientid)]$V1]
which gives
patientid outcome time
1: 100 1 1
2: 100 1 2
3: 100 1 3
4: 101 1 1
5: 101 1 2
6: 101 NA 3
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(observation = sum(outcome, na.rm = TRUE)) %>% # create new variable (observation) and count the observation per patient
filter(observation >=2) %>%
ungroup
output:
# A tibble: 6 x 4
patientid outcome time observation
<dbl> <dbl> <dbl> <dbl>
1 100 1 1 3
2 100 1 2 3
3 100 1 3 3
4 101 1 1 2
5 101 1 2 2
6 101 NA 3 2

How can I create an incremental ID column based on whenever one of two variables are encountered?

My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4

Resources