Related
I want to create new rows based on the value of pre-existent rows in my dataset. There are two catches: first, some cell values need to remain constant while others have to increase by +1. Second, I need to cycle through every row the same amount of times.
I think it will be easier to understand with data
Here is where I am starting from:
mydata <- data.frame(id=c(10012000,10012002,10022000,10022002),
col1=c(100,201,44,11),
col2=c("A","C","B","A"))
Here is what I want:
mydata2 <- data.frame(id=c(10012000,10012001,10012002,10012003,10022000,10022001,10022002,10022003),
col1=c(100,100,201,201,44,44,11,11),
col2=c("A","A","C","C","B","B","A","A"))
Note how I add +1 in the id column cell for each new row but col1 and col2 remain constant.
Thank you
library(tidyverse)
mydata |>
mutate(id = map(id, \(x) c(x, x+1))) |>
unnest(id)
#> # A tibble: 8 × 3
#> id col1 col2
#> <dbl> <dbl> <chr>
#> 1 10012000 100 A
#> 2 10012001 100 A
#> 3 10012002 201 C
#> 4 10012003 201 C
#> 5 10022000 44 B
#> 6 10022001 44 B
#> 7 10022002 11 A
#> 8 10022003 11 A
Created on 2022-04-14 by the reprex package (v2.0.1)
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
mydata %>%
group_by(id) %>%
uncount(2) %>%
mutate(id = first(id) + row_number() - 1) %>%
ungroup()
This returns
# A tibble: 8 x 3
id col1 col2
<dbl> <dbl> <chr>
1 10012000 100 A
2 10012001 100 A
3 10012002 201 C
4 10012003 201 C
5 10022000 44 B
6 10022001 44 B
7 10022002 11 A
8 10022003 11 A
library(data.table)
setDT(mydata)
final <- setorder(rbind(copy(mydata), mydata[, id := id + 1]), id)
# id col1 col2
# 1: 10012000 100 A
# 2: 10012001 100 A
# 3: 10012002 201 C
# 4: 10012003 201 C
# 5: 10022000 44 B
# 6: 10022001 44 B
# 7: 10022002 11 A
# 8: 10022003 11 A
I think this should do it:
library(dplyr)
df1 <- arrange(rbind(mutate(mydata, id = id + 1), mydata), id, col2)
Gives:
id col1 col2
1 10012000 100 A
2 10012001 100 A
3 10012002 201 C
4 10012003 201 C
5 10022000 44 B
6 10022001 44 B
7 10022002 11 A
8 10022003 11 A
in base R, for nostalgic reasons:
mydata2 <- as.data.frame(lapply(mydata, function(col) rep(col, each = 2)))
mydata2$id <- mydata2$id + 0:1
I have a column of IDs in a dataframe that sometimes has duplicates, take for example,
ID
209
315
109
315
451
209
What I want to do is take this column and create another column that indicates what ID the row belongs to. i.e. I want it to look like,
ID
ID Category
209
1
315
2
109
3
315
2
451
4
209
1
Essentially, I want to loop through the IDs and if it equals to a previous one, I indicate that it is from the same ID, and if it is a new ID, I create a new indicator for it.
Does anyone know is there a quick function in R that I could do this with? Or have any other suggestions?
Convert to factor with levels ordered with unique (order of appearance in the data set) and then to numeric:
data$IDCategory <- as.numeric(factor(data$ID, levels = unique(data$ID)))
#> data
# ID IDCategory
#1 209 1
#2 315 2
#3 109 3
#4 315 2
#5 451 4
#6 209 1
library(tidyverse)
data <- tibble(ID= c(209,315,109,315,451,209))
data %>%
left_join(
data %>%
distinct(ID) %>%
mutate(`ID Category` = row_number())
)
#> Joining, by = "ID"
#> # A tibble: 6 × 2
#> ID `ID Category`
#> <dbl> <int>
#> 1 209 1
#> 2 315 2
#> 3 109 3
#> 4 315 2
#> 5 451 4
#> 6 209 1
Created on 2022-03-10 by the reprex package (v2.0.0)
df <- df %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, drop=TRUE)))
Answer with data.table
library(data.table)
df <- as.data.table(df)
df <- df[
j = `ID Category` := as.numeric(interaction(ID, drop=TRUE))
]
The pro of this solution is that you can create an unique ID for a group of variables. Here you only need ID, but if you want to have an unique ID let say for the couple [ID—Location] you could.
data <- tibble(ID= c(209,209,209,315,315,315), Location = c("A","B","C","A","A","B"))
data <- data %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, Location, drop=TRUE)))
another way:
merge(data,
data.frame(ID = unique(data$ID),
ID.Category = seq_along(unique(data$ID))
), sort = F)
# ID ID.Category
# 1 209 1
# 2 209 1
# 3 315 2
# 4 315 2
# 5 109 3
# 6 451 4
data:
tibble(ID = c(209,315,109,315,451,209)) -> data
In my dataset, I want to count the number of times each row appears in my dataset, which consists of five columns. I tried using table; however, this seems to only work with seeing how many times one column, not multiple, is duplicated since I get the error
attempt to make a table with >= 2^31 elements
As a quick example, say my dataframe is as follows:
dat <- data.frame(
SSN = c(204,401,204,666,401),
Name=c("Blossum","Buttercup","Blossum","MojoJojo","Buttercup"),
Age = c(7,8,7,43,8),
Gender = c(0,0,0,1,0)
)
How do I add another column with how many times each row appears in this dataframe?
With dplyr, we could group by all columns:
dat %>%
group_by(across(everything())) %>%
mutate(n = n())
# # A tibble: 5 x 5
# # Groups: SSN, Name, Age, Gender [3]
# SSN Name Age Gender n
# <dbl> <chr> <dbl> <dbl> <int>
# 1 204 Blossum 7 0 2
# 2 401 Buttercup 8 0 2
# 3 204 Blossum 7 0 2
# 4 666 MojoJojo 43 1 1
# 5 401 Buttercup 8 0 2
(mutate(n = n()) is has a shortcut, add_tally(), if you prefer. Use summarize(n = n() or count() if you want to collapse the data frame to the unique rows while adding counts)
Using data.table package. setDT is used to inplace transform data.frame into a data.table.
Inplace (:=) modification of dat by adding count (.N) of lines grouped by all columns of dat (by=names(dat)).
Note: inplace modification result is invisible. So you need to explicitly print it or add [] after (dat[, ...][]).
setDT(dat)
dat[,by=names(dat),N:=.N][]
#> SSN Name Age Gender N
#> 1: 204 Blossum 7 0 2
#> 2: 401 Buttercup 8 0 2
#> 3: 204 Blossum 7 0 2
#> 4: 666 MojoJojo 43 1 1
#> 5: 401 Buttercup 8 0 2
or (to collapse lines)
setDT(dat)
dat[,by=names(dat),.N]
#> SSN Name Age Gender N
#> 1: 204 Blossum 7 0 2
#> 2: 401 Buttercup 8 0 2
#> 3: 666 MojoJojo 43 1 1
We can use add_count without grouping as well
library(dplyr)
dat %>%
add_count(across(everything()))
-output
# SSN Name Age Gender n
#1 204 Blossum 7 0 2
#2 401 Buttercup 8 0 2
#3 204 Blossum 7 0 2
#4 666 MojoJojo 43 1 1
#5 401 Buttercup 8 0 2
I am not sure which is your desired output. Below are some base R options
> aggregate(
+ cnt ~ .,
+ cbind(dat, cnt = 1),
+ sum
+ )
SSN Name Age Gender cnt
1 204 Blossum 7 0 2
2 401 Buttercup 8 0 2
3 666 MojoJojo 43 1 1
> transform(
+ cbind(dat, n = 1),
+ n = ave(n, SSN, Name, Age, Gender, FUN = sum)
+ )
SSN Name Age Gender n
1 204 Blossum 7 0 2
2 401 Buttercup 8 0 2
3 204 Blossum 7 0 2
4 666 MojoJojo 43 1 1
5 401 Buttercup 8 0 2
I have a long form of clinical data that looks something like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
outcome <- c(1,1,1,1,1,NA,1,NA,NA,NA,NA,NA)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
A patient should be kept in the database only if they 2 or 3 observations (so patients that have complete data for 0 or only 1 time points should be thrown out. So for this example my desired result is this:
patientid <- c(100,100,100,101,101,101)
outcome <- c(1,1,1,1,1,NA)
time <- c(1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
Hence patients 102 and 104 are thrown out of the database because of they were missing the outcome variable in 2 or 3 of the time points.
We can create a logical expression on the sum of non-NA elements as a logical vector, grouped by 'patientid' to filter patientid's having more than one non-NA 'outcome'
library(dplyr)
Data %>%
group_by(patientid) %>%
filter(sum(!is.na(outcome)) > 1) %>%
ungroup
-output
# A tibble: 6 x 3
# patientid outcome time
# <dbl> <dbl> <dbl>
#1 100 1 1
#2 100 1 2
#3 100 1 3
#4 101 1 1
#5 101 1 2
#6 101 NA 3
A base R option using subset + ave
subset(
Data,
ave(!is.na(outcome), patientid, FUN = sum) > 1
)
giving
patientid outcome time
1 100 1 1
2 100 1 2
3 100 1 3
4 101 1 1
5 101 1 2
6 101 NA 3
A data.table option
setDT(Data)[, Y := sum(!is.na(outcome)), patientid][Y > 1, ][, Y := NULL][]
or a simpler one (thank #akrun)
setDT(Data)[Data[, .I[sum(!is.na(outcome)) > 1], .(patientid)]$V1]
which gives
patientid outcome time
1: 100 1 1
2: 100 1 2
3: 100 1 3
4: 101 1 1
5: 101 1 2
6: 101 NA 3
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(observation = sum(outcome, na.rm = TRUE)) %>% # create new variable (observation) and count the observation per patient
filter(observation >=2) %>%
ungroup
output:
# A tibble: 6 x 4
patientid outcome time observation
<dbl> <dbl> <dbl> <dbl>
1 100 1 1 3
2 100 1 2 3
3 100 1 3 3
4 101 1 1 2
5 101 1 2 2
6 101 NA 3 2
My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4