I have a dataset in R in long format. Each ID does not appear the same number of times (i.e. one ID might be one row, another might appear 79 rows).
e.g.
ID V1 V2
1 B 0
1 A 1
1 C 0
2 C 0
3 A 0
3 C 0
I want to create a variable which, if any of the rows for a given ID have Var2 == 1, then 1 repeats for every row of that ID
e.g.
ID V1 V2 V3
1 B 0 1
1 A 1 1
1 C 0 1
2 C 0 0
3 A 0 0
3 C 0 0
In base R we can use any - and ave for the grouping.
DF$V3 <- with(DF, ave(V2, ID, FUN = function(x) any(x == 1)))
DF
# ID V1 V2 V3
#1 1 B 0 1
#2 1 A 1 1
#3 1 C 0 1
#4 2 C 0 0
#5 3 A 0 0
#6 3 C 0 0
data
DF <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), V1 = c("B", "A",
"C", "C", "A", "C"), V2 = c(0L, 1L, 0L, 0L, 0L, 0L)), .Names = c("ID",
"V1", "V2"), class = "data.frame", row.names = c(NA, -6L))
Here's a tidyverse solution.
If V2 can only be 0 or 1:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(V3 = max(V2))
If you want to check that V2 is exactly 1.
df %>%
group_by(ID) %>%
mutate(V3 = as.numeric(any(V2 == 1)))
Another base R option is
df$V3 <- with(df, +(ID %in% which(rowsum(V2, ID) > 0)))
Related
The matrix I have looks something like this:
Plot A B C
1 1 0 0
2 1 0 1
3 1 1 0
And I have a dataframe that looks like this
A 5
B 4
C 2
What I would like to do is replace the "1" values in the matrix with the corresponding values in the dataframe, like this:
Plot A B C
1 5 0 0
2 5 0 2
3 5 4 0
Any suggestions on how to do this in R? Thank you!
An option with tidyverse
library(dplyr)
df1 %>%
mutate(across(all_of(df2$col1),
~ replace(.x, .x== 1, df2$col2[match(cur_column(), df2$col1)])))
-output
Plot A B C
1 1 5 0 0
2 2 5 0 2
3 3 5 4 0
data
df1 <- structure(list(Plot = 1:3, A = c(1L, 1L, 1L), B = c(0L, 0L, 1L
), C = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(col1 = c("A", "B", "C"), col2 = c(5, 4, 2)),
class = "data.frame", row.names = c(NA,
-3L))
I have the something like the following:
person_ID visit date
1 2/25/2001
1 2/30/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
I want to add another column to see if the person has a reoccurring observation within 90 days, like:
person_ID visit date reoccurrence
1 2/25/2001 1
1 2/30/2001 1
1 4/2/2001 0
2 3/18/2004 0
3 9/22/2004 1
3 10/27/2004 0
3 5/15/2008 0
any help is appreciated, thank you!
If the second 'date' is not 2/30/2001, convert the 'visit_date' to Date class, grouped by 'person_id', get the difference between current and next 'visit_date' in 'day', check if it is less than 90, replace the NA with 0
library(dplyr)
library(lubridate)
library(tidyr)
df1 <- df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(reoccurrence = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0)) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date reoccurrence
# <int> <date> <dbl>
#1 1 2001-02-25 1
#2 1 2001-02-28 1
#3 1 2001-04-02 0
#4 2 2004-03-18 0
#5 3 2004-09-22 1
#6 3 2004-10-27 0
#7 3 2008-05-15 0
Or using data.table
library(data.table)
setDT(df1)[, visit_date := as.IDate(visit_date, '%m/%d/%Y')
][, reoccurence := +(difftime(shift(visit_date, type = 'lead'),
visit_date, units = 'day') < 90))
][is.na(reoccurence), reoccurence := 0]
Or with base R
df1$visit_date <- as.Date(df1$visit_date, '%m/%d/%Y')
with(df1, ave(as.integer(visit_date), person_ID, FUN =
function(x) c(+(diff(x) < 90), 0)))
#[1] 1 1 0 0 1 0 0
data
df1 <- structure(list(person_ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L), visit_date = c("2/25/2001",
"2/28/2001", "4/2/2001", "3/18/2004", "9/22/2004", "10/27/2004",
"5/15/2008")), row.names = c(NA, -7L), class = "data.frame")
Base R variant:
reoccur <- function(x, lim=90) {
m <- outer(x, x, `-`)
m[upper.tri(m, diag=TRUE)] <- NA
colSums(!is.na(m) & m >= 0 & m <= lim) > 0
}
### make your dates *dates*
dat$visit <- as.Date(dat$visit, format="%m/%d/%Y")
### calculate if you have reoccurrences
ave(as.numeric(dat$visit), dat$person_ID, FUN=reoccur)
# [1] 1 1 0 0 1 0 0
Data:
dat <- structure(list(person_ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L), visit = c("2/25/2001", "2/27/2001", "4/2/2001", "3/18/2004", "9/22/2004", "10/27/2004", "5/15/2008")), class = "data.frame", row.names = c(NA, -7L))
(I changed "2/30/2001" to "2/27/2001" to get a real Date out of it.)
I want to merge them and find the values of one dataframe that would like to be added to the existing values of the other based on the same columns.
For example:
df1
No
A
B
C
D
1
1
0
1
0
2
0
1
2
1
3
0
0
1
0
df2
No
A
B
E
F
1
1
0
1
1
2
0
1
2
1
3
2
1
1
0
Finally, I want the output table like this.
df
No
A
B
C
D
E
F
1
2
0
1
0
1
1
2
0
2
2
1
2
1
3
2
1
1
0
1
0
Note: I did try merge(), but in this case, it did not work.
Any help/suggestion would be appreciated.
Reproducible sample data
df1 <-
structure(list(No = 1:3, A = c(1L, 0L, 0L), B = c(0L, 1L, 0L),
C = c(1L, 2L, 1L), D = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <-
structure(list(No = 1:3, A = c(1L, 0L, 2L), B = c(0L, 1L, 1L),
E = c(1L, 2L, 1L), F = c(1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
You can also carry out this operation by left_joining these two data frames:
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = "No") %>%
mutate(across(ends_with(".x"), ~ .x + get(str_replace(cur_column(), "\\.x", "\\.y")))) %>%
rename_with(~ str_replace(., "\\.x", ""), ends_with(".x")) %>%
select(!ends_with(".y"))
No A B C D E F
1 1 2 0 1 0 1 1
2 2 0 2 2 1 2 1
3 3 2 1 1 0 1 0
You can first row-bind the two dataframes and then compute the sum of each column while 'grouping' by the No column. This can be done like so:
library(dplyr)
bind_rows(df1, df2) %>%
group_by(No) %>%
summarise(across(c(A, B, C, D, E, `F`), sum, na.rm = TRUE),
.groups = "drop")
If a particular column doesn't exist in one dataframe (i.e. columns E and F), values will be padded with NA. Adding the na.rm = TRUE argument (to be passed to sum()) means that these values will get treated like zeros.
Using data.table :
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[, lapply(.SD, sum, na.rm = TRUE), No]
# No A B C D E F
#1: 1 2 0 1 0 1 1
#2: 2 0 2 2 1 2 1
#3: 3 2 1 1 0 1 0
We can use base R (with R 4.1.0). Get the values of the objects in a list ('lst1'). Then, find the union of the column names ('nm1'). Loop over the list assign to create 0 value columns with setdiff in each list element, rbind them and use aggregate to get the sum grouped by 'No'
lst1 <- mget(ls(pattern= '^df\\d+$'))
nm1 <- lapply(lst1, names) |>
{\(x) Reduce(union, x)}()
lapply(lst1, \(x) {x[setdiff(nm1, names(x))] <- 0; x}) |>
{\(x) do.call(rbind, x)}() |>
{\(dat) aggregate(.~ No, data = dat, FUN = sum, na.rm = TRUE,
na.action = na.pass)}()
# No A B C D E F
#1 1 2 0 1 0 1 1
#2 2 0 2 2 1 2 1
#3 3 2 1 1 0 1 0
I am looking to subtract multiple rows from the same row within a dataframe.
For example:
Group A B C
A 3 1 2
B 4 0 3
C 4 1 1
D 2 1 2
This is what I want it to look like:
Group A B C
B 1 -1 1
C 1 0 -1
D -1 0 0
So in other words:
Row B - Row A
Row C - Row A
Row D - Row A
Thank you!
Here's a dplyr solution:
library(dplyr)
df %>%
mutate(across(A:C, ~ . - .[1])) %>%
filter(Group != "A")
This gives us:
Group A B C
1: B 1 -1 1
2: C 1 0 -1
3: D -1 0 0
Here's an approach with base R:
data[-1] <- do.call(rbind,
apply(data[-1],1,function(x) x - data[1,-1])
)
data[-1,]
# Group A B C
#2 B 1 -1 1
#3 C 1 0 -1
#4 D -1 0 0
Data:
data <- structure(list(Group = c("A", "B", "C", "D"), A = c(3L, 4L, 4L,
2L), B = c(1L, 0L, 1L, 1L), C = c(2L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-4L))
We could also replicate the first row and substract from the rest
cbind(data[-1, 1, drop = FALSE], data[-1, -1] - data[1, -1][col(data[-1, -1])])
-output
# Group A B C
#2 B 1 -1 1
#3 C 1 0 -1
#4 D -1 0 0
I have a data frame in the following format
1 2 a b c
1 a b 0 0 0
2 b 0 0 0
3 c 0 0 0
I want to fill columns a through c with a TRUE/FALSE that says whether the column name is in columns 1 or 2
1 2 a b c
1 a b 1 1 0
2 b 0 1 0
3 c 0 0 1
I have a dataset of about 530,000 records, 4 description columns, and 95 output columns so a for loop does not work. I have tried code in the following format, but it was too time consuming:
> for(i in 3:5) {
> for(j in 1:3) {
> for(k in 1:2){
> if(df[j,k]==colnames(df)[i]) df[j, i]=1
> }
> }
> }
Is there an easier, more efficient way to achieve the same output?
Thanks in advance!
One option is mtabulate from qdapTools
library(qdapTools)
df1[-(1:2)] <- mtabulate(as.data.frame(t(df1[1:2])))[-3]
df1
# 1 2 a b c
#1 a b 1 1 0
#2 b 0 1 0
#3 c 0 0 1
Or we melt the dataset after converting to matrix, use table to get the frequencies, and assign the output to the columns that are numeric.
library(reshape2)
df1[-(1:2)] <- table(melt(as.matrix(df1[1:2]))[-2])[,-1]
Or we can 'paste' the first two columns and use cSplit_e to get the binary format.
library(splitstackshape)
cbind(df1[1:2], cSplit_e(as.data.table(do.call(paste, df1[1:2])),
'V1', ' ', type='character', fill=0, drop=TRUE))
data
df1 <- structure(list(`1` = c("a", "b", "c"), `2` = c("b", "", ""),
a = c(0L, 0L, 0L), b = c(0L, 0L, 0L), c = c(0L, 0L, 0L)), .Names = c("1",
"2", "a", "b", "c"), class = "data.frame", row.names = c("1",
"2", "3"))