How to find rows with same values in two columns? - r

It's a little hard to explain, but I'm trying to compare the column "cpf" from two different data frames. I want to identify when the value in the two "cpf" columns from (df1) and (df2) is equal (these values can be in different rows). After that, I want to update the NA values if these are available from the other data frame
df1
cpf x y
1 21 NA NA
2 32 NA NA
3 43 NA NA
4 54 NA NA
5 65 NA NA
df2
cpf x y
1 54 5 10
2 0 NA NA
3 65 3 2
4 0 NA NA
5 0 NA NA
I want the following result
df3
cpf x y
1 21 NA NA
2 32 NA NA
3 43 NA NA
4 54 5 10
5 65 3 2

We could do a join on 'cpf' and use fcoalecse
library(data.table)
setDT(df1)[df2, c('x', 'y') := .(fcoalesce(x, i.x),
fcoalesce(y, i.y)), on = .(cpf)]
-output
df1
# cpf x y
#1: 21 NA NA
#2: 32 NA NA
#3: 43 NA NA
#4: 54 5 10
#5: 65 3 2
Or using coalecse from dplyr after a left_join
library(dplyr)
left_join(df1, df2, by = 'cpf') %>%
transmute(cpf, x = coalesce(x.x, x.y), y = coalesce(y.x, y.y))
# cpf x y
#1 21 NA NA
#2 32 NA NA
#3 43 NA NA
#4 54 5 10
#5 65 3 2
In base R, can use match
i1 <- match(df1$cpf, df2$cpf, nomatch = 0)
i2 <- match(df2$cpf, df1$cpf, nomatch = 0)
df1[i2, -1] <- df2[i1, -1]
data
df1 <- structure(list(cpf = c(21L, 32L, 43L, 54L, 65L), x = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), y = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
df2 <- structure(list(cpf = c(54L, 0L, 65L, 0L, 0L), x = c(5L, NA, 3L,
NA, NA), y = c(10L, NA, 2L, NA, NA)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))

df1 %>%
left_join(df2, by = "cpf") %>%
select(cpf, x = x.y, y = y.y)
Output:
cpf x y
1 21 NA NA
2 32 NA NA
3 43 NA NA
4 54 5 10
5 65 3 2

Another base R option using merge
merge(df1,
df2,
by = "cpf",
all.x = TRUE,
suffixes = c(".x", "")
)[names(df1)]
gives
cpf x y
1 21 NA NA
2 32 NA NA
3 43 NA NA
4 54 5 10
5 65 3 2

Related

Change variable in a column with a for loop using R

In my data frame, I have a column named Localisation. I want to change the data that is stored in it.
Thalamus, External capsule or Lenticulate for 0,
Cerebellum or Brain stem for 2 and
Frontal, Occipital, Parietal or Temporal for 1
I’m trying to do a For loop for this operation. My for statement doesn’t seem to be right, because I received
Error in for (. in i) 15:nrow(Localisation) :
4 arguments passed to 'for' which requires 3
and I don’t know how to compose my if statement.
dfM <- dfM %>%
for(i in dfM$Localisation) if(i = "Thalamus"| "External capsule"| "Lenticulate"){
dfM$Localisation <- "0"
} else if ( i = "Cerebellum"| "Brain stem") {
dfM$Localisation <- "2"
} else {
dfM$Localisation <- "1"
}
I know similar questions have been asked multiple times, but I can’t find a way to work with my data.
dput(head(dfM))
structure(list(new_id = c("5", "9", "10", "16", "30", "31"),
Localisation = c("Frontal", "Thalamus", "Occipital ", "Frontal",
"External capsule", "Cerebellum"), HIV.CT.initial = c(0L,
1L, 1L, 1L, 0L, 0L), Anticoagulant = c("Warfarin", "DOAC",
"Warfarin", "Warfarin", "Warfarin", "Warfarin"), Sex = c(1L,
1L, 1L, 2L, 2L, 1L), HTA = c(1L, 1L, 1L, 1L, 1L, 0L), Systolic_BP = c(116L,
169L, 164L, 109L, 134L, 146L), Diastolic_BP = c(70L, 65L,
80L, 60L, 75L, 85L), ACO = c(1L, 0L, 1L, 1L, 1L, 1L), Type.NACO = c(NA,
2L, NA, NA, NA, NA), Dose.ACO = c(NA, NA, NA, NA, NA, NA),
APT = c(0L, 0L, 1L, 0L, 1L, 0L), INR = c(4.2, 1.1, 1.9, 1.3,
3.6, 2.8), GLW = c(13L, 15L, 14L, 14L, 15L, 15L), GLW.Prog = c(NA,
NA, NA, NA, 14L, NA), mRS = c(4L, 3L, 4L, 6L, 6L, 5L), Time.to.scan = c(NA,
NA, NA, NA, NA, NA), Chx = c(0L, 0L, 0L, 0L, 0L, 0L), Type.Chx = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), Date.Chx = c("",
"", "", "", "", ""), décès.hospit = c(0L, 0L, 0L, 1L, 1L,
0L), décès.HIP = c(NA, NA, NA, 2L, 1L, NA), X3.mois = c(NA,
NA, NA, NA, NA, 1L), X6.mois = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), X12.mois = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), X3.mois.1 = c(NA, NA, NA, 1L, 1L, 1L), X6.mois.1 = c(NA,
NA, NA, 1L, 1L, 1L), X12.mois.1 = c(NA, NA, NA, 1L, 1L, 1L
), X.1.y = c(NA, NA, NA, NA, NA, NA), ID = c(5L, 9L, 10L,
16L, 30L, 31L)), row.names = c(NA, 6L), class = "data.frame")
Up front: in general, don't use for in a %>%-pipe, it almost never is what you intend.
Since you have the %>%, I'll infer you're using dplyr and/or other related packages. Here's one way:
library(dplyr)
dfM %>%
mutate(Localisation = case_when(
Localisation %in% c("Thalamus", "External capsule", "Lenticulate") ~ "0",
Localisation %in% c("Cerebellum", "Brain stem") ~ "2",
TRUE ~ "1")
)
# new_id Localisation HIV.CT.initial Anticoagulant Sex HTA Systolic_BP Diastolic_BP ACO Type.NACO Dose.ACO APT INR GLW GLW.Prog mRS Time.to.scan Chx Type.Chx Date.Chx décès.hospit décès.HIP X3.mois X6.mois X12.mois X3.mois.1 X6.mois.1 X12.mois.1 X.1.y ID
# 1 5 1 0 Warfarin 1 1 116 70 1 NA NA 0 4.2 13 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 5
# 2 9 0 1 DOAC 1 1 169 65 0 2 NA 0 1.1 15 NA 3 NA 0 NA 0 NA NA NA NA NA NA NA NA 9
# 3 10 1 1 Warfarin 1 1 164 80 1 NA NA 1 1.9 14 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 10
# 4 16 1 1 Warfarin 2 1 109 60 1 NA NA 0 1.3 14 NA 6 NA 0 NA 1 2 NA NA NA 1 1 1 NA 16
# 5 30 0 0 Warfarin 2 1 134 75 1 NA NA 1 3.6 15 14 6 NA 0 NA 1 1 NA NA NA 1 1 1 NA 30
# 6 31 2 0 Warfarin 1 0 146 85 1 NA NA 0 2.8 15 NA 5 NA 0 NA 0 NA 1 NA NA 1 1 1 NA 31
Perhaps relevant: one of your values "Occipital " has a space at the end; if you have inadvertent spaces at times and need it to work, you can replace all references of Localisation == to trimws(Localisation) == to reduce the chance of a mis-classification.
Another method is to maintain a frame of translations:
translations <- tribble(
~Localisation, ~NewLocalisation
,"Thalamus", "0",
,"External capsule", "0",
,"Lenticulate", "0",
,"Cerebellum", "2",
,"Brain stem", "2"
)
dfM %>%
select(new_id, Localisation) %>% # just to reduce the display here on SO
left_join(., translations, by = "Localisation")
# new_id Localisation NewLocalisation
# 1 5 Frontal <NA>
# 2 9 Thalamus 0
# 3 10 Occipital <NA>
# 4 16 Frontal <NA>
# 5 30 External capsule 0
# 6 31 Cerebellum 2
I intentionally left the "default" step out to highlight that the NA values are natural: those are the ones not specifically identified.
dfM %>%
left_join(., translations, by = "Localisation") %>%
mutate(NewLocalisation = if_else(is.na(NewLocalisation), "1", NewLocalisation)) %>%
mutate(Localisation = NewLocalisation) %>%
select(-NewLocalisation)
# new_id Localisation HIV.CT.initial Anticoagulant Sex HTA Systolic_BP Diastolic_BP ACO Type.NACO Dose.ACO APT INR GLW GLW.Prog mRS Time.to.scan Chx Type.Chx Date.Chx décès.hospit décès.HIP X3.mois X6.mois X12.mois X3.mois.1 X6.mois.1 X12.mois.1 X.1.y ID
# 1 5 1 0 Warfarin 1 1 116 70 1 NA NA 0 4.2 13 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 5
# 2 9 0 1 DOAC 1 1 169 65 0 2 NA 0 1.1 15 NA 3 NA 0 NA 0 NA NA NA NA NA NA NA NA 9
# 3 10 1 1 Warfarin 1 1 164 80 1 NA NA 1 1.9 14 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 10
# 4 16 1 1 Warfarin 2 1 109 60 1 NA NA 0 1.3 14 NA 6 NA 0 NA 1 2 NA NA NA 1 1 1 NA 16
# 5 30 0 0 Warfarin 2 1 134 75 1 NA NA 1 3.6 15 14 6 NA 0 NA 1 1 NA NA NA 1 1 1 NA 30
# 6 31 2 0 Warfarin 1 0 146 85 1 NA NA 0 2.8 15 NA 5 NA 0 NA 0 NA 1 NA NA 1 1 1 NA 31
The reason I offer this alternative is that sometimes it's easier to maintain a separate table (in excel, CSV form, etc) in this fashion for automated processing of data. It doesn't offer any capability that the case_when or #dcarlson's ifelse methods do not.

Create a new variable with an existing variable name in a data frame, filling it when matching a non NA value in each of the variable lists

I want to create a column - C - in dfABy with the name of the existing variables, when in the list A or B it is a "non NA" value. For example, my df is:
>dfABy
A B
56 NA
NA 45
NA 77
67 NA
NA 65
The result what I will attend is:
> dfABy
A B C
56 NA A
NA 45 B
NA 77 B
67 NA A
NA 65 B
One option using dplyr could be:
df %>%
rowwise() %>%
mutate(C = names(.[!is.na(c_across(everything()))]))
A B C
<int> <int> <chr>
1 56 NA A
2 NA 45 B
3 NA 77 B
4 67 NA A
5 NA 65 B
Or with the addition of purrr:
df %>%
mutate(C = pmap_chr(across(A:B), ~ names(c(...)[!is.na(c(...))])))
You can use max.col over is.na values to get the column numbers where non-NA value is present. From those numbers you can get the column names.
dfABy$C <- names(dfABy)[max.col(!is.na(dfABy))]
dfABy
# A B C
#1 56 NA A
#2 NA 45 B
#3 NA 77 B
#4 67 NA A
#5 NA 65 B
If there are more than one non-NA value in a row take a look at at ties.method argument in ?max.col on how to handle ties.
data
dfABy <- structure(list(A = c(56L, NA, NA, 67L, NA), B = c(NA, 45L, 77L,
NA, 65L)), class = "data.frame", row.names = c(NA, -5L))
Using the data.table package I recommend:
dfABy[, C := apply(cbind(dfABy), 1, function(x) names(x[!is.na(x)]))]
creating the following output:
A B C
1 56 NA A
2 NA 45 B
3 NA 77 B
4 67 NA A
5 NA 65 B
This is just another solution, However other proposed solutions are better.
library(dplyr)
library(purrr)
df %>%
rowwise() %>%
mutate(C = detect_index(c(A, B), ~ !is.na(.x)),
C = names(.[C]))
# A tibble: 5 x 3
# Rowwise:
A B C
<dbl> <dbl> <chr>
1 56 NA A
2 NA 45 B
3 NA 77 B
4 67 NA A
5 NA 65 B

Replace going on NA values with sum of another column

I am trying to replace all going on NA values with sum of values from another column, but I'm a little confused.
How the data looks like
df
# Distance Distance2
# 1 160 8
# 2 20 NA
# 3 30 15
# 4 100 11
# 5 35 NA
# 6 42 NA
# 7 10 NA
# 8 10 2
# 9 9 NA
# 10 20 NA
And am looking to get a result like this
df
# Distance Distance2
# 1 160 8
# 2 20 20
# 3 30 15
# 4 100 11
# 5 35 87
# 6 42 87
# 7 10 87
# 8 10 2
# 9 9 29
# 10 20 29
Thanks in advance for your help
We can use rleid to create groups and replace NA with sum of Distance values.
library(data.table)
setDT(df)[, Distance_new := replace(Distance2, is.na(Distance2),
sum(Distance)), rleid(Distance2)]
df
# Distance Distance2 Distance_new
# 1: 160 8 8
# 2: 20 NA 20
# 3: 30 15 15
# 4: 100 11 11
# 5: 35 NA 87
# 6: 42 NA 87
# 7: 10 NA 87
# 8: 10 2 2
# 9: 9 NA 29
#10: 20 NA 29
We can also use this in dplyr :
library(dplyr)
df %>%
group_by(gr = rleid(Distance2)) %>%
mutate(Distance_new = replace(Distance2, is.na(Distance2), sum(Distance)))
data
df <- structure(list(Distance = c(160L, 20L, 30L, 100L, 35L, 42L, 10L,
10L, 9L, 20L), Distance2 = c(8L, NA, 15L, 11L, NA, NA, NA, 2L,
NA, NA)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"))
You can group by consecutive NAs and replace with the sum, i.e.
library(dplyr)
df %>%
group_by(grp = cumsum(c(TRUE, diff(is.na(df$Distance2)) != 0))) %>%
mutate(Distance2 = replace(Distance2, is.na(Distance2), sum(Distance)))
# A tibble: 10 x 3
# Groups: grp [6]
Distance Distance2 grp
<int> <int> <int>
1 160 8 1
2 20 20 2
3 30 15 3
4 100 11 3
5 35 87 4
6 42 87 4
7 10 87 4
8 10 2 5
9 9 29 6
10 20 29 6
We can use fcoalesce
library(data.table)
library(zoo)
setDT(df)[, Distance2 := fcoalesce(Distance2, na.aggregate(Distance, FUN = sum)),
rleid(Distance2)]
data
df <- structure(list(Distance = c(160L, 20L, 30L, 100L, 35L, 42L, 10L,
10L, 9L, 20L), Distance2 = c(8L, NA, 15L, 11L, NA, NA, NA, 2L,
NA, NA)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"))

How can I use merge so that I have data for all times?

I'm trying to change a data into which all entities have value for all possible times(months). Here's what I'm trying;
Class Value month
A 10 1
A 12 3
A 9 12
B 11 1
B 10 8
From the data above, I want to get the following data;
Class Value month
A 10 1
A NA 2
A 12 3
A NA 4
....
A 9 12
B 11 1
B NA 2
....
B 10 8
B NA 9
....
B NA 12
So I want to have all possible cells with through month from 1 to 12;
How can I do this? I'm right now trying it with merge function, but appreciate any other ways to approach.
We can use tidyverse
library(tidyverse)
df1 %>%
complete(Class, month = min(month):max(month)) %>%
select_(.dots = names(df1)) %>% #if we need to be in the same column order
as.data.frame() #if needed to convert to 'data.frame'
In base R using merge (where df is your data):
res <- data.frame(Class=rep(levels(df$Class), each=12), value=NA, month=1:12)
merge(df, res, by = c("Class", "month"), all.y = TRUE)[,c(1,3,2)]
# Class Value month
# 1 A 10 1
# 2 A NA 2
# 3 A 12 3
# 4 A NA 4
# 5 A NA 5
# 6 A NA 6
# 7 A NA 7
# 8 A NA 8
# 9 A NA 9
# 10 A NA 10
# 11 A NA 11
# 12 A 9 12
# 13 B 11 1
# 14 B NA 2
# 15 B NA 3
# 16 B NA 4
# 17 B NA 5
# 18 B NA 6
# 19 B NA 7
# 20 B 10 8
# 21 B NA 9
# 22 B NA 10
# 23 B NA 11
# 24 B NA 12
df <- structure(list(Class = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Value = c(10L, 12L, 9L, 11L, 10L), month = c(1L,
3L, 12L, 1L, 8L)), .Names = c("Class", "Value", "month"), class = "data.frame", row.names = c(NA,
-5L))
To add to #akrun's answer, if you want to replace the NA values with 0, you can do the following:
library(dplyr)
library(tidyr)
df1 %>%
complete(Class, month = min(month):max(month)) %>%
mutate(Value = ifelse(is.na(Value),0,Value))

lag function in ifelse

I have a dataframe df with a structure like this:
val1 val2 val3
1 12 NA
2 14 NA
3 54 54
1 35 4
2 3 5
3 7 NA
4 8 NA
5 9 NA
Expected value:
val1 val2 val3 val4
1 12 NA 12
2 14 NA 12
3 54 54 54
1 35 4 35
2 3 5 3
3 7 NA 3
4 8 NA 3
5 9 NA 3
Problem:
I need a new column val4 with the following condition
df$val4 <- ifelse(df$val1 == 1, df$val2, ifelse(is.na(df$val3), lag(df$val4), df$val2))
This leads to
Error in hasTsp(x) : attempt to set an attribute on NULL
Condition:
val4 is equal to value of val2 when val1 is equal 1 (val3 does not matter)
val4 is equal to previous value when val3 is NA ( expect when val1 is not equal to 1)
P.S: I know I can use for loop here, but that would be very slow!
We can use data.table with zoo. Convert the 'data.frame' to 'data.table' (setDT(df)), create the 'val4' by multiplying the 'val2' with a vector of 1's and NA (NA^is.na(val3) - returns NA for NA elements in 'val3' while the non-NA is changed to 1), then for 1 in 'val1', we assign 'val4' to 'val2', and replace the NA elements with the previous non-NA elements with na.locf
library(data.table)
library(zoo)
setDT(df)[, val4 := val2 * NA^is.na(val3)
][val1==1, val4 := val2
][, val4 := na.locf(val4)][]
# val1 val2 val3 val4
#1: 1 12 NA 12
#2: 2 14 NA 12
#3: 3 54 54 54
#4: 1 35 4 35
#5: 2 3 5 3
#6: 3 7 NA 3
#7: 4 8 NA 3
#8: 5 9 NA 3
More code explanation
`is.na` returns a `logical` vector
setDT(df)[, is.na(val3)]
#[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
If we need to change the TRUE values to NA and 1 to others
setDT(df)[, NA^is.na(val3)]
#[1] NA NA 1 1 1 NA NA NA
Multiply by 'val2'
setDT(df)[, val2 * NA^is.na(val3)]
#[1] NA NA 54 35 3 NA NA NA
and the rest is just assignment based on the logical condition in 'i'
data
df <- structure(list(val1 = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L), val2 = c(12L,
14L, 54L, 35L, 3L, 7L, 8L, 9L), val3 = c(NA, NA, 54L, 4L, 5L,
NA, NA, NA)), .Names = c("val1", "val2", "val3"), class = "data.frame",
row.names = c(NA, -8L))

Resources