Replacing NAs in a column with the values of other column - r

I wonder how to replace NAs in a column with the values of other column in R using dplyr. MWE is below.
Letters <- LETTERS[1:5]
Char <- c("a", "b", NA, "d", NA)
df1 <- data.frame(Letters, Char)
df1
library(dplyr]
df1 %>%
mutate(Char1 = ifelse(Char != NA, Char, Letters))
Letters Char Char1
1 A a NA
2 B b NA
3 C <NA> NA
4 D d NA
5 E <NA> NA

You can use coalesce:
library(dplyr)
df1 <- data.frame(Letters, Char, stringsAsFactors = F)
df1 %>%
mutate(Char1 = coalesce(Char, Letters))
Letters Char Char1
1 A a a
2 B b b
3 C <NA> C
4 D d d
5 E <NA> E

Related

Group values in rows according into similar columns

I had a column with multiple values inside it..
Like...
ColumnX1
A,D,C,B,F,E,G
F,A,B,E,G,C
C,D,G,F,A,T
I splitted the data with
Species_Data2 <- data.frame(str_split_fixed(Species_Data$Other.Anopheline.species, ",", 21))
But I got the values as below:
I have dataframe like:-
X1 X2 X3 X4 X5 X6 X7
A D C B F E G
F A B E G NA C
C D G F A T NA
I wanted to make a dataframe like:
X1 X2 X3 X4 X5 X6 X7 X8
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
and then....
I want to make the columns names as row values:-
Colnames
'A' 'B' 'C' 'D' 'E' 'F' 'G' 'T'
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
Tried to create sorting...but does not work that great... :(..
Comes up with O values though....
If I understand correctly, the OP wants to rearrange the data so that there is a separate column for each letter. If a letter is present in a row, then the letter appears in the appropriate column/row of the reshaped data. NA indicates that a letter is missing in a row. In addition, the letter columns should be arranged in alphabetical order.
1. dplyr/tidyr approach
If we start with the data.frame resulting from OP's call to stringr::str_split_fixed() we need to reshape the splitted data from wide to long format, remove empty entries, order rows so that columns appear in letter order and reshape to wide format again. For reshaping, a row id is required. To achieve the desired output, pivot_wide() has to be called the names_from = value parameter:
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value)
rn A B C D E F G T
<int> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 1 A B C D E F G NA
2 2 A B C NA E F G NA
3 3 A NA C D NA F G T
2. data.table approach
If we start from the unsplitted original data, there is a much more concise variant which uses data.table's dcast() for reshaping:
library(data.table)
setDT(DF)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
If required, the additional row id column can be removed in both approaches.
Data
DF <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G",
"F,A,B,E,G,C",
"C,D,G,F,A,T")
)
EDIT: Duplicate values
In a comment, the OP has disclosed that the production dataset contains duplicate values.
In case of duplicate values, dcast() uses the length() function by default to aggregate the data.
With a modified dataset DF2 which contains duplicate values in rows 1 and 2, the original data.table approach returns:
library(data.table)
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 1 1 2 1 1 1 1 0
2: 2 1 1 1 0 1 2 1 0
3: 3 1 0 1 1 0 1 1 1
Here, the number of duplicate letters is shown.
The expected behaviour can be restored by removing the duplicate values before reshaping by using unique():
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][
, dcast(unique(.SD), nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
Also the dplyr/tidyr approach needs to be modified by specifying an appropriate aggregation function in the call to pivot_wider():
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF2$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value, values_fn = list(value = unique))
Data with duplicate values
DF2 <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G,C",
"F,A,B,E,G,C,F",
"C,D,G,F,A,T")
)

Alternate empty column among dataframe

I have this df:
df = data.frame(aa = letters[1:5],
bb = letters[1:5],
cc = letters[1:5],
dd = letters[1:5])
df2 = c('ee', 'ff', 'gg')
df[df2] = NA
And I want to have this output:
ee aa bb ff cc dd gg
NA a a NA a a NA
NA b b NA b b NA
NA c c NA c c NA
NA d d NA d d NA
NA e e NA e e NA
Is there an elegant way to do so instead of:
df = df[,c('ee', 'aa', 'bb', 'ff', 'cc', 'dd', 'gg')] ??
Here is one option. Based on the input/output, we need to have alternate columns within each block of 2 columns, Created a matrix 'm1' of column names, split them by col of the matrix, concatenate each list element with one of the element of 'df2' to create a vector of column names in the specified order ('un1'). Using that, a 'data.frame' of NA is created (through the matrix route) and assign the values of columns of 'df' to that
m1 <- matrix(names(df), 2, 2)
un1 <- c(unlist(Map(c, df2[seq_len(nrow(m1))],
split(m1, col(m1)))), df2[length(df2)])
dfN <- as.data.frame(matrix(NA, ncol =length(un1),
nrow = nrow(df), dimnames = list(NULL, un1)))
dfN[names(df)] <- df
dfN
# ee aa bb ff cc dd gg
#1 NA a a NA a a NA
#2 NA b b NA b b NA
#3 NA c c NA c c NA
#4 NA d d NA d d NA
#5 NA e e NA e e NA
Or another option is add_column from tibble. We split the dataset into a list of data.frame based on the 'k' (blocks of column - 2), loop through the list and the sequence of list with map2, add the columns at the beginning (add_column), convert it to a single data.frame (map2_dfc) and then add the remaining column at the end
library(tidyverse)
k <- 2
l1 <- split.default(df, as.integer(gl(ncol(df), k, ncol(df))))
i1 <- seq_along(l1)
nm1 <- tail(names(df), 1)
l1 %>%
map2_dfc(., i1, ~
.x %>%
add_column(!! df2[.y] := NA, .before = 1)) %>%
add_column(!!df2[-i1] := NA, .after = nm1)
# ee aa bb ff cc dd gg
#1 NA a a NA a a NA
#2 NA b b NA b b NA
#3 NA c c NA c c NA
#4 NA d d NA d d NA
#5 NA e e NA e e NA
If the names of empty columns do not matter much then you can also use for loop. It will result in the desired dataframe named df2
df = data.frame(aa = letters[1:5],
bb = letters[1:5],
cc = letters[1:5],
dd = letters[1:5])
df2 = NA
for (i in 1:(ncol(df) / 2)) {
df2 <- data.frame(df2, df[, (i*2-1):(i*2)], NA)
}
Column names can be added later if needed as
colnames(df2)[seq(1,ncol(df2),3)] <- c('ee', 'ff', 'gg')

R - Merge and Replace Column If ID Found on Another Data Frame

I have two data frames as below and am trying to improve my code so the letters column in df1 should replaced with the letters column in df2 if they match.
df1 <- data.frame(ID = c(1,3,2,4,5), Letters = LETTERS[1:5], stringsAsFactors = F)
df2 <- data.frame(ID = c(1,3,4), Letters2 = "F", stringsAsFactors = F)
desired:
ID letters
1 F
2 B
3 F
4 D
5 F
It would be like doing the following by in one line:
desired <- merge(df1, df2, by = "ID", all.x = T)
desired$letters <- ifelse(is.na(desired$letters2), desired$letters, desired$letters2)
desired$letters2 <- NULL
Try this:
library(tidyverse)
df1%>%
left_join(df2)%>%
mutate(Letters=coalesce(letters2,Letters),letters2=NULL)
Joining, by = "ID"
ID Letters
1 1 F
2 2 B
3 3 F
4 4 F
5 5 E
We could use the numeric 'ID' as index to change the values in 'Letters' to those of 'letters2' (which are all 'F's)
df1$Letters[df2$ID] <- df2$letters2
df1
# ID Letters
#1 1 F
#2 2 B
#3 3 F
#4 4 F
#5 5 E
Or using data.table
library(data.table)
setDT(df1)[df2, Letters := Letters2, on = .(ID)]
df1
# ID Letters
#1: 1 F
#2: 3 F
#3: 2 C
#4: 4 F
#5: 5 E

Creating new variable in dataframe based on matching values from other dataframe

I have two dataframes, df1 and df2, of which two columns have partly matching values, however in completely different order; also, the values are unique in df2 but may be repeated in df1.
What I'd like to do is transfer into df1, not the matching values, but values associated with them in another variable in df2; for one value in df1, "G", I do not want the associated value to be transferred but rather just NA.
Consider df1 and df2:
df1 <- data.frame(
x = c("A", NA, "L", "G", "C", "F", NA, "J", "G", "K")
)
df2 <- data.frame(
a = LETTERS[1:10],
b = 1:10 # these are the values to be transferred into df1$z
)
df1$z <- ifelse(df1$x=="G", NA, ifelse(df1$x %in% df2$a, df2$b[df2$a %in% df1$x], NA))
The values to be transferred from df2 into df1 are in df2$b. I've tried the above ifelse() string but the resulting values in df1$z are only partly correct. Where's the mistake?
I think this does what you want:
df1$z <- df2$b[match(df1$x,df2$a)]
df1$z[df1$x=='G']=NA
Output:
> df1
x z
1 A 1
2 <NA> NA
3 L NA
4 G 7
5 C 3
6 F 6
7 <NA> NA
8 J 10
9 G 7
10 K NA
Hope this helps!
dplyr::left_join(df1,df2,by=c("x"="a")) %>% mutate(b = ifelse(x=="G",NA,b))
# x b
# 1 A 1
# 2 <NA> NA
# 3 L NA
# 4 G NA
# 5 C 3
# 6 F 6
# 7 <NA> NA
# 8 J 10
# 9 G NA
# 10 K NA

Matching columns with other columns in data frames and adding certain columns of matching values

I have tried searching for something but cannot find it. I have found similar threads but still they don't get what I want. I know there should be an easy way to do this without writing a loop function. Here it goes
I have two data frame df1 and df2
df1 <- data.frame(ID = c("a", "b", "c", "d", "e", "f"), y = 1:6 )
df2 <- data.frame(x = c("a", "c", "g", "f"), f=c("M","T","T","M"), obj=c("F70", "F60", "F71", "F82"))
df2$f <- as.factor(df2$f)
now I want to match df1 and df2 "ID" and "x" column with each other. But I want to add new columns to the df1 data frame that matches "ID" and "x" from df2 as well. The final output of df1 should look like this
ID y obj f1 f2
a 1 F70 M NA
b 2 NA NA NA
c 3 F60 NA T
d 4 NA NA NA
e 5 NA NA NA
f 6 F82 M NA
We can do this with tidyverse after joining the two datasets and spread the 'f' column
library(tidyverse)
left_join(df1, df2, by = c(ID = "x")) %>%
group_by(f) %>%
spread(f, f) %>%
select(-6) %>%
rename(f1 = M, f2 = T)
# A tibble: 6 × 5
# ID y obj f1 f2
#* <chr> <int> <fctr> <fctr> <fctr>
#1 a 1 F70 M NA
#2 b 2 NA NA NA
#3 c 3 F60 NA T
#4 d 4 NA NA NA
#5 e 5 NA NA NA
#6 f 6 F82 M NA
Or a similar approach with data.table
library(data.table)
dcast(setDT(df2)[df1, on = .(x = ID)], x+obj + y ~ f, value.var = 'f')[, -6, with = FALSE]
Here is a base R process.
# combine the data.frames
dfNew <- merge(df1, df2, by.x="ID", by.y="x", all.x=TRUE)
# add f1 and f2 variables
dfNew[c("f1", "f2")] <- lapply(c("M", "T"),
function(i) factor(ifelse(as.character(dfNew$f) == i, i, NA)))
# remove original factor variable
dfNew <- dfNew[-3]
ID y obj f1 f2
1 a 1 F70 M <NA>
2 b 2 <NA> <NA> <NA>
3 c 3 F60 <NA> T
4 d 4 <NA> <NA> <NA>
5 e 5 <NA> <NA> <NA>
6 f 6 F82 M <NA>

Resources