Replicate rows with missing values and replace missing values by vector - r

I have a dataframe in which a column has some missing values.
I would like to replicate the rows with the missing values N times, where N is the length of a vector which contains replacements for the missing values.
I first define a replacement vector, then my starting data.frame, then my desired result and finally my attempt to solve it. Unfortunately that didn't work...
> replace_values <- c('A', 'B', 'C')
> data.frame(value = c(3, 4, NA, NA), result = c(5, 3, 1,2))
value result
1 3 5
2 4 3
3 NA 1
4 NA 2
> data.frame(value = c(3, 4, replace_values, replace_values), result = c(5, 3, rep(1, 3),rep(2, 3)))
value result
1 3 5
2 4 3
3 A 1
4 B 1
5 C 1
6 A 2
7 B 2
8 C 2
> t <- data.frame(value = c(3, 4, NA, NA), result = c(5, 3, 1,2))
> mutate(t, value = ifelse(is.na(value), replace_values, value))
value result
1 3 5
2 4 3
3 C 1
4 A 2

You can try a tidyverse solution
d %>%
mutate(value=ifelse(is.na(value), paste0(replace_values, collapse=","), value)) %>%
separate_rows(value, sep=",") %>%
select(value, everything())
value result
1 3 5
2 4 3
3 A 1
4 B 1
5 C 1
6 A 2
7 B 2
8 C 2
The idea is to replace the NA's by the ,-collapsed 'replace_values'. Then separate the collpased values and binding them by row using tidyr's separate_rows function. Finally sort the data.frame according your expected output.

We can do an rbind here using base R. Create a logical vector where the 'value' is NA ('i1'), get the number of NA elements by taking the sum of it ('n'), create a data.frame by replicating the 'replace_values' with 'n' as well as the 'result' elements that correspond to the NA elements of 'value' by the length of 'replace_values' and 'rbind' with the subset of dataset i.e. the non-NA elements of 'value' rows
i1 <- is.na(df1$value)
n <- sum(i1)
rbind(df1[!i1,],
data.frame(value = rep(replace_values, n),
result = rep(df1$result[i1], each = length(replace_values))))
# value result
#1 3 5
#2 4 3
#3 A 1
#4 B 1
#5 C 1
#6 A 2
#7 B 2
#8 C 2

Related

Replacing NA's with numbers based on the numbers preceding them in R data frame

I have the following two columns:
ID Number
A 1
A 1
B NA
C NA
C NA
D 3
D 3
D 3
F NA
G 6
H NA
I want the NAs in the Number column to be replaced by the next integer number following the Non-NA number that precedes them. That new number should then stay the same as long as the ID in the ID column doesn't change.
So for example, using the example columns above, if the number associated with ID "A" is 1, and ID "B" below it has NA's, I want those NA's replaced with the number 2. Then, if ID "C" has NA's, they should be replaced with 3. We move down to ID "D". This ID has the number 3 in the Number column, so nothing changes. ID "E" below it has NA's so they get replaced with 4, etc.
Here is what my data frame should look like:
ID Number
A 1
A 1
B 2
C 3
C 3
D 3
D 3
D 3
F 4
G 6
H 7
How would I be able to code this in R, preferably using dplyr?
I came up with the following, though I am not totally sure if the logic is correct, so please test it:
df <- data.frame(ID=c('A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'F', 'G', 'H'),
Number=c(1, 1, NA, NA, NA, 3, 3, 3, NA, 6, NA))
library(zoo)
a <- as.integer(factor(df$ID))
b <- zoo::na.locf(a - df$Number)
df$Number <- a - b
Resulting in:
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
Some explanation:
a simply relabels the groups with ascending integers: 1 1 2 3 3 4 4 4 5 6 7. This is almost the desired result, but we have to account for cases where the values in df$Number get ahead of/behind this running integer label.
b tracks the difference between a and df$Number with a forward fill (na.locf): 0 0 0 0 0 1 1 1 1 0 0. Places with a non-zero indicate the correction that should be applied to a to "reset" the running labels, based on the values observed in df$Number.
a - b applies the correction alluded to in the above point: 1 1 2 3 3 3 3 3 4 6 7.
One hiccup I noted is if the values start with NA; in that case using na.locf will return something smaller than the length of the dataframe. The fix I came up with was to manually prepend a value (0), forward fill, then remove the value, but G. Grothendieck in the comments provided a nicer way to fix this 😊:
# Note: the first 5 values are NA
df <- data.frame(ID=c('A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'F', 'G', 'H'),
Number=c(NA, NA, NA, NA, NA, 3, 3, 3, NA, 6, NA))
library(zoo)
a <- as.integer(factor(df$ID))
b <- na.fill(na.locf0(a - df$Number), 0)
df$Number <- a - b
Result:
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
Data
ID <- c("A","A","B","C","C","D","D","D","F","G","H")
Number <- c(1,1,NA_real_,NA_real_,NA_real_,3,3,3,NA_real_,6,NA_real_)
df <- data.frame(ID,Number,stringsAsFactors = F)
dplyr approach
df2 <- df[!df$ID%>%duplicated(),]%>%
mutate(Number2=ifelse(is.na(Number),1,0))%>%
group_by(grp=cumsum(Number2==0))%>%
mutate(cumulative=cumsum(Number2))%>%
ungroup%>%
fill(Number)%>%
mutate(Number=Number+cumulative)%>%
select(ID,Number)
base::merge(df%>%select(-Number),df2,by="ID",all.x=T)
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
or in one LONG line:
df%>%select(-Number)%>%merge(df[!df$ID%>%duplicated(),]%>%
mutate(Number2=ifelse(is.na(Number),1,0))%>%
group_by(grp=cumsum(Number2==0))%>%
mutate(cumulative=cumsum(Number2))%>%
ungroup%>%
fill(Number)%>%
mutate(Number=Number+cumulative)%>%
select(ID,Number),
by="ID",all.x=T)
original answer:
df2 <- df[!df$ID%>%duplicated(),]
while(sum(is.na(df2$Number))!=0){
df2$Number[is.na(df2$Number)] <- c(lag(df2$Number)+1)[is.na(df2$Number)]
}
base::merge(df%>%select(-Number),df2,by="ID",all.x=T)
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7

Select first non-NA value by row [duplicate]

This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 1 year ago.
I have data like this:
df <- data.frame(id=c(1, 2, 3, 4), A=c(6, NA, NA, 4), B=c(3, 2, NA, NA), C=c(4, 3, 5, NA), D=c(4, 3, 1, 2))
id A B C D
1 1 6 3 4 4
2 2 NA 2 3 3
3 3 NA NA 5 1
4 4 4 NA NA 2
For each row: If the row has non-NA values in column "A", I want that value to be entered into a new column 'E'. If it doesn't, I want to move on to column "B", and that value entered into E. And so on. Thus, the new column would be E = c(6, 2, 5, 4).
I wanted to use the ifelse function, but I am not quite sure how to do this.
tidyverse
library(dplyr)
mutate(df, E = coalesce(A, B, C, D))
# id A B C D E
# 1 1 6 3 4 4 6
# 2 2 NA 2 3 3 2
# 3 3 NA NA 5 1 5
# 4 4 4 NA NA 2 4
coalesce is effectively "return the first non-NA in each vector". It has a SQL equivalent (or it is an equivalent of SQL's COALESCE, actually).
base R
df$E <- apply(df[,-1], 1, function(z) na.omit(z)[1])
df
# id A B C D E
# 1 1 6 3 4 4 6
# 2 2 NA 2 3 3 2
# 3 3 NA NA 5 1 5
# 4 4 4 NA NA 2 4
na.omit removes all of the NA values, and [1] makes sure we always return just the first of them. The advantage of [1] over (say) head(., 1) is that head will return NULL if there are no non-NA elements, whereas .[1] will always return at least an NA (indicating to you that it was the only option).

subtracting the greater column from smaller columns in a dataframe in R

I have the input below and I would like to subtract the two columns, but I want to subtract always the lowest value from the highest value.
Because I don't want negative values as a result and sometimes the highest value is in the first column (PaternalOrgin) and other times in the second column (MaternalOrigin).
Input:
df <- PaternalOrigin MaternalOrigin
16 20
3 6
11 0
1 3
1 4
3 11
and the dput output is this:
df <- structure(list(PaternalOrigin = c(16, 3, 11, 1, 1, 3), MaternalOrigin = c(20, 6, 0, 3, 4, 11)), colnames = c("PaternalOrigin", "MaternalOrigin"), row.names= c(NA, -6L), class="data.frame")
Thus, my expected output would look like:
df2 <- PaternalOrigin MaternalOrigin Results
16 20 4
3 6 3
11 0 11
1 3 2
1 4 3
3 11 8
Please, can someone advise me?
Thanks.
We can wrap with abs
transform(df, Results = abs(PaternalOrigin - MaternalOrigin))
# PaternalOrigin MaternalOrigin Results
#1 16 20 4
#2 3 6 3
#3 11 0 11
#4 1 3 2
#5 1 4 3
#6 3 11 8
Or we can assign it to 'Results'
df$Results <- with(df, abs(PaternalOrigin - MaternalOrigin))
Or using data.table
library(data.table)
setDT(df)[, Results := abs(PaternalOrigin - MaternalOrigin)]
Or with dplyr
library(dplyr)
df %>%
mutate(Results = abs(PaternalOrigin - MaternalOrigin))

in R find duplicates by column 1 and filter by not NA column 3

I have a dataframe:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(1,NA,2,4,NA,1,2,2)
df <-data.frame(a,b,c)
I have a dataframe with some duplicate variables in column 1 but when I use the duplicated function, it randomly chooses the row after de-duping using duplicate(function)
dedup_df = df[!duplicated(df$a), ]
How can I ensure that the output returns me the row that does not contain an NA on column c ?
I tried to use the dplyr package but the output prints only a result
library(dplyr)
options(dplyr.print_max = Inf )
df %>% ## source dataframe
group_by(a) %>% ## grouped by variable
filter(!is.na(c) ) %>% ## filter by Gross value
as.data.frame(dedup_df)
Your use of duplicated function to remove duplicate observations (lines) using a column as key from a data frame is correct.
But it seems that you are worried that it may keep a line that contains NA in another column and drop another line that contains a non NA value.
I'll use you example, but with a slight modification
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(NA,1,2,4,NA,1,2,2)
df <-data.frame(a,b,c)
> df
a b c
1 A 1 NA
2 A 1 1
3 A 2 2
4 B 4 4
5 B 1 NA
6 B 1 1
7 C 2 2
8 C 2 2
In this case, your dedup_df contains an NA for the first value.
> dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
1 A 1 NA
4 B 4 4
7 C 2 2
Solution:
Reorder df by column c first and then use the same command. This reordering by column c will send all NAs to the end of the data frame. When the duplicated passes it will see these lines having NA last and will tag them as TRUE if there was a previous one without NA.
df = df[order(df$c),]
dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
2 A 1 1
6 B 1 1
7 C 2 2
You can also reorder in descending order
df = df[order(df$c,decreasing = T),]
dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
4 B 4 4
3 A 2 2
7 C 2 2

Special use of colSums(), na.rm = TRUE only if 1 or fewer are missing

I need to sum some columns in a data.frame with a rule that says, a column is to be summed to NA if more than one observation is missing NA if only 1 or less missing it is to be summed regardless.
Say I have some data like this,
dfn <- data.frame(
a = c(3, 3, 0, 3),
b = c(1, NA, 0, NA),
c = c(0, 3, NA, 1))
dfn
a b c
1 3 1 0
2 3 NA 3
3 0 0 NA
4 3 NA 1
and I apply my rule, and sum the columns with less then 2 missing NA. So I get something like this.
a b c
1 3 1 0
2 3 NA 3
3 0 0 NA
4 3 NA 1
5 9 NA 4
I've played around with colSums(dfn, na.rm = FALSE) and colSums(dfn, na.rm = TRUE). In my real data there is more then three columns and also more then 4 rows. I imagine I can count the missing some way and use that as a rule?
I don't think you can do this with colSums alone, but you can add to its result using ifelse:
colSums(dfn,na.rm=TRUE) + ifelse(colSums(is.na(dfn)) > 1, NA, 0)
a b c
9 NA 4
Nothing wrong with #James' Answer, but here's a slightly cleaner way:
colSums(apply(dfn, 2, function(col) replace(col, match(NA, col), 0)))
# a b c
# 9 NA 4
match(NA, col) returns the index of the first NA in col, replace replaces it with 0 and returns the new column, and apply returns a matrix with all of the new columns.

Resources