I have a data frame like this
df <- data.frame(v1 = 10:14, v2 = c(NA, 1, NA, 3, 6), v3 = c(1, NA, NA, 9, 4))
v1 v2 v3
1 10 NA 1
2 11 1 NA
3 12 NA NA
4 13 3 9
5 14 6 4
I now want to fill the NAs with the value of the previous column, so it looks like this:
v1 v2 v3
1 10 10 1
2 11 1 1
3 12 12 12
4 13 3 9
5 14 6 4
I know how to do this manually, like this:
df$v2 <- ifelse(is.na(df$v2), df$v1, df$v2)
How can I automate this for a full data frame with many columns?
You can do this with fill from tidyr:
library(dplyr)
library(tidyr)
data.frame(t(df)) %>%
fill(., names(.)) %>%
t()
Result:
v1 v2 v3
X1 10 10 1
X2 11 1 1
X3 12 12 12
X4 13 3 9
X5 14 6 4
Note:
Basically, I transposed df, filled every column downward, then transposed it back to the original orientation
for (i in 2:ncol(df))
df[,i] = ifelse(is.na(df[,i]), df[,i-1],df[,i])
This will propagate values across streaks of NA columns. If you don't want this, just reverse the order of the indexes in the for loop declaration.
Another option using Reduce with ifelse:
df[] <- Reduce(function(x, y) ifelse(is.na(y), x, y), df, accumulate = TRUE)
df
# v1 v2 v3
#1 10 10 1
#2 11 1 1
#3 12 12 12
#4 13 3 9
#5 14 6 4
You could use apply but note that the output will be a matrix
t(apply(df, 1, function(x){
replace(x, is.na(x), x[cumsum(!is.na(x))][is.na(x)])
}))
# v1 v2 v3
#[1,] 10 10 1
#[2,] 11 1 1
#[3,] 12 12 12
#[4,] 13 3 9
#[5,] 14 6 4
By using zoo na.locf
data.frame(t(apply(df,1,function(x) na.locf(x))))
v1 v2 v3
1 10 10 1
2 11 1 1
3 12 12 12
4 13 3 9
5 14 6 4
Related
I actually have a pretty easy task to do, but I just cannot find a solution. I have one df with 2 columns of numbers and 1 column of 3 different strings. I want to add now a 4th column V4 that I want to fill with the values of V1 and V2, depending on the V3 column.
> df
V1 V2 V3
1 1 6 P
2 2 7 P
3 3 8 N
4 4 9 B
5 5 10 P
6 6 11 B
7 7 12 N
8 8 13 N
9 9 14 P
10 10 15 P
structure(list(V1 = 1:10, V2 = 6:15, V3 = c("P", "P", "N", "B", "P", "B", "N", "N", "P", "P")), row.names = c(NA, -10L), class = "data.frame")
For "P" I want to take V1, for "N" I want to take V2 and for "B" I ideally want both values next to each other (V1|V2), but without making them a character, they have to stay numeric. If that is not possible then the higher number should be filled in.
My output should look like this (as numeric). Or if not possible to display 4|9 or something similar as a numeric, then just the nigher number of these 2.
V1 V2 V3 V4
1 1 6 P 1
2 2 7 P 2
3 3 8 N 8
4 4 9 B 4|9
5 5 10 P 5
6 6 11 B 6|11
7 7 12 N 12
8 8 13 N 13
9 9 14 P 9
10 10 15 P 10
I found a lot how to do this with just filling the column, but I cant find any examples filling the column with values of other columns based on 3 conditions. I tried if-statements with loops and subsets, but I failed so far.
We may create the condition with case_when.
library(dplyr)
library(stringr)
df %>%
mutate(V4 = case_when(V3 == 'B' ~ str_c(V1, V2, sep = '|'),
V3 == 'P' ~ as.character(V1),
TRUE ~ as.character(V2)))
-output
df
V1 V2 V3 V4
1 1 6 P 1
2 2 7 P 2
3 3 8 N 8
4 4 9 B 4|9
5 5 10 P 5
6 6 11 B 6|11
7 7 12 N 12
8 8 13 N 13
9 9 14 P 9
10 10 15 P 10
If we need a numeric column and the 'B' should be NA
df %>%
mutate(V4 = case_when(V3 == 'P' ~ V1,
V3 == 'N' ~ V2))
-output
V1 V2 V3 V4
1 1 6 P 1
2 2 7 P 2
3 3 8 N 8
4 4 9 B NA
5 5 10 P 5
6 6 11 B NA
7 7 12 N 12
8 8 13 N 13
9 9 14 P 9
10 10 15 P 10
Or if we need the numeric column and the max per row, use pmax to return the max per row when 'B' is the case
df %>%
mutate(V4 = case_when(V3 == 'P' ~ V1,
V3 == 'N' ~ V2, V3 == 'B' ~ pmax(V1, V2)))
-output
V1 V2 V3 V4
1 1 6 P 1
2 2 7 P 2
3 3 8 N 8
4 4 9 B 9
5 5 10 P 5
6 6 11 B 11
7 7 12 N 12
8 8 13 N 13
9 9 14 P 9
10 10 15 P 10
Here is an example:
df<-data.frame(v1=rep(1:2, 4),
v2=rep(c("a", "b"), each=4),
v3=paste0(rep(1:2, each=4), rep(c("m", "n", "o", "p"), each=2)),
v4=c(1,2, NA, NA, 3,4, NA,NA),
v5=c(5,6, NA, NA, 7,8, NA,NA),
v6=c(9,10, NA, NA, 11,12, NA,NA))
df
v1 v2 v3 v4 v5 v6
1 1 a 1m 1 5 9
2 2 a 1m 2 6 10
3 1 a 1n NA NA NA
4 2 a 1n NA NA NA
5 1 b 2o 3 7 11
6 2 b 2o 4 8 12
7 1 b 2p NA NA NA
8 2 b 2p NA NA NA
What I wanted is, if column v1+v2+v3 are same by ignore the last letter of v3, fill the NAs from the rows that are not NA . In this case, row3's NA should be filled by row1 due to same 1a1 by ignoring m. So a desired output would be:
v1 v2 v3 v4 v5 v6
1 1 a 1m 1 5 9
2 2 a 1m 2 6 10
3 1 a 1n 1 5 9
4 2 a 1n 2 6 10
5 1 b 2o 3 7 11
6 2 b 2o 4 8 12
7 1 b 2p 3 7 11
8 2 b 2p 4 8 12
I don't know but I think this is a simpler way of producing your results
library(tidyverse)
df %>%
group_by(v1,v2) %>%
fill(v4:v6)
Adding the v3 logic
df %>%
mutate(v7 = v3 %>% as.character() %>% parse_number()) %>%
group_by(v1,v2,v7) %>%
fill(v4:v6) %>%
select(-v7)
Here is a solution that recodes v3 into a variable that only takes into account the numeric part.
library(dplyr)
library(stringr)
#Extract numeric part of the string in v3
df$v7<-str_extract(df$v3,"[[:digit:]]+")
df %>%
group_by(v1,v2,v7) %>%
fill(v4:v6)
Here's a solution using data.table and zoo which ignores v3 column's last letter:
library(data.table)
setDT(df)[, match_cols := paste0(v1, v2, substr(v3, 1, nchar(as.character(v3)) - 1))][, id := .GRP, by = match_cols][, v4 := zoo::na.locf(v4, na.rm = F), by = id][, v5 := zoo::na.locf(v5, na.rm = F), by = id][, v6 := zoo::na.locf(v6, na.rm = F), by = id][ , c("match_cols", "id") := NULL]
df
# v1 v2 v3 v4 v5 v6
#1: 1 a 1m 1 5 9
#2: 2 a 1m 2 6 10
#3: 1 a 1n 1 5 9
#4: 2 a 1n 2 6 10
#5: 1 b 2o 3 7 11
#6: 2 b 2o 4 8 12
#7: 1 b 2p 3 7 11
#8: 2 b 2p 4 8 12
Using na.locf from zoo
library(zoo)
library(data.table)
setDT(df)[, na.locf(.SD),.(v1, v2)]
# v1 v2 v3 v4 v5 v6
#1: 1 a 1m 1 5 9
#2: 1 a 1n 1 5 9
#3: 2 a 1m 2 6 10
#4: 2 a 1n 2 6 10
#5: 1 b 2o 3 7 11
#6: 1 b 2p 3 7 11
#7: 2 b 2o 4 8 12
#8: 2 b 2p 4 8 12
If we want to add the condition in 'v3'
setDT(df)[, names(df)[4:6] := na.locf(.SD),.(v1, v2, sub("\\D+", "", v3))][]
# v1 v2 v3 v4 v5 v6
#1: 1 a 1m 1 5 9
#2: 2 a 1m 2 6 10
#3: 1 a 1n 1 5 9
#4: 2 a 1n 2 6 10
#5: 1 b 2o 3 7 11
#6: 2 b 2o 4 8 12
#7: 1 b 2p 3 7 11
#8: 2 b 2p 4 8 12
I have a list containing two matrices:
a <- list("m1"=matrix(1:9, nrow = 3, ncol = 3),
"m2"=matrix(1:9, nrow = 3, ncol = 3))
I want to bind the two matrices (row-bind), and to distinguish the rows, I want to add a column that contains the name of the matrix. I can bind the rows using r bind:
b <- do.call(rbind, a) %>% as.data.frame
which yields
V1 V2 V3
1 1 4 7
2 2 5 8
3 3 6 9
4 1 4 7
5 2 5 8
6 3 6 9
But how do I add a column containing the names? I can do b$id <- c("m1","m1","m1","m2","m2","m2"), but there must be an easier way than this (?)
Here's how to do it in dplyr / purrr
a %>% purrr::map(as.data.frame) %>% dplyr::bind_rows(.id = "origin")
origin V1 V2 V3
1 m1 1 4 7
2 m1 2 5 8
3 m1 3 6 9
4 m2 1 4 7
5 m2 2 5 8
6 m2 3 6 9
That converts the matrices to data-frames before row-binding them.
You can use bind_rows on a list of matrices. But it doesn't return what you expect.
a %>% bind_rows(.id = "origin")
# A tibble: 9 x 3
origin m1 m2
<chr> <int> <int>
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 1 5 5
6 1 6 6
7 1 7 7
8 1 8 8
9 1 9 9
This happens because m1 and m2 are vectors (because they are matrices) of the same length, and bind_rows sees a list of constant-length vectors as a single data-frame. So the latter call is equivalent to
bind_rows(data.frame(m1 = as.vector(m1), m2 = as.vector(m2)), .id = "origin")
So, make sure you convert your matrices to data.frames before you bind them together.
You can do:
b <- do.call(rbind.data.frame, a)
# V1 V2 V3
#m1.1 1 4 7
#m1.2 2 5 8
#m1.3 3 6 9
#m2.1 1 4 7
#m2.2 2 5 8
#m2.3 3 6 9
or if you not happy with this,
b <- do.call(rbind.data.frame, a)
b$id <- sub("[.].+", "", rownames(b))
# V1 V2 V3 id
#m1.1 1 4 7 m1
#m1.2 2 5 8 m1
#m1.3 3 6 9 m1
#m2.1 1 4 7 m2
#m2.2 2 5 8 m2
#m2.3 3 6 9 m2
Some example of my data:
library(tidyverse)
set.seed(1234)
df <- tibble(
v1 = c(1:6),
v2 = rnorm(6, 5, 2) %>% round,
v3 = rnorm(6, 4, 2) %>% round,
v4 = rnorm(6, 4, 1) %>% round %>% lag(1),
v5 = rnorm(6, 6, 2) %>% round %>% lag(2),
v6 = rnorm(6, 5, 3) %>% round %>% lag(3),
v7 = rnorm(6, 5, 3) %>% round %>% lag(4))
v1 v2 v3 v4 v5 v6 v7
1 1 3 3 NA NA NA NA
2 2 6 3 3 NA NA NA
3 3 7 3 4 4 NA NA
4 4 0 2 5 11 3 NA
5 5 6 3 4 6 1 8
6 6 6 2 3 5 7 4
I want to shift it by diagonal, that separates NA and filled data.
So, desired output looks like this:
v1 v2 v3 v4 v5 v6 v7
1 NA NA 3 3 4 3 8
2 NA 3 3 4 11 1 4
3 1 6 3 5 6 7 NA
4 2 7 2 4 5 NA NA
5 3 0 3 4 NA NA NA
6 4 6 2 NA NA NA NA
7 5 6 NA NA NA NA NA
8 6 NA NA NA NA NA NA
Each column around v3 is just shifted by 1, 2, 3.. etc rows down and up.
Tried to achieve this inside dplyr::mutate_all() but I failed to iterate it with a lag() and lead() functions.
EDIT: after #wibeasley advice I made this
df %>%
mutate(dummy1 = c(3:8)) %>%
gather("var", "val", -dummy1) %>%
mutate(
dummy2 = sub("v", "", var, fixed = T),
dummy3 = dummy1 - as.numeric(dummy2) + 1) %>%
select(-dummy1, -dummy2) %>%
spread(var, val) %>%
slice(-c(1:4)) %>% select(-dummy3)
Looks ugly, but works.
We can use lapply to handle each column, putting NA to the back.
df[] <- lapply(df, function(x) c(x[!is.na(x)], x[is.na(x)]))
df
# # A tibble: 6 x 7
# v1 v2 v3 v4 v5 v6 v7
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 3 3 3 4 3 8
# 2 2 6 3 4 11 1 4
# 3 3 7 3 5 6 7 NA
# 4 4 0 2 4 5 NA NA
# 5 5 6 3 3 NA NA NA
# 6 6 6 2 NA NA NA NA
Hi in relation to the question here:
[Dynamically replace row in dataframe with vector
I have a data.frame for example:
d <- read.table(text=' V1 V2 V3 V4 V5 V6 V7
1 1 a 2 3 4 9 6
2 1 b 2 2 4 5 NA
3 1 c 1 3 4 5 8
4 1 d 1 2 3 6 9
5 2 a 1 2 3 4 5
6 2 b 1 4 5 6 7
7 2 c 1 2 3 5 8
8 2 d 2 3 6 7 9', header=TRUE)
Now I want to take one row, for example the first one (1a) and:
Get the min and max value from that row. In this case min=2 and max=9 (note there are missing values in between for example there is no 5, 7, or 8 in that row).
Now I want to replace that row with all missing values and extend it (the row will be longer than all others as it will go from 2 until 9 (2,3,4,5,6,7,8,9). The whole data.frame should then be automatically extended by NA columns for the other rows that are not as long as the one I replaced.
Now the following code does achieve this:
row.to.change <- 1
(new.row <- seq(min(d[row.to.change,c(-1, -2)], na.rm=TRUE), max(d[row.to.change,c(-1,-2)], na.rm=TRUE)))
(num.add <- length(new.row) - ncol(d) + 2)
# [1] 3
if (num.add > 0) {
d <- cbind(d, replicate(num.add, rep(NA, nrow(d))))
} else if (num.add <= 0) {
new.row <- c(new.row, rep(NA, -num.add))
}
and finally renames the extended data.frame headers as the default ones:
d[row.to.change,c(-1, -2)] <- new.row
colnames(d) <- paste0("V", seq_len(ncol(d)))
Now: This does work for the row that I specify in: row.to.replace but how does this work, if for example I want it to work for all rows which have a 'b' in the second column? Something like: "do this where d$V2 == 'b'"? In case the data.frame is 5000 rows long.
You have already solved. Just make a function and then apply it to each row of your data.
rtc=function(row.to.change){# <- 1
(new.row <- seq(min(d[row.to.change,c(-1, -2)], na.rm=TRUE), max(d[row.to.change,c(-1,-2)], na.rm=TRUE)))
(num.add <- length(new.row) - ncol(d) + 2)
# [1] 3
if (num.add <= 0) {
new.row <- c(new.row, rep(NA, -num.add))
}
new.row
}
#d2=d
newr=lapply(1:nrow(d),rtc) # for the hole data
# for specific condition, like lines with "b" in V2 change to:
# newr=lapply(1:nrow(d),function(z)if(d$V2[z]=="b")rtc(z) else as.numeric(d[z,c(-1, -2)]))
mxl=max(sapply(newr,length))
newr=lapply(newr,function(z)if(length(z)<mxl)c(z,rep(NA,mxl-length(z))) else z)
if (ncol(d)-2 < mxl) {
d <- cbind(d, replicate(mxl-ncol(d)+2, rep(NA, nrow(d))))
}
d[,c(-1, -2)] <- do.call(rbind,newr)
colnames(d) <- paste0("V", seq_len(ncol(d)))
d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 1 a 2 3 4 5 6 7 8 9 NA
2 1 b 2 3 4 5 NA NA NA NA NA
3 1 c 1 2 3 4 5 6 7 8 NA
4 1 d 1 2 3 4 5 6 7 8 9
5 2 a 1 2 3 4 5 NA NA NA NA
6 2 b 1 2 3 4 5 6 7 NA NA
7 2 c 1 2 3 4 5 6 7 8 NA
8 2 d 2 3 4 5 6 7 8 9 NA