Carry forward NA values for multiple columns R - r

I need to carry forward NA values from one column to the next. An example of the code is below
df <- data.frame(a = c(1,2,NA,NA,NA,NA,NA,NA,NA,NA),
b =c(NA,NA,3,4,NA,NA,NA,NA,NA,NA),
c = c(NA,NA,NA,NA,5,6,NA,NA,NA,NA),
d = c(NA,NA,NA,NA,NA,NA,7,8,NA,NA),
e = c(NA,NA,NA,NA,NA,NA,NA,NA,9,10))
I have tried to use a loop with the na.locf function in zoo but this only carries the previous columns values
columns <- seq(2,ncol(df))
output <- list()
for (i in columns){
output[[i]] <- t(zoo::na.locf(t(df[,(i-1):i])))[,2]
}
The expected output would be like
expected_output <- data.frame(a = c(1,2,NA,NA,NA,NA,NA,NA,NA,NA),
b = c(1,2,3,4,NA,NA,NA,NA,NA,NA),
c = c(1,2,3,4,5,6,NA,NA,NA,NA),
d = c(1,2,3,4,5,6,7,8,NA,NA),
e = c(1,2,3,4,5,6,7,8,9,10))

Transpose df, apply na.locf, transpose again and replace df contents with that to make it a data frame with the correct names.
library(zoo)
out <- replace(df, TRUE, t(na.locf(t(df), fill = NA)))
identical(out, expected_output)
## [1] TRUE
This also works and is similar except it applies na.locf0 to each row instead of applying na.locf to the transpose.
out <- replace(df, TRUE, t(apply(df, 1, na.locf0)))
identical(out, expected_output)
## [1] TRUE

Related

Coerce specific column to "double" within a dataframe list

Let's say I have a list of dataframes
myList <- list(df1 = data.frame(A = as.character(sample(10)), B =
rep(1:2, 10)), df2 = data.frame(A = as.character(sample(10)), B = rep(1:2, 10)) )
I want to coerce column A in each dataframe to double.
I'm trying:
myList = sapply(myList,simplify = FALSE, function(x){
x$A <- as.double(x$A) })
But this returns the coerced values, not even column with column names.
I also tried with dplyr and mutate_if, but with no success
We can use lapply with transform in base R
myList2 <- lapply(myList, transform, A = as.double(A))
Or use map with mutate from tidyverse
library(dplyr)
library(purrr)
myList2 <- map(myList, ~ .x %>%
mutate(A = as.double(A)))
The issue in the OP's code is that it is not returning the data i.e. 'x'.
myList2 <- sapply(myList, simplify = FALSE,
function(x){
x$A <- as.double(x$A)
x
})

Conditional dataframe slicing

I would like to remove the rows of this dataframe in which, if the pattern ,2) exists, it just exist in one of the columns.
As an example: in this dataframe, each column is a character class (representing a vector in each position):
A c(0,1) c(1,1)
B c(0,2) c(0,1)
C c(1,1) c(0,1)
D c(1,2) c(0,2)
I would like to subset it, removing row B, as the pattern is present in one of the columns but not in the other.
I tried to use grep, but I don't know how to specify the conditional statement.
How can I achieve this?
For a single column we would do this (calling your data d)
d[!grepl(",2)", d$column_name, fixed = TRUE), ]
But we need to check all the columns and find rows that have exactly one match. For this, we'll convert to matrix and use rowSums to count the matches by row:
n_occurrences = rowSums(matrix(grepl(",2)", as.matrix(d), fixed = TRUE), nrow = nrow(d)))
d[n_occurrences != 1, ]
# V1 V2 V3
# 1 A c(0,1) c(1,1)
# 3 C c(1,1) c(0,1)
# 4 D c(1,2) c(0,2)
Using this sample data:
d = read.table(text = 'A c(0,1) c(1,1)
B c(0,2) c(0,1)
C c(1,1) c(0,1)
D c(1,2) c(0,2)')
Not as elegant as the selected answer above, but you can also split into two variables at the blank space and then create separate indices.
library(dplyr)
df = data.frame(v1=c('c(0,1) c(1,1)','c(0,2) c(0,1)',
'c(1,1) c(0,1)','c(1,2) c(0,2)'))
empty_omit <- function(vec) vec[vec!='']
get_even <- function(vec) vec[seq_along(vec) %% 2 == 0]
get_odd <- function(vec) vec[seq_along(vec) %% 2 ==1]
df$v2 = strsplit(df$v1, ' ') %>% unlist() %>% empty_omit %>% get_odd()
df$v3 = strsplit(df$v1, ' ') %>% unlist() %>% empty_omit %>% get_even()
idx_v2 = grepl(",2)", df$v2)
idx_v3 = grepl(",2)", df$v3)
df[!idx_v2 | idx_v3, ]

Paste leading zero in columns A and B if column A meets condition

Data:
A B
"2058600192", "2058644"
"4087600101", "4087601"
"30138182591","30138011"
I am trying to add one leading 0 to columns A and B if column A is 10 characters.
This is what I have written so far:
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A)
data$B[i] <- paste0(0, data$B)
}
}
But I'm getting the following warning:
number of items to replace is not a multiple of replacement length
I've also tried using a dplyr solution, but I'm not sure how to mutate two columns based on one column. Any insight would be appreciated.
Your solution was already pretty good. You just made some very small mistakes. This code would give the correct output:
data <- data.frame(A = c("2058600192","4087600101","30138182591"), B = c("2058644","4087601","30138011"))
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A[i])
data$B[i] <- paste0(0, data$B[i])
}
}
The only difference is data$A[i] <- paste0(0, data$A[i]) instead of data$A[i] <- paste0(0, data$A). Without the [i] you would try to add the whole column.
You can get the index where the number of characters is equal to 10 and replace those values using lapply for multiple columns.
inds <- nchar(df$A) == 10
df[] <- lapply(df, function(x) replace(x, inds, paste0('0', x[inds])))
#If you want to replace only specific columns
#df[c('A', 'B')] <- lapply(df[c('A', 'B')], function(x)
# replace(x, inds, paste0('0', x[inds])))
df
# A B
#1 02058600192 02058644
#2 04087600101 04087601
#3 30138182591 30138011
data
df <- structure(list(A = c(2058600192, 4087600101, 30138182591), B = c(2058644L,
4087601L, 30138011L)), class = "data.frame", row.names = c(NA, -3L))
Just in case you were interested in using dplyr here's another solution using transmute.
df %>%
# Need to transmute B first, so that nchar is evaluated on the original A column and not on the one with leading zeros
transmute(B = ifelse(nchar(A) == 10, paste0(0, B), B),
A = ifelse(nchar(A) == 10, paste0(0, A), A)) %>%
# Just change the order of the columns to the original one
select(A,B)
Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(A = ifelse(str_length(A) == 10, str_pad(A, width = 11, side = "left", pad = 0), A),
B = ifelse(grepl("^0", A), paste0("0", B), B))
# A B
# 1 02058600192 02058644
# 2 04087600101 04087601
# 3 30138182591 30138011
str_length to detect length of string
You can use str_pad to add leading zeros. More information about str_pad() here
We can use grepl to detect strings with leading zeros in column A and add leading zeros to column B.
You may use the ifelse vectorized function here:
data$A <- ifelse(nchar(data$A) == 10, paste0("0", data$A), data$A)
data$B <- ifelse(nchar(data$B) == 10, paste0("0", data$B), data$B)
data
A B
1 02058600192 2058644
2 04087600101 4087601
3 30138182591 30138011

R add columns based on modifying other columns of dataframes within a list

I would like to add a new column D to data.frames in a list that contains the first part of column B. But I'm not sure how to adress within lists down to the column level?
create some data
df1 <- data.frame(A = "hey", B = "wass.7", C = "up")
df2 <- data.frame(A = "how", B = "are.1", C = "you")
dfList <- list(df1,df2)
desired output:
# a new column removing the last part of column B
[[1]]
A B C D
1 hey wass.7 up wass
[[2]]
A B C D
1 how are.1 you are
for each data frame I did this, which worked
df1$D<-sub('\\..*', '', df1$B)
in a function I tried this, which is probably
not correctly addressing the columns and returns
"unexpected symbol..."
dfList <- lapply(rapply(dfList, function(x)
x$D<-sub('\\..*', '', x$B) how = "list"),
as.data.frame)
the lapply(rapply) part is copied from Using gsub in list of dataframes with R
Check this out
lapply(dfList, function(x){
x$D <-sub('\\..*', '', x$B);
x
})
[[1]]
A B C D
1 hey wass.7 up wass
[[2]]
A B C D
1 how are.1 you are
The rapply solution does work. However, you needed a comma before the how argument to resolve the error. Additionally, you will NOT be able to assign one new column only replace existing ones. Since rapply is a recursive call, it will run the gsub across every element in nested list so across ALL columns of ALL dataframes.
Otherwise use a simple lapply per #JilberUrbina's answer.
df1 <- data.frame(A = "hey", B = "wass.7", C = "up", stringsAsFactors = F)
df2 <- data.frame(A = "how", B = "are.1", C = "you", stringsAsFactors = F)
dfList <- list(df1,df2)
dfList <- lapply(rapply(dfList, function(x)
sub('\\..*', '', x), how = "list"),
as.data.frame)
dfList
# [[1]]
# A B C
# 1 hey wass up
# [[2]]
# A B C
# 1 how are you

Get the longest element of a list

Suppose you have a list of data.frames like
dfs <- list(
a = data.frame(x = c(1:4, 7:10), a = runif(8)),
b = data.frame(x = 1:10, b = runif(10)),
c = data.frame(x = 1:10, c = runif(10))
)
I would now like to extract the longest data.frame or data.frames in this list. How?
I am stuck at this point:
library(plyr)
lengths <- lapply(dfs, nrow)
longest <- max(lengths)
There are two built-in functions in R that could solve your question in my opinion:
which.max: returns the index of the first element of your list that is equal to the max
> which.max(lengths)
[1] 2
which function returns all indexes that are TRUE
Here:
> which(lengths==longest)
[1] 2 3
Then you can subset you list to the desired element:
dfs[which(lengths==longest)]
will return b and c in your example.
cnt <- sapply(dfs, nrow)
dfs[cnt == max(cnt)]
Or if you only need the first occurrence of the maximum length:
dfs[which.max(cnt)]

Resources