fill missing values with value from previous column - r

I have a data.frame whit some columns with missing values, and I want that the missing values are filled in with data from a previous column. For example:
country <- c('a','b','c')
yr01 <- c(15,16,7)
yr02 <- c(NA,18,NA)
yr03 <- c(20,22,NA)
yr04 <- c(15,18,19)
tab <- data.frame(country,yr01,yr02,yr03,yr04)
tab
country yr01 yr02 yr03 yr04
1 a 15 NA 20 15
2 b 16 18 22 18
3 c 7 NA NA 19
How can I make it so that the NA are replaced by the previous value? For example, in country a column yr02 will be equals to 15, and in country c columns year02 and yr03 will be 7. Thanks!

It's usually easier to work with columns, but we can apply to rows the standard answer from the R-FAQ Replace NAs with latest non-NA value.
tab[-1] = t(apply(tab[-1], 1, zoo::na.locf))
tab
# country yr01 yr02 yr03 yr04
# 1 a 15 15 20 15
# 2 b 16 18 22 18
# 3 c 7 7 7 19

Related

Keep the row if the specific column is the minimum value of that row

I cannot share the dataset but I will explain it as best as I can.
The dataset has 50 columns 48 of them are in Y/m/d h:m:s format. also the data has many NA, but it must not be removed.
Let's say there is a column B. I want to remove the rows if the value of B is not the earliest in that row.
How can I do this in R? For example, the original would be like this:
df <- data.frame(
A = c(11,19,17,6,13),
B = c(18,9,5,16,12),
C = c(14,15,8,87,16))
A B C
1 11 18 14
2 19 9 15
3 17 5 8
4 6 16 87
5 13 12 16
but I want this:
A B C
1 19 9 15
2 17 5 8
3 13 12 16
You could use apply() to find the minimum for each row.
df |> subset(B == apply(df, 1, min, na.rm = TRUE))
# A B C
# 2 19 9 15
# 3 17 5 8
# 5 13 12 16
The tidyverse equivalent is
library(tidyverse)
df %>% filter(B == pmap(across(A:C), min, na.rm = TRUE))
If you are willing to use data.table, you could do the following for the example.
library(data.table)
setDT(df)
df[(B < A & B < C)]
A B C
1: 19 9 15
2: 17 5 8
3: 13 12 16
More generally, you could do
df <- as.data.table(df)
df[, min := do.call(pmin, .SD)][B == min, !"min"]
.SDcols in the first [ would let you control which columns you want to take the min over, if you wanted to eg. exclude some. I am not super knowledgeable about the inner workings of data.table, but I believe that creating this new column is probably efficient RAM-wise.

Binding rows from list with meaningful duplicates in R [duplicate]

This question already has an answer here:
How to collapse many records into one while removing NA values
(1 answer)
Closed 2 years ago.
Guys I need to merge different data frames from a list by row and maintain some information contained in the duplicate rows. Each row contains daily observation of some variables (stock prices) and each of the data frames contains different time spans (years). From one data frame to the other some variables could change (columns - stocks inside the index). bind_rows from dplyr seems to do a great job at simply adding columns with the new variables and leaving NAs elsewhere.
The point is that some of the data frames contain the last day of the previous period (that is therefore already bind from the previous data frame) but they slightly differ in the variables shown (columns). I don't want to completely eliminate one of the duplicate rows because they both contain information I need and I would rather prefer to merge them. The duplicate rows contain either the same value (because refer to the same day) or one NA and one value (because refer to the different variables in the set). How can I do?
The problem could be synthetized in the following example:
library(dplyr)
df_1 <- data.frame(Date=c(1:4),A=c(20,30,20,30),B=c(15,16,15,16))
df_2 <- data.frame(Date=c(4:7),A=c(30,35,60,40),C=c(15,18,25,20))
dfs<-list(df_1,df_2)
bind_rows(dfs)
Outcome:
Date A B C
1 1 20 15 NA
2 2 30 16 NA
3 3 20 15 NA
4 4 30 16 NA
5 4 30 NA 15
6 5 35 NA 18
7 6 60 NA 25
8 7 40 NA 20
Desired outcome:
Date A B C
1 1 20 15 NA
2 2 30 16 NA
3 3 20 15 NA
4 4 30 16 15
5 5 35 NA 18
6 6 60 NA 25
7 7 40 NA 20
Instead of binding rows you can do a full join by Date and A column.
library(dplyr)
full_join(df_1, df_2, by = c('Date', 'A'))
#Thanks to #duckmayr for the suggestion.
# A B C
#1 20 15 NA
#2 30 16 NA
#3 20 15 NA
#4 30 16 15
#5 35 NA 18
#6 60 NA 25
#7 40 NA 20
which in base R, can be done as :
merge(df_1, df_2, by = c('Date', 'A'), all = TRUE)
If the data is in a list we can use Reduce
purrr::reduce(dfs, full_join, by = c('Date', 'A'))
Or
Reduce(function(x, y) merge(df_1, df_2, by = c('Date', 'A'), all = TRUE), dfs)

R data table: modify column values by referencing other columns by name

I have a melted data table with a column containing values that refer to other column names within the same table. I want to replace each row within that same column with the row value of the referenced column.
library("data.table")
## Example input data table
DT_input <- data.table(A=c(1:10),
B=c(11:20),
C=c(21:30),
replace=c(rep("A", 5), rep("B", 3), rep("C", 2)))
## Desired output data table
DT_output <- data.table(A=c(1:10),
B=c(11:20),
C=c(21:30),
replace=c(1:5, 16:18, 29:30))
My old approach shown here is very slow because of the for loop:
## Attempted looping solution
for (kRow in seq_len(nrow(DT_input))) {
e <- parse(text = DT_input[kRow, Variable])
DT_input[kRow, Variable := eval(e)]
}
If we need a vectorized approach use the row/column indexing from base R
i1 <- cbind(seq_len(nrow(df1)), match(df1$replace, names(df1)[-4]))
df1$replace <- df1[-4][i1]
df1$replace
#[1] 1 2 3 4 5 16 17 18 29 30
With data.table, an option is Map or for loop without the eval, but it would be still not vectorized
data
df1 <- as.data.frame(DT_input)
An option using data.table:
DT_input[, rn := .I]
DT_input[, replace :=
DT_input[, DT_input[.SD, on=c("rn", .BY$replace), get(.BY$replace)], .(replace)]$V1
]
output:
A B C replace
1: 1 11 21 1
2: 2 12 22 2
3: 3 13 23 3
4: 4 14 24 4
5: 5 15 25 5
6: 6 16 26 16
7: 7 17 27 17
8: 8 18 28 18
9: 9 19 29 29
10: 10 20 30 30
It will be slower than Akrun base R method.

R Plot values that select column based on a provided index value

I'm trying to figure out how to plot some values in a peculiar way. Say I have the example data below:
set.seed(100)
test.df <- as.data.frame(matrix(1:36,nrow=6))
test.df$V7 <- sample(1:6,6)
test.df$V8 <- seq(1:6)
colnames(test.df) <- c("col1","col2","col3","col4","col5","col6","index","id")
test.df
col1 col2 col3 col4 col5 col6 index id
1 1 7 13 19 25 31 2 1
2 2 8 14 20 26 32 6 2
3 3 9 15 21 27 33 3 3
4 4 10 16 22 28 34 1 4
5 5 11 17 23 29 35 4 5
6 6 12 18 24 30 36 5 6
I want to plot values from the first 6 columns by using the "index" column as a means of selecting which column (1-6) to choose from. This would be the y axis. The x axis would be "id". Essentially, the first y value would be 7 because index selects column 2 for the first value. The second y value would 32 because the index value indicates column 6.
Please let me know if I can clarify anything else. I'm fairly new to plotting in R (ggplot2 or otherwise), so any help is much appreciated.
This is not a problem of ggplot2.
First, you can create a column `y':
test.df[, "y"] <- 0
for (i in (1:nrow(test.df))) {
test.df[i, "y"] <- test.df[i, paste0("col", test.df[i, "index"])]
}
Then you can do the plotting, with plot:
plot(y ~ id, data = test.df, type = "l")

R - Subset rows of a data frame on a condition in all the columns

I want to subset rows of a data frame on a single condition in all the columns, avoiding the use of subset.
I understand how to subset a single column but I cannot generalize for all columns (without call all the columns).
Initial data frame :
V1 V2 V3
1 1 8 15
2 2 0 16
3 3 10 17
4 4 11 18
5 5 0 19
6 0 13 20
7 7 14 21
In this example, I want to subset the rows without zeros.
Expected output :
V1 V2 V3
1 1 8 15
2 3 10 17
3 4 11 18
4 7 14 21
Thanks
# create your data
a <- c(1,2,3,4,5,0,7)
b <- c(8,0,10,11,0,14,14)
c <- c(15,16,17,18,19,20,21)
data <- cbind(a, b, c)
# filter out rows where there is at least one 0
data[apply(data, 1, min) > 0,]
A solution using rowSums function after matching to 0.
# creating your data
data <- data.frame(a = c(1,2,3,4,5,0,7),
b = c(8,0,10,11,0,14,14),
c = c(15,16,17,18,19,20,21))
# Selecting rows containing no 0.
data[which(rowSums(as.matrix(data)==0) == 0),]
Another way
data[-unique(row(data)[grep("^0$", unlist(data))]),]

Resources