I have a two datasets with a similar dimensions and a similar column names. The goal is to check if NA values exist in one of the datasets and replace with the corresponding values in the other dataset as shown in the example below.
I have tried running a for loop for to do solve the problem but that didn't work and failed miserably.
df is new data frame created with NA's
loop = for (a in 1:nrow(data1)) {
for (b in 1:ncol(data1)) {
for (c in 1:nrow(data2)) {
for (d in 1:ncol(data2)) {
for (x in 1:nrow(df)) {
for (y in 1:ncol(df)) {
df[x,y]<- ifelse(data1[a,b] != "NA", data1[a,b], data2[c,d])
return(df)`enter code here`
}
}
}
}
}
}
Example
# The first data frame
structure(list(age = c(23, 22, 21, 20), gender = c("M", "F",
NA, "F")), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
# age gender
# 1 23 M
# 2 22 F
# 3 21 NA
# 4 20 F
# The second data frame
structure(list(age = c(23, 22, 21, 20), gender = c("M", "F",
"M", "F")), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
# age gender
# 1 23 M
# 2 22 F
# 3 21 M
# 4 20 F
Desired output
Age Gender
23 M
22 F
21 M
20 F
You might try this:
df1 <- tibble(age = c(23,22,21,20),
gender = c("M", "F", NA, "F"))
# -------------------------------------------------------------------------
#> df1
# # A tibble: 4 x 2
# age gender
# <dbl> <chr>
# 1 23 M
# 2 22 F
# 3 21 NA
# 4 20 F
# -------------------------------------------------------------------------
df2 <- tibble(age = c(23,22,21,20),
gender = c("M", "F", "M", "F"))
# -------------------------------------------------------------------------
#> df2
# # A tibble: 4 x 2
# age gender
# <dbl> <chr>
# 1 23 M
# 2 22 F
# 3 21 M
# 4 20 F
# -------------------------------------------------------------------------
# get the na in df1 of gender var
df1.na <- is.na(df1$gender)
#> df1.na
# [1] FALSE FALSE TRUE FALSE
# -------------------------------------------------------------------------
# use the values in df2 to replace na in df1 (Note that this is index based)
df1$gender[df1.na] <- df2$gender[df1.na]
df1
# -------------------------------------------------------------------------
#> df1
# A tibble: 4 x 2
# age gender
# <dbl> <chr>
# 1 23 M
# 2 22 F
# 3 21 M
# 4 20 F
# -------------------------------------------------------------------------
This can be done using the natural_join function from the rqdatatable library. The function does require an index to merge on, so we will need to create one.
Creating a reproducible example will help other people help you. Here I've created two simple data frames that should cover most cases for your problem.
# Create example data
tbl1 <-
data.frame(
w = c(1, 2, 3, 4),
x = c(1, 2, 3, NA),
y = c(1, 2, 3, 4),
z = c(1, NA, NA, NA)
)
tbl2 <-
data.frame(
w = c(9, 9, 9, 9), # check value doesnt overwrite value,
x = c(1, 2, 3, 4), # check na gets filled in
y = c(1, 2, 3, NA), # check NA doesnt overwrite value
z = c(9, NA, NA, NA) # check NA in both stays NA
)
# Create join index
tbl1$indx <- 1:nrow(tbl1)
tbl2$indx <- 1:nrow(tbl2)
# Use natural_join
library("rqdatatable")
natural_join(tbl1, tbl2, by = "indx")
Related
this question has been asked a couple of times but I have yet to find a satisfactory answer that works.
I have a dataframe:
grouping1 <- rep(c('a','b'),times=47350)
grouping2 <- rep(c('A','B', 'C', 'D', 'E'), times=18940)
observations <- rep(c(14, 16, 12, 11, 15, 15,15,18,20,34,12), times=9470)
my_data <- as.data.frame(cbind(grouping1,grouping2,observations))
I would like to group over my grouping variables to pass a different value to 'times' in rep() for each group:
new_data <- my_data %>%
group_by(grouping1,grouping2,grouping3) %>%
mutate(sim_count = rep(1:100, times=observations, each=1))
But the 'times' argument is invalid, no matter if I pipe in a list of values from 'observations' iterate over 'observations' from the dataframe, iterate through observations in a for loop, etc. I think there must be an easy fix but I'm not seeing it. Thank you in advance.
EDIT: Thanks to everyone for their patience; they helped me better envision the data structure and how I could better explain the problem. Here's the solution I came up with:
new_data <- my_data %>%
distinct(grouping1,grouping2,.keep_all=T) %>%
rowwise() %>%
mutate(sim_count = list(rep(1:100,times=observations,each=1))) %>%
unnest_longer(sim_count) %>%
arrange(sim_count)
We can make a list-column and then tidyr::unnest it:
my_data %>%
group_by(grouping1, grouping2, grouping3) %>%
mutate(sim_count = lapply(observations, function(obs) rep(1:100, times = obs, each = 1))) %>%
ungroup() %>%
tidyr::unnest(sim_count)
# # A tibble: 8,300 x 5
# grouping1 grouping2 grouping3 observations sim_count
# <chr> <chr> <chr> <dbl> <int>
# 1 a A 1 14 1
# 2 a A 1 14 2
# 3 a A 1 14 3
# 4 a A 1 14 4
# 5 a A 1 14 5
# 6 a A 1 14 6
# 7 a A 1 14 7
# 8 a A 1 14 8
# 9 a A 1 14 9
# 10 a A 1 14 10
# # ... with 8,290 more rows
Data
my_data <- structure(list(grouping1 = c("a", "a", "a", "b", "b", "b"), grouping2 = c("A", "A", "B", "B", "C", "C"), grouping3 = c("1", "2", "3", "4", "5", "6"), observations = c(14, 16, 12, 11, 15, 15)), class = "data.frame", row.names = c(NA, -6L))
Maybe we can try the following data.table option
setDT(my_data)[
,
.(observations,
sim_count = rep(1:100, times = observations, each = 1)
), grouping1:grouping3
]
I have a dataset: (actually I have more than 100 groups)
and I want to use dplyr to create a variable-y for each group, and fill first value of y to be 1,
Second y = 1* first x + 2*first y
The result would be:
I tried to create a column- y, all=1, then use
df%>% group_by(group)%>% mutate(var=shift(x)+2*shift(y))%>% ungroup()
but the formula for y become, always use initialize y value--1
Second y = 1* first x + 2*1
Could someone give me some ideas about this? Thank you!
The dput of my result data is:
structure(list(group = c("a", "a", "a", "a", "a", "b", "b", "b" ), x =
c(1, 2, 3, 4, 5, 6, 7, 8), y = c(1, 3, 8, 19, 42, 1, 8, 23)),
row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame" ))
To perform such calculation we can use accumulate from purrr or Reduce in base R.
Since you are already using dplyr we can use accumulate :
library(dplyr)
df %>%
group_by(group) %>%
mutate(y1 = purrr::accumulate(x[-n()], ~.x * 2 + .y, .init = 1))
# group x y y1
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 3 3
#3 a 3 8 8
#4 a 4 19 19
#5 a 5 42 42
#6 b 6 1 1
#7 b 7 8 8
#8 b 8 23 23
I've looked at the discussion What is the difference between as.tibble(), as_data_frame(), and tbl_df()? to figure out why a replace_na function (shown below) works on data frames but not on tibbles. Could you help me understand why it doesn't work on tibbles? How can the function be modified so it can work for both data.frame and tibble?
Data
library(dplyr)
#dput(df1)
df1 <- structure(list(id = c(1, 2, 3, 4), gender = c("M", "F", NA, "F"
), grade = c("A", NA, NA, NA), age = c(2, NA, 2, NA)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
#dput(df2)
df2 <- structure(list(id = c(1, 2, 3, 4), gender = c("M", "F", "M",
"F"), grade = c("A", "A", "B", "NG"), age = c(22, 23, 21, 19)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
replace function
replace_na <- function(df_to, df_from) {
replace(df_to, is.na(df_to), df_from[is.na(df_to)])
}
Usage
replace_na(df1,df2)
Error: Must use a vector in [, not an object of class matrix.
Call rlang::last_error() to see a backtrace
Called from: abort(error_dim_column_index(j))
However; coercing the arglist to data frame produces the desired output as shown below.
replace_na(as.data.frame(df1), as.data.frame(df2))
# id gender grade age
# 1 1 M A 2
# 2 2 F A 23
# 3 3 M B 2
# 4 4 F NG 19
Thank you.
is.na() returns a logical matrix for a data frame:
is.na(df1)
#> id gender grade age
#> [1,] FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE TRUE TRUE
#> [3,] FALSE TRUE TRUE FALSE
#> [4,] FALSE FALSE TRUE TRUE
The base data.frame class supports subsetting with a matrix; tbl_df is more strict, and does not.
as.data.frame(df2)[is.na(df1)]
#> [1] "M" "A" "B" "NG" "23" "19"
df2[is.na(df1)]
#> Must use a vector in `[`, not an object of class matrix.
To make your replace_na() function work with a tbl_df you need to do the operation separately for each column. For example, with recursion:
replace_na <- function(x, y) {
if (is.data.frame(x)) {
x[] <- Map(replace_na, x, y)
return(x)
}
replace(x, is.na(x), y[is.na(x)])
}
replace_na(df1, df2)
#> # A tibble: 4 x 4
#> id gender grade age
#> <dbl> <chr> <chr> <dbl>
#> 1 1 M A 2
#> 2 2 F A 23
#> 3 3 M B 2
#> 4 4 F NG 19
This method is also generally faster:
replace_na_vec <- function(x, y) {
replace(x, is.na(x), y[is.na(x)])
}
df1_10k <- do.call("rbind", replicate(10000, df1, simplify = FALSE))
df2_10k <- do.call("rbind", replicate(10000, df2, simplify = FALSE))
bench::mark(
check = FALSE,
new = replace_na(df1, df2),
old = replace_na_vec(as.data.frame(df1), as.data.frame(df2)),
new_10k = replace_na(df1_10k, df2_10k),
old_10k = replace_na_vec(as.data.frame(df1_10k), as.data.frame(df2_10k))
)
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 new 74.01us 97.79us 7295. 0B 12.6
#> 2 old 269.97us 529.93us 1845. 81.02KB 8.23
#> 3 new_10k 1.82ms 2.75ms 338. 4.27MB 32.3
#> 4 old_10k 94.29ms 104.05ms 9.68 10.24MB 2.42
Created on 2019-09-12 by the reprex package (v0.3.0)
Sample data.frame:
structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
Output:
df
# a b c
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
I'd like to get the first and third columns, but I want to subset by name and also by column index.
df[, "a"]
# [1] 1 2 3
df[, 3]
# [1] 7 8 9
df[, c("a", 3)]
# Error in `[.data.frame`(df, , c("a", 3)) : undefined columns selected
df[, c(match("a", names(df)), 3)]
# a c
# 1 1 7
# 2 2 8
# 3 3 9
Are there functions or packages that allow for clean/simple syntax, as in the third example, while also achieving the result of the fourth example?
Maybe use dplyr?
For interactive use - i.e., if you know ahead of time the name of the column you want to select
library(dplyr)
df %>% select(a, 3)
If you do not know the name of the column in advance, and want to pass it as a variable,
x <- names(df)[1]
x
[1] "a"
df %>% select_(x, 3)
Either way the output is
# a c
#1 1 7
#2 2 8
#3 3 9
In base R you can combine subset with select.
df <- structure(list(a = c(1, 2, 3),
b = c(4, 5, 6), c = c(7, 8, 9)),
.Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
df <- subset(df, select = c(a, 3))
You can index names(df) without using dplyr:
df <- structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
df[,c("a",names(df)[3]) ]
Output:
a c
1 1 7
2 2 8
3 3 9
Given a data frame
df=data.frame(
E=c(1,1,2,1,3,2,2),
N=c(4,4,10,4,3,2,2)
)
I would like to create a third column: Every time a value equals another value in the same column and these rows are also equal in the other column it results in a match (new character for every match).
dfx=data.frame(
E=c(1,1,2,1,3,2,2,3, 2),
N=c(4,4,10,4,3,2,2,6, 10),
matched=c("A", "A", "B","A", NA, "C", "C", NA, "B")
)
Thanks!
Here, df is:
df <- structure(list(E = c(1, 1, 2, 1, 3, 2, 2, 3, 2), N = c(4, 4,
10, 4, 3, 2, 2, 6, 10)), .Names = c("E", "N"), row.names = c(NA,
-9L), class = "data.frame")
You can do:
dfx <- transform(df, matched = {
i <- as.character(interaction(df[c("E", "N")]))
tab <- table(i)[order(unique(i))]
LETTERS[match(i, names(tab)[tab > 1])]
})
# E N matched
# 1 1 4 A
# 2 1 4 A
# 3 2 10 B
# 4 1 4 A
# 5 3 3 <NA>
# 6 2 2 C
# 7 2 2 C
# 8 3 6 <NA>
# 9 2 10 B