I must imagine this question is not unique, but I was struggling with which words to search for so if this is redundant please point me to the post!
I have a dataframe
test <- data.frame(x = c("a", "b", "c", "d", "e"))
x
1 a
2 b
3 c
4 d
5 e
And I'd like to replace SOME of the values using a separate data frame
metadata <- data.frame(
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"))
Resulting in:
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
A base R solution using match + replace
test <- within(test,x <- replace(as.character(x),match(metadata$a,x),as.character(metadata$b)))
such that
> test
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
Importing your data with stringsAsFactors = FALSE and using dplyr and stringr, you can do:
test %>%
mutate(x = str_replace_all(x, setNames(metadata$b, metadata$a)))
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
Or using the basic idea from #Sotos:
test %>%
mutate(x = pmax(x, metadata$b[match(x, metadata$a, nomatch = x)], na.rm = TRUE))
You can do,
test$x[test$x %in% metadata$a] <- na.omit(metadata$b[match(test$x, metadata$a)])
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Here's one approach, though I presume there are shorter ones:
library(dplyr)
test %>%
left_join(metadata, by = c("x" = "a")) %>%
mutate(b = coalesce(b, x))
# x b
#1 a a
#2 b b
#3 c REPLACE_1
#4 d REPLACE_2
#5 e e
(Note, I have made the data types match by loading metadata as character, not factors:
metadata <- data.frame(stringsAsFactors = F,
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"))
You can use match to make this update join.
i <- match(metadata$a, test$x)
test$x[i] <- metadata$b
# test
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Or:
i <- match(test$x, metadata$a)
j <- !is.na(i)
test$x[j] <- metadata$b[i[j]]
test
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Data:
test <- data.frame(x = c("a", "b", "c", "d", "e"), stringsAsFactors = FALSE)
metadata <- data.frame(
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"), stringsAsFactors = FALSE)
Related
I have the following data frame of letters with some blank (NA) slots for the lower cases
letters_df <- data.frame(caps = LETTERS[1:10], lows = letters[c(1,2,11,11,11,11,11,11,11,10)])
letters_df[letters_df == "k"] <- NA
letters_df
To fill in some of the blanks I am using this new data frame I constructed
new_letters <- data.frame(caps = c("C", "D", "F", "G", "H"),
lows = c("c", "d", "f", "g", "h"))
Following on from a previous question I am using dplyr mutate and case_when as follows
letters_df %>%
mutate(lows = case_when(
caps %in% new_letters$caps ~ new_letters$lows,
TRUE ~ lows))
However, the result does not add in the missing letters and throws an error asking for a vector of the same length as the letters_df column. I thought I had a good handle on the syntax here. Can help me with where I am going wrong?
This is a typical case that rows_* from dplyr can treat:
library(dplyr)
letters_df %>%
rows_patch(new_letters, by = "caps")
caps lows
1 A a
2 B b
3 C c
4 D d
5 E <NA>
6 F f
7 G g
8 H h
9 I <NA>
10 J j
You could consider using a left_join combined with coalesce:
library(dplyr)
letters_df %>%
left_join(new_letters, by = "caps") %>%
mutate(lows = coalesce(lows.x, lows.y), .keep = "unused")
This returns
caps lows
1 A a
2 B b
3 C c
4 D d
5 E <NA>
6 F f
7 G g
8 H h
9 I <NA>
10 J j
As an alternative that is more similar to your approach, you could transform your new_letters data.frame into a lookup vector returning the same result:
lookup <- tibble::deframe(new_letters)
letters_df %>%
mutate(lows = case_when(caps %in% names(lookup) ~ lookup[caps],
TRUE ~ lows))
How can you transform empty strings to NA in a variable?
So the context is I have 4 datasets that are combined (4 surveys). In one of those datasets and in one variable in particular, the "NA" data are classified as (empty strings). How do I modify/transform/mutate those (empty strings) as NA without affecting the rest of the dataset?
What I saw that I think could work is this:
dat <- dat %>% mutate_all(na_if, "")
The problem is obviously that this selects all the data.
What worked was
nav <- c('', ' ')
ECEMD <- transform(ECEMD, CHILDREN_DIM = replace (CHILDREN_DIM, CHILDREN_DIM %in% nav, NA))
Assuming sort of this data.
dat
# want_na non_na
# 1 A A
# 2 B B
# 3
# 4 D D
# 5
# 6 F F
# 7
# 8 H H
# 9 I I
# 10 J J
Then define a vector containing all values you want to with NA, e.g.
nav <- c('', ' ')
and replace them in a defined variable, "want_na" in this example.
dat <- transform(dat, want_na=replace(want_na, want_na %in% nav, NA))
dat
# want_na non_na
# 1 A A
# 2 B B
# 3 <NA>
# 4 D D
# 5 <NA>
# 6 F F
# 7 <NA>
# 8 H H
# 9 I I
# 10 J J
Data:
dat <- structure(list(want_na = c("A", "B", "", "D", "", "F", "", "H",
"I", "J"), non_na = c("A", "B", "", "D", "", "F", "", "H", "I",
"J")), class = "data.frame", row.names = c(NA, -10L))
What worked was
nav <- c('', ' ')
ECEMD <- transform(ECEMD, CHILDREN_DIM = replace (CHILDREN_DIM, CHILDREN_DIM %in% nav, NA))
This question already has an answer here:
Replace NA with mode based on ID attribute
(1 answer)
Closed 2 years ago.
I'd like to fill the NA-values in F2-column, based on the the most common F2-value when grouped by F1-column.
F1 F2
1 A C
2 B D
3 A NA
4 A C
5 B NA
Desired outcome:
F1 F2
1 A C
2 B D
3 A C
4 A C
5 B D
Thank you for help
Here is a base R solution. First define a function for Mode (Taken from here) and then apply it to you data frame, i.e.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df$F2 <- with(df, ave(F2, F1, FUN = function(i) replace(i, is.na(i), Mode(i))))
df
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Here is one way using dplyr :
library(dplyr)
df %>%
group_by(F1) %>%
mutate(F2 = replace(F2, is.na(F2),
names(sort(table(F2), decreasing = TRUE)[1])))
# F1 F2
# <chr> <chr>
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
In case of ties, preference is given to lexicographic order.
Try this:
First in df2 I get max count by the variable F1 where F2 is not missing. That will give you the most common F2 value when groups by F1. I join it back onto the original data.frame and use a mutate to fill by the new variable F2_fill and then remove it from this variable from the data.frame.
library(tidyverse)
df <- tribble(
~F1, ~F2,
'A', 'C',
'B' , 'D',
'A' ,NA,
'A', 'C',
'B', NA)
df2 <- df %>%
group_by(F1) %>%
count(F2) %>%
filter(!is.na(F2), n == max(n)) %>%
select(-n) %>%
rename(F2_fill = F2)
df3 <- left_join(df,df2, by="F1") %>%
mutate(F2 = ifelse(is.na(F2), F2_fill,F2)) %>%
select(-F2_fill)
You can use ave with table and which.max and subsetting with is.na when it is a character.
i <- is.na(x$F2)
x$F2[i] <- ave(x$F2, x$F1, FUN=function(y) names(which.max(table(y))))[i]
x
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Data:
x <- data.frame(F1 = c("A", "B", "A", "A", "B")
, F2 = c("C", "D", NA, "C", NA))
I have struggles filling a column based on a condition. Maybe my approach is not in the right direction. I don't know. My conditions are as follow:
2 "b"s and 1 "a" in a row, write in column "match" "B"
2 "c"s in a row, write in column "match" "C"
for anything else fill NA
So far I did the following but I see that this is not quite accurate since my new vector is not created from the rows but the entire column, and it still doesn't work.
set.seed(123)
df_letters <- data.frame(basket1 = sample(letters[1:3], 5, replace = TRUE, prob = c(0.85,0.10,0.5)),
basket2 = sample(letters[1:3], 5, replace = TRUE, prob = c(0.10,0.85,0.5)),
basket3 = sample(letters[1:3], 5, replace = TRUE, prob=c(0.5,0.10,0.85)),
stringsAsFactors = FALSE)
df_letters %>% mutate(match = ifelse(sum(as.character(as.vector(df_letters)) == "c")==2, "C",
ifelse((sum(as.character(as.vector(df_letters)) == "b")==2) & (sum(as.character(as.vector(df_letters)) == "a")==1) ,"B", NA )))
My desired output is:
> df_letters
basket1 basket2 basket3 match
1 a b b B
2 c b c C
3 a c a <NA>
4 c b c C
5 b b c <NA>
Many thanks in advance!
One dplyr option could be:
df_letters %>%
mutate(match = case_when(rowSums(select(., starts_with("basket")) == "b") == 2 & rowSums(select(., starts_with("basket")) == "a") == 1 ~ "B",
rowSums(select(., starts_with("basket")) == "c") == 2 ~ "C",
TRUE ~ NA_character_))
basket1 basket2 basket3 match
1 a b b B
2 c b c C
3 a c a <NA>
4 c b c C
5 b b c <NA>
This is how to achieve this in base R:
df_letters$match <- apply(df_letters, 1, function(x) {
count <- as.list(table(x))
ifelse(count$a == 1 && count$b == 2, "B", ifelse(count$c == 2, "C", NA_character_))
})
The idea is to convert the table object to list to access counts by element.
Output
basket1 basket2 basket3 match
1 a b b B
2 c b c C
3 a c a <NA>
4 c b c C
5 b b c <NA>
I have a table df that looks like this:
a <- c(10,20, 20, 20, 30)
b <- c("u", "u", "u", "r", "r")
c <- c("a", "a", "b", "b", "b")
df <- data.frame(a,b,c)
I would like to create a new table that contains the mean of col a, grouped by variable c. And I would like to have a column with the counts of the occurrence of b types within each group c.
I would therefore like the result table to look like df2:
a_m <- c(15, 23.3)
c <- c("a", "b")
counts_b <-c("2 u", "1 u, 2 r")
df2 <- data.frame(a_m, c, counts_b)
What I have so far is:
df2 <- df %>% group_by(c) %>% summarise(a_m = mean(a, na.rm = TRUE))
I do not know how to add the column counts_b in the example df2.
Giulia
Here's a way using a little table magic:
df %>%
group_by(c) %>%
summarise(a_mean = mean(a),
b_list = paste(names(table(b)), table(b), collapse = ', '))
# A tibble: 2 x 3
c a_mean b_list
<fct> <dbl> <chr>
1 a 15.0 r 0, u 2
2 b 23.3 r 2, u 1
Here is another solution using reshape2. The output format may be more convenient to work with, each value of b has its own column with the number of occurrences.
out1 <- dcast(df, c ~ b, value.var="c", fun.aggregate=length)
c r u
1 a 0 2
2 b 2 1
out2 <- df %>% group_by(c) %>% summarise(a_m = mean(a))
# A tibble: 2 x 2
c a_m
<fctr> <dbl>
1 a 15.00000
2 b 23.33333
df2 <- merge(out1, out2, by=c)
c r u a_m
1 a 0 2 15.00000
2 b 2 1 23.33333