I have data, a simplified version of which looks like this:
df_current <- data.frame(
start = c('yes', rep('no', 5), 'yes', rep('no', 3)),
season = c('banana', rep('to update', 5), 'apple', rep('to update', 3)),
stringsAsFactors = F
)
Let's say that the "start" variable indicates when a new season starts, and I can use that in combination with a date variable (not included) to indicate where apple and banana season start. Once this is done, I want to update the rest of the rows in the "season" column. All of the rows which currently have the value "to update" should be updated to have the value of the type of fruit whose season has most recently started (the rows are arranged by date). In other words, I want the data to look like this:
df_desired <- data.frame(
start = c('yes', rep('no', 5), 'yes', rep('no', 3)),
season = c(rep('banana', 6), rep('apple', 4)),
stringsAsFactors = F
)
I had assumed that something like the following would work:
updated <- df_current %>%
rowwise() %>%
mutate(season = case_when(
season != 'to update' ~ season,
season == 'to update' ~ lag(season)
))
However, that generates NAs at all the 'to update' values.
An easy way would be to replace "to update" with NA and then use fill.
library(dplyr)
library(tidyr)
df_current %>%
mutate(season = replace(season, season == "to update", NA)) %>%
fill(season)
# start season
#1 yes banana
#2 no banana
#3 no banana
#4 no banana
#5 no banana
#6 no banana
#7 yes apple
#8 no apple
#9 no apple
#10 no apple
Using the same logic you can also use zoo::na.locf to fill missing values with latest non-missing values.
The reason you generate a bunch of NAs is due to season containing only a single value in each case_when evaluation, and thus lag(season) always producing NA. Here is another base R solution that uses rle:
x <- rle(df_current$season)
x
#> Run Length Encoding
#> lengths: int [1:4] 1 5 1 3
#> values : chr [1:4] "banana" "to update" "apple" "to update"
x$values[x$values == "to update"] <- x$values[which(x$values == "to update") - 1]
x
#> Run Length Encoding
#> lengths: int [1:4] 1 5 1 3
#> values : chr [1:4] "banana" "banana" "apple" "apple"
df_current$season <- inverse.rle(x)
df_current
#> start season
#> 1 yes banana
#> 2 no banana
#> 3 no banana
#> 4 no banana
#> 5 no banana
#> 6 no banana
#> 7 yes apple
#> 8 no apple
#> 9 no apple
#> 10 no apple
We can use na_if
library(dplyr)
library(tidyr)
df_current %>%
mutate(season = na_if(season, "to update")) %>%
fill(season)
# start season
#1 yes banana
#2 no banana
#3 no banana
#4 no banana
#5 no banana
#6 no banana
#7 yes apple
#8 no apple
#9 no apple
#10 no apple
Related
This might be a simple answer, but I am having issues finding this solution and could use help, please.
> fruit.names <- c(rep("apple",3), rep("pear",3), rep("pepper", 3), rep("rice",3))
> adj <- c(rep("red", 3), rep("not round", 2), "yellow", rep("hot", 3), "grain", "white", "starch")
> df.start <- data.frame(fruit.names, adj)
> df.start
fruit.names adj
1 apple red
2 apple red
3 apple red
4 pear not round
5 pear not round
6 pear yellow
7 pepper hot
8 pepper hot
9 pepper hot
10 rice grain
11 rice white
12 rice starch
I am need of code that results that list only unique df.start$names and has all the same results in df.start$adj for each item in df.start$names.
So the results would look like this. I'd prefer to use only base R, if possible (i.e. no tidyr/dplyr.)
> df.results
fruit.names adj
1 apple red
2 pepper hot
A couple ways:
base R
ind <- ave(df.start$adj, df.start$fruit.names, FUN = function(z) length(unique(z)) == 1) == "TRUE"
unique(df.start[ind,])
# fruit.names adj
# 1 apple red
# 7 pepper hot
The need to check against the string "TRUE" is because ave requires that its return value is the same class as the input vector, so the output is coerced.
dplyr
(Offered for the crowd, though I know you said you preferred base R.)
library(dplyr)
df.start %>%
group_by(fruit.names) %>%
filter(length(unique(adj)) == 1) %>%
ungroup() %>%
distinct()
# # A tibble: 2 x 2
# fruit.names adj
# <chr> <chr>
# 1 apple red
# 2 pepper hot
I would like to optimize my code.
I’m working with str_detect to make a lot of selections, as I would like to optimize my code for the future I would like to select, have a filter pattern defined, based on an externally defined object. I can do that, but I have to strip my way to the object using as.character(). Is it possible to do it in a tidy way?
Working example demonstrating the issue. This is the classical way, it works
> tbl %>% mutate(twentys = case_when(
+ str_detect(fruit, "20") ~ T) )
# A tibble: 4 x 3
x fruit twentys
<int> <chr> <lgl>
1 1 apple 20 TRUE
2 2 banana 20 TRUE
3 3 pear 10 NA
4 4 pineapple 10 NA
This is how I imaged I could do, but it doesn’t way
> twenty <- 20
> tbl %>% mutate(twentys = case_when(
+ str_detect(fruit, twenty) ~ T) )
Error: Problem with `mutate()` input `twentys`.
x no applicable method for 'type' applied to an object of class "c('double', 'numeric')"
i Input `twentys` is `case_when(str_detect(fruit, twenty) ~ T)`.
Run `rlang::last_error()` to see where the error occurred.
This is the cumbersome way, using as.character(), that I would like to optimize.
> tbl %>% mutate(twentys = case_when(
+ str_detect(fruit, as.character(twenty)) ~ T) )
# A tibble: 4 x 3
x fruit twentys
<int> <chr> <lgl>
1 1 apple 20 TRUE
2 2 banana 20 TRUE
3 3 pear 10 NA
4 4 pineapple 10 NA
You can use grepl if you don't want to convert twenty to character.
library(dplyr)
tbl %>% mutate(twentys = case_when(grepl(twenty, fruit) ~ TRUE))
# x fruit twentys
#1 1 apple 20 TRUE
#2 2 banana 20 TRUE
#3 3 pear 10 NA
#4 4 pineapple 10 NA
data
tbl <- structure(list(x = 1:4, fruit = c("apple 20", "banana 20", "pear 10",
"pineapple 10")), class = "data.frame", row.names = c(NA, -4L))
twenty <- 20
We can use str_detect
library(dplyr)
library(stringr)
tbl %>%
mutate(twenty = case_when(str_detect(fruit, str_c(twenty)) ~ TRUE))
Or wrap with paste
tbl %>%
mutate(twenty = case_when(str_detect(fruit, paste(twenty)) ~ TRUE))
data
tbl <- structure(list(x = 1:4, fruit = c("apple 20", "banana 20", "pear 10",
"pineapple 10")), class = "data.frame", row.names = c(NA, -4L))
twenty <- 20
If I say I have a list of people who used cafeteria.
Fruits ID Date
apple 1 100510
apple 2 100710
banana 2 110710
banana 1 120910
kiwi 2 120710
apple 3 100210
kiwi 3 110810
I want to select people who have took both apple and banana and my new dataset to contain people who qualify for this inclusion criteria and give:
ID
1
2
(because only ID 1 and 2 had both apple and banana in the dataset)
what code should I use in R?
In base R you could do something like
data.frame(ID = names(which(sapply(split(df$Fruits, df$ID), function(x) {
"apple" %in% x & "banana" %in% x
}))))
#> ID
#> 1 1
#> 2 2
This will give you the name of the IDs that contain both "apple" and "banana"
If you want the subset of the data frame containing these rows you can do:
df[df$ID %in% names(which(sapply(split(df$Fruits, df$ID), function(x) {
"apple" %in% x & "banana" %in% x
}))),]
#> Fruits ID Date
#> 1 apple 1 100510
#> 2 apple 2 100710
#> 3 banana 2 110710
#> 4 banana 1 120910
#> 5 kiwi 2 120710
You can use dplyr package to check if any of the entries is an apple AND any of the entries is a banana per ID:
library(dplyr)
df <- data.frame(Fruits = c("apple", "apple", "banana", "banana", "kiwi", "apple", "kiwi"),
ID = c(1,2,2,1,2,3,3),
Date = c(100510,100710,100710,120910, 120710,100210,110810))
df %>%
group_by(ID) %>%
filter(any(Fruits == "apple") & any(Fruits == "banana")) %>%
ungroup() %>%
select(ID) %>%
distinct()
In this case, the result is
# A tibble: 2 x 1
ID
<dbl>
1 1
2 2
I have data in the form of ID and food:
adf<-data.frame(ID=c("a","a","a","b","b","b","b","c","c"),
foods=c("apple","orange","banana","apple","banana","tomato","pear","pear","onion"))
I also have a list of required foods that each ID is being measured for completion against:
required_foods<-c("apple","tomato")
I am interested in producing a column called "missing_foods" that houses a comma-separated list of any and all foods in the required_foods that don't exist in the foods column of my data, as grouped by ID.
In the desired_output below is an example of what I'm hoping to accomplish.
desired_output<-data.frame(ID=c("a","a","a","b","b","b","b","c","c"),
foods=c("apple","orange","banana","apple","banana","tomato","pear","pear","onion"),
missing_foods=c("tomato","tomato","tomato","","","","","apple,tomato","apple,tomato"))
My attempts at solving this so far have been fruitless. Ideally, I'm hoping to a dplyr answer that will have the flexibility to allow for required_food lists of varying lengths. I will ultimately be making multiple required_... lists and hoping to produce a new column for each one.
My attempts:
adf2<-adf%>%
group_by(ID)%>%
mutate(missing_foods= !(required_foods %in% foods))
adf2<-adf%>%
group_by(ID)%>%
mutate(missing_foods= paste(!(required_foods %in% foods),sep=","))
adf2<-adf%>%
group_by(ID)%>%
mutate(missing_foods= for (f in 1:length(required_foods)){
ifelse(f %in% required_foods,paste0(""),
paste0(f,","))
})
Any help would be greatly appreciated.
Here, we are using the desired_output data.frame as the 'adf' dataset values are not the same as in the 'desired_output'. After grouping by 'ID', get the elements from 'required_foods' that are not in 'foods' with setdiff, paste them together (str_c), and replace any NA (when all the elements are found) with blank ("")
library(dplyr)
library(stringr)
library(tidyr)
desired_output %>%
group_by(ID) %>%
mutate(newmissing_foods = replace_na(str_c(setdiff(required_foods,
foods), collapse=", ")[1], ''))
# A tibble: 9 x 4
# Groups: ID [3]
# ID foods missing_foods newmissing_foods
# <fct> <fct> <fct> <chr>
#1 a apple "tomato" "tomato"
#2 a orange "tomato" "tomato"
#3 a banana "tomato" "tomato"
#4 b apple "" ""
#5 b banana "" ""
#6 b tomato "" ""
#7 b pear "" ""
#8 c pear "apple,tomato" "apple, tomato"
#9 c onion "apple,tomato" "apple, tomato"
In the OP's code, it is just creating a logical vector
!(required_foods %in% foods)
which should be used to subset the 'required_foods'
desired_output %>%
group_by(ID) %>%
mutate(newmissing_foods = paste(required_foods[!(required_foods %in%
foods)], collapse=", "))
Or using data.table
library(dplyr)
setDT(desired_output)[, newmissing_foods := paste(required_foods[!(required_foods %in%
foods)], collapse=", "), ID]
NOTE: toString is a wrapper for paste(., collapse = ", ")
We can group_by ID and use setdiff to get foods which is not present in required_foods and get comma-separated value of it.
library(dplyr)
adf %>%
group_by(ID) %>%
mutate(missing_foods = toString(setdiff(required_foods, foods)))
# ID foods missing_foods
# <fct> <fct> <chr>
#1 a apple "tomato"
#2 a orange "tomato"
#3 a banana "tomato"
#4 b apple ""
#5 b banana ""
#6 b tomato ""
#7 b pear ""
#8 c pear "apple, tomato"
#9 c onion "apple, tomato"
The same can be done with data.table as well
library(data.table)
setDT(adf)[, missing_foods := toString(setdiff(required_foods, foods)), ID]
I'm starting with the following data:
df <- data.frame(Person=c("Ada","Ada","Bob","Bob","Carl","Carl"), Day=c(1,2,2,1,1,2), Fruit=c("Apple","X","Apple","X","X","Orange"))
Person Day Fruit
1 Ada 1 Apple
2 Ada 2 X
3 Bob 2 Apple
4 Bob 1 X
5 Carl 1 X
6 Carl 2 Orange
And I want to loop through every person and replace the unknown fruit X with either Apple or Orange while making sure that if it's Orange one day, it should be Apple the next day, and vice versa.
For Ada: Day 1 = Apple, meaning Day 2 = X <- Orange
I don't know where to start other than:
library(dplyr)
df %>%
group_by(Person)
any suggestions for direction?
Another solution using case_when from dplyr:
library(dplyr)
# Changing datatypes to character instead of factor
df[] <- lapply(df, as.character)
# Optional, but this line will convert all columns to appropriate datatype, eg. Day will be integer
df <- readr::type_convert(df)
df %>%
group_by(Person) %>%
mutate(
Contains_Apple = any(Fruit == "Apple"),
Contains_Orange = any(Fruit == "Orange"),
Fruit = case_when(
Fruit == "X" & Contains_Apple == F ~ "Apple",
Fruit == "X" & Contains_Orange == F ~ "Orange",
TRUE ~ Fruit
)
)
# A tibble: 6 x 5
# Groups: Person [3]
Person Day Fruit Contains_Apple Contains_Orange
<chr> <int> <chr> <lgl> <lgl>
1 Ada 1 Apple T F
2 Ada 2 Orange T F
3 Bob 2 Apple T F
4 Bob 1 Orange T F
5 Carl 1 Apple F T
6 Carl 2 Orange F T
Remove the Contains_Apple and Contains_Orange by:
df %>%
group_by(Person) %>%
mutate(Contains_Apple = any(Fruit == "Apple"),
Contains_Orange = any(Fruit == "Orange"),
Fruit = case_when(Fruit == "X" & Contains_Apple == F ~ "Apple",
Fruit == "X" & Contains_Orange == F ~ "Orange",
TRUE ~ Fruit)) %>%
select(Person, Day, Fruit) %>%
ungroup()
# A tibble: 6 x 3
Person Day Fruit
<chr> <int> <chr>
1 Ada 1 Apple
2 Ada 2 Orange
3 Bob 2 Apple
4 Bob 1 Orange
5 Carl 1 Apple
6 Carl 2 Orange
Here is one idea using case_when to check if each group already has "Apple" or "Orange", and then assign the opposite value if Fruit is "X".
Notice that I added stringsAsFactors = FALSE when creating the example data frame, which aims to avoid the creation of factor columns.
library(dplyr)
library(tidyr)
df %>%
group_by(Person) %>%
mutate(Fruit = case_when(
Fruit %in% "X" & any(Fruit %in% "Apple") ~ "Orange",
Fruit %in% "X" & any(Fruit %in% "Orange") ~ "Apple",
TRUE ~ Fruit
)) %>%
ungroup()
# # A tibble: 6 x 3
# Person Day Fruit
# <chr> <dbl> <chr>
# 1 Ada 1.00 Apple
# 2 Ada 2.00 Orange
# 3 Bob 2.00 Apple
# 4 Bob 1.00 Orange
# 5 Carl 1.00 Apple
# 6 Carl 2.00 Orange
DATA
df <- data.frame(Person=c("Ada","Ada","Bob","Bob","Carl","Carl"),
Day=c(1,2,2,1,1,2),
Fruit=c("Apple","X","Apple","X","X","Orange"),
stringsAsFactors = FALSE)
Simple with looping:
fruity_loop <- function(frame) {
ops <- c('Apple', 'Orange')
for(x in 1:nrow(frame)) {
if(frame[x,]['Fruit'] == 'X') {
if(frame[x-1,]['Fruit'] == ops[1]) { frame[x,]['Fruit'] <- ops[2] } else { frame[x,]['Fruit'] <- ops[1] } }
}
return(frame)
}
Example:
fruity_loop(df)