R replace NA values in a dataframe column with existing values in other rows and same column - r

I have the following dataframe:
FOOD ID DATE PRICE DES
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
3 1/25/2020 4 Tomato
3 1/25/2020 NA Tomato
3 1/1/2019 NA Tomato
3 1/1/2019 5 Tomato
I would need to replace (where/when possible) the NA values when a price for the same FOOD ID and same DATE is available. Expected output:
FOOD ID DATE PRICE DES
1 1/1/2020 100 Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 100 Tuna
3 1/25/2020 4 Tomato
3 1/25/2020 4 Tomato
3 1/1/2019 5 Tomato
3 1/1/2019 5 Tomato
Without using a loop for, is there a way I could easily perform such task?
I guess one way could be to use dplyr, group the data by FOOD ID and DATE and get an "average" PRICE, delete the PRICE column from the original dataframe, and finally merged the group data with the original dataframe, but this seems a odd way to do it.
Thanks for the help.

df %>%
group_by(FOOD_ID, DATE)%>%
fill(PRICE, .direction = 'updown')
# A tibble: 8 x 4
# Groups: FOOD_ID, DATE [3]
FOOD_ID DATE PRICE DES
<int> <chr> <int> <chr>
1 1 1/1/2020 100 Tuna
2 1 1/1/2020 100 Tuna
3 1 1/1/2020 100 Tuna
4 1 1/1/2020 100 Tuna
5 3 1/25/2020 4 Tomato
6 3 1/25/2020 4 Tomato
7 3 1/1/2019 5 Tomato
8 3 1/1/2019 5 Tomato

We can use the data itself to feed prices back in.
Data:
df <- read.table(header = TRUE, text= "FOOD_ID DATE PRICE DES
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
3 1/25/2020 4 Tomato
3 1/25/2020 NA Tomato
3 1/1/2019 NA Tomato
3 1/1/2019 5 Tomato")
Find distinct prices for each product on each date.
prices <- df %>%
filter(!is.na(PRICE)) %>%
group_by(FOOD_ID, DATE, DES) %>%
distinct(FOOD_ID, .keep_all = TRUE)
Join these prices back into the original dataframe, which will assign the prices for each day (I have removed the original price column because it feeds back in from the prices df.
new_df <- df %>%
select(-PRICE) %>%
left_join(prices, by = c('FOOD_ID', 'DATE', 'DES'))
Output of new_df:
FOOD_ID DATE DES PRICE
1 1 1/1/2020 Tuna 100
2 1 1/1/2020 Tuna 100
3 1 1/1/2020 Tuna 100
4 1 1/1/2020 Tuna 100
5 3 1/25/2020 Tomato 4
6 3 1/25/2020 Tomato 4
7 3 1/1/2019 Tomato 5
8 3 1/1/2019 Tomato 5

Related

Collapse rows with same identifier and columns and retain all values in r

I have a dataframe that contains several fields related to an identifier but some are disjointed:
id store manager fruit vegetable
1 Grocery1 Joe apple NA
1 Grocery1 Joe lemon NA
1 Grocery1 Joe NA zucchini
2 Grocery2 Amy orange NA
2 Grocery2 Amy NA asparagus
2 Grocery2 Amy NA spinach
3 Grocery3 Bill NA NA
I want the dataframe to look like:
id store manager fruit vegetable
1 Grocery1 Joe apple zucchini
1 Grocery1 Joe lemon zucchini
2 Grocery2 Amy orange asparagus
2 Grocery2 Amy orange spinach
3 Grocery3 Bill NA NA
Is there a way to easily do this?
You can use tidyr::fill to fill the NA, and only keep the non-duplicated rows using distinct.
library(dplyr)
library(tidyr)
df %>%
group_by(store, manager) %>%
fill(fruit, vegetable, .direction = "updown") %>%
distinct()
# A tibble: 5 × 5
# Groups: store, manager [3]
id store manager fruit vegetable
<int> <chr> <chr> <chr> <chr>
1 1 Grocery1 Joe apple zucchini
2 1 Grocery1 Joe lemon zucchini
3 2 Grocery2 Amy orange asparagus
4 2 Grocery2 Amy orange spinach
5 3 Grocery3 Bill NA NA

How does ggplot2 evaluate factor levels?

Consider this example dataframe:
library(tidyverse)
df <- tibble(item = c("Banana", "Ananas", "Apple", "Blueberry", "Orange",
"Spinach", "Cabbage", "Broccoli", "Carrot", "Eggplant"),
category = c(rep("Fruit", 5), rep("Vegetable", 5)),
n = c(57, 19, 14, 11, 8, 318, 70, 33, 31, 23))
First, I attribute factor levels to item by n. I also add a variable item_fct in order to display the factor levels of item in the output table for a better understanding of my problem:
df %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item))
# A tibble: 10 x 4
item category n item_fct
<fct> <chr> <dbl> <dbl>
1 Banana Fruit 57 8
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 7
9 Carrot Vegetable 31 6
10 Eggplant Vegetable 23 5
This output table and the factor levels make sense to me: Item is labeled from 1-10 by n in an ascending order.
But if I group it by category, I get the following output:
df %>%
group_by(category) %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item))
# A tibble: 10 x 4
# Groups: category [2]
item category n item_fct
<fct> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 5
7 Cabbage Vegetable 70 4
8 Broccoli Vegetable 33 3
9 Carrot Vegetable 31 2
10 Eggplant Vegetable 23 1
Here, the factor levels are rather confusing to me. The items, depending on the group, share the same factor levels (1-5). I would still have expected levels from 1-10, but ordered by n depending on the respective group. Like this:
# A tibble: 10 x 4
item category n item_fct
<chr> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 8
9 Carrot Vegetable 31 7
10 Eggplant Vegetable 23 6
My Question:
Ultimately, it's hard for me to understand why ggplot2 produces the following plot with this code:
df %>%
group_by(category) %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item)) %>%
ggplot(aes(n, item, fill = category)) +
geom_col()
If the items are factored from 1-5 in their respective groups and thus share the same overall factor levels, then why does ggplot2 display the bars by the grouped variable?
This plot is actually my desired outcome, but I don't understand the logic behind it. Considering the factor levels, I would expect bars of the same factor level to be next to each other (basically starting with "Eggplant", then "Orange", then "Carrot", then "Blueberry" etc.).
I would've assumed that the following table would produce the plot, because here the factor levels are ordered in the same way the bars are ordered in the plot:
# A tibble: 10 x 4
item category n item_fct
<chr> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 8
9 Carrot Vegetable 31 7
10 Eggplant Vegetable 23 6

Join dataframes in dplyr by characters

So I have two dataframes:
DF1
X Y ID
banana 14 1
orange 20 2
pineapple 1 3
guava 300 4
grapes 1 5
DF2
Store State ID
Walmart NY 1
Sears AL 1;2
Target DC 3
Old Navy PA 3
Popeye's HA 5
Footlocker NJ 4;5
I join with the following and get:
df1 %>%
inner_join(df2, by = "ID")
X Y ID Store State
banana 14 1 Walmart NY
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
grapes 1 5 Popeye's HA
But due to the semi-colons I'm not capturing those data points on the join, the end result should look like this:
X Y ID Store State
banana 14 1 Walmart NY
banana 14 1 Sears AL
orange 20 2 Sears AL
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
guava 300 4 Foot Locker NJ
grapes 1 5 Popeye's HA
grapes 1 5 Popeye's HA
Using separate_rows from tidyr in combination with dplyr will get you there.
First table I called fruit, the other stores.
library(dplyr)
library(tidyr)
fruit %>%
inner_join(separate_rows(stores, ID) %>% mutate(ID = as.integer(ID)))
Joining, by = "ID"
X Y ID Store State
1 banana 14 1 Walmart NY
2 banana 14 1 Sears AL
3 orange 20 2 Sears AL
4 pineapple 1 3 Target DC
5 pineapple 1 3 Old Navy PA
6 guava 300 4 Footlocker NJ
7 grapes 1 5 Popeye's HA
8 grapes 1 5 Footlocker NJ
With base R, we can use strsplit with merge
lst1 <- strsplit(DF2$ID, ";")
merge(DF1, transform(DF2[rep(seq_len(nrow(DF2)),
lengths(lst1)), 1:2], ID = unlist(lst1)))
# ID X Y Store State
#1 1 banana 14 Walmart NY
#2 1 banana 14 Sears AL
#3 2 orange 20 Sears AL
#4 3 pineapple 1 Target DC
#5 3 pineapple 1 Old Navy PA
#6 4 guava 300 Footlocker NJ
#7 5 grapes 1 Popeye's HA
#8 5 grapes 1 Footlocker NJ

Use count() in more than one column and order results

I have a dataframe called table like this
a m g c1 c2 c3 c4
1 2015 5 13 bread wine <NA> <NA>
2 2015 8 30 wine eggs rice cake
3 2015 1 21 wine rice eggs <NA>
...
I want to count the elements in column c1 to c4 and order them
I tried to use:
library(plyr)
c<-count(table,"c1")
But i don't know how to count more than one column.
Then i want to use arrange(c,desc(freq)) to order them but when i try with one column the value NA is always on top, and i want only top 3 elements. Like this
c freq
1 wine 3
2 eggs 2
3 rice 2
Can someone please get me some solution for this. Thanks
Use melt and table:
df1 <- read.table(text="a m g c1 c2 c3 c4
2015 5 13 bread wine NA NA
2015 8 30 wine eggs rice cake
2015 1 21 wine rice eggs NA", header=TRUE, stringsAsFactors=FALSE)
c_col <- melt(as.matrix(df1[,4:7]))
sort(table(c_col$value),decreasing=TRUE)
wine eggs rice bread cake
3 2 2 1 1
With qdaptools, with the example dataframe (having name table) provided:
library(qdapTools)
counts <- data.frame(count=sort(colSums(mtabulate(table[,4:7])), decreasing=TRUE))
subset(counts,rownames(counts)!='<NA>')[1:3,1,drop=FALSE] #remove <NA>, select top 3 elements
# count
# wine 3
# eggs 2
# rice 2

How to populate values of one row conditional of another row in R?

I inherited a data set coded in an unusual way. I would like to learn a less verbose way of reshaping it. The data frame looks like this:
# Input.
participant = c(rep("John",6), rep("Mary",6))
day = c(rep(1,3), rep(2,3), rep(1,3), rep(2,3))
likes = c("apples", "apples", "18", "apples", "apples", "7", "bananas", "bananas", "24", "bananas", "bananas", "3")
question = rep(c(1,1,0),4)
number = c(rep(18,3), rep(7,3), rep(24,3), rep(3,3))
df = data.frame(participant, day, question, likes)
participant day question likes
1 John 1 1 apples
2 John 1 1 apples
3 John 1 0 18
4 John 2 1 apples
5 John 2 1 apples
6 John 2 0 7
7 Mary 1 1 bananas
8 Mary 1 1 bananas
9 Mary 1 0 24
10 Mary 2 1 bananas
11 Mary 2 1 bananas
12 Mary 2 0 3
As you can see, the column likes is heterogeneous. When question equals 0, likes conveys a number chosen by the participants, not their preferred fruit. So I would like to re-code it in a new column as follows:
participant day question likes number
1 John 1 1 apples 18
2 John 1 1 apples 18
3 John 1 0 18 18
4 John 2 1 apples 7
5 John 2 1 apples 7
6 John 2 0 7 7
7 Mary 1 1 bananas 24
8 Mary 1 1 bananas 24
9 Mary 1 0 24 24
10 Mary 2 1 bananas 3
11 Mary 2 1 bananas 3
12 Mary 2 0 3 3
My current solution with base R involves subsetting the initial data frame, creating a lookup table, changing the column names and then merging the lookup table with the original data frame. But this involves several steps and I worry that there should be a simpler solution. I think that tidyr might be the answer, but I don't know how to use it to spread values in one column (likes) conditional other columns (day and question).
Do you have any suggestions? Thanks a lot!
Using the data set above, you can try the following. You group your data by participant and day and look for a row with question == 0 for each group.
library(dplyr)
group_by(df, participant, day) %>%
mutate(age = as.numeric(as.character(likes[which(question == 0)])))
Or as alistaire suggested, you can use grep() too.
group_by(df, participant, day) %>%
mutate(age = as.numeric(grep('\\d+', likes, value = TRUE)))
# participant day question likes age
# (fctr) (dbl) (dbl) (fctr) (dbl)
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
If you want to use data.table, you can do:
library(data.table)
setDT(df)[, age := as.numeric(as.character(likes[which(question == 0)])),
by = list(participant, day)]
NOTE
The present data set is a new one. Jota's answer works for the deleted data set.
Addressing the new example data:
# create a key column, overwrite it later
df$number <- paste0(df$participant, df$day) # use as a key
# create lookup table
lookup <- df[!is.na(as.numeric(as.character(df$likes))), c("number", "likes")]
# use lookup to overwrite df$number with the appropriate number
df$number <- lookup$likes[match(df$number, lookup$number)]
# participant day question likes number
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
The warning about NAs be introduced by coercion is expected due to converting characters to numeric (as.numeric(as.character(df$likes))),.
If you're data are ordered like in the example, you can use na.locf from the zoo package:
library(zoo)
df$age <- na.locf(as.numeric(as.character(df$likes)), fromLast = TRUE)

Resources