duplicate data and dates - r

I am bit of a newbie to R I have two questions. I have a dataframe, say FruitsNew
Fruit
1 Apples
2 Oranges
3 Bananas
Q1) I want to duplicate the data and add monthly dates starting from 31-May-2000 to the above, for example
Fruit date
1 Apples 2000-05-31
2 Oranges 2000-05-31
3 Bananas 2000-05-31
4 Apples 2000-06-30
5 Oranges 2000-06-30
6 Bananas 2000-06-30
and so on...
Q2) After I obtain the above, I merge it with a Sales dataset which is only available yearly at end of May so it looks like this
Fruit date sales
1 Apples 2000-05-31 1000
2 Oranges 2000-05-31
3 Bananas 2000-05-31 500
4 Apples 2000-06-30
5 Oranges 2000-06-30
6 Bananas 2000-06-30
...
7 Apples 2001-05-31 2000
8 Oranges 2001-05-31 200
9 Bananas 2001-05-31 600
The oranges don't have sales, but I would like to fill it with a 0 for all the monthly dates between 05/31/2000 and the next available sales data which occurs in 05/31/2001
The other fruits should have the same sales number between 05/31/2000 and 05/31/2001 and so on.
The above is just an example but the idea is if missing to fill the previously available sales number for the date, if previously available date is empty then fill 0
Something like this
Fruit date sales
1 Apples 2000-05-31 1000
2 Oranges 2000-05-31 0
3 Bananas 2000-05-31 500
4 Apples 2000-06-30 1000
5 Oranges 2000-06-30 0
6 Bananas 2000-06-30 500
7 Apples 2001-05-31 2000
8 Oranges 2001-05-31 200
9 Bananas 2001-05-31 600

Let's assume your first dataframe will be named core and your second dataframe named merg.yr:
merg.yr <- merge(core, year.sale, by.x=1:2, by.y=1:2, all.x=TRUE)
merg.yr[is.na(merg.yr)] <- 0
To build the core df I came up with a method that created the dates at the first of the months, then subtracted 1 from each to get the last date in the prior month. I then repeated each one three times and let the `data.frame function fill in the fruits:
core <- data.frame(fruit =c('Apples','Oranges','Bananas'),
date=rep( as.Date(seq(ISOdate(2000, 6,1),
ISOdate(2001,6,1), by='month')) -1,
each=3)
)

Related

calculate difference between Rows in R by setting specific target date

I am new at R, my df is as the following and I would like to set my bench comparison date as 2020/02/01, compare the results against the row with this date:
Here is my data frame, I want to be able to genearte the Diff Column with R
DATE
FRUIT
LOCATION
VALUE
DIFF
2010-01-01
Apple
USA
2
-2
2010-02-01
Apple
USA
4
0
2020-11-01
Apple
USA
100
96
2020-12-01
Apple
USA
54
50
2010-01-01
Apple
China
0
-4
2010-02-01
Apple
China
4
0
2020-11-01
Apple
China
40
36
2020-12-01
Apple
China
44
40
2010-01-01
Banana
USA
1
-1
2010-02-01
Banana
USA
2
0
2020-11-01
Banana
USA
12
10
2020-12-01
Banana
USA
13
11
2010-01-01
Banana
China
0
-100
2010-02-01
Banana
China
100
0
2020-11-01
Banana
China
130
30
2020-12-01
Banana
China
145
45
Thank you!
Using dplyr you can do :
library(dplyr)
compare_date <- as.Date('2010-02-01')
df %>%
mutate(Date = as.Date(Date)) %>%
group_by(Fruit, Metric) %>%
mutate(Diff = Value - Value[match(compare_date, Date)]) -> result
result

R replace NA values in a dataframe column with existing values in other rows and same column

I have the following dataframe:
FOOD ID DATE PRICE DES
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
3 1/25/2020 4 Tomato
3 1/25/2020 NA Tomato
3 1/1/2019 NA Tomato
3 1/1/2019 5 Tomato
I would need to replace (where/when possible) the NA values when a price for the same FOOD ID and same DATE is available. Expected output:
FOOD ID DATE PRICE DES
1 1/1/2020 100 Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 100 Tuna
3 1/25/2020 4 Tomato
3 1/25/2020 4 Tomato
3 1/1/2019 5 Tomato
3 1/1/2019 5 Tomato
Without using a loop for, is there a way I could easily perform such task?
I guess one way could be to use dplyr, group the data by FOOD ID and DATE and get an "average" PRICE, delete the PRICE column from the original dataframe, and finally merged the group data with the original dataframe, but this seems a odd way to do it.
Thanks for the help.
df %>%
group_by(FOOD_ID, DATE)%>%
fill(PRICE, .direction = 'updown')
# A tibble: 8 x 4
# Groups: FOOD_ID, DATE [3]
FOOD_ID DATE PRICE DES
<int> <chr> <int> <chr>
1 1 1/1/2020 100 Tuna
2 1 1/1/2020 100 Tuna
3 1 1/1/2020 100 Tuna
4 1 1/1/2020 100 Tuna
5 3 1/25/2020 4 Tomato
6 3 1/25/2020 4 Tomato
7 3 1/1/2019 5 Tomato
8 3 1/1/2019 5 Tomato
We can use the data itself to feed prices back in.
Data:
df <- read.table(header = TRUE, text= "FOOD_ID DATE PRICE DES
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
3 1/25/2020 4 Tomato
3 1/25/2020 NA Tomato
3 1/1/2019 NA Tomato
3 1/1/2019 5 Tomato")
Find distinct prices for each product on each date.
prices <- df %>%
filter(!is.na(PRICE)) %>%
group_by(FOOD_ID, DATE, DES) %>%
distinct(FOOD_ID, .keep_all = TRUE)
Join these prices back into the original dataframe, which will assign the prices for each day (I have removed the original price column because it feeds back in from the prices df.
new_df <- df %>%
select(-PRICE) %>%
left_join(prices, by = c('FOOD_ID', 'DATE', 'DES'))
Output of new_df:
FOOD_ID DATE DES PRICE
1 1 1/1/2020 Tuna 100
2 1 1/1/2020 Tuna 100
3 1 1/1/2020 Tuna 100
4 1 1/1/2020 Tuna 100
5 3 1/25/2020 Tomato 4
6 3 1/25/2020 Tomato 4
7 3 1/1/2019 Tomato 5
8 3 1/1/2019 Tomato 5

Use count() in more than one column and order results

I have a dataframe called table like this
a m g c1 c2 c3 c4
1 2015 5 13 bread wine <NA> <NA>
2 2015 8 30 wine eggs rice cake
3 2015 1 21 wine rice eggs <NA>
...
I want to count the elements in column c1 to c4 and order them
I tried to use:
library(plyr)
c<-count(table,"c1")
But i don't know how to count more than one column.
Then i want to use arrange(c,desc(freq)) to order them but when i try with one column the value NA is always on top, and i want only top 3 elements. Like this
c freq
1 wine 3
2 eggs 2
3 rice 2
Can someone please get me some solution for this. Thanks
Use melt and table:
df1 <- read.table(text="a m g c1 c2 c3 c4
2015 5 13 bread wine NA NA
2015 8 30 wine eggs rice cake
2015 1 21 wine rice eggs NA", header=TRUE, stringsAsFactors=FALSE)
c_col <- melt(as.matrix(df1[,4:7]))
sort(table(c_col$value),decreasing=TRUE)
wine eggs rice bread cake
3 2 2 1 1
With qdaptools, with the example dataframe (having name table) provided:
library(qdapTools)
counts <- data.frame(count=sort(colSums(mtabulate(table[,4:7])), decreasing=TRUE))
subset(counts,rownames(counts)!='<NA>')[1:3,1,drop=FALSE] #remove <NA>, select top 3 elements
# count
# wine 3
# eggs 2
# rice 2

How to populate values of one row conditional of another row in R?

I inherited a data set coded in an unusual way. I would like to learn a less verbose way of reshaping it. The data frame looks like this:
# Input.
participant = c(rep("John",6), rep("Mary",6))
day = c(rep(1,3), rep(2,3), rep(1,3), rep(2,3))
likes = c("apples", "apples", "18", "apples", "apples", "7", "bananas", "bananas", "24", "bananas", "bananas", "3")
question = rep(c(1,1,0),4)
number = c(rep(18,3), rep(7,3), rep(24,3), rep(3,3))
df = data.frame(participant, day, question, likes)
participant day question likes
1 John 1 1 apples
2 John 1 1 apples
3 John 1 0 18
4 John 2 1 apples
5 John 2 1 apples
6 John 2 0 7
7 Mary 1 1 bananas
8 Mary 1 1 bananas
9 Mary 1 0 24
10 Mary 2 1 bananas
11 Mary 2 1 bananas
12 Mary 2 0 3
As you can see, the column likes is heterogeneous. When question equals 0, likes conveys a number chosen by the participants, not their preferred fruit. So I would like to re-code it in a new column as follows:
participant day question likes number
1 John 1 1 apples 18
2 John 1 1 apples 18
3 John 1 0 18 18
4 John 2 1 apples 7
5 John 2 1 apples 7
6 John 2 0 7 7
7 Mary 1 1 bananas 24
8 Mary 1 1 bananas 24
9 Mary 1 0 24 24
10 Mary 2 1 bananas 3
11 Mary 2 1 bananas 3
12 Mary 2 0 3 3
My current solution with base R involves subsetting the initial data frame, creating a lookup table, changing the column names and then merging the lookup table with the original data frame. But this involves several steps and I worry that there should be a simpler solution. I think that tidyr might be the answer, but I don't know how to use it to spread values in one column (likes) conditional other columns (day and question).
Do you have any suggestions? Thanks a lot!
Using the data set above, you can try the following. You group your data by participant and day and look for a row with question == 0 for each group.
library(dplyr)
group_by(df, participant, day) %>%
mutate(age = as.numeric(as.character(likes[which(question == 0)])))
Or as alistaire suggested, you can use grep() too.
group_by(df, participant, day) %>%
mutate(age = as.numeric(grep('\\d+', likes, value = TRUE)))
# participant day question likes age
# (fctr) (dbl) (dbl) (fctr) (dbl)
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
If you want to use data.table, you can do:
library(data.table)
setDT(df)[, age := as.numeric(as.character(likes[which(question == 0)])),
by = list(participant, day)]
NOTE
The present data set is a new one. Jota's answer works for the deleted data set.
Addressing the new example data:
# create a key column, overwrite it later
df$number <- paste0(df$participant, df$day) # use as a key
# create lookup table
lookup <- df[!is.na(as.numeric(as.character(df$likes))), c("number", "likes")]
# use lookup to overwrite df$number with the appropriate number
df$number <- lookup$likes[match(df$number, lookup$number)]
# participant day question likes number
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
The warning about NAs be introduced by coercion is expected due to converting characters to numeric (as.numeric(as.character(df$likes))),.
If you're data are ordered like in the example, you can use na.locf from the zoo package:
library(zoo)
df$age <- na.locf(as.numeric(as.character(df$likes)), fromLast = TRUE)

In R: Duplicate rows except for the first row based on condition

I have a data.table dt:
names <- c("john","mary","mary","mary","mary","mary","mary","tom","tom","tom","mary","john","john","john","tom","tom")
dates <- c(as.Date("2010-06-01"),as.Date("2010-06-01"),as.Date("2010-06-05"),as.Date("2010-06-09"),as.Date("2010-06-13"),as.Date("2010-06-17"),as.Date("2010-06-21"),as.Date("2010-07-09"),as.Date("2010-07-13"),as.Date("2010-07-17"),as.Date("2010-06-01"),as.Date("2010-08-01"),as.Date("2010-08-05"),as.Date("2010-08-09"),as.Date("2010-09-03"),as.Date("2010-09-04"))
shifts_missed <- c(2,11,11,11,11,11,11,6,6,6,1,5,5,5,0,2)
shift <- c("Day","Night","Night","Night","Night","Night","Night","Day","Day","Day","Day","Night","Night","Night","Night","Day")
df <- data.frame(names=names, dates=dates, shifts_missed=shifts_missed, shift=shift)
dt <- as.data.table(df)
names dates shifts_missed shift
john 2010-06-01 2 Day
mary 2010-06-01 11 Night
mary 2010-06-05 11 Night
mary 2010-06-09 11 Night
mary 2010-06-13 11 Night
mary 2010-06-17 11 Night
mary 2010-06-21 11 Night
tom 2010-07-09 6 Day
tom 2010-07-13 6 Day
tom 2010-07-17 6 Day
mary 2010-06-01 1 Day
john 2010-08-01 5 Night
john 2010-08-05 5 Night
john 2010-08-09 5 Night
tom 2010-09-03 0 Night
tom 2010-09-04 2 Day
Ultimately, what I want is to get the following:
names dates shifts_missed shift count
john 2010-06-01 2 Day 1
mary 2010-06-01 11 Night 1
mary 2010-06-05 11 Night 1
mary 2010-06-09 11 Night 1
mary 2010-06-13 11 Night 1
mary 2010-06-17 11 Night 1
mary 2010-06-21 11 Night 1
tom 2010-07-09 6 Day 1
tom 2010-07-13 6 Day 1
tom 2010-07-17 6 Day 1
mary 2010-06-01 1 Day 1
john 2010-08-01 5 Night 1
john 2010-08-05 5 Night 1
john 2010-08-09 5 Night 1
tom 2010-09-03 0 Night 0
tom 2010-09-04 2 Day 1
john 2010-06-01 2 Night 1
mary 2010-06-05 11 Day 1
mary 2010-06-09 11 Day 1
mary 2010-06-13 11 Day 1
mary 2010-06-17 11 Day 1
mary 2010-06-21 11 Day 1
tom 2010-07-09 6 Night 1
tom 2010-07-13 6 Night 1
tom 2010-07-17 6 Night 1
john 2010-08-05 5 Day 1
john 2010-08-09 5 Day 1
tom 2010-09-04 2 Night 1
As you can see, the second half of the data is almost a duplicate of the first half. However, if shifts_missed = 0, it should not be duplicated, and if shifts_missed is odd, the first row should not be duplicated but the remaining rows should. It should then add a 1 in the count column for all except when shifts_missed = 0.
I've seen some answers that speak about !duplicate or unique, but these values in shifts_missed are not unique. I'm sure this isn't overly complicated and is probably a multi-step process, but I can't figure out how to isolate the first rows of the odd shifts_missed column.
dt[, is.in := if(shifts_missed[1] %% 2 == 0) T else c(F, rep(T, .N-1))
, by = .(names, shift)]
rbind(dt, dt[is.in & shifts_missed != 0])
Adding the extra column part should be obvious.

Resources