Insert row to fill in missing date in R [duplicate] - r

This question already has answers here:
Insert rows for missing dates/times
(9 answers)
How to add only missing Dates in Dataframe
(3 answers)
Closed 3 years ago.
I have a dataset that look something like this:
Person date Amount
A 2019-01 900
A 2019-03 600
A 2019-04 300
A 2019-05 0
B 2019-04 1200
B 2019-07 800
B 2019-08 400
B 2019-09 0
As you'll notice in the "date" column, there are missing dates, such as '2019-02' for person A and '2019-05' and '2019-06' for person B. I would like to insert rows with the missing date and amount equal to the one before it (see expected result below).
I have tried performing group by but I don't know how to proceed from there. I've also tried converting the 'date' and 'amount' columns as lists, and from there fill in the gaps before putting them back to the dataframe. I was wondering if there is a more convenient way of doing this. In particular, getting the same results without having to extract lists from the original dataframe.
Ideally, I would want to having a dataframe that look something like this:
Person date Amount
A 2019-01 900
A 2019-02 900
A 2019-03 600
A 2019-04 300
A 2019-05 0
B 2019-04 1200
B 2019-05 1200
B 2019-06 1200
B 2019-07 800
B 2019-08 400
B 2019-09 0
I hope I was able to make my problem clear.
Thanks in advance.

We can first convert the date to actual date object (date1) by pasting "-01" at the end, then using complete we create a sequence of 1 month date objects for each Person. We then use fill to get Amount equal to the one before it and to get data in the original form we remove "-01" again from date1.
library(dplyr)
library(tidyr)
df %>%
mutate(date1 = as.Date(paste0(date, "-01"))) %>%
group_by(Person) %>%
complete(date1 = seq(min(date1), max(date1), by = "1 month")) %>%
fill(Amount) %>%
mutate(date = sub("-01$", "", date1)) %>%
select(-date1)
# Person date Amount
# <fct> <chr> <int>
# 1 A 2019-01 900
# 2 A 2019-02 900
# 3 A 2019-03 600
# 4 A 2019-04 300
# 5 A 2019-05 0
# 6 B 2019-04 1200
# 7 B 2019-05 1200
# 8 B 2019-06 1200
# 9 B 2019-07 800
#10 B 2019-08 400
#11 B 2019-09 0
data
df <- structure(list(Person = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), date = structure(c(1L,
2L, 3L, 4L, 3L, 5L, 6L, 7L), .Label = c("2019-01", "2019-03",
"2019-04", "2019-05", "2019-07", "2019-08", "2019-09"), class = "factor"),
Amount = c(900L, 600L, 300L, 0L, 1200L, 800L, 400L, 0L)),
class = "data.frame", row.names = c(NA, -8L))

Related

Reshaping data.frame with a by-group where id variable repeats [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I want to reshape/ rearrange a dataset, that is stored as a data.frame with 2 columns:
id (non-unique, i.e. can repeat over several rows) --> stored as character
value --> stored as numeric value (range 1:3)
Sample data:
id <- as.character(1001:1003)
val_list <- data.frame(sample(1:3, size=12, replace=TRUE))
have <- data.frame(cbind(rep(id, 4), val_list))
colnames(have) <- c("id", "values")
have <- have %>% arrange(id)
This gives me the following output:
id values
1 1001 2
2 1001 2
3 1001 2
4 1001 3
5 1002 2
6 1002 3
7 1002 2
8 1002 2
9 1003 1
10 1003 3
11 1003 1
12 1003 2
What I want:
want <- data.frame(cbind(have[1:4, 2],
have[5:8, 2],
have[9:12, 2]))
colnames(want) <- id
Output of want:
1001 1002 1003
1 2 2 1
2 2 3 3
3 2 2 1
4 3 2 2
My original dataset has >1000 variables "id" and >50 variables "value".
I want to chunk/ slice the dataset get a new data.frame where each "id" variable will represent one column listing its "value" variable content.
It is possible to solve it via a loop, but I want to have the vectorized solution.
If possible with base R as "one-liner", but other solutions also appreciated.
You can create a unique row value for each id and use pivot_wider.
have %>%
group_by(id) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = id, values_from = values) %>%
select(-row)
# A tibble: 4 x 3
# `1001` `1002` `1003`
# <int> <int> <int>
#1 1 3 1
#2 3 2 3
#3 2 2 3
#4 2 2 3
Or using data.table
library(data.table)
dcast(setDT(have), rowid(id)~id, value.var = 'values')
data
df <- structure(list(id = c(1001L, 1001L, 1001L, 1001L, 1002L, 1002L,
1002L, 1002L, 1003L, 1003L, 1003L, 1003L), values = c(2L, 2L,
2L, 3L, 2L, 3L, 2L, 2L, 1L, 3L, 1L, 2L)), class = "data.frame",
row.names = c(NA, -12L))

How to select top 3 columns/variables according to the sum values of each column?

I have a dataset like this
>head(grocery)
milk bread juice honey eggs beef ... (140 more variables)
1 4 3 1 4 2
2 5 4 2 4 3
1 2 6 0 7 0
0 1 5 3 3 1
4 10 2 1 5 8
I want to select 3 columns that have largest sum values, showing the order, column name and sum values, and place them in descending order. Like this:
1 eggs 23
2 bread 22
3 juice 20
How can I do this?
Thank you very much for helping!
With dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
gather(key,value) %>%
group_by(key) %>%
summarise(Sum=sum(value)) %>%
arrange(desc(Sum)) %>%
top_n(3,Sum)
# A tibble: 3 x 2
key Sum
<chr> <int>
1 eggs 23
2 bread 22
3 juice 20
Data:
df <- structure(list(milk = c(1L, 2L, 1L, 0L, 4L), bread = c(4L, 5L,
2L, 1L, 10L), juice = c(3L, 4L, 6L, 5L, 2L), honey = c(1L, 2L,
0L, 3L, 1L), eggs = c(4L, 4L, 7L, 3L, 5L), beef = c(2L, 3L, 0L,
1L, 8L)), class = "data.frame", row.names = c(NA, -5L))
Original answer
In base R you can find sums of values for each column, sort the resulted values in decreasing order, subset first 3 values and cbind them to get the desired output:
cbind(sort(colSums(dat), T)[1:3])
# [,1]
#eggs 23
#bread 22
#juice 20
Updated answer
...how I can go back to the original data set from this solution?...
Here I subset original data set by names corresponding to columns with three largest columns sums. Probably there is better solution, that one what I can find right now.
dat[, names(sort(colSums(dat), T)[1:3])]
# eggs bread juice
#1 4 4 3
#2 4 5 4
#3 7 2 6
#4 3 1 5
#5 5 10 2
Data:
dat <- read.table(
text = "milk bread juice honey eggs beef
1 4 3 1 4 2
2 5 4 2 4 3
1 2 6 0 7 0
0 1 5 3 3 1
4 10 2 1 5 8",
stringsAsFactors = F,
header = T
)
Take a look at this tiny example:
data <- NULL
data$c <- c(1,2,3,4)
data$b <- c(4,5,6,7)
data$a <- c(1,1,1,1)
apply(data, 2, sum)
arraysum <- NULL
for(i in names(data)){
arraysum$name <- append(arraysum$name,i)
arraysum$sum <- append(arraysum$sum, sum(data[[i]]))
}
arraysum$sum
arraysum$name[order(arraysum$sum, decreasing = T)]

Removing duplicates based on 3 columns in R

I have a data set of 300k+ cases and where a customer id may be repeated several times. Each customer has a date and rank associated with it as well. I'd like to be able to keep only unique customer ids sorted first by date then if there is a duplicate id with a duplicate date it would sort by rank (keeping the rank closest to 1). An example of my data is like this:
Customer.ID Date Rank
576293 8/13/2012 2
576293 11/16/2015 6
581252 11/22/2013 4
581252 11/16/2011 6
581252 1/4/2016 5
581600 1/12/2015 3
581600 1/12/2015 2
582560 4/13/2016 1
591674 3/21/2012 6
586334 3/30/2014 1
Ideal outcome would then be like this:
Customer.ID Date Rank
576293 11/16/2015 6
581252 1/4/2016 5
581600 1/12/2015 2
582560 4/13/2016 1
591674 3/21/2012 6
586334 3/30/2014 1
With the desired output of the OP clarified:
We can also do this with base R, which will be faster than the below dplyr approach using group_by(Customer.ID) since we are not going to have to loop over all unique Customer.ID:
df <- df[order(-df$Customer.ID,as.Date(df$Date, format="%m/%d/%Y"),-df$Rank, decreasing=TRUE),]
res <- df[!duplicated(df$Customer.ID),]
Notes:
First, sort by Customer.ID in ascending order followed by Date in descending order followed by Rank in ascending order.
Remove the duplicates in Customer.ID so that only the first row for each Customer.ID is kept.
The result using your posted data as a data frame df (without converting the Date column) in ascending order for Customer.ID:
print(res)
## Customer.ID Date Rank
##2 576293 11/16/2015 6
##5 581252 1/4/2016 5
##7 581600 1/12/2015 2
##8 582560 4/13/2016 1
##10 586334 3/30/2014 1
##9 591674 3/21/2012 6
Data:
df <- structure(list(Customer.ID = c(591674L, 586334L, 582560L, 581600L,
581252L, 576293L), Date = c("3/21/2012", "3/30/2014", "4/13/2016",
"1/12/2015", "1/4/2016", "11/16/2015"), Rank = c(6L, 1L, 1L,
2L, 5L, 6L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(9L,
10L, 8L, 7L, 5L, 2L), class = "data.frame")
If you want to keep only the latest date (followed by lower rank) row for each Customer.ID, you can do the following using dplyr:
library(dplyr)
res <- df %>% group_by(Customer.ID) %>% arrange(desc(Date),Rank) %>%
summarise_all(funs(first)) %>%
ungroup() %>% arrange(Customer.ID)
Notes:
group_by Customer.ID and sort using arrange by Date in descending order and Rank by ascending order.
summarise_all to keep only the first row from each Customer.ID.
Finally, ungroup and sort by Customer.ID to get your desired result.
Using your data as a data frame df with the Date column converted to the Date class:
print(res)
### A tibble: 7 x 3
## Customer.ID Date Rank
## <int> <date> <int>
##1 576293 2015-11-16 6
##2 581252 2016-01-04 5
##3 581600 2015-01-12 2
##4 582560 2016-04-13 1
##5 586334 2014-03-30 1
##6 591674 2012-03-21 6
Data:
df <- structure(list(Customer.ID = c(576293L, 576293L, 581252L, 581252L,
581252L, 581600L, 581600L, 582560L, 591674L, 586334L), Date = structure(c(15565,
16755, 16031, 15294, 16804, 16447, 16447, 16904, 15420, 16159
), class = "Date"), Rank = c(2L, 6L, 4L, 6L, 5L, 3L, 2L, 1L,
6L, 1L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(NA,
-10L), class = "data.frame")

reshaping a dataframe for prediction

I just picked up on the package reshape today and I'm having some trouble to understand how it works.
I have the following dataframe:
name workoutnum time weight raceid final position
tommy 1 12 140 1 2
tommy 2 14 140 1 2
tommy 3 11 140 1 2
sarah 1 10 115 1 1
sarah 2 10 115 1 1
sarah 3 11 115 1 1
sarah 4 15 115 1 1
How would I put all this in one row? So the dataframe would look like:
name workoutnum1 workoutnum2 workoutnum3 workoutnum4 time1 time2 time3 time4 weight raceid final_position
tommy 1 1 1 0 12 14 11 NA 140 1 2
sarah 1 1 1 1 10 10 11 15 115 1 1
So all columns would be attached to the workout values.
Is this even the proper way to do it?
reshape seems like a natural part of what you want to do, but won't get you all the way there.
Here's a reshape2 approach that fully melts the data, then casts it back to data.frame, with some tweaks along the way to get the desired output.
Note that in the call to melt(), the variables in the id.vars arguments will remain wide. Then in dcast(), the variable that'll be cast wide is on the RHS of the ~.
library(reshape2)
library(dplyr)
# fully melt the data
d_melt <- melt(d, id.vars = c("name", "raceid", "position", "weight"))
# index the variables within name and variable
d_melt <- d_melt %>%
group_by(name, variable) %>%
mutate(i = row_number(),
wide_variable = paste0(variable, i))
# cast as wide
d_wide <- dcast(d_melt, name + raceid + position + weight ~ wide_variable, value.var = "value")
# replace the workoutnum indices with indicators for missingness
d_wide %>% mutate_each(funs(ifelse(!is.na(.), 1L, 0L)), matches("workoutnum\\d"))
# name raceid position weight time1 time2 time3 time4 workoutnum1 workoutnum2
# 1 sarah 1 1 115 10 10 11 15 1 1
# 2 tommy 1 2 140 12 14 11 NA 1 1
# workoutnum3 workoutnum4
# 1 1 1
# 2 1 0
Data:
structure(list(name = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("sarah", "tommy"), class = "factor"), workoutnum = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), time = c(12L, 14L, 11L, 10L, 10L, 11L, 15L), weight = c(140L, 140L, 140L, 115L, 115L, 115L, 115L), raceid = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), position = c(2L, 2L, 2L, 1L, 1L, 1L, 1L)), .Names = c("name", "workoutnum", "time", "weight", "raceid", "position"), class = "data.frame", row.names = c(NA, -7L))
Here's an approach using dcast from "data.table", which reshapes a little more like the reshape function in base R.
The only change I've made to the data is the inclusion of another "time" variable though, as pointed out by #rawr in the comments, it almost seems like your "workoutnum" is the time variable.
I've used getanID from my "splitstackshape" package to generate the "time" variable, but you can create this variable in many different ways.
library(splitstackshape)
dcast(getanID(mydf, c("name", "raceid", "final_position")),
name + raceid + final_position ~ .id,
value.var = c("workoutnum", "time", "weight"))
## name raceid final_position workoutnum_1 workoutnum_2 workoutnum_3
## 1: sarah 1 1 1 2 3
## 2: tommy 1 2 1 2 3
## workoutnum_4 time_1 time_2 time_3 time_4 weight_1 weight_2 weight_3 weight_4
## 1: 4 10 10 11 15 115 115 115 115
## 2: NA 12 14 11 NA 140 140 140 NA
If you're using getanID, you can also use reshape like this:
reshape(getanID(mydf, c("name", "raceid", "final_position")),
idvar = c("name", "raceid", "final_position"), timevar = ".id",
direction = "wide")
## name raceid final_position workoutnum.1 time.1 weight.1 workoutnum.2 time.2
## 1: tommy 1 2 1 12 140 2 14
## 2: sarah 1 1 1 10 115 2 10
## weight.2 workoutnum.3 time.3 weight.3 workoutnum.4 time.4 weight.4
## 1: 140 3 11 140 NA NA NA
## 2: 115 3 11 115 4 15 115
but dcast would be more efficient in general.

Calculate column sums for each combination of two grouping variables [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 7 years ago.
I have a dataset that looks something like this:
Type Age count1 count2 Year Pop1 Pop2 TypeDescrip
A 35 1 1 1990 30000 50000 alpha
A 35 3 1 1990 30000 50000 alpha
A 45 2 3 1990 20000 70000 alpha
B 45 2 1 1990 20000 70000 beta
B 45 4 5 1990 20000 70000 beta
I want to add the counts of the rows that are matching in the Type and Age columns. So ideally I would end up with a dataset that looks like this:
Type Age count1 count2 Year Pop1 Pop2 TypeDescrip
A 35 4 2 1990 30000 50000 alpha
A 45 2 3 1990 20000 70000 alpha
B 45 6 6 1990 20000 70000 beta
I've tried using nested duplicated() statements such as below:
typedup = duplicated(df$Type)
bothdup = duplicated(df[(typedup == TRUE),]$Age)
but this returns indices for which age or type are duplicated, not necessarily when one row has duplicates of both.
I've also tried tapply:
tapply(c(df$count1, df$count2), c(df$Age, df$Type), sum)
but this output is difficult to work with. I want to have a data.frame when I'm done.
I don't want to use a for-loop because my dataset is quite large.
Try
library(dplyr)
df1 %>%
group_by(Type, Age) %>%
summarise_each(funs(sum))
# Type Age count1 count2
#1 A 35 4 2
#2 A 45 2 3
#3 B 45 6 6
In the newer versions of dplyr
df1 %>%
group_by(Type, Age) %>%
summarise_all(sum)
Or using base R
aggregate(.~Type+Age, df1, FUN=sum)
# Type Age count1 count2
#1 A 35 4 2
#2 A 45 2 3
#3 B 45 6 6
Or
library(data.table)
setDT(df1)[, lapply(.SD, sum), .(Type, Age)]
# Type Age count1 count2
#1: A 35 4 2
#2: A 45 2 3
#3: B 45 6 6
Update
Based on the new dataset,
df2 %>%
group_by(Type, Age,Pop1, Pop2, TypeDescrip) %>%
summarise_each(funs(sum), matches('^count'))
# Type Age Pop1 Pop2 TypeDescrip count1 count2
#1 A 35 30000 50000 alpha 4 2
#2 A 45 20000 70000 beta 2 3
#3 B 45 20000 70000 beta 6 6
data
df1 <- structure(list(Type = c("A", "A", "A", "B", "B"), Age = c(35L,
35L, 45L, 45L, 45L), count1 = c(1L, 3L, 2L, 2L, 4L), count2 = c(1L,
1L, 3L, 1L, 5L)), .Names = c("Type", "Age", "count1", "count2"
), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(Type = c("A", "A", "A", "B", "B"), Age = c(35L,
35L, 45L, 45L, 45L), count1 = c(1L, 3L, 2L, 2L, 4L), count2 = c(1L,
1L, 3L, 1L, 5L), Year = c(1990L, 1990L, 1990L, 1990L, 1990L),
Pop1 = c(30000L, 30000L, 20000L, 20000L, 20000L), Pop2 = c(50000L,
50000L, 70000L, 70000L, 70000L), TypeDescrip = c("alpha",
"alpha", "beta", "beta", "beta")), .Names = c("Type", "Age",
"count1", "count2", "Year", "Pop1", "Pop2", "TypeDescrip"),
class = "data.frame", row.names = c(NA, -5L))
#hannah you can also use sql using the sqldf package
sqldf("select
Type,Age,
sum(count1) as sum_count1,
sum(count2) as sum_count2
from
df
group by
Type,Age
")

Resources