Change the form while merging multiple data frames - r

I have several data frames that are all in same format, like:
price <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
size <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
performance <- data.frame(Year= c(2001, 2002, 2003),
A=c(1,2,3),B=c(2,3,4), C=c(4,5,6))
> price
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
> size
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
> performance
Year A B C
1 2001 1 2 4
2 2002 2 3 5
3 2003 3 4 6
and I want to merge these data frames but the result is in different form, the desired output is like:
> df
name Year price size performance
1 A 2001 1 1 1
2 A 2002 2 2 2
3 A 2003 3 3 3
4 B 2001 2 2 2
5 B 2002 3 3 3
6 B 2003 4 4 4
7 C 2001 3 3 3
8 C 2002 4 4 4
9 C 2003 5 5 5
which arranges the data in the order of names, and then the ordered date. Since I have over 2000 names and 180 dates in each of the 20 data frames it's too difficult to sort it by just imputing the specific name.

You need to convert your data frames to long format then join them together
library(tidyverse)
price_long <- price %>% gather(key, value = "price", -Year)
size_long <- size %>% gather(key, value = "size", -Year)
performance_long <- performance %>% gather(key, value = "performance", -Year)
price_long %>%
left_join(size_long) %>%
left_join(performance_long)
Joining, by = c("Year", "key")
Joining, by = c("Year", "key")
Year key price size performance
1 2001 A 1 1 1
2 2002 A 2 2 2
3 2003 A 3 3 3
4 2001 B 2 2 2
5 2002 B 3 3 3
6 2003 B 4 4 4
7 2001 C 4 4 4
8 2002 C 5 5 5
9 2003 C 6 6 6

you can use data.table
library(data.table)
a=list(price=price,size=size,performance=performance)
dcast(melt(rbindlist(a,T,idcol = "name"),1:2),variable+Year~name)
variable Year performance price size
1: A 2001 1 1 1
2: A 2002 2 2 2
3: A 2003 3 3 3
4: B 2001 2 2 2
5: B 2002 3 3 3
6: B 2003 4 4 4
7: C 2001 4 4 4
8: C 2002 5 5 5
9: C 2003 6 6 6

We can combine the data frames, gather and spread the combined data frame.
library(tidyverse)
dat <- list(price, size, performance) %>%
setNames(c("price", "size", "performance")) %>%
bind_rows(.id = "type") %>%
gather(name, value, A:C) %>%
spread(type, value) %>%
arrange(name, Year)
dat
# Year name performance price size
# 1 2001 A 1 1 1
# 2 2002 A 2 2 2
# 3 2003 A 3 3 3
# 4 2001 B 2 2 2
# 5 2002 B 3 3 3
# 6 2003 B 4 4 4
# 7 2001 C 4 4 4
# 8 2002 C 5 5 5
# 9 2003 C 6 6 6

dplyr::bind_rows comes quiet handy in such scenarios. A solution can be as:
library(tidyverse)
bind_rows(list(price = price, size = size, performance = performance), .id="Type") %>%
gather(Key, Value, - Type, -Year) %>%
spread(Type, Value)
# Year Key performance price size
# 1 2001 A 1 1 1
# 2 2001 B 2 2 2
# 3 2001 C 4 4 4
# 4 2002 A 2 2 2
# 5 2002 B 3 3 3
# 6 2002 C 5 5 5
# 7 2003 A 3 3 3
# 8 2003 B 4 4 4
# 9 2003 C 6 6 6
The above solution is very much similar to the one by #www. It just avoids use of setNames

To round it out, here's package-free base R answer.
# gather the data.frames into a list
myList <- mget(ls())
Note that the three data.frames are the only objects in my environment.
# get the final data.frame
Reduce(merge,
Map(function(x, y) setNames(cbind(x[1], stack(x[-1])), c("Year", y, "ID")),
myList, names(myList)))
This returns
Year ID performance price size
1 2001 A 1 1 1
2 2001 B 2 2 2
3 2001 C 4 4 4
4 2002 A 2 2 2
5 2002 B 3 3 3
6 2002 C 5 5 5
7 2003 A 3 3 3
8 2003 B 4 4 4
9 2003 C 6 6 6

Related

Remove columns that are all NA for at least one level of a factor

I am hoping to tidy a dataframe by removing variables that are empty for any level of a grouping factor. It is fairly easy to remove columns that are entirely empty, however there appears to be no simple way to apply this selection over groups.
## Data
site<-c("A","A","A","A","A","B","B","B","B","B")
year<-c("2000","2001","2002","2003","2004","2000","2001","2002","2003","2004")
species_A<-c(1,2,3,4,5,NA,NA,NA,NA,NA)
species_B<-c(1,2,NA,4,5,NA,3,4,5,6)
species_C<-c(1,2,3,4,5,2,3,4,5,6)
dat<-data.frame(site,year,species_A,species_B,species_C)
site year species_A species_B species_C
1 A 2000 1 1 1
2 A 2001 2 2 2
3 A 2002 3 NA 3
4 A 2003 4 4 4
5 A 2004 5 5 5
6 B 2000 NA NA 2
7 B 2001 NA 3 3
8 B 2002 NA 4 4
9 B 2003 NA 5 5
10 B 2004 NA 6 6
## Remove columns with any NAs
dat %>%
group_by(site) %>%
select(where( ~!any(is.na(.x))))
## which returns
site year species_C
<chr> <chr> <dbl>
1 A 2000 1
2 A 2001 2
3 A 2002 3
4 A 2003 4
5 A 2004 5
6 B 2000 2
7 B 2001 3
8 B 2002 4
9 B 2003 5
10 B 2004 6
## Alternatively, if i try using "all" in select it will only identify fully incomplete cases.
dat %>%
group_by(site) %>%
select(where( ~!all(is.na(.x))))
## however I am trying to get...
site year species_B species_C
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6
It seems like this should be fairly straightforward but for whatever reason I cannot seem to get it to work.
Thanks!
Another option:
dat %>%
select(site, dat %>%
group_by(site) %>%
summarise(across(everything(), ~!all(is.na(.x))))%>%
ungroup() %>%
select(-site) %>%
select(where(all))%>%
names())
site year species_B species_C
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6
We can split by site, then use select(where(!all(is.na(.x))) to drop the all-NA columns for every dataframe, and finally subset dat by the intersection of column names.
library(dplyr)
library(map)
dat %>% split(site) %>%
map(\(x) select(x, where(~!all(is.na(.x)))))%>%
map(names)%>%
reduce(intersect)%>%
dat[.]
Or, for a purrr-only solution:
library(purrr)
dat %>% split(site) %>%
map(~discard(., ~all(is.na(.x))))%>%
map(names)%>%
reduce(intersect)%>%
dat[.]
As an alternative, we can call summarise twice: once on grouped data to tell if any group is all-NAs, and a second call to obtain the final logical vector. Then subset dat with the logical vector:
library(dplyr)
dat %>% group_by(site) %>%
summarise(across(.fns = ~all(is.na(.x))))%>%
summarise(across(.fns = ~!(is.logical(.x) & any(.x))))%>%
unlist()%>%
dat[,.]
OR
dat %>% group_by(site) %>%
summarise(across(.fns = ~all(is.na(.x))))%>%
map_lgl(~!(is.logical(.x) & any(.x)))%>%
dat[,.]
output
site year species_B species_C
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6
You could convert to a long format, remove the variable, then change back to a wide format.
library(tidyverse)
dat %>%
tidyr::pivot_longer(!c(site, year), names_to = "species", values_to = "values") %>%
dplyr::group_by(site, species) %>%
dplyr::mutate(allNA = all(is.na(values))) %>%
dplyr::ungroup(site) %>%
dplyr::filter(!any(allNA == TRUE)) %>%
dplyr::select(-allNA) %>%
tidyr::pivot_wider(names_from = "species", values_from = "values")
Output
# A tibble: 10 × 4
site year species_B species_C
<chr> <chr> <dbl> <dbl>
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6

R: How can I group rows in a dataframe, ID rows meeting a condition, then delete prior rows for the group?

I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:
customer year item.A item.B key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>
needs to look like this:
customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>
I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).
a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df
As a first step I am marking existing customers (customer 3 in this case) in the key column -
df %>%
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()
# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA

apply lag or lead in increasing order for the dataframe

df1 <- read.csv("C:/Users/uni/DS-project/df1.csv")
df1
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
6 2000 1
7 2001 2
8 2002 3
9 2003 4
10 2004 5
11 2000 1
12 2001 2
13 2002 3
14 2003 4
15 2004 5
16 2000 1
17 2001 2
18 2002 3
19 2003 4
20 2004 5
i want to apply lead so i can get the output in the below fashion.
we have set of 5 observation of each year repeated for n number of times, in output for 1st year we need to remove 2000 and its respective value, similar for second year we neglect 2000 and 2001 and its respective value, and for 3rd year remove - 2000, 2001, 2002 and its respective value. And so on.
so that we can get the below output in below manner.
output:
year value
2000 1
2001 2
2002 3
2003 4
2004 5
2001 2
2002 3
2003 4
2004 5
2002 3
2003 4
2004 5
2003 4
2004 5
please help.
Just for fun, adding a vectorized solution using matrix sub-setting
m <- matrix(rep(TRUE, nrow(df)), 5)
m[upper.tri(m)] <- FALSE
df[m,]
# year value
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 7 2001 2
# 8 2002 3
# 9 2003 4
# 10 2004 5
# 13 2002 3
# 14 2003 4
# 15 2004 5
# 19 2003 4
# 20 2004 5
Below grp is 1 for each row of the first group, 2 for the second and so on. Seq is 1, 2, 3, ... for the successive rows of each grp. Now just pick out those rows for which Seq is at least as large as grp. This has the effect of removing the first i-1 rows from the ith group for i = 1, 2, ... .
grp <- cumsum(df1$year == 2000)
Seq <- ave(grp, grp, FUN = seq_along)
subset(df1, Seq >= grp)
We could alternately write this in the less general form:
subset(df1, 1:5 >= rep(1:4, each = 5))
In any case the output from either subset statement is:
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
7 2001 2
8 2002 3
9 2003 4
10 2004 5
13 2002 3
14 2003 4
15 2004 5
19 2003 4
20 2004 5
library(dplyr)
df %>%
group_by(g = cumsum(year == 2000)) %>%
filter(row_number() >= g) %>%
ungroup %>%
select(-g)
# # A tibble: 14 x 2
# year value
# <int> <int>
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 6 2001 2
# 7 2002 3
# 8 2003 4
# 9 2004 5
# 10 2002 3
# 11 2003 4
# 12 2004 5
# 13 2003 4
# 14 2004 5
Using lapply():
to <- nrow(df) / 5 - 1
df[-unlist(lapply(1:to, function(x) seq(1:x) + 5*x)), ]
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
7 2001 2
8 2002 3
9 2003 4
10 2004 5
13 2002 3
14 2003 4
15 2004 5
19 2003 4
20 2004 5
Where unlist(lapply(1:to, function(x) seq(1:x) + 5*x)) are the indices to skip:
[1] 6 11 12 16 17 18
Using sequence:
df[5-rev(sequence(2:5)-1),]
# year value
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 2.1 2001 2
# 3.1 2002 3
# 4.1 2003 4
# 5.1 2004 5
# 3.2 2002 3
# 4.2 2003 4
# 5.2 2004 5
# 4.3 2003 4
# 5.3 2004 5
how it works:
5-rev(sequence(2:5)-1)
# [1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5
rev(sequence(2:5)-1)
# [1] 4 3 2 1 0 3 2 1 0 2 1 0 1 0
sequence(2:5)-1
# [1] 0 1 0 1 2 0 1 2 3 0 1 2 3 4
sequence(2:5)
# [1] 1 2 1 2 3 1 2 3 4 1 2 3 4 5

What is the most elegant way to standardize a time series by the value in the first period?

I have a data frame with sales by product and year and would like to create a column that divides each product-year by the value of Sales in 2000, separately by product, in order to create "adjusted sales" (adj_Sales).
library(plyr)
df <- data.frame(Product=gl(3,3,labels=c("A","B", "C")),
Year=factor(rep(2000:2002,3)),
Sales=1:9)
print(df)
# Product Year Sales
# 1 A 2000 1
# 2 A 2001 2
# 3 A 2002 3
# 4 B 2000 4
# 5 B 2001 5
# 6 B 2002 6
# 7 C 2000 7
# 8 C 2001 8
# 9 C 2002 9
The following code works, but isn't very elegant since it:
a) creates an intermediate data frame (base_sales),
b) uses a merge with the intermediate data frame (base_sales) and the original (df),
c) requires a step to rename of the Sales column to Sales_2000,
d) creates an undesired Sales_2000 column, and
Is there a way to do this all at once using plyr or dplyr?
base_sales <- df[df$Year==2000, c("Product","Sales")]
base_sales <- plyr::rename(base_sales, c("Sales" = "Sales_2000"))
print(base_sales)
# Product Sales_2000
# 1 A 1
# 4 B 4
# 7 C 7
df2 <- merge(df,base_sales,by="Product")
df2$adj_Sales <- df2$Sales / df2$Sales_2000
print(df2)
# Product Year Sales Sales_2000 adj_Sales
# 1 A 2000 1 1 1.0000
# 2 A 2001 2 1 2.0000
# 3 A 2002 3 1 3.0000
# 4 B 2000 4 4 1.0000
# 5 B 2001 5 4 1.2500
# 6 B 2002 6 4 1.5000
# 7 C 2000 7 7 1.0000
# 8 C 2001 8 7 1.1429
# 9 C 2002 9 7 1.2857
Is there a way to do this all at once using plyr or dplyr?
We can directly create the columns with mutate from dplyr.
library(dplyr)
df %>%
group_by(Product) %>%
mutate(Sales_2000= Sales[Year==2000], adj_sales=Sales/Sales_2000)
# Product Year Sales Sales_2000 adj_sales
#1 A 2000 1 1 1.000000
#2 A 2001 2 1 2.000000
#3 A 2002 3 1 3.000000
#4 B 2000 4 4 1.000000
#5 B 2001 5 4 1.250000
#6 B 2002 6 4 1.500000
#7 C 2000 7 7 1.000000
#8 C 2001 8 7 1.142857
#9 C 2002 9 7 1.285714
Or using data.table
library(data.table)
setDT(df)[, c('Sales_2000', 'adj_sales') := {tmp=Sales[Year==2000]
list(tmp, Sales/tmp)}, by = Product]
# Product Year Sales Sales_2000 adj_sales
#1: A 2000 1 1 1.000000
#2: A 2001 2 1 2.000000
#3: A 2002 3 1 3.000000
#4: B 2000 4 4 1.000000
#5: B 2001 5 4 1.250000
#6: B 2002 6 4 1.500000
#7: C 2000 7 7 1.000000
#8: C 2001 8 7 1.142857
#9: C 2002 9 7 1.285714

Add a "rank" column to a data frame

I have a dataframe with counts of different items, in different years:
df <- data.frame(item = rep(c('a','b','c'), 3),
year = rep(c('2010','2011','2012'), each=3),
count = c(1,4,6,3,8,3,5,7,9))
And I would like to add a "year.rank" column, which gives an item's rank within a given year, where a higher count leads to a higher "rank". With the above, it would look like:
item year count year.rank
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
I know I could do this for the whole data frame using order(df$count), but I'm not sure how I would do it by year.
There is a rank function to help you with that:
transform(df,
year.rank = ave(count, year,
FUN = function(x) rank(-x, ties.method = "first")))
item year count year.rank
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
data.table version for practice:
library(data.table)
DT <- as.data.table(df)
DT[,yrrank:=rank(-count,ties.method="first"),by=year]
item year count yrrank
1: a 2010 1 3
2: b 2010 4 2
3: c 2010 6 1
4: a 2011 3 2
5: b 2011 8 1
6: c 2011 3 3
7: a 2012 5 3
8: b 2012 7 2
9: c 2012 9 1
Using order function,
transform(dat, x= ave(count,year,FUN=function(x) order(x,decreasing=T)))
item year count x
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
EDIT
You can use plyr here also:
ddply(dat,.(year),transform,x = order(count,decreasing=T))
Using dplyr you could do it as follows:
library(dplyr) # 0.4.1
df %>%
group_by(year) %>%
mutate(yrrank = row_number(-count))
#Source: local data frame [9 x 4]
#Groups: year
#
# item year count yrrank
#1 a 2010 1 3
#2 b 2010 4 2
#3 c 2010 6 1
#4 a 2011 3 2
#5 b 2011 8 1
#6 c 2011 3 3
#7 a 2012 5 3
#8 b 2012 7 2
#9 c 2012 9 1
It is the same as:
df %>%
group_by(year) %>%
mutate(yrrank = rank(-count, ties.method = "first"))
Note that the resulting data is still grouped by "year". If you want to remove the grouping you can simply extend the pipe with %>% ungroup().
While using the answers given by others, I found that the following performs faster than the transform and dyplr variants:
df$year.rank <- ave(count, year, FUN = function(x) rank(-x, ties.method = "first"))

Resources