R: How do I avoid getting an error when merging two data frames (group by/summarise)? - r

I have a big data frame of 80,000 rows. It was created by combining individual data frames from different years. The origin variable indicates the year of the entry's original data frame.
Here is an example of the first few of the big data frame rows that show how data frames from 2003 and 2011 were combined.
df_1:
ID City State origin
1 NY NY 2003
2 NY NY 2003
3 SF CA 2003
1 NY NY 2011
3 SF CA 2011
2 NY NY 2011
4 LA CA 2011
5 SD CA 2011
Now I want to create a new variable called first_appearance that takes the min of the origin variable for each ID:
final_df:
ID City State origin first_appearance
1 NY NY 2003 2003
2 NY NY 2003 2003
3 SF CA 2003 2003
1 NY NY 2011 2003
3 SF CA 2011 2003
2 NY NY 2011 2003
4 LA CA 2011 2011
5 SD CA 2011 2011
So far, I've tried using:
prestep_final <- df_1 %>% group_by(ID) %>% summarise(first_apperance = min(origin))
final_df <- merge(prestep_final, df_1, by = "ID")
Prestep_final works and produces a data frame with the ID and the first_appearance.
Unfortunately, the merge step doesn't work and yields a data frame with NA entries only.
How can I improve my code so that I can produce a table like final_df above. I'd appreciate any suggestions and don't have package preferences.

If you change summarise to mutate you get your desired result without merging:
library(tidyverse)
df <- tibble::tribble(
~ID, ~City, ~State, ~origin,
1, 'NY', 'NY', 2003,
2, 'NY', 'NY', 2003,
3, 'SF', 'CA', 2003,
1, 'NY', 'NY', 2011,
3, 'SF', 'CA', 2011,
2, 'NY', 'NY', 2011,
4, 'LA', 'CA', 2011,
5, 'SD', 'CA', 2011
)
df %>% group_by(ID) %>%
mutate(first_appearance = min(origin))
#> # A tibble: 8 x 5
#> # Groups: ID [5]
#> ID City State origin first_appearance
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 NY NY 2003 2003
#> 2 2 NY NY 2003 2003
#> 3 3 SF CA 2003 2003
#> 4 1 NY NY 2011 2003
#> 5 3 SF CA 2011 2003
#> 6 2 NY NY 2011 2003
#> 7 4 LA CA 2011 2011
#> 8 5 SD CA 2011 2011
Created on 2020-06-10 by the reprex package (v0.3.0)

An option with data.table
library(data.table)
setDT(df)[, first_appearance := min(origin), ID]
Or in base R
df$first_appearance <- with(df, ave(origin, ID, FUN = min))

Related

Replace NA with minimum Group Value R

I'm struggeling with transforming my data and would appreciate some help
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
na
2010
John
na
2012
John
na
2007
Louis
na
2012
Louis
na
the aim is to replace all NAs with the minimum value in year for every name group so the data looks like this
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
2009
2010
John
2009
2012
John
2009
2007
Louis
2007
2012
Louis
2007
Note: either all start values of one name group are NAs or none
I tried to use
mydf %>% group_by(name) %>% mutate(start= ifelse(is.na(start), min(year, na.rm = T), start))
but got this error
x `start` must return compatible vectors across groups
There are a lot of similar problems here.
Some people here used the ave function or worked with data.table which both doesnt seem to fit my problem
My base function must be sth like
df$A <- ifelse(is.na(df$A), df$B, df$A)
however I cant seem to properly combine it with the min() and group by() function.
Thank you for any help
I changed the colname to 'Year' because it was colliding to
dat %>%
dplyr::group_by(name) %>%
dplyr::mutate(start = dplyr::if_else(start == "na", min(Year), start))
# A tibble: 8 x 3
# Groups: name [3]
Year name start
<chr> <chr> <chr>
1 2010 Emma 1998
2 2011 Emma 1998
3 2012 Emma 1998
4 2009 John 2009
5 2010 John 2009
6 2012 John 2009
7 2007 Louis 2007
8 2012 Louis 2007
We can use na.aggregate
library(dplyr)
library(zoo)
dat %>%
group_by(name) %>%
mutate(start = na.aggregate(na_if(start, "na"), FUN = min))

Obtaining back incidence data from cumulative data?

I have a dataframe for which I have date data and cumulative counts.
I am trying to do a reverse of cumsum to get the daily counts but also getting the counts per group.
I am trying to go from dataframe A to dataframe B.
I am using R and tidyr.
Here is the code :
df <- data.frame(cum_count = c(5, 14, 50, 5, 14, 50),
state = c("Alabama", "Alabama", "Alabama", "NY", "NY", "NY"),
Year = c(2012:2014, 2012:2014))
Dataframe A
cum_count state Year
1 5 Alabama 2012
2 14 Alabama 2013
3 50 Alabama 2014
4 5 NY 2012
5 14 NY 2013
6 50 NY 2014
Dataframe B
cum_count state Year
1 5 Alabama 2012
2 9 Alabama 2013
3 36 Alabama 2014
4 5 NY 2012
5 9 NY 2013
6 36 NY 2014
I have tried using the diff function :
df <- df %>%group_by(state)%>%
mutate(daily_count = diff(cum_count))
But I get
Error: Column daily_count must be length 3 (the number of rows) or one, not 2
Let me know what you think.
Thanks!
diff returns length one less than the original length and mutate requires the output column to have the same length as the original (or length 1 which can be recycled). We can append a value possibly NA or the first value of 'cum_count'
library(dplyr)
df %>%
group_by(state)%>%
mutate(daily_count = c(first(cum_count), diff(cum_count)))
# A tibble: 6 x 4
# Groups: state [2]
# cum_count state Year daily_count
# <dbl> <fct> <int> <dbl>
#1 5 Alabama 2012 5
#2 14 Alabama 2013 9
#3 50 Alabama 2014 36
#4 5 NY 2012 5
#5 14 NY 2013 9
#6 50 NY 2014 36
Or for this purpose, use lag and subtract from the column itself
df %>%
group_by(state)%>%
mutate(daily_count = replace_na(cum_count - lag(cum_count), first(cum_count)))

merge two data frames based on matching rows of multiple columns

Below is the summary and structure of the two data sets I tried to merge claimants and unemp, they can me found here claims.csv and unemp.csv
> tbl_df(claimants)
# A tibble: 6,960 × 5
X County Month Year Claimants
<int> <fctr> <fctr> <int> <int>
1 1 ALAMEDA Jan 2007 13034
2 2 ALPINE Jan 2007 12
3 3 AMADOR Jan 2007 487
4 4 BUTTE Jan 2007 3496
5 5 CALAVERAS Jan 2007 644
6 6 COLUSA Jan 2007 1244
7 7 CONTRA COSTA Jan 2007 8475
8 8 DEL NORTE Jan 2007 328
9 9 EL DORADO Jan 2007 2120
10 10 FRESNO Jan 2007 19974
# ... with 6,950 more rows
> tbl_df(unemp)
# A tibble: 6,960 × 7
County Year Month laborforce emplab unemp unemprate
* <chr> <int> <chr> <int> <int> <int> <dbl>
1 Alameda 2007 Jan 743100 708300 34800 4.7
2 Alameda 2007 Feb 744800 711000 33800 4.5
3 Alameda 2007 Mar 746600 713200 33300 4.5
4 Alameda 2007 Apr 738200 705800 32400 4.4
5 Alameda 2007 May 739100 707300 31800 4.3
6 Alameda 2007 Jun 744900 709100 35800 4.8
7 Alameda 2007 Jul 749600 710900 38700 5.2
8 Alameda 2007 Aug 746700 709600 37000 5.0
9 Alameda 2007 Sep 748200 712100 36000 4.8
10 Alameda 2007 Oct 749000 713000 36100 4.8
# ... with 6,950 more rows
I thought first I should change all the factor columns to character columns.
unemp[sapply(unemp, is.factor)] <- lapply(unemp[sapply(unemp, is.factor)], as.character)
claimants[sapply(claimants, is.factor)] <- lapply(claimants[sapply(claimants, is.factor)], as.character)
m <-merge(unemp, claimants, by = c("County", "Month", "Year"))
dim(m)
[1] 0 10
In the output of dim(m), 0 rows are in the resulting dataframe. All the 6960 rows should match each other uniquely.
To verify that the two data frames have unique combination of the the 3 columns 'County', 'Month', and 'Year' I reorder and rearrange these columns within the dataframes as below:
a <- claimants[ order(claimants[,"County"], claimants[,"Month"], claimants[,"Year"]), ]
b <- unemp[ order(unemp[,"County"], unemp[,"Month"], unemp[,"Year"]), ]
b[2:4] <- b[c(2,4,3)]
a[2:4] %in% b[2:4]
[1] TRUE TRUE TRUE
This last output confirms that all 'County', 'Month', and 'Year' columns match each other in these two dataframes.
I have tried looking into the documentation for merge and could not gather where do I go wrong, I have also tried the inner_join function from dplyr:
> m <- inner_join(unemp[2:8], claimants[2:5])
Joining, by = c("County", "Year", "Month")
> dim(m)
[1] 0 8
I am missing something and don't know what, would appreciate the help with understanding this, I know I should not have to rearrange the rows by the three columns to run merge R should identify the matching rows and merge the non-matching columns.
The claimants df has the counties in all uppercase, the unemp df has them in lower case.
I used the options(stringsAsFactors = FALSE) when reading in your data. A few suggestions drop the X column in both, it doesn't seem useful.
options(stringsAsFactors = FALSE)
claims <- read.csv("claims.csv",header=TRUE)
claims$X <- NULL
unemp <- read.csv("unemp.csv",header=TRUE)
unemp$X <- NULL
unemp$County <- toupper(unemp$County)
m <- inner_join(unemp, claims)
dim(m)
# [1] 6960 8

Merging data by 2 variables in R

I am attempting to merge two data sets. In the past I have used merge() with by equal to the variable I want to merge by. However, now I would like to do so with two variables. My first data set looks something like this:
Year Winning_Tm Losing_Tm
2011 Texas Washington
2012 Alabama South Carolina
2013 Tennessee Texas
Then I have another data set with a rank of each team (this is very simplified) for each year. Like this:
Year Team Rank
2011 Texas 32
2011 Washington 34
2012 South Carolina 45
2012 Alabama 12
2013 Texas 6
2013 Tennessee 51
I would like to merge them so I have a data set that looks like this:
Year Winning_Tm Winning_TM_rank Losing_Tm Losing_Tm_rank
2011 Texas 32 Washington 34
2012 Alabama 12 South Carolina 45
2013 Tennessee 51 Texas 6
My hope is that there is a simple way to do this but it may be more complicated. Thanks!
I reproduced your data (try to include a dput of it next time):
A <- data.frame(
Year = c(2011, 2012, 2013),
Winning_Tm = c("Texas","Alabama","Tennessee"),
Losing_Tm = c("Washington","South Carolina", "Texas"),
stringsAsFactors = FALSE
)
B <- data.frame(
Year = c("2011","2011","2012","2012","2013","2013"),
Team = c("Texas","Washington","South Carolina","Alabama","Texas","Tennessee"),
Rank = c(32,34,45,12,6,51),
stringsAsFactors = FALSE
)
You can melt the first dataframe using the reshape2 package:
library(reshape2)
A <- melt(A, id.vars = "Year")
names(A)[3] <- "Team"
Now it looks like this:
> A
Year variable Team
1 2011 Winning_Tm Texas
2 2012 Winning_Tm Alabama
3 2013 Winning_Tm Tennessee
4 2011 Losing_Tm Washington
5 2012 Losing_Tm South Carolina
6 2013 Losing_Tm Texas
You can then merge the datasets together by the two columns of interest:
AB <- merge(A, B, by=c("Year","Team"))
Which looks like this:
> AB
Year Team variable Rank
1 2011 Texas Winning_Tm 32
2 2011 Washington Losing_Tm 34
3 2012 Alabama Winning_Tm 12
4 2012 South Carolina Losing_Tm 45
5 2013 Tennessee Winning_Tm 51
6 2013 Texas Losing_Tm 6
Then using the reshape command from base R you can change AB to a wide format:
reshape(AB, idvar = "Year", timevar = "variable", direction = "wide")
The result:
Year Team.Winning_Tm Rank.Winning_Tm Team.Losing_Tm Rank.Losing_Tm
1 2011 Texas 32 Washington 34
3 2012 Alabama 12 South Carolina 45
5 2013 Tennessee 51 Texas 6
Two separate merges. You would need to wrap your list of by variables in c(), and since the variables have different names, you need by.x and by.y. Afterward you could rename the rank variables.
I'll call your data winlose and teamrank, respectively. Then you'd need:
first_merge <- merge(winlose, teamrank, by.x = c('Year', 'Winning_Tm'), by.y = c('Year', 'Team'))
second_merge <- merge(first_merge, teamrank, by.x = c('Year', 'Losing_Tm'), by.y = c('Year', 'Team'))
Renaming the variables:
names(second_merge)[names(second_merge) == 'Rank.x'] <- 'Winning_Tm_rank'
names(second_merge)[names(second_merge) == 'Rank.y'] <- 'Losing_Tm_rank'
If you are familiar with SQL a rather complicated, but fast way to do this all in one step would be:
res <- sqldf("SELECT l.*,
max(case when l.Winning_Tm = r.Team then r.Rank else 0 end) as Winning_Tm_rank,
max(case when l.Losing_Tm = r.Team then r.Rank else 0 end) as Losing_Tm_rank
FROM df1 as l
inner join df2 as r
on (l.Winning_Tm = r.Team
OR l.Losing_Tm = r.Team)
AND l.Year = r.Year
group by l.Year, l.Winning_Tm, l.Losing_Tm")
res
Year Winning_Tm Losing_Tm Winning_Tm_rank Losing_Tm_rank
1 2011 Texas Washington 32 34
2 2012 Alabama South_Carolina 12 45
3 2013 Tennessee Texas 51 6
Data:
df1 <- read.table(header=T,text="Year Winning_Tm Losing_Tm
2011 Texas Washington
2012 Alabama South_Carolina
2013 Tennessee Texas")
df2<- read.table(header=T,text="Year Team Rank
2011 Texas 32
2011 Washington 34
2012 South_Carolina 45
2012 Alabama 12
2013 Texas 6
2013 Tennessee 51")

R: Find top, mid and bottom values to create a category column in dplyr

I would like to create a 'Category' column in the below dataset based on the sales and year.
set.seed(30)
df <- data.frame(
Year = rep(2010:2015, each = 6),
Country = rep(c('India', 'China', 'Japan', 'USA', 'Germany', 'Russia'), 6),
Sales = round(runif(18, 100, 900))
)
head(df)
Year Country Sales
1 2010 India 661
2 2010 China 888
3 2010 Japan 285
4 2010 USA 272
5 2010 Germany 332
6 2010 Russia 660
Categories are:
Top 2 countries with highest sales in each year: Category - 1
Bottom 2 countries with lowest sales in each year: Category - 3
Remaining countries by year: Category - 2
Expected dataset might look like:
Year Country Sales Category
1 2010 India 661 1
2 2010 China 888 1
3 2010 Japan 285 3
4 2010 USA 272 3
5 2010 Germany 332 2
6 2010 Russia 660 2
You don't need much here; just group_by year, arrange from greatest to least sales, and then add a new column with mutate that fills with 2:
df %>% group_by(Year) %>%
arrange(desc(Sales)) %>%
mutate(Category = c(1, 1, rep(2, n()-4), 3, 3))
# Source: local data frame [36 x 4]
# Groups: Year [6]
#
# Year Country Sales Category
# (int) (fctr) (dbl) (dbl)
# 1 2010 China 491 1
# 2 2010 USA 436 1
# 3 2010 Japan 391 2
# 4 2010 Germany 341 2
# 5 2010 Russia 218 3
# 6 2010 India 179 3
# 7 2011 Japan 873 1
# 8 2011 India 819 1
# 9 2011 Russia 418 2
# 10 2011 China 279 2
# .. ... ... ... ...
It will fail with fewer than four countries, but that doesn't sound like an issue from the question.
We can use cut to create a 'Category' column after grouping by "Year".
library(dplyr)
df %>%
group_by(Year) %>%
mutate(Category = as.numeric(cut(-Sales, breaks=c(-Inf,
quantile(-Sales, prob = c(0, .5, 1))))))
Or using data.table
library(data.table)
setDT(df)[order(-Sales), Category := if(.N > 4) rep(1:3,
c(2, .N - 4, 2)) else rep(seq(.N), each = ceiling(.N/3)) ,by = Year]
This should also work when there are fewer elements than 4 in each "Year". i.e. if we remove the first five observations in 2010.
df1 <- df[-(1:5),]
setDT(df1)[order(-Sales), Category := if(.N > 4) rep(1:3,
c(2, .N - 4, 2)) else rep(seq(.N), each = ceiling(.N/3)) ,by = Year]
head(df1)
# Year Country Sales Category
#1: 2010 Russia 218 1
#2: 2011 India 819 1
#3: 2011 China 279 2
#4: 2011 Japan 873 1
#5: 2011 USA 213 3
#6: 2011 Germany 152 3

Resources