R- ddply function - r

Hey everyone I have a dataset with about 8 for which I want to calculate the largest volume for each combination of city and year.
The dataset looks like this:
city sales volume year avg price
abilene 239 12313 2000 7879
kansas 2324 18765 2000 2424
nyc 2342 987651 2000 3127
abilene 3432 34342 2001 1234
nyc 2342 10000 2001 3127
kansas 176 3130 2001 879
kansas 123 999650 2002 2424
abilene 3432 34342 2002 1234
nyc 2342 98000 2002 3127
I want my dataset to look like this :
city year volume
nyc 2000 987651
abilene 2001 34342
kansas 2002 999650
I used the ddplyr to find the maximum volume of each city.
newdf=ddply(df,c('city','year'),summarise, max(volume))
However this gives me a dataset with maximum value of each city for each year. However, I just want to know the maximum volume comparing all cities for an year. Thank you.

library(dplyr)
df %>% #df is your dataframe
group_by(year)%>%
filter(volume==max(volume))
Source: local data frame [3 x 5]
Groups: year
city sales volume year avg_price
1 nyc 2342 987651 2000 3127
2 abilene 3432 34342 2001 1234
3 kansas 123 999650 2002 2424
#updated : If you are grouping by both city and year
df %>% #df is your dataframe
group_by(year,city)%>%
filter(volume==max(volume))
Source: local data frame [9 x 5]
Groups: year, city
city sales volume year avg_price
1 abilene 239 12313 2000 7879
2 kansas 2324 18765 2000 2424
3 nyc 2342 987651 2000 3127
4 abilene 3432 34342 2001 1234
5 nyc 2342 10000 2001 3127
6 kansas 176 3130 2001 879
7 kansas 123 999650 2002 2424
8 abilene 3432 34342 2002 1234
9 nyc 2342 98000 2002 3127

Related

construct a pseudo panel based on similar values for some variables in R

I Have a 2 questions in one. I have 20 data frames. Each one is subject to a given year (from 2000 to 2020). They all have the same columns. 1) I want to merge them based on similar observations for a list of variables (columns), so I can construct a panel. 2) Plus when merging I want to rename the columns by adding a suffixes indicating the date.
For example, let take 3 dataframes
df1
year_sample birth_date country work_establishment Wage
2014 1995 US X2134 1700
2014 1996 US X26 1232
2014 1992 CANADA X26 2553
2014 1990 FRANCE X4T346 6574
2014 1983 BELGIUM X2E43 1706
2014 1975 US X2134 1000
2014 1969 CHINA XXZT55 996
df2
year_sample birth_date country work_establishment Wage
2015 1995 US X2134 1756
2015 1996 US X26 1230
2015 1992 CANADA X26 2700
2015 1990 FRANCE X4T346 6574
2015 1975 US X2134 1000
2015 1979 GERMANY X35555 2435
df3
year_sample birth_date country work_establishment Wage
2016 1995 US X2134 1750
2016 1996 US X26 1032
2016 1992 CANADA X26 2353
2016 1990 FRANCE X4T346 6574
2016 1955 MALI X2244 1000
2016 1979 GERMANY X35555 2435
If an observation have similar values for c(birth_date; country ; work_establisment) then I will considere it as the same person. I want therefore:
df_final
id birth_date country work_establishment Wage_2014 Wage_2015 Wage_2016
1 1995 US X2134 1700 1756 1750
2 1996 US X26 1232 1230 1032
3 1992 CANADA X26 2553 2700 2353
4 1990 FRANCE X4T346 6574 6574 6574
I know that if I had just two dataframes I can do :
df_final <- transform(merge(df1,df2, by=c("birth_date", "country", "work_establishment"), suffixes=c("_2014", "_2015")))
But I can't manage to do it for several dataframes at once.
Thank you!
You can get all the dataframes in a list.
list_df <- mget(paste0('df', 1:3))
#OR
#list_df <- list(df1, df2, df3)
Then add suffix to 'Wage' column in each of the dataframe from the year_sample value and drop the year column and use Reduce to merge the dataframes into one.
result <- Reduce(function(x, y)
merge(x, y, by=c("birth_date", "country", "work_establishment")),
lapply(list_df, function(x)
{names(x)[5] <- paste('Wage', x$year_sample[1], sep = '_');x[-1]}))
result
# birth_date country work_establishment Wage_2014 Wage_2015 Wage_2016
#1 1990 FRANCE X4T346 6574 6574 6574
#2 1992 CANADA X26 2553 2700 2353
#3 1995 US X2134 1700 1756 1750
#4 1996 US X26 1232 1230 1032

Merging two data frames with different rows in R

I have two data frames. The first one looks like
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
The second one contains all the countries that are in the first data frame plus a few more countries for year 2018. It looks likes this
Country Year production
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I would like to merge the two data frames, and the final table should look like this:
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
Austria 1996 NA
Japan 1996 NA
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I've tried several functions including full_join, merge, and rbind but they didn't work. Does anybody have any ideas?
With dplyr and tidyr, you may use:
bind_rows(df1, df2) %>%
complete(Country, Year)
Country Year production
<chr> <int> <int>
1 Austria 1996 NA
2 Austria 2018 56
3 France 1996 12
4 France 2018 29
5 Germany 1996 11
6 Germany 2018 27
7 Greece 1996 15
8 Greece 2018 44
9 Japan 1996 NA
10 Japan 2018 66
11 UK 1996 17
12 UK 2018 46
13 USA 1996 24
14 USA 2018 99
Consider base R with expand.grid and merge (and avoid any dependencies should you be a package author):
# BUILD DF OF ALL POSSIBLE COMBINATIONS OF COUNTRY AND YEAR
all_country_years <- expand.grid(Country=unique(c(df_96$Country, df_18$Country)),
Year=c(1996, 2018))
# MERGE (LEFT JOIN)
final_df <- merge(all_country_years, rbind(df_96, df_18), by=c("Country", "Year"),
all.x=TRUE)
# ORDER DATA AND RESET ROW NAMES
final_df <- data.frame(with(final_df, final_df[order(Year, Country),]),
row.names = NULL)
final_df
# Country Year production
# 1 Germany 1996 11
# 2 France 1996 12
# 3 Greece 1996 15
# 4 UK 1996 17
# 5 USA 1996 24
# 6 Austria 1996 NA
# 7 Japan 1996 NA
# 8 Germany 2018 27
# 9 France 2018 29
# 10 Greece 2018 44
# 11 UK 2018 46
# 12 USA 2018 99
# 13 Austria 2018 56
# 14 Japan 2018 66
Demo

Group by DF and then Filter using dplyr

This might be relatively easy in dplyr. Sample question uses the Lahman package data.
What player managed both the NYA and NYN under teamID?
# get master player table
players <- Lahman::People
# get manager table
managers <- Lahman::Managers
# merge players to managers
manager_tbl <-
managers %>%
left_join(players)
I want to get the results for the players under playerID that have a row for both NYA and NYN under teamID.
How would I go about doing this? I'm guessing that I would need to group at playerID. berrayo01 is one of the answers.
After grouping by 'playerID', filter all groups having both 'NYA' and 'NYN' %in% 'teamID'
library(dplyr)
manager_tbl %>%
group_by(playerID) %>%
filter(all(c("NYA", "NYN") %in% teamID))
# A tibble: 69 x 35
# Groups: playerID [4]
# playerID yearID teamID lgID inseason G W L rank plyrMgr birthYear birthMonth birthDay birthCountry birthState birthCity deathYear deathMonth deathDay deathCountry deathState
# <chr> <int> <fct> <fct> <int> <int> <int> <int> <int> <fct> <int> <int> <int> <chr> <chr> <chr> <int> <int> <int> <chr> <chr>
# 1 stengca… 1934 BRO NL 1 153 71 81 6 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 2 stengca… 1935 BRO NL 1 154 70 83 5 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 3 stengca… 1936 BRO NL 1 156 67 87 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 4 stengca… 1938 BSN NL 1 153 77 75 5 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 5 stengca… 1939 BSN NL 1 152 63 88 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 6 stengca… 1940 BSN NL 1 152 65 87 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 7 stengca… 1941 BSN NL 1 156 62 92 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 8 stengca… 1942 BSN NL 1 150 59 89 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 9 stengca… 1943 BSN NL 2 107 47 60 6 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
#10 stengca… 1949 NYA AL 1 155 97 57 1 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# … with 59 more rows, and 14 more variables: deathCity <chr>, nameFirst <chr>, nameLast <chr>, nameGiven <chr>, weight <int>, height <int>, bats <fct>, throws <fct>, debut <chr>,
# finalGame <chr>, retroID <chr>, bbrefID <chr>, deathDate <date>, birthDate <date>

Grouping and/or Counting in R

I'm trying to 're-count' a column in R and having issues by cleaning up the data. I'm working on cleaning data by location and once I change CA to California.
all_location <- read.csv("all_location.csv", stringsAsFactors = FALSE)
all_location <- count(all_location, location)
all_location <- all_location[with(all_location, order(-n)), ]
all_location
A tibble: 100 x 2
location n
<chr> <int>
1 CA 3216
2 Alaska 2985
3 Nevada 949
4 Washington 253
5 Hawaii 239
6 Montana 218
7 Puerto Rico 149
8 California 126
9 Utah 83
10 NA 72
From the above, there's CA and California. Below I'm able to clean grep and replace CA with California. However, my issue is that it's grouping by California but shows two separate instances of California.
ca1 <- grep("CA",all_location$location)
all_location$location <- replace(all_location$location,ca1,"California")
all_location
A tibble: 100 x 2
location n
<chr> <int>
1 California 3216
2 Alaska 2985
3 Nevada 949
4 Washington 253
5 Hawaii 239
6 Montana 218
7 Puerto Rico 149
8 California 126
9 Utah 83
10 NA 72
My goal would be to combine both to a total under n.
all_location$location[substr(all_location$location, 1, 5) %in% "Calif" ] <- "California"
to make sure everything that starts with "Calif" gets made into "California"
I am assuming that maybe you have a space in the California (e.g. "California ") that is already present which is why this is happening..

Data transformation, almost like when you use cast and melt

I don't know how to name this data transformation neither know if there exists some kind of function to use
My data has this shape:
rank abbrv country eci_value delta year
(int) (fctr) (fctr) (dbl) (int) (int)
1 30 BRA Brazil 0.5588656 2 1995
2 47 URY Uruguay 0.2098838 -14 1995
3 52 PAN Panama 0.1164776 2 1995
4 56 ARG Argentina 0.0013733 7 1995
5 58 VEN Venezuela -0.0329851 11 1995
6 64 COL Colombia -0.2216275 -2 1995
And I want a data frame with just the information provided by "year, "rank" and country presented in this way:
country 1995 1996 1997 1998 ...
Peru rank1995 rank1996 rank1997 rank1998 ...
Brazil rank1995 rank1996 rank1997 rank1998 ...
Chile rank1995 rank1996 rank1997 rank1998 ...
... ... ... ... ...
The var "year" ranges from 1995 to 2014 and the rank varies each year
I've thought of using a melt and dcast functions from reshape2 package... but nothing useful goes out.
Thanks
This could work for you. Here is an example using dplyr and tidyr, using your small sample above (you will have to test on a larger data set or provide one).
library(dplyr)
library(tidyr)
df
# rank abbrv country eci_value delta year
#1 30 BRA Brazil 0.5588656 2 1995
#2 47 URY Uruguay 0.2098838 -14 1995
#3 52 PAN Panama 0.1164776 2 1995
#4 56 ARG Argentina 0.0013733 7 1995
#5 58 VEN Venezuela -0.0329851 11 1995
#6 64 COL Colombia -0.2216275 -2 1995
df %>% select(country, year, rank) %>% spread(year, rank)
# country 1995
#1 Argentina 56
#2 Brazil 30
#3 Colombia 64
#4 Panama 52
#5 Uruguay 47
#6 Venezuela 58

Resources