Looking up values without loop in R - r

I need to look up a value in a data frame based on multiple criteria in another data frame. Example
A=
Country Year Number
USA 1994 455
Canada 1997 342
Canada 1998 987
must have added a column by the name of "rate" coming from
B=
Year USA Canada
1993 21 654
1994 41 321
1995 56 789
1996 85 123
1997 65 456
1998 1 999
So that the final data frame is
C=
Country Year Number Rate
USA 1994 455 41
Canada 1997 342 456
Canada 1998 987 999
In other words: Look up year and country from A in B and result is C. I would like to do this without a loop. I would like a general approach, such that I would be able to look up based on more than two criteria.

Here's another way using data.table that doesn't require converting the 2nd data table to long form:
require(data.table) # 1.9.6+
A[B, Rate := get(Country), by=.EACHI, on="Year"]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
where A and B are data.tables, and Country is of character type.

We can melt the second dataset from 'wide' to 'long' format, merge with the first dataset to get the expected output.
library(reshape2)
res <- merge(A, melt(B, id.var='Year'),
by.x=c('Country', 'Year'), by.y=c('variable', 'Year'))
names(res)[4] <- 'Rate'
res
# Country Year Number Rate
#1 Canada 1997 342 456
#2 Canada 1998 987 999
#3 USA 1994 455 41
Or we can use gather from tidyr and right_join to get this done.
library(dplyr)
library(tidyr)
gather(B, Country,Rate, -Year) %>%
right_join(., A)
# Year Country Rate Number
#1 1994 USA 41 455
#2 1997 Canada 456 342
#3 1998 Canada 999 987
Or as #DavidArenburg mentioned in the comments, this can be also done with data.table. We convert the 'data.frame' to 'data.table' (setDT(A)), melt the second dataset and join on 'Year', and 'Country'.
library(data.table)#v1.9.6+
setDT(A)[melt(setDT(B), 1L, variable = "Country", value = "Rate"),
on = c("Country", "Year"),
nomatch = 0L]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
Or a shorter version (if we are not too picky no variable names)
setDT(A)[melt(B, 1L), on = c(Country = "variable", Year = "Year"), nomatch = 0L]

Related

Casting and Melting with reshape in R

As an example let's say that I have the following dataframe:
datas=data.frame(Variables=c("Power","Happiness","Power","Happiness"),
Country=c("France", "France", "UK", "UK"), y2000=c(1213,1872,1726,2234), y2001=c(1234,2345,6433,9082))
Resulting in the following output:
Variables Country 2000 2001
1 Power France 1213 1234
2 Happiness France 1872 2345
3 Power UK 1726 6433
4 Happiness UK 2234 9082
I would like to reshape this dataframe as follows:
Year Country Power Happiness
1 2000 France 1213 1872
2 2001 France 1234 2345
3 2000 UK 1726 2234
4 2001 UK 6433 9082
I started out with:
q2=cast(datas, Country~Variables, value="2000")
But then got the following error:
Aggregation requires fun.aggregate: length used as default
Error in `[.data.frame`(sort_df(data, variables), , c(variables, "value"), :
undefined columns selected
Any suggestions?
Also: Would it matter for the solution that my dataframe is really big (417120 by 62)?
Perhaps you're interested in a tidyverse alternative
library(tidyverse)
df %>%
gather(Year, val, -Variables, -Country) %>%
spread(Variables, val)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
Or using reshape2::melt and reshape2::dcast
reshape2::dcast(
reshape2::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
Or (identically) using data.table::melt and data.table::dcast
data.table::dcast(
data.table::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
In terms of performance/runtime, I imagine the data.table or tidyr solutions to be the most efficient. You can check by running a microbenchmark on some larger sample data.
Sample data
df <-read.table(text =
" Variables Country 2000 2001
1 Power France 1213 1234
2 Happiness France 1872 2345
3 Power UK 1726 6433
4 Happiness UK 2234 9082", header = T)
colnames(df)[3:4] <- c("2000", "2001")
Benchmark analysis
Following results from a microbenchmark analysis of the four methods, based on a (slightly) larger 78x22 sample dataset.
set.seed(2017)
df <- data.frame(
Variables = rep(c("Power", "Happiness", "something_else"), 26),
Country = rep(LETTERS[1:26], each = 3),
matrix(sample(10000, 20 * 26 * 3), nrow = 26 * 3))
colnames(df)[3:ncol(df)] <- 2000:2019
library(microbenchmark)
library(tidyr)
res <- microbenchmark(
reshape2 = {
reshape2::dcast(
reshape2::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
},
tidyr = {
df %>%
gather(Year, val, -Variables, -Country) %>%
spread(Variables, val)
},
datatable = {
data.table::dcast(
data.table::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
},
reshape = {
reshape::cast(reshape::melt(df), Country + variable ~ Variables)
}
)
res
#Unit: milliseconds
# expr min lq mean median uq max neval
# reshape2 3.088740 3.449686 4.313044 3.919372 5.112560 7.856902 100
# tidyr 4.482361 4.982017 6.215872 5.771133 6.931964 28.293377 100
# datatable 3.179035 3.511542 4.861192 4.040188 5.123103 46.010810 100
# reshape 27.371094 30.226222 32.425667 32.504644 34.118499 41.286803 100
library(ggplot2)
autoplot(res)
As above, I would strongly recommend using tidyr instead of reshape, or at least using reshape2 instead of reshape, as it fixes many of the performance issues with reshape.
In reshape itself, you have to melt datas first
> cast(melt(datas), Country + variable ~ Variables)
Using Variables, Country as id variables
Country variable Happiness Power
1 France y2000 1872 1213
2 France y2001 2345 1234
3 UK y2000 2234 1726
4 UK y2001 9082 6433
And then renaming and converting the columns as necessary.
In reshape2 the code is identical but you would use dcast instead of cast. tidyr, as in #Maurits Evers's solution above is a better solution and most development has shifted from reshape2 to the tidyverse

Aggregate/Group_by second minimum value in R

I have used either group_by() in dplyr or the aggregate() function to aggregate across columns in R. For my current problem I want to group by an individual but finding the second lowest of one column (Number) and the lowest of another (Year). So, if my data looks like this:
Number Individual Year Value
123 M. Smith 2010 234
435 M. Smith 2011 346
435 M. Smith 2012 356
524 M. Smith 2015 432
119 J. Jones 2010 345
119 J. Jones 2012 432
254 J. Jones 2013 453
876 J. Jones 2014 654
I want it to become:
Number Individual Year Value
435 M. Smith 2011 346
254 J. Jones 2013 453
Thank you.
We can use the dplyr package. dt2 is the final output. The idea is to filter out the minimum in the Number column, then arrange the data frame by Individual, Number, and Year. Finally, select the first row of each group.
# Load package
library(dplyr)
# Create example data frame
dt <- read.table(text = "Number Individual Year Value
123 'M. Smith' 2010 234
435 'M. Smith' 2011 346
435 'M. Smith' 2012 356
524 'M. Smith' 2015 432
119 'J. Jones' 2010 345
119 'J. Jones' 2012 432
254 'J. Jones' 2013 453
876 'J. Jones' 2014 654",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
group_by(Individual) %>%
filter(Number != min(Number)) %>%
arrange(Individual, Number, Year) %>%
slice(1)
We can use dplyr
library(dplyr)
df1 %>%
group_by(Individual) %>%
arrange(Individual, Number) %>%
filter(Number != max(Number)) %>%
slice(which.max(Number))
# A tibble: 2 x 4
# Groups: Individual [2]
# Number Individual Year Value
# <int> <chr> <int> <int>
#1 254 J. Jones 2013 453
#2 435 M. Smith 2011 346

R: How to spread, group_by, summarise and mutate at the same time

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

R: Find top, mid and bottom values to create a category column in dplyr

I would like to create a 'Category' column in the below dataset based on the sales and year.
set.seed(30)
df <- data.frame(
Year = rep(2010:2015, each = 6),
Country = rep(c('India', 'China', 'Japan', 'USA', 'Germany', 'Russia'), 6),
Sales = round(runif(18, 100, 900))
)
head(df)
Year Country Sales
1 2010 India 661
2 2010 China 888
3 2010 Japan 285
4 2010 USA 272
5 2010 Germany 332
6 2010 Russia 660
Categories are:
Top 2 countries with highest sales in each year: Category - 1
Bottom 2 countries with lowest sales in each year: Category - 3
Remaining countries by year: Category - 2
Expected dataset might look like:
Year Country Sales Category
1 2010 India 661 1
2 2010 China 888 1
3 2010 Japan 285 3
4 2010 USA 272 3
5 2010 Germany 332 2
6 2010 Russia 660 2
You don't need much here; just group_by year, arrange from greatest to least sales, and then add a new column with mutate that fills with 2:
df %>% group_by(Year) %>%
arrange(desc(Sales)) %>%
mutate(Category = c(1, 1, rep(2, n()-4), 3, 3))
# Source: local data frame [36 x 4]
# Groups: Year [6]
#
# Year Country Sales Category
# (int) (fctr) (dbl) (dbl)
# 1 2010 China 491 1
# 2 2010 USA 436 1
# 3 2010 Japan 391 2
# 4 2010 Germany 341 2
# 5 2010 Russia 218 3
# 6 2010 India 179 3
# 7 2011 Japan 873 1
# 8 2011 India 819 1
# 9 2011 Russia 418 2
# 10 2011 China 279 2
# .. ... ... ... ...
It will fail with fewer than four countries, but that doesn't sound like an issue from the question.
We can use cut to create a 'Category' column after grouping by "Year".
library(dplyr)
df %>%
group_by(Year) %>%
mutate(Category = as.numeric(cut(-Sales, breaks=c(-Inf,
quantile(-Sales, prob = c(0, .5, 1))))))
Or using data.table
library(data.table)
setDT(df)[order(-Sales), Category := if(.N > 4) rep(1:3,
c(2, .N - 4, 2)) else rep(seq(.N), each = ceiling(.N/3)) ,by = Year]
This should also work when there are fewer elements than 4 in each "Year". i.e. if we remove the first five observations in 2010.
df1 <- df[-(1:5),]
setDT(df1)[order(-Sales), Category := if(.N > 4) rep(1:3,
c(2, .N - 4, 2)) else rep(seq(.N), each = ceiling(.N/3)) ,by = Year]
head(df1)
# Year Country Sales Category
#1: 2010 Russia 218 1
#2: 2011 India 819 1
#3: 2011 China 279 2
#4: 2011 Japan 873 1
#5: 2011 USA 213 3
#6: 2011 Germany 152 3

Reducing rows and expanding columns of data.frame in R

I have this data.frame in R.
> a <- data.frame(year = c(2001,2001,2001,2001), country = c("Japan", "Japan","US","US"), type = c("a","b","a","b"), amount = c(35,67,39,45))
> a
year country type amount
1 2001 Japan a 35
2 2001 Japan b 67
3 2001 US a 39
4 2001 US b 45
How should I transform this into a data.frame that looks like this?
year country type.a type.b
1 2001 Japan 35 67
2 2001 US 39 45
Basically I want the number of rows to be the number of (year x country) pairs, and I want to create additional columns for each type.
base solution, but requires renaming columns and rows
reshape(a, v.names="amount", timevar="type", idvar="country", direction="wide")
year country amount.a amount.b
1 2001 Japan 35 67
3 2001 US 39 45
reshape2 solution
library(reshape2)
dcast(a, year+country ~ paste("type", type, sep="."), value.var="amount")
year country type.a type.b
1 2001 Japan 35 67
2 2001 US 39 45
Another way would be to use spread in the tidyr package and rename in the dplyr package to deliver the expected outcome.
library(dplyr)
library(tidyr)
spread(a,type, amount) %>%
rename(type.a = a, type.b = b)
# year country type.a type.b
#1 2001 Japan 35 67
#2 2001 US 39 45

Resources