Two way table with mean of a third variable R - r

Here's my problem. I have a table, of which I show a sample here. I would like to have the Country as row, Stars as column and the mean of the price for each combination. I used aggregate which gave me the info that i want but not how I want it.
The table looks like that :
Country Stars Price
1 Canada 4 567
2 China 2 435
3 Russia 3 456
4 Canada 5 687
5 Canada 4 432
6 Russia 3 567
7 China 4 1200
8 Russia 3 985
9 Canada 2 453
10 Russia 3 234
11 Russia 4 546
12 Canada 3 786
13 China 2 456
14 China 3 234
15 Russia 4 800
16 China 5 987
I used this code :
aggregate(Stars[,3],list(Country=Stars$Country, Stars = Stars$Stars), mean)
output :
Country Stars x
1 Canada 2 453.0
2 China 2 445.5
3 Canada 3 786.0
4 China 3 234.0
5 Russia 3 560.5
6 Canada 4 499.5
7 China 4 1200.0
8 Russia 4 673.0
9 Canada 5 687.0
10 China 5 987.0
Where x stands for the mean, I would like to change x for "price mean" to...
So the goal would be to have one country per row and the number of stars as column with the mean of the price for each pair.
Thank you very much.

It seems you want Excel like pivot table. Here package pivottabler helps much. See, it generates nice html tables too (apart from displaying results)
library(pivottabler)
qpvt(df, "Country", "Stars", "mean(Price)")
2 3 4 5 Total
Canada 453 786 499.5 687 585
China 445.5 234 1200 987 662.4
Russia 560.5 673 598
Total 448 543.666666666667 709 837 614.0625
for formatting use format argument
qpvt(df, "Country", "Stars", "mean(Price)", format = "%.2f")
2 3 4 5 Total
Canada 453.00 786.00 499.50 687.00 585.00
China 445.50 234.00 1200.00 987.00 662.40
Russia 560.50 673.00 598.00
Total 448.00 543.67 709.00 837.00 614.06
for html output use qhpvt instead.
qhpvt(df, "Country", "Stars", "mean(Price)")
Output
Note: tidyverse and baseR methods are also possible and are easy too

To obtain a two-way table of means, after attaching data you can use
tapply(Price, list(Country,Stars), mean)

Related

Add multiple columns lagged by one year

I need to add a 1-year-lagged version of multiple columns from my dataframe. Here's my data:
data<-data.frame(Year=c("2011","2011","2011","2012","2012","2012","2013","2013","2013"),
Country=c("America","China","India","America","China","India","America","China","India"),
Value1=c(234,443,754,334,117,112,987,903,476),
Value2=c(2,4,5,6,7,8,1,2,2))
And I want to add two columns that contain Value1 and Value2 at t-1, so that my dataframe looks like this:
How can I do this? Would this be the correct way to lag my variables by year?
Thanks in advance!
Using data.table:
library(data.table)
setDT(data)
cols <- grep("^Value", colnames(data), value = TRUE)
data[, paste0(cols, "_lag") := lapply(.SD, shift), .SDcols = cols, by = Country]
# Year Country Value1 Value2 Value1_lag Value2_lag
# 1: 2011 America 234 2 NA NA
# 2: 2011 China 443 4 NA NA
# 3: 2011 India 754 5 NA NA
# 4: 2012 America 334 6 234 2
# 5: 2012 China 117 7 443 4
# 6: 2012 India 112 8 754 5
# 7: 2013 America 987 1 334 6
# 8: 2013 China 903 2 117 7
# 9: 2013 India 476 2 112 8
In dplyr, use lag by group:
library(dplyr) #1.1.0
data %>%
mutate(across(contains("Value"), lag, .names = "{col}_lagged"), .by = Country)
Year Country Value1 Value2 Value1_lagged Value2_lagged
1 2011 America 234 2 NA NA
2 2011 China 443 4 NA NA
3 2011 India 754 5 NA NA
4 2012 America 334 6 234 2
5 2012 China 117 7 443 4
6 2012 India 112 8 754 5
7 2013 America 987 1 334 6
8 2013 China 903 2 117 7
9 2013 India 476 2 112 8
Below 1.1.0:
data %>%
group_by(Country) %>%
mutate(across(c(GDP, Population), lag, .names = "{col}_lagged")) %>%
ungroup()
Another way using dplyr to ge tthe job done.
library(dplyr)
data_lagged <- data %>%
group_by(Country) %>%
mutate(Value1_Lagged = lag(Value1),
Value2_Lagged = lag(Value2),
Year = as.integer(as.character(Year)) + 1)
data_final <- cbind(data, data_lagged[, c("Value1_Lagged", "Value2_Lagged")])
data_final
Output:
Year Country Value1 Value2 Value1_Lagged Value2_Lagged
1 2011 America 234 2 NA NA
2 2011 China 443 4 NA NA
3 2011 India 754 5 NA NA
4 2012 America 334 6 234 2
5 2012 China 117 7 443 4
6 2012 India 112 8 754 5
7 2013 America 987 1 334 6
8 2013 China 903 2 117 7
9 2013 India 476 2 112 8

Filling NAs withing a variable depending on where they are located

I have data of the following sort:
> df <- data.frame(Date = rep(seq(2000:2014), 2),
+ Country=c(rep("Italy",15),rep("Germany",15)),
+ var1= c(NA, NA, NA, NA, NA, 20:21, NA, NA, NA, 27:28, NA, NA, NA, NA, NA, NA, NA, 74:77, NA, NA, 68:70, NA, NA))
>
> df
Date Country var1
1 1 Italy NA
2 2 Italy NA
3 3 Italy NA
4 4 Italy NA
5 5 Italy NA
6 6 Italy 20
7 7 Italy 21
8 8 Italy NA
9 9 Italy NA
10 10 Italy NA
11 11 Italy 27
12 12 Italy 28
13 13 Italy NA
14 14 Italy NA
15 15 Italy NA
16 1 Germany NA
17 2 Germany NA
18 3 Germany NA
19 4 Germany NA
20 5 Germany 74
21 6 Germany 75
22 7 Germany 76
23 8 Germany 77
24 9 Germany NA
25 10 Germany NA
26 11 Germany 68
27 12 Germany 69
28 13 Germany 70
29 14 Germany NA
30 15 Germany NA
>
> df1 <- data.frame(Date = rep(seq(2000:2014), 2),
+ Country=c(rep("Italy",15),rep("Germany",15)),
+ var1= c(15.67052, 16.45405, 17.27675, 18.14059, 19.04762, 20:21, 22.36173, 23.81176, 25.35582, 27:28, 29.03704, 30.11249, 31.22777, 63.12417, 65.68326, 68.34609, 71.11688, 74:77, 73.87488, 70.8766, 68:70, 72.05882, 76.3599))
>
> df1
Date Country var1
1 1 Italy 15.67052
2 2 Italy 16.45405
3 3 Italy 17.27675
4 4 Italy 18.14059
5 5 Italy 19.04762
6 6 Italy 20.00000
7 7 Italy 21.00000
8 8 Italy 22.36173
9 9 Italy 23.81176
10 10 Italy 25.35582
11 11 Italy 27.00000
12 12 Italy 28.00000
13 13 Italy 29.03704
14 14 Italy 30.11249
15 15 Italy 31.22777
16 1 Germany 63.12417
17 2 Germany 65.68326
18 3 Germany 68.34609
19 4 Germany 71.11688
20 5 Germany 74.00000
21 6 Germany 75.00000
22 7 Germany 76.00000
23 8 Germany 77.00000
24 9 Germany 73.87488
25 10 Germany 70.87660
26 11 Germany 68.00000
27 12 Germany 69.00000
28 13 Germany 70.00000
29 14 Germany 72.05882
30 15 Germany 76.35990
As you can see I have
"beginning NAs" i.e. NAs at the beginnning of the sampling period for each country;
"within NAs" i.e. NAs present in the middle of the sampling period as gaps in the data.
"ending NAs" i.e. NAs at the end of the sampling period when the data is not available yet, that is for the last couple of years.
I have a threefold goal:
make the "beginning NAs" grow at the same rate as the first two non-NAs are growing, or as another proxy country is growing.
Make the "within NAs" grow at a CAGR (Compound Annual Growth rate). I.e. each "within NA" growth factor is the ratio between the next available non-NA over the previous value elevated to the inverse of the number of years between the latter and the "within NA" in question. E.g. for Italy in year 9 the growth rate should be (27/(value in year 8))^(1/number of years between year 9 and 11)=(27/(value in year 8))^1/3
Make the "ending NAs" grow in different ways. Default should be simply grow as the last two non-NAs are growing. But when needed I need to make some country specific adjustments depending on assumptions (e.g. some countries where data is not available will be assumed to grow at the same rate as a proxy country is growing). Perhaps for this goal what's best is a loop that breaks for the exceptions.
The reason why it's so messy is that I have to automate an excel data processing that has been done in the past. Therefore, for as much as I'd like to change things (e.g. simply interpolate the within NA) I should follow what was done in the past.
Please use all your fantasy, however I'd love to remain within the DPLYR framework.
Thanks in advance

Backtrack values in R for a logic

My request is slightly complicated.
Below is how my data looks like.
**S.no Date City Sales diff Indicator
1 1 1/1/2017 New York 2795 0 0
2 2 1/31/2017 New York 4248 1453 0
3 3 3/2/2017 New York 1330 -2918 1
4 4 4/1/2017 New York 3535 2205 0
5 5 5/1/2017 New York 4330 795 0
6 6 5/31/2017 New York 3360 -970 1
7 7 6/30/2017 New York 2238 -1122 1
8 8 1/1/2017 Paris 1451 0 0
9 9 1/31/2017 Paris 2339 888 0
10 10 3/2/2017 Paris 2029 -310 1
11 11 4/1/2017 Paris 1850 -179 1
12 12 5/1/2017 Paris 2800 950 1
13 13 5/31/2017 Paris 1986 -814 0
14 14 6/30/2017 Paris 3776 1790 0
15 15 1/1/2017 London 1646 0 0
16 16 1/31/2017 London 3575 1929 0
17 17 3/2/2017 London 1161 -2414 1
18 18 4/1/2017 London 1766 605 0
19 19 5/1/2017 London 2799 1033 0
20 20 5/31/2017 London 2761 -38 1
21 21 6/30/2017 London 1048 -1713 1**
diff is Current Month Sales-Last Month Sales, for each group, and Indicator is when diff is negative or positive.
I want to compute a logic for each group starting from last row to first row, aka in reverse order.
I want to see in reverse order, the value of Sales when indicator was 1. The compare that captured Sales value to the threshold value(2000), for next steps.
Now below are two cases of comparison(Capture Sales v/s Threshold).
a. If captured value of sales, when Indicator is first 1(starting from last row to 1st row), is less than 2000, then store the captured values in a new dataset for each group.
b. If the captured of sales, when Indicator is first 1(starting from last row to 1st row), is greater than 2000, then skip that Indicator=1 row and move to the next row where Indicator=1, and repeat the same step for pt.a) and pt. b)
I want to bring the result in a new dataset, that will have a single row for each City providing me the "Sales value" for the aforementioned logic, along with the Date.
I simply want to understand how can i bring up this logic in R. Will rle function help?
Result:
S.no Date City Value(Sales)
3. 3/2/2017 New York 1330
11. 4/1/2017 Paris 1850
21. 6/30/2017 London 1048
Thanks,
J
If we assume that your data is already arranged it ascending order, you can do the following with base R:
threshold <- 2000
my_new_df <- my_df[my_df$Indicator == 1 & my_df$Sales < threshold, ]
my_new_df
# S.no Date City Sales diff Indicator
# 3 3 2017-03-02 New York 1330 -2918 1
# 11 11 2017-04-01 Paris 1850 -179 1
# 17 17 2017-03-02 London 1161 -2414 1
# 21 21 2017-06-30 London 1048 -1713 1
Now we have all rows where the Indicator is equal to one and the Salse value less than our threshold. But London has to rows and we only wnat the last one:
my_new_df <- my_new_df[!duplicated(my_new_df$City, fromLast = T),
c("S.no", "Date", "City", "Sales")]
my_new_df
# S.no Date City Sales
# 3 3 2017-03-02 New York 1330
# 11 11 2017-04-01 Paris 1850
# 21 21 2017-06-30 London 1048
With the fromLast-argument in the duplicated, we start in the last row to check, whether the City has already been in the data set.

merge two dataframes on column with fuzzy match in R

I have two dataframes, one for 2008, the other for 2004 and 2012 data. Examples of the dataframes are below.
df_08 <- read.table(text = c("
observation year x code_location location
1 2008 300 23-940 town no. 1 town no. 1
2 2008 234 23-941 town no. 2 town no. 2
3 2008 947 23-942 city St 23 city St 23
4 2008 102 23-943 Mtn town 5 Mtn town 5 "), header = TRUE)
df_04_12 <- read.table(text = c("
observation year y code_location location
1 2004 124 23-940 town no. 1 town no. 1
2 2004 395 23-345 town # 2 town # 2
3 2004 1349 23-942 city St 23 city St 23
4 2012 930 53-443 Mtn town 5 Mtn town 5
5 2012 185 99-999 town no. 1 town no. 1
6 2012 500 23-941 town Number 2 town Number 2
7 2012 185 34-942 city Street 23 city Street 23
8 2012 195 23-943 Mt town 5 Mt town 5 "), header = TRUE)
I want to merge df_08 to df_04_12 using the location variable (the codes are not consistent across years). However, slight variations in the location name, eg Mtn v. Mt or no. v. #, result in no match. Given these slight variations between location names, is there a way to merge these dataframes and get the following? I currently do not have any code for this since I am not sure how to match locations for a merge.
observation year y code_location location.x location.y y.y
1 2004 124 23-940 town no. 1 town no. 1 town no.1 300
2 2004 395 "23-345 town # 2" "town # 2" "town no. 2" 234
3 2004 1349 23-942 city St 23 city St 23 city St 23 947
4 2012 930 53-443 Mtn town 5 Mtn town 5 Mtn town 5 102
5 2012 185 99-999 town no. 1 town no. 1 town no. 1 300
6 2012 500 23-941 town Number 2 town Number 2 town no. 2 234
7 2012 185 34-942 city Street 23 city Street 23 city St 23 947
8 2012 195 23-943 Mt town 5 Mt town 5 Mtn town 5 102
You can use levenshtein distance on character variables but there is no way to account for symbols. I would suggest you clear all of the symbols before merge and than use the stringdist package. There is no clean solution for this problem, you will have to develop your own method as it relates to your data.
Some of the methods that are used in fuzzy matching is string distance calculations and SoundX transformation of the data, you just have to find out what is appropriate for your data.

Add new column to long dataframe from another dataframe?

Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.

Resources