Here's my problem. I have a table, of which I show a sample here. I would like to have the Country as row, Stars as column and the mean of the price for each combination. I used aggregate which gave me the info that i want but not how I want it.
The table looks like that :
Country Stars Price
1 Canada 4 567
2 China 2 435
3 Russia 3 456
4 Canada 5 687
5 Canada 4 432
6 Russia 3 567
7 China 4 1200
8 Russia 3 985
9 Canada 2 453
10 Russia 3 234
11 Russia 4 546
12 Canada 3 786
13 China 2 456
14 China 3 234
15 Russia 4 800
16 China 5 987
I used this code :
aggregate(Stars[,3],list(Country=Stars$Country, Stars = Stars$Stars), mean)
output :
Country Stars x
1 Canada 2 453.0
2 China 2 445.5
3 Canada 3 786.0
4 China 3 234.0
5 Russia 3 560.5
6 Canada 4 499.5
7 China 4 1200.0
8 Russia 4 673.0
9 Canada 5 687.0
10 China 5 987.0
Where x stands for the mean, I would like to change x for "price mean" to...
So the goal would be to have one country per row and the number of stars as column with the mean of the price for each pair.
Thank you very much.
It seems you want Excel like pivot table. Here package pivottabler helps much. See, it generates nice html tables too (apart from displaying results)
library(pivottabler)
qpvt(df, "Country", "Stars", "mean(Price)")
2 3 4 5 Total
Canada 453 786 499.5 687 585
China 445.5 234 1200 987 662.4
Russia 560.5 673 598
Total 448 543.666666666667 709 837 614.0625
for formatting use format argument
qpvt(df, "Country", "Stars", "mean(Price)", format = "%.2f")
2 3 4 5 Total
Canada 453.00 786.00 499.50 687.00 585.00
China 445.50 234.00 1200.00 987.00 662.40
Russia 560.50 673.00 598.00
Total 448.00 543.67 709.00 837.00 614.06
for html output use qhpvt instead.
qhpvt(df, "Country", "Stars", "mean(Price)")
Output
Note: tidyverse and baseR methods are also possible and are easy too
To obtain a two-way table of means, after attaching data you can use
tapply(Price, list(Country,Stars), mean)
I have df1:
State date fips score score1
1 Alabama 2020-03-24 1 242 0
2 Alabama 2020-03-26 1 538 3
3 Alabama 2020-03-28 1 720 4
4 Alabama 2020-03-21 1 131 0
5 Alabama 2020-03-15 1 23 0
6 Alabama 2020-03-18 1 51 0
7 Texas 2020-03-14 2 80 0
7 Texas 2020-03-16 2 102 0
7 Texas 2020-03-20 2 702 1
8 Texas 2020-03-23 2 1005 1
I would like to see which date a State surpasses a score of 100. I would then like to select the row 7 days after that date? For example, Alabama passes 100 on March 21st, so I would like to keep the March 28th data.
State date fips score score1
3 Alabama 2020-03-28 1 720 4
8 Texas 2020-03-23 2 1005 1
Here is a solution tidyverse and lubridate.
library(tidyverse)
library(lubridate)
df %>%
#Convert date column to date format
mutate_at(vars(date), ymd) %>%
#Group by State
group_by(State) %>%
#Ignore scores under 100
filter(score > 100) %>%
#Stay only with the date of the first date with score over 100 + 7 days
filter(date == min(date) + days(7))
Using a by approach (assuming date + 7 is available).
res <- do.call(rbind, by(dat, dat$state, function(x) {
st <- x[x$cases > 100, ]
st[as.Date(st$date) == as.Date(st$date[1]) + 7, ]
}))
head(res)
# date state fips cases deaths
# Alabama 2020-03-27 Alabama 1 639 4
# Alaska 2020-04-04 Alaska 2 169 3
# Arizona 2020-03-28 Arizona 4 773 15
# Arkansas 2020-03-28 Arkansas 5 409 5
# California 2020-03-15 California 6 478 6
# Colorado 2020-03-21 Colorado 8 475 6
I have two data frames Table_1 and Table_2 and I need to add a column "index" to Table_2 where value 1 for matching rows from Table_1 and 0 for others.
Basically, I need to match the "Pol", "CTY" ,"STATE" and "CRP" columns from Table_1 and "STATE", "CTY" , "CRP" and "Pol_No" from Table_2.
I prefer the data.table method.
Table_1:
Pol Cty Avg STATE CRP
85010 23 1123 MO 11
75022 23 1123 MO 11
35014 143 450 MO 11
.
.
Table_2:
STATE CTY CRP Pol_No Plan Price
AL 1 11 150410 90 4563
AL 1 21 45023 90 5402
MO 143 11 85010 90 2522
.
.
Desired output as below.
Table_2:
STATE CTY CRP Pol_No Plan Price Index
AL 1 11 150410 90 4563 0
AL 1 21 45023 90 5402 0
MO 143 11 85010 90 2522 1
.
.
How can I achieve this is R.?
Any help is appreciated.
Thanks.
Here's an entirely data.table solution:
merge(t1,t2,by.x='Pol', by.y='Pol_No', all.y=TRUE)[,c('STATE.y','CTY', 'Cty', 'CRP.y', 'Pol', 'Plan', 'Price')]
#-----
STATE.y CTY Cty CRP.y Pol Plan Price
1: AL 1 NA 21 45023 90 5402
2: MO 143 23 11 85010 90 2522
3: AL 1 NA 11 150410 90 4563
#--------
t3 <- merge(t1,t2,by.x='Pol', by.y='Pol_No', all.y=TRUE)[ ,
c('STATE.y','CTY', 'Cty', 'CRP.y', 'Pol', 'Plan', 'Price')]
t3[ , index := as.numeric(!is.na(Cty))]
t3
#--------
STATE.y CTY Cty CRP.y Pol Plan Price index
1: AL 1 NA 21 45023 90 5402 0
2: MO 143 23 11 85010 90 2522 1
3: AL 1 NA 11 150410 90 4563 0
To get column names right after merge(.. I first looked at:
merge(t1,t2,by.x='Pol', by.y='Pol_No', all.y=TRUE)
Pol Cty Avg STATE.x CRP.x STATE.y CTY CRP.y Plan Price
1: 45023 NA NA <NA> NA AL 1 21 90 5402
2: 85010 23 1123 MO 11 MO 143 11 90 2522
3: 150410 NA NA <NA> NA AL 1 11 90 4563
I think this is a straight-forward multi-column join:
library(dplyr)
t2 %>%
left_join(transmute(t1, CTY=Cty, STATE, Index=1L), by=c("CTY", "STATE")) %>%
mutate(Index = if_else(is.na(Index), 0L, Index))
# STATE CTY CRP Pol_No Plan Price Index
# 1 AL 1 11 150410 90 4563 0
# 2 AL 1 21 45023 90 5402 0
# 3 MO 143 11 85010 90 2522 1
EDIT
I've been trying to learn data.table, thought I'd give this a try. It feels a little clumsy to me, I'm sure there is a way to streamline it.
t1 <- setDT(t1); t2 <- setDT(t2)
For convenience, set the column names to be the same (I'm not sure how to make it happen easily otherwise) ... one is "Cty", the other is "CTY". Make them the same.
colnames(t1)[2] <- "CTY"
Now, the merge.
t1[,.(CTY,STATE,CRP,Index=1),][t2,on=c("CTY","STATE","CRP")]
# CTY STATE CRP Index Pol_No Plan Price
# 1: 1 AL 11 NA 150410 90 4563
# 2: 1 AL 21 NA 45023 90 5402
# 3: 143 MO 11 1 85010 90 2522
Notes:
the first bracket-op is selecting just the three joining columns and assigning the fourth, Index;
the actual join is in the second bracket-op, the first is just a selection
typically data.table ops work in side-effect, but not merges or selections like this, so it returns it without modifying the underlying structure; for this, we'll need to store it (back in t2, perhaps)
It's close ... now just update the Index field, since it's either 1 where the data co-exists or NA otherwise.
t2 <- t1[,.(CTY,STATE,CRP,Index=1),][t2,on=c("CTY","STATE","CRP")]
t2[,Index := as.integer(!is.na(Index)),]
t2
# CTY STATE CRP Index Pol_No Plan Price
# 1: 1 AL 11 0 150410 90 4563
# 2: 1 AL 21 0 45023 90 5402
# 3: 143 MO 11 1 85010 90 2522
Data:
t1 <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
Pol Cty Avg STATE CRP
85010 23 1123 MO 11
75022 23 1123 MO 11
35014 143 450 MO 11')
t2 <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
STATE CTY CRP Pol_No Plan Price
AL 1 11 150410 90 4563
AL 1 21 45023 90 5402
MO 143 11 85010 90 2522')
This is not a nice solution but it works for data.table. You need sqldf which works for dataframes and data tables.
library(data.table)
df1<-data.table(Pol=c(85010,75022,35014),Cty=c(23,23,143), Avg=c(1123,1123,450),STATE=c("MO","MO","MO"), CRP=c(11,11,11))
df2=data.table(STATE=c("AL","AL","MO"),CTY=c(1,1,143),CRP=c(11,21,11),Pol_No=c(150410,45023,85010),Plan=c(90,90,90),Price=c(4563,5402,2522))
library(sqldf)
#left join
df<-sqldf("select df2.STATE,df2.CTY,df2.CRP,df2.Pol_No,df2.Plan,df2.Price,df1.Pol from df2 left join df1 on df1.Pol=df2.Pol_No")
#create index
df$index<-ifelse(is.na(df$Pol),0,1)
#delete extra column
df$Pol<-NULL
> df
STATE CTY CRP Pol_No Plan Price index
1 AL 1 11 150410 90 4563 0
2 AL 1 21 45023 90 5402 0
3 MO 143 11 85010 90 2522 1
I have two dataframes, one for 2008, the other for 2004 and 2012 data. Examples of the dataframes are below.
df_08 <- read.table(text = c("
observation year x code_location location
1 2008 300 23-940 town no. 1 town no. 1
2 2008 234 23-941 town no. 2 town no. 2
3 2008 947 23-942 city St 23 city St 23
4 2008 102 23-943 Mtn town 5 Mtn town 5 "), header = TRUE)
df_04_12 <- read.table(text = c("
observation year y code_location location
1 2004 124 23-940 town no. 1 town no. 1
2 2004 395 23-345 town # 2 town # 2
3 2004 1349 23-942 city St 23 city St 23
4 2012 930 53-443 Mtn town 5 Mtn town 5
5 2012 185 99-999 town no. 1 town no. 1
6 2012 500 23-941 town Number 2 town Number 2
7 2012 185 34-942 city Street 23 city Street 23
8 2012 195 23-943 Mt town 5 Mt town 5 "), header = TRUE)
I want to merge df_08 to df_04_12 using the location variable (the codes are not consistent across years). However, slight variations in the location name, eg Mtn v. Mt or no. v. #, result in no match. Given these slight variations between location names, is there a way to merge these dataframes and get the following? I currently do not have any code for this since I am not sure how to match locations for a merge.
observation year y code_location location.x location.y y.y
1 2004 124 23-940 town no. 1 town no. 1 town no.1 300
2 2004 395 "23-345 town # 2" "town # 2" "town no. 2" 234
3 2004 1349 23-942 city St 23 city St 23 city St 23 947
4 2012 930 53-443 Mtn town 5 Mtn town 5 Mtn town 5 102
5 2012 185 99-999 town no. 1 town no. 1 town no. 1 300
6 2012 500 23-941 town Number 2 town Number 2 town no. 2 234
7 2012 185 34-942 city Street 23 city Street 23 city St 23 947
8 2012 195 23-943 Mt town 5 Mt town 5 Mtn town 5 102
You can use levenshtein distance on character variables but there is no way to account for symbols. I would suggest you clear all of the symbols before merge and than use the stringdist package. There is no clean solution for this problem, you will have to develop your own method as it relates to your data.
Some of the methods that are used in fuzzy matching is string distance calculations and SoundX transformation of the data, you just have to find out what is appropriate for your data.
Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.