Backtrack values in R for a logic - r

My request is slightly complicated.
Below is how my data looks like.
**S.no Date City Sales diff Indicator
1 1 1/1/2017 New York 2795 0 0
2 2 1/31/2017 New York 4248 1453 0
3 3 3/2/2017 New York 1330 -2918 1
4 4 4/1/2017 New York 3535 2205 0
5 5 5/1/2017 New York 4330 795 0
6 6 5/31/2017 New York 3360 -970 1
7 7 6/30/2017 New York 2238 -1122 1
8 8 1/1/2017 Paris 1451 0 0
9 9 1/31/2017 Paris 2339 888 0
10 10 3/2/2017 Paris 2029 -310 1
11 11 4/1/2017 Paris 1850 -179 1
12 12 5/1/2017 Paris 2800 950 1
13 13 5/31/2017 Paris 1986 -814 0
14 14 6/30/2017 Paris 3776 1790 0
15 15 1/1/2017 London 1646 0 0
16 16 1/31/2017 London 3575 1929 0
17 17 3/2/2017 London 1161 -2414 1
18 18 4/1/2017 London 1766 605 0
19 19 5/1/2017 London 2799 1033 0
20 20 5/31/2017 London 2761 -38 1
21 21 6/30/2017 London 1048 -1713 1**
diff is Current Month Sales-Last Month Sales, for each group, and Indicator is when diff is negative or positive.
I want to compute a logic for each group starting from last row to first row, aka in reverse order.
I want to see in reverse order, the value of Sales when indicator was 1. The compare that captured Sales value to the threshold value(2000), for next steps.
Now below are two cases of comparison(Capture Sales v/s Threshold).
a. If captured value of sales, when Indicator is first 1(starting from last row to 1st row), is less than 2000, then store the captured values in a new dataset for each group.
b. If the captured of sales, when Indicator is first 1(starting from last row to 1st row), is greater than 2000, then skip that Indicator=1 row and move to the next row where Indicator=1, and repeat the same step for pt.a) and pt. b)
I want to bring the result in a new dataset, that will have a single row for each City providing me the "Sales value" for the aforementioned logic, along with the Date.
I simply want to understand how can i bring up this logic in R. Will rle function help?
Result:
S.no Date City Value(Sales)
3. 3/2/2017 New York 1330
11. 4/1/2017 Paris 1850
21. 6/30/2017 London 1048
Thanks,
J

If we assume that your data is already arranged it ascending order, you can do the following with base R:
threshold <- 2000
my_new_df <- my_df[my_df$Indicator == 1 & my_df$Sales < threshold, ]
my_new_df
# S.no Date City Sales diff Indicator
# 3 3 2017-03-02 New York 1330 -2918 1
# 11 11 2017-04-01 Paris 1850 -179 1
# 17 17 2017-03-02 London 1161 -2414 1
# 21 21 2017-06-30 London 1048 -1713 1
Now we have all rows where the Indicator is equal to one and the Salse value less than our threshold. But London has to rows and we only wnat the last one:
my_new_df <- my_new_df[!duplicated(my_new_df$City, fromLast = T),
c("S.no", "Date", "City", "Sales")]
my_new_df
# S.no Date City Sales
# 3 3 2017-03-02 New York 1330
# 11 11 2017-04-01 Paris 1850
# 21 21 2017-06-30 London 1048
With the fromLast-argument in the duplicated, we start in the last row to check, whether the City has already been in the data set.

Related

Two way table with mean of a third variable R

Here's my problem. I have a table, of which I show a sample here. I would like to have the Country as row, Stars as column and the mean of the price for each combination. I used aggregate which gave me the info that i want but not how I want it.
The table looks like that :
Country Stars Price
1 Canada 4 567
2 China 2 435
3 Russia 3 456
4 Canada 5 687
5 Canada 4 432
6 Russia 3 567
7 China 4 1200
8 Russia 3 985
9 Canada 2 453
10 Russia 3 234
11 Russia 4 546
12 Canada 3 786
13 China 2 456
14 China 3 234
15 Russia 4 800
16 China 5 987
I used this code :
aggregate(Stars[,3],list(Country=Stars$Country, Stars = Stars$Stars), mean)
output :
Country Stars x
1 Canada 2 453.0
2 China 2 445.5
3 Canada 3 786.0
4 China 3 234.0
5 Russia 3 560.5
6 Canada 4 499.5
7 China 4 1200.0
8 Russia 4 673.0
9 Canada 5 687.0
10 China 5 987.0
Where x stands for the mean, I would like to change x for "price mean" to...
So the goal would be to have one country per row and the number of stars as column with the mean of the price for each pair.
Thank you very much.
It seems you want Excel like pivot table. Here package pivottabler helps much. See, it generates nice html tables too (apart from displaying results)
library(pivottabler)
qpvt(df, "Country", "Stars", "mean(Price)")
2 3 4 5 Total
Canada 453 786 499.5 687 585
China 445.5 234 1200 987 662.4
Russia 560.5 673 598
Total 448 543.666666666667 709 837 614.0625
for formatting use format argument
qpvt(df, "Country", "Stars", "mean(Price)", format = "%.2f")
2 3 4 5 Total
Canada 453.00 786.00 499.50 687.00 585.00
China 445.50 234.00 1200.00 987.00 662.40
Russia 560.50 673.00 598.00
Total 448.00 543.67 709.00 837.00 614.06
for html output use qhpvt instead.
qhpvt(df, "Country", "Stars", "mean(Price)")
Output
Note: tidyverse and baseR methods are also possible and are easy too
To obtain a two-way table of means, after attaching data you can use
tapply(Price, list(Country,Stars), mean)

How to select a row based on date conditions of another row?

I have df1:
State date fips score score1
1 Alabama 2020-03-24 1 242 0
2 Alabama 2020-03-26 1 538 3
3 Alabama 2020-03-28 1 720 4
4 Alabama 2020-03-21 1 131 0
5 Alabama 2020-03-15 1 23 0
6 Alabama 2020-03-18 1 51 0
7 Texas 2020-03-14 2 80 0
7 Texas 2020-03-16 2 102 0
7 Texas 2020-03-20 2 702 1
8 Texas 2020-03-23 2 1005 1
I would like to see which date a State surpasses a score of 100. I would then like to select the row 7 days after that date? For example, Alabama passes 100 on March 21st, so I would like to keep the March 28th data.
State date fips score score1
3 Alabama 2020-03-28 1 720 4
8 Texas 2020-03-23 2 1005 1
Here is a solution tidyverse and lubridate.
library(tidyverse)
library(lubridate)
df %>%
#Convert date column to date format
mutate_at(vars(date), ymd) %>%
#Group by State
group_by(State) %>%
#Ignore scores under 100
filter(score > 100) %>%
#Stay only with the date of the first date with score over 100 + 7 days
filter(date == min(date) + days(7))
Using a by approach (assuming date + 7 is available).
res <- do.call(rbind, by(dat, dat$state, function(x) {
st <- x[x$cases > 100, ]
st[as.Date(st$date) == as.Date(st$date[1]) + 7, ]
}))
head(res)
# date state fips cases deaths
# Alabama 2020-03-27 Alabama 1 639 4
# Alaska 2020-04-04 Alaska 2 169 3
# Arizona 2020-03-28 Arizona 4 773 15
# Arkansas 2020-03-28 Arkansas 5 409 5
# California 2020-03-15 California 6 478 6
# Colorado 2020-03-21 Colorado 8 475 6

Issue with indexing using two data frames in R

I have two data frames Table_1 and Table_2 and I need to add a column "index" to Table_2 where value 1 for matching rows from Table_1 and 0 for others.
Basically, I need to match the "Pol", "CTY" ,"STATE" and "CRP" columns from Table_1 and "STATE", "CTY" , "CRP" and "Pol_No" from Table_2.
I prefer the data.table method.
Table_1:
Pol Cty Avg STATE CRP
85010 23 1123 MO 11
75022 23 1123 MO 11
35014 143 450 MO 11
.
.
Table_2:
STATE CTY CRP Pol_No Plan Price
AL 1 11 150410 90 4563
AL 1 21 45023 90 5402
MO 143 11 85010 90 2522
.
.
Desired output as below.
Table_2:
STATE CTY CRP Pol_No Plan Price Index
AL 1 11 150410 90 4563 0
AL 1 21 45023 90 5402 0
MO 143 11 85010 90 2522 1
.
.
How can I achieve this is R.?
Any help is appreciated.
Thanks.
Here's an entirely data.table solution:
merge(t1,t2,by.x='Pol', by.y='Pol_No', all.y=TRUE)[,c('STATE.y','CTY', 'Cty', 'CRP.y', 'Pol', 'Plan', 'Price')]
#-----
STATE.y CTY Cty CRP.y Pol Plan Price
1: AL 1 NA 21 45023 90 5402
2: MO 143 23 11 85010 90 2522
3: AL 1 NA 11 150410 90 4563
#--------
t3 <- merge(t1,t2,by.x='Pol', by.y='Pol_No', all.y=TRUE)[ ,
c('STATE.y','CTY', 'Cty', 'CRP.y', 'Pol', 'Plan', 'Price')]
t3[ , index := as.numeric(!is.na(Cty))]
t3
#--------
STATE.y CTY Cty CRP.y Pol Plan Price index
1: AL 1 NA 21 45023 90 5402 0
2: MO 143 23 11 85010 90 2522 1
3: AL 1 NA 11 150410 90 4563 0
To get column names right after merge(.. I first looked at:
merge(t1,t2,by.x='Pol', by.y='Pol_No', all.y=TRUE)
Pol Cty Avg STATE.x CRP.x STATE.y CTY CRP.y Plan Price
1: 45023 NA NA <NA> NA AL 1 21 90 5402
2: 85010 23 1123 MO 11 MO 143 11 90 2522
3: 150410 NA NA <NA> NA AL 1 11 90 4563
I think this is a straight-forward multi-column join:
library(dplyr)
t2 %>%
left_join(transmute(t1, CTY=Cty, STATE, Index=1L), by=c("CTY", "STATE")) %>%
mutate(Index = if_else(is.na(Index), 0L, Index))
# STATE CTY CRP Pol_No Plan Price Index
# 1 AL 1 11 150410 90 4563 0
# 2 AL 1 21 45023 90 5402 0
# 3 MO 143 11 85010 90 2522 1
EDIT
I've been trying to learn data.table, thought I'd give this a try. It feels a little clumsy to me, I'm sure there is a way to streamline it.
t1 <- setDT(t1); t2 <- setDT(t2)
For convenience, set the column names to be the same (I'm not sure how to make it happen easily otherwise) ... one is "Cty", the other is "CTY". Make them the same.
colnames(t1)[2] <- "CTY"
Now, the merge.
t1[,.(CTY,STATE,CRP,Index=1),][t2,on=c("CTY","STATE","CRP")]
# CTY STATE CRP Index Pol_No Plan Price
# 1: 1 AL 11 NA 150410 90 4563
# 2: 1 AL 21 NA 45023 90 5402
# 3: 143 MO 11 1 85010 90 2522
Notes:
the first bracket-op is selecting just the three joining columns and assigning the fourth, Index;
the actual join is in the second bracket-op, the first is just a selection
typically data.table ops work in side-effect, but not merges or selections like this, so it returns it without modifying the underlying structure; for this, we'll need to store it (back in t2, perhaps)
It's close ... now just update the Index field, since it's either 1 where the data co-exists or NA otherwise.
t2 <- t1[,.(CTY,STATE,CRP,Index=1),][t2,on=c("CTY","STATE","CRP")]
t2[,Index := as.integer(!is.na(Index)),]
t2
# CTY STATE CRP Index Pol_No Plan Price
# 1: 1 AL 11 0 150410 90 4563
# 2: 1 AL 21 0 45023 90 5402
# 3: 143 MO 11 1 85010 90 2522
Data:
t1 <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
Pol Cty Avg STATE CRP
85010 23 1123 MO 11
75022 23 1123 MO 11
35014 143 450 MO 11')
t2 <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
STATE CTY CRP Pol_No Plan Price
AL 1 11 150410 90 4563
AL 1 21 45023 90 5402
MO 143 11 85010 90 2522')
This is not a nice solution but it works for data.table. You need sqldf which works for dataframes and data tables.
library(data.table)
df1<-data.table(Pol=c(85010,75022,35014),Cty=c(23,23,143), Avg=c(1123,1123,450),STATE=c("MO","MO","MO"), CRP=c(11,11,11))
df2=data.table(STATE=c("AL","AL","MO"),CTY=c(1,1,143),CRP=c(11,21,11),Pol_No=c(150410,45023,85010),Plan=c(90,90,90),Price=c(4563,5402,2522))
library(sqldf)
#left join
df<-sqldf("select df2.STATE,df2.CTY,df2.CRP,df2.Pol_No,df2.Plan,df2.Price,df1.Pol from df2 left join df1 on df1.Pol=df2.Pol_No")
#create index
df$index<-ifelse(is.na(df$Pol),0,1)
#delete extra column
df$Pol<-NULL
> df
STATE CTY CRP Pol_No Plan Price index
1 AL 1 11 150410 90 4563 0
2 AL 1 21 45023 90 5402 0
3 MO 143 11 85010 90 2522 1

merge two dataframes on column with fuzzy match in R

I have two dataframes, one for 2008, the other for 2004 and 2012 data. Examples of the dataframes are below.
df_08 <- read.table(text = c("
observation year x code_location location
1 2008 300 23-940 town no. 1 town no. 1
2 2008 234 23-941 town no. 2 town no. 2
3 2008 947 23-942 city St 23 city St 23
4 2008 102 23-943 Mtn town 5 Mtn town 5 "), header = TRUE)
df_04_12 <- read.table(text = c("
observation year y code_location location
1 2004 124 23-940 town no. 1 town no. 1
2 2004 395 23-345 town # 2 town # 2
3 2004 1349 23-942 city St 23 city St 23
4 2012 930 53-443 Mtn town 5 Mtn town 5
5 2012 185 99-999 town no. 1 town no. 1
6 2012 500 23-941 town Number 2 town Number 2
7 2012 185 34-942 city Street 23 city Street 23
8 2012 195 23-943 Mt town 5 Mt town 5 "), header = TRUE)
I want to merge df_08 to df_04_12 using the location variable (the codes are not consistent across years). However, slight variations in the location name, eg Mtn v. Mt or no. v. #, result in no match. Given these slight variations between location names, is there a way to merge these dataframes and get the following? I currently do not have any code for this since I am not sure how to match locations for a merge.
observation year y code_location location.x location.y y.y
1 2004 124 23-940 town no. 1 town no. 1 town no.1 300
2 2004 395 "23-345 town # 2" "town # 2" "town no. 2" 234
3 2004 1349 23-942 city St 23 city St 23 city St 23 947
4 2012 930 53-443 Mtn town 5 Mtn town 5 Mtn town 5 102
5 2012 185 99-999 town no. 1 town no. 1 town no. 1 300
6 2012 500 23-941 town Number 2 town Number 2 town no. 2 234
7 2012 185 34-942 city Street 23 city Street 23 city St 23 947
8 2012 195 23-943 Mt town 5 Mt town 5 Mtn town 5 102
You can use levenshtein distance on character variables but there is no way to account for symbols. I would suggest you clear all of the symbols before merge and than use the stringdist package. There is no clean solution for this problem, you will have to develop your own method as it relates to your data.
Some of the methods that are used in fuzzy matching is string distance calculations and SoundX transformation of the data, you just have to find out what is appropriate for your data.

Add new column to long dataframe from another dataframe?

Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.

Resources