Aggregate column R - r

I am new here and have a problem
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
3 1990 EFG Pear 0.1350
4 1991 EFG Apple 0.1113
5 1991 EFG Orange 0.1094
For each year and separately for the two markets (i.e.,ABC,EFG), examine the
combined data for Apple and Pear on the bid price variable BID for presence
of potential outliers.5 Identify instances where you observe the presence of
potential outliers.
I managed to separate the data by year only
y <- c(1, seq(300))
year1991 <- subset(X, y < 39)
year1991
Year1991 <- year1991[, c(1,2,3,5)]
Year1991
now I need help on whats the right R command to key to select(View) only ABC
of the Market COLUMN, which the other column values remains.
Is it possible to do multiple separation at one time? or step by step
Possible to give me a tip,how do I exlude if I wanna view the date in such
a manner
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
Year Market Winner BID
1 1990 EFG Pear 0.1350
Like trying to split the 'Market' but still see the whole list of values
Thanks in advance :)

> df
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
3 1990 EFG Pear 0.1350
4 1991 EFG Apple 0.1113
5 1991 EFG Orange 0.1094
library(plyr)
# Then you can break up the data into chunks of year x market.
# I split your data.frame into a list. You can do further things with that list.
# alternatively, you can use ddply and add a function to do your hw bit and collate all
# results back into a final data.frame. This should be a helpful start.
> dlply(df, .(Year,Market))
$`1990.ABC`
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
$`1990.EFG`
Year Market Winner BID
3 1990 EFG Pear 0.135
$`1991.EFG`
Year Market Winner BID
4 1991 EFG Apple 0.1113
5 1991 EFG Orange 0.1094

Related

Aggregating observations based on category in R

I have a set of agricultural data in R that looks something like this:
State District Year Crop Production Area
1 State A District 1 2000 Banana 1254.00 2000.00
2 State A District 1 2000 Apple 175.00 176.00
3 State A District 1 2000 Wheat 321.00 641.00
4 State A District 1 2000 Rice 1438.00 175.00
5 State A District 1 2000 Cashew 154.00 1845.00
6 State A District 1 2000 Peanut 2076.00 439.00
7 State B District 2 2000 Banana 3089.00 1987.00
8 State B District 2 2000 Apple 309.00 302.00
9 State B District 2 2000 Wheat 401.00 230.00
10 State B District 2 2000 Rice 1832.00 2134.00
11 State B District 2 2000 Cashew 991.00 1845.00
12 State B District 2 2000 Peanut 2311.00 1032.00
I want to aggregate the area and production values by crop type, but keep the state, district and year details, so that it would look something like:
State District Year Crop Production Area
1 State A District 1 2000 Fruit 1429.00 2176.00
2 State A District 1 2000 Grain 1759.00 816.00
3 State A District 1 2000 Nut 2230.00 2284.00
4 State B District 2 2000 Fruit 3398.00 2289.00
5 State B District 2 2000 Grain 2233.00 2364.00
6 State B District 2 2000 Nut 3302.00 2877.00
What's the best way to go about this?
Using dplyr & forcats:
library(dplyr)
library(forcats)
df %>%
mutate(crop_type = fct_recode(Crop, fruit = "Apple", fruit = "Banana",
grain = "Wheat", grain = "Rice",
nut = "Cashew", nut = "Peanut")) %>%
group_by(State, District, Year, Crop) %>%
summarize(mean_production = mean(Production),
mean_area = mean(Area))

How to do Group By Rollup in R? (Like SQL)

I have a dataset and I want to perform something like Group By Rollup like we have in SQL for aggregate values.
Below is a reproducible example. I know aggregate works really well as explained here but not a satisfactory fit for my case.
year<- c('2016','2016','2016','2016','2017','2017','2017','2017')
month<- c('1','1','1','1','2','2','2','2')
region<- c('east','west','east','west','east','west','east','west')
sales<- c(100,200,300,400,200,400,600,800)
df<- data.frame(year,month,region,sales)
df
year month region sales
1 2016 1 east 100
2 2016 1 west 200
3 2016 1 east 300
4 2016 1 west 400
5 2017 2 east 200
6 2017 2 west 400
7 2017 2 east 600
8 2017 2 west 800
now what I want to do is aggregation (sum- by year-month-region) and add the new aggregate row in the existing dataframe
e.g. there should be two additional rows like below with a new name for region as 'USA' for the aggreagted rows
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 USA 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 USA 2000
I have figured out a way (below) but I am very sure that there exists an optimum solution for this OR a better workaround than mine
df1<- setNames(aggregate(df$sales, by=list(df$year,df$month, df$region), FUN=sum),
c('year','month','region', 'sales'))
df2<- setNames(aggregate(df$sales, by=list(df$year,df$month), FUN=sum),
c('year','month', 'sales'))
df2$region<- 'USA' ## added a new column- region- for total USA
df2<- df2[, c('year','month','region', 'sales')] ## reordering the columns of df2
df3<- rbind(df1,df2)
df3<- df3[order(df3$year,df3$month,df3$region),] ## order by
rownames(df3)<- NULL ## renumbered the rows after order by
df3
Thanks for the support!
melt/dcast in the reshape2 package can do subtotalling. After running dcast we replace "(all)" in the month column with the month using na.locf from the zoo package:
library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))
giving:
> dout
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 (all) 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 (all) 2000
In recent devel data.table 1.10.5 you can use new feature called "grouping sets" to produce sub totals:
library(data.table)
setDT(df)
res = groupingsets(df, .(sales=sum(sales)), sets=list(c("year","month"), c("year","month","region")), by=c("year","month","region"))
setorder(res, na.last=TRUE)
res
# year month region sales
#1: 2016 1 east 400
#2: 2016 1 west 600
#3: 2016 1 NA 1000
#4: 2017 2 east 800
#5: 2017 2 west 1200
#6: 2017 2 NA 2000
You can substitute NA to USA using res[is.na(region), region := "USA"].
plyr::ddply(df, c("year", "month", "region"), plyr::summarise, sales = sum(sales))

R aggregating on date then character

I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.

Clustering / Matching Over Many Dimensions in R

I have a very large and complex data set with many observations of companies. Some of the observations of the companies are redundant and I need to make a key to map the redundant observations to a single one. However the only way to tell if they are actually representing the same company is through the similarity of a variety of variables. I think the appropriate approach is a kind of clustering based on a variety of conditions or perhaps even some kind of propensity score matching. Perhaps I just need flexible tools for making a complex kind of similarity matrix.
Unfortunately, I am not quite sure how to go about that in R. Most of the tools I've seen for clustering and categorizing seem to do so with either numerical distance or categorical data, but don't seem to allow multiple conditions or user specified conditions.
Below I've tried to create a smaller, public example of the kind of data I am working with and the result I am trying to produce. There are some conditions that must apply, for example, the location must be the same. There are some features that may associate one with another, for example var1 and var2. Then there are some features that may associate one with another, but they must not conflict, such as var3.
An additional layer of complexity is that the kind of association I am trying to use to map the redundant observation varies. For example, id1 and id2 are the same company redundantly entered into the data twice. In one place its name is "apples" and another "red apples". They share the same location, var1 value and var3 (after adjusting for formatting). Similarly ids 3, 5 and 6, are also really just one company, though much of the input for each is different. Some clusters would identify multiple observations, others would only have one. Ideally I would like to find a way to categorize or associate the observations based on several conditions, for example:
1. Test that the location is the same
2. Test whether var3 is different
3. Test whether the names is a substring of others
4. Test the edit distance of names
5. Test the similarity of var1 and var2 between observations
Anyways, hopefully there are better, more flexible tools for this than what I am finding or someone has experience with this kind of data work in R. Any and all suggestions and advice are much appreciated!
Data
id name location var1 var2 var3
1 apples US 1 abc 12345
2 red apples US 1 NA 12-345
3 green apples Mexico 2 def 235-92
4 bananas Brazil 2 abc NA
5 oranges Mexico 2 NA 23592
6 green apple Mexico NA def NA
7 tangerines Honduras NA abc 3498
8 mango Honduras 1 NA NA
9 strawberries Honduras NA abcd 3498
10 strawberry Honduras NA abc 3498
11 blueberry Brazil 1 abcd 2348
12 blueberry Brazil 3 abc NA
13 blueberry Mexico NA def 1859
14 bananas Brazil 1 def 2348
15 blackberries Honduras NA abc NA
16 grapes Mexico 6 qrs NA
17 grapefruits Brazil 1 NA 1379
18 grapefruit Brazil 2 bcd 1379
19 mango Brazil 3 efaq NA
20 fuji apples US 4 NA 189-35
Result
id name location var1 var2 var3 Result
1 apples US 1 abc 12345 1
2 red apples US 1 NA 12-345 1
3 green apples Mexico 2 def 235-92 3
4 bananas Brazil 2 abc NA 4
5 oranges Mexico 2 NA 23592 3
6 green apple Mexico NA def NA 3
7 tangerines Honduras NA abc 3498 7
8 mango Honduras 1 NA NA 8
9 strawberries Honduras NA abcd 3498 7
10 strawberry Honduras NA abc 3498 7
11 blueberry Brazil 1 abcd 2348 11
12 blueberry Brazil 3 abc NA 11
13 blueberry Mexico NA def 1859 13
14 bananas Brazil 1 def 2348 11
15 blackberries Honduras NA abc NA 15
16 grapes Mexico 6 qrs NA 16
17 grapefruits Brazil 1 NA 1379 17
18 grapefruit Brazil 2 bcd 1379 17
19 mango Brazil 3 efaq NA 19
20 fuji apples US 4 NA 189-35 20
Thanks in advance for your time and help!
library(stringdist)
getMatches <- function(df, tolerance=6){
out <- integer(nrow(df))
for(row in 1:nrow(df)){
dists <- numeric(nrow(df))
for(col in 1:ncol(df)){
tempDist <- stringdist(df[row, col], df[ , col], method="lv")
# WARNING: Matches NA perfectly.
tempDist[is.na(tempDist)] <- 0
dists <- dists + tempDist
}
dists[row] <- Inf
min_dist <- min(dists)
if(min_dist < tolerance){
out[row] <- which.min(dists)
}
else{
out[row] <- row
}
}
return(out)
}
test$Result <- getMatches(test[, -1])
Where test is your data. This probably definitely needs some refining and certainly needs some postprocessing. This creates a column with the index of the closest match. If it can't find a match within the given tolerance, it returns the index of itself.
EDIT: I will attempt some more later.

Merging irregular time series

I have two time series, one being a daily time series and the other one a discrete one. In my case I have share prices and ratings I need to merge but in a way that the merged time series keeps the daily dates according to the stock prices and that the rating is fitted to the daily data by ticker and date.
A simple merge command would only look for the exact date and ticker and apply NA to non-fitting cases. But I would like to look for the exact matches and fill the dates between with last rating.
Daily time series:
ticker date stock.price
AA US Equity 2004-09-06 1
AA US Equity 2004-09-07 2
AA US Equity 2004-09-08 3
AA US Equity 2004-09-09 4
AA US Equity 2004-09-10 5
AA US Equity 2004-09-11 6
Discrete time series
ticker date Rating Last_Rating
AA US Equity 2004-09-08 A A+
AA US Equity 2004-09-11 AA A
AAL LN Equity 2005-09-08 BB BB
AAL LN Equity 2007-09-09 AA AA-
ABE SM Equity 2006-09-10 AA AA-
ABE SM Equity 2009-09-11 AA AA-
Required Output:
ticker date stock.price Rating
AA US Equity 2004-09-06 1 A+
AA US Equity 2004-09-07 2 A+
AA US Equity 2004-09-08 3 A
AA US Equity 2004-09-09 4 A
AA US Equity 2004-09-10 5 A
AA US Equity 2004-09-11 6 AA
I would be very greatful for your help.
Maybe this is the solution you want.
The function na.locf in the time series package zoo can be used to carry values forward (or backward).
library(zoo)
library(plyr)
options(stringsAsFactors=FALSE)
daily_ts=data.frame(
ticker=c('A','A','A','A','B','B','B','B'),
date=c(1,2,3,4,1,2,3,4),
stock.price=c(1.1,1.2,1.3,1.4,4.1,4.2,4.3,4.4)
)
discrete_ts=data.frame(
ticker=c('A','A','B','B'),
date=c(2,4,2,4),
Rating=c('A','AA','BB','BB-'),
Last_Rating=c('A+','A','BB+','BB')
)
res=ddply(
merge(daily_ts,discrete_ts,by=c("ticker","date"),all=TRUE),
"ticker",
function(x)
data.frame(
x[,c("ticker","date","stock.price")],
Rating=na.locf(x$Rating,na.rm=FALSE),
Last_Rating=na.locf(x$Last_Rating,na.rm=FALSE,fromLast=TRUE)
)
)
res=within(
res,
Rating<-ifelse(
is.na(Rating),
Last_Rating,Rating
)
)[,setdiff(colnames(res),"Last_Rating")]
res
Gives
# ticker date stock.price Rating
#1 A 1 1.1 A+
#2 A 2 1.2 A
#3 A 3 1.3 A
#4 A 4 1.4 AA
#5 B 1 4.1 BB+
#6 B 2 4.2 BB
#7 B 3 4.3 BB
#8 B 4 4.4 BB-

Resources