Sum over a column and remove duplicates simultaneously [duplicate] - r

I have a sample dataframe "data" as follows:
X Y Month Year income
2281205 228120 3 2011 1000
2281212 228121 9 2010 1100
2281213 228121 12 2010 900
2281214 228121 3 2011 9000
2281222 228122 6 2010 1111
2281223 228122 9 2010 3000
2281224 228122 12 2010 1889
2281225 228122 3 2011 778
2281243 228124 12 2010 1111
2281244 228124 3 2011 200
2281282 228128 9 2010 7889
2281283 228128 12 2010 2900
2281284 228128 3 2011 3400
2281302 228130 9 2010 1200
2281303 228130 12 2010 2000
2281304 228130 3 2011 1900
2281352 228135 9 2010 2300
2281353 228135 12 2010 1333
2281354 228135 3 2011 2340
I want to use the ddply to compute the income for each Y(not X), if I have four observations for each Y (for example for 2281223 with months 6,9,12 of 2010 and month 3 of 2011). If I have less than four observations (for example for Y =228130), I want to simply ignore it. I use the following commands in R for the above purpose:
require(plyr)
# the data are in the data csv file
data<-read.csv("data.csv")
# convert Y (integers) into factors
y<-as.factor(y)
# get the count of each unique Y
count<-ddply(data,.(Y), summarize, freq=length(Y))
# get the sum of each unique Y
sum<-ddply(data,.(Y),summarize,tot=sum(income))
# show the sum if number of observations for each Y is less than 4
colbind<-cbind(count,sum)
finalsum<-subset(colbind,freq>3)
My output are as follows:
>colbind
Y freq Y tot
1 228120 1 228120 1000
2 228121 3 228121 11000
3 228122 4 228122 6778
4 228124 2 228124 1311
5 228128 3 228128 14189
6 228130 3 228130 5100
7 228135 3 228135 5973
>finalsum
Y freq Y.1 tot
3 228122 4 228122 6778
The above code works, but requires many steps. So,I would like to know whether there is a simple way of performing the above task (using the plyr package).

As pointed out in a comment, you can do multiple operations inside the summarize.
This reduces your code to one line of ddply() and one line of subsetting, which is easy enough with the [ operator:
x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income))
x[x$freq > 3, ]
Y freq tot
3 228122 4 6778
This is also exceptionally easy with the data.table package:
library(data.table)
data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778
In fact, the operation to calculate the length of a vector has its own shortcut in data.table - use the .N shortcut:
data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778

I think the package dplyr is faster than plyr::ddply and more elegant.
testData <- read.table(file = "clipboard",header = TRUE)
require(dplyr)
testData %>%
group_by(Y) %>%
summarise(total = sum(income),freq = n()) %>%
filter(freq > 3)

Related

multiplying column from data frame 1 by a condition found in data frame 2

I have two separate data frame and what I am trying to do is that for each year, I want to check data frame 2 (in the same year) and multiply a column from data frame 1 by the found number. So for example, imagine my first data frame is:
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
year price
1 2001 1000
2 2003 1000
3 2001 1000
4 2004 1000
5 2006 1000
6 2007 1000
7 2008 1000
8 2008 1000
9 2001 1000
10 2009 1000
11 2001 1000
Now, I have a second data frame which includes inflation conversion rate (code from #akrun)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data<-inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100))
ref_year ref_inf final_inf
1 2010 2.0 1.020000
2 2009 3.0 1.050600
3 2008 1.0 1.061106
4 2007 2.2 1.084450
5 2006 1.3 1.098548
6 2005 1.5 1.115026
7 2004 1.9 1.136212
8 2003 1.8 1.156664
9 2002 1.9 1.178640
10 2001 1.9 1.201035
What I want to do is that for example for the first row of data frame 1, it's the year 2001, so I go and found a conversion for the year 2001 from data frame 2 which is 1.201035 and then multiply the price in a data frame 1 by this found conversion rate.
So the result should look like this:
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
is there any way to do this without using else and if commands?
We can do a join on the 'year' with 'ref_year' and create the new column by assigning (:=) the output of product of 'price' and 'final_inf'
library(data.table)
setDT(df)[inf_data, after_conv := price * final_inf, on = .(year = ref_year)]
-output
df
# year price after_conv
# 1: 2001 1000 1201.035
# 2: 2003 1000 1156.664
# 3: 2001 1000 1201.035
# 4: 2004 1000 1136.212
# 5: 2006 1000 1098.548
# 6: 2007 1000 1084.450
# 7: 2008 1000 1061.106
# 8: 2008 1000 1061.106
# 9: 2001 1000 1201.035
#10: 2009 1000 1050.600
#11: 2001 1000 1201.035
Since the data is already being processed by dplyr, we can also solve this problem with dplyr. A dplyr based solution joins the data with the reference data by year and calculates after_conv.
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
library(dplyr)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv)
We use left_join() to keep the data ordered in the original order of df as well as ensure rows in inf_data only contribute to the output if they match at least one row in df. We use . to reference the data already in the pipeline as the right side of the join, merging in final_inf so we can use it in the subsequent mutate() function. We then select() to keep the three result columns we need.
...and the output:
Joining, by = "year"
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
We can save the result to the original df by writing the result of the pipeline to df.
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv) -> df

How can I change row and column indexes of a dataframe in R?

I have a dataframe in R which has three columns Product_Name(name of books), Year and Units (number of units sold in that year) which looks like this:
Product_Name Year Units
A Modest Proposal 2011 10000
A Modest Proposal 2012 11000
A Modest Proposal 2013 12000
A Modest Proposal 2014 13000
Animal Farm 2011 8000
Animal Farm 2012 9000
Animal Farm 2013 11000
Animal Farm 2014 15000
Catch 22 2011 1000
Catch 22 2012 2000
Catch 22 2013 3000
Catch 22 2014 4000
....
I intend to make a R Shiny dashboard with that where I want to keep the year as a drop-down menu option, for which I wanted to have the dataframe in the following format
A Modest Proposal Animal Farm Catch 22
2011 10000 8000 1000
2012 11000 9000 2000
2013 12000 11000 3000
2014 13000 15000 4000
or the other way round where the Product Names are row indexes and Years are column indexes, either way goes.
How can I do this in R?
Your general issue is transforming long data to wide data. For this, you can use data.table's dcast function (amongst many others):
dt = data.table(
Name = c(rep('A', 4), rep('B', 4), rep('C', 4)),
Year = c(rep(2011:2014, 3)),
Units = rnorm(12)
)
> dt
Name Year Units
1: A 2011 -0.26861318
2: A 2012 0.27194732
3: A 2013 -0.39331361
4: A 2014 0.58200101
5: B 2011 0.09885381
6: B 2012 -0.13786098
7: B 2013 0.03778400
8: B 2014 0.02576433
9: C 2011 -0.86682584
10: C 2012 -1.34319590
11: C 2013 0.10012673
12: C 2014 -0.42956207
> dcast(dt, Year ~ Name, value.var = 'Units')
Year A B C
1: 2011 -0.2686132 0.09885381 -0.8668258
2: 2012 0.2719473 -0.13786098 -1.3431959
3: 2013 -0.3933136 0.03778400 0.1001267
4: 2014 0.5820010 0.02576433 -0.4295621
For the next time, it is easier if you provide a reproducible example, so that the people assisting you do not have to manually recreate your data structure :)
You need to use pivot_wider from tidyr package. I assumed your data is saved in df and you also need dplyr package for %>% (piping)
library(tidyr)
library(dplyr)
df %>%
pivot_wider(names_from = Product_Name, values_from = Units)
Assuming that your dataframe is ordered by Product_Name and by year, I will generate artificial data similar to your datafrme, try this:
Col_1 <- sort(rep(LETTERS[1:3], 4))
Col_2 <- rep(2011:2014, 3)
# artificial data
resp <- ceiling(rnorm(12, 5000, 500))
uu <- data.frame(Col_1, Col_2, resp)
uu
# output is
Col_1 Col_2 resp
1 A 2011 5297
2 A 2012 4963
3 A 2013 4369
4 A 2014 4278
5 B 2011 4721
6 B 2012 5021
7 B 2013 4118
8 B 2014 5262
9 C 2011 4601
10 C 2012 5013
11 C 2013 5707
12 C 2014 5637
>
> # Here starts
> output <- aggregate(uu$resp, list(uu$Col_1), function(x) {x})
> output
Group.1 x.1 x.2 x.3 x.4
1 A 5297 4963 4369 4278
2 B 4721 5021 4118 5262
3 C 4601 5013 5707 5637
>
output2 <- output [, -1]
colnames(output2) <- levels(as.factor(uu$Col_2))
rownames(output2) <- levels(as.factor(uu$Col_1))
# transpose the matrix
> t(output2)
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637
> # or convert to data.frame
> as.data.frame(t(output2))
A B C
2011 5297 4721 4601
2012 4963 5021 5013
2013 4369 4118 5707
2014 4278 5262 5637

Make subset with specific values of a column with grep

I have the following data set:
usd year
1 65.09 1997
2 69.28 1998
3 71.18 1999Q1
4 72.12 1999Q2
5 70.68 1999Q3
6 71.01 1999Q4
7 71.45 2000Q1
8 72.02 2000Q2
9 72.29 2000Q3
10 71.12 2000Q4
I want to have the means of every year:
usd year
1 65.09 1997
2 69.28 1998
3 71.24 1999
7 71.72 2000
I know how I can do it if I only have years without the quarter. Is there a way to extract the years? Maybe with grep?
I have found a solution using the stringr package:
mydata <- data.frame(usd = c(65.09,69.28,71.18,72.12,70.68,71.01,71.45,72.02,72.29,71.12),
year = c("1997","1998","1999Q1","1999Q2","1999Q3","1999Q4",
"2000Q1","2000Q2","2000Q3","2000Q4"))
library(stringr)
mydata$year <- str_extract(mydata$year, "[[:digit:]]{4}")
mydata <- aggregate(usd ~ year, mydata, mean)
mydata
year usd
1 1997 65.0900
2 1998 69.2800
3 1999 71.2475
4 2000 71.7200

Avoid For-Loops in R

I'm sure this question has been posed before, but would like some input on my specific question. In return for your help, I'll use an interesting example.
Sean Lahman provides giant datasets of MLB baseball statistics, available free on his website (http://www.seanlahman.com/baseball-archive/statistics/).
I'd like to use this data to answer the following question: What is the average number of home runs per game recorded for each decade in the MLB?
Below I've pasted all relevant script:
teamdata = read.csv("Teams.csv", header = TRUE)
decades = c(1870,1880,1890,1900,1910,1920,1930,1940,1950,1960,1970,1980,1990,2000,2010,2020)
i = 0
meanhomers = c()
for(i in c(1:length(decades))){
meanhomers[i] = mean(teamdata$HR[teamdata$yearID>=decades[i] & teamdata$yearID<decades[i+1]]);
i = i+1
}
My primary question is, how could this answer have been determined without resorting to the dreaded for-loop?
Side question: What simple script would have generated the decades vector for me?
(For those interested in the answer to the baseball question, see below.)
meanhomers
[1] 4.641026 23.735849 34.456522 20.421053 25.755682 61.837500 84.012500
[8] 80.987500 130.375000 132.166667 120.093496 126.700000 148.737410 173.826667
[15] 152.973333 NaN
Edit for clarity: Turns out I answered the wrong question; the answer provided above indicates the number of home runs per team per year, not per game. A little fix of the denominator would get the correct result.
Here's a data.table example. Because others showed how to use cut, I took another route for splitting the data into decades:
teamdata[,list(HRperYear=mean(HR)),by=10*floor((yearID)/10)]
However, the original question mentions average HRs per game, not per year (though the code and answers clearly deal with HRs per year).
Here's how you could compute average HRs per game (and average games per team per year):
teamdata[,list(HRperYear=mean(HR),HRperGame=sum(HR)/sum(G),games=mean(G)),by=10*floor(yearID/10)]
floor HRperYear HRperGame games
1: 1870 4.641026 0.08911866 52.07692
2: 1880 23.735849 0.21543555 110.17610
3: 1890 34.456522 0.25140108 137.05797
4: 1900 20.421053 0.13686067 149.21053
5: 1910 25.755682 0.17010657 151.40909
6: 1920 61.837500 0.40144445 154.03750
7: 1930 84.012500 0.54593453 153.88750
8: 1940 80.987500 0.52351325 154.70000
9: 1950 130.375000 0.84289640 154.67500
10: 1960 132.166667 0.81977946 161.22222
11: 1970 120.093496 0.74580935 161.02439
12: 1980 126.700000 0.80990313 156.43846
13: 1990 148.737410 0.95741873 155.35252
14: 2000 173.826667 1.07340167 161.94000
15: 2010 152.973333 0.94427984 162.00000
(The low average game totals in the 1980's and 1990's are due to the 1981 and 1994-5 player strikes).
PS: Nicely-written question, but it would be extra nice for you to provide a fully reproducible example so that I don't have to go and download the CSV to answer your question. Making dummy data is OK.
You can use seq to generate sequences.
decades <- seq(1870, 2020, by=10)
You can use cut to split up numeric variables into intervals.
teamdata$decade <- cut(teamdata$yearID, breaks=decades, dig.lab=4)
Basically it creates a factor with one level for each decade (as specified by the breaks). The dig.lab=4 is just so it prints the years as e.g. "1870" not "1.87e+03".
See ?cut for further configuration (e.g. is '1980' included in this decade or the next one, & so on. You can even configure the labels if you think you'll use them.)
Then to do something for each decade, use the plyr package (data.table and dplyr are other options, but I think plyr has the easiest learning curve, and your data does not seem very large to need data.table).
library(plyr)
ddply(teamdata, .(decade), summarize, meanhomers=mean(HR))
decade meanhomers
1 (1870,1880] 4.930233
2 (1880,1890] 25.409091
3 (1890,1900] 35.115702
4 (1900,1910] 20.068750
5 (1910,1920] 27.284091
6 (1920,1930] 67.681250
7 (1930,1940] 84.050000
8 (1940,1950] 84.125000
9 (1950,1960] 130.718750
10 (1960,1970] 133.349515
11 (1970,1980] 117.745968
12 (1980,1990] 127.584615
13 (1990,2000] 155.053191
14 (2000,2010] 170.226667
15 (2010,2020] 152.775000
Mine is a little different to yours because my intervals are (, ] whereas yours are [, ). Can adjust cut to switch these around.
You can also use the sqldf package in order to use SQL queries on the data.
Here is the code:
library(sqldf)
sqldf("select floor(yearID/10)*10 as decade,avg(hr) as count
from Teams
group by decade;")
decade count
1 1870 4.641026
2 1880 23.735849
3 1890 34.456522
4 1900 20.421053
5 1910 25.755682
6 1920 61.837500
7 1930 84.012500
8 1940 80.987500
9 1950 130.375000
10 1960 132.166667
11 1970 120.093496
12 1980 126.700000
13 1990 148.737410
14 2000 173.826667
15 2010 152.973333
aggregate is handy for this sort of thing. You can use your decades object with findInterval to put the years into bins:
aggregate(HR ~ findInterval(yearID, decades), data=teamdata, FUN=mean)
## findInterval(yearID, decades) HR
## 1 1 4.641026
## 2 2 23.735849
## 3 3 34.456522
## 4 4 20.421053
## 5 5 25.755682
## 6 6 61.837500
## 7 7 84.012500
## 8 8 80.987500
## 9 9 130.375000
## 10 10 132.166667
## 11 11 120.093496
## 12 12 126.700000
## 13 13 148.737410
## 14 14 173.826667
## 15 15 152.973333
Note that the intervals used are left-closed, as you desire. Also note that the intervals need not be regular. Yours are, which leads to the "side question" of how to produce the decades vector: don't even compute it. Instead, directly compute which decade each year falls in:
aggregate(HR ~ I(10 * (yearID %/% 10)), data=teamdata, FUN=mean)
## I(10 * (yearID%/%10)) HR
## 1 1870 4.641026
## 2 1880 23.735849
## 3 1890 34.456522
## 4 1900 20.421053
## 5 1910 25.755682
## 6 1920 61.837500
## 7 1930 84.012500
## 8 1940 80.987500
## 9 1950 130.375000
## 10 1960 132.166667
## 11 1970 120.093496
## 12 1980 126.700000
## 13 1990 148.737410
## 14 2000 173.826667
## 15 2010 152.973333
I usually prefer the formula interface to aggregate as used above, but you can get better names directly by using the non-formula interface. Here's the example for each of the above:
with(teamdata, aggregate(list(mean.HR=HR), list(Decade=findInterval(yearID,decades)), FUN=mean))
## Decade mean.HR
## 1 1 4.641026
## ...
with(teamdata, aggregate(list(mean.HR=HR), list(Decade=10 * (yearID %/% 10)), FUN=mean))
## Decade mean.HR
## 1 1870 4.641026
## ...
dplyr::group_by, mixed with cut is a good option here, and avoids looping. The decades vector is just a stepped sequence.
decades <- seq(1870,2020,by=10)
cut breaks the data into categories, which I've labelled by the decades themselves for clarity.
teamdata$decade <- cut(teamdata$yearID, breaks=decades, right=FALSE, labels=decades[1:(length(decades)-1)])
Then dplyr handles the grouped summarise as neatly as you could hope
library(dplyr)
teamdata %>% group_by(decade) %>% summarise(meanhomers=mean(HR))
# decade meanhomers
# (fctr) (dbl)
# 1 1870 4.641026
# 2 1880 23.735849
# 3 1890 34.456522
# 4 1900 20.421053
# 5 1910 25.755682
# 6 1920 61.837500
# 7 1930 84.012500
# 8 1940 80.987500
# 9 1950 130.375000
# 10 1960 132.166667
# 11 1970 120.093496
# 12 1980 126.700000
# 13 1990 148.737410
# 14 2000 173.826667
# 15 2010 152.973333

ddply for sum by group in R

I have a sample dataframe "data" as follows:
X Y Month Year income
2281205 228120 3 2011 1000
2281212 228121 9 2010 1100
2281213 228121 12 2010 900
2281214 228121 3 2011 9000
2281222 228122 6 2010 1111
2281223 228122 9 2010 3000
2281224 228122 12 2010 1889
2281225 228122 3 2011 778
2281243 228124 12 2010 1111
2281244 228124 3 2011 200
2281282 228128 9 2010 7889
2281283 228128 12 2010 2900
2281284 228128 3 2011 3400
2281302 228130 9 2010 1200
2281303 228130 12 2010 2000
2281304 228130 3 2011 1900
2281352 228135 9 2010 2300
2281353 228135 12 2010 1333
2281354 228135 3 2011 2340
I want to use the ddply to compute the income for each Y(not X), if I have four observations for each Y (for example for 2281223 with months 6,9,12 of 2010 and month 3 of 2011). If I have less than four observations (for example for Y =228130), I want to simply ignore it. I use the following commands in R for the above purpose:
require(plyr)
# the data are in the data csv file
data<-read.csv("data.csv")
# convert Y (integers) into factors
y<-as.factor(y)
# get the count of each unique Y
count<-ddply(data,.(Y), summarize, freq=length(Y))
# get the sum of each unique Y
sum<-ddply(data,.(Y),summarize,tot=sum(income))
# show the sum if number of observations for each Y is less than 4
colbind<-cbind(count,sum)
finalsum<-subset(colbind,freq>3)
My output are as follows:
>colbind
Y freq Y tot
1 228120 1 228120 1000
2 228121 3 228121 11000
3 228122 4 228122 6778
4 228124 2 228124 1311
5 228128 3 228128 14189
6 228130 3 228130 5100
7 228135 3 228135 5973
>finalsum
Y freq Y.1 tot
3 228122 4 228122 6778
The above code works, but requires many steps. So,I would like to know whether there is a simple way of performing the above task (using the plyr package).
As pointed out in a comment, you can do multiple operations inside the summarize.
This reduces your code to one line of ddply() and one line of subsetting, which is easy enough with the [ operator:
x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income))
x[x$freq > 3, ]
Y freq tot
3 228122 4 6778
This is also exceptionally easy with the data.table package:
library(data.table)
data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778
In fact, the operation to calculate the length of a vector has its own shortcut in data.table - use the .N shortcut:
data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778
I think the package dplyr is faster than plyr::ddply and more elegant.
testData <- read.table(file = "clipboard",header = TRUE)
require(dplyr)
testData %>%
group_by(Y) %>%
summarise(total = sum(income),freq = n()) %>%
filter(freq > 3)

Resources