calculate differences in dataframe - r

I have a dataframe that looks like this:
set.seed(50)
data.frame(distance=c(rep("long", 5), rep("short", 5)),
year=rep(2002:2006),
mean.length=rnorm(10))
distance year mean.length
1 long 2002 0.54966989
2 long 2003 -0.84160374
3 long 2004 0.03299794
4 long 2005 0.52414971
5 long 2006 -1.72760411
6 short 2002 -0.27786453
7 short 2003 0.36082844
8 short 2004 -0.59091244
9 short 2005 0.97559055
10 short 2006 -1.44574995
I need to calculate the difference between in mean.length between long and short in each year. Whats fastest way of doing this?

Here's one way using plyr:
set.seed(50)
df <- data.frame(distance=c(rep("long", 5),rep("short", 5)),
year=rep(2002:2006),
mean.length=rnorm(10))
library(plyr)
aggregation.fn <- function(df) {
data.frame(year=df$year[1],
diff=(df$mean.length[df$distance == "long"] -
df$mean.length[df$distance == "short"]))}
new.df <- ddply(df, "year", aggregation.fn)
Gives you
> new.df
year diff
1 2002 0.8275344
2 2003 -1.2024322
3 2004 0.6239104
4 2005 -0.4514408
5 2006 -0.2818542
A second way
df <- df[order(df$year, df$distance), ]
n <- dim(df)[1]
df$new.year <- c(1, df$year[2:n] != df$year[1:(n-1)])
df$diff <- c(-diff(df$mean.length), NA)
df$diff[!df$new.year] <- NA
new.df.2 <- df[!is.na(df$diff), c("year", "diff")]
all(new.df.2 == new.df) # True

Use tapply() and apply() like this:
apply(
with(x, tapply(mean.length, list(year, distance), FUN=mean)),
1,
diff
)
2002 2003 2004 2005 2006
-0.8275344 1.2024322 -0.6239104 0.4514408 0.2818542
This works because tapply creates a tabular summary by year and distance:
with(x, tapply(mean.length, list(year, distance), FUN=mean))
long short
2002 0.54966989 -0.2778645
2003 -0.84160374 0.3608284
2004 0.03299794 -0.5909124
2005 0.52414971 0.9755906
2006 -1.72760411 -1.4457499

Since you seem to have paired values and the data.frame is ordered, you can do this:
res <- with(DF, mean.length[distance=="long"]-mean.length[distance=="short"])
names(res) <- unique(DF$year)
# 2002 2003 2004 2005 2006
#0.8275344 -1.2024322 0.6239104 -0.4514408 -0.2818542
This should be quite fast, but not as safe as the other answers as it relies on the assumptions.

You've received some good answers for computing the specific question at hand. It may make sense for you to consider reshaping your data into a wide format. Here are two options:
reshape(df, direction = "wide", idvar = "year", timevar = "distance")
#---
year mean.length.long mean.length.short
1 2002 0.54966989 -0.2778645
2 2003 -0.84160374 0.3608284
3 2004 0.03299794 -0.5909124
4 2005 0.52414971 0.9755906
5 2006 -1.72760411 -1.4457499
#package reshape2 is probably easier to use.
library(reshape2)
dcast(year ~ distance, data = df)
#---
year long short
1 2002 0.54966989 -0.2778645
2 2003 -0.84160374 0.3608284
3 2004 0.03299794 -0.5909124
4 2005 0.52414971 0.9755906
5 2006 -1.72760411 -1.4457499
You can easily compute your new statistics now.

Related

multiplying column from data frame 1 by a condition found in data frame 2

I have two separate data frame and what I am trying to do is that for each year, I want to check data frame 2 (in the same year) and multiply a column from data frame 1 by the found number. So for example, imagine my first data frame is:
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
year price
1 2001 1000
2 2003 1000
3 2001 1000
4 2004 1000
5 2006 1000
6 2007 1000
7 2008 1000
8 2008 1000
9 2001 1000
10 2009 1000
11 2001 1000
Now, I have a second data frame which includes inflation conversion rate (code from #akrun)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data<-inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100))
ref_year ref_inf final_inf
1 2010 2.0 1.020000
2 2009 3.0 1.050600
3 2008 1.0 1.061106
4 2007 2.2 1.084450
5 2006 1.3 1.098548
6 2005 1.5 1.115026
7 2004 1.9 1.136212
8 2003 1.8 1.156664
9 2002 1.9 1.178640
10 2001 1.9 1.201035
What I want to do is that for example for the first row of data frame 1, it's the year 2001, so I go and found a conversion for the year 2001 from data frame 2 which is 1.201035 and then multiply the price in a data frame 1 by this found conversion rate.
So the result should look like this:
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
is there any way to do this without using else and if commands?
We can do a join on the 'year' with 'ref_year' and create the new column by assigning (:=) the output of product of 'price' and 'final_inf'
library(data.table)
setDT(df)[inf_data, after_conv := price * final_inf, on = .(year = ref_year)]
-output
df
# year price after_conv
# 1: 2001 1000 1201.035
# 2: 2003 1000 1156.664
# 3: 2001 1000 1201.035
# 4: 2004 1000 1136.212
# 5: 2006 1000 1098.548
# 6: 2007 1000 1084.450
# 7: 2008 1000 1061.106
# 8: 2008 1000 1061.106
# 9: 2001 1000 1201.035
#10: 2009 1000 1050.600
#11: 2001 1000 1201.035
Since the data is already being processed by dplyr, we can also solve this problem with dplyr. A dplyr based solution joins the data with the reference data by year and calculates after_conv.
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
library(dplyr)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv)
We use left_join() to keep the data ordered in the original order of df as well as ensure rows in inf_data only contribute to the output if they match at least one row in df. We use . to reference the data already in the pipeline as the right side of the join, merging in final_inf so we can use it in the subsequent mutate() function. We then select() to keep the three result columns we need.
...and the output:
Joining, by = "year"
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
We can save the result to the original df by writing the result of the pipeline to df.
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv) -> df

Adding data points in a column by factors in R

The data.frame my_data consists of two columns("PM2.5" & "years") & around 6400000 rows. The data.frame has various data points for pollutant levels of "PM2.5" for years 1999, 2002, 2005 & 2008.
This is what i have done to the data.drame:
{
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
}
I want to find the sum of all PM2.5 levels (i.e sum of all data points under PM2.5) according to different year. How can I do it.
!The image shows the first 20 rows of the data.frame.
Since the column "years" is arranged, it is showing only 1999
Say this is your data:
library(plyr) # <- don't forget to tell us what libraries you are using
give us an easy sample set
my_data <- data.frame(year=sample(c("1999","2002","2005","2008"), 10, replace=T), PM2.5 = rnorm(10,mean = 5))
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
> my_data
year PM2.5
1 1999 5.556852
2 2002 5.508820
3 2002 4.836500
4 2002 3.766266
5 2005 6.688936
6 2005 5.025600
7 2005 4.041670
8 2005 4.614784
9 2005 4.352046
10 2008 6.378134
One way to do it (out of many, many ways already shown by a simple google search):
> with(my_data, (aggregate(PM2.5, by=list(year), FUN="sum")))
Group.1 x
1 1999 5.556852
2 2002 14.111586
3 2005 24.723037
4 2008 6.378134

Aggregate using a certain value

I'm trying to use the aggregate function in R to get the mean EMISSIONS, organized by YEAR, but only for rows where FIPS is equal to 24510. The following code gives me the right result, but in addition it also adds the overall EMISSIONS, summed across all FIPS values. What am I missing here?
This is the function I'm using:
sum <- aggregate(NEI$Emissions, list(Year = NEI$year, NEI$fips == 24510), sum);
This is the output:
Year Group.2 x
1 1999 FALSE 7329692.557
2 2002 FALSE 5633326.582
3 2005 FALSE 5451611.723
4 2008 FALSE 3462343.556
5 1999 TRUE 3274.180
6 2002 TRUE 2453.916
7 2005 TRUE 3091.354
8 2008 TRUE 1862.282
This is the output that I would like:
Year x
1 1999 3274.180
2 2002 2453.916
3 2005 3091.354
4 2008 1862.282
Should I be using subset separately or can this be done with aggregate alone?
Using this sample
set.seed(15)
NEI <- data.frame(year=2000:2004, fips=rep(c(24510,57399), each=5), Emissions=rnorm(10))
you could use the command
mysum <- aggregate(Emissions~year, subset(NEI, fips == 24510), sum);
to get
year Emissions
1 2000 0.2588229
2 2001 1.8311207
3 2002 -0.3396186
4 2003 0.8971982
5 2004 0.4880163
(also, don't save a value to a variable named sum -- that will conflict with the base function sum())

Long Format Function [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
faster way to create variable that aggregates a column by id
I am having trouble with a project. I created a dataframe (called dat) in long format (i copied the first 3 rows below) and I want to calculate for example the mean of the Pretax Income of all Banks in the United States for the years 2000 to 2011. How would I do that? I have hardly any experience in R. I am sorry if the answer is too obvious, but I couldn't find anything and i already spent a lot of time on the project. Thank you in advance!
KeyItem Bank Country Year Value
1 Pretax Income WELLS_FARGO_&_COMPANY UNITED STATES 2011 2.365600e+10
2 Total Assets WELLS_FARGO_&_COMPANY UNITED STATES 2011 1.313867e+12
3 Total Liabilities WELLS_FARGO_&_COMPANY UNITED STATES 2011 1.172180e+12
The following should get you started. You basically need to do two things: subset, and aggregate. I'll demonstrate a base R solution and a data.table solution.
First, some sample data.
set.seed(1) # So you can reproduce my results
dat <- data.frame(KeyItem = rep(c("Pretax", "TotalAssets", "TotalLiabilities"),
times = 30),
Bank = rep(c("WellsFargo", "BankOfAmerica", "ICICI"),
each = 30),
Country = rep(c("UnitedStates", "India"), times = c(60, 30)),
Year = rep(c(2000:2009), each = 3, times = 3),
Value = runif(90, min=300, max=600))
Let's aggregate mean of the "Pretax" values by "Country" and "Year", but only for the years 2001 to 2005.
aggregate(Value ~ Country + Year,
dat[dat$KeyItem == "Pretax" & dat$Year >= 2001 & dat$Year <=2005, ],
mean)
# Country Year Value
# 1 India 2001 399.7184
# 2 UnitedStates 2001 464.1638
# 3 India 2002 443.5636
# 4 UnitedStates 2002 560.8373
# 5 India 2003 562.5964
# 6 UnitedStates 2003 370.9591
# 7 India 2004 404.0050
# 8 UnitedStates 2004 520.4933
# 9 India 2005 567.6595
# 10 UnitedStates 2005 493.0583
Here's the same thing in data.table
library(data.table)
DT <- data.table(dat, key = "Country,Bank,Year")
subset(DT, KeyItem == "Pretax")[Year %between% c(2001, 2005),
mean(Value), by = list(Country, Year)]
# Country Year V1
# 1: India 2001 399.7184
# 2: India 2002 443.5636
# 3: India 2003 562.5964
# 4: India 2004 404.0050
# 5: India 2005 567.6595
# 6: UnitedStates 2001 464.1638
# 7: UnitedStates 2002 560.8373
# 8: UnitedStates 2003 370.9591
# 9: UnitedStates 2004 520.4933
# 10: UnitedStates 2005 493.0583

R table conversion

Hello I am working with a table with these characteristics:
2000 0.051568
2000 0.04805
2002 0.029792
2002 0.056141
2008 0.047285
2008 0.038989
And I need to convert it to something like this:
2000 2002 2008
0.051568 0.029792 0.047285
0.04805 0.056141 0.038989
I would be grateful if somebody could give me a solution.
Here's a relatively simple solution:
# CREATE ORIGINAL DATA.FRAME
df <- read.table(text="2000 0.051568
2000 0.04805
2002 0.029792
2002 0.056141
2008 0.047285
2008 0.038989", header=FALSE)
names(df) <- c("year", "value")
# MODIFY ITS LAYOUT
df2 <- as.data.frame(split(df$value, df$year))
df2
# X2000 X2002 X2008
# 1 0.051568 0.029792 0.047285
# 2 0.048050 0.056141 0.038989
I'm guessing you are new to R, so I'm going to guess what you mean and give you some more correct terminology. If I guess wrong, then at least this may help you to clarify the question.
In R, a table is a special case of a matrix that arises from cross-tabulation. What I think you have (or want) to start with is a data.frame. A data.frame is a set of columns with potentially different types, but all the same length; it is "rectangular" in that sense. Generally, elements in the same positions in the columns (that is, each row) of a data.frame are related to each other. The columns of a data.frame have names, as can the rows.
long <- data.frame(year=c(2000,2000,2002,2002,2008,2008),
val=c(0.051568, 0.04805, 0.029792,
0.056141, 0.047285, 0.038989))
Which when printed looks like
> long
year val
1 2000 0.051568
2 2000 0.048050
3 2002 0.029792
4 2002 0.056141
5 2008 0.047285
6 2008 0.038989
By itself, this isn't enough, because for your desired output, you need to specify which value for, say, 2000 is in the first row and which is in the second (etc., if there were more). In your example, it is just the order they are in.
long$targetrow = 1:2
Which makes long now look like
> long
year val targetrow
1 2000 0.051568 1
2 2000 0.048050 2
3 2002 0.029792 1
4 2002 0.056141 2
5 2008 0.047285 1
6 2008 0.038989 2
Now you can use reshape on it.
reshape(long, idvar="targetrow", timevar="year", direction="wide")
which gives
> reshape(long, idvar="targetrow", timevar="year", direction="wide")
targetrow val.2000 val.2002 val.2008
1 1 0.051568 0.029792 0.047285
2 2 0.048050 0.056141 0.038989
More complicated transformations are possible using the reshape2 package, but this should get you started.
probably i am understanding this wrong but is ?reshape what you are looking for?
from the examples:
summary(Indometh)
wide <- reshape(Indometh, v.names="conc", idvar="Subject", timevar="time", direction="wide")
wide

Resources