Calculation and replacement in R - r

I have a dataset like the following and I need to compare the value of each year (2005-2009) with the average value of (2002-2004).
Year Firm R
2002 A 30
2003 A 11
2004 A 1
2005 A 7
2006 A 15
2007 A 20
2008 A 3.5
2009 A 8
2002 B 24
2003 B 30
2004 B 25
2005 B 5.2
2006 B 11.8
2007 B 78
2008 B 90
2009 B 57
The Issue that I need to calculate the average of (2002-2004) for each firm and replace the value in years 2002-2004 with the new value (i.e. the calculated average). for example, the new dataset should be like this:
Year Firm R
2002 A 14
2003 A 14
2004 A 14
2005 A 7
2006 A 15
2007 A 20
2008 A 3.5
2009 A 8
2002 B 26.333
2003 B 26.333
2004 B 26.333
2005 B 5.2
2006 B 11.8
2007 B 78
2008 B 90
2009 B 57
I have tried to use the following code:
df$R[df$Year==2002 & df$Year==2003 & df$Year==2004] = (df$R[df$Year==2002] + df$R[df$Year==2003] + df$R[df$Year==2004])/3
but when I apply it nothing changes!!!!!?????
I hope you can help with this issue

You can use data.table for this if you like:
library(data.table)
year <- c(rep(seq(2002,2009,1),2))
firm <- c(rep("A",8),rep("B",8))
r <- c(30,11,1,7,15,20,3.5,8,24,30,25,5.2,11.8,78,90,57)
aa <- data.table(year,firm,r)
aa[year>=2002 & year<=2004, r:= mean(r), by = firm]
Giving this result :
year firm r
1: 2002 A 14.00000
2: 2003 A 14.00000
3: 2004 A 14.00000
4: 2005 A 7.00000
5: 2006 A 15.00000
6: 2007 A 20.00000
7: 2008 A 3.50000
8: 2009 A 8.00000
9: 2002 B 26.33333
10: 2003 B 26.33333
11: 2004 B 26.33333
12: 2005 B 5.20000
13: 2006 B 11.80000
14: 2007 B 78.00000
15: 2008 B 90.00000
16: 2009 B 57.00000

The mistake in your code is that you are not grouping by Firm name and also using & instead or |. In my example test.txt is the file which has input same as in question.
Below code should help you achieve what you need.
library(dplyr)
df <- read.delim('test.txt', header = T, sep = '\t')
print(df)
# get unique firm names for grouping
firms <- unique(df$Firm)
# for each firm, calculate mean and update it
for (f in firms){
df$R[df$Firm == f & (df$Year==2002 | df$Year==2003 | df$Year==2004)] =
sum(df$R[df$Firm == f & (df$Year==2002 | df$Year==2003 | df$Year==2004)])/3
}
print(df)

Try this dplyr version:
library(tidyverse)
data %>%
filter(Year<2005) %>% # this subsets the data
group_by(Firm) %>% # state which values you want to evaluate
summarise(m=mean(R)) %>% # take the mean (named mean)
left_join(data) %>% # join the original data to the summarised data
mutate(R=ifelse(Year<2005 & Firm=='A', m,
ifelse(Year<2005 & Firm=='B', m, R))) %>% # nested ifelse to define conditions
select(year,firm,R) -> newdata # select the desired columns and rename the data.frame

Related

subtract specific row und rename it

it is possible to subtract certain rows and rename them?
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","b","c","a","b","c", "a", "b", "c")
value <- c(2,2,10,3,3,12,4,4,16)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
And this is how the result should look:
year
category
value
2005
a
2
2005
b
2
2005
c
4
2006
a
3
2006
b
3
2006
c
12
2007
a
4
2007
b
4
2007
c
16
2005
c-b
2
2006
c-b
9
2007
c-b
12
You can use group_modify:
library(tidyverse)
df %>%
group_by(year) %>%
group_modify(~ add_row(.x, category = "c-b", value = .x$value[.x$category == "c"] - .x$value[.x$category == "b"]))
# A tibble: 12 x 3
# Groups: year [3]
year category value
<dbl> <chr> <dbl>
1 2005 a 2
2 2005 b 2
3 2005 c 10
4 2005 c-b 8
5 2006 a 3
6 2006 b 3
7 2006 c 12
8 2006 c-b 9
9 2007 a 4
10 2007 b 4
11 2007 c 16
12 2007 c-b 12
See substract() function.
Example:
substracted_df<-substr(df,df$category=="c")
If you want to know which rows are you dealing with, use which()
rows<-which(df$category=="c")
substracted_df<-df[rows, ]
You can rename each desired row as
row.names(substracted_df)<-c("Your desired row names")

Create groups based on time period

How can I create a new grouping variable for my data based on 5-year steps?
So from this:
group <- c(rep("A", 7), rep("B", 10))
year <- c(2008:2014, 2005:2014)
dat <- data.frame(group, year)
group year
1 A 2008
2 A 2009
3 A 2010
4 A 2011
5 A 2012
6 A 2013
7 A 2014
8 B 2005
9 B 2006
10 B 2007
11 B 2008
12 B 2009
13 B 2010
14 B 2011
15 B 2012
16 B 2013
17 B 2014
To this:
> dat
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014
7 A 2014 2010_2014
8 B 2005 2005_2009
9 B 2006 2005_2009
10 B 2007 2005_2009
11 B 2008 2005_2009
12 B 2009 2005_2009
13 B 2010 2010_2014
14 B 2011 2010_2014
15 B 2012 2010_2014
16 B 2013 2010_2014
17 B 2014 2010_2014
I guess I could use cut(dat$year, breaks = ??) but I don't know how to set the breaks.
Here is one way of doing it:
dat$period <- paste(min <- floor(dat$year/5)*5, min+4,sep = "_")
I guess the trick here is to get the biggest whole number smaller than your year with the floor(year/x)*x function.
Here is a version that should work generally:
x <- 5
yearstart <- 2000
dat$period <- paste(min <- floor((dat$year-yearstart)/x)*x+yearstart,
min+x-1,sep = "_")
You can use yearstart to ensure e.g. year 2000 is the first in a group for when x is not a multiple of it.
cut should do the job if you create actual Date objects from your 'year' column.
## convert 'year' column to dates
yrs <- paste0(dat$year, "-01-01")
yrs <- as.Date(yrs)
## create cuts of 5 years and add them to data.frame
dat$period <- cut(yrs, "5 years")
## create desired factor levels
library(lubridate)
lvl <- as.Date(levels(dat$period))
lvl <- paste(year(lvl), year(lvl) + 4, sep = "_")
levels(dat$period) <- lvl
head(dat)
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014

Conditional subsetting gone wrong in R

So I have this fairly basic problem with R subsetting, but because I'm a newbie I don't know how to solve it properly. There's example of some panel data I have:
idnr year sales space municipality pop
1 1 2004 110000 1095 136 71377
2 1 2005 110000 1095 136 71355
3 1 2006 110000 1095 136 71837
4 1 2007 120000 1095 136 72956
5 2 2004 35000 800 136 71377
6 3 2004 45000 1000 136 71377
7 3 2005 45000 1000 2584 23135
8 3 2006 45000 1000 2584 23258
9 3 2007 45000 1000 2584 23407
10 4 2005 180000 5000 2584 23254
11 4 2006 220000 5000 2584 23135
12 4 2007 250000 5000 2584 23258
So my problem is that I want to subset data using conditions for both year = 2004 AND (not or) year = 2005. However it doesn't seem to work. Code:
tab3 <- stores[stores$year==2004 & stores$year==2005, c("idnr","year")]
What I am trying to say is that I need to select data which existed in both 2004 and 2005, cause some entries existed either in 2004 or 2005, but not in both and hence should be excluded. Using data above as an example, this should be the output:
idnr year
1 2004
1 2005
3 2004
3 2005
Update:
I was hoping that akrun's method may work for selecting data entries, which appeared ONLY in 2005. Such that:
idnr year
4 2005
Unfortunately, it doesn't. Instead it groups both idnr's which appeared in 2004&2005 with those which appeared only in 2005. Any ideas?
Here is a an option using "data.table". Convert the dataset ("df") to "data.table" using setDT. Set the "year" column as "key" (setkey(..)). Subset the rows that have "2004/2005" in the "year" columns (J(c(2004,..)), select the first two columns 1:2.
library(data.table) # data.table_1.9.5
DT1 <- setkey(setDT(df),year)[J(c(2004,2005)), 1:2, with=FALSE]
DT1
# idnr year
#1: 1 2004
#2: 2 2004
#3: 3 2004
#4: 1 2005
#5: 3 2005
#6: 4 2005
Update
Based on the updated expected result, we can check whether there are more than one unique "year" entries (uniqueN(year)>1) per "idnr" group, get the row index (.I) as a column ("V1") and subset the data.table "DT1".
DT1[DT1[, .I[uniqueN(year)>1], idnr]$V1,]
# idnr year
#1: 1 2004
#2: 1 2005
#3: 3 2004
#4: 3 2005
Or everything in one liner
setDT(df)[year %in% 2004:2005, if(uniqueN(year) > 1L) year, idnr]
# idnr V1
# 1: 1 2004
# 2: 1 2005
# 3: 3 2004
# 4: 3 2005
Or a base R option would be
indx <- with(df, ave(year==2004, idnr, FUN=any)& ave(year==2005,
idnr, FUN=any) & year %in% 2004:2005)
df[indx,1:2]
# idnr year
#1 1 2004
#2 1 2005
#6 3 2004
#7 3 2005
Update2
Based on the dataset and the expected result showed, we can check whether the first value of "year" is 2005 for each group "idnr". If it is TRUE, then subset the first observation (.SD[1L,..]) and select the columns that are needed.
setDT(df)[,if(year[1L]==2005) .SD[1L,1,with=FALSE], by = idnr]
# idnr year
#1: 4 2005
Or
setDT(df)[df[,.I[year[1L]==2005] , by = idnr]$V1[1L], 1:2, with=FALSE]
# idnr year
#1: 4 2005
If you want to subset with either year == 2004 or year == 2005, you need to use the | operator instead of & in your actual approach:
tab3 <- stores[stores$year == 2004 | stores$year == 2005, c("idnr", "year")]
Which results:
#> tab3
# idnr year
#1 1 2004
#2 1 2005
#5 2 2004
#6 3 2004
#7 3 2005
#10 4 2005
Or using dplyr:
library(dplyr)
tab3 <- stores %>% select(idnr, year) %>% filter(year == 2004 | year == 2005)
More concisely:
tab3 <- stores %>% select(idnr, year) %>% filter(year %in% c(2004, 2005))

R: Function “diff” over various groups

While searching for a solution to my problem I found this thread: Function "diff" over various groups in R. I've got a very similar question so I'll just work with the example there.
This is what my desired output should look like:
name class year diff
1 a c1 2009 NA
2 a c1 2010 67
3 b c1 2009 NA
4 b c1 2010 20
I have two variables which form subgroups - class and name. So I want to compare only the values which have the same name and class. I also want to have the differences from 2009 to 2010. If there is no 2008, diff 2009 should return NA (since it can't calculate a difference).
I'm sure it works very similarly to the other thread but I just can't make it work. I used this code too (and simply solved the ascending year by sorting the data differently), but somehow R still manages to calculate a difference and does not return NA.
ddply(df, .(class, name), summarize, year=head(year, -1), value=diff(value))
Using the data set form the other post, I would do something like
library(data.table)
df <- df[df$year != 2008, ]
setkey(setDT(df), class, name, year)
df[, diff := lapply(.SD, function(x) c(NA, diff(x))),
.SDcols = "value", by = list(class, name)]
Which returns
df
# name class year value diff
# 1: a c1 2009 33 NA
# 2: a c1 2010 100 67
# 3: b c1 2009 80 NA
# 4: b c1 2010 90 10
# 5: a c2 2009 80 NA
# 6: a c2 2010 90 10
# 7: b c2 2009 90 NA
# 8: b c2 2010 100 10
# 9: a c3 2009 90 NA
#10: a c3 2010 100 10
#11: b c3 2009 80 NA
#12: b c3 2010 99 19
Using dplyr
df %>%
filter(year!=2008)%>%
arrange(name, class, year)%>%
group_by(class, name)%>%
mutate(diff=c(NA,diff(value)))
# Source: local data frame [12 x 5]
# Groups: class, name
# name class year value diff
# 1 a c1 2009 33 NA
# 2 a c1 2010 100 67
# 3 a c2 2009 80 NA
# 4 a c2 2010 90 10
# 5 a c3 2009 90 NA
# 6 a c3 2010 100 10
# 7 b c1 2009 80 NA
# 8 b c1 2010 90 10
# 9 b c2 2009 90 NA
# 10 b c2 2010 100 10
# 11 b c3 2009 80 NA
# 12 b c3 2010 99 19
Update:
With relative difference
df %>%
filter(year!=2008)%>%
arrange(name, class, year)%>%
group_by(class, name)%>%
mutate(diff1=c(NA,diff(value)), rel_diff=round(diff1/value[row_number()-1],2))

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources