First differences panel data depending on extra variables R - r

I have a panel dataset which looks like the following
ID Model Month Country Activations avg_price
1 VW Golf 2012-01 NL 23 5000
1 VW Golf 2012-02 NL 2 5500
1 VW Golf 2012-01 FR 8 6000
1 VW Golf 2012-02 FR 34 7000
2 Audi TT 2012-01 NL 8 6900
Now, I want to take first differences for the Activations and avg_price variables. I do this using the diff(data$Activations) function from the plm package, but first I have to transform the data frame using pdata.frame(data). So:
data_fd = pdata.frame(data)
data_fd$Activations = diff(data_fdactivations)
This returns the following error using the data above: duplicate couples (id-time) in resulting pdata.frame. This is because I have data on different countries and when I aggregate the data over all the countries (so total Activations and avg_price and only one id-month combination) this works fine. However, I want now to take the first differences also using the Country variable.
My dataframe should, then, look like:
ID Model Month Country Activations avg_price
1 VW Golf 2012-01 NL NA NA
1 VW Golf 2012-02 NL -21 500
1 VW Golf 2012-01 FR NA NA
1 VW Golf 2012-02 FR 26 1000
etc
Does anyone know how I can make this happen?

Have a look ,is this what you want?
lag_new <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L), Model = structure(c(2L,
2L, 2L, 2L, 1L), .Label = c("Audi TT", "VW Golf"), class = "factor"),
Month = structure(c(1L, 2L, 1L, 2L, 1L), .Label = c("2012-01",
"2012-02"), class = "factor"), Country = structure(c(2L,
2L, 1L, 1L, 2L), .Label = c("FR", "NL"), class = "factor"),
Activations = c(23L, 2L, 8L, 34L, 8L), avg_price = c(5000L,
5500L, 6000L, 7000L, 6900L), Activations_new = c(NA, -21L,
6L, 26L, -26L), avg_price_new = c(NA, 500L, 500L, 1000L,
-100L)), row.names = c(NA, -5L), class = "data.frame")
lag_new$Activations_new <- lag_new$Activations-lag(lag_new$Activations)
lag_new$avg_price_new <- lag_new$avg_price-lag(lag_new$avg_price)

Related

How to get counts for every levels of a factor in a data frame in r and get observations with low counts

Im super new to R and programming in general, so my description of stuff may be a bit off. I'll try to be as clear as possible. I have a dataframe df.train that has hundreds of factor levels with obfuscated entries that may or may not be unique to their respective factor. I'm trying to get the row id of every observation that has at least one factor level below some given amount. So If there is an nice way to get this I would like that answer. if not, this is my current attempt at a solution and where I got stuck, with a sample dataframe:
structure(list(GKUhYLAE = structure(c(1L, 2L, 1L, 1L, 2L, 1L), .Label = c("DDOFi",
"fVvMw"), class = "factor"), OnTaJkLa = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = c("LDyDX", "sbxXu"), class = "factor"),
SsZAZLma = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("fSMSz",
"Hltat"), class = "factor"), BMmgMRvd = structure(c(2L, 1L,
2L, 2L, 2L, 2L), .Label = c("IjEdt", "QZujc"), class = "factor"),
OMtioXZZ = c(3L, 21L, 30L, 21L, 12L, 12L), bIBQTaHw = structure(c(1L,
3L, 1L, 3L, 3L, 2L), .Label = c("ALZyK", "qqkkL", "wQABW"
), class = "factor")), row.names = 1013:1018, class = "data.frame")
​
which gives me the following table
GKUhYLAE OnTaJkLa SsZAZLma BMmgMRvd OMtioXZZ bIBQTaHw
1013 DDOFi LDyDX fSMSz QZujc 3 ALZyK
1014 fVvMw LDyDX fSMSz IjEdt 21 wQABW
1015 DDOFi LDyDX fSMSz QZujc 30 ALZyK
1016 DDOFi LDyDX fSMSz QZujc 21 wQABW
1017 fVvMw LDyDX fSMSz QZujc 12 wQABW
1018 DDOFi LDyDX fSMSz QZujc 12 qqkkL
​
I can run the following to figure out which factors have counts lower than a certain amount:
library (dplyr)
library(plyr)
df.test.a = lapply( df.test[,!names( df.test) %in% c("id")], count)
df.test.freqcount <- as.data.frame(do.call(rbind, df.test.a))
df.test.list = df.test.freqcount[which( df.test.freqcount$freq <2),]
This returns:
x freq
BMmgMRvd.1 IjEdt 1
OMtioXZZ.1 NA 1
OMtioXZZ.4 NA 1
bIBQTaHw.2 qqkkL 1
(the left most column is the column name with a .x after it which I assume is the factor level). Here is where im stuck. I figured the best way to get what I want is to make a vector with entries true or 1 when that entry of my dataframe has at least one column with a factor that is in my list of flagged factor. I cant figure out how to do this or how to construct a suitable list of flagged factors. what I want to write is this:
df.test.freqcount[which( df.test.freqcount$freq <2),]$x
Since the names are not unique and im geting NA values I dont expect, instead of checking by $x i want to check by the column.factorlevel thats a hidden column on the left of the table I gave, if possible.
what I would like as output is:
low_count_factor
1013 TRUE
1014 TRUE
1015 TRUE
1016 FALSE
1017 FALSE
1018 TRUE

Linear combinations of rows with matching row attributes in data.table

I would like to subtract corresponding rows by months in a datatable
Here is the example table
monthly_date sector_order Retail Sales Trend Sales
1: 2014-12-01 1 42123.87 42279.64
2: 2015-11-01 1 44181.69 43620.22
3: 2015-12-01 1 43207.97 43605.21
4: 2014-12-01 30 14972.60 15025.74
5: 2015-11-01 30 15969.98 15685.36
6: 2015-12-01 30 15478.42 15675.09
Is there an elegant way to give me a 3 row table with the rows with
sector_order==30 subtracted from the rows with sector_order==1
I can obviously brute force it with two data frames. Is there a more general data.table way?
Here is an option
library(data.table)
data[, .(RetailSales = RetailSales[1L] - RetailSales[.N],
TrendSales = TrendSales[1L] - TrendSales[.N]), by = monthly_date]
# monthly_date RetailSales TrendSales
#1: 2014-12-01 27151.27 27253.90
#2: 2015-11-01 28211.71 27934.86
#3: 2015-12-01 27729.55 27930.12
or as #MichaelChirico suggested a more elegant solution
data[order(-sector_order),.(RetailSales = diff(RetailSales),
TrendSales = diff(TrendSales)), by = monthly_date]
Or as #Frank suggested
data[order(-sector_order),
.SD[2]-.SD[1]
# lapply(.SD, diff) # also works here
, by=monthly_date, .SDcols=c("RetailSales","TrendSales")]
data
data = setDT(structure(list(monthly_date = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("2014-12-01", "2015-11-01", "2015-12-01"), class = "factor"),
sector_order = c(1L, 1L, 1L, 30L, 30L, 30L), RetailSales = c(42123.87,
44181.69, 43207.97, 14972.6, 15969.98, 15478.42), TrendSales = c(42279.64,
43620.22, 43605.21, 15025.74, 15685.36, 15675.09), grp = c(1L,
2L, 3L, 1L, 2L, 3L)), .Names = c("monthly_date", "sector_order",
"RetailSales", "TrendSales", "grp"), class = "data.frame", row.names = c(NA,
-6L)))

Replacing loop in dplyr R

So I am trying to program function with dplyr withou loop and here is something I do not know how to do
Say we have tv stations (x,y,z) and months (2,3). If I group by this say we get
this output also with summarised numeric value
TV months value
x 2 52
y 2 87
z 2 65
x 3 180
y 3 36
z 3 99
This is for evaluated Brand.
Then I will have many Brands I need to filter to get only those which get value >=0.8*value of evaluated brand & <=1.2*value of evaluated brand
So for example from this down I would only want to filter first two, and this should be done for all months&TV combinations
brand TV MONTH value
sdg x 2 60
sdfg x 2 55
shs x 2 120
sdg x 2 11
sdga x 2 5000
As #akrun said, you need to use a combination of merging and subsetting. Here's a base R solution.
m <- merge(df, data, by.x=c("TV", "MONTH"), by.y=c("TV", "months"))
m[m$value.x >= m$value.y*0.8 & m$value.x <= m$value.y*1.2,][,-5]
# TV MONTH brand value.x
#1 x 2 sdg 60
#2 x 2 sdfg 55
Data
data <- structure(list(TV = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("x",
"y", "z"), class = "factor"), months = c(2L, 2L, 2L, 3L, 3L,
3L), value = c(52L, 87L, 65L, 180L, 36L, 99L)), .Names = c("TV",
"months", "value"), class = "data.frame", row.names = c(NA, -6L
))
df <- structure(list(brand = structure(c(2L, 1L, 4L, 2L, 3L), .Label = c("sdfg",
"sdg", "sdga", "shs"), class = "factor"), TV = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "x", class = "factor"), MONTH = c(2L,
2L, 2L, 2L, 2L), value = c(60L, 55L, 120L, 11L, 5000L)), .Names = c("brand",
"TV", "MONTH", "value"), class = "data.frame", row.names = c(NA,
-5L))

subseting in a for loop

My dataset has 34,000 rows and 353 columns. One column is location and it has 11,000 unique values. I want to subset the dataset within a for loop. I can do this by creating a new data frame for each subset, but I want the subsets to form a single data frame. I have included a sample dataset below
structure(list(X = structure(c(1L, 1L, 1L, 1L, 3L, 3L, 3L, 2L,
3L), .Label = c("Car", "DOG", "House"), class = "factor"), Y = c(20L,
20L, 20L, 20L, 410L, 410L, 410L, 410L, 60L), Z = structure(c(1L,
3L, 8L, 1L, 7L, 5L, 2L, 4L, 6L), .Label = c("ARGENTINA", "BERLIN GERMANY",
"BUENOS AIRES ARGENTINA", "DUBLIN IRELAND", "FROM AUSTRIA", "GERMANY",
"IN TRANSIT FROM GERMANY", "RIVER PLATE ARGENTINA"), class = "factor"),
K = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor")),
.Names = c("X", "Y", "Z", "K"), class = "data.frame", row.names = c(NA, -9L))
I can use the following code to create new data frames
l=c("ARGENTINA","IRELAND")
for(i in l){
assign(paste("newdata",i,sep=""),
subset(TESTL[which(grepl(i,TESTL$Z)&
!grepl("IN TRANSIT",TESTL$Z)&!grepl("FROM",TESTL$Z)),],
select=c("X","Y","Z")))}
However I want to create a single new dataframe to hold all the subsets. I have tried the following code
d<-data.frame()
for(i in l){d<-rbind(d,c(
subset(TESTL[which(grepl(i,TESTL$Z) & !grepl("IN TRANSIT",TESTL$Z)
& !grepl("FROM",TESTL$Z)),],
select=c("X","Y","Z")))}
I get the following errors
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "DOG") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "DUBLIN IRELAND") :
invalid factor level, NA generated
I have attempted to convert the factors to characters with no success. Any help appreciated
I think you are making your life rather difficult by using assign here and trying to store the subsets in separate data frames. Try something more like this:
l <- c("ARGENTINA","IRELAND")
res <- setNames(vector("list",length(l)),l)
for (i in seq_along(l)){
res[[i]] <- dat[grepl(l[i],dat$Z) & !grepl("IN TRANSIT",dat$Z) & !grepl("FROM",dat$Z),c("X","Y","Z")]
}
> res
$ARGENTINA
X Y Z
1 Car 20 ARGENTINA
2 Car 20 BUENOS AIRES ARGENTINA
3 Car 20 RIVER PLATE ARGENTINA
4 Car 20 ARGENTINA
$IRELAND
X Y Z
8 DOG 410 DUBLIN IRELAND
> do.call("rbind",res)
X Y Z
ARGENTINA.1 Car 20 ARGENTINA
ARGENTINA.2 Car 20 BUENOS AIRES ARGENTINA
ARGENTINA.3 Car 20 RIVER PLATE ARGENTINA
ARGENTINA.4 Car 20 ARGENTINA
IRELAND DOG 410 DUBLIN IRELAND
The warnings is becouse at first iteration of a loop (ARGENTINA) it introduces factors variables X and Z, and on the second indtroduce IRELAND with another factor levels. So:
First you should change a classes of your vaiables n TESTL:
for (i in names(TESTL) [grep ("factor", sapply (TESTL, class))]) {
TESTL[[i]] <- as.character (TESTL[[i]])
}
Then it will work with the next code:
d <- data.frame(stringsAsFactors=F)
for(i in l){d <- rbind(d,
TESTL [grepl(i,TESTL$Z) & !grepl("FROM|IN TRANSIT", TESTL$Z), c("X", "Y", "Z")])}

'Complex' aggregation function in dcast from reshape2

I have a dataframe in long form for which I need to aggregate several observations taken on a particular day.
Example data:
long <- structure(list(Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Genotype = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), View = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor"), variable = c(1496L, 1704L,
1738L, 1553L, 1834L, 1421L, 1208L, 1845L, 1325L, 1264L, 1920L,
1735L)), .Names = c("Day", "Genotype", "View", "variable"), row.names = c(NA, -12L),
class = "data.frame")
> long
Day Genotype View variable
1 1 A 1 1496
2 1 A 2 1704
3 1 A 3 1738
4 1 B 1 1553
5 1 B 2 1834
6 1 B 3 1421
7 2 A 1 1208
8 2 A 2 1845
9 2 A 3 1325
10 2 B 1 1264
11 2 B 2 1920
12 2 B 3 1735
I need to aggregate each genotype for each day by taking the cube root of the product of each view. So for genotype A on day 1, (1496 * 1704 * 1738)^(1/3). Final dataframe would look like:
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Have been going round and round with reshape2 for the last couple of days, but not getting anywhere. Help appreciated!
I'd probably use plyr and ddply for this task:
library(plyr)
ddply(long, .(Day, Genotype), summarize,
summary = prod(variable) ^ (1/3))
#-----
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Or this with dcast:
dcast(data = long, Day + Genotype ~ .,
value.var = "variable", function(x) prod(x) ^ (1/3))
#-----
Day Genotype NA
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
An other solution without additional packages.
aggregate(list(Summary=long$variable),by=list(Day=long$Day,Genotype=long$Genotype),function(x) prod(x)^(1/length(x)))
Day Genotype Summary
1 1 A 1642.418
2 2 A 1434.695
3 1 B 1593.633
4 2 B 1614.790

Resources