ddply slow when used to aggregate by date over factor variable

ddply slow when used to aggregate by date over factor variable - r

I am having a terrible time running 'ddply' over two variables in what seems like it should be a simple command.
Sample data (df):
Brand Day Rev RVP
A 1 2535.00 195.00
B 1 1785.45 43.55
C 1 1730.87 32.66
A 2 920.00 230.00
B 2 248.22 48.99
C 3 16466.00 189.00
A 1 2535.00 195.00
B 3 1785.45 43.55
C 3 1730.87 32.66
A 4 920.00 230.00
B 5 248.22 48.99
C 4 16466.00 189.00
I am using the command:
df2<-ddply(df, .(Brand, Day), summarize, Rev=mean(Rev), RVP=sum(RVP))
My dataframe has about 2600 observations, and there are 45 levels of "Brand" and up to 300 levels of "Day" (which is coded using 'difftime').
I am able to easily use 'ddply' when simply grouping by "Day," but when I also try to group by "Brand," my computer freezes up.
Thoughts?

You should read through the help pages for aggregate, by, ave, and tapply, paying close attention to the types of the arguments each one of them expects and the names of the arguments as well. Then run all of the examples or demo(). The main thing #hadley did with pkg:plyr and reshape/reshape2 was to impose some degree of regularity, but it was at the expense of speed. I do understand why he did it, especially when I try to use the base::reshape function, but also when I forget as I repeatedly do, which of these requires a list, which requires the FUN= argument label, which needs interaction() for the grouping variable, .... since they are all somewhat different.
> aggregate(df[3:4], df[1:2], function(d) mean(d) )
Brand Day Rev RVP
1 A 1 2535.000 195.00
2 B 1 1785.450 43.55
3 C 1 1730.870 32.66
4 A 2 920.000 230.00
5 B 2 248.220 48.99
6 B 3 1785.450 43.55
7 C 3 9098.435 110.83
8 A 4 920.000 230.00
9 C 4 16466.000 189.00
10 B 5 248.220 48.99

Related

Nested logit model using panel data in R

I am new to R and I would love it if you can help me with this because I am having serious difficulties.
I have unbalanced panel data that shows monthly companies' performance compared to the rest of the market in terms of $$ (eg. this month company 1 has made $1000 more than the average of the market). Each of these companies had decided on a strategy when they entered the market (1 through 8). These strategies are nested into two different groups (a,b) so that strategies 1,2, and 3 are part of the group a, while strategies 4 through 8 are part of group b. I would need a rank of the best strategies from best to worst.
I have discretized my DV so that now it only shows whether that month company 1 performed higher or lower than the market. However, I am not sure it is the right way because I would then lose how much better or worse each month companies performed compared to the market.
My data looks like this:
ID Main Strategy YearMonth DiffPerformance Control1 Control 2 DiffPerformanceHL
1 a 2 201706 9.037 2 57 H
1 a 2 201707 4.371 2 57 H
1 a 2 201708 1.633 2 57 H
1 a 2 201709 -3.521 2 59 L
1 a 2 201710 13.096 2 59 H
1 a 2 201711 5.070 2 60 H
1 a 2 201712 4.25 2 60 H
2 b 5 201904 6.78 4 171 H
2 b 5 201905 -15.26 4 169 L
2 b 5 201906 7.985 4 169 H
Where ID is the company, Main is the group (a or b) Strategies are 1 through 8 and nested as previously stated, YearMonth represents the specific month, DifferencePerformance is the DV as a continuous variable, Control 1 is static over time and is a categorical variable (1 through 6), Control 2 is a control count variable that changes over time, and DiffPerformance HL is the discretized DV.
Can you please help me figuring out how to create a nested logit model in R? I would be super appreciative
Thanks

How to choose subsets of factor groups to complete a statistical test in R

I have a dataset containing several variables and I wish to statistically test the variances (Kruskal-test) for each variable seperately.
My data (df) looks like that: (carbon and nitrogen content for diffrent agricultural managements (see name)).
I have 16 groups (to simplify it, I´d say, I have got 8 groups):
extract of the data
1. List item
name N_cont C_cont agriculture
C_ero 1,064 8,380 1
C_ero 0,961 8,086 1
C_ero 0,977 8,331 1
Ds_ero 1,767 17,443 2
Ds_ero 1,802 18,264 2
Ds_ero 2,083 20,112 2
Ms_ero 1,547 14,380 3
Ms_ero 1,566 15,313 3
Ms_ero 1,505 14,760 3
Md_ero 1,512 14,303 4
Md_ero 1,656 15,331 4
Md_ero 1,500 13,788 4
C_upsl 1,121 10,581 5
C_upsl 1,159 10,460 5
C_upsl 1,223 10,171 5
Ds_upsl 1,962 20,656 6
Ds_upsl 1,784 16,780 6
Ds_upsl 1,720 17,482 6
Ms_upsl 1,578 16,228 7
Ms_upsl 1,634 15,331 7
Ms_upsl 1,394 13,419 7
Md_upsl 1,286 11,824 8
Md_upsl 1,241 11,452 8
Md_upsl 1,317 11,932 8
I already put a factor for the agriculture
df$agriculture<-factor(df$agriculture)
I can do statistical tests compairing all of the 16 groups.
e.g. kruskal.test(df$C,df$agriculture)
But now I would like to do statistic tests just for specific groups out of the 8 groups, e.g. those which contain e.g. an C (Conventional) or rather DS (Direct seeding) in the name column
or e.g. ero (eroding site) or upsl (upper slope)
It did try grep or split, but it did not work, because the dimension of x and y should be the same.
Do you have any clue?

You can try to subset with grepl. Assuming you want rows whose name contains either DS, upsl or C then
df[grepl("(DS)|(upsl)|(C)", df$name), ]
# name N_cont C_cont agriculture
#1 C_ero 1,064 8,380 1
#2 C_ero 0,961 8,086 1
#3 C_ero 0,977 8,331 1
#13 C_upsl 1,121 10,581 5
#14 C_upsl 1,159 10,460 5
#15 C_upsl 1,223 10,171 5
#16 Ds_upsl 1,962 20,656 6
#17 Ds_upsl 1,784 16,780 6
#18 Ds_upsl 1,720 17,482 6
#19 Ms_upsl 1,578 16,228 7
#20 Ms_upsl 1,634 15,331 7
#21 Ms_upsl 1,394 13,419 7
#22 Md_upsl 1,286 11,824 8
#23 Md_upsl 1,241 11,452 8
#24 Md_upsl 1,317 11,932 8
If you do not want to hard code the name values , you can also try,
x <- c("C", "DS", "upsl")
df[grepl(paste0(x, collapse = "|"), df$name), ]
which would also yield the same result.

Load the data.table package.
library(data.table)
Create a subset of the group you want to do your stats on:
if your dataframe is df, then
DT<-data.table(df)
DT[like(name,"C_")]
.. OR use the sqldf package:
library(sqldf)
sqldf("select * from df where name like 'C_'")

How do you create a new column containing percentage data calculated from other columns?

Please excuse the very novice question, but I'm trying to create a new column in a data frame that contains percentages based on other columns. For example, the data I'm working with is similar to the following, where the That column is a binary factor (i.e. presence or absence of "that"), the Verb column is the individual verb (i.e. verbs that may or may not be following by "that"), and the Freq column indicates the frequency of each individual verb.
That Verb Freq
1 That believe 3
2 NoThat think 4
3 That say 3
4 That believe 3
5 That think 4
6 NoThat say 3
7 NoThat believe 3
8 NoThat think 4
9 That say 3
10 NoThat think 4
What I want is to add another column that provides the overall rate of "that" expression (coded as "That") for each of the different verbs. Something like the following:
That Verb Freq Perc.That
1 That believe 3 33.3
2 NoThat think 4 25.0
3 That say 3 33.3
4 That believe 3 33.3
5 That think 4 25.0
6 NoThat say 3 33.3
7 NoThat believe 3 33.3
8 NoThat think 4 25.0
9 That say 3 33.3
10 NoThat think 4 25.0
It may be that I've missed a similar question elsewhere. If so, my apologize. Nevertheless, thanks in advance for any help.

You want to use the ddply function in the plyr library:
#install.packages('plyr')
library(plyr)
dat # your data frame
ddply(dat, .(verb), transform, perc.that = freq/sum(freq))
# that verb freq perc.that
#1 That believe 3 0.3333333
#2 That believe 3 0.3333333
#3 NoThat believe 3 0.3333333
#4 That say 3 0.3333333
#...

How to prepare my data fo a factorial repeated measures analysis?

Currently, my dataframe is in wide-format and I want to do a factorial repeated measures analysis with two between subject factors (sex & org) and a within subject factor (tasktype). Below I've illustrated how my data looks with a sample (the actual dataset has a lot more variables). The variable starting with '1_' and '2_' belong to measurements during task 1 and task 2 respectively. this means that 1_FD_H_org and 2_FD_H_org are the same measurements but for tasks 1 and 2 respectively.
id sex org task1 task2 1_FD_H_org 1_FD_H_text 2_FD_H_org 2_FD_H_text 1_apv 2_apv
2 F T Correct 2 69.97 68.9 116.12 296.02 10 27
6 M T Correct 2 53.08 107.91 73.73 333.15 16 21
7 M T Correct 2 13.82 30.9 31.8 78.07 4 9
8 M T Correct 2 42.96 50.01 88.81 302.07 4 24
9 F H Correct 3 60.35 102.9 39.81 96.6 15 10
10 F T Incorrect 3 78.61 80.42 55.16 117.57 20 17
I want to analyze whether there is a difference between the two tasks on e.g. FD_H_org for the different groups/conditions (sex & org).
How do I reshape my data so I can analyze it with a model like this?
ezANOVA(data=df, dv=.(FD_H_org), wid=.(id), between=.(sex, org), within=.(task))
I think that the correct format of my data should like this:
id sex org task outcome FD_H_org FD_H_text apv
2 F T 1 Correct 69.97 68.9 10
2 F T 2 2 116.12 296.02 27
6 M T 1 Correct 53.08 107.91 16
6 M T 2 2 73.73 333.15 21
But I'm not sure. I tryed to achieve this wih the reshape2 package but couldn't figure out how to do it. Anybody who can help?

I think probably you need to rebuild it by binding the 2 subsets of columns together with rbind(). The only issue here was that your outcomes implied difference data types, so forced them both to text:
require(plyr)
dt<-read.table(file="dt.txt",header=TRUE,sep=" ") # this was to bring in your data
newtab=rbind(
ddply(dt,.(id,sex,org),summarize, task=1, outcome=as.character(task1), FD_H_org=X1_FD_H_org, FD_H_text=X1_FD_H_text, apv=X1_apv),
ddply(dt,.(id,sex,org),summarize, task=2, outcome=as.character(task2), FD_H_org=X2_FD_H_org, FD_H_text=X2_FD_H_text, apv=X2_apv)
)
newtab[order(newtab$id),]
id sex org task outcome FD_H_org FD_H_text apv
1 2 F T 1 Correct 69.97 68.90 10
7 2 F T 2 2 116.12 296.02 27
2 6 M T 1 Correct 53.08 107.91 16
8 6 M T 2 2 73.73 333.15 21
3 7 M T 1 Correct 13.82 30.90 4
9 7 M T 2 2 31.80 78.07 9
4 8 M T 1 Correct 42.96 50.01 4
10 8 M T 2 2 88.81 302.07 24
5 9 F H 1 Correct 60.35 102.90 15
11 9 F H 2 3 39.81 96.60 10
6 10 F T 1 Incorrect 78.61 80.42 20
12 10 F T 2 3 55.16 117.57 17
EDIT - obviously you don't need plyr for this (and it may slow it down) unless you're doing further transformations. This is the code with no non-standard dependencies:
newcolnames<-c("id","sex","org","task","outcome","FD_H_org","FD_H_text","apv")
t1<-dt[,c(1,2,3,3,4,6,8,10)]
t1$org.1<-1
colnames(t1)<-newcolnames
t2<-dt[,c(1,2,3,3,5,7,9,11)]
t2$org.1<-2
t2$task2<-as.character(t2$task2)
colnames(t2)<-newcolnames
newt<-rbind(t1,t2)
newt[order(newt$id),]

How to put information obtained by cast function of reshape package back in my original data frame in R

I have a data.frame in panel format (country-year) and I need to calculate the mean of a variable by country and at each five years. So I just used the 'cast' function from 'reshape' package and it worked. Now I need to put this information(the mean by quinquennium) in the old data.frame, so I can run some regressions. How can I do that? Below I provide an example to ilustrate what I want:
set.seed(2)
fake= data.frame(y=rnorm(20), x=rnorm(20), country=rep(letters[1:2], each=10), year=rep(1:10,2), quinquenio= rep(rep(1:2, each=5),2))
fake.m = melt.data.frame(fake, id.vars=c("country", "year", "quinquenio"))
cast(fake.m, country ~ quinquenio, mean, subset=variable=="x", na.rm=T)
Now, everything is fine and I get what I wantted: the mean of x and y, by country and by quinquennial years. Now, I would like to put them back in the data.frame fake, like this:
y x country year quinquenio mean.x
1 -0.89691455 2.090819205 a 1 1 0.8880242
2 0.18484918 -1.199925820 a 2 1 0.8880242
3 1.58784533 1.589638200 a 3 1 0.8880242
4 -1.13037567 1.954651642 a 4 1 0.8880242
5 -0.08025176 0.004937777 a 5 1 0.8880242
6 0.13242028 -2.451706388 a 6 2 -0.2978375
7 0.70795473 0.477237303 a 7 2 -0.2978375
8 -0.23969802 -0.596558169 a 8 2 -0.2978375
9 1.98447394 0.792203270 a 9 2 -0.2978375
10 -0.13878701 0.289636710 a 10 2 -0.2978375
11 0.41765075 0.738938604 b 1 1 0.2146461
12 0.98175278 0.318960401 b 2 1 0.2146461
13 -0.39269536 1.076164354 b 3 1 0.2146461
14 -1.03966898 -0.284157720 b 4 1 0.2146461
15 1.78222896 -0.776675274 b 5 1 0.2146461
16 -2.31106908 -0.595660499 b 6 2 -0.8059598
17 0.87860458 -1.725979779 b 7 2 -0.8059598
18 0.03580672 -0.902584480 b 8 2 -0.8059598
19 1.01282869 -0.559061915 b 9 2 -0.8059598
20 0.43226515 -0.246512567 b 10 2 -0.8059598
I appreciate any tip in the right direction. Thanks in advance.
ps.: the reason I need this is that I'll run a regression with quinquennial data, and for some variables (like per capita income) I have information for all years, so I decided to average them by 5 years.

I'm sure there's an easy way to do this with reshape, but my brain defaults to plyr first:
require(plyr)
ddply(fake, c("country", "quinquenio"), transform, mean.x = mean(x))
This is quite hackish, but one way to use reshape building off your earlier work:
zz <- cast(fake.m, country ~ quinquenio, mean, subset=variable=="x", na.rm=T)
merge(fake, melt(zz), by = c("country", "quinquenio"))
though I'm positive there has to be a better solution.

Here's a more old school approach using tapply, ave, and with
fake$mean.x <- with(fake, unlist(tapply(x, list(country, quinquenio), ave)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

ddply slow when used to aggregate by date over factor variable - r

Related

Nested logit model using panel data in R

How to choose subsets of factor groups to complete a statistical test in R

How do you create a new column containing percentage data calculated from other columns?

How to prepare my data fo a factorial repeated measures analysis?

How to put information obtained by cast function of reshape package back in my original data frame in R

Categories

Resources