Integrating Data - r

I have a large data frame as follows which is a subset of a larger data frame.
tree=data.frame(INVYR=tree$INVYR,
DIA=tree$DIA,PLOT=tree$PLOT,SPCD=tree$SPCD,
D.2=tree$D.2, BA.T=tree$BA.T)
What I am attempting to do is calculate the total BA.T per Plot per Year (plots are remeasured in subsequent years). I do this by ...
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x<- x[with(x, order(Group.1,Group.2)), ]
This gives me the data frame...
x=data.frame(Group.1,Group.2,x,PLOT)
Where Group.1 is the INVYR, Group.2 is the PLOT, and x is total BA.T per plot per year. So far this works great. Here is where my problem begins. I then want to integrate this back into my original tree data.frame. If I merge the data by plot it doesn't account for year and quadrupoles the data set because of the four remeasurements. I can't run an if statement because the data set is not equal lengths. The data.frame I wish to accompolish is
tree=data.frame(INVYR, DIA, PLOT, SPCD, D.2, BA.T, x)
where x is the total BA.T for the given INVYR and PLOT of that record.
Any thoughts would be greatly appreciated. Thanks.
Edit
INVYR=rbind(1982,1982,1982,1982,1982,1995,1995,1995,1995,1995,2000,2000,2000,2000,2000)
PLOT=rbind(1,1,2,2,3,1,1,2,2,3,1,1,2,2,3)
BA.T=rbind(.1,.2,.3,.4,.2,.3,.5,.8,.3,.6,.7,.2,.1,1,1.02)
tree=data.frame(INVYR,PLOT,BA.T)
head(tree)
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x$INVYR<-x$Group.1
x<- x[with(x, order(Group.1,Group.2)), ]
head(x)

On solution is to use package reshape2.
library(reshape2)
melt(data=tree,id.vars=c('INVYR','PLOT')) ## Notice the choice of the id!the keys!
dcast(tree.m,formula=...~variable,fun.aggregate=sum)
INVYR PLOT BA.T
1 1982 1 0.30
2 1982 2 0.70
3 1982 3 0.20
4 1995 1 0.80
5 1995 2 1.10
6 1995 3 0.60
7 2000 1 0.90
8 2000 2 1.10
9 2000 3 1.02

Related

create list and generate descriptives for each variable

I want to generate descriptive statistics for multiple variables at a time (close to 50), rather than writing out the code several times.
Here is a very basic example of data:
id var1 var2
1 1 3
2 2 3
3 1 4
4 2 4
I typically write out each line of code to get a frequency count and descriptives, like so:
library(psych)
table(df$var1)
table(df1$var2)
describe(df1$var1)
describe(df1$var2)
I would like to create a list and get the output from these analyses, rather than writing out 100 lines of code. I tried this, but it is not working:
variable_list<-list(df1$var, df2$var)
for (variable in variable_list){
table(df$variable_list))
describe(df$variable_list))}
Does anyone have advice on getting this to work?
The describe from psych can take a data.frame and returns the descriptive statistics for each column
library(psych)
describe(df1)
# vars n mean sd median trimmed mad min max range skew kurtosis se
#id 1 4 2.5 1.29 2.5 2.5 1.48 1 4 3 0 -2.08 0.65
#var1 2 4 1.5 0.58 1.5 1.5 0.74 1 2 1 0 -2.44 0.29
#var2 3 4 3.5 0.58 3.5 3.5 0.74 3 4 1 0 -2.44 0.29
If it is subset of columns, specify either column index or column name to select and subset the dataset
describe(df1[2:3])
Another option is descr from collapse
library(collapse)
descr(slt(df1, 2:3))
Or to select numeric columns
descr(num_vars(df1))
Or for factors
descr(fact_vars(df1))

Stacking two data frame columns into a single separate data frame column in R

I will present my question in two ways. First, requesting a solution for a task; and second, as a description of my overall objective (in case I am overthinking this and there is an easier solution).
1) Task Solution
Data context: each row contains four price variables (columns) representing (a) the price at which the respondent feels the product is too cheap; (b) the price that is perceived as a bargain; (c) the price that is perceived as expensive; (d) the price that is too expensive to purchase.
## mock data set
a<-c(1,5,3,4,5)
b<-c(6,6,5,6,8)
c<-c(7,8,8,10,9)
d<-c(8,10,9,11,12)
df<-as.data.frame(cbind(a,b,c,d))
## result
# a b c d
#1 1 6 7 8
#2 5 6 8 10
#3 3 5 8 9
#4 4 6 10 11
#5 5 8 9 12
Task Objective: The goal is to create a single column in a new data frame that lists all of the unique values contained in a, b, c, and d.
price
#1 1
#2 3
#3 4
#4 5
#5 6
...
#12 12
My initial thought was to use rbind() and unique()...
price<-rbind(df$a,df$b,df$c,df$d)
price<-unique(price)
...expecting that a, b, c and d would stack vertically.
[Pseudo illustration]
a[1]
a[2]
a[...]
a[n]
b[1]
b[2]
b[...]
b[n]
etc.
Instead, the "columns" are treated as rows and stacked horizontally.
V1 V2 V3 V4 V5
1 1 5 3 4 5
2 6 6 5 6 8
3 7 8 8 10 9
4 8 10 9 11 12
How may I stack a, b, c and d such that price consists of only one column ("V1") that contains all twenty responses? (The unique part I can handle separately afterwards).
2) Overall Objective: The Bigger Picture
Ultimately, I want to create a cumulative share of population for each price (too cheap, bargain, expensive, too expensive) at each price point (defined by the unique values described above). For example, what percentage of respondents felt $1 was too cheap, what percentage felt $3 or less was too cheap, etc.
The cumulative shares for bargain and expensive are later inverted to become not.bargain and not.expensive and the four vectors reside in a data frame like this:
buckets too.cheap not.bargain not.expensive too.expensive
1 0.01 to 0.50 0.000000000 1 1 0
2 0.51 to 1.00 0.000000000 1 1 0
3 1.01 to 1.50 0.000000000 1 1 0
4 1.51 to 2.00 0.000000000 1 1 0
5 2.01 to 2.50 0.001041667 1 1 0
6 2.51 to 3.00 0.001041667 1 1 0
...
from which I may plot something that looks like this:
Above, I accomplished my plotting objective using defined price buckets ($0.50 ranges) and the hist() function.
However, the intersections of these lines have meanings and I want to calculate the exact price at which any of the lines cross. This is difficult when the x-axis is defined by price range buckets instead of a specific value; hence the desire to switch to exact values and the need to generate the unique price variable.
[Postscript: This analysis is based on Peter Van Westendorp's Price Sensitivity Meter (https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter) which has known practical limitations but is relevant in the context of my research which will explore consumer perceptions of value under different treatments rather than defining an actual real-world price. I mention this for two reasons 1) to provide greater insight into my objective in case another approach comes to mind, and 2) to keep the thread focused on the mechanics rather than whether or not the Price Sensitivity Meter should be used.]
We can unlist the data.frame to a vector and get the sorted unique elements
sort(unique(unlist(df)))
When we do an rbind, it creates a matrix and unique of matrix calls the unique.matrix
methods('unique')
#[1] unique.array unique.bibentry* unique.data.frame unique.data.table* unique.default unique.IDate* unique.ITime*
#[8] unique.matrix unique.numeric_version unique.POSIXlt unique.warnings
which loops through the rows as the default MARGIN is 1 and then looks for unique elements. Instead, if we use the 'price', either as.vector or c(price) converts into vector
sort(unique(c(price)))
#[1] 1 3 4 5 6 7 8 9 10 11 12
If we use unique.default
sort(unique.default(price))
#[1] 1 3 4 5 6 7 8 9 10 11 12

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

Populate a column with forecasts of panel data using data.table in R

I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.

Adding abline to augPred plot

apologies for what is likely to be a very basic question, I am very new to R.
I am looking to read off my augPred plot in order to average out the values to provide a prediction between a time period.
> head(tthm.groupeddata)
Grouped Data: TTHM ~ Yearmon | WSZ_Code
WSZ_Code Treatment_Code Year Month TTHM CL2_FREE BrO3 Colour PH TURB Yearmon
1 2 3 1996 1 30.7 0.35 0.00030 0.75 7.4 0.055 Jan 1996
2 6 1 1996 2 24.8 0.25 0.00055 0.75 6.9 0.200 Feb 1996
3 7 4 1996 2 60.4 0.05 0.00055 0.75 7.1 0.055 Feb 1996
4 7 4 1996 2 58.1 0.15 NA 0.75 7.5 0.055 Feb 1996
5 7 4 1996 3 62.2 0.20 NA 2.00 7.6 0.055 Mar 1996
6 5 2 1996 3 40.3 0.15 0.00140 2.00 7.7 0.055 Mar 1996
This is my model:
modellme<- lme(TTHM ~ Yearmon, random = ~ 1|WSZ_Code, data=tthm.groupeddata)
and my current plot:
plot(augPred(modellme, order.groups=T),xlab="Date", ylab="TTHM concentration", main="TTHM Concentration with Time for all Water Supply Zones")
I would like a way to read off the graph by either placing lines between a specific time period in a specific WSZ_Code (my group) and averaging the values between this period...
Of course any other way/help or guidance would be much appreciated!
Thanks in advance
I don't think we can tell whether it is "entirely incorrect", since you have not described the question and have not included any data. (The plotting question is close to being entirely incorrect, though.) I can tell you that the answer is NOT to use abline, since augPred objects are plotted with plot.augPred which returns (and plots) a lattice object. abline is a base graphic function and does not share a coordinate system with the lattice device. Lattice objects are lists that can be modified. Your plot probably had different panels at different levels of WSZ_Code, but the location of the desired lines is entirely unclear especially since you trail off with an ellipsis. You refer to "times" but there is no "times" variable.
There are lattice functions such as trellis.focus and update.trellis that allow one to apply modifications to lattice objects. You would first assign the plot object to a named variable, make mods to it and then plot() it again.
help(package='lattice')
?Lattice
(If this is a rush job, you might be better off making any calculations by hand and using ImageMagick to edit pdf or png output.)

Resources