get z standardized score within each group - r

Here is the data.
set.seed(23) data<-data.frame(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
ID group value
1 1 1 0.4133934
2 2 2 0.6444651
3 3 3 0.1350871
4 4 1 0.5924411
5 5 2 0.3439465
6 6 3 0.3673059
7 7 1 0.3202062
8 8 2 0.8883733
9 9 3 0.7506174
10 10 1 0.3301955
11 11 2 0.7365258
12 12 3 0.1502212
I want to get z-standardized scores within each group. so I try
library(weights)
data_split<-split(data, data$group) #split the dataframe
stan<-lapply(data_split, function(x) stdz(x$value)) #compute z-scores within group
However, It looks wrong because I want to add a new variable following 'value'
How can I do that? Kindly provide some suggestions(sample code). Any help is greatly appreciated .

Use this instead:
within(data, stan <- ave(value, group, FUN=stdz))
No need to call split nor lapply.

One way using data.table package:
library(data.table)
library(weights)
set.seed(23)
data <- data.table(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
setkey(data, ID)
dataNew <- data[, list(ID, stan = stdz(value)), by = 'group']
the result is:
group ID stan
1: 1 1 -0.6159312
2: 1 4 0.9538398
3: 1 7 -1.0782747
4: 1 10 0.7403661
5: 2 2 -1.2683237
6: 2 5 0.7839781
7: 2 8 0.8163844
8: 2 11 -0.3320388
9: 3 3 0.6698418
10: 3 6 0.8674548
11: 3 9 -0.2131335
12: 3 12 -1.3241632

I tried Ferdinand.Kraft's solution but it didn't work for me. I think the stdz function isn't included in the basic R install. Moreover, the within part troubled me in a large dataset with many variables. I think the easiest way is:
data$value.s <- ave(data$value, data$group, FUN=scale)

Add the new column while in your function, and have the function return the whole data frame.
stanL<-lapply(data_split, function(x) {
x$stan <- stdz(x$value)
x
})
stan <- do.call(rbind, stanL)

Related

(Using a custom function to) Sum above N rows in a datatable (dataframe) by groups

I need a function that sums the above N+1 rows in dataframes (data tables) by groups.
An equivalent function for a vector, would be something like below. (Please forgive me if the function below is inefficient)
Function1<-function(x,N){
y<-vector(length=length(x))
for (i in 1:length(x))
if (i<=N)
y[i]<-sum(x[1:i])
else if (i>N)
y[i]<-sum(x[(i-N):i])
return(y)}
Function1(c(1,2,3,4,5,6),3)
#[1] 1 3 6 10 14 18 # Sums previous (above) 4 values (rows)
I wanted to use this function with sapply, like below..
sapply(X=DF<-data.frame(A=c(1:10), B=2), FUN=Function1(N=3))
but couldn't.. because I could not figure out how to set a default for the x in my function. Thus, I built another function for data.frames.
Function2<-function(x, N)
if(is.data.frame(x)) {
y<-data.frame()
for(j in 1:ncol(x))
for(i in 1:nrow(x))
if (i<=N) {
y[i,j]<-sum(x[1:i,j])
} else if (i>N) {
y[i,j]<-sum(x[(i-N):i,j])}
return(y)}
DF<-data.frame(A=c(1:10), B=2)
Function2(DF, 2)
# V1 V2
1 1 2
2 3 4
3 6 6
4 9 6
5 12 6
6 15 6
7 18 6
8 21 6
9 24 6
10 27 6
However, I still need to perform this by groups. For example, for the following data frame with a character column.
DF<-data.frame(Name=rep(c("A","B"),each=5), A=c(1:10), B=2)
I would like to apply my function by group "Name" -- which would result in.
A 1 2
A 3 4
A 6 6
A 9 6
A 12 6
B 6 2
B 13 4
B 21 6
B 24 6
B 27 6
#Perform function2 separately for group A and B.
I was hoping to use function with the data.table package (by=Groups), but couldn't figure out how.
What would be the best way to do this?
(Also, it would be really nice, if I could learn how to make my Function1 to work in sapply)
With data.table, we group by 'Name', loop through the columns of interest specified in .SDcols (here all the columns are of interest so we are not specifying it) and apply the Function1
library(data.table)
setDT(DF)[, lapply(.SD, Function1, 2), Name]
# Name A B
# 1: A 1 2
# 2: A 3 4
# 3: A 6 6
# 4: A 9 6
# 5: A 12 6
# 6: B 6 2
# 7: B 13 4
# 8: B 21 6
# 9: B 24 6
#10: B 27 6

adding row/column total data when aggregating data using plyr and reshape2 package in R

I create aggregate tables most of the time during my work using the flow below:
set.seed(1)
temp.df <- data.frame(var1=sample(letters[1:5],100,replace=TRUE),
var2=sample(11:15,100,replace=TRUE))
temp.output <- ddply(temp.df,
c("var1","var2"),
function(df) {
data.frame(count=nrow(df))
})
temp.output.all <- ddply(temp.df,
c("var2"),
function(df) {
data.frame(var1="all",
count=nrow(df))
})
temp.output <- rbind(temp.output,temp.output.all)
temp.output[,"var1"] <- factor(temp.output[,"var1"],levels=c(letters[1:5],"all"))
temp.output <- dcast(temp.output,formula=var2~var1,value.var="count",fill=0)
I start feeling silly to writing the "boilerplate" code every time to include the row/column total when I create a new aggregate table, is there some way for skipping it?
Looking at your desired output (now that I'm in front of a computer), perhaps you should look at the margins argument of dcast:
library(reshape2)
dcast(temp.df, var2 ~ var1, value.var = "var2",
fun.aggregate=length, margins = "var1")
# var2 a b c d e (all)
# 1 11 3 1 6 4 2 16
# 2 12 1 3 6 5 5 20
# 3 13 5 9 3 6 1 24
# 4 14 4 7 3 6 2 22
# 5 15 0 5 1 5 7 18
Also look into the addmargins function in base R.

R sum rows in a telecommunication matrix

I have a big matrix df with a length of over 3000 rows. I am programming in R. It looks like this:
df: person1 person2 calls
1 3 5
1 4 7
2 11 6
3 1 5
3 2 1
3 4 13
and so on.
What i want to do is to get the total number of calls that each person made and received in two matrices. This would look like this:
calls: person madecalls received: person receivedcalls
1 12 1 5
2 6 2 1
3 19 3 5
4 20
11 6
Can anyone help me with this problem?
Thanks!
Use the aggregate function:
made.calls <- aggregate(df$calls, by = list(person = df$person1), fun = sum)
.....plyr way:
library(plyr)
ddply(df, .(person1), function(x) data.frame( madecalls = sum(x$calls) )

Appending results of dlply function to original table

This question builds on the answer that Simon and James provided here
The dlply function worked well to give me Y estimates within my data subsets. Now, my challenge is getting these Y estimates and residuals back into the original data frame to calculate goodness of fit statistics and for further analysis.
I was able to use cbind to convert the dlply output lists to row vectors, but this doesn't quite work as the result is (sorry about the poor markdown).
model <- function(df){ glm(Y~D+O+A+log(M), family=poisson(link="log"), data=df)}
Modrpt <- ddply(msadata, "Dmsa", function(x)coef(model(x)))
Modest <- cbind(dlply(msadata, "Dmsa", function(x) fitted.values(model(x))))
Subset name | Y_Estimates
-------------------------
Dmsa 1 | c(4353.234, 234.34,...
Dmsa 2 | c(998.234, 2543.55,...
This doesn't really answer the mail, because I need to get the individual Y estimates (separated by commas in the Y_estimates column of the Modest data frame) into my msadata data frame.
Ideally, and I know this is incorrect, but I'll put it here for an example, I'd like to do something like this:
msadata$Y_est <- cbind(dlply(msadata, "Dmsa", function(x)fitted.values(model(x))))
If I can decompose the list into individual Y estimates, I could join this to my msadata data frame by "Dmsa". I feel like this is very similar to Michael's answer here, but something is needed to separate the list elements prior to employing Michael's suggestion of join() or merge(). Any ideas?
In the previous question , I proposed a data.table solution. I think it is more appropriate to what you want to do, since you want to apply models by group then aggregate the results with the original data.
library(data.table)
DT <- as.data.table(df)
models <- DT[,{
mod= glm(Y~D+O+A+log(M), family=poisson(link="log"))
data.frame(res= mod$residuals,
fit=mod$fitted.values,
mod$model)
},
by = Dmsa]
Here an application with some data:
## create some data
set.seed(1)
d.AD <- data.frame(
counts = sample(c(10:30),18,rep=TRUE),
outcome = gl(3,1,18),
treatment = gl(3,6),
type = sample(c(1,2),18,rep=TRUE) ) ## type is the grouping variable
## corece data to a data.table
library(data.table)
DT <- as.data.table(d.AD)
## apply models
DT[,{mod= glm(formula = counts ~ outcome + treatment,
family = poisson())
data.frame(res= mod$residuals,
fit=mod$fitted.values,
mod$model)},
by = type]
type res fit counts outcome treatment
1: 1 -3.550408e-01 23.25729 15 1 1
2: 1 2.469211e-01 23.25729 29 1 1
3: 1 9.866698e-02 25.48543 28 3 1
4: 1 5.994295e-01 18.13147 29 1 2
5: 1 4.633974e-16 23.00000 23 2 2
6: 1 1.576093e-01 19.86853 23 3 2
7: 1 -3.933199e-01 18.13147 11 1 2
8: 1 -3.456991e-01 19.86853 13 3 2
9: 1 6.141856e-02 22.61125 24 1 3
10: 1 4.933908e-02 24.77750 26 3 3
11: 1 -1.154845e-01 22.61125 20 1 3
12: 2 9.229985e-02 15.56349 17 1 1
13: 2 5.805515e-03 21.87302 22 2 1
14: 2 -1.004589e-01 15.56349 14 1 1
15: 2 2.537653e-16 14.00000 14 1 2
16: 2 -1.603110e-01 21.43651 18 1 3
17: 2 1.662347e-01 21.43651 25 1 3
18: 2 -4.214963e-03 30.12698 30 2 3

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Resources