Frequency table with ddply function - r

ID<-c("R1","R2","R2","R3","R3","R4","R4","R4","R4","R3","R3","R3","R3","R2","R2","R2","R5","R6")
event<-c("a","b","b","M","s","f","y","b","a","a","a","a","s","c","c","b","m","a")
df<-data.frame(ID,event)
How can I modify the below code to get this table. 2-How can i get the average of frequency for each element of frequency?for example: the average of frequency for a would be 1+3+1+1/4.
ddply(df,.(ID),summarise,N=sum(!is.na(ID)),frequency=length(event))
ID N Number-event-level levels frequency
R1 1 1 a a=1
R2 5 2 b,c b=3,c=2
R3 6 3 M,a,s M=1,a=3,s=2
R4 4 4 f,y,b,a f=1,y=1,b=1,a=1
R5 1 1 m m=1
R6 1 1 a a=1

Here's an answer for the first question:
ddply(df,.(ID),summarise,
N=length(event),
Number.event.level=length(unique(event)),
levels=paste(sort(unique(event)),collapse=","),
frequency=paste(paste(sort(unique(event)),table(event)[table(event)>0],sep="="),collapse=","))
# ID N Number.event.level levels frequency
# 1 R1 1 1 a a=1
# 2 R2 5 2 b,c b=3,c=2
# 3 R3 6 3 a,M,s a=3,M=1,s=2
# 4 R4 4 4 a,b,f,y a=1,b=1,f=1,y=1
# 5 R5 1 1 m m=1
# 6 R6 1 1 a a=1
For your second question, it seems like you want to get the average frequency when the frequency is greater than 0. If that's the case, you can do this:
apply(table(df),2,function(x) mean(x[x>0]))
# a b c f m M s y
# 1.5 2.0 2.0 1.0 1.0 1.0 2.0 1.0
Update
If you want to do that last part for each level of a third variable and you still want to use ddply() you could do the following:
df1 <- rbind(df,df)
df1$cat <- rep(c("a","b"),each=nrow(df))
ddply(df1,.(cat),function(y) apply(table(y),2,function(x) mean(x[x>0])))
# cat a b c f m M s y
# 1 a 1.5 2 2 1 1 1 2 1
# 2 b 1.5 2 2 1 1 1 2 1

Related

Finding the maximum value of a variable

I would like to find the maximum value of a variable (column) and then retain this value (the maximum value) and all values below it. Along with these values, I would like to retain the corresponding values from all other variables (columns) within the data frame. I want to exclude all values above this point from the data frame, for all variables within it. Included is the script for an example data frame (df), and an expected data frame (df2) i.e. what I am trying to achieve. I would be so grateful for some script to do this.
Ba <- c(1,1,1,2,2)
Sr <- c(1,1,1,2,2)
Mn <- c(1,1,2,1,1)
df <- data.frame(Ba, Sr, Mn)
df
# Ba Sr Mn
# 1 1 1 1
# 2 1 1 1
# 3 1 1 2
# 4 2 2 1
# 5 2 2 1
Showing 1 to 5 of 5 entries, 3 total columns
This is what I want to achieve in R:
Ba2 <- c(1,2,2)
Sr2 <- c(1,2,2)
Mn2 <- c(2,1,1)
df2 <- data.frame(Ba2, Sr2, Mn2)
df2
# Ba2 Sr2 Mn2
# 1 1 1 2
# 2 2 2 1
# 3 2 2 1
Showing 1 to 3 of 3 entries, 3 total columns
You can subset df with the sequence from min to nrow(df) of which.max per column:
df[min(sapply(df, which.max)):nrow(df),]
# Ba Sr Mn
#3 1 1 2
#4 2 2 1
#5 2 2 1
Does this work:
df[max(apply(df, 1, which.max)):nrow(df),]
Ba Sr Mn
3 1 1 2
4 2 2 1
5 2 2 1
Using cummax
library(dplyr)
library(purrr)
df %>%
filter(cummax(invoke(pmax, cur_data())) == max(cur_data()))
Ba Sr Mn
1 1 1 2
2 2 2 1
3 2 2 1

R counting strings variables in each row of a dataframe

I have a dataframe that looks something like this, where each row represents a samples, and has repeats of the the same strings
> df
V1 V2 V3 V4 V5
1 a a d d b
2 c a b d a
3 d b a a b
4 d d a b c
5 c a d c c
I want to be able to create a new dataframe, where ideally the headers would be the string variables in the previous dataframe (a, b, c, d) and the contents of each row would be the number of occurrences of each the respective variable from
the original dataframe. Using the example from above, this would look like
> df2
a b c d
1 2 1 0 2
2 2 1 1 1
3 2 1 0 1
4 1 1 1 2
5 1 0 3 1
In my actual dataset, there are hundreds of variables, and thousands of samples, so it'd be ideal if I could automatically pull out the names from the original dataframe, and alphabetize them into the headers for the new dataframe.
You may try
library(qdapTools)
mtabulate(as.data.frame(t(df)))
Or
mtabulate(split(as.matrix(df), row(df)))
Or using base R
Un1 <- sort(unique(unlist(df)))
t(apply(df ,1, function(x) table(factor(x, levels=Un1))))
You can stack the columns and then use table:
table(cbind(id = 1:nrow(mydf),
stack(lapply(mydf, as.character)))[c("id", "values")])
# values
# id a b c d
# 1 2 1 0 2
# 2 2 1 1 1
# 3 2 2 0 1
# 4 1 1 1 2
# 5 1 0 3 1

Convert datafile from wide to long format to fit ordinal mixed model in R

I am dealing with a dataset that is in wide format, as in
> data=read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
> data
factor1 factor2 count_1 count_2 count_3
1 a a 1 2 0
2 a b 3 0 0
3 b a 1 2 3
4 b b 2 2 0
5 c a 3 4 0
6 c b 1 1 0
where factor1 and factor2 are different factors which I would like to take along (in fact I have more than 2, but that shouldn't matter), and count_1 to count_3 are counts of aggressive interactions on an ordinal scale (3>2>1). I would now like to convert this dataset to long format, to get something like
factor1 factor2 aggression
1 a a 1
2 a a 2
3 a a 2
4 a b 1
5 a b 1
6 a b 1
7 b a 1
8 b a 2
9 b a 2
10 b a 3
11 b a 3
12 b a 3
13 b b 1
14 b b 1
15 b b 2
16 b b 2
17 c a 1
18 c a 1
19 c a 1
20 c a 2
21 c a 2
22 c a 2
23 c a 2
24 c b 1
25 c b 2
Would anyone happen to know how to do this without using for...to loops, e.g. using package reshape2? (I realize it should work using melt, but I just haven't been able to figure out the right syntax yet)
Edit: For those of you that would also happen to need this kind of functionality, here is Ananda's answer below wrapped into a little function:
widetolong.ordinal<-function(data,factors,responses,responsename) {
library(reshape2)
data$ID=1:nrow(data) # add an ID to preserve row order
dL=melt(data, id.vars=c("ID", factors)) # `melt` the data
dL=dL[order(dL$ID), ] # sort the molten data
dL[,responsename]=match(dL$variable,responses) # convert reponses to ordinal scores
dL[,responsename]=factor(dL[,responsename],ordered=T)
dL=dL[dL$value != 0, ] # drop rows where `value == 0`
out=dL[rep(rownames(dL), dL$value), c(factors, responsename)] # use `rep` to "expand" `data.frame` & drop unwanted columns
rownames(out) <- NULL
return(out)
}
# example
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
widetolong.ordinal(data,c("factor1","factor2"),c("count_1","count_2","count_3"),"aggression")
melt from "reshape2" will only get you part of the way through this problem. To go the rest of the way, you just need to use rep from base R:
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
library(reshape2)
## Add an ID if the row order is importantt o you
data$ID <- 1:nrow(data)
## `melt` the data
dL <- melt(data, id.vars=c("ID", "factor1", "factor2"))
## Sort the molten data, if necessary
dL <- dL[order(dL$ID), ]
## Extract the numeric portion of the "variable" variable
dL$aggression <- gsub("count_", "", dL$variable)
## Drop rows where `value == 0`
dL <- dL[dL$value != 0, ]
## Use `rep` to "expand" your `data.frame`.
## Drop any unwanted columns at this point.
out <- dL[rep(rownames(dL), dL$value), c("factor1", "factor2", "aggression")]
This is what the output finally looks like. If you want to remove the funny row names, just use rownames(out) <- NULL.
out
# factor1 factor2 aggression
# 1 a a 1
# 7 a a 2
# 7.1 a a 2
# 2 a b 1
# 2.1 a b 1
# 2.2 a b 1
# 3 b a 1
# 9 b a 2
# 9.1 b a 2
# 15 b a 3
# 15.1 b a 3
# 15.2 b a 3
# 4 b b 1
# 4.1 b b 1
# 10 b b 2
# 10.1 b b 2
# 5 c a 1
# 5.1 c a 1
# 5.2 c a 1
# 11 c a 2
# 11.1 c a 2
# 11.2 c a 2
# 11.3 c a 2
# 6 c b 1
# 12 c b 2

Remove columns with same value from a dataframe

I've got a data frame like this one
1 1 1 K 1 K K
2 1 2 K 1 K K
3 8 3 K 1 K K
4 8 2 K 1 K K
1 1 1 K 1 K K
2 1 2 K 1 K K
I want to remove all the columns with the same value, i.e K, so my result will be like this
1 1 1 1
2 1 2 1
3 8 3 1
4 8 2 1
1 1 1 1
2 1 2 1
I try to iterate in a for by columns but I didn't get anything. Any ideas?
To select columns with more than one value regardless of type:
uniquelength <- sapply(d,function(x) length(unique(x)))
d <- subset(d, select=uniquelength>1)
?
(Oops, Roman's question is right -- this could knock out your column 5 as well)
Maybe (edit: thanks to comments!)
isfac <- sapply(d,inherits,"factor")
d <- subset(d,select=!isfac | uniquelength>1)
or
d <- d[,!isfac | uniquelength>1]
Here's a solution that'll work to remove any replicated columns (including, e.g., pairs of replicated character, numeric, or factor columns). That's how I read the OP's question, and even if it's a misreading, it seems like an interesting question as well.
df <- read.table(text="
1 1 1 K 1 K K
2 1 2 K 1 K K
3 8 3 K 1 K K
4 8 2 K 1 K K
1 1 1 K 1 K K
2 1 2 K 1 K K")
# Need to run duplicated() in 'both directions', since it considers
# the first example to be **not** a duplicate.
repdCols <- as.logical(duplicated(as.list(df), fromLast=FALSE) +
duplicated(as.list(df), fromLast=TRUE))
# [1] FALSE FALSE FALSE TRUE FALSE TRUE TRUE
df[!repdCols]
# V1 V2 V3 V5
# 1 1 1 1 1
# 2 2 1 2 1
# 3 3 8 3 1
# 4 4 8 2 1
# 5 1 1 1 1
# 6 2 1 2 1
Another way to do this is using the higher order function Filter. Here is the code
to_keep <- function(x) any(is.numeric(x), length(unique(x)) > 1)
Filter(to_keep, d)
Oneliner solution.
df2 <- df[sapply(df, function(x) !is.factor(x) | length(unique(x))>1 )]

Applying a function repeatedly to many subjects

I have a data frame as follows,
> mydata
date station treatment subject par
A a 0 R1 1.3
A a 0 R1 1.4
A a 1 R2 1.4
A a 1 R2 1.1
A b 0 R1 1.5
A b 0 R1 1.8
A b 1 R2 2.5
A b 1 R2 9.5
B a 0 R1 0.3
B a 0 R1 8.2
B a 1 R2 7.3
B a 1 R2 0.2
B b 0 R1 9.4
B b 0 R1 3.2
B b 1 R2 3.5
B b 1 R2 2.4
....
where:
date is a factor with 2 levels A/B;
station is a factor with 2 levels a/b;
treatment is a factor with 2 levels 0/1;
subject are the replicates R1 to R20 assigned to treatment (10 to treatment 0 and 10 to treatment 1);
and
par is my parameter, which is a repeated measurement of particle size for each subject at at each date and station
What i need to do is:
divide par in 10 equal bins and count the number in each bin. This has to be done in subsets of mydata definded by a combination of date station and subject. The final outcome has to be a daframe myres as follow:
> myres
date station treatment bin.centre freq
A a 0 1.2 4
A a 0 1.3 3
A a 0 1.4 2
A a 0 1.5 1
A a 1 1.2 4
A a 1 1.3 3
A a 1 1.4 2
A a 1 1.5 1
B b 0 2.3 5
B b 0 2.4 4
B b 0 2.5 3
B b 0 2.6 2
B b 1 2.3 5
B b 1 2.4 4
B b 1 2.5 3
B b 1 2.6 2
....
this is what i've done so far:
#define the number of bins
num.bins<-10
#define the width of each bins
bin.width<-(max(par)-min(par))/num.bins
#define the lower and upper boundaries of each bins
bins<-seq(from=min(par), to=max(par), by=bin.width)
#define the centre of each bins
bin.centre<-c(seq(min(bins)+bin.width/2,max(bins)-bin.width/2,by=bin.width))
#create a vector to store the frequency in each bins
freq<-numeric(length(length(bins-1)))
# this is the loop that counts the frequency of particles between the lower and upper boundaries
of each bins and store the result in freq
for(i in 1:10){
freq[i]<-length(which(par>=bins[i] &
par<bins[i+1]))
}
#create the data frame with the results
res<-data.frame(bin.centre,res)
my first approach was to subset mydata manually, using subset(),for each combination of subject station and date, and apply the above sequence of commands for each subsets, then build the final dataframe combining each single res using rbind(), but this procedure was very convoluted and subject to the propagation of errors.
What i would like to do, is to automate the above procedure so that it calculates the binned frequency distribution for each subject. My intuition is that the best way to do this is by creating a function for estimating this particle distribution, and then applying it to each subject via a for loop. However, I am not sure of how to do this. Any suggestions would be really appreciated.
thanks
matteo.
You can do this in a few steps using the functionality in the plyr package. This allows you to split your data into the desired chunks, apply a statistic to each chunk, and combine the results.
First I set up some dummy data:
set.seed(1)
n <- 100
dat <- data.frame(
date=sample(LETTERS[1:2], n, replace=TRUE),
station=sample(letters[1:2], n, replace=TRUE),
treatment=sample(0:1, n, replace=TRUE),
subject=paste("R", sample(1:2, n, replace=TRUE), sep=""),
par=runif(n, 0, 5)
)
head(dat)
date station treatment subject par
1 A b 0 R2 3.2943880
2 A a 0 R1 0.9253498
3 B a 1 R1 4.7718907
4 B b 0 R1 4.4892425
5 A b 0 R1 4.7184853
6 B a 1 R2 3.6184538
Now I use the function in base called cut to divide your par into equal sized bins:
dat$bin <- cut(dat$par, breaks=10)
Now for the fun bit. Load package plyr and use the function ddply to split, apply and combine. Because you want a frequency count, we can use the function length to count the number of times each replicate appeared in that bin:
library(plyr)
res <- ddply(dat, .(date, station, treatment, bin),
summarise, freq=length(treatment))
head(res)
date station treatment bin freq
1 A a 0 (0.00422,0.501] 1
2 A a 0 (0.501,0.998] 2
3 A a 0 (1.5,1.99] 4
4 A a 0 (1.99,2.49] 2
5 A a 0 (2.49,2.99] 2
6 A a 0 (2.99,3.48] 1

Resources