Duplicating rows based on a multiplier - r

Using R and having some trouble manipulating my data. I've identified bee collected pollen to types and their relative volumes ("adjusted_volume" below) (how much pollen on a slide). I'm now trying to calculate average pollen usage by bees at each of my 14 sites. My data looks like this:
head(pollen)
site treatment hive_code pollen_type adjusted_volume
A conventional 4 alnus_spp 248.5
B conventional 4 alnus_spp 71.0
B conventional 7 alnus_spp 35.5
My plan was to dcast and gather to get the amount of each pollen type per site...
data1 <- dcast(pollen, site + treatment ~ pollen_type, length)
data2 <- gather(data1, pollen_type, count, alnus_spp:vaccinium_corymbosum, factor_key=TRUE, na.rm=TRUE)
But that doesn't account for the differences in volume for each entry. I might be thinking about this the wrong way, but is there a way to multiply each row by the adjusted_volume number in the dcast function? So the first row would count as 248.5 alnus_spp at site A instead of just 1 record?
Thanks for your help in advance! And sorry if I'm going about this in a ridiculous way!
Edit:
This worked! Thanks all!
x <- ddply(pollen, .(site, pollen_type, treatment, hive_code), summarise, tot_pollen = sum(adjusted_volume))
> head(x)
> site pollen_type treatment hive_code tot_pollen
> A alnus_spp conventional 1 497.0
> A alnus_spp conventional 5 142.0
> A graminaceae_spp conventional 1 29.0

I think something like this might get at what you are looking for:
ddply(pollen, .(site, treatment, pollen_type), summarise, tot_pollen = sum(adjusted_volume))
This should summarize the volumes of pollen by site, treatment, and pollen_type.
Good luck!

Related

Suggestion with aggregation of data in R

Hello i have a data frame with more than 3632200+ obs, and I'm trying to find some useful information out of it. I have cleaned it a bit so now this is what the data looks like
Order Lane Days
18852324 796005 - Ahmedabad 2
232313 796008 - Delhi 5
63963231 796005 - Ahmedabad 5
23501231 788152 - Chennai 1
2498732 796008 - Delhi 2
231413 796005 - Ahmedabad 3
75876876 796012 - Chennai 4
14598676 796008 - Delhi 4
Order are different Order Id's, they all are unique, Lane are different paths on which the order was delivered(Lanes can repeat for various orders) & Days is calculated using difftime function in R by differentiating Order delivered and created date.
Now What I'm trying to achieve is something like this
Now I can calculate 98.% order achieved date by using quantile function in R across various lane.
But how do I achieve % of orders fulfilled by day 1 to 5 across various lanes?
Any help would be highly appreciated.
Thank You
Hard to tell without the data, but maybe something like this:
library(purrr)
#df = your data
max_days = max(df$days)
aggregate_fun = function(x){
days = factor(x$days,levels=c(1:max_days))
prop.table(table(days))
}
df = split(df,df$lane)
results = reduce(lapply(df,aggregate_fun),rbind)

Finding Specific Means and Medians in R

I am working on a project for school in R that is looking at swimming data compiled up of 8 different teams looking at each of the 13 events, over 6 years. I have over 8700 rows of data that I have appended and am trying to find out how to draw the specific means that I am looking for. For example, I would like to look at the progression of mean times for team 1 for event 3 for men. Thanks!
You can subset your data-frame to only include those variables, e.g.
ss = subset(df, team == 1 & event == 3)
mean(ss$times)

Table of average score of peer per percentile

I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.
I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).
I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".
Here is an example in the picture:
So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:
[0,15] : 9.5
(15,24] : 20.25
So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).
Here how I did it:
match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)
If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").
Thanks :)
The main fix is to eliminate the for loop with:
AvgScore <- AvgeSchScore[match]
R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.
data.table
If you would like to try data.table you may see speed improvements.
library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
# scoreQuant V1
#1: (17.4,19.2] 16.5
#2: NA 6.0
#3: (12.2,15] 16.5
#4: (7.2,9.4] 6.0
#5: (21.6,24] 24.0
It may be faster than base R. If the value in the NA row bothers you, you can delete it after.

How do I generate a dataframe displaying the number of unique pairs between two vectors, for each unique value in one of the vectors?

First of all, I apologize for the title. I really don't know how to succinctly explain this issue in one sentence.
I have a dataframe where each row represents some aspect of a hospital visit by a patient. A single patient might have thousands of rows for dozens of hospital visits, and each hospital visit could account for several rows.
One column is Medical.Record.Number, which corresponds to Patient IDs, and the other is Patient.ID.Visit, which corresponds to an ID for an individual hospital visit. I am trying to calculate the number of hospital visits each each patient has had.
For example:
Medical.Record.Number    Patient.ID.Visit
AAAXXX           1111
AAAXXX           1112
AAAXXX           1113
AAAZZZ           1114
AAAZZZ           1114
AAABBB           1115
AAABBB           1116
would produce the following:
Medical.Record.Number   Number.Of.Visits
AAAXXX          3
AAAZZZ          1
AAABBB          2
The solution I am currently using is the following, where "data" is my dataframe:
#this function returns the number of unique hospital visits associated with the
#supplied record number
countVisits <- function(record.number){
visits.by.number <- data$Patient.ID.Visit[which(data$Medical.Record.Number
== record.number)]
return(length(unique(visits.by.number)))
}
recordNumbers <- unique(data$Medical.Record.Number)
visits <- integer()
for (record in recordNumbers){
visits <- c(visits, countVisits(record))
}
visit.counts <- data.frame(recordNumbers, visits)
This works, but it is pretty slow. I am dealing with potentially millions of rows of data, so I'd like something efficient. From what little I know about R, I know there's usually a faster way to do things without using a for-loop.
This essentially looks like a table() operation after you take out duplicates. First, some sample data
#sample data
dd<-read.table(text="Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116", header=T)
then you could do
tt <- table(Medical.Record.Number=unique(dd)$Medical.Record.Number)
as.data.frame(tt, responseName="Number.Of.Visits") #to get a data.frame rather than named vector (table)
# Medical.Record.Number Number.Of.Visits
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
Or you could also think of this as an aggregation problem
aggregate(Patient.ID.Visit~Medical.Record.Number, dd, function(x) length(unique(x)))
# Medical.Record.Number Patient.ID.Visit
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
There are many ways to do this, #MrFlick provided handful of perfectly valid approaches. Personally I'm fond of the data.table package. Its faster on large data frames and I find the logic to be more intuitive than the base functions. I'd check it out if you are having problems with execution time.
library(data.table)
med.dt <- data.table(med_tbl)
num.visits.dt <- med.dt[ , num_visits = length(unique(Patient.ID.Visit)),
by = Medical.Record.Number]
data.Table should be much faster than data.frame on a large tables.

Good ways to code complex tabulations in R?

Does anyone have any good thoughts on how to code complex tabulations in R?
I am afraid I might be a little vague on this, but I want to set up a script to create a bunch of tables of a complexity analogous to the stat abstract of the united states.
e.g.: http://www.census.gov/compendia/statab/tables/09s0015.pdf
And I would like to avoid a whole bunch of rbind and hbind statements.
In SAS, I have heard, there is a table creation specification language; I was wondering if there was something of similar power for R?
Thanks!
It looks like you want to apply a number of different calculations to some data, grouping it by one field (in the example, by state)?
There are many ways to do this. See this related question.
You could use Hadley Wickham's reshape package (see reshape homepage). For instance, if you wanted the mean, sum, and count functions applied to some data grouped by a value (this is meaningless, but it uses the airquality data from reshape):
> library(reshape)
> names(airquality) <- tolower(names(airquality))
> # melt the data to just include month and temp
> aqm <- melt(airquality, id="month", measure="temp", na.rm=TRUE)
> # cast by month with the various relevant functions
> cast(aqm, month ~ ., function(x) c(mean(x),sum(x),length(x)))
month X1 X2 X3
1 5 66 2032 31
2 6 79 2373 30
3 7 84 2601 31
4 8 84 2603 31
5 9 77 2307 30
Or you can use the by() function. Where the index will represent the states. In your case, rather than apply one function (e.g. mean), you can apply your own function that will do multiple tasks (depending upon your needs): for instance, function(x) { c(mean(x), length(x)) }. Then run do.call("rbind" (for instance) on the output.
Also, you might give some consideration to using a reporting package such as Sweave (with xtable) or Jeffrey Horner's brew package. There is a great post on the learnr blog about creating repetitive reports that shows how to use it.
Another options is the plyr package.
library(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, "month", function(x){
with(x, c(meantemp = mean(temp), maxtemp = max(temp), nonsense = max(temp) - min(solar.r)))
})
Here is an interesting blog posting on this topic. The author tries to create a report analogous to the United Nation's World Population Prospects: The 2008 Revision report.
Hope that helps,
Charlie

Resources