R: Modifying Subsets of Dataframe using Calculations on that Subset - r

I am going to ask my question through example, because I don't know what the best way to phrase it in general is. Using the ChickWeight dataset built into R:
> head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
> tail(ChickWeight)
weight Time Chick Diet
573 155 12 50 4
574 175 14 50 4
575 205 16 50 4
576 234 18 50 4
577 264 20 50 4
578 264 21 50 4
I can use ddply to calculate mean for each unique Diet, for example
> ddply(d, .(Diet), summarise, mean_weight=mean(weight, na.rm=TRUE))
Diet mean_weight
1 1 102.6455
2 2 122.6167
3 3 142.9500
4 4 135.2627
What do I do if I wanted to easily create a data frame that modifies the 'weight' column in ChickWeight by dividing it by the mean_weight of it's corresponding diet?

A solution with data.table that's short, fast and readable:
library(data.table)
cw <- data.table(ChickWeight)
cw[, pct_mw_diet:=weight/mean(weight, na.rm=T), by=Diet]
Now you have a column with percent of mean weight by diet

Related

How to write code for Level 2 data for Multilevel Modeling using nlme package

I am struggling with how to describe level 2 data in my Multilevel Model in R.
I am using the nlme package.
I have longitudinal data with repeated measures. I have repeated observations for every subject across many days.
The Goal:
Level 1 would be the individual observations within the subject ID
Level 2 would be the differences between overall means between subject IDs (Cluster).
I am trying to determine if Test scores are significantly affected by study time, and to see if it's significantly different within subjects and between subjects.
How would I write the script if I want to do "Between Subjects" ?
Here is my script for Level 1 Model
model1 <- lme(fixed = TestScore~Studytime, random =~1|SubjectID, data=dataframe, na.action=na.omit)
Below is my example dataframe
`Subject ID` Observations TestScore Studytime
1 1 1 50 600
2 1 2 72 900
3 1 3 82 627
4 1 4 90 1000
5 1 5 81 300
6 1 6 37 333
7 2 1 93 900
8 2 2 97 1000
9 2 3 99 1200
10 2 4 85 600
11 3 1 92 800
12 3 2 73 900
13 3 3 81 1000
14 3 4 96 980
15 3 5 99 1300
16 4 1 47 600
17 4 2 77 900
18 4 3 85 950
I appreciate the help!

How to create a data frame with all ordinal variables as columns and with frequencies of specific event

I have an ordinal data frame which has answers in the survey format. I want to convert each factor into a possible column so as to get them by frequencies of a specific event.
I have tried lapply, dplyr to get frequencies but failed
as.data.frame(apply(mtfinal, 2, table))
and
mtfinalf<-mtfinal %>%
group_by(q28) %>%
summarise(freq=n())
Expected Results in the form of data.frame
Frequency table with respect to q28's factors
Expected Results in the form of data.frame
q28 sex1 sex2 race1 race2 race3 race4 race5 race6 race7 age1 age2
2 0
3 0
4 23
5 21
Actual Results
$age
1 2 3 4 5 6 7
6 2 184 520 507 393 170
$sex
1 2
1239 543
$grade
1 2 3 4
561 519 425 277
$race7
1 2 3 4 5 6
179 21 27 140 17 1307
7
91
$q8
1 2 3 4 5
127 259 356 501 539
$q9
1 2 3 4 5
993 224 279 86 200
$q28
2 3 4 5
1034 533 94 121
This will give you a count of number of unique combinations. What you are asking is impossible since there would be overlaps between levels of sex, race and age.
mtfinalf<-mtfinal %>%
group_by(q28,age,race,sex) %>%
tally()

Calculate the mean of a column for each batch of n rows in R

Suppose I have a data frame like this...
> head(x)
round value
1 1 0.37207016
2 2 0.51954917
3 3 -0.70684976
4 4 0.76105557
5 5 0.09252876
6 6 -2.42223178
> tail(x)
round value
95 95 -0.6799075
96 96 -0.4109732
97 97 0.9740048
98 98 -0.8877499
99 99 0.1501041
100 100 -0.5415825
...and I want to get the mean value over each 10-round interval. I've posted one answer below, but a common thing to want to do, so is there is a more straightforward way?
I can do some gymnastics to create a data frame with an extra column for the "batch" index, and then group by that to calculate the mean.
> y <- data.frame(x$round, x$value, rep(1:10, each=10))
> colnames(y) <- c("round","value", "batch")
> head(y)
round value batch
1 1 0.37207016 1
2 2 0.51954917 1
3 3 -0.70684976 1
4 4 0.76105557 1
5 5 0.09252876 1
6 6 -2.42223178 1
> tail(y)
round value batch
95 95 -0.6799075 10
96 96 -0.4109732 10
97 97 0.9740048 10
98 98 -0.8877499 10
99 99 0.1501041 10
100 100 -0.5415825 10
> tapply(y$value, y$batch, mean)
1 2 3 4 5 6
-0.13784753 -0.15969468 0.41346173 0.09019686 -0.26467052 -0.29677632
7 8 9 10
0.06489254 0.17609739 0.35029525 -0.19669901
Try using modulo division. Need to subtract 1 to get first group of size 10:
tapply(y$yvalue, (nrow(x)-1) %/% 10, mean)

summing a range of columns in data frame

I am having trouble summing select columns within a data frame, a basic problem that I've seen numerous similar, but not identical questions/answers for on StackOverflow.
With this perhaps overly complex data frame:
site<-c(223,257,223,223,257,298,223,298,298,211)
moisture<-c(7,7,7,7,7,8,7,8,8,5)
shade<-c(83,18,83,83,18,76,83,76,76,51)
sampleID<-c(158,163,222,107,106,166,188,186,262,114)
bluestm<-c(3,4,6,3,0,0,1,1,1,0)
foxtail<-c(0,2,0,4,0,1,1,0,3,0)
crabgr<-c(0,0,2,0,33,0,2,1,2,0)
johnson<-c(0,0,0,7,0,8,1,0,1,0)
sedge1<-c(2,0,3,0,0,9,1,0,4,0)
sedge2<-c(0,0,1,0,1,0,0,1,1,1)
redoak<-c(9,1,0,5,0,4,0,0,5,0)
blkoak<-c(0,22,0,23,0,23,22,17,0,0)
my.data<-data.frame(site,moisture,shade,sampleID,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak)
I want to sum the counts of each plant species (bluestem, foxtail, etc. - columns 4-12 in this example) within each site, by summing rows that have the same site number. I also want to keep information about moisture and shade (these are consistant withing site, but may also be the same between sites), and want a new column that is the count of number of rows summed.
the result would look like this
site,moisture,shade,NumSamples,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak
211,5,51,1,0,0,0,0,0,1,0,0
223,7,83,4,13,5,4,8,6,1,14,45
257,7,18,2,4,2,33,0,0,1,1,22
298,8,76,3,2,4,3,9,13,2,9,40
The problem I am having is that, my real data sets (and I have several of them) have from 50 to 300 plant species, and I want refer a range of columns (in this case, [5:12] ) instead of my.data$foxtail, my.data$sedge1, etc., which is going to be very difficult with 300 species.
I know I can start off by deleting the column I don't need (SampleID)
my.data$SampleID <- NULL
but then how do I get the sums? I've messed with the aggregate command and with ddply, and have seen lots of examples which call particular column names, but just haven't gotten anything to work. I recognize this is a variant of a commonly asked and simple type of question, but I've spent hours without resolving it on my own. So, apologies for my stupidity!
This works ok:
x <- aggregate(my.data[,5:12], by=list(site=my.data$site, moisture=my.data$moisture, shade=my.data$shade), FUN=sum, na.rm=T)
library(dplyr)
my.data %>%
group_by(site) %>%
tally %>%
left_join(x)
site n moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 1 5 51 0 0 0 0 0 1 0 0
2 223 4 7 83 13 5 4 8 6 1 14 45
3 257 2 7 18 4 2 33 0 0 1 1 22
4 298 3 8 76 2 4 3 9 13 2 9 40
Or to do it all in dplyr
my.data %>%
group_by(site) %>%
tally %>%
left_join(my.data) %>%
group_by(site,moisture,shade,n) %>%
summarise_each(funs(sum=sum)) %>%
select(-sampleID)
site moisture shade n bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 5 51 1 0 0 0 0 0 1 0 0
2 223 7 83 4 13 5 4 8 6 1 14 45
3 257 7 18 2 4 2 33 0 0 1 1 22
4 298 8 76 3 2 4 3 9 13 2 9 40
Try following using base R:
outdf<-data.frame(site=numeric(),moisture=numeric(),shade=numeric(),bluestm=numeric(),foxtail=numeric(),crabgr=numeric(),johnson=numeric(),sedge1=numeric(),sedge2=numeric(),redoak=numeric(),blkoak=numeric())
my.data$basic = with(my.data, paste(site, moisture, shade))
for(b in unique(my.data$basic)) {
outdf[nrow(outdf)+1,1:3] = unlist(strsplit(b,' '))
for(i in 4:11)
outdf[nrow(outdf),i]= sum(my.data[my.data$basic==b,i])
}
outdf
site moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 223 7 83 13 5 4 8 6 1 14 45
2 257 7 18 4 2 33 0 0 1 1 22
3 298 8 76 2 4 3 9 13 2 9 40
4 211 5 51 0 0 0 0 0 1 0 0

ddply type functionality on multiple datafrmaes

I have two dataframes that are structured as follows:
Dataframe A:
id sqft traf month
1 1030 16 35 1
1 1030 15 32 2
2 1027 1 31 1
2 1027 2 31 2
Dataframe B:
id price frequency month day
1 1030 8 196 1 1
2 1030 9 101 1 15
3 1030 10 156 1 30
4 1030 3 137 2 1
5 1030 7 190 2 15
6 1027 6 188 1 1
7 1027 1 198 1 15
8 1027 2 123 1 30
9 1027 4 185 2 1
10 1027 5 122 2 15
I want to output certain types of summary statistics (centered around each unique ID) from both these columns. This would be easy with ddply if say I wanted the mean price for each ID for each month (split by id and month) from Dataframe B or if I wanted the average ratio of sqft to traf for each id (split by id).
But what would be a potential solution if I wanted to make combined variables from both dataframes. For instance, how would I get the average price for each id/month (Dataframe B) divided by sqft for each id/month?
The varying frequencies at of the dataframes are measured makes combining them not easily doable. The only solution I've found so far is to ddply the first dataframe to extract average sqft/id/month and then pass that value into a second ddply call on the second dataframe.
Is there a more efficient/less convoluted way to do this? I would be splitting both dataframes on the same variables (id and month).
Thanks in advance for any suggestions!
In the case of the sample data, you could merge the two data sets like this (by specifying all.y = TRUE you can make sure that all rows of dfb are kept and, in this case, corresponding entries of dfa are repeated accordingly)
dfall <- merge(dfa, dfb, by = c("id", "month"), all.y=TRUE)
# id month sqft traf price frequency day
#1 1027 1 1 31 6 188 1
#2 1027 1 1 31 1 198 15
#3 1027 1 1 31 2 123 30
#4 1027 2 2 31 4 185 1
#5 1027 2 2 31 5 122 15
#6 1030 1 16 35 8 196 1
#7 1030 1 16 35 9 101 15
#8 1030 1 16 35 10 156 30
#9 1030 2 15 32 3 137 1
#10 1030 2 15 32 7 190 15
Then, you can use ddply as usual:
ddply(dfall, .(id, month), mutate, newcol = mean(price)/sqft)
# id month sqft traf price frequency day newcol
#1 1027 1 1 31 6 188 1 3.0000000
#2 1027 1 1 31 1 198 15 3.0000000
#3 1027 1 1 31 2 123 30 3.0000000
#4 1027 2 2 31 4 185 1 2.2500000
#5 1027 2 2 31 5 122 15 2.2500000
#6 1030 1 16 35 8 196 1 0.5625000
#7 1030 1 16 35 9 101 15 0.5625000
#8 1030 1 16 35 10 156 30 0.5625000
#9 1030 2 15 32 3 137 1 0.3333333
#10 1030 2 15 32 7 190 15 0.3333333
Edit: if you're looking for better performance, consider using dplyr instead of plyr. The equivalent dplyr code (including the merge) is:
library(dplyr)
dfall <- dfb %>%
left_join(., dfa, by = c("id", "month")) %>%
group_by(id, month) %>%
dplyr::mutate(newcol = mean(price)/sqft) # I added dplyr:: to avoid confusion with plyr::mutate
Of course, you could also check out data.table which is also very efficient.
AFAIK ddply is not designed to be used with different data frames at the same time.
dplyr does well here. This code merges the data frames, gets price and sqft means by unique id/month combination, then creates a new variable pricePerSqft.
require(dplyr)
dfa %>%
left_join(dfb, by = c("id", "month")) %>%
group_by(id, month) %>%
summarize(
avgPrice = mean(price),
avgSqft = mean(sqft)) %>%
mutate(pricePerSqft = round(avgPrice / avgSqft, 2))
Here's the result:
id month avgPrice avgSqft pricePerSqft
1 1027 1 3.0 1 3.00
2 1027 2 4.5 2 2.25
3 1030 1 9.0 16 0.56
4 1030 2 5.0 15 0.33

Resources