I have a data frame with daily average temperature data in it, structured like so:
'data.frame': 4666 obs. of 6 variables:
$ Site : chr "EB" "FFCE" "IB" "FFCE" ...
$ Date : Date, format: "2013-01-01" "2013-01-01" "2013-01-01" "2014-01-01" ...
$ Day : int 1 1 1 1 1 1 1 1 1 1 ...
$ Year : int 2013 2013 2013 2014 2014 2014 2014 2015 2015 2015 ...
$ Month: int 1 1 1 1 1 1 1 1 1 1 ...
$ Temp : num 28.5 28.3 28.3 27 27.8 ...
i am attempting to produce a summary table which just totals the number of days in a year per site above certain temperatures thresholds e.g 25c, 26c.
i can achieve this manually by using dplyr like so-
Days_above = Site_Daily_average %>%
group_by(Year, Site) %>%
summarise("23" = sum(Temp > 23), "24" = sum(Temp > 24),"25"= sum(Temp >
25), "26"= sum(Temp > 26), "27"= sum(Temp > 27), "28"= sum(Temp > 28), "29"
= sum(Temp > 29),"30"= sum(Temp > 30), "31" = sum(Temp > 31), "ABOVE
THRESHOLD" = sum(Temp > maxthreshold))%>% as.data.frame()
Which produces a table like so :
Year Site 23 24 25 26 27 28 29 30 31 ABOVE THRESHOLD
1 2012 EB 142 142 142 91 64 22 0 0 0 0
2 2012 FFCE 238 238 238 210 119 64 0 0 0 0
3 2012 IB 238 238 238 218 138 87 1 0 0 0
4 2013 EB 115 115 115 115 115 109 44 0 0 0
5 2013 FFCE 223 223 216 197 148 114 94 0 0 0
6 2013 IB 365 365 365 348 299 194 135 3 0 0
...
however, as you can see the code is fairly verbose. The problem i am having is producing this same output for a sequence of temperature thresholds, i.e Tempclasses = Seq(16,32,0.25).
As you can see that would take a long time to type that out manually. i feel like this is a very simple calculation and there should be way to use dplyr to recognize each variable in the sequence vector, perform this function and produce an output in a complete table format. sorry if that was unclear as i am relatively new to R,
any suggestions would be welcome, thankyou.
Here's a tidyverse approach, likewise using mtcars for illustration:
library(tidyverse)
mtcars %>%
mutate(threshold = cut(mpg,
breaks=seq(10, max(mtcars$mpg)+10, 5),
labels=seq(10, max(mtcars$mpg)+5, 5))) %>%
group_by(cyl, threshold) %>%
tally %>%
ungroup %>%
complete(threshold, nesting(cyl), fill=list(n=0)) %>%
arrange(desc(threshold)) %>%
group_by(cyl) %>%
mutate(N_above = cumsum(n)) %>%
select(-n) %>%
arrange(cyl, threshold)
threshold cyl N_above
1 10 4 11
2 15 4 11
3 20 4 11
4 25 4 6
5 30 4 4
6 35 4 0
7 10 6 7
8 15 6 7
9 20 6 3
10 25 6 0
11 30 6 0
12 35 6 0
13 10 8 14
14 15 8 8
15 20 8 0
16 25 8 0
17 30 8 0
18 35 8 0
If you want the final data in wide format, add a spread at the end and remove the arrange:
... %>%
select(-n) %>%
spread(threshold, N_above)
cyl 10 15 20 25 30 35
1 4 11 11 11 6 4 0
2 6 7 7 3 0 0 0
3 8 14 8 0 0 0 0
As #dww commented we can use cut to get the required format. I have tried this on mtcars dataset where we create range from 10 to 35 with step of 5 for mpg column.
df <- mtcars
df$group <- cut(df$mpg, seq(10, 35, 5))
and then we group by cyl and use table to get count of how many of them fall in the the respective buckets.
table(df$cyl, df$group)
# (10,15] (15,20] (20,25] (25,30] (30,35]
#4 0 0 5 2 4
#6 0 4 3 0 0
#8 6 8 0 0 0
Now , if certain value is greater than 10, it is also greater than 15, hence the number in (15, 20) bucket should also include number from (10,15) bucket and number in (20, 15) bucket should include both the previous number. Hence, we need a row-wise cumsum for this table
t(apply(table(df$cyl, df$group), 1, cumsum))
# (10,15] (15,20] (20,25] (25,30] (30,35]
# 4 0 0 5 7 11
# 6 0 4 7 7 7
# 8 6 14 14 14 14
For your case , the code would go
Site_Daily_average$group <- cut(Site_Daily_average$Temp, seq(16,32,0.25))
#and then do table to get required answer.
t(apply(table(Site_Daily_average$Year,Site_Daily_average$Site,
Site_Daily_average$group), 1, cumsum)
Related
Below is the sample data. The task at hand is to sum quarter1 and quarter2 for ownership code 30 but exclude indcode 115. From there complete a new row that contain this sum. In excel, this is very simple but hoping to automate this a bit using R. The bottom half of the desired result is below. First question is would I pivot_wider so that I summing columns not rows?
area <- c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
indcode <- c(110,111,112,113,114,115,110,111,112,113,114,115)
quarter1 <- c(NA,2,4,6,16,3,NA,1,2,3,8,2)
quarter2 <- c(2,3,5,7,22,1,9,1,2,4,11,1)
ownership <- c(00,00,00,00,00,00,30,30,30,30,30,30)
employment <- data.frame(area,indcode,quarter1,quarter2,ownership)
area indcode quarter1 quarter2 ownership
000000 111 1 1 30
000000 112 2 2 30
000000 113 3 4 30
000000 114 8 11 30
000000 115 2 1 30
000000 993 14 18 30
I've assumed you want this done for area groups, but if not you can delete the group_by(area) line.
employment %>%
group_by(area) %>%
summarize(
across(quarter1:quarter2, ~sum(.x[ownership == 30 & indcode != 115], na.rm = TRUE)),
indcode = 993,
ownership = 30
) %>%
bind_rows(employment, .)
# area indcode quarter1 quarter2 ownership
# 1 0 111 2 3 0
# 2 0 112 4 5 0
# 3 0 113 6 7 0
# 4 0 114 16 22 0
# 5 0 115 3 1 0
# 6 0 111 1 1 30
# 7 0 112 2 2 30
# 8 0 113 3 4 30
# 9 0 114 8 11 30
# 10 0 115 2 1 30
# 11 0 993 14 18 30
I have a dataframe of which is characterized by many different ID's. For every ID there are multiple events which are characterized by the cumulative time duration between events(hours) and the duration of that event(seconds). So, it would look something like:
Id <- c(1,1,1,1,1,1,2,2,2,2,2)
cumulative_time<-c(0,3.58,8.88,11.19,21.86,29.54,0,5,14,19,23)
duration<-c(188,124,706,53,669,1506.2,335,349,395,385,175)
test = data.frame(Id,cumulative_time,duration)
> test
Id cummulative_time duration
1 1 0.00 188.0
2 1 3.58 124.0
3 1 8.88 706.0
4 1 11.19 53.0
5 1 21.86 669.0
6 1 29.54 1506.2
7 2 0.00 335.0
8 2 5.00 349.0
9 2 14.00 395.0
10 2 19.00 385.0
11 2 23.00 175.0
I would like to group by the ID and then restructure the group by sampling by a cumulative amount of every say 10 hours, and in that 10 hours sum by the duration that occurred in the 10 hour interval. The number of bins I want should be from say 0 to 30 hours. Thus were would be 3 bins.
I looked at the cut function and managed to make a hack of it within a dataframe - even me as a new r user I know it isn't pretty
test_cut = test %>%
mutate(bin_durations = cut(test$cummulative_time,breaks = c(0,10,20,30),labels = c("10","20","30"),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
which gives the output:
test_cut
Id time_bins duration
1 1 10 1018.0
2 1 20 53.0
3 1 30 2175.2
4 2 10 684.0
5 2 20 780.0
6 2 30 175.0
Ultimately I want the interval window and number of bins to be arbitrary - If I have a span of 5000 hours and I want to bin in 1 hour samples. For this I would use breaks=seq(0,5000,1) for the bins I would say labels = as.character(seq(1,5000,1))
This is will also be applied to a very large data frame, so computational speed somewhat desired.
A dplyr solution would be great since I am applying the binning per group.
My guess is there is a nice interaction between cut and perhaps split to generate the desired output.
Thanks in advance.
Update
After testing, I find that even my current implementation isn't quite what I'd like as if I say:
n=3
test_cut = test %>%
mutate(bin_durations = cut(test$cumulative_time,breaks=seq(0,30,n),labels = as.character(seq(n,30,n)),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
I get
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 24 669
6 1 30 1506.
7 2 3 335
8 2 6 349
9 2 15 395
10 2 21 385
11 2 24 175
Where there are no occurrences in the bin sequence I should just get 0 in the duration column. Rather than an omission.
Thus, it should look like:
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 15 0
6 1 18 0
7 1 21 0
8 1 24 669
9 1 27 0
10 1 30 1506.
11 2 3 335
12 2 6 349
13 2 9 0
14 2 12 0
15 2 15 395
16 2 18 0
17 2 21 385
18 2 24 175
19 2 27 0
20 2 30 0
Here is one idea via integer division (%/%)
library(tidyverse)
test %>%
group_by(Id, grp = cumulative_time %/% 10) %>%
summarise(toatal_duration = sum(duration))
which gives,
# A tibble: 6 x 3
# Groups: Id [?]
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 1018
2 1 1 53
3 1 2 2175.
4 2 0 684
5 2 1 780
6 2 2 175
To address your updated issue, we can use complete in order to add the missing rows. So, for the same example, binning in hours of 3,
test %>%
group_by(Id, grp = cumulative_time %/% 3) %>%
summarise(toatal_duration = sum(duration)) %>%
ungroup() %>%
complete(Id, grp = seq(min(grp), max(grp)), fill = list(toatal_duration = 0))
which gives,
# A tibble: 20 x 3
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 188
2 1 1 124
3 1 2 706
4 1 3 53
5 1 4 0
6 1 5 0
7 1 6 0
8 1 7 669
9 1 8 0
10 1 9 1506.
11 2 0 335
12 2 1 349
13 2 2 0
14 2 3 0
15 2 4 395
16 2 5 0
17 2 6 385
18 2 7 175
19 2 8 0
20 2 9 0
We could make these changes:
test$cummulative_time can be simply cumulative_time
breaks could be factored out and then used in the cut as shown
the second mutate could be changed to summarize in which case the select and distinct are not needed
it is always a good idea to close any group_by with a matching ungroup or in the case of summarize we can use .groups = "drop")
add complete to insert 0 for levels not present
Implementing these changes we have:
library(dplyr)
library(tidyr)
breaks <- seq(0, 40, 10)
test %>%
mutate(bin_durations = cut(cumulative_time, breaks = breaks,
labels = breaks[-1], include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
summarize(total_duration = sum(duration), .groups = "drop") %>%
complete(Id, bin_durations, fill = list(total_duration = 0))
giving:
# A tibble: 8 x 3
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 10 1018
2 1 20 53
3 1 30 2175.
4 1 40 0
5 2 10 684
6 2 20 780
7 2 30 175
8 2 40 0
imagine this is the structure of my Data hrd:
'data.frame': 14999 obs. of 2 variables:
$ left : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2
$ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
I want to know the percentage of how many people have left (0 = stayed, 1 = left) for each each level of sales.
This is the closest I come:
hrd %>% group_by(sales) %>% count(left)
However, the output is this:
sales left n
<fctr> <fctr> <int>
1 accounting 0 563
2 accounting 1 204
3 hr 0 524
4 hr 1 215
5 IT 0 954
6 IT 1 273
7 management 0 539
8 management 1 91
9 marketing 0 655
10 marketing 1 203
11 product_mng 0 704
12 product_mng 1 198
13 RandD 0 666
14 RandD 1 121
15 sales 0 3126
16 sales 1 1014
17 support 0 1674
18 support 1 555
19 technical 0 2023
20 technical 1 697
I'm trying something like this:
hrd %>% group_by(sales)
%>% summarise(count = n() )
%>% mutate( leaving_rate = count(left == 1 )/ count )
But the error message is saying
Error: object 'left' not found
Don't use summarise() first because it is truncating your data frame to a summary version. So dropping the column "left" (and any other not mentioned or non-grouping variables) and keeping only "sales" (grouping var) and "count" (mentioned var).
You can do it in one summarize call like this:
hrd %>% group_by(sales) %>%
summarise(percent_left = sum(left) / n())
I am having trouble summing select columns within a data frame, a basic problem that I've seen numerous similar, but not identical questions/answers for on StackOverflow.
With this perhaps overly complex data frame:
site<-c(223,257,223,223,257,298,223,298,298,211)
moisture<-c(7,7,7,7,7,8,7,8,8,5)
shade<-c(83,18,83,83,18,76,83,76,76,51)
sampleID<-c(158,163,222,107,106,166,188,186,262,114)
bluestm<-c(3,4,6,3,0,0,1,1,1,0)
foxtail<-c(0,2,0,4,0,1,1,0,3,0)
crabgr<-c(0,0,2,0,33,0,2,1,2,0)
johnson<-c(0,0,0,7,0,8,1,0,1,0)
sedge1<-c(2,0,3,0,0,9,1,0,4,0)
sedge2<-c(0,0,1,0,1,0,0,1,1,1)
redoak<-c(9,1,0,5,0,4,0,0,5,0)
blkoak<-c(0,22,0,23,0,23,22,17,0,0)
my.data<-data.frame(site,moisture,shade,sampleID,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak)
I want to sum the counts of each plant species (bluestem, foxtail, etc. - columns 4-12 in this example) within each site, by summing rows that have the same site number. I also want to keep information about moisture and shade (these are consistant withing site, but may also be the same between sites), and want a new column that is the count of number of rows summed.
the result would look like this
site,moisture,shade,NumSamples,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak
211,5,51,1,0,0,0,0,0,1,0,0
223,7,83,4,13,5,4,8,6,1,14,45
257,7,18,2,4,2,33,0,0,1,1,22
298,8,76,3,2,4,3,9,13,2,9,40
The problem I am having is that, my real data sets (and I have several of them) have from 50 to 300 plant species, and I want refer a range of columns (in this case, [5:12] ) instead of my.data$foxtail, my.data$sedge1, etc., which is going to be very difficult with 300 species.
I know I can start off by deleting the column I don't need (SampleID)
my.data$SampleID <- NULL
but then how do I get the sums? I've messed with the aggregate command and with ddply, and have seen lots of examples which call particular column names, but just haven't gotten anything to work. I recognize this is a variant of a commonly asked and simple type of question, but I've spent hours without resolving it on my own. So, apologies for my stupidity!
This works ok:
x <- aggregate(my.data[,5:12], by=list(site=my.data$site, moisture=my.data$moisture, shade=my.data$shade), FUN=sum, na.rm=T)
library(dplyr)
my.data %>%
group_by(site) %>%
tally %>%
left_join(x)
site n moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 1 5 51 0 0 0 0 0 1 0 0
2 223 4 7 83 13 5 4 8 6 1 14 45
3 257 2 7 18 4 2 33 0 0 1 1 22
4 298 3 8 76 2 4 3 9 13 2 9 40
Or to do it all in dplyr
my.data %>%
group_by(site) %>%
tally %>%
left_join(my.data) %>%
group_by(site,moisture,shade,n) %>%
summarise_each(funs(sum=sum)) %>%
select(-sampleID)
site moisture shade n bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 5 51 1 0 0 0 0 0 1 0 0
2 223 7 83 4 13 5 4 8 6 1 14 45
3 257 7 18 2 4 2 33 0 0 1 1 22
4 298 8 76 3 2 4 3 9 13 2 9 40
Try following using base R:
outdf<-data.frame(site=numeric(),moisture=numeric(),shade=numeric(),bluestm=numeric(),foxtail=numeric(),crabgr=numeric(),johnson=numeric(),sedge1=numeric(),sedge2=numeric(),redoak=numeric(),blkoak=numeric())
my.data$basic = with(my.data, paste(site, moisture, shade))
for(b in unique(my.data$basic)) {
outdf[nrow(outdf)+1,1:3] = unlist(strsplit(b,' '))
for(i in 4:11)
outdf[nrow(outdf),i]= sum(my.data[my.data$basic==b,i])
}
outdf
site moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 223 7 83 13 5 4 8 6 1 14 45
2 257 7 18 4 2 33 0 0 1 1 22
3 298 8 76 2 4 3 9 13 2 9 40
4 211 5 51 0 0 0 0 0 1 0 0
Here is a sample data frame:
> df = data.frame(rep(seq(0, 120, length.out=6), times = 2), c(sample(1:50, 4),
+ NA, NA, NA, sample(1:50, 5)))
> colnames(df) = c("Time", "Pat1")
> df
Time Pat1
1 0 33
2 24 48
3 48 7
4 72 8
5 96 NA
6 120 NA
7 0 NA
8 24 1
9 48 6
10 72 28
11 96 31
12 120 32
NAs which have to be replaced are identified by which and logical operators:
x = which(is.na(df$Pat1) & df$Time == 0)
I know the locf() command, but it's replacing all NAs. How can I replace only the NAs at position x in a multi-column df?
EDIT: Here is a link to my original dataset: link
And thats how far I get:
require(reshape2)
require(zoo)
pad.88 <- read.csv2("pad_88.csv")
colnames(pad.88) = c("Time", "Increment", "Side", 4:length(pad.88)-3)
attach(pad.88)
x = which(Time == 240 & Increment != 5)
pad.88 = pad.88[c(1:x[1], x[1]:x[2], x[2]:x[3], x[3]:x[4], x[4]:x[5], x[5]:x[6],x[6]:x[7], x[7]:x[8], x[8]:nrow(pad.88)),]
y = which(duplicated(pad.88))
pad.88$Time[y] = 0
pad.88$Increment[y] = Increment[x] + 1
z = which(is.na(pad.88[4:ncol(pad.88)] & pad.88$Time == 0), arr.ind=T)
a = na.locf(pad.88[4:ncol(pad.88)])
My next step is something like pat.cols[z] = a[z], which doesn't work.
That's how the result should look like:
Time Increment Side 1 2 3 4 5 ...
150 4 0 27,478 24,076 27,862 20,001 25,261
165 4 0 27,053 24,838 27,231 20,001 NA
180 4 0 27,599 24,166 27,862 20,687 NA
195 4 0 27,114 23,403 27,862 20,001 NA
210 4 0 26,993 24,076 27,189 19,716 NA
225 4 0 26,629 24,21 26,221 19,887 NA
240 4 0 26,811 26,228 26,431 20,001 NA
0 5 1 26,811 26,228 26,431 20,001 25,261
15 5 1 ....
The last valid value in col 5 is 25,261. This value replaces the NA at Time 0/Col 5.
You can change it so that x records all the NA values and use the first and last from that to identify the locations you want.
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 NA
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
x <- which(is.na(df$Pat1))
df[rev(x)[1],"Pat1"] <- df[x[1]-1,"Pat1"]
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 38
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
For the multi-column example use the same idea in a sapply call:
cbind(df[1],sapply(df[-1],function(x) {y<-which(is.na(x));x[rev(y)[1]]<-x[y[1]-1];x}))
Time Pat1 Pat2
1 0 41 42
2 24 8 30
3 48 3 41
4 72 14 NA
5 96 NA NA
6 120 NA NA
7 0 14 41
8 24 5 37
9 48 29 48
10 72 31 11
11 96 50 43
12 120 46 21