Consider this toy data frame. I would like to create a new data frame in which only rows that are below the average of "birds" and only rows that less than the two top values after the maximum value of "wolfs".So in this data frame I'll get only rows: 543,608,987,225,988,556.
I used this two lines of code for the first constrain but couldn't find a solution for the second constrain.
df$filt<-ifelse(df$birds<mean(df$birds),1,0)
df1<-df1[which(df1$filt==1),]
How can I create the second constrain ?
Here is the toy dataframe:
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 1
608 0 1 5
123 1 9 7
321 1 8 7
226 0 2 7
556 0 2 3
334 1 6 3
225 0 1 1
999 0 3 9
988 0 1 1 ",header = TRUE)
subset(df,birds < mean(birds) & wolfs < sort(unique(wolfs),decreasing=T)[3]);
## userid target birds wolfs
## 4 543 1 2 3
## 6 987 0 1 2
## 8 608 0 1 5
## 12 556 0 2 3
## 14 225 0 1 1
## 16 988 0 1 1
Here a solution but maybe some constraints are not clear to me because it is fit another row respect your desired output.
avbi <- mean(df$birds)
ttw <- sort(df$wolfs, decreasing = T)[3]
df[df$birds < avbi & df$wolfs < ttw , ]
userid target birds wolfs
4 543 1 2 3
6 987 0 1 2
8 608 0 1 5
12 556 0 2 3
14 225 0 1 1
16 988 0 1 1
or with dplyr
df %>% filter(birds < avbi & wolfs < ttw)
Consider a data frame df with an extract from a web server access log, with two fields (sample below, duration is in msec and to simplify the example, let's ignore the date).
time,duration
18:17:26.552,8
18:17:26.632,10
18:17:26.681,12
18:17:26.733,4
18:17:26.778,5
18:17:26.832,5
18:17:26.889,4
18:17:26.931,3
18:17:26.991,3
18:17:27.040,5
18:17:27.157,4
18:17:27.209,14
18:17:27.249,4
18:17:27.303,4
18:17:27.356,13
18:17:27.408,13
18:17:27.450,3
18:17:27.506,13
18:17:27.546,3
18:17:27.616,4
18:17:27.664,4
18:17:27.718,3
18:17:27.796,10
18:17:27.856,3
18:17:27.909,3
18:17:27.974,3
18:17:28.029,3
qplot(time, duration, data=df); gives me a graph of the duration. I'd like to add, superimposed a line showing the number of requests for each minute. Ideally, this line would have a single data point per minute, at the :30sec point. If that's too complicated, an acceptable alternative is to have a step line, with the same value (the count of request) during a minute.
One way is to trunc(df$time, units=c("mins")), then calculate the count of request per minute into a new column then graph it.
I'm asking if there is, perhaps, a more direct way to accomplish the above. Thanks.
Following may be helpful. Create a data frame with steps and plot:
time duration sec sec2 diffsec2 step30s steps
1 18:17:26.552 8 26.552 552 0 0 0
2 18:17:26.632 10 26.632 632 80 1 1
3 18:17:26.681 12 26.681 681 49 0 0
4 18:17:26.733 4 26.733 733 52 1 1
5 18:17:26.778 5 26.778 778 45 0 0
6 18:17:26.832 5 26.832 832 54 1 1
7 18:17:26.889 4 26.889 889 57 1 2
8 18:17:26.931 3 26.931 931 42 0 0
9 18:17:26.991 3 26.991 991 60 1 1
10 18:17:27.040 5 27.040 040 -951 0 0
11 18:17:27.157 4 27.157 157 117 1 1
12 18:17:27.209 14 27.209 209 52 1 2
13 18:17:27.249 4 27.249 249 40 0 0
14 18:17:27.303 4 27.303 303 54 1 1
15 18:17:27.356 13 27.356 356 53 1 2
16 18:17:27.408 13 27.408 408 52 1 3
17 18:17:27.450 3 27.450 450 42 0 0
18 18:17:27.506 13 27.506 506 56 1 1
19 18:17:27.546 3 27.546 546 40 0 0
20 18:17:27.616 4 27.616 616 70 1 1
21 18:17:27.664 4 27.664 664 48 0 0
22 18:17:27.718 3 27.718 718 54 1 1
23 18:17:27.796 10 27.796 796 78 1 2
24 18:17:27.856 3 27.856 856 60 1 3
25 18:17:27.909 3 27.909 909 53 1 4
26 18:17:27.974 3 27.974 974 65 1 5
27 18:17:28.029 3 28.029 029 -945 0 0
>
> ggplot(ddf)+geom_point(aes(x=time, y=duration))+geom_line(aes(x=time, y=steps, group=1),color='red')
I am having trouble summing select columns within a data frame, a basic problem that I've seen numerous similar, but not identical questions/answers for on StackOverflow.
With this perhaps overly complex data frame:
site<-c(223,257,223,223,257,298,223,298,298,211)
moisture<-c(7,7,7,7,7,8,7,8,8,5)
shade<-c(83,18,83,83,18,76,83,76,76,51)
sampleID<-c(158,163,222,107,106,166,188,186,262,114)
bluestm<-c(3,4,6,3,0,0,1,1,1,0)
foxtail<-c(0,2,0,4,0,1,1,0,3,0)
crabgr<-c(0,0,2,0,33,0,2,1,2,0)
johnson<-c(0,0,0,7,0,8,1,0,1,0)
sedge1<-c(2,0,3,0,0,9,1,0,4,0)
sedge2<-c(0,0,1,0,1,0,0,1,1,1)
redoak<-c(9,1,0,5,0,4,0,0,5,0)
blkoak<-c(0,22,0,23,0,23,22,17,0,0)
my.data<-data.frame(site,moisture,shade,sampleID,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak)
I want to sum the counts of each plant species (bluestem, foxtail, etc. - columns 4-12 in this example) within each site, by summing rows that have the same site number. I also want to keep information about moisture and shade (these are consistant withing site, but may also be the same between sites), and want a new column that is the count of number of rows summed.
the result would look like this
site,moisture,shade,NumSamples,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak
211,5,51,1,0,0,0,0,0,1,0,0
223,7,83,4,13,5,4,8,6,1,14,45
257,7,18,2,4,2,33,0,0,1,1,22
298,8,76,3,2,4,3,9,13,2,9,40
The problem I am having is that, my real data sets (and I have several of them) have from 50 to 300 plant species, and I want refer a range of columns (in this case, [5:12] ) instead of my.data$foxtail, my.data$sedge1, etc., which is going to be very difficult with 300 species.
I know I can start off by deleting the column I don't need (SampleID)
my.data$SampleID <- NULL
but then how do I get the sums? I've messed with the aggregate command and with ddply, and have seen lots of examples which call particular column names, but just haven't gotten anything to work. I recognize this is a variant of a commonly asked and simple type of question, but I've spent hours without resolving it on my own. So, apologies for my stupidity!
This works ok:
x <- aggregate(my.data[,5:12], by=list(site=my.data$site, moisture=my.data$moisture, shade=my.data$shade), FUN=sum, na.rm=T)
library(dplyr)
my.data %>%
group_by(site) %>%
tally %>%
left_join(x)
site n moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 1 5 51 0 0 0 0 0 1 0 0
2 223 4 7 83 13 5 4 8 6 1 14 45
3 257 2 7 18 4 2 33 0 0 1 1 22
4 298 3 8 76 2 4 3 9 13 2 9 40
Or to do it all in dplyr
my.data %>%
group_by(site) %>%
tally %>%
left_join(my.data) %>%
group_by(site,moisture,shade,n) %>%
summarise_each(funs(sum=sum)) %>%
select(-sampleID)
site moisture shade n bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 5 51 1 0 0 0 0 0 1 0 0
2 223 7 83 4 13 5 4 8 6 1 14 45
3 257 7 18 2 4 2 33 0 0 1 1 22
4 298 8 76 3 2 4 3 9 13 2 9 40
Try following using base R:
outdf<-data.frame(site=numeric(),moisture=numeric(),shade=numeric(),bluestm=numeric(),foxtail=numeric(),crabgr=numeric(),johnson=numeric(),sedge1=numeric(),sedge2=numeric(),redoak=numeric(),blkoak=numeric())
my.data$basic = with(my.data, paste(site, moisture, shade))
for(b in unique(my.data$basic)) {
outdf[nrow(outdf)+1,1:3] = unlist(strsplit(b,' '))
for(i in 4:11)
outdf[nrow(outdf),i]= sum(my.data[my.data$basic==b,i])
}
outdf
site moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 223 7 83 13 5 4 8 6 1 14 45
2 257 7 18 4 2 33 0 0 1 1 22
3 298 8 76 2 4 3 9 13 2 9 40
4 211 5 51 0 0 0 0 0 1 0 0