if statement with ddply function - r

I am trying to use the if statement with ddply but am having issues with the if statement.
An example dataset is:
data<-data.frame(Gear=c(rep("S",10),rep("C",10)),TowSurvey=c(0,0,1,1,0,1,1,1,1,0),TowCom=c(0,1,1,1,0,1,1,1,1,0),
StationID=c(1,2,3,4,5,6,7,8,9,10),Totwght=c(2,8,6,4,12,9,56,7,89,10),Totexpwght=c(5,8,12,45,89,56,23,78,56,41),
Expnum=c(1,5,6,98,45,2,6,3,7,45),Exp=c(56,25,85,74,1,23,56,45,89,75))
My first try was
if(data$Gear=="S" & data$TowSurvey== 1 | data$Gear=="C" & data$TowCom== 1){
datad<-ddply(data, .(StationID,Gear), summarize,Totwghtpertow=sum(Totwght),
Totexppertow=sum(Totexpwght),Totnum =sum(Expnum),Totexpnum=sum(Exp))}
print(datad)
But the records that don't meet the if statement criteria are included in datad.
Then I found this post: Aggregate (count) rows that match a condition, group by unique values. Aggregate (count) rows that match a condition, group by unique values
So my second attempt based on the answer from the post was
datad<-ddply(data, .(StationID,Gear), summarize,Totwghtpertow=sum(Totwght[Gear=="S" & TowSurvey== 1 | Gear=="C" & TowCom== 1]))
I only tried with one column as a test and am getting the same results. Any help would be appreciated in trying to figure this out.
Thanks

If you run your first attempt you should actually get an error message since if can only evaluate a logical vector of length 1.
You really don't need an if statement here. Subsetting your data will do just fine.
data_sub <- subset(data, (data$Gear=="S" & data$TowSurvey== 1) | (data$Gear=="C" & data$TowCom== 1))
You can run your ddply statement using data_sub rather than data.
Or if you're going to be using the a lot you can wrap it in a function:
datad_func <- function(data){
data_sub <- subset(data, (data$Gear=="S" & data$TowSurvey== 1) | (data$Gear=="C" & data$TowCom== 1))
datad<-ddply(data_sub, .(StationID,Gear), summarize,Totwghtpertow=sum(Totwght),
Totexppertow=sum(Totexpwght),Totnum =sum(Expnum),Totexpnum=sum(Exp))
rm('data_sub')
print(datad)
}
datad_func(data)
StationID Gear Totwghtpertow Totexppertow Totnum Totexpnum
1 2 C 8 8 5 25
2 3 C 6 12 6 85
3 3 S 6 12 6 85
4 4 C 4 45 98 74
5 4 S 4 45 98 74
6 6 C 9 56 2 23
7 6 S 9 56 2 23
8 7 C 56 23 6 56
9 7 S 56 23 6 56
10 8 C 7 78 3 45
11 8 S 7 78 3 45
12 9 C 89 56 7 89
13 9 S 89 56 7 89

plyr is not so good at subsetting in the function, so you can do it before or after like #scribbles said.
You could also try dplyr and pipe them together:
library(dplyr)
data %>% filter((data$Gear == "S" & data$TowSurvey == 1) | (data$Gear == "C" & data$TowCom == 1)) %>%
group_by(StationID, Gear) %>%
summarise_each(funs(sum), Totwght, Totexpwght, Expnum, Exp)

Related

R countif and sum on multiple columns matching elements in specified vector

I am applying this function to my dataset column DL1 on another vector as below and receiving the results expected
table(df$DL1[df$DL1 %in% undefined_dl_codes])
Result:
0 10 30 3B 4 49 54 5A 60 7 78 8 90
24 366 4 3 665 40 1 1 14 8 4 87 1
however I do have columns DL2, DL3 and DL4 which have same data, how can I apply the function to multiple columns and receive the result of all. I would need to go through all 4 required columns and receive 1 result as summary.
Any help highly appreciated!
May not be the best of the methods, however you could do the following
table(c(df$DL1[df$DL1 %in% undefined_dl_codes],
df$DL2[df$DL2 %in% undefined_dl_codes],
df$DL3[df$DL3 %in% undefined_dl_codes],
df$DL4[df$DL4 %in% undefined_dl_codes]
)
)
Using Raghuveer solution I further simplified,
attach(df)
table(c(DL1,DL2,DL3,DL4)[c(DL1,DL2,DL3,DL4) %in% undefined_dl_codes])
detach(df)

How to use apply function instead of for loop if you have multiple if conditions to be excecuted

1st DF:
t.d
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
names(t.d) <- c("ID","A","B","C")
t.d$FinalTime <- c("7/30/2009 08:18:35","9/30/2009 19:18:35","11/30/2009 21:18:35","13/30/2009 20:18:35","15/30/2009 04:18:35")
t.d$InitTime <- c("6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35")
>t.d
ID A B C FinalTime InitTime
1 1 6 11 16 7/30/2009 08:18:35 6/30/2009 9:18:35
2 2 7 12 17 9/30/2009 19:18:35 6/30/2009 9:18:35
3 3 8 13 18 11/30/2009 21:18:35 6/30/2009 9:18:35
4 4 9 14 19 13/30/2009 20:18:35 6/30/2009 9:18:35
5 5 10 15 20 15/30/2009 04:18:35 6/30/2009 9:18:35
2nd DF:
> s.d
F D E Time
1 10 19 28 6/30/2009 08:18:35
2 11 20 29 8/30/2009 19:18:35
3 12 21 30 9/30/2009 21:18:35
4 13 22 31 01/30/2009 20:18:35
5 14 23 32 10/30/2009 04:18:35
6 15 24 33 11/30/2009 04:18:35
7 16 25 34 12/30/2009 04:18:35
8 17 26 35 13/30/2009 04:18:35
9 18 27 36 15/30/2009 04:18:35
Output to be:
From DF "t.d" I have to calculate the time interval for each row between "FinalTime" and "InitTime" (InitTime will always be less than FinalTime).
Another DF "temp" from "s.d" has to be formed having data only within the above time interval, and then the most recent values of "F","D","E" have to be taken and attached to the 'ith' row of "t.d" from which the time interval was calculated.
Also we have to see if the newly formed DF "temp" has the following conditions true:
here 'j' represents value for each row:
if(temp$F[j] < 35.5) + (temp$D[j] >= 100) >= 1)
{
temp$Flag <- 1
} else{
temp$Flag <- 0
}
Originally I have 3 million rows in the dataframe and 20 columns in each DF.
I have solved the above problem using "for loop" but it obviously takes 2 to 3 days as there are a lot of rows.
(Also if I have to add new columns to the resultant DF if multiple conditions get satisfied on each row?)
Can anybody suggest a different technique? Like using apply functions?
My suggestion is:
use lapply over row indices
handle in the function call your if branches
return either your dataframe or NULL
combine everything with rbind
by replacing lapply with mclapply from the 'parallel' package, your code gets executed in parallel.
resultList <- lapply(1:nrow(t.d), function(i){
do stuff
if(condition){
return(df)
}else{
return(NULL)
}
resultDF <- do.call(rbind, resultList)

How to subsetting rows group wise in R?

Probably my question title is not appropriate, sorry for that. I have a csv file named "table_parameter". Please, download from here.. Data look like this:
time Avg.PM10 sill range nugget
1 1 2012030101 52.269231 0.11054330 45574.072 0.037261216
2 2 2012030102 55.314286 0.20250974 87306.391 0.048315377
3 3 2012030103 56.038095 0.17711558 56806.827 0.034956709
4 4 2012030104 55.904762 0.16466350 104767.669 0.030752835
5 5 2012030105 57.123810 0.23638953 87306.391 0.037308364
6 6 2012030106 58.542857 0.24130317 87306.391 0.042108754
7 7 2012030107 60.066667 0.20362439 87306.391 0.037353980
8 8 2012030108 63.790476 0.19417801 87306.391 0.034144464
.
.
.
In my dataframe there is a variable named time contains hours value from 01 march 2012 to 7 march 2012 in numeric form. for example 01 march 2012, 1.00 a.m. is written as 2012030101 and so on.
I want to subset this dataframe time wise. I want a dataframe contains only morning times of every 7 days. morning time is 1.00 am to 5.00 a.m. That means I want a dataframe which contais all the value belongs to 2012030101 to 2012030105, 2012030201 to 2012030205..........2012030701 to 2012030705.in other words,I want a dataframe like below:
time Avg.PM10 sill range nugget
1 49 49 2012030301 17.371429 0.7154449 48239.54 0.17163448
2 50 50 2012030302 17.811321 1.1201199 117603.55 0.12425337
3 51 51 2012030303 17.094340 0.5799705 55103.16 0.12061258
4 52 52 2012030304 16.679245 0.8486774 86725.77 0.15210005
5 53 53 2012030305 16.885714 1.2408621 154677.61 0.09743375
6 73 73 2012030401 21.619048 0.4417369 104767.67 0.08567888
7 74 74 2012030402 20.485714 2.0271124 215474.54 0.06340464
8 75 75 2012030403 20.552381 0.4509354 104767.67 0.06319812
9 76 76 2012030404 20.104762 0.4438798 104767.67 0.05639840
10 77 77 2012030405 20.133333 0.5050201 104767.67 0.09037341
.
.
.
For doing this I wrote these code:
table<-read.csv("table_parameter.csv")
table
table_morning<-subset(table, time %in% c(2012030101:2012030105,
2012030201:2012030205,
2012030301:2012030305,
2012030401:2012030405,
2012030501:2012030505,
2012030601:2012030605,
2012030701:2012030705) & Avg.PM10 <=30)
table_morning
But this code is not efficient.as you see, I wrote all the hour values to subset! If want to do the same work for 90 days then Its very inefficient. So, how can I do this subsetting efficiently? If you have any further query please let me know.
you could use substring like below:
table_morning <- subset(table, substring(time, 9, 10) %in% c("01", "02","03","04", "05") & Avg.PM10 <=30)
I would extract the hour from the time and then filter accordingly.
For example:
library(dplyr)
data_orpheus = read.csv('table_parameter.csv')
data_orpheus$hour = as.numeric(substr(as.character(data_orpheus$time),9,10))
data_morning = data_orpheus %>% filter(hour >= 1 & hour <= 5)
The dplyr operator %>% is not necessary, you could filter with data_morning = data_orpheus[with(data_orpheus,hour >= 1 & hour <= 5)]
Update
I am still learning dplyr, so here is a beautiful one-liner that does it all:
data_morning = read.csv('table_parameter.csv') %>% # Read CSV
mutate(hours = as.numeric(substr(time,9,10))) %>% # Extract hours
filter(hours >= 1 & hours <= 5) %>% # Keep only mornings
select(-hours) # Drop hours, if not needed
head(data_morning)
X time Avg.PM10 sill range nugget
1 1 2012030101 52.26923 0.1105433 45574.07 0.03726122
2 2 2012030102 55.31429 0.2025097 87306.39 0.04831538
3 3 2012030103 56.03810 0.1771156 56806.83 0.03495671
4 4 2012030104 55.90476 0.1646635 104767.67 0.03075283
5 5 2012030105 57.12381 0.2363895 87306.39 0.03730836
6 25 2012030201 67.10476 0.1434977 72755.33 0.03003781
Thanks a lot for Other answers. My improvised answer for my future advantage:
table<-read.csv("table_parameter.csv")
times<- as.numeric(substr(table$time,9,10))
table_morning<- subset(table, times>=1 & times<=5 & Avg.PM10<=30)

How to update values with dplyr

I'm currently trying to update values from a data.frame using dplyr butI don't know if it is possible to replace a subset of values?
# the net4 table
head(net4)
Source: local data frame [6 x 4]
temps2 NNET NET ave
1 18 2 4 36
2 18 2 4 36
3 22 2 4 44
4 18 2 4 36
5 22 2 4 44
6 27 3 4 36
# I would like to do the same command line as below:
subs <- (net4$ave < 10 & net4$ave!=net4$temps2)
net4$ave[subs] <- with(net4[subs,], temps2/NNET*NET)
Thanks
Use mutate and ifelse
mutate(net4,
ave = ifelse(ave < 10 & ave != temp2, temps2 / NNET * NET, ave)
)

Selecting top finite number of rows for each unique value of a column in a data fame in R

I have a data frame with 3 columns. a,b,c. There are multiple rows corresponding to each unique value of column a. I want to select top 5 rows corresponding to each unique value of column a. column c is some value and the data frame is already sorted by it in descending order, so that would not be a problem. Can anyone please suggest how can I do this in R.
Stealing #ptocquin's example, here's how you can use base function by. You can flatten the result using do.call (see below).
> by(data = data, INDICES = data$a, FUN = function(x) head(x, 5))
# or by(data = data, INDICES = data$a, FUN = head, 5)
data$a: 1
a b c
21 1 0.1188552 1.6389895
41 1 1.0182033 1.4811359
61 1 -0.8795879 0.7784072
81 1 0.6485745 0.7734652
31 1 1.5102255 0.7107957
------------------------------------------------------------
data$a: 2
a b c
15 2 -1.09704040 1.1710693
85 2 0.42914795 0.8826820
65 2 -1.01480957 0.6736782
45 2 -0.07982711 0.3693384
35 2 -0.67643885 -0.2170767
------------------------------------------------------------
A similar thing could be achieved by splitting your data.frame based on a and then using lapply to step through each element subsetting first n rows.
split.data <- split(data, data$a)
subsetted.data <- lapply(split.data, FUN = function(x) head(x, 5)) # or ..., FUN = head, 5) like above
flatten.data <- do.call("rbind", subsetted.data)
head(flatten.data)
a b c
1.21 1 0.11885516 1.63898947
1.41 1 1.01820329 1.48113594
1.61 1 -0.87958790 0.77840718
1.81 1 0.64857445 0.77346517
1.31 1 1.51022545 0.71079568
2.15 2 -1.09704040 1.17106930
2.85 2 0.42914795 0.88268205
2.65 2 -1.01480957 0.67367823
2.45 2 -0.07982711 0.36933837
2.35 2 -0.67643885 -0.21707668
Here is my try :
library(plyr)
data <- data.frame(a=rep(sample(1:20,10),10),b=rnorm(100),c=rnorm(100))
data <- data[rev(order(data$c)),]
head(data, 15)
a b c
28 6 1.69611039 1.720081
91 11 1.62656460 1.651574
70 9 -1.17808386 1.641954
6 15 1.23420550 1.603140
23 7 0.70854914 1.588352
51 11 -1.41234359 1.540738
19 10 2.83730734 1.522825
49 10 0.39313579 1.370831
80 9 -0.59445323 1.327825
59 10 -0.55538404 1.214901
18 6 0.08445888 1.152266
86 15 0.53027267 1.066034
69 10 -1.89077464 1.037447
62 1 -0.43599566 1.026505
3 7 0.78544009 1.014770
result <- ddply(data, .(a), "head", 5)
head(result, 15)
a b c
1 1 -0.43599566 1.02650544
2 1 -1.55113486 0.36380251
3 1 0.68608364 0.30911430
4 1 -0.85406406 0.05555500
5 1 -1.83894595 -0.11850847
6 5 -1.79715809 0.77760033
7 5 0.82814909 0.22401278
8 5 -1.52726859 0.06745849
9 5 0.51655092 -0.02737905
10 5 -0.44004646 -0.28106808
11 6 1.69611039 1.72008079
12 6 0.08445888 1.15226601
13 6 -1.99465060 0.82214319
14 6 0.43855489 0.76221979
15 6 -2.15251353 0.64417757

Resources