How to update values with dplyr - r

I'm currently trying to update values from a data.frame using dplyr butI don't know if it is possible to replace a subset of values?
# the net4 table
head(net4)
Source: local data frame [6 x 4]
temps2 NNET NET ave
1 18 2 4 36
2 18 2 4 36
3 22 2 4 44
4 18 2 4 36
5 22 2 4 44
6 27 3 4 36
# I would like to do the same command line as below:
subs <- (net4$ave < 10 & net4$ave!=net4$temps2)
net4$ave[subs] <- with(net4[subs,], temps2/NNET*NET)
Thanks

Use mutate and ifelse
mutate(net4,
ave = ifelse(ave < 10 & ave != temp2, temps2 / NNET * NET, ave)
)

Related

How to use apply function instead of for loop if you have multiple if conditions to be excecuted

1st DF:
t.d
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
names(t.d) <- c("ID","A","B","C")
t.d$FinalTime <- c("7/30/2009 08:18:35","9/30/2009 19:18:35","11/30/2009 21:18:35","13/30/2009 20:18:35","15/30/2009 04:18:35")
t.d$InitTime <- c("6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35")
>t.d
ID A B C FinalTime InitTime
1 1 6 11 16 7/30/2009 08:18:35 6/30/2009 9:18:35
2 2 7 12 17 9/30/2009 19:18:35 6/30/2009 9:18:35
3 3 8 13 18 11/30/2009 21:18:35 6/30/2009 9:18:35
4 4 9 14 19 13/30/2009 20:18:35 6/30/2009 9:18:35
5 5 10 15 20 15/30/2009 04:18:35 6/30/2009 9:18:35
2nd DF:
> s.d
F D E Time
1 10 19 28 6/30/2009 08:18:35
2 11 20 29 8/30/2009 19:18:35
3 12 21 30 9/30/2009 21:18:35
4 13 22 31 01/30/2009 20:18:35
5 14 23 32 10/30/2009 04:18:35
6 15 24 33 11/30/2009 04:18:35
7 16 25 34 12/30/2009 04:18:35
8 17 26 35 13/30/2009 04:18:35
9 18 27 36 15/30/2009 04:18:35
Output to be:
From DF "t.d" I have to calculate the time interval for each row between "FinalTime" and "InitTime" (InitTime will always be less than FinalTime).
Another DF "temp" from "s.d" has to be formed having data only within the above time interval, and then the most recent values of "F","D","E" have to be taken and attached to the 'ith' row of "t.d" from which the time interval was calculated.
Also we have to see if the newly formed DF "temp" has the following conditions true:
here 'j' represents value for each row:
if(temp$F[j] < 35.5) + (temp$D[j] >= 100) >= 1)
{
temp$Flag <- 1
} else{
temp$Flag <- 0
}
Originally I have 3 million rows in the dataframe and 20 columns in each DF.
I have solved the above problem using "for loop" but it obviously takes 2 to 3 days as there are a lot of rows.
(Also if I have to add new columns to the resultant DF if multiple conditions get satisfied on each row?)
Can anybody suggest a different technique? Like using apply functions?
My suggestion is:
use lapply over row indices
handle in the function call your if branches
return either your dataframe or NULL
combine everything with rbind
by replacing lapply with mclapply from the 'parallel' package, your code gets executed in parallel.
resultList <- lapply(1:nrow(t.d), function(i){
do stuff
if(condition){
return(df)
}else{
return(NULL)
}
resultDF <- do.call(rbind, resultList)

if statement with ddply function

I am trying to use the if statement with ddply but am having issues with the if statement.
An example dataset is:
data<-data.frame(Gear=c(rep("S",10),rep("C",10)),TowSurvey=c(0,0,1,1,0,1,1,1,1,0),TowCom=c(0,1,1,1,0,1,1,1,1,0),
StationID=c(1,2,3,4,5,6,7,8,9,10),Totwght=c(2,8,6,4,12,9,56,7,89,10),Totexpwght=c(5,8,12,45,89,56,23,78,56,41),
Expnum=c(1,5,6,98,45,2,6,3,7,45),Exp=c(56,25,85,74,1,23,56,45,89,75))
My first try was
if(data$Gear=="S" & data$TowSurvey== 1 | data$Gear=="C" & data$TowCom== 1){
datad<-ddply(data, .(StationID,Gear), summarize,Totwghtpertow=sum(Totwght),
Totexppertow=sum(Totexpwght),Totnum =sum(Expnum),Totexpnum=sum(Exp))}
print(datad)
But the records that don't meet the if statement criteria are included in datad.
Then I found this post: Aggregate (count) rows that match a condition, group by unique values. Aggregate (count) rows that match a condition, group by unique values
So my second attempt based on the answer from the post was
datad<-ddply(data, .(StationID,Gear), summarize,Totwghtpertow=sum(Totwght[Gear=="S" & TowSurvey== 1 | Gear=="C" & TowCom== 1]))
I only tried with one column as a test and am getting the same results. Any help would be appreciated in trying to figure this out.
Thanks
If you run your first attempt you should actually get an error message since if can only evaluate a logical vector of length 1.
You really don't need an if statement here. Subsetting your data will do just fine.
data_sub <- subset(data, (data$Gear=="S" & data$TowSurvey== 1) | (data$Gear=="C" & data$TowCom== 1))
You can run your ddply statement using data_sub rather than data.
Or if you're going to be using the a lot you can wrap it in a function:
datad_func <- function(data){
data_sub <- subset(data, (data$Gear=="S" & data$TowSurvey== 1) | (data$Gear=="C" & data$TowCom== 1))
datad<-ddply(data_sub, .(StationID,Gear), summarize,Totwghtpertow=sum(Totwght),
Totexppertow=sum(Totexpwght),Totnum =sum(Expnum),Totexpnum=sum(Exp))
rm('data_sub')
print(datad)
}
datad_func(data)
StationID Gear Totwghtpertow Totexppertow Totnum Totexpnum
1 2 C 8 8 5 25
2 3 C 6 12 6 85
3 3 S 6 12 6 85
4 4 C 4 45 98 74
5 4 S 4 45 98 74
6 6 C 9 56 2 23
7 6 S 9 56 2 23
8 7 C 56 23 6 56
9 7 S 56 23 6 56
10 8 C 7 78 3 45
11 8 S 7 78 3 45
12 9 C 89 56 7 89
13 9 S 89 56 7 89
plyr is not so good at subsetting in the function, so you can do it before or after like #scribbles said.
You could also try dplyr and pipe them together:
library(dplyr)
data %>% filter((data$Gear == "S" & data$TowSurvey == 1) | (data$Gear == "C" & data$TowCom == 1)) %>%
group_by(StationID, Gear) %>%
summarise_each(funs(sum), Totwght, Totexpwght, Expnum, Exp)

ggplot2 is plotting a line strangely

i am trying to plot the time series x_t = A + (-1)^t B
To do this i am using the following code. The problem is, that the ggplot is wrong.
require (ggplot2)
set.seed(42)
N<-2
A<-sample(1:20,N)
B<-rnorm(N)
X<-c(A+B,A-B)
dat<-sapply(1:N,function(n) X[rep(c(n,N+n),20)],simplify=FALSE)
dat<-data.frame(t=rep(1:20,N),w=rep(A,each=20),val=do.call(c,dat))
ggplot(data=dat,aes(x=t, y=val, color=factor(w)))+
geom_line()+facet_grid(w~.,scale = "free")
looking at the head of dat everything looks right:
> head(dat)
t w val
1 1 12 10.5533
2 2 12 13.4467
3 3 12 10.5533
4 4 12 13.4467
5 5 12 10.5533
6 6 12 13.4467
So the lower (blue) line should only have values 10.5533 and 13.4467. But it also takes different values. What is wrong in my code?
Thanks in advance for any help
You really should be more careful before asserting that something is "wrong". The way you are creating dat the rows are not ordered by dat$t, so head(...) is not displaying the extra values:
head(dat[order(dat$w,dat$t),],10)
# t w val
# 21 1 18 18.43530
# 61 1 18 18.36313
# 22 2 18 19.56470
# 62 2 18 17.63687
# 23 3 18 18.43530
# 63 3 18 18.36313
# 24 4 18 19.56470
# 64 4 18 17.63687
# 25 5 18 18.43530
# 65 5 18 18.36313
Note the row numbers.

Sum multiple columns [duplicate]

This question already has an answer here:
Summarizing multiple columns with data.table
(1 answer)
Closed 3 years ago.
I am trying to write a function that will sum the column(s) in the data frame according to the values in the first two columns.For example I have a matrix M,
Crs gr P_7 P_8
38 1 3 16
38 1 12 45
38 1 9 28
40 2 3 9
40 2 14 29
40 1 4 3
40 2 8 2
I want to sum the columns according to column1(crs) first and then column2(gr). Result will be,
Crs gr P_7 P_8
38 1 24 89
40 2 25 40
40 1 4 3
Currently I am using,
M <- M[, list(sum(P_7),sum(P_8)), by=list(Crs,gr)]
But the problem with this, is that I have to define the names of columns which wont be fixed. So, I was wondering how can I do this without defining the names of the columns.
Thanks in advance!
You're looking for this:
M[, lapply(.SD, sum), by = list(Crs, gr)]
The package plyr has some magic for situations just like this. Use a combination of ddply and numcolwise, like this:
library(plyr)
ddply(dat, .(Crs, gr), numcolwise(sum))
results in:
Crs gr P_7 P_8
1 38 1 24 89
2 40 1 4 3
3 40 2 25 40

Selecting top finite number of rows for each unique value of a column in a data fame in R

I have a data frame with 3 columns. a,b,c. There are multiple rows corresponding to each unique value of column a. I want to select top 5 rows corresponding to each unique value of column a. column c is some value and the data frame is already sorted by it in descending order, so that would not be a problem. Can anyone please suggest how can I do this in R.
Stealing #ptocquin's example, here's how you can use base function by. You can flatten the result using do.call (see below).
> by(data = data, INDICES = data$a, FUN = function(x) head(x, 5))
# or by(data = data, INDICES = data$a, FUN = head, 5)
data$a: 1
a b c
21 1 0.1188552 1.6389895
41 1 1.0182033 1.4811359
61 1 -0.8795879 0.7784072
81 1 0.6485745 0.7734652
31 1 1.5102255 0.7107957
------------------------------------------------------------
data$a: 2
a b c
15 2 -1.09704040 1.1710693
85 2 0.42914795 0.8826820
65 2 -1.01480957 0.6736782
45 2 -0.07982711 0.3693384
35 2 -0.67643885 -0.2170767
------------------------------------------------------------
A similar thing could be achieved by splitting your data.frame based on a and then using lapply to step through each element subsetting first n rows.
split.data <- split(data, data$a)
subsetted.data <- lapply(split.data, FUN = function(x) head(x, 5)) # or ..., FUN = head, 5) like above
flatten.data <- do.call("rbind", subsetted.data)
head(flatten.data)
a b c
1.21 1 0.11885516 1.63898947
1.41 1 1.01820329 1.48113594
1.61 1 -0.87958790 0.77840718
1.81 1 0.64857445 0.77346517
1.31 1 1.51022545 0.71079568
2.15 2 -1.09704040 1.17106930
2.85 2 0.42914795 0.88268205
2.65 2 -1.01480957 0.67367823
2.45 2 -0.07982711 0.36933837
2.35 2 -0.67643885 -0.21707668
Here is my try :
library(plyr)
data <- data.frame(a=rep(sample(1:20,10),10),b=rnorm(100),c=rnorm(100))
data <- data[rev(order(data$c)),]
head(data, 15)
a b c
28 6 1.69611039 1.720081
91 11 1.62656460 1.651574
70 9 -1.17808386 1.641954
6 15 1.23420550 1.603140
23 7 0.70854914 1.588352
51 11 -1.41234359 1.540738
19 10 2.83730734 1.522825
49 10 0.39313579 1.370831
80 9 -0.59445323 1.327825
59 10 -0.55538404 1.214901
18 6 0.08445888 1.152266
86 15 0.53027267 1.066034
69 10 -1.89077464 1.037447
62 1 -0.43599566 1.026505
3 7 0.78544009 1.014770
result <- ddply(data, .(a), "head", 5)
head(result, 15)
a b c
1 1 -0.43599566 1.02650544
2 1 -1.55113486 0.36380251
3 1 0.68608364 0.30911430
4 1 -0.85406406 0.05555500
5 1 -1.83894595 -0.11850847
6 5 -1.79715809 0.77760033
7 5 0.82814909 0.22401278
8 5 -1.52726859 0.06745849
9 5 0.51655092 -0.02737905
10 5 -0.44004646 -0.28106808
11 6 1.69611039 1.72008079
12 6 0.08445888 1.15226601
13 6 -1.99465060 0.82214319
14 6 0.43855489 0.76221979
15 6 -2.15251353 0.64417757

Resources