writing the outcome of a nested loop to a vector object in R - r

I have the following data read into R as a data frame named "data_old":
yes year month
1 15 2004 5
2 9 2005 6
3 15 2006 3
4 12 2004 5
5 14 2005 1
6 15 2006 7
. . ... .
. . ... .
I have written a small loop which goes through the data and sums up the yes variable for each month/year combination:
year_f <- c(2004:2006)
month_f <- c(1:12)
for (i in year_f){
for (j in month_f){
x <- subset(data_old, month == j & year == i, select="yes")
if (nrow(x) > 0){
print(sum(x))
}
else{print("Nothing")}
}
}
My question is this: I can print the sum for each month/year combination in the terminal, but how do i store it in a vector? (the nested loop is giving me headaches trying to figure this out).
Thomas

Another way,
library(plyr)
ddply(data_old,.(year,month),function(x) sum(x[1]))
year month V1
1 2004 5 27
2 2005 1 14
3 2005 6 9
4 2006 3 15
5 2006 7 15

Forget the loops, you want to use an aggregation function. There's a recent discussion of them in this SO question.
with(data_old, tapply(yes, list(year, month), sum))
is one of many solutions.
Also, you don't need to use c() when you aren't concatenating anything. Plain 1:12 is fine.

Just to add a third option:
aggregate(yes ~ year + month, FUN=sum, data=data_old)

Related

SQL `lead()` equivalent in R

I want to make something like LEAD(mes) OVER(PARTITION BY CODIGO_CLIENTE ORDER BY mes) mes_2 in R, but I dont know a similar function.
I have no clue how to work it out.
Since you shared no data and desired output, here is an example with lead() from the dplyr package. The example is from the Help page of lead(). This can give you a good idea of what you can do with this function.
df <- data.frame(year = 2000:2005, value = (0:5) ^ 2)
scrambled <- df[sample(nrow(df)), ]
year value
1 2000 0
5 2004 16
3 2002 4
4 2003 9
2 2001 1
6 2005 25
right <- mutate(scrambled, `next` = lead(value, order_by = year))
arrange(right, year)
year value next
1 2000 0 1
2 2001 1 4
3 2002 4 9
4 2003 9 16
5 2004 16 25
6 2005 25 NA
Since you're new to R I suggest you read a bit on the dplyr package. Also, to make it easier for the people trying to help you, please provide more details next time!

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

Conditional cumulative subtraction

This is what my data.table looks like:
library(data.table)
dt <- fread('
Year Total Shares Balance
2017 10 1 10
2016 12 2 9
2015 10 2 7
2014 10 3 6
2013 10 NA 3
')
**Balance** is my desired column. I am trying to find the cumulative subtractions by taking the first value of Total which is 10(it should also be the first value of Balance field) and then cumulatively subtracting values in Shares. So the second value is 10-1 =9 and the third value is 9-2 = 7 and such. There is one condition, if the Year is 2014, then subtract the Shares value after dividing it by 2. so the fourth value is 7-(2/2)=6 and the fifth value is 6-3=3. I want to end the calc as of the last row.
My attempt is:
dt[, Balance:= ifelse( Year == 2014, cumsum(Total[1]-Shares/2), cumsum(Total[1] - Shares))]
Here is one method.
dt[, Balance2 := Total[1] - cumsum(shift(Shares * (1 - (0.5 *(Year == 2015))), fill=0))]
shift is used to create a lag variable, and the first element is filled with 0, using fill=0. The other elements are calculated as Shares * (1 - (0.5 *(Year == 2015))) which return Shares except when Years == 2015, in which case Shares * 0.5 is returned.
which returns
dt
Year Total Shares Balance Balance2
1: 2017 10 1 10 10
2: 2016 12 2 9 9
3: 2015 10 2 7 7
4: 2014 10 3 6 6
5: 2013 10 NA 3 3
FWIW, I wanted to provide a functional alternative that would allow for more flexible calculations in the cumulative differences, indexing, etc. I also have read in the data with read.table.
dt <- read.table(header=TRUE, text='
Year Total Shares Balance
2017 10 1 10
2016 12 2 9
2015 10 2 7
2014 10 3 6
2013 10 NA 3
')
makeNewBalance <- function(dt) {
output <- NULL
for (i in 1:nrow(dt)) {
if (i==1) {
output[i] <- dt$Total[i]
} else {
output[i] <- output[i-1] - as.integer(ifelse(dt$Year[i]==2014,
dt$Shares[i-1]/2,
dt$Shares[i-1]))
}
}
return(output)
}
dt$NewBalance <- makeNewBalance(dt)
which also returns
> dt
Year Total Shares Balance NewBalance
1 2017 10 1 10 10
2 2016 12 2 9 9
3 2015 10 2 7 7
4 2014 10 3 6 6
5 2013 10 NA 3 3

Sum column values that match year in another column in R

I have the following dataframe
y<-data.frame(c(2007,2008,2009,2009,2010,2010),c(10,13,10,11,9,10),c(5,6,5,7,4,7))
colnames(y)<-c("year","a","b")
I want to have a final data.frame that adds together within the same year the values in "y$a" in the new "a" column and the values in "y$b" in the new "b" column so that it looks like this"
year a b
2007 10 5
2008 13 6
2009 21 12
2010 19 11
The following loop has done it for me,
years<- as.numeric(levels(factor(y$year)))
add.a<- numeric(length(y[,1]))
add.b<- numeric(length(y[,1]))
for(i in years){
ind<- which(y$year==i)
add.a[ind]<- sum(as.numeric(as.character(y[ind,"a"])))
add.b[ind]<- sum(as.numeric(as.character(y[ind,"b"])))
}
y.final<-data.frame(y$year,add.a,add.b)
colnames(y.final)<-c("year","a","b")
y.final<-subset(y.final,!duplicated(y.final$year))
but I just think there must be a faster command. Any ideas?
Kindest regards,
Marco
The aggregate function is a good choice for this sort of operation, type ?aggregate for more information about it.
aggregate(cbind(a,b) ~ year, data = y, sum)
# year a b
#1 2007 10 5
#2 2008 13 6
#3 2009 21 12
#4 2010 19 11

R - Bootstrap by several column criteria

So what I have is data of cod weights at different ages. This data is taken at several locations over time.
What I would like to create is "weight at age", basically a mean value of weights at a certain age. I want do this for each location at each year.
However, the ages are not sampled the same way (all old fish caught are measured, while younger fish are sub sampled), so I can't just create a normal average, I would like to bootstrap samples.
The bootstrap should take out 5 random values of weight at an age, create a mean value and repeat this a 1000 times, and then create an average of the means. The values should be able to be used again (replace). This should be done for each age at every AreaCode for every year. Dependent factors: Year-location-Age.
So here's an example of what my data could look like.
df <- data.frame( Year= rep(c(2000:2008),2), AreaCode = c("39G4", "38G5","40G5"), Age = c(0:8), IndWgt = c(rnorm(18, mean=5, sd=3)))
> df
Year AreaCode Age IndWgt
1 2000 39G4 0 7.317489899
2 2001 38G5 1 7.846606144
3 2002 40G5 2 0.009212455
4 2003 39G4 3 6.498688035
5 2004 38G5 4 3.121134937
6 2005 40G5 5 11.283096043
7 2006 39G4 6 0.258404136
8 2007 38G5 7 6.689780137
9 2008 40G5 8 10.180511929
10 2000 39G4 0 5.972879108
11 2001 38G5 1 1.872273650
12 2002 40G5 2 5.552962065
13 2003 39G4 3 4.897882549
14 2004 38G5 4 5.649438631
15 2005 40G5 5 4.525012587
16 2006 39G4 6 2.985615831
17 2007 38G5 7 8.042884181
18 2008 40G5 8 5.847629941
AreaCode contains the different locations, in reality I have 85 different levels. The time series stretches 1991-2013, the ages 0-15. IndWgt contain the weight. My whole data frame has a row length of 185726.
Also, every age does not exist for every location and every year. Don't know if this would be a problem, just so the scripts isn't based on references to certain row number. There are some NA values in the weight column, but I could just remove them before hand.
I was thinking that I maybe should use replicate, and apply or another plyr function. I've tried to understand the boot function but I don't really know if I would write my arguments under statistics, and in that case how. So yeah, basically I have no idea.
I would be thankful for any help I can get!
How about this with plyr. I think from the question you wanted to bootstrap only the "young" fish weights and use actual means for the older ones. If not, just replace the ifelse() statement with its last argument.
require(plyr)
#cod<-read.csv("cod.csv",header=T) #I loaded your data from csv
bootstrap<-function(Age,IndWgt){
ifelse(Age>2, # treat differently for old/young fish
res<-mean(IndWgt), # old fish mean
res<-mean(replicate(1000,sample(IndWgt,5,replace = TRUE))) # young fish bootstrap
)
return(res)
}
ddply(cod,.(Year,AreaCode,Age),summarize,boot_mean=bootstrap(Age,IndWgt))
Year AreaCode Age boot_mean
1 2000 39G4 0 6.650294
2 2001 38G5 1 4.863024
3 2002 40G5 2 2.724541
4 2003 39G4 3 5.698285
5 2004 38G5 4 4.385287
6 2005 40G5 5 7.904054
7 2006 39G4 6 1.622010
8 2007 38G5 7 7.366332
9 2008 40G5 8 8.014071
PS: If you want to sample all ages in the same way, no need for the function, just:
ddply(cod,.(Year,AreaCode,Age),
summarize,
boot_mean=mean(replicate(1000,mean(sample(IndWgt,5,replace = TRUE)))))
Since you don't provide enough code, it's too hard (lazy) for me to test it properly. You should get your first step using the following code. If you wrap this into replicate, you should get your end result that you can average.
part.result <- aggregate(IndWgt ~ Year + AreaCode + Age, data = data, FUN = function(x) {
rws <- length(x)
get.em <- sample(x, size = 5, replace = TRUE)
out <- mean(get.em)
out
})
To handle any missing combination of year/age/location, you could probably add an if statement checking for NULL/NA and producing a warning and/or skipping the iteration.

Resources