adding rnorm to a column in loop - r

I am doing simulations and am trying to add error to a column repeatedly, specifically to the column titled Ao. In my output, the first 30 rows are correct; we have the initial data, the first year of altered data (error added to Ao), but then afterwards, where I would like to have 30 years of added error, I get repeats of Year 2 for Ao up to year 30. My goal is that I add error after each year of sampling. Ie. Year 2 is Year 1 Ao + error. Year 3 is Year 2 Ao + error, so on and so forth. Any helpers? Cheers.
for(t in 1:30){
Error<-rnorm(1000,0,1)
m<-rep(year1data$m,30)
r<-rep(year1data$r,30)
a<-rep(year1data$a,30)
g<-rep(year1data$g,30)
Year<-rep(2:31, each=TotSpecies)
Species<-1:TotSpecies
Ao<-year1data$Ao+sample(Error,TotSpecies,replace=FALSE)
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
TotSpeciesdata<-rbind(year1data,TotSpeciesdata)
}
> TotSpeciesdata
Species Year Ao m r a g
1 1 1 25.770783 43 119.110786 3.2305180 2.6526471
2 2 1 53.908914 138 161.894541 0.7342070 0.1151602
3 3 1 2.010732 226 193.820489 2.2890904 3.6248105
4 4 1 23.742254 332 17.315335 1.4009572 2.0037931
5 5 1 4.291080 63 187.591209 0.2563995 2.1553908
6 6 1 4.691113 343 116.267867 0.3899113 3.3950085
7 7 1 604.133044 224 132.240197 3.0410743 0.7985524
8 8 1 13.332567 166 5.367118 0.7921644 1.7861011
9 9 1 3.759268 141 212.340970 2.8733737 2.7123141
10 10 1 3.647390 209 259.400858 0.1249936 0.6594659
11 11 1 23.731109 10 114.171147 2.2437372 0.9867591
12 12 1 85.116996 69 167.412993 0.8306823 2.8905148
13 13 1 31.684280 277 177.025460 2.7618332 2.9245554
14 14 1 30.657523 205 21.710438 2.7661347 1.5911379
15 15 1 12.240410 85 210.121109 2.8827455 3.0418454
16 1 2 27.038097 43 119.110786 3.2305180 2.6526471
17 2 2 54.251600 138 161.894541 0.7342070 0.1151602
18 3 2 2.010636 226 193.820489 2.2890904 3.6248105
19 4 2 22.699369 332 17.315335 1.4009572 2.0037931
20 5 2 4.542589 63 187.591209 0.2563995 2.1553908
21 6 2 3.607833 343 116.267867 0.3899113 3.3950085
22 7 2 604.480756 224 132.240197 3.0410743 0.7985524
23 8 2 13.663513 166 5.367118 0.7921644 1.7861011
24 9 2 2.138715 141 212.340970 2.8733737 2.7123141
25 10 2 3.642769 209 259.400858 0.1249936 0.6594659
26 11 2 22.897993 10 114.171147 2.2437372 0.9867591
27 12 2 85.490897 69 167.412993 0.8306823 2.8905148
28 13 2 31.689202 277 177.025460 2.7618332 2.9245554
29 14 2 30.644419 205 21.710438 2.7661347 1.5911379
30 15 2 12.050207 85 210.121109 2.8827455 3.0418454
31 1 3 27.038097 43 119.110786 3.2305180 2.6526471
32 2 3 54.251600 138 161.894541 0.7342070 0.1151602
33 3 3 2.010636 226 193.820489 2.2890904 3.6248105
34 4 3 22.699369 332 17.315335 1.4009572 2.0037931
35 5 3 4.542589 63 187.591209 0.2563995 2.1553908
36 6 3 3.607833 343 116.267867 0.3899113 3.3950085
37 7 3 604.480756 224 132.240197 3.0410743 0.7985524
38 8 3 13.663513 166 5.367118 0.7921644 1.7861011
39 9 3 2.138715 141 212.340970 2.8733737 2.7123141
40 10 3 3.642769 209 259.400858 0.1249936 0.6594659
41 11 3 22.897993 10 114.171147 2.2437372 0.9867591
42 12 3 85.490897 69 167.412993 0.8306823 2.8905148
43 13 3 31.689202 277 177.025460 2.7618332 2.9245554
44 14 3 30.644419 205 21.710438 2.7661347 1.5911379
45 15 3 12.050207 85 210.121109 2.8827455 3.0418454

The main problem you have with your approach is the line:
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
Because Year is a 30 * TotSpecies vector, but all the others are just TotSpecies long. So in effect, you are recycling all columns except Year 30 times when you create the data frame, which will lead to the year 2 data repeated 30 times, among other things. If you just have Year <- rep(i + 1, TotSpecies) I think your logic will work fine. That said, here is an alternate approach:
This will, for each species, create an incrementing random walk starting with Ao for that species for 5 years (just did that for display purposes):
set.seed(1)
year1data <- data.frame(species=1:10, year=1, Ao=runif(10, 1, 700))
TotSpeciesData <- do.call(
rbind,
lapply(
split(year1data, year1data$species),
function(data)
with(
data,
data.frame(species=species, year=year, Ao=c(Ao, Ao + cumsum(rnorm(5)))
) ) ) )
head(TotSpeciesData, 15)
Note I excluded columns m-g since they don't seem directly relevant to your particular question, but you can add them relatively easily. I also only did 5 years in addition to year 1 so you can see the results here, but that is also easy to change:
species year Ao
1.1 1 1 186.5906
1.2 1 1 185.7701
1.3 1 1 186.2575
1.4 1 1 186.9958
1.5 1 1 187.5716
1.6 1 1 187.2662
2.1 2 1 261.1146
2.2 2 1 262.6264
2.3 2 1 263.0162
2.4 2 1 262.3950
2.5 2 1 260.1803
2.6 2 1 261.3052
3.1 3 1 401.4245
3.2 3 1 401.3796
3.3 3 1 401.3634

It has been pointed out that the code that you provided above, or at least that I have edited, repeats itself every 15 years, rather than being unique year year in a step-wise fashion. I edited it as shown below:
TotSpeciesData <- do.call(
rbind, #bind the table by rows
lapply( #applying the function in list form
split(year1data, year1data$Species), #splits data into groups by species
function(data)
with(
data,
data.frame(Species=Species, Year=1:Community, Ao=c(Ao, Ao + cumsum(rnorm((TotSpecies-1),0,2))),m=m, r=r, a=a, g=g) #data frame is Species, Year,
) ) )
TotSpeciesData$Ao[TotSpeciesData$Ao<0]<-0 #any values less than 0 go to 0
TotSpeciesData<-TotSpeciesData[order(TotSpeciesData$Year),] #orders the data frame by Year
When I do this code:
TotSpeciesData[TotSpeciesData$Species==1 & TotSpeciesData$Year %in% c(1,2,16,17),]
I end up with an output showing that the data is repeating itself.
Species Year Ao m r a g
1.1 1 1 48.49161 239 332.9625 3.791778 2.723104
1.2 1 2 49.62851 239 332.9625 3.791778 2.723104
1.16 1 16 48.49161 239 332.9625 3.791778 2.723104
1.17 1 17 49.62851 239 332.9625 3.791778 2.723104
Any comments toward this?

Related

Get the average of the values of one column for the values in another

I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives.
My data looks like this
> tabmean
Initiative Topic Goals Achievements Tone
1 52 44 2 2 2
2 294 42 2 2 2
3 103 31 2 2 2
4 52 41 2 2 2
5 87 26 2 1 1
6 52 87 2 2 2
7 136 81 2 2 2
8 19 7 2 2 1
9 19 4 2 2 2
10 0 63 2 2 2
11 0 25 2 2 2
12 19 51 2 2 2
13 52 51 2 2 2
14 108 94 2 2 1
15 52 89 2 2 2
16 110 37 2 2 2
17 247 25 2 2 2
18 66 95 2 2 2
19 24 49 2 2 2
20 24 110 2 2 2
I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2).
I have used this code:
GoalMeanTone <- tabmean %>%
group_by(Initiative,Topic,Goals,Tone) %>%
summarize(averagetone = mean(Tone))
With Solution output :
GoalMeanTone
# A tibble: 454 x 5
# Groups: Initiative, Topic, Goals [424]
Initiative Topic Goals Tone averagetone
<chr> <chr> <chr> <chr> <dbl>
1 0 104 2 0 NA
2 0 105 2 0 NA
3 0 22 2 0 NA
4 0 25 2 0 NA
5 0 29 2 0 NA
6 0 30 2 1 NA
7 0 31 1 1 NA
8 0 42 1 0 NA
9 0 44 2 0 NA
10 0 44 NA 0 NA
# ... with 444 more rows
note that for Initiative Value 0 means "other initiative".
and I've also tried this code
library(plyr)
GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) )
with solution output
> GoalMeanTone2
Initiative V1
1 0 NA
2 1 NA
3 101 NA
4 102 NA
5 103 NA
6 104 NA
7 105 NA
8 107 NA
9 108 NA
10 110 NA
Note that in both instances, I do not get an average for Tone but instead get NA's
I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted).
and I have also re-coded the values for Tone :
tabmean<-Meantable %>% mutate(Tone=recode(Tone,
`1`="1",
`2`="0",
`3`="-1",
`4`="2"))
I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this.
i'd be super grateful for a better code to get this. Thanks!
I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following:
tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE)
Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).

Choosing the correct fixed and random variables in a generalized linear mixed model (GLMM) in a longitudinal study (repeated measures)

I want to explore the relationship between the abundance of an organism and several possible explanatory factors. I have doubts regarding what variables should be called as fixed or random in the GLMM.
I have a dataset with the number of snails in different sites within a national park (all sites are under the same climatic conditions). But there are local parameters whose effects over the snail abundance haven't been studied yet.
This is a longitudinal study, with repeated measures over time (every month, for almost two years). The number of snails were counted in the field, always in the same 21 sites (each site has a 6x6 square meters plot, delimitated with wooden stakes).
In case it could influence the analysis, note that some parameters may vary over time, such as the vegetation cover in each plot, or the presence of the snail natural predator (measured with yes/no values). Others, however, are always the same, because they are specific to each site, such as the distant to the nearest riverbed or the type of soil.
Here is a subset of my data:
> snail.data
site time snails vegetation_cover predator type_soil distant_riverbed
1 1 1 9 NA n 1 13
2 1 2 7 0.8 n 1 13
3 1 3 13 1.4 n 1 13
4 1 4 14 0.6 n 1 13
5 1 5 12 1.6 n 1 13
10 2 1 0 NA n 1 136
11 2 2 0 0.0 n 1 136
12 2 3 0 0.0 n 1 136
13 2 4 0 0.0 n 1 136
14 2 5 0 0.0 n 1 136
19 3 1 1 NA n 2 201
20 3 2 0 0.0 n 2 201
21 3 3 0 0.0 y 2 201
22 3 4 3 0.0 n 2 201
23 3 5 2 0.0 n 2 201
28 4 1 0 NA n 2 104
29 4 2 0 0.0 n 2 104
30 4 3 0 0.0 y 2 104
31 4 4 0 0.0 n 2 104
32 4 5 0 0.0 n 2 104
37 5 1 1 NA n 3 65
38 5 2 0 2.4 n 3 65
39 5 3 3 2.2 n 3 65
40 5 4 2 2.2 n 3 65
41 5 5 4 2.0 y 3 65
46 6 1 1 NA n 3 78
47 6 2 2 3.0 n 3 78
48 6 3 7 2.8 n 3 78
49 6 4 3 1.8 n 3 78
50 6 5 6 1.2 y 3 78
55 7 1 14 NA n 3 91
56 7 2 21 2.8 n 3 91
57 7 3 16 2.6 n 3 91
58 7 4 15 1.6 n 3 91
59 7 5 8 2.0 n 3 91
So I'm interested in investigating if the number of snails is significantly different in each site and if those differences are related to some specific parameters.
So far the best statistic approach I have found is a generalized linear mixed model. But I'm struggling in choosing the correct fixed and random variables. My reasoning is, although I'm checking for the differences among sites (by comparing the number of snails) the focus of the study is the other parameters commented above, thus the site would be a random factor.
Then, my question is: should 'site' and 'time' be considered random factors and the local parameters should be the fixed variables? Should I include interactions between time and other factors?
I have set up my command as follows:
library(lme4)
mixed_model <- glmer(snails ~ vegetation_cover + predator + type_soil + distant_riverbed + (1|site) + (1|time), data = snails.data, family = poisson)
Would it be the correct syntax for what I have described?

Cumulative function for a specific range of values

I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5
You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22

Reshaping a df in a specific way in R [duplicate]

This question already has answers here:
R Partial Reshape Data from Long to Wide
(2 answers)
Closed 6 years ago.
I am struggling to reshape this df into a different one, I have this:
ID task mean sd mode
1 0 2 10 1.5 223
2 0 2 21 2.4 213
3 0 2 24 4.3 232
4 1 3 26 2.2 121
5 1 3 29 1.3 433
6 1 3 12 2.3 456
7 2 4 45 4.3 422
8 2 4 67 5.3 443
9 2 4 34 2.1 432
and I would like to reshape it in this way discarding sd and mode and placing the means in the rows like this :
ID task mean mean1 mean2
1 0 2 10 21 24
2 1 3 26 29 12
3 2 4 45 67 34
Thanks a lot for your help in advance
You need to create a new column first by which we can pivot the mean values. Using data.table, this approach works:
library(data.table)
dt <- data.table(df) # Convert to data.table
dcast(dt[,nr := seq(task),
.(ID)],
ID + task ~ nr,
value.var = "mean")
# ID task 1 2 3
#1: 0 2 10 21 24
#2: 1 3 26 29 12
#3: 2 4 45 67 34
Consequently, you can always rename the columns to what you want them to be called.
reshape(cbind(df,time=ave(df$ID,df$ID,FUN=seq_along)),dir='w',idvar=c('ID','task'),drop=c('sd','mode'),sep='');
## ID task mean1 mean2 mean3
## 1 0 2 10 21 24
## 4 1 3 26 29 12
## 7 2 4 45 67 34
Data
df <- data.frame(ID=c(0L,0L,0L,1L,1L,1L,2L,2L,2L),task=c(2L,2L,2L,3L,3L,3L,4L,4L,4L),mean=c(
10L,21L,24L,26L,29L,12L,45L,67L,34L),sd=c(1.5,2.4,4.3,2.2,1.3,2.3,4.3,5.3,2.1),mode=c(223L,
213L,232L,121L,433L,456L,422L,443L,432L));

Rank function to rank multiple variables in R

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

Resources