I have a data frame with 1666 rows. I would like to add a column with a repeating sequence of 1:5 to use with cut() to do cross validation. It would look like this:
Y x1 x2 Id1
1 .15 3.6 1
0 1.1 2.2 2
0 .05 3.3 3
0 .45 2.8 4
1 .85 3.1 5
1 1.01 2.9 1
... ... ... ...
I've tried the following 2 ways but get an error message as it seems to only add numbers in increments of the full seq() argument:
> tr2$Id1 <- rep(seq(1,5,1), (nrow(tr2)/5))
Error in `$<-.data.frame`(`*tmp*`, "Id", value = c(1, 2, 3, 4, 5, 1, 2, :
replacement has 1665 rows, data has 1666
> tr2$Id1 <- rep(seq(1,5,1), (nrow(tr2)/5) + (nrow(tr2)%%5))
Error in `$<-.data.frame`(`*tmp*`, "Id", value = c(1, 2, 3, 4, 5, 1, 2, :
replacement has 1670 rows, data has 1666
Any suggestions?
Use the length.out argument of rep() or rep_len (a "faster simplified version" [of rep]):
length.out: non-negative integer. The desired length of the output vector
Here is an example using the built-in dataset cars.
str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
Add grouping column:
cars$group <- rep(1:3, length.out = 50L)
Inspect the result:
head(cars)
speed dist group
1 4 2 1
2 4 10 2
3 7 4 3
4 7 22 1
5 8 16 2
6 9 10 3
tail(cars)
speed dist group
45 23 54 3
46 24 70 1
47 24 92 2
48 24 93 3
49 24 120 1
50 25 85 2
Something, like this?
df <- data.frame(rnorm(1666))
df$cutter <- rep(1:5, length.out=1666)
tail(df)
rnorm.1666. cutter
1661 0.11693169 1
1662 -1.12508091 2
1663 0.25441847 3
1664 -0.06045037 4
1665 -0.17242921 5
1666 -0.85366242 1
Related
I have a large df with values in a column with the sequence pattern: seq(1,3000,10).
I need to change every value in the column so that
1 = 1
11 = 2
21 = 3
31 = 4
41 = 5
The order of these numbers are jumbled in places, therefore I need to define that every 1 is converted to 1, 11 to 2, 21 to 3, 31 to 4 and so on for thousands of numbers with this sequence pattern.
Example
x <- seq(1, 100, by = 10)
# [1] 1 11 21 31 41 51 61 71 81 91
You can use %/%:
x %/% 10 + 1
# [1] 1 2 3 4 5 6 7 8 9 10
I have the following dataset:
Class Budget Total Rank
A 120 1926 58 5 9 2 10 3
B 120 3146 52 6 15 1 6 7 8 9
C 120 2358 51 2 1 4
D 120 3252 57 5 16 0.5 9 7 6 33 4 6
I would like to get the maximum and minimum value for each row starting from the column after the Rank (i.e., those columns that don't have titles).
What I want is to include the max and min within the data frame like:
Class Budget Total Rank max min
A 120 1926 58 10 2 5 9 2 10 3
B 120 3146 52 15 1 6 15 1 6 7 8 9
C 120 2358 51 4 1 2 1 4
D 120 3252 57 33 0.5 5 16 0.5 9 7 6 33 4 6
How can I do that?
Try the following:
df[, "Max"] <- apply(df[, 5:length(df)], 1, max, na.rm = TRUE)
df[, "Min"] <- apply(df[, 5:length(df)], 1, min, na.rm = TRUE)
In Matloff's The Art of R programming, he uses the function below (z12) to demonstrate the use of a vector-valued function.
My question is: When applying the function to 1:8, why does it return 1 2 3 4 ... 1 4 9 16 ... and not 1, 1, 2, 4, 3, 9 ...? After all, isn't z^2 right next to z in the return statement?
The c() is the concatenation operator. It joins two vectors end to end. You can do
c(1, 2)
# [1] 1 2
c(1:3, 9:11)
# [1] 1 2 3 9 10 11
So the function you've defined is running
c(1:8, (1:8)^2)
# [1] 1 2 3 4 5 6 7 8 1 4 9 16 25 36 49 64
So the c() puts together the vectors after they have been extended. Not as the extension is happening.
We can change the function to
z12 <- function(z) c(rbind(z, z^2))
z12(x)
#[1] 1 1 2 4 3 9 4 16 5 25 6 36 7 49 8 64
An alternative would be to use rep and recycling.
zN <- function(x) rep(x, each=2)^c(1:2)
Now, give it a try.
zN(1:8)
[1] 1 1 2 4 3 9 4 16 5 25 6 36 7 49 8 64
Or, with the desired order
zN2 <- function(x) x^rep(1:2, each=length(x))
zN2(1:8)
[1] 1 2 3 4 5 6 7 8 1 4 9 16 25 36 49 64
I am doing simulations and am trying to add error to a column repeatedly, specifically to the column titled Ao. In my output, the first 30 rows are correct; we have the initial data, the first year of altered data (error added to Ao), but then afterwards, where I would like to have 30 years of added error, I get repeats of Year 2 for Ao up to year 30. My goal is that I add error after each year of sampling. Ie. Year 2 is Year 1 Ao + error. Year 3 is Year 2 Ao + error, so on and so forth. Any helpers? Cheers.
for(t in 1:30){
Error<-rnorm(1000,0,1)
m<-rep(year1data$m,30)
r<-rep(year1data$r,30)
a<-rep(year1data$a,30)
g<-rep(year1data$g,30)
Year<-rep(2:31, each=TotSpecies)
Species<-1:TotSpecies
Ao<-year1data$Ao+sample(Error,TotSpecies,replace=FALSE)
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
TotSpeciesdata<-rbind(year1data,TotSpeciesdata)
}
> TotSpeciesdata
Species Year Ao m r a g
1 1 1 25.770783 43 119.110786 3.2305180 2.6526471
2 2 1 53.908914 138 161.894541 0.7342070 0.1151602
3 3 1 2.010732 226 193.820489 2.2890904 3.6248105
4 4 1 23.742254 332 17.315335 1.4009572 2.0037931
5 5 1 4.291080 63 187.591209 0.2563995 2.1553908
6 6 1 4.691113 343 116.267867 0.3899113 3.3950085
7 7 1 604.133044 224 132.240197 3.0410743 0.7985524
8 8 1 13.332567 166 5.367118 0.7921644 1.7861011
9 9 1 3.759268 141 212.340970 2.8733737 2.7123141
10 10 1 3.647390 209 259.400858 0.1249936 0.6594659
11 11 1 23.731109 10 114.171147 2.2437372 0.9867591
12 12 1 85.116996 69 167.412993 0.8306823 2.8905148
13 13 1 31.684280 277 177.025460 2.7618332 2.9245554
14 14 1 30.657523 205 21.710438 2.7661347 1.5911379
15 15 1 12.240410 85 210.121109 2.8827455 3.0418454
16 1 2 27.038097 43 119.110786 3.2305180 2.6526471
17 2 2 54.251600 138 161.894541 0.7342070 0.1151602
18 3 2 2.010636 226 193.820489 2.2890904 3.6248105
19 4 2 22.699369 332 17.315335 1.4009572 2.0037931
20 5 2 4.542589 63 187.591209 0.2563995 2.1553908
21 6 2 3.607833 343 116.267867 0.3899113 3.3950085
22 7 2 604.480756 224 132.240197 3.0410743 0.7985524
23 8 2 13.663513 166 5.367118 0.7921644 1.7861011
24 9 2 2.138715 141 212.340970 2.8733737 2.7123141
25 10 2 3.642769 209 259.400858 0.1249936 0.6594659
26 11 2 22.897993 10 114.171147 2.2437372 0.9867591
27 12 2 85.490897 69 167.412993 0.8306823 2.8905148
28 13 2 31.689202 277 177.025460 2.7618332 2.9245554
29 14 2 30.644419 205 21.710438 2.7661347 1.5911379
30 15 2 12.050207 85 210.121109 2.8827455 3.0418454
31 1 3 27.038097 43 119.110786 3.2305180 2.6526471
32 2 3 54.251600 138 161.894541 0.7342070 0.1151602
33 3 3 2.010636 226 193.820489 2.2890904 3.6248105
34 4 3 22.699369 332 17.315335 1.4009572 2.0037931
35 5 3 4.542589 63 187.591209 0.2563995 2.1553908
36 6 3 3.607833 343 116.267867 0.3899113 3.3950085
37 7 3 604.480756 224 132.240197 3.0410743 0.7985524
38 8 3 13.663513 166 5.367118 0.7921644 1.7861011
39 9 3 2.138715 141 212.340970 2.8733737 2.7123141
40 10 3 3.642769 209 259.400858 0.1249936 0.6594659
41 11 3 22.897993 10 114.171147 2.2437372 0.9867591
42 12 3 85.490897 69 167.412993 0.8306823 2.8905148
43 13 3 31.689202 277 177.025460 2.7618332 2.9245554
44 14 3 30.644419 205 21.710438 2.7661347 1.5911379
45 15 3 12.050207 85 210.121109 2.8827455 3.0418454
The main problem you have with your approach is the line:
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
Because Year is a 30 * TotSpecies vector, but all the others are just TotSpecies long. So in effect, you are recycling all columns except Year 30 times when you create the data frame, which will lead to the year 2 data repeated 30 times, among other things. If you just have Year <- rep(i + 1, TotSpecies) I think your logic will work fine. That said, here is an alternate approach:
This will, for each species, create an incrementing random walk starting with Ao for that species for 5 years (just did that for display purposes):
set.seed(1)
year1data <- data.frame(species=1:10, year=1, Ao=runif(10, 1, 700))
TotSpeciesData <- do.call(
rbind,
lapply(
split(year1data, year1data$species),
function(data)
with(
data,
data.frame(species=species, year=year, Ao=c(Ao, Ao + cumsum(rnorm(5)))
) ) ) )
head(TotSpeciesData, 15)
Note I excluded columns m-g since they don't seem directly relevant to your particular question, but you can add them relatively easily. I also only did 5 years in addition to year 1 so you can see the results here, but that is also easy to change:
species year Ao
1.1 1 1 186.5906
1.2 1 1 185.7701
1.3 1 1 186.2575
1.4 1 1 186.9958
1.5 1 1 187.5716
1.6 1 1 187.2662
2.1 2 1 261.1146
2.2 2 1 262.6264
2.3 2 1 263.0162
2.4 2 1 262.3950
2.5 2 1 260.1803
2.6 2 1 261.3052
3.1 3 1 401.4245
3.2 3 1 401.3796
3.3 3 1 401.3634
It has been pointed out that the code that you provided above, or at least that I have edited, repeats itself every 15 years, rather than being unique year year in a step-wise fashion. I edited it as shown below:
TotSpeciesData <- do.call(
rbind, #bind the table by rows
lapply( #applying the function in list form
split(year1data, year1data$Species), #splits data into groups by species
function(data)
with(
data,
data.frame(Species=Species, Year=1:Community, Ao=c(Ao, Ao + cumsum(rnorm((TotSpecies-1),0,2))),m=m, r=r, a=a, g=g) #data frame is Species, Year,
) ) )
TotSpeciesData$Ao[TotSpeciesData$Ao<0]<-0 #any values less than 0 go to 0
TotSpeciesData<-TotSpeciesData[order(TotSpeciesData$Year),] #orders the data frame by Year
When I do this code:
TotSpeciesData[TotSpeciesData$Species==1 & TotSpeciesData$Year %in% c(1,2,16,17),]
I end up with an output showing that the data is repeating itself.
Species Year Ao m r a g
1.1 1 1 48.49161 239 332.9625 3.791778 2.723104
1.2 1 2 49.62851 239 332.9625 3.791778 2.723104
1.16 1 16 48.49161 239 332.9625 3.791778 2.723104
1.17 1 17 49.62851 239 332.9625 3.791778 2.723104
Any comments toward this?
Is there a way - other than a for loop - to generate new variables in an R dataframe, which will be all the possible 2-way interactions between the existing ones?
i.e. supposing a dataframe with three numeric variables V1, V2, V3, I would like to generate the following new variables:
Inter.V1V2 (= V1 * V2)
Inter.V1V3 (= V1 * V3)
Inter.V2V3 (= V2 * V3)
Example using for loop :
x <- read.table(textConnection('
V1 V2 V3 V4
1 9 25 18
2 5 20 10
3 4 30 12
4 4 34 16'
), header=TRUE)
dim.init <- dim(x)[2]
for (i in 1: (dim.init - 1) ) {
for (j in (i + 1) : (dim.init) ) {
x[dim(x)[2] + 1] <- x[i] * x[j]
names(x)[dim(x)[2]] <- paste("Inter.V",i,"V",j,sep="")
}
}
Here is a one liner for you that also works if you have factors:
> model.matrix(~(V1+V2+V3+V4)^2,x)
(Intercept) V1 V2 V3 V4 V1:V2 V1:V3 V1:V4 V2:V3 V2:V4 V3:V4
1 1 1 9 25 18 9 25 18 225 162 450
2 1 2 5 20 10 10 40 20 100 50 200
3 1 3 4 30 12 12 90 36 120 48 360
4 1 4 4 34 16 16 136 64 136 64 544
attr(,"assign")
[1] 0 1 2 3 4 5 6 7 8 9 10
Here you go, using combn and apply:
> x2 <- t(apply(x, 1, combn, 2, prod))
Setting the column names can be done with two paste commands:
> colnames(x2) <- paste("Inter.V", combn(1:4, 2, paste, collapse="V"), sep="")
Lastly, if you want all your variables together, just cbind them:
> x <- cbind(x, x2)
> V1 V2 V3 V4 Inter.V1V2 Inter.V1V3 Inter.V1V4 Inter.V2V3 Inter.V2V4 Inter.V3V4
1 1 9 25 18 9 25 18 225 162 450
2 2 5 20 10 10 40 20 100 50 200
3 3 4 30 12 12 90 36 120 48 360
4 4 4 34 16 16 136 64 136 64 544
I think this question should be complemented with the poly/polym function, which goes futher: it generates not only interactions between the variables, but its power until the selected degree. And orthogonal iteractions, which may be very usefull.
The directly solution to the asked problem would be:
> polym(x$V1, x$V2, x$V3, x$V4, degree = 2, raw = T)
1.0.0.0 2.0.0.0 0.1.0.0 1.1.0.0 0.2.0.0 0.0.1.0 1.0.1.0 0.1.1.0 0.0.2.0 0.0.0.1 1.0.0.1 0.1.0.1 0.0.1.1 0.0.0.2
[1,] 1 1 9 9 81 25 25 225 625 18 18 162 450 324
[2,] 2 4 5 10 25 20 40 100 400 10 20 50 200 100
[3,] 3 9 4 12 16 30 90 120 900 12 36 48 360 144
[4,] 4 16 4 16 16 34 136 136 1156 16 64 64 544 256
attr(,"degree")
[1] 1 2 1 2 2 1 2 2 2 1 2 2 2 2
The columns 4, 7, 8, 11, 12, 13 has the requested in the question. Other columns have other kinds of interactions. If you would like to get orthogonal interactions, just set raw = FALSE.