I am trying to create a specialized summary 'matrix' for my supervisor, and would like R to export it in a clean, readable form. As such, I am creating it from scratch basically, to tailor it to our project. My problem is I can't figure out how to get a created data frame to behave like an imported one, specifically headers.
I am most comfortable dealing with imported data frames with headers, and calling specific rows by name instead of column number:
iris$Sepal.Length
with(iris,Sepal.Length)
iris['Sepal.Length']
Now, if I want to create a data frame (or matrix, I'm not entirely sure what the difference is), I have tried the following:
groups<-c("Group 1", "Group 2")
factors<-c("Fac 1", "Fac 2", "Fac 3","Fac 4", "Fac 5")
x<-1:10
y<-11:20
z<-21-30
data<-cbind(groups, factors, x, y, z)
names(data) #returns NULL
data$x #clearly doesn't return the column 'x' since the matrix 'data' has no names
data<-data.frame(cbind(groups, factors, x, y, z))
names(data) #confirms that there are header names
So, I have created a data frame that has the columns x, y and z, but in reality I don't have a premade column to start off with. If I knew how many rows of data there would be I could simply do:
data<-data.frame(1:10)
data$x<-x
data$y<-y
data$z<-z
I tried creating an empty data frame, but it is one element big, and if I try to append a vector to it (of any length greater than 1), I get an error:
data<-data.frame(0)
data$x<-x #returns an error
My best guess at what to do is to pass through the data once to find out how long many rows of data I will have (there are several factor levels, and the summary matrix will have a row for each possible combination of factors). Then I can get the data frame started with a simple:
data<-data.frame(length(n)) #where n would be how many rows of data I would have
And follow through by creating individual vectors for each summary statistic I want and appending it to the data frame with ~$~.
Another solution I tried to play with was creating a matrix and filling in each element as I calculate it within a loop. I know the apply family is better than a loop, but to make my summary table tailored to my needs I would need to run an apply function then try to pull the individual data:
means<-with(iris,tapply(iris[,4],Species,mean))
means[1] #This returns the species and the mean petal width. What I need is the numeric part of this, as I will have my own headers, or possibly a separate summary table for each species.
I'm not sure if extracting the numerical information from the apply output is better / any easier than simply constructing my own loop to calculate the required statistics. It would be a nested loop that would first sort by group (2 runs), then an internal loop that would run by factors (5 runs) for a total of 10 runs through the data. I was thinking of creating an empty martix, and simply saving the data in the appropriate cell when it is calculated. My problem, again, is calling a specific row in a matrix. I have tried:
m<-matrix(0,ncol=5)
m[1,1]<-'Groups'
m[1,2]<-'Factors'
m[1,3]<-'Mean.x'
m[1,4]<-'Mean.y'
m[1,5]<-'Mean.z'
names(m) #Returns NULL
My desired output would look like:
Groups Factors Mean.x Mean.y Mean.z
Group 1 Fac 1
Group 1 Fac 2
Group 1 Fac 3
Etc, for all combinations of groups and factors.
You can use ddply from plyr package for that: assume your original data frame is mydata and your new data frame where you store the result is newdata:
library(plyr)
newdata<-ddply(mydata,.(Groups,Factors),summarize,mean.x=mean(x),mean.y=mean(y),mean.z=mean(z))
Example: mydata<-iris
> newdata<-ddply(mydata,.(Species),colwise(mean))
> newdata
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
I think this is what you're looking for, but I'm a little confused by your question in general. This basically will give you a pivot table of the means in each column x,y, and z grouped by the columns 'groups' and 'factors'
aggregate(.~groups+factors, data=data, FUN="mean")
groups factors x y z
1 Group 1 Fac 1 1 1 1
2 Group 2 Fac 1 7 6 1
3 Group 1 Fac 2 8 7 1
4 Group 2 Fac 2 3 2 1
5 Group 1 Fac 3 4 3 1
6 Group 2 Fac 3 9 8 1
7 Group 1 Fac 4 10 9 1
8 Group 2 Fac 4 5 4 1
9 Group 1 Fac 5 6 5 1
10 Group 2 Fac 5 2 10 1
or with the iris data grouped by Species:
aggregate(.~Species, data=iris, FUN="mean")
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
UPDATE: To only calculate the mean of certain columns, you can either pass only the apropriate columns of your dataset to the aggregate function (perhaps calling subset) or modify the formula like this:
aggregate(cbind(Sepal.Length,Sepal.Width)~Species, data=iris, FUN="mean")
I am not entirely sure if that's what you are looking for but there are several options to add “stuff” to data frames:
To add a variable, just type data$newname <- NA (no need to know the length of the data frame or pass a vector, all rows will be filled with NA)
To append data use rbind (the data you are adding should be another data frame with the same variables)
To fix your example, first create an empty data frame and append data as it comes:
data <- data.frame(x=numeric())
data <- rbind(data, data.frame(x))
The previous example had only one variable (x) but you can also define a data frame with several variables and no rows:
data <- data.frame(x=numeric(),
y=numeric(),
a=character(),
b=factor(levels=c("Factor 1", "Factor 2")))
You don't need to know how many rows you will have, but the data you are adding needs to have the same structure. If that's not the case, you need to create columns with missing values in both data frames as needed, e.g.
data1 <- data.frame(x=1:10, y=1)
data2 <- data.frame(y=2, z=100:110)
rbind(data1, data2) # Error
data1$z <- NA
data2$x <- NA
rbind(data1, data2) # Now it works
Related
I have a very large data frame and a set of adjustment coefficients that I wish to apply to certain years, with each coefficient applied to one and only one year. The code below tries, for each row, to select the right coefficient, and return a vector containing dat in the unaffected years and dat times that coefficient in the selected years, which is to replace dat.
year <- rep(1:5, times = c(2,2,2,2,2))
dat <- 1:10
df <- tibble(year, dat)
adjust = c(rep(0, 4), rep(c(1 + 0.1*1:3), c(2,2,2)))
df %>% mutate(dat = ifelse(year < 5, year, dat*adjust[[year - 2]]))
If I get to do this, I get the following error:
Evaluation error: attempt to select more than one element in vectorIndex.
I am pretty sure this is because the extraction operator [[ treats year as the entire vector year rather than the year of the current row, so there is then a vectorized subtraction, whereupon [[ chokes on the vector-valued index.
I know there are many ways to solve this problem. I have a particularly ugly way involving nested ifelse’s working now. My question is, is there any way to do what I was trying to do in an R- and dplyr- idiomatic way? In some ways this seems like a filter or group_by problem, since we want to treat rows or groups of rows as distinct entities, but I have not found a way of doing so that is any cleaner.
It seems like there are some functions which are easier to define or to think of as row-by-row rather than as the product of entire vectors. I could produce a single vector containing the correct adjustment for each year, but since the number of rows per year varies, I would still have to apply a multi-valued conditional test to construct that vector, so the same problem arises.
Or doesn’t it?
You need to use [ instead of [[ for vector indexing; And also year - 2 produces negative index which will further give problems; If you want to map year to adjust by index positions, you can use replace with a mask that indicates the year to be modified:
df %>%
mutate(dat = {
mask = year > 2;
replace(year, mask, dat[mask] * adjust[year[mask] - 2])
})
# A tibble: 10 x 2
# year1 dat1
# <int> <dbl>
# 1 1 1.0
# 2 1 1.0
# 3 2 2.0
# 4 2 2.0
# 5 3 5.5
# 6 3 6.6
# 7 4 8.4
# 8 4 9.6
# 9 5 11.7
#10 5 13.0
I have a data frame 'math.numeric' with 32 variables. Each row represents a student and each variable is an attribute. The students have been put into 5 groups based on their final grade.
The data looks as follows:
head(math.numeric)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason ... group
1 1 18 2 1 1 4 4 1 5 1 2
1 1 17 2 1 2 1 1 1 3 1 2
1 1 15 2 2 2 1 1 1 3 3 3
1 1 15 2 1 2 4 2 2 4 2 4
1 1 16 2 1 2 3 3 3 3 2 3
1 2 16 2 2 2 4 3 4 3 4 4
I am performing t-tests on each variable for group 1 vs. all the other groups to identify significantly different attributes with this group. I am looking to pull out the p-values for each test such as:
t.test(subset(math.numeric$school, math.numeric$group == 1),
subset(math.numeric$school, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$sex, math.numeric$group == 1),
subset(math.numeric$sex, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$age, math.numeric$group == 1),
subset(math.numeric$age, math.numeric$group != 1))$p.value
I have been trying to figure out how I can create a loop to do this instead of writing out each test one at a time. I have tried a for loop, and lapply, but so far I have not had any luck.
I am fairly new to this, so any help would be appreciated.
Courtney
Your example data is not sufficient to actually carry out t-tests on all subgroups. For that reason, I take the iris dataset, which contains 3 species of plants: Setosa, Versicolor, and Virginica. These are my groups. You will have to adjust your code accordingly. Below I show how to test one groups versus all other groups, one group versus each other group, and all combinations of individual groups.
One group versus all other groups combined:
First, let's say I want to compare Versicolor and Virginica to Setosa, i.e. Setosa is my group 1 to which all other groups should be compared. An easy way to achieve what you want is the following:
sapply(names(iris)[-ncol(iris)], function(x){
t.test(iris[iris$Species=="setosa", x],
iris[iris$Species!="setosa", x])$p.value
})
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.709331e-32 1.035396e-13 1.746188e-69 1.347804e-60
Here, I have supplied the names of the different variables in the dataset names(iris) - exlcuding the column indicating the grouping variable [-ncol(iris)] (since it is the last column) - as vector to sapply, which passes the corresponding names as arguments to the function that I have defined.
One group versus each of the other groups:
In case you want to make groupwise comparisons for all groups, the following may be helpful: First, create a dataframe of all group x variable combinations that you are going to do, excluding the grouping variable itself and the reference group, of course. This can be achieved by:
comps <- expand.grid(unique(iris$Species)[-1], # excluding Setosa as reference group
names(iris)[-ncol(iris)] # excluding group column
)
head(comps)
Var1 Var2
1 versicolor Sepal.Length
2 virginica Sepal.Length
3 versicolor Sepal.Width
4 virginica Sepal.Width
5 versicolor Petal.Length
6 virginica Petal.Length
Here, Var1 are the different species, and Var2 the different variables for which comparisons are to be done. The reference group 1 or Setosa is implicit in this case. Now, I can use apply to create the tests. I do this by using each row of comps as argument with two elements, the first of which indicates which group's turn it is, and the second argument indicates which variable should be compared. These will be used to subset the original dataframe.
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species=="setosa", x[2]], iris[iris$Species==x[1], x[2]])$p.value
} )
where group 1 aka Setosa is hard-coded in the function. This gives me a dataframe with p-values for all combinations (with Setosa as reference group) so that they are easy to look up:
head(comps)
Var1 Var2 pval
1 versicolor Sepal.Length 3.746743e-17
2 virginica Sepal.Length 3.966867e-25
3 versicolor Sepal.Width 2.484228e-15
4 virginica Sepal.Width 4.570771e-09
5 versicolor Petal.Length 9.934433e-46
6 virginica Petal.Length 9.269628e-50
All combinations of groups:
You can expand the above easily to produce a dataframe that contains p-values of t-tests for each combination of groups. One approach would be:
comps <- expand.grid(unique(iris$Species), unique(iris$Species), names(iris)[-ncol(iris)])
This now has three columns. The first two are the groups, and the third the variable:
head(comps)
Var1 Var2 Var3
1 setosa setosa Sepal.Length
2 versicolor setosa Sepal.Length
3 virginica setosa Sepal.Length
4 setosa versicolor Sepal.Length
5 versicolor versicolor Sepal.Length
6 virginica versicolor Sepal.Length
You can use this to carry out the tests:
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species==x[1], x[3]], iris[iris$Species==x[2], x[3]])$p.value
} )
I get an error message: what should I do?
t.test may throw out an error message if the sample size is too small or if the values are constant for one group. This is problematic since it might only occur for specific groups, and you may not know in advance which one it is. Yet the error will disrupt the entire function call to apply, and you will not be able to see any results.
A way to circumvent this and to identify the problematic groups is to wrap the function t.test around dplyr::failwith (see also ?tryCatch). To show how this works, consider the following:
smalln <- data.frame(a=1, b=2)
t.test(smalln$a, smalln$b)
> Error in t.test.default(smalln$a, smalln$b) : not enough 'x' observations
failproof.t <- failwith(default="Some default of your liking", t.test, quiet = T)
failproof.t(smalln$a, smalln$b)
[1] "Some default of your liking"
That way, whenever t.test would throw out an error, you get a character as a result instead and the computation continues with other groups. Needless to say, you could also set default to a number, or anything else. It does not have to be a character.
Statistical disclaimer:
Having said all of this, note that conducting a several t-tests is not necessarily good statistical practice. You may want to adjust your p-values to account for multiple testing, or you may want to use alternative test procedures that conducts joint tests.
Hows this?
pvals <- numeric() #the vector of p values
k <- 1 #in case you choose to use a subset not continuing from 1
# "for(i in seq(1,5))" is just doing the pvalues for the first 5 columns. You could do a
# list, like "c(1,2,4)" (in place of "seq(1,5)"), to do tests for columns 1, 2, and 4.
# To do all of the columns, try "for(i in seq(1,(ncol(math.numeric)-1)))".
for(i in seq(1,5)){
# using your code to grab the p-values and store them in the kth element of "pvals"
pvals[k] <- t.test(subset(math.numeric[,i], math.numeric$group == 1),
subset(math.numeric[,i], math.numeric$group != 1))$p.value
#iterating the "pvals" vector entry counter
k=k+1
}
pvals #printing the p values for each test
Consider splitting the data frame by group and using mapply() across the columns. Output becomes a compiled list of tests' statistics: statistic, parameter, p-value, confid. interval, etc.
# FILTER ROWS AND SUBSET NUMERIC COLS
group1df <- df[df$group==1, 1:ncol(df)-1]
othgroupdf <- df[df$group!=1, 1:ncol(df)-1]
# T-TEST FCT
tfct <- function(v1, v2){
t.test(v1, v2)
}
# RUN T-TESTS BY COL, SAVE RESULTS TO LIST
ttests <- mapply(tfct, group1df, othgroupdf)
I have a data frame 'math.numeric' with 32 variables. Each row represents a student and each variable is an attribute. The students have been put into 5 groups based on their final grade.
The data looks as follows:
head(math.numeric)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason ... group
1 1 18 2 1 1 4 4 1 5 1 2
1 1 17 2 1 2 1 1 1 3 1 2
1 1 15 2 2 2 1 1 1 3 3 3
1 1 15 2 1 2 4 2 2 4 2 4
1 1 16 2 1 2 3 3 3 3 2 3
1 2 16 2 2 2 4 3 4 3 4 4
I am performing t-tests on each variable for group 1 vs. all the other groups to identify significantly different attributes with this group. I am looking to pull out the p-values for each test such as:
t.test(subset(math.numeric$school, math.numeric$group == 1),
subset(math.numeric$school, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$sex, math.numeric$group == 1),
subset(math.numeric$sex, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$age, math.numeric$group == 1),
subset(math.numeric$age, math.numeric$group != 1))$p.value
I have been trying to figure out how I can create a loop to do this instead of writing out each test one at a time. I have tried a for loop, and lapply, but so far I have not had any luck.
I am fairly new to this, so any help would be appreciated.
Courtney
Your example data is not sufficient to actually carry out t-tests on all subgroups. For that reason, I take the iris dataset, which contains 3 species of plants: Setosa, Versicolor, and Virginica. These are my groups. You will have to adjust your code accordingly. Below I show how to test one groups versus all other groups, one group versus each other group, and all combinations of individual groups.
One group versus all other groups combined:
First, let's say I want to compare Versicolor and Virginica to Setosa, i.e. Setosa is my group 1 to which all other groups should be compared. An easy way to achieve what you want is the following:
sapply(names(iris)[-ncol(iris)], function(x){
t.test(iris[iris$Species=="setosa", x],
iris[iris$Species!="setosa", x])$p.value
})
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.709331e-32 1.035396e-13 1.746188e-69 1.347804e-60
Here, I have supplied the names of the different variables in the dataset names(iris) - exlcuding the column indicating the grouping variable [-ncol(iris)] (since it is the last column) - as vector to sapply, which passes the corresponding names as arguments to the function that I have defined.
One group versus each of the other groups:
In case you want to make groupwise comparisons for all groups, the following may be helpful: First, create a dataframe of all group x variable combinations that you are going to do, excluding the grouping variable itself and the reference group, of course. This can be achieved by:
comps <- expand.grid(unique(iris$Species)[-1], # excluding Setosa as reference group
names(iris)[-ncol(iris)] # excluding group column
)
head(comps)
Var1 Var2
1 versicolor Sepal.Length
2 virginica Sepal.Length
3 versicolor Sepal.Width
4 virginica Sepal.Width
5 versicolor Petal.Length
6 virginica Petal.Length
Here, Var1 are the different species, and Var2 the different variables for which comparisons are to be done. The reference group 1 or Setosa is implicit in this case. Now, I can use apply to create the tests. I do this by using each row of comps as argument with two elements, the first of which indicates which group's turn it is, and the second argument indicates which variable should be compared. These will be used to subset the original dataframe.
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species=="setosa", x[2]], iris[iris$Species==x[1], x[2]])$p.value
} )
where group 1 aka Setosa is hard-coded in the function. This gives me a dataframe with p-values for all combinations (with Setosa as reference group) so that they are easy to look up:
head(comps)
Var1 Var2 pval
1 versicolor Sepal.Length 3.746743e-17
2 virginica Sepal.Length 3.966867e-25
3 versicolor Sepal.Width 2.484228e-15
4 virginica Sepal.Width 4.570771e-09
5 versicolor Petal.Length 9.934433e-46
6 virginica Petal.Length 9.269628e-50
All combinations of groups:
You can expand the above easily to produce a dataframe that contains p-values of t-tests for each combination of groups. One approach would be:
comps <- expand.grid(unique(iris$Species), unique(iris$Species), names(iris)[-ncol(iris)])
This now has three columns. The first two are the groups, and the third the variable:
head(comps)
Var1 Var2 Var3
1 setosa setosa Sepal.Length
2 versicolor setosa Sepal.Length
3 virginica setosa Sepal.Length
4 setosa versicolor Sepal.Length
5 versicolor versicolor Sepal.Length
6 virginica versicolor Sepal.Length
You can use this to carry out the tests:
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species==x[1], x[3]], iris[iris$Species==x[2], x[3]])$p.value
} )
I get an error message: what should I do?
t.test may throw out an error message if the sample size is too small or if the values are constant for one group. This is problematic since it might only occur for specific groups, and you may not know in advance which one it is. Yet the error will disrupt the entire function call to apply, and you will not be able to see any results.
A way to circumvent this and to identify the problematic groups is to wrap the function t.test around dplyr::failwith (see also ?tryCatch). To show how this works, consider the following:
smalln <- data.frame(a=1, b=2)
t.test(smalln$a, smalln$b)
> Error in t.test.default(smalln$a, smalln$b) : not enough 'x' observations
failproof.t <- failwith(default="Some default of your liking", t.test, quiet = T)
failproof.t(smalln$a, smalln$b)
[1] "Some default of your liking"
That way, whenever t.test would throw out an error, you get a character as a result instead and the computation continues with other groups. Needless to say, you could also set default to a number, or anything else. It does not have to be a character.
Statistical disclaimer:
Having said all of this, note that conducting a several t-tests is not necessarily good statistical practice. You may want to adjust your p-values to account for multiple testing, or you may want to use alternative test procedures that conducts joint tests.
Hows this?
pvals <- numeric() #the vector of p values
k <- 1 #in case you choose to use a subset not continuing from 1
# "for(i in seq(1,5))" is just doing the pvalues for the first 5 columns. You could do a
# list, like "c(1,2,4)" (in place of "seq(1,5)"), to do tests for columns 1, 2, and 4.
# To do all of the columns, try "for(i in seq(1,(ncol(math.numeric)-1)))".
for(i in seq(1,5)){
# using your code to grab the p-values and store them in the kth element of "pvals"
pvals[k] <- t.test(subset(math.numeric[,i], math.numeric$group == 1),
subset(math.numeric[,i], math.numeric$group != 1))$p.value
#iterating the "pvals" vector entry counter
k=k+1
}
pvals #printing the p values for each test
Consider splitting the data frame by group and using mapply() across the columns. Output becomes a compiled list of tests' statistics: statistic, parameter, p-value, confid. interval, etc.
# FILTER ROWS AND SUBSET NUMERIC COLS
group1df <- df[df$group==1, 1:ncol(df)-1]
othgroupdf <- df[df$group!=1, 1:ncol(df)-1]
# T-TEST FCT
tfct <- function(v1, v2){
t.test(v1, v2)
}
# RUN T-TESTS BY COL, SAVE RESULTS TO LIST
ttests <- mapply(tfct, group1df, othgroupdf)
Suppose I have a data frame in R where I would like to use 2 columns "factor1" and "factor2" as factors and I need to calculate mean value for all other columns per each pair of the above mentioned factors. After running the code below, the last line gives the following warnings:
Warning messages:
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Why is it happening and what should I do to make it right?
Thanks.
Here is my code:
# Create data frame
myDataFrame <- data.frame(factor1=c(1,1,1,2,2,2,3,3,3), factor2=c(3,3,3,4,4,4,5,5,5), val1=c(1,2,3,4,5,6,7,8,9), val2=c(9,8,7,6,5,4,3,2,1))
# Split by 2 columns (factors)
splitDataFrame <- split(myDataFrame, list(myDataFrame$factor1, mydataFrame$factor2))
# Calculate mean value for each column per each pair of factors
splitMeanValues <- lapply(splitDataFrame, function(x) apply(x, 2, mean))
# Combine back to reduced table whereas there is only one value (mean) per each pair of factors
MeanValues <- unsplit(splitMeanValues, list(unique(myDataFrame$factor1), unique(mydataFrame$factor2)))
EDIT1: Added data frame creation (see above)
If you need to calculate the mean for all other columns than the factors, you can use the formula syntax of aggregate()
aggregate(.~factor1+factor2, myDataFrame, FUN=mean)
That returns
factor1 factor2 val1 val2
1 1 3 2 8
2 2 4 5 5
3 3 5 8 2
Your split() method didn't work because when you unsplit you must have the same number of rows as when you split your data. You were reduing the number of rows for all groups to just one row. Plus, unsplit really should be used with the exact same list of factors that was used to do the split otherwise groups may get out of order. You could to a split and then lapply some collapsing function and then rbind the list back into a single data.frame if you really wanted, but for a simple mean, aggregate is probably best.
The same result can be obtained with summaryBy() in the doBy package. Although it's pretty much the same as aggregate() in this case.
> library(doBy)
> summaryBy( . ~ factor1+factor2, data = myDataFrame)
# factor1 factor2 val1.mean val2.mean
# 1 1 3 2 8
# 2 2 4 5 5
# 3 3 5 8 2
Have you tried aggregate?
aggregate(myDataFrame$valueColum, myDataFrame$factor1, FUN=mean)
aggregate(myDataFrame$valueColum, myDataFrame$factor2, FUN=mean)
I am trying to create a new data frame which is identical in the number of columns (but not rows) of an existing data frame. All columns are of identical type, numeric. I need to sample each column of the original data frame (n=241 samples, replace=T) and add those samples to the new data frame at the same column number as the original data frame.
My code so far:
#create the new data frame
tree.df <- data.frame(matrix(nrow=0, ncol=72))
#give same column names as original data frame (data3)
colnames(tree.df)<-colnames(data3)
#populate with NA values
tree.df[1:241,]=NA
#sample original data frame column wise and add to new data frame
for (i in colnames(data3)){
rbind(sample(data3[i], 241, replace = T),tree.df)}
The code isn't working out. Any ideas on how to get this to work?
Use the fact that a data frame is a list, and pass to lapply to perform a column-by-column operation.
Here's an example, taking 5 elements from each column in iris:
as.data.frame(lapply(iris, sample, size=5, replace=TRUE))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.7 3.2 1.7 0.2 versicolor
## 2 5.8 3.1 1.5 1.2 setosa
## 3 6.0 3.8 4.9 1.9 virginica
## 4 4.4 2.5 5.3 0.2 versicolor
## 5 5.1 3.1 3.3 0.3 setosa
There are several issues here. Probably the one that is causing things not to work is that you are trying to access a column of the data frame data3. To do that, you use the following data3[, i]. Note the comma. That separates the row index from the column index.
Additionally, since you already know how big your data frame will be, allocate the space from the beginning:
tree.df <- data.frame(matrix(nrow = 241, ncol = 72))
tree.df is already prepopulated with missing (NA) values so you don't need to do it again. You can now rewrite your for loop as
for (i in colnames(data3)){
tree.df[, i] <- sample(data3[, i], 241, replace = TRUE)
}
Notice I spelled out TRUE. This is better practice than using T because T can be reassigned. Compare:
T
T <- FALSE
T
TRUE <- FALSE