This question already has answers here:
Sample random rows in dataframe
(13 answers)
Closed 3 years ago.
Suppose I have a dataset with (90,000 x 17) i.e. (n x p) where n is the number of observations and p is the number of variables and I would like to take a random sample of 20% of rows from my whole dataset how can this be done in R?
After taking a random sample I will be performing cluster analysis accordingly.
I had tried using other questions to answer my question but they were inconclusive because it was not giving me what I needed.
You can do it with sample_frac from dplyr, here is an example with the database iris
library(dplyr)
#data(iris)
sample20 <- iris %>% sample_frac(0.2)
Related
This question already has answers here:
Mean per group in a data.frame [duplicate]
(8 answers)
Closed 1 year ago.
I have this data frame of about 35'000 observations. The problem is that there are about 5'000 occurences (as exemplified by the first two and last two rows of the image) whereby I have two observations relating to the same COD_DOM but with differing values of RENDIMENTO. What I would like is to calculate the average RENDIMENTO for all COD_DOM which appear twice and thus keep only one observation with the average value.
If your data.frame is just these two columns, you should be able to use:
library(dplyr)
new_df <- data.frame %>%
group_by(COD_DOM) %>%
summarize(RENDIMENTO=mean(RENDIMENTO))
This question already has answers here:
Remove last N rows in data frame with the arbitrary number of rows
(4 answers)
Closed 2 years ago.
I have a dataset consisting of 250 observations. I want to select all observations expect last. I know I can do this by following codes. But if do not know exact number of observations how I can do this.
dataset(mtcars)
mtcars_lag<-mtcars[1:31,]
## skipping first observation and selecting all
mtcars_forward<-mtcars[2:32,]
Using nrow() gets you the number of observations in the dataset. mtcars_subset <- mtcars[1:(nrow(mtcars)-1), ] will fetch you all observations except the last one.
EDIT: Added parenthesis in line with suggestion from MrFlick.
This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 5 years ago.
I have a dataset, like this mushrooms <- read.csv("mushrooms.csv") and now I already have a mushrooms.training_set which is 1/3 of the whole dataset. For both variables, typeof() returns list.
Now, I want to select the rows in the original dataset mushrooms, that are not in the mushrooms.training_set. How would I do this? I have tried the following:
mushrooms[c(!mushrooms.training_set),] but this returns something in the order of 64K rows.
mushrooms[!mushrooms.training_set,]
mushrooms[!duplicated(mushrooms.training_set)]
Who helps me out?
From where you are in the question, you can use dplyr::setdiff:
library(dplyr)
mushroooms.test = setdiff(mushrooms, mushrooms.training_set)
But most of the time it's easier to create the test set using at the same time as the training set. Lots of examples here at How to split data into training and test sets?
This question already has answers here:
backtransform `scale()` for plotting
(9 answers)
Closed 8 years ago.
I'm new in R. I would like to transform a set of numbers I have scaled using scale() to the original raw ones.
Here the code I used to scale the numbers
dataCluster <- dataFinal[, c(1)]
data_z <- as.data.frame(lapply(dataCluster, scale))
clusters <- kmeans (na.roughfix(data_z), 3)
where:
dataFinal is a data frame (3 columns x 100 rows)
clusters is a "data matrix" (3 columns x 3 rows).
I would like to create a clustersRaw with the raw values.
Can anyone help?
Don't know it this is going to solve, since you dind't provide your data. However:
#create a matrix 10x3
mat<-matrix(1:30,ncol=3)
#scale it
x<-scale(mat)
#restore it
t(t(x)*attr(x,"scaled:scale")+attr(x,"scaled:center"))
This question already has answers here:
How to split a data frame?
(8 answers)
Closed 8 years ago.
I have a data set called data, which I am splitting into 2 new data sets, which I will call test and train.
I want the splitting to be random, without replacement.
Using the code below, I get train to be a new data frame with 35 elements:
rows_in_test <- 35 # number of rows to randomly select
rows_in_train <- nrow(data) - rows_in_test
train <- data[sample(nrow(data), rows_in_test), ]
Is there a nice way in R to assign the complement of train to a new data set called test? I am thinking there must be a function for this?
myData<-data.frame(a=c(1:20), b=c(101:120))
set.seed(123)#to be able to replicate random sampling later
trainRows<-runif(nrow(myData))>0.25 #randomly put aside 25% of the data
train<-myData[trainRows,]#has 13 rows
test<-myData[!trainRows,]#has 7 rows
#following method to select a fixed no. of samples - in this case selecting 5 rows
testRows2<-sort(sample(c(1:nrow(myData)), 5, replace=F))
train2<-myData[-testRows2, ]
test2<-myData[testRows2, ]