sample function in R - r

I Have just started learning R using RStudio and I have, perhaps, some basic questions.
One of them regards the "sample" function.
More specifically, my dataset consists of 402224 observations of 147 variables. My task is to take a sample of 50 observations and then produce a dataframe and so on.
But when the function sample is executed
y = sample(mydata, 50, replace = TRUE, prob = NULL)
the result is a dataset with 40224 observations of 50 variables. That is, the sampling is done at variables and not obesrvations.
Do you have any idea why does it happen?
Thank you in advance.

If you want to create a data frame of 50 observations with replacement from your data frame, you can try:
mydata[sample(nrow(mydata), 50, replace=TRUE), ]
Alternatively, you can use the sample_n function from the dplyr package:
sample_n(mydata, 50)

The other answers people have been giving are to select rows, but it looks like you are after columns. You can still accomplish this in a similar way.
Here's a sample df.
df = data.frame(a = 1:5, b = 6:10, c = 11:15)
> df
a b c
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
Then, to randomly select 2 columns and all observations we could do this
> df[ , sample(1:ncol(df), 2)]
c a
1 11 1
2 12 2
3 13 3
4 14 4
5 15 5
So, what you'll want to do is something like this
y = mydata[ , sample(1:ncol(mydata), 50)]

That is because sample accepts only vectors.
try the following:
library(data.table)
set.seed(10)
df_sample<- data.table(df)
df[sample(.N, 402224 )]

Related

R | Create a cross matrix

I don´t know how or where to start, but i hope someone can help. It´s the first time i´d use R like this, so even a keyword or a recommendation where to look it up would be helpful.
My dataframe looks like this:
set.seed(1)
df <- data.frame(
X = sample(c(1, 2, 3), 50, replace = TRUE),
Y = sample(c(1, 2, 3), 50, replace = TRUE))
And I would like to get a cross table like this:
using
length(which(df$X == & df$Y == ))
I could calculate the data with R and fill it in my Excel-sheet but there has to be a better option.
Thank you in advance.
Try this base R solution:
#Data
set.seed(1)
df <- data.frame(
X = sample(c(1, 2, 3), 50, replace = TRUE),
Y = sample(c(1, 2, 3), 50, replace = TRUE))
#Code
addmargins(table(df$X,df$Y))
Output:
1 2 3 Sum
1 6 7 5 18
2 4 6 9 19
3 5 5 3 13
Sum 15 18 17 50
You can also change the order of your variables like this:
#Code2
addmargins(table(df$Y,df$X))
Output:
1 2 3 Sum
1 6 4 5 15
2 7 6 5 18
3 5 9 3 17
Sum 18 19 13 50
In order to export to MS Excel, you use this code:
library(xlsx)
#Transform to dataframe
d1 <- as.data.frame.matrix(addmargins(table(df$X,df$Y)))
#Export
write.xlsx(d1,file='myexample.xlsx','Sheet1')
If the data have only two columns, just pass the data.frame object to table.
addmargins(table(df))
If the data include more than two columns, you can subset it's variable before passing to table().
addmargins(table(df[c("X", "Y")]))
You can also pass a formula to xtabs().
addmargins(xtabs( ~ X + Y, df))
All of above give
Y
X 1 2 3 Sum
1 5 6 3 14
2 2 6 6 14
3 13 4 5 22
Sum 20 16 14 50
To export the table to an excel file, you can use write.xlsx() from openxlsx.
library(openxlsx)
tab <- addmargins(xtabs( ~ X + Y, df))
write.xlsx(tab, "foo.xlsx")

R conditional random samples

How can I get a random samples based on conditional values. For example I have the following dataframe:
GROUP CLASS AGE
A 1 10
A 2 15
B 1 10
B 2 17
C 1 12
C 2 14
I need to get a sample of 30 records for each of the GROUPS, but only from CLASS = 1 compiled all in a sample dateframe.
I Know how to get a sample of 30 records, but I don't know how to create a condition that loops throught the different GROUPS and filters the CLASS
ran.sample = sample(nrow(df_all), 30)
df = df_all[ran.sample, ]
Any ideas?
Thanks
Try this:
newdf <- df[df$CLASS==1,]
do.call(rbind, lapply(split(newdf, newdf$GROUP), function(x) x[sample(nrow(x), 30),]))

How to re-arrange a data.frame

I am interested in re-arranging a data.frame in R. Bear with me a I stumble through a reproducible example.
I have a nominal variable which can have 1 of two values. Currently this nominal variable is a column. Instead I would like to have two columns, representing the two values this nominal variable can have. Here is an exmample data frame. S is the nominal variable with values T and C.
n <- c(1,1,2,2,3,3,4,4)
s <- c("t","c","t","c","t","c","t","c")
b <- c(11,23,6,5,12,16,41,3)
mydata <- data.frame(n, s, b)
I would rather have a data frame that looked like this
n.n <- c(1,2,3,4)
trt <- c(11,6,23,41)
cnt <- c(23,5,16,3)
new.data <- data.frame(n.n, trt, cnt)
I am sure there is a way to use mutate or possibly tidyr but I am not sure what the best route is and my data frame that I would like to re-arrange is quite large.
you want spread:
library(dplyr)
library(tidyr)
new.data <- mydata %>% spread(s,b)
n c t
1 1 23 11
2 2 5 6
3 3 16 12
4 4 3 41
How about unstack(mydata, b~s):
c t
1 23 11
2 5 6
3 16 12
4 3 41

table() by group in R

Sample data:
a <- sample(1:4, 100, replace = T)
b <- sample(0:1, 100, replace = T)
d <- data.frame(a, b)
I want to achieve this output automatically for all levels of a:
table(d$b[d$a==1])
table(d$b[d$a==2])
table(d$b[d$a==3])
table(d$b[d$a==4])
I could do a for-loop, but that is not in the spirit of R.
for (i in unique(d$a)) {
print(table(d$b[d$a==i]))
}
Rather, I want to use one of the many list-function in R.
I tried to use ddply from the plyr package:
ddply(d, ~a, function(x) table(b))
But that is just the same as table(d$b) repeated four times.
How do I apply the table() function to each group in a using a list-function, preferably ddply?
You can use table with multiple arguments:
table(d$a,d$b)
0 1
1 15 10
2 6 16
3 13 10
4 20 10
Or, if you only have the data you want to tabulate in the data.frame, it will handle it for you if you pass in the data.frame:
table(d)
b
a 0 1
1 15 10
2 6 16
3 13 10
4 20 10

How do I take subsets of a data frame according to a grouping in R?

I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]

Resources