Sample data:
a <- sample(1:4, 100, replace = T)
b <- sample(0:1, 100, replace = T)
d <- data.frame(a, b)
I want to achieve this output automatically for all levels of a:
table(d$b[d$a==1])
table(d$b[d$a==2])
table(d$b[d$a==3])
table(d$b[d$a==4])
I could do a for-loop, but that is not in the spirit of R.
for (i in unique(d$a)) {
print(table(d$b[d$a==i]))
}
Rather, I want to use one of the many list-function in R.
I tried to use ddply from the plyr package:
ddply(d, ~a, function(x) table(b))
But that is just the same as table(d$b) repeated four times.
How do I apply the table() function to each group in a using a list-function, preferably ddply?
You can use table with multiple arguments:
table(d$a,d$b)
0 1
1 15 10
2 6 16
3 13 10
4 20 10
Or, if you only have the data you want to tabulate in the data.frame, it will handle it for you if you pass in the data.frame:
table(d)
b
a 0 1
1 15 10
2 6 16
3 13 10
4 20 10
Related
I don´t know how or where to start, but i hope someone can help. It´s the first time i´d use R like this, so even a keyword or a recommendation where to look it up would be helpful.
My dataframe looks like this:
set.seed(1)
df <- data.frame(
X = sample(c(1, 2, 3), 50, replace = TRUE),
Y = sample(c(1, 2, 3), 50, replace = TRUE))
And I would like to get a cross table like this:
using
length(which(df$X == & df$Y == ))
I could calculate the data with R and fill it in my Excel-sheet but there has to be a better option.
Thank you in advance.
Try this base R solution:
#Data
set.seed(1)
df <- data.frame(
X = sample(c(1, 2, 3), 50, replace = TRUE),
Y = sample(c(1, 2, 3), 50, replace = TRUE))
#Code
addmargins(table(df$X,df$Y))
Output:
1 2 3 Sum
1 6 7 5 18
2 4 6 9 19
3 5 5 3 13
Sum 15 18 17 50
You can also change the order of your variables like this:
#Code2
addmargins(table(df$Y,df$X))
Output:
1 2 3 Sum
1 6 4 5 15
2 7 6 5 18
3 5 9 3 17
Sum 18 19 13 50
In order to export to MS Excel, you use this code:
library(xlsx)
#Transform to dataframe
d1 <- as.data.frame.matrix(addmargins(table(df$X,df$Y)))
#Export
write.xlsx(d1,file='myexample.xlsx','Sheet1')
If the data have only two columns, just pass the data.frame object to table.
addmargins(table(df))
If the data include more than two columns, you can subset it's variable before passing to table().
addmargins(table(df[c("X", "Y")]))
You can also pass a formula to xtabs().
addmargins(xtabs( ~ X + Y, df))
All of above give
Y
X 1 2 3 Sum
1 5 6 3 14
2 2 6 6 14
3 13 4 5 22
Sum 20 16 14 50
To export the table to an excel file, you can use write.xlsx() from openxlsx.
library(openxlsx)
tab <- addmargins(xtabs( ~ X + Y, df))
write.xlsx(tab, "foo.xlsx")
I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))
I Have just started learning R using RStudio and I have, perhaps, some basic questions.
One of them regards the "sample" function.
More specifically, my dataset consists of 402224 observations of 147 variables. My task is to take a sample of 50 observations and then produce a dataframe and so on.
But when the function sample is executed
y = sample(mydata, 50, replace = TRUE, prob = NULL)
the result is a dataset with 40224 observations of 50 variables. That is, the sampling is done at variables and not obesrvations.
Do you have any idea why does it happen?
Thank you in advance.
If you want to create a data frame of 50 observations with replacement from your data frame, you can try:
mydata[sample(nrow(mydata), 50, replace=TRUE), ]
Alternatively, you can use the sample_n function from the dplyr package:
sample_n(mydata, 50)
The other answers people have been giving are to select rows, but it looks like you are after columns. You can still accomplish this in a similar way.
Here's a sample df.
df = data.frame(a = 1:5, b = 6:10, c = 11:15)
> df
a b c
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
Then, to randomly select 2 columns and all observations we could do this
> df[ , sample(1:ncol(df), 2)]
c a
1 11 1
2 12 2
3 13 3
4 14 4
5 15 5
So, what you'll want to do is something like this
y = mydata[ , sample(1:ncol(mydata), 50)]
That is because sample accepts only vectors.
try the following:
library(data.table)
set.seed(10)
df_sample<- data.table(df)
df[sample(.N, 402224 )]
Suppose the data frame is like this:
df <- data.frame(x = c(1,7,8,15,24,100,9,19,128))
How do I create a new variable that satisfies the following condition:
y = 1 if 1<=x<=7
y = 2 if 8<=x<=14
y = 3 if 15<=x<=21
...
y = k if 1+7*(k-1)<= x<= 7+7*(k-1)
so that I can have the new data frame like this
df <- data.frame(y = c(1,1,2,3,4,15, 2,3, 19))
I am wondering if a for loop can be applied in this case.
Via simple algebra, you can do:
df$y <- floor((df$x+6)/7)
df
# x y
# 1 1 1
# 2 7 1
# 3 8 2
# 4 15 3
# 5 24 4
# 6 100 15
# 7 9 2
# 8 19 3
# 9 128 19
In R you will often find it easier (less typing and less thinking) to use vectorized operators than for loops for simple computations like this. In this case we performed calls to +, /, and floor over a whole vector instead of looping and using them on each element.
I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]