I have an excel file that has multiple sheets. each sheet looks like this with some excess data at the bottom
A B C D....
1 time USA USA USA
2 MD CA PX
3 pork peas nuts
4 jan-11 4 2 2
5 feb-11 4 9 3
6 mar-11 8 8 3
.
.
workbook1|workbook2.....
The file is 11 mb, but when I try to use
sheet<-readWorksheetFromFile("excelfile.xlsx", sheet = 1)
I get
Error: OutOfMemoryError (Java): Java heap space
For each work sheet the data takes up different number for rows and columns, I want to write something that produces this for each sheet.
I am trying to convert each column into
country state product unit time
USA MD pork 3 jan-11
USA MD pork 3 feb-11
USA MD pork 3 mar-11
...
..
.
Is there any way to do this in R?
If your spreadsheet is full of formulas, you might need to convert those to values to get them to be read in easily. Otherwise, I would suggest using a tool like this one (among others out there) to convert all the sheets in a workbook to CSV files and work from there.
If you've gotten that far, here's something that can be tried for the "reshaping" part of your question. Here, we'll assume that "A" actually represents a CSV file, the contents of which are the six lines shown as sample data in your question:
## Create some sample data
A <- tempfile()
writeLines(sep="\n", con = A,
text = c("time, USA, USA, USA",
", MD, CA, PX",
", pork, peas, nuts",
"jan-11, 4, 2, 2",
"feb-11, 4, 9, 3",
"mar-11, 8, 8, 3"))
The first thing I would do is read in the headers and the data separately. To read the headers separately, use nrows to specify the number of rows that contain the header information. To read the data separately, specify skip to skip the header rows.
B <- read.csv(A, header = FALSE, skip = 3, strip.white = TRUE)
Bnames <- read.csv(A, header = FALSE, nrows = 3, strip.white = TRUE)
Use apply to paste the header rows together to form the names for the resulting data.frame:
names(B) <- apply(Bnames, 2, function(x) paste(x[x != ""], collapse = "_"))
B
# time USA_MD_pork USA_CA_peas USA_PX_nuts
# 1 jan-11 4 2 2
# 2 feb-11 4 9 3
# 3 mar-11 8 8 3
Now comes the part of converting the data from a "wide" to a "long" format. There are many ways to do this, some using base R too, but the most direct is to use melt and colsplit from the "reshape2" package:
library(reshape2)
BL <- melt(B, id.vars="time")
cbind(BL[c("time", "value")],
colsplit(BL$variable, "_",
c("country", "state", "product")))
# time value country state product
# 1 jan-11 4 USA MD pork
# 2 feb-11 4 USA MD pork
# 3 mar-11 8 USA MD pork
# 4 jan-11 2 USA CA peas
# 5 feb-11 9 USA CA peas
# 6 mar-11 8 USA CA peas
# 7 jan-11 2 USA PX nuts
# 8 feb-11 3 USA PX nuts
# 9 mar-11 3 USA PX nuts
Unfortunately, XLConnect is unlikely to work in your application. I can confirm that on a system with 8GB RAM, running Win 7 64bit and 64bit R 3.0.2, XLConnect fails with a 22MB .xlsx file, with the same error that you are getting. As #Ista pointed out, and as explained here, after restarting R and before doing anything else:
options(java.parameters = "-Xmx4096m")
library(XLConnect)
wb <- loadWorkbook("myWorkBook.xlsx")
sheet <- readWorksheet(wb,"Data")
avoids the error. However, the import still takes more than an hour(!!).
In contrast, as #Gaffi pointed out, once the sheet "Data" is saved to a csv file (~7MB), it can be imported as follows:
library(data.table)
system.time(sheet <- fread("Data.csv"))
user system elapsed
0.84 0.00 0.86
in less than 1 second. In my test case sheet has 6 columns and ~376,000 rows.
Sorry about this "second answer", but you really had two questions... #Ananda's solution for reshaping your data is extremely elegant. This is just another way to think about it.
If you transpose the input matrix you get a new matrix, where the first column is country, the second column is city, the third column is "type" (for lack of a better term), and the actual data is in the other columns (so, there is one additional column for every "time").
So a different approach is to transpose first and then melt the new matrix. This avoids creating all the concatenated column names and splitting them back later. The problem is that melt.data.frame is exceptionally inefficient with a very large number of columns (which you would have here). So doing it this way would bbe 10X slower than #Ananda's approach.
A solution is to use melt.array (just call melt(...) with an array rather than a data frame). As shown below, this approach is ~20X faster, with larger datasets (yours was 11MB).
library(reshape) # for melt(...)
library(microbenchmark) # for microbenchmark(...)
# this is just to model your situation with more realistic size
# create a large data frame (250 columns of country, city, type; 1000 rows of time)
df <- rep(c("USA","UK","FR","CHN","GER"),each=50) # time + 250 columns
df <- rbind(df,rep(c(c("NY","SF","CHI","BOS","LA")),each=10))
df <- rbind(df,rep(c("pork","peas","nuts","fruit","other")))
df <- rbind(df,matrix(sample(1:1000,250*1000,replace=T),ncol=250))
df <- cbind(c("time","","",
as.character(as.Date(1:1000,origin="2010-01-01"))),df)
df <- data.frame(df) # big warning here about duplicated row names; not important
# #Ananda'a approach:
transform.orig <- function(df){
B <- df[-(1:3),]
Bnames <- df[1:3,]
names(B) <- apply(Bnames, 2, function(x) paste(x[x != ""], collapse = "_"))
BL <- melt(B, id.vars="time")
final <- cbind(BL[c("time", "value")],
colsplit(BL$variable, "_",
c("country", "state", "product")))
return(final)
}
# transpose approach:
transform.new <- function(df) {
zz <- t(df)
times <- t(zz[1,4:ncol(zz)])
colnames(zz) <- c("country","city","type", times)
data <- melt(zz[-1,-(1:3)],varnames=c("id","time"))
final <- cbind(country=rep(zz[-1,1],each=ncol(zz)-3),
city =rep(zz[-1,2],each=ncol(zz)-3),
type =rep(zz[-1,3],each=ncol(zz)-3),
data[,-1])
return(final)
}
# benchmark
microbenchmark(transform.orig(df),transform.new(df), times=5, unit="s")
Unit: seconds
expr min lq median uq max neval
transform.orig(df) 9.2511679 9.6986330 9.889457 10.1518191 10.3354328 5
transform.new(df) 0.4383197 0.4724145 0.474212 0.5815531 0.6886383 5
For reading the data from excel, try the openxlsx package. It uses c++ instead of java, and better handles larger excel files.
To reshape your data look at the tidyr package. The gather function could help you out.
Related
I have a folder of 10,000+ csv files stored on my hard drive. Each csv is for a species and gives presence in raster cells (so over 5million cells if the species were present in every cell on earth).
I need to read each file and use dplyr to join to other data frames and summarise, then return a summary df. I don't have a server to run this on and it's stalling my desktop. It works with a subset of 17 species csvs, but even then it's slow.
This is similar to a few other questions about dealing with big data, but I can't figure out the right combination of packages like data.table, bigmemory, and future. I think the really slow part is the dplyr commands, as opposed to reading the files, but I'm not sure.
I'm not sure if this is possible to answer without the files, but they're huge so not sure how to make this reproducible?
spp_ids <- <vector of the species ids, in this case 17 of them>
spp_list <- <datafame with ids of the 17 spp in the folder>
spp_info <- <dataframe with the species id and then some other columns>
cellid_df <- <big df with 5 million+ cell ids and corresponding region names>
# Loop
spp_regions <- future_lapply(spp_ids, FUN = function(x) {
csv_file <- file.path("//filepathtoharddrivefolder",
sprintf('chrstoremove_%s.csv', x)) # I pull just the id number from the file names
# summarise number of regions and cells
spp_region_summary <- data.table::fread(csv_file, sep = ",") %>%
dplyr::mutate(spp_id = x) %>%
dplyr::filter(presence == 1) %>% # select cell ids where the species is present
dplyr::left_join(cellid_df, by = "cell_id") %>%
dplyr::group_by(region, spp_id) %>%
dplyr::summarise(num_cells = length(presence)) %>%
dplyr::ungroup()
# add some additional information
spp_region_summary <- spp_region_summary %>%
dplyr::left_join(spp_info, by = "spp_id") %>%
dplyr::left_join(spp_list, by = "spp_id") %>%
dplyr::select(region, spp_id, num_cells)
return(spp_region_summary)
})
spp_regions_df <- dplyr::bind_rows(spp_regions)
fwrite(spp_regions_df,"filepath.csv")
Haven't worked with this much data before so I've never had to leave the tidyverse!
I've tried to reproduce this. I generated 10 million rows for cellid_df and each individual file. It only took about 40 seconds for 15 "files" (Using reprex added an extra 20 seconds).
If you can leave your laptop running for half a day or so, this should do it.
A couple of suggestions:
You can write to file if you're worried about memory issues.
Since spp_id is unique in each iteration, you can add it in after the merge. It will save some time.
The "additional information" can be joined to the final dataframe since it is keyed on spp_id. In data.table, left_join(X,Y,by='id') will become Y[X,on='id']
library(data.table)
spp_ids <- 1:15
set.seed(123)
N <- 1e7 # number of cell_ids
# Dummy cell ids + regions
cellid_df <- data.table(cell_id=1:N,region=sample(state.abb,N,replace = T))
head(cellid_df)
#> cell_id region
#> 1: 1 NM
#> 2: 2 IA
#> 3: 3 IN
#> 4: 4 AZ
#> 5: 5 TN
#> 6: 6 WY
#
outfile <- 'test.csv'
if(file.exists(outfile))
file.remove(outfile)
a=Sys.time()
l<- lapply(spp_ids, function(x){
#Generate random file with cell_id and presence
spp_file <- data.table(cell_id=1:N,presence=round(runif(N)))
present_cells <- cellid_df[spp_file[presence==1],on='cell_id'] # Filter and merge
spp_region_summary <- present_cells[,.(spp_id=x,num_cells=.N),by=.(region)] # Summarise and add
setcolorder(spp_region_summary,c('spp_id','region','num_cells')) # Reorder the columns if you want
fwrite(spp_region_summary,outfile,append = file.exists(outfile)) # Write the summary to disk to avoid memory issues
# If you want to keep it in memory, you can return it and use rbindlist
# spp_region_summary
})
b=Sys.time()
b-a
#> Time difference of 1.019157 mins
# Check lines in file = (No of species) x (No of regions) + 1
R.utils::countLines(outfile)
#> Registered S3 method overwritten by 'R.oo':
#> method from
#> throw.default R.methodsS3
#> [1] 751
#> attr(,"lastLineHasNewline")
#> [1] TRUE
Created on 2019-12-20 by the reprex package (v0.3.0)
I'm trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first column is a date and the second column has "Monthly return" as the header. In this example, the data starts in cell B5:
How do I automate the search of Excel cells for my "Monthly return" string using R?
At the moment, the best idea I can come up with is to upload everything in R starting at cell A1 and sort out the mess in the resulting (huge) matrices. I'm hoping for a more elegant solution
I haven't found a way to do this elegantly, but I'm very familiar with this problem (getting data from FactSet PA reports -> Excel -> R, right?). I understand different reports have different formats, and this can be a pain.
For a slightly different version of annoyingly formatted spreadsheets, I do the following. It's not the most elegant (it requires two reads of the file) but it works. I like reading the file twice, to make sure the columns are of the correct type, and with good headers. It's easy to mess up column imports, so I'd rather have my code read the file twice than go through and clean up columns myself, and the read_excel defaults, if you start at the right row, are pretty good.
Also, it's worth noting that as of today (2017-04-20), readxl had an update. I installed the new version to see if that would make this very easy, but I don't believe that's the case, although I could be mistaken.
library(readxl)
library(stringr)
library(dplyr)
f_path <- file.path("whatever.xlsx")
if (!file.exists(f_path)) {
f_path <- file.choose()
}
# I read this twice, temp_read to figure out where the data actually starts...
# Maybe you need something like this -
# excel_sheets <- readxl::excel_sheets(f_path)
# desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution"))
desired_sheet <- 1
temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)
skip_rows <- NULL
col_skip <- 0
search_string <- "Monthly Returns"
max_cols_to_search <- 10
max_rows_to_search <- 10
# Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later.
while (length(skip_rows) == 0) {
col_skip <- col_skip + 1
if (col_skip == max_cols_to_search) break
skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0
}
# ... now we re-read from the known good starting point.
real_data <- readxl::read_excel(
f_path,
sheet = desired_sheet,
skip = skip_rows
)
# You likely don't need this if you start at the right row
# But given that all weird spreadsheets are weird in their own way
# You may want to operate on the col_skip, maybe like so:
# real_data <- real_data %>%
# select(-(1:col_skip))
Okay, at the format was specified for xls, update from csv to the correctly suggested xls loading.
library(readxl)
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)
You would get something similar to:
data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("",
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund",
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"),
V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("",
"0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))
then you can dynamincally filter on the "Monthly return" cell and identify your matrix.
targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]
With a general purpose package like readxl, you'll have to read twice, if you want to enjoy automatic type conversion. I assume you have some sort of upper bound on the number of junk rows at the front? Here I assumed that was 10. I'm iterating over worksheets in one workbook, but the code would look pretty similar if iterating over workbooks. I'd write one function to handle a single worksheet or workbook then use lapply() or purrr::map(). This function will encapsulate the skip-learning read and the "real" read.
library(readxl)
two_passes <- function(path, sheet = NULL, n_max = 10) {
first_pass <- read_excel(path = path, sheet = sheet, n_max = n_max)
skip <- which(first_pass[[2]] == "Monthly return")
message("For sheet '", if (is.null(sheet)) 1 else sheet,
"' we'll skip ", skip, " rows.")
read_excel(path, sheet = sheet, skip = skip)
}
(sheets <- excel_sheets("so.xlsx"))
#> [1] "sheet_one" "sheet_two"
sheets <- setNames(sheets, sheets)
lapply(sheets, two_passes, path = "so.xlsx")
#> For sheet 'sheet_one' we'll skip 4 rows.
#> For sheet 'sheet_two' we'll skip 6 rows.
#> $sheet_one
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
#>
#> $sheet_two
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.
I have this Excel book:
I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.
First: load data
library(xlsx)
dat <- read.xlsx('book.xlsx', sheetIndex = 1)
head(dat)
MY.COMPANY.PTY.LTD NA.
1 MC Pension Fund <NA>
2 GROSS PERFORMANCE DETAILS <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Second: I added some cols with NA and '' values in the case that your data contain some
dat$x2 <- NA
dat$x4 <- NA
head(dat)
MY.COMPANY.PTY.LTD NA. x2 x4
1 MC Pension Fund <NA> NA NA
2 GROSS PERFORMANCE DETAILS <NA> NA NA
3 updated by IG on 20/04/2017 <NA> NA NA
4 <NA> Monthly return NA NA
5 Mar-14 0.0097 NA NA
6 Apr-14 6e-04 NA NA
Third: Remove columns when all values are NA and ''. I have to deal with that kind of problems in past
colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
dat2 <- dat[, colSelect]
head(dat2)
MY.COMPANY.PTY.LTD NA.
1 MC Pension Fund <NA>
2 GROSS PERFORMANCE DETAILS <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Fourth: Keep only rows with complete observations (it's what I supose from your example)
rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
dat3 <- dat2[rowSelect, ]
head(dat3)
MY.COMPANY.PTY.LTD NA.
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Finally if you want to keep the header you can make something like this:
colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])
or
colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
dat3
Month Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Here is how I would tackle it.
STEP 1
Read the excel spreadsheet in without the headers.
STEP 2
Find the row index for your string Monthly return in this case
STEP 3
Filter from the identified row (or column or both), prettify a little and done.
Here is what a sample function looks like. It works for your example no matter where it is in the spreadsheet. You can play around with regex to make it more robust.
Function Definition:
library(xlsx)
extract_return <- function(path = getwd(), filename = "Mysheet.xlsx", sheetnum = 1){
filepath = paste(path, "/", filename, sep = "")
input = read.xlsx(filepath, sheetnum, header = FALSE)
start_idx = which(input == "Monthly return", arr.ind = TRUE)[1]
output = input[start_idx:dim(input)[1],]
rownames(output) <- NULL
colnames(output) <- c("Date","Monthly Return")
output = output[-1, ]
return(output)
}
Example:
final_df <- extract_return(
path = "~/Desktop",
filename = "Apr2017.xlsx",
sheetnum = 2)
No matter ho many rows or columns you may have, the idea remains the same.. Give it a try and let me know.
This is a tidy alternative that avoids the multiple reads issue discussed above. However, when doing benchmarks, Rafael Zayas's answer still wins out.
library("tidyxl")
library("unpivotr")
library("tidyr")
library("dplyr")
tidy_solution <- function() {
raw <- xlsx_cells("messyExcel.xlsx")
start <- raw %>%
filter_all(any_vars(. %in% c("Monthly return"))) %>%
select(row, col)
month.col <- raw %>%
filter(row >= start$row + 1, col == start$col - 1) %>%
pivot_wider(date, col)
return.col <- raw %>%
filter(row >= start$row + 1, col == start$col) %>%
pivot_wider(numeric, col)
output <- cbind(month.col, return.col)
}
# My Solution
expr min lq mean median uq max neval
tidy_solution() 29.0372 30.40305 32.13793 31.36925 32.9812 56.6455 100
# Rafael's
expr min lq mean median uq max neval
original_solution() 21.4405 23.8009 25.86874 25.10865 26.99945 59.4128 100
grep("2014",dat)[1]
This gives you first column with year. Or use "-14" or whatever you have for years.
Similar way grep("Monthly",dat)[1] gives you second column
I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879
It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.
I'm still learning R and have been given the task of grouping a long list of students into groups of four based on another variable. I have loaded the data into R as a data frame. How do I sample entire rows without replacement, one from each of 4 levels of a variable and have R output the data into a spreadsheet?
So far I have been tinkering with a for loop and the sample function but I'm quickly getting over my head. Any suggestions? Here is sample of what I'm attempting to do. Given:
Last.Name <- c("Picard","Troi","Riker","La Forge", "Yar", "Crusher", "Crusher", "Data")
First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com")
Section <- c(1,1,2,2,3,3,4,4)
df <- data.frame(Last.Name,First.Name,Email,Section)
I want to randomly select a Star Trek character from each section and end up with 2 groups of 4. I would want the entire row's worth of information to make it over to a new data frame containing all groups with their corresponding group number.
I'd use the wonderful package 'dplyr'
require(dplyr)
random_4 <- df %>% group_by(Section) %>% slice(sample(c(1,2),1))
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Troi Deanna b#b.com 1
2 La Forge Geordi d#d.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Picard Jean-Luc a#a.com 1
2 Riker William c#c.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
%>% means 'and then'
The code is read as:
Take DF AND THEN for all 'Section', select by position (slice) 1 or 2. Voila.
I suppose you have 8 students: First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data").
If you wish to randomly assign a section number to the 8 students, and assuming you would like each section to have 2 students, then you can either permute Section <- c(1, 1, 2, 2, 3, 3, 4, 4) or permute the list of the students.
First approach, permute the sections:
> assigned_section <- print(sample(Section))
[1] 1 4 3 2 2 3 4 1
Then the following data frame gives the assignments:
assigned_students <- data.frame(First.Name, assigned_section)
Second approach, permute the students:
> assigned_students <- print(sample(First.Name))
[1] "Data" "Geordi" "Tasha" "William" "Deanna" "Beverly" "Jean-Luc" "Wesley"
Then, the following data frame gives the assignments:
assigned_students <- data.frame(assigned_students, Section)
Alex, Thank You. Your answer wasn't exactly what I was looking for, but it inspired the correct one for me. I had been thinking about the process from a far too complicated point of view. Instead of having R select rows and put them into a new data frame, I decided to have R assign a random number to each of the students and then sort the data frame by the number:
First, I broke up the data frame into sections:
df1<- subset(df, Section ==1)
df2<- subset(df, Section ==2)
df3<- subset(df, Section ==3)
df4<- subset(df, Section ==4)
Then I randomly generated a group number 1 through 4.
Groupnumber <-sample(1:4,4, replace=F)
Next, I told R to bind the columns:
Assigned1 <- cbind(df1,Groupnumber)
*Ran the group number generator and cbind in alternating order until I got through the whole set. (Wanted to make sure the order of the numbers was unique for each section).
Finally row binding the data set back together:
Final_List<-rbind(Assigned1,Assigned2,Assigned3,Assigned4)
Thank you everyone who looked this over. I am new to data science, R, and stackoverflow, but as I learn more I hope to return the favor.
I'd suggest the randomizr package to "block assign" according to section. The block_ra function lets you do this in a easy-to-read one-liner.
install.packages("randomizr")
library(randomizr)
df$group <- block_ra(block_var = df$Section,
condition_names = c("group_1", "group_2"))
You can inspect the resulting sets in a variety of ways. Here's with base r subsetting:
df[df$group == "group_1",]
Last.Name First.Name Email Section group
2 Troi Deanna b#b.com 1 group_1
3 Riker William c#c.com 2 group_1
6 Crusher Beverly f#f.com 3 group_1
7 Crusher Wesley g#g.com 4 group_1
df[df$group == "group_2",]
Last.Name First.Name Email Section group
1 Picard Jean-Luc a#a.com 1 group_2
4 La Forge Geordi d#d.com 2 group_2
5 Yar Tasha e#e.com 3 group_2
8 Data Data h#h.com 4 group_2
If you want to roll your own:
set <- tapply(1:nrow(df), df$Section, FUN = sample, size = 1)
df[set,] # show the sampled set
df[-set,] # show the complimentary set
I have a data set that includes a whole bunch of data about students, including their current school, zipcode of former residence, and a score:
students <- read.table(text = "zip school score
43050 'Hunter' 202.72974236
48227 'NYU' 338.49571519
48227 'NYU' 223.48658339
32566 'CCNY' 310.40666224
78596 'Columbia' 821.59318662
78045 'Columbia' 853.09842034
60651 'Lang' 277.48624384
32566 'Lang' 315.49753763
32566 'Lang' 80.296556533
94941 'LIU' 373.53839238
",header = TRUE,sep = "")
I want a heap of summary data about it, per school. How many students from each school are in the data set, how many unique zipcodes per school, average and cumulative score. I know I can get this by using tapply to create a bunch of tmp frames:
tmp.mean <- data.frame(tapply(students$score, students$school, mean))
tmp.sum <- data.frame(tapply(students$score, students$school, sum))
tmp.unique.zip <- data.frame(tapply(students$zip, students$school, function(x) length(unique(x))))
tmp.count <- data.frame(tapply(students$zip, students$school, function(x) length(x)))
Giving them better column names:
colnames(tmp.unique.zip) <- c("Unique zips")
colnames(tmp.count) <- c("Count")
colnames(tmp.mean) <- c("Mean Score")
colnames(tmp.sum) <- c("Total Score")
And using cbind to tie them all back together again:
school.stats <- cbind(tmp.mean, tmp.sum, tmp.unique.zip, tmp.count)
I think the cleaner way to do this is:
library(plyr)
school.stats <- ddply(students, .(school), summarise,
record.count=length(score),
unique.r.zips=length(unique(zip)),
mean.dist=mean(score),
total.dist=sum(score)
)
The resulting data looks about the same (actually, the ddply approach is cleaner and includes the schools as a column instead of as row names). Two questions: is there better way to find out how many records there are associated with each school? And, am I using ddply efficiently here? I'm new to it.
If performance is an issue, you can also use data.table
require(data.table)
tab_s<-data.table(students)
setkey(tab_s,school)
tab_s[,list(total=sum(score),
avg=mean(score),
unique.zips=length(unique(zip)),
records=length(score)),
by="school"]
school total avg unique.zips records
1: Hunter 202.7297 202.7297 1 1
2: NYU 561.9823 280.9911 1 2
3: CCNY 310.4067 310.4067 1 1
4: Columbia 1674.6916 837.3458 2 2
5: Lang 673.2803 224.4268 2 3
6: LIU 373.5384 373.5384 1 1
Comments seem to be in general agreement: this looks good.