How remain pairs by below criteria? (explained my criteria) - r

While doing my data work I have this problem.
Data is as below,
row_number var1 var2
1 1921 16
2 1922 16
3 1921 17
4 1922 17
5 1703 29
6 1704 29
7 1705 29
8 1703 30
9 1704 30
10 1705 30
11 1703 31
12 1704 31
13 1705 31
I want to make pairs by only using unique var1 and unique var2.
In other words, 1~4 rows can be a group and I only need to remain 1st and 4th column. And, 5~13 rows can be an another group and I only need to remain this pair (1703 29, 1704 30, 1705 31). That is, I want to have this outcome
row_number var1 var2
1 1921 16
4 1922 17
5 1703 29
9 1704 30
13 1705 31
I have much more observations.

Suppose your data is in a dataframe named d. Then
out <- data.frame(row_number = NA, var1 = NA, var2 = NA)
for (i in 1:nrow(d)) {
if (!(d[i, "var1" ] %in% out[, "var1"]) & !(d[i, "var2"] %in% out[, "var2"])) {
out <- rbind(out, d[i,])
}
}
out <- out[-1, ]
out
# row_number var1 var2
# 2 1 1921 16
# 4 4 1922 17
# 5 5 1703 29
# 9 9 1704 30
# 13 13 1705 31
gives you your desired result by iterating through rows of d and extracting only rows where neither var1 nor var2 has previously appeared in the output dataframe.

Related

find max column value in r conditional on another column

I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))

Expand data.frame by creating duplicates based on group condition (3)

Starting from this SO question.
Example data.frame:
df = read.table(text = 'ID Day Count Count_group
18 1933 6 15
33 1933 6 15
37 1933 6 15
18 1933 6 15
16 1933 6 15
11 1933 6 15
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 2
677 1798 2 2
778 888 4 8
111 888 4 8
88 888 4 8
10 888 4 8
37 887 2 4
26 887 2 4
8 886 1 2
56 885 1 1
22 120 2 6
34 120 2 6
88 119 1 6
99 118 2 5
12 118 2 5
90 117 1 3
22 115 2 2
99 115 2 2', header = TRUE)
The Count col shows the total number of ID values per each Day and the Count_group col shows the sum of the ID values per each Day, Day - 1, Day -2, Day -3 and Day -4.
e.g. 1933 = Count_group 15 because Count 6 (1933) + Count 5 (1932) + Count 3 (1931) + Count 1 (1930) + Count 0 (1929).
What I need to do is to create duplicated observations per each Count_group and add them to it in order to show per each Count_group its Day, Day - 1, Day -2, Day -3 and Day -4.
e.g. Count_group = 15 is composed by the Count values of Day 1933, 1932, 1931, 1930 (and 1929 not present in the df). So the five days needs to be included in the Count_group = 15. The next one will be Count_group = 9, composed by 1932, 1931, 1930, 1929 and 1928; etc...
Desired output:
ID Day Count Count_group
18 1933 6 15
33 1933 6 15
37 1933 6 15
18 1933 6 15
16 1933 6 15
11 1933 6 15
111 1932 5 15
34 1932 5 15
60 1932 5 15
88 1932 5 15
18 1932 5 15
33 1931 3 15
13 1931 3 15
56 1931 3 15
23 1930 1 15
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 9
13 1931 3 9
56 1931 3 9
23 1930 1 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 12
6 1799 4 12
52 1799 4 12
133 1799 4 12
112 1798 2 12
677 1798 2 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 6
677 1798 2 6
112 1798 2 2
677 1798 2 2
778 888 4 8
111 888 4 8
88 888 4 8
10 888 4 8
37 887 2 8
26 887 2 8
8 886 1 8
56 885 1 8
37 887 2 4
26 887 2 4
8 886 1 4
56 885 1 4
8 886 1 2
56 885 1 2
56 885 1 1
22 120 2 6
34 120 2 6
88 119 1 6
99 118 2 6
12 118 2 6
90 117 1 6
88 119 1 6
99 118 2 6
12 118 2 6
90 117 1 6
22 115 2 6
99 115 2 6
99 118 2 5
12 118 2 5
90 117 1 5
22 115 2 5
99 115 2 5
90 117 1 3
22 115 2 3
99 115 2 3
22 115 2 2
99 115 2 2
(note that different group of 5 days each one have been separated by a blank line in order to make them clearer)
I have got different data.frames which are grouped by n days and therefore I would like to adapt the code (by changing it a little) specifically for each of them.
Thanks
A generalised version of my previous answer...
#first add grouping variables
days <- 5 #grouping no of days
df$smalldaygroup <- c(0,cumsum(sapply(2:nrow(df),function(i) df$Day[i]!=df$Day[i-1]))) #individual days
df$bigdaygroup <- c(0,cumsum(sapply(2:nrow(df),function(i) df$Day[i]<df$Day[i-1]-days+1))) #blocks of linked days
#duplicate days in each big group
df2 <- lapply(split(df,df$bigdaygroup),function(x) {
n <- max(x$Day)-min(x$Day)+1 #number of consecutive days in big group
dayvec <- (max(x$Day):min(x$Day)) #possible days in range
daylog <- dayvec[dayvec %in% x$Day] #actual days in range
pattern <- data.frame(base=rep(dayvec,each=days))
pattern$rep <- sapply(1:nrow(pattern),function(i) pattern$base[i]+1-sum(pattern$base[1:i]==pattern$base[i])) #indices to repeat
pattern$offset <- match(pattern$rep,daylog)-match(pattern$base,daylog) #offsets (used later)
pattern <- pattern[(pattern$base %in% x$Day) & (pattern$rep %in% x$Day),] #remove invalid elements
#store pattern in list as offsets needed in next loop
return(list(df=split(x,x$smalldaygroup)[match(pattern$rep,daylog)],pat=pattern))
})
#change the Count_group to previous value in added entries
df2 <- lapply(df2,function(L) lapply(1:length(L$df),function(i) {
x <- L$df[[i]]
offset <- L$pat$offset #pointer to day to copy Count_group from
x$Count_group <- L$df[[i-offset[i]]]$Count_group[1]
return(x)
}))
df2 <- do.call(rbind,unlist(df2,recursive=FALSE)) #bind back together
df2[,5:6] <- NULL #remove grouping variables
head(df2,30) #ignore rownames!
ID Day Count Count_group
01.1 18 1933 6 15
01.2 33 1933 6 15
01.3 37 1933 6 15
01.4 18 1933 6 15
01.5 16 1933 6 15
01.6 11 1933 6 15
02.7 111 1932 5 15
02.8 34 1932 5 15
02.9 60 1932 5 15
02.10 88 1932 5 15
02.11 18 1932 5 15
03.12 33 1931 3 15
03.13 13 1931 3 15
03.14 56 1931 3 15
04 23 1930 1 15
05.7 111 1932 5 9
05.8 34 1932 5 9
05.9 60 1932 5 9
05.10 88 1932 5 9
05.11 18 1932 5 9
06.12 33 1931 3 9
06.13 13 1931 3 9
06.14 56 1931 3 9
07 23 1930 1 9
08.12 33 1931 3 4
08.13 13 1931 3 4
08.14 56 1931 3 4
09 23 1930 1 4
010 23 1930 1 1
11.16 6 1800 6 12
I attach a rather mechanical method, but I believe it is a good starting point.
I have noticed that in your original table the entry
ID Day Count Count_group
18 1933 6 14
is duplicated; I have left it untouched for sake of clarity.
Structure of the approach:
Read original data
Generate list of data frames, for each Day
Generate final data frame, collapsing the list in 2.
1. Read original data
We start with
df = read.table(text = 'ID Day Count Count_group
18 1933 6 14
33 1933 6 14
37 1933 6 14
18 1933 6 14
16 1933 6 14
11 1933 6 14
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 2
677 1798 2 2
778 888 4 7
111 888 4 7
88 888 4 7
10 888 4 7
37 887 2 4
26 887 2 4
8 886 1 2
56 885 1 1', header = TRUE)
# ordered vector of unique values for "Day"
ord_day <- unique(df$Day[order(df$Day)])
ord_day
[1] 885 886 887 888 1798 1799 1800 1930 1931 1932 1933
2. Generate list of data frames, for each Day
For each element in ord_day we introduce a data.frame as element of a list called df_new_aug.
Such data frames are defined through a for loop for all values in ord_day except ord_day[2] and ord_day[1] which are treated separately.
Idea behind the looping: for each unique ord_day[i] with i > 2 we check which days between ord_day[i-1] and ord_day[i-2] (or both!) contribute (through the variable "Count") to the value "Count_Group" at ord_day[i].
We therefore introduce if else statements in the loop.
Here we go
# Recursive generation of the list of data.frames (for days > 886)
#-----------------------------------------------------------------
df_new <- list()
df_new_aug <- list()
# we exclude cases i=1, 2: they are manually treated below
for ( i in 3: length(ord_day) ) {
# is "Count_Group" for ord_day[i] equal to the sum of "Count" at ord_day[i-1] and ord_day[i-2]?
if ( unique(df[df$Day == ord_day[i], "Count_group"]) == unique(df[df$Day == ord_day[i], "Count"]) +
unique(df[df$Day == ord_day[i-1], "Count"]) + unique(df[df$Day == ord_day[i-2], "Count"])
) {
# we create columns ID | Day | Count
df_new[[i]] <- data.frame(df[df$Day == ord_day[i] | df$Day == ord_day[i-1] | df$Day == ord_day[i-2],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[i]] <- data.frame( df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
} else if (unique(df[df$Day == ord_day[i], "Count_group"]) == unique(df[df$Day == ord_day[i], "Count"]) +
unique(df[df$Day == ord_day[i-1], "Count"]) ) #only "Count" at i and i-1 contribute to "Count_group" at i
{
df_new[[i]] <- data.frame(df[df$Day == ord_day[i] | df$Day == ord_day[i-1],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[2]
df_new_aug[[i]] <- data.frame(df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
} else #only "Count" at i contributes to "Count_group" at i
df_new[[i]] <- data.frame(df[df$Day == ord_day[i],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[i]] <- data.frame(df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
#closing the for loop
}
# for ord_day[2] = "886" (both "Count" at i =2 and i = 1 contribute to "Count_group" at i=2)
#-------------------------------------------------------------------------------------
df_new[[2]] <- data.frame(df[df$Day == ord_day[2] | df$Day == ord_day[1],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[2]
df_new_aug[[2]] <- data.frame(df_new[[2]],
Count_group = rep(unique(df[df$Day == ord_day[2], "Count_group"]), nrow(df_new[[2]]) ) )
# for ord_day[1] = "885" (only "count" at i = 1 contributes to "Count_group" at i =1)
#------------------------------------------------------------------------------------
df_new[[1]] <- data.frame(df[df$Day == ord_day[1], c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[1]] <- data.frame(df_new[[1]], Count_group = rep(unique(df[df$Day == ord_day[1], "Count_group"]), nrow(df_new[[1]]) ) )
# produced list
df_new_aug
3. Generate final data frame, collapsing the list in 2.
We collapse df_new_aug through an ugly loop, but other solutions (for example with Reduce() and merge() are possible):
# merging the list (mechanically): final result
df_result <- df_new_aug[[1]]
for (i in 1:10){
df_result <- rbind(df_result, df_new_aug[[i+1]])
}
One arrives at df_result and the analysis is stopped.

How can I select the 10 largest values from three different columns and save them in a new data frame in R?

Var1 <- 90:115
Var2 <- 1:26
Var3 <- 52:27
data <- data.frame(Var1, Var2, Var3)
Hi, I want to select from each column the 10 largest values and save them in a new data frame? I know that in my example the new data frame will contain 20 rows but I don't understand the correct workflow.
That's what I'm expecting:
Var1 Var2 Var3
90 1 52
91 2 51
92 3 50
93 4 49
94 5 48
95 6 47
96 7 46
97 8 45
98 9 44
99 10 43
106 17 36
107 18 35
108 19 34
109 20 33
110 21 32
111 22 31
112 23 30
113 24 29
114 25 28
115 26 27
I can solve my problem for three column with this approach
df <- subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43)
but if I have to do that for 50+ columns it's not really the best solution.
This can be done by looping over the columns with lapply, sort them, and get the first 10 values with head
data.frame(lapply(data, function(x) head(sort(x,
decreasing=TRUE) ,10)))
If we need the first 10 rows, just use
head(data, 10)
Update
Based on the OP's edit
data[sort(Reduce(union,lapply(data, function(x)
order(x,decreasing=TRUE)[1:10]))),]
I think this is what you want:
data[sort(unique(c(sapply(data,order,decreasing=T)[1:10,]))),]
Basically index the top 10 elements from each column, merge them and remove duplicate, reorder and extract it from the original data.
A direct answer to your question:
nv1 <- sort(Var1,decreasing = TRUE)[1:10]
nv2 <- sort(Var2,decreasing = TRUE)[1:10]
nv3 <- sort(Var2,decreasing = TRUE)[1:10]
nd <- data.frame(nv1, nv2, nv3)
But why would you want to do such a thing? You're breaking the order of the data -- Var3 is increasing and the others are decreasing. Perhaps you want a list, rather than a data frame?
This might help:
thresh <- sapply(data,sort,decreasing=T)[10,]
data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),]
First, a vector thresh is defined, which contains the tenth largest value of each column. Then we perform a loop over the columns to check if any of the values is larger than or equal to the corresponding threshold value. The !! is a shorthand notation for as.logical(), which (owing to the combination with rowSums) selects those rows where at least one of the values is above or equal to the threshold. In your example this yields the output:
# Var1 Var2 Var3
#1 90 1 52
#2 91 2 51
#3 92 3 50
#4 93 4 49
#5 94 5 48
#6 95 6 47
#7 96 7 46
#8 97 8 45
#9 98 9 44
#10 99 10 43
#17 106 17 36
#18 107 18 35
#19 108 19 34
#20 109 20 33
#21 110 21 32
#22 111 22 31
#23 112 23 30
#24 113 24 29
#25 114 25 28
#26 115 26 27
Which is equal to the output that you obtain with the command you posted:
#> identical(data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),], subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43))
[1] TRUE

Transforming long format data to short format by segmenting dates that include redundant observations

I have a data set that is long format and includes exact date/time measurements of 3 scores on a single test administered between 3 and 5 times per year.
ID Date Fl Er Cmp
1 9/24/2010 11:38 15 2 17
1 1/11/2011 11:53 39 11 25
1 1/15/2011 11:36 39 11 39
1 3/7/2011 11:28 95 58 2
2 10/4/2010 14:35 35 9 6
2 1/7/2011 13:11 32 7 8
2 3/7/2011 13:11 79 42 30
3 10/12/2011 13:22 17 3 18
3 1/19/2012 14:14 45 15 36
3 5/8/2012 11:55 29 6 11
3 6/8/2012 11:55 74 37 7
4 9/14/2012 9:15 62 28 18
4 1/24/2013 9:51 82 45 9
4 5/21/2013 14:04 135 87 17
5 9/12/2011 11:30 98 61 18
5 9/15/2011 13:23 55 22 9
5 11/15/2011 11:34 98 61 17
5 1/9/2012 11:32 55 22 17
5 4/20/2012 11:30 23 4 17
I need to transform this data to short format with time bands based on month (i.e. Fall=August-October; Winter=January-February; Spring=March-May). Some bands will include more than one observation per participant, and as such, will need a "spill over" band. An example transformation for the Fl scores below.
ID Fall1Fl Fall2Fl Winter1Fl Winter2Fl Spring1Fl Spring2Fl
1 15 NA 39 39 95 NA
2 35 NA 32 NA 79 NA
3 17 NA 45 NA 28 74
4 62 NA 82 NA 135 NA
5 98 55 55 NA 23 NA
Notice that dates which are "redundant" (i.e. more than 1 Aug-Oct observation) spill over into Fall2fl column. Dates that occur outside of the desired bands (i.e. November, December, June, July) should be deleted. The final data set should have additional columns that include Fl Er and Cmp.
Any help would be appreciated!
(Link to .csv file with long data http://mentor.coe.uh.edu/Data_Example_Long.csv )
This seems to do what you are looking for, but doesn't exactly match your desired output. I haven't looked at your sample data to see whether the problem lies with your sample desired output or the transformations I've done, but you should be able to follow along with the code to see how the transformations were made.
## Convert dates to actual date formats
mydf$Date <- strptime(gsub("/", "-", mydf$Date), format="%m-%d-%Y %H:%M")
## Factor the months so we can get the "seasons" that you want
Months <- factor(month(mydf$Date), levels=1:12)
levels(Months) <- list(Fall = c(8:10),
Winter = c(1:2),
Spring = c(3:5),
Other = c(6, 7, 11, 12))
mydf$Seasons <- Months
## Drop the "Other" seasons
mydf <- mydf[!mydf$Seasons == "Other", ]
## Add a "Year" column
mydf$Year <- year(mydf$Date)
## Add a "Times" column
mydf$Times <- as.numeric(ave(as.character(mydf$Seasons),
mydf$ID, mydf$Year, FUN = seq_along))
## Load "reshape2" and use `dcast` on just one variable.
## Repeat for other variables by changing the "value.var"
dcast(mydf, ID ~ Seasons + Times, value.var="Fluency")
# ID Fall_1 Fall_2 Winter_1 Winter_2 Spring_2 Spring_3
# 1 1 15 NA 39 39 NA 95
# 2 2 35 NA 32 NA 79 NA
# 3 3 17 NA 45 NA 29 NA
# 4 4 62 NA 82 NA 135 NA
# 5 5 98 55 55 NA 23 NA

Error in sort.list(y) whlie using 'Strata()' in R

When I run the command:
H <-length(table(data$Team))
n.h <- rep(5,H)
strata(data, stratanames=data$Team,size=n.h,method="srswor"),
I get the error statement:
'Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?'
Please help me how can I get this stratified sample. The variable 'Team' is 'Factor' type.
Data is as below:
zz <- "Team League.ID Player Salary POS G GS InnOuts PO A
ANA AL molinjo0 335000 C 73 57 1573 441 37
ANA AL percitr0 7833333 P 3 0 149 1 3
ARI NL bautida0 4000000 RF 141 135 3536 265 8
ARI NL estalbo0 550000 C 7 3 92 19 2
ARI NL finlest0 7000000 CF 104 102 2689 214 5
ARI NL koplomi0 330000 P 72 0 260 6 23
ARI NL sparkst0 500000 P 27 18 362 8 21
ARI NL villaos0 325000 P 17 0 54 0 4
ARI NL webbbr01 335000 P 33 35 624 13 41
ATL NL francju0 750000 1B 125 71 1894 627 48
ATL NL hamptmi0 14625000 P 35 29 517 13 37
ATL NL marreel0 3000000 LF 90 42 1125 80 4
ATL NL ortizru0 6200000 P 32 34 614 7 38
BAL AL surhobj0 800000 LF 100 31 805 69 0"
data <- read.table(text=zz, header=T)
This should work:
library(sampling)
H <- length(levels(data$Team))
n.h <- rep(5, H)
strata(data, stratanames=c("Team"), size=n.h, method="srswor")
stratanames should be a list of column names, not a reference to the actual column data.
Update:
Now that example data is available, I see another problem: you are sampling without-replacement (wor), but your samples are bigger that the available data. You need to sample with replacement in this case
smpl <- strata(data, stratanames=c("Team"), size=n.h, method="srswr")
BTW, you get the actual data with:
sampledData <- getdata(data, smpl)
This doesn't really answer your question, but a long time ago, I wrote a function called stratified that might be of use to you.
I've posted it here as a GitHub Gist.
Notice that when you have asked for samples that are bigger than your data, it just returns all of the relevant rows.
output <- stratified(data, "Team", 5)
# Some groups
# ---ANA, ATL, BAL---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
table(output$Team)
#
# ANA ARI ATL BAL
# 2 5 4 1
output
# Team League.ID Player Salary POS G GS InnOuts PO A
# 1 ANA AL molinjo0 335000 C 73 57 1573 441 37
# 2 ANA AL percitr0 7833333 P 3 0 149 1 3
# 9 ARI NL webbbr01 335000 P 33 35 624 13 41
# 7 ARI NL sparkst0 500000 P 27 18 362 8 21
# 8 ARI NL villaos0 325000 P 17 0 54 0 4
# 3 ARI NL bautida0 4000000 RF 141 135 3536 265 8
# 6 ARI NL koplomi0 330000 P 72 0 260 6 23
# 12 ATL NL marreel0 3000000 LF 90 42 1125 80 4
# 13 ATL NL ortizru0 6200000 P 32 34 614 7 38
# 10 ATL NL francju0 750000 1B 125 71 1894 627 48
# 11 ATL NL hamptmi0 14625000 P 35 29 517 13 37
# 14 BAL AL surhobj0 800000 LF 100 31 805 69 0
I'll add official documentation to the function at some point, but here's a summary to help you get the best use out of it:
The arguments to stratified are:
df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
If size is a value less than 1, a proportionate sample is taken from each stratum.
If size is a single integer of 1 or more, that number of samples is taken from each stratum.
If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.

Resources