groups of different size randomly selected within different classes - r

i have such a difficult question (at least to me) that i spend 2 hours just writing it. Complete impossible to program it by my self. I try to be very clear and i´m sorry if i didn´t. I´m doing this in a very rustic way in excel, but i really need to program this.
i have a data.frame like this
id_pix id_lote clase f1 f2
45 4 Sg 2460 2401
46 4 Sg 2620 2422
47 4 Sg 2904 2627
48 5 M 2134 2044
49 5 M 2180 2104
50 5 M 2127 2069
83 11 S 2124 2062
84 11 S 2189 2336
85 11 S 2235 2162
86 11 S 2162 2153
87 11 S 2108 2124
with 17451 "id_pixel"(rows), 2080 "id_lote" and 9 "clase"
this is the "id_lote" count per "clase" (v1 is the id_lote count)
clase v1
1: S 1099
2: P 213
3: Sg 114
4: M 302
5: Alg 27
6: Az 77
7: Po 228
8: Cit 13
9: Ma 7
i need to split the "id_lote" randomly within the "clase". I mean i have 1099 "id_lote" for the "S" "clase" that are 9339 "id_pixel" (rows) and i want to randomly select 50 % of "id_lote" that are x "id_pixel"(rows). And do this for every "clase" considering that the size (number of "id_lote") of every "clase" are different. I also would like to be able to change the size of the selection (50 %, 30 %, etc). And i also want to keep the not selected set of "id_lote". I hope some one can help me with this!
here is the reproducible example
this is the data with 2 clase (S and Az), with 6 id_lote and 13 id_pixel
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
30 2 S 3021 2985
31 2 S 3020 2596
71 9 S 4725 4404
72 9 S 4759 4943
75 11 S 2728 2225
218 21 Az 4830 3007
219 21 Az 4574 2761
220 21 Az 5441 3092
1155 126 Az 7209 2449
1156 126 Az 7035 2932
and one result could be:
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
75 11 S 2728 2225
1155 126 Az 7209 2449
1156 126 Az 7035 2932
were 50% of id_lote were randomly selected in clase "S" (2 of 4 id_lote) but all the id_pixel in selected id_lote were keeped. The same for clase "Az", one id_lote was randomly selected (1 of 2 in this case) and all the id_pixel in selected id_lote were keeped.
what colemand77 proposed helped a lot. I think dplyr package is usefull for this but i think that if i do
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
i get the 30 % of the data of each clase but not grouped by id_lote like i need! I mean 30 % of the rows (id_pixel) were selected instead of id_lote.
i hope this example help to understand what i want to do and make it usefull for everybody. I´m sorry if i wasn´t clear enough the first time.
Thanks a lot!

First glimpse I'd say the dplyr package is your friend here.
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
so you first use group_by() and include the grouping levels you want to sample from, then you use sample_frac to sample the fraction of the results you want for each group.
As near as I can tell this is what you are asking for. If not, please consider re-stating your question to include either a reproducible example or clarify. Cheers.
to "keep" the not-selected members, I would add a column of unique ids, and use an anti-join anti_join()(also from the dplyr package) to find the id's that are not in common between the two data.frames (the results of the sampling and the original).
## Update ##
I'm understanding better now, I believe. Think about this as a two step process...
1) you want to select x% (50 in example) of the id_lote from each clase and return those id_lote #s (i'm assuming that a given id_lote does not exist for multiple clase?)
2) you want to see all of the id_pixels that correspond to each id_lote, all in one data.frame
I've broken this down into multiple steps for illustration, not because it is the fastest / prettiest.
raw data: (couldn't read your data into R.)
df<-data.frame(id_pix = c(1:200),
id_lote = sample(1:20,200, replace = TRUE),
clase = sample(letters[seq_along(1:10)], 200, replace = TRUE),
f1 = sample(1000:2000,200, replace = TRUE),
f2 = sample(2000:3000,200, replace = TRUE))
1) figure out which id_lote correspond to which clase - for this we use the dplyr summarise function and store it in a variable
summary<-df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise()
returns:
Source: local data frame [125 x 2]
Groups: clase
clase id_lote
1 a 1
2 a 2
3 a 4
4 a 5
5 a 6
6 a 7
7 a 8
8 a 9
9 a 11
10 a 12
.. ... ...
then we sample to get the 30% of the id_lote for each clase..
sampled_summary <- summary %>%
group_by(clase) %>%
sample_frac(.3,replace = FALSE)
so the result of this is a data table with two columns, (clase and id_lote) with 30% of the id_lotes shown for each clase.
2) ok so now we have the id_lotes randomly selected from each class but not the id_pix that are associated with that class. To accomplish this we do a join to get the corresponding full data set including the id_pix, etc.
result <- sampled_summary %>%
left_join(df)
The above makes a copy of the data set a bunch, so if you have a substantial data set you could just do it all at one go:
result <- df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise() %>%
group_by(clase) %>%
sample_frac(.5,replace = FALSE) %>%
left_join(df)
if this doesn't get you what you want, let me know and we'll take another crack at it.

Related

tidyr::gather can't find my value variables

I have an extremely large data.frame. I reproduce part of it.
RECORDING_SESSION_LABEL condition TRIAL_INDEX IA_LABEL IA_DWELL_TIME
1 23 match 1 eyes 3580
2 23 match 1 nose 2410
3 23 match 1 mouth 1442
4 23 match 1 face 841
5 23 mismatch 3 eyes 1817
6 23 mismatch 3 nose 1724
7 23 mismatch 3 mouth 1600
8 23 mismatch 3 face 1136
9 23 mismatch 4 eyes 4812
10 23 mismatch 4 nose 3710
11 23 mismatch 4 mouth 4684
12 23 mismatch 4 face 1557
13 23 mismatch 6 eyes 4645
14 23 mismatch 6 nose 2321
15 23 mismatch 6 mouth 674
16 23 mismatch 6 face 684
17 23 match 7 eyes 1062
18 23 match 7 nose 1359
19 23 match 7 mouth 215
20 23 match 7 face 0
I need to calculate the percentage of IA_DWELL_TIME for each IA_LABEL in each trial index. For that, I first put IA_label in different columns
data_IA_DWELL_TIME <- tidyr::spread(data_IA_DWELL_TIME, key = IA_LABEL, value = IA_DWELL_TIME)
For calculating the percentage, I create a new dataframe:
data_IA_DWELL_TIME_percentage <-data_IA_DWELL_TIME
data_IA_DWELL_TIME_percentage$eyes <- 100*(data_IA_DWELL_TIME$eyes/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$nose <- 100*(data_IA_DWELL_TIME$nose/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$mouth <- 100*(data_IA_DWELL_TIME$mouth/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
data_IA_DWELL_TIME_percentage$face <- 100*(data_IA_DWELL_TIME$face/(rowSums(data_IA_DWELL_TIME[,c("eyes","nose","mouth","face")])))
So all is fine, and I get the wanted output. The problem is when I want to put the columns back to the rows
data_IA_DWELL_TIME_percentage <- tidyr::gather(key = IA_LABEL, value = IA_DWELL_TIME,-RECORDING_SESSION_LABEL,-condition, -TRIAL_INDEX)
I obtain this error:
Error in tidyr::gather(key = IA_LABEL, value = IA_DWELL_TIME,
-RECORDING_SESSION_LABEL, : object 'RECORDING_SESSION_LABEL' not found
>
Any idea of what is going on here? Thanks!
As explained, you're not referring to your data frame in the gather statement.
However, you could avoid the need for referring to it altogether and put the second part in a dplyr pipeline, like below:
library(dplyr)
library(tidyr)
data_IA_DWELL_TIME <- spread(data_IA_DWELL_TIME, key = IA_LABEL, value = IA_DWELL_TIME)
data_IA_DWELL_TIME %>%
mutate_at(
vars(eyes, nose, mouth, face),
~ 100 * (. / (rowSums(data_IA_DWELL_TIME[, c("eyes", "nose", "mouth", "face")])))
) %>%
gather(key = IA_LABEL, value = IA_DWELL_TIME,-RECORDING_SESSION_LABEL,-condition, -TRIAL_INDEX)

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

R - setting equiprobability over a specific variable when sampling

I have a data set with more than 2 millions entries which I load into a data frame.
I'm trying to grab a subset of the data. I need around 10000 entries but I need the entries to be picked with equal probability on one variable.
This is what my data looks like with str(data):
'data.frame': 2685628 obs. of 3 variables:
$ category : num 3289 3289 3289 3289 3289 ...
$ id: num 8064180 8990447 747922 9725245 9833082 ...
$ text : chr "text1" "text2" "text3" "text4" ...
You've noticed that I have 3 variables : category,id and text.
I have tried the following :
> sample_data <- data[sample(nrow(data),10000,replace=FALSE),]
Of course this works, but the probability of sample if not equal. Here is the output of count(sample_data$category) :
x freq
1 3289 707
2 3401 341
3 3482 160
4 3502 243
5 3601 1513
6 3783 716
7 4029 423
8 4166 21
9 4178 894
10 4785 31
11 5108 121
12 5245 2178
13 5637 387
14 5946 1484
15 5977 117
16 6139 664
Update: Here is the output of count(data$category) :
x freq
1 3289 198142
2 3401 97864
3 3482 38172
4 3502 59386
5 3601 391800
6 3783 201409
7 4029 111075
8 4166 6749
9 4178 239978
10 4785 6473
11 5108 32083
12 5245 590060
13 5637 98785
14 5946 401625
15 5977 28769
16 6139 183258
But when I try setting the probability I get the following error :
> catCount <- length(unique(data$category))
> probabilities <- rep(c(1/catCount),catCount)
> train_set <- data[sample(nrow(data),10000,prob=probabilities),]
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
I understand that the sample function is randomly picking between the row number but I can't figure out how to associate that with the probability over the categories.
Question : How can I sample my data over an equal probability for the category variable?
Thanks in advance.
I guess you could do this with some simple base R operation, though you should remember that you are using probabilities here within sample, thus getting the exact amount per each combination won't work using this method, though you can get close enough for large enough sample.
Here's an example data
set.seed(123)
data <- data.frame(category = sample(rep(letters[1:10], seq(1000, 10000, by = 1000)), 55000))
Then
probs <- 1/prop.table(table(data$category)) # Calculating relative probabilities
data$probs <- probs[match(data$category, names(probs))] # Matching them to the correct rows
set.seed(123)
train_set <- data[sample(nrow(data), 1000, prob = data$probs), ] # Sampling
table(train_set$category) # Checking frequencies
# a b c d e f g h i j
# 94 103 96 107 105 99 100 96 107 93
Edit: So here's a possible data.table equivalent
library(data.table)
setDT(data)[, probs := .N, category][, probs := .N/probs]
train_set <- data[sample(.N, 1000, prob = probs)]
Edit #2: Here's a very nice solution using the dplyr package contributed by #Khashaa and #docendodiscimus
The nice thing about this solution is that it returns the exact sample size within each group
library(dplyr)
train_set <- data %>%
group_by(category) %>%
sample_n(1000)
Edit #3:
It seems that data.table equivalent to dplyr::sample_n would be
library(data.table)
train_set <- setDT(data)[data[, sample(.I, 1000), category]$V1]
Which will also return the exact sample size within each group

How to obtain a new table after filtering only one column in an existing table in R?

I have a data frame having 20 columns. I need to filter / remove noise from one column. After filtering using convolve function I get a new vector of values. Many values in the original column become NA due to filtering process. The problem is that I need the whole table (for later analysis) with only those rows where the filtered column has values but I can't bind the filtered column to original table as the number of rows for both are different. Let me illustrate using the 'age' column in 'Orange' data set in R:
> head(Orange)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Convolve filter used
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r <- head(tail(r, -D), -D)
r
}
Filtering the 'age' column
age2 <- smooth(Orange$age, 5,10)
data.frame(age2)
The number of rows for age column and age2 column are 35 and 15 respectively. The original dataset has 2 more columns and I like to work with them also. Now, I only need 15 rows of each column corresponding to the 15 rows of age2 column. The filter here removed first and last ten values from age column. How can I apply the filter in a way that I get truncated dataset with all columns and filtered rows?
You would need to figure out how the variables line up. If you can add NA's to age2 and then do Orange$age2 <- age2 followed by na.omit(Orange) you should have what you want. Or, equivalently, perhaps this is what you are looking for?
df <- tail(head(Orange, -10), -10) # chop off the first and last 10 observations
df$age2 <- age2
df
Tree age circumference age2
11 2 1004 156 915.1678
12 2 1231 172 876.1048
13 2 1372 203 841.3156
14 2 1582 203 911.0914
15 3 118 30 948.2045
16 3 484 51 1008.0198
17 3 664 75 955.0961
18 3 1004 108 915.1678
19 3 1231 115 876.1048
20 3 1372 139 841.3156
21 3 1582 140 911.0914
22 4 118 32 948.2045
23 4 484 62 1008.0198
24 4 664 112 955.0961
25 4 1004 167 915.1678
Edit: If you know the first and last x observations will be removed then the following works:
x <- 2
df <- tail(head(Orange, -x), -x) # chop off the first and last x observations
df$age2 <- age2

Count rows for selected column values and remove rows based on count in R

I am new to R and am trying to work on a data frame from a csv file (as seen from the code below). It has hospital data with 46 columns and 4706 rows (one of those columns being 'State'). I made a table showing counts of rows for each value in the State column. So in essence the table shows each state and the number of hospitals in that state. Now what I want to do is subset the data frame and create a new one without the entries for which the state has less than 20 hospitals.
How do I count the occurrences of values in the State column and then remove those that count up to less than 20? Maybe I am supposed to use the table() function, remove the undesired data and put that into a new data frame using something like lappy(), but I'm not sure due to my lack of experience in programming with R.
Any help will be much appreciated. I have seen other examples of removing rows that have certain column values in this site, but not one that does that based on the count of a particular column value.
> outcome <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
> hospital_nos <- table(outcome$State)
> hospital_nos
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37 134
MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX UT VA
133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370 42 87
VI VT WA WI WV WY
2 15 88 125 54 29
Here is one way to do it. Starting with the following data frame :
df <- data.frame(x=c(1:10), y=c("a","a","a","b","b","b","c","d","d","e"))
If you want to keep only the rows with more than 2 occurrences in df$y, you can do :
tab <- table(df$y)
df[df$y %in% names(tab)[tab>2],]
Which gives :
x y
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
And here is a one line solution with the plyr package :
ddply(df, "y", function(d) {if(nrow(d)>2) d else NULL})

Resources