Random selection based on a variable in a R dataframe [duplicate] - r

This question already has answers here:
Take the subsets of a data.frame with the same feature and select a single row from each subset
(3 answers)
Closed 7 years ago.
I have a data frame with 1000 columns. It is a dataset of animals from different breeds. However I have more animals from some breeds. So what I want to do is to select a random sample of those breeds with more animals and make all breeds with the same number of observations.
In details: I have 400 Holstein animals, 300 Jersey, 100 Hereford and 150 Nelore and 50 Canchim. What I want to do is to randomly select 50 animals from each breed. So I would have a total of 250 animals at the end. I know how to randomly select using runif, however I am not sure how I can apply that in my case.
My data looks like:
Breed ID Trait1 Trait2 Trait3
Holstein 1 11 22 44
Jersey 2 22 33 55
Nelore 3 33 44 66
Nelore 4 44 55 77
Canchim 5 55 66 88
I have tried:
Data = data[!!ave(seq_along(data$Breed), unique(data$Breed), FUN=function(x) sample(x, 50) == x),]
However, it does not work and I am not allowed to install the package dplyr in the server that I am using.
Thank in advance.

You can split your animals data frame on the breed, and then apply a custom function to each chunk which will randomly extract 50 rows:
animals.split <- split(animals, animals$Breed)
animals.list <- lapply(animals.split, function(x) {
y <- x[sample(nrow(x), 50), ]
return(y)
}
result <- unsplit(animals.list, f = animals$Breed)

Related

How to fill a new data frame based on the value and the results of an calculus integrating this very value?

For graphical purpose, I want to create a new data frame with two columns.
The first column is the dose of the treatment received (i; 10 grammes up to 200 grammes).
The second column must be filed with the result of a calculus corresponding to the value of the dose received, id est the percentage of patients developing the disease according the corresponding dose which is given by the formula below:
The dose is extracted from a much larger dataset (data_fcpa) of more than 1 000 rows (patients).
percent_i <- round (prop.table (table (data_fcpa $ n_chir_act [data_fcpa $ cyproterone_dose > i] > 1))[2] * 100, 1)
I know how to create a new data (df) with the doses I want to explore:
df <- data.frame (dose <- seq (10, 200, by = 10))
names (df) <- c("cpa_dose")
> df
cpa_dose
1 10
2 20
3 30
4 40
5 50
6 60
7 70
8 80
9 90
10 100
11 110
12 120
13 130
14 140
15 150
16 160
17 170
18 180
19 190
20 200
For example for a dose of 10 grammes the result is:
> round (prop.table (table (data_fcpa $ n_chir_act [data_fcpa $ cyproterone_dose > 10] > 1))[2] * 100, 1)
TRUE
11.7
I suspect that a loop is needed to produce an output alike the little example provided below but, I have no idea of how to do it.
cpa_dose percentage
1 10 11.7
2 20
3 30
4 40
Any suggestion are welcomed.
Thank you in advance for your help.
It seems that you are describing a a situation where you want to show predicted effects from a statistical model? In that case, ggeffects is your best friend.
library(tidyverse)
library(ggeffects)
lm(mpg ~ hp,mtcars) %>%
ggpredict() %>%
as_tibble()
Btw, in order to answer your question it's required to provide some data and show what you have tried.

R lapply(): Change all columns within all data frames in a list to numeric, then convert all values to percentages

Question:
I am a little stumped as to how I can batch process as.numeric() (or any other function for that matter) for columns in a list of data frames.
I understand that I can view specific data frames or colunms within this list by using:
> my.list[[1]]
# or columns within this data frame using:
> my.list[[1]][1]
But my trouble comes when I try to apply this into an lapply() function to change all of the data from integer to numeric.
# Example of what I am trying to do
> my.list[[each data frame in list]][each column in data frame] <-
as.numberic(my.list[[each data frame in list]][each column in data frame])
If you can assist me in any way, or know of any resources that can help me out I would appreciate it.
Background:
My data frames are structured as the below example, where I have 5 habitat types and information on how much area an individual species home range extends to n :
# Example data
spp.1.data <- data.frame(Habitat.A = c(100,45,0,9,0), Habitat.B = c(0,0,203,45,89), Habitat.C = c(80,22,8,9,20), Habitat.D = c(8,59,77,83,69), Habitat.E = c(23,15,99,0,10))
I have multiple data frames with the above structure which I have assigned to a list object:
all.spp.data <- list(spp.1.data, spp.2.data, spp.1.data...n)
I am then trying to coerce all data frames to as.numeric() so I can create data frames of % habitat use i.e:
# data, which is now numeric as per Phil's code ;)
data.numeric <- lapply(data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
> head(data.numeric[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E
1 100 0 80 8 23
2 45 0 22 59 15
3 0 203 8 77 99
4 9 45 9 83 0
5 0 89 20 69 10
EDIT: I would like to sum every row, in all data frames
# Add row at the end of each data frame populated by rowSums()
f <- function(i){
data.numeric[[i]]$Sums <- rowSums(data.numeric[[i]])
data.numeric[[i]]
}
data.numeric.SUM <- lapply(seq_along(data.numeric), f)
head(data.numeric.SUM[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E Sums
1 100 0 80 8 23 211
2 45 0 22 59 15 141
3 0 203 8 77 99 387
4 9 45 9 83 0 146
5 0 89 20 69 10 188
EDIT: This is the code I used to convert values within the data frames to % habitat used
# Used Phil's logic to convert all numbers in percentages
data.numeric.SUM.perc <- lapply(data.numeric.SUM,
function(x) {
x[] <- (x[]/x[,6])*100
x
})
Perc.Habitat.A Perc.Habitat.B Perc.Habitat.C Perc.Habitat.D Perc.Habitat.E
1 47 32 0 6 0
2 0 0 52 31 47
3 38 16 2 6 11
4 4 42 20 57 37
5 11 11 26 0 5
6 100 100 100 100 100
This is still not the most condensed way to do this, but it did the trick for me.
Thank you, Phil, Val and Leo P, for helping with this problem.
I'd do this a bit more explicitly:
all.spp.data <- lapply(all.spp.data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
As a personal preference, this clearly conveys to me that I'm looping over each column in a data frame, and looping over each data frame in a list.
If you really want to do it all with lapply, here's a way to go:
lapply(all.spp.data,function(x) do.call(cbind,lapply(1:nrow(x),function(y) as.numeric(x[,y]))))
This uses a nested lapply call. The first one references the single data.frames to x. The second one references the column index for each x to y. So in the end I can reference each column by x[,y].
Since everything will be split up in single vectors, I'm calling do.call(cbind, ... ) to bring it back to a matrix. If you prefer you could add data.frame() around it to bring it back into the original type.

aggregate over multiple columns [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
Hey I have some data looks like this:
ExpNum Compound Peak Tau SS
1 a 100 30 50
2 a 145 23 45
3 b 78 45 56
4 b 45 43 23
5 c 344 23 56
Id like to fund the mean based on Compound name
What I have
Norm_Table$Norm_Peak = (aggregate(data[[3]],by=list(Compound),FUN=normalization))
This is fine and I have this coding repeating 3 times just changing the data[[x]] number. Would lapply work here? or a for loop?
A dplyr solution:
library(dplyr)
data %>%
group_by(Compound) %>%
summarize_each(funs(mean), -ExpNum)

How to find group of rows of a data frame where error occures

I have a two-column dataframe contataining thousands of IDs where each ID has hundreds of data rows, in other words a data frame of about 6 million rows. I am grouping (using either dplyr or data.table) this data frame by IDs and performing a "tso" (outlier detection) function on grouped data frame. The problem is after hours of computation it returns me an error related to ARIMA specification of one of the IDs. Question is how can I identify the ID (or the row number) where my function is returning error?? (if I detect it then I can remove that ID from dataframe)
I tried to manually perform my function on subgroups of this dataframe however I cannot reach the erroneous ID because there are thousands of IDs so it takes me weeks to find them this way.
outlier.detection <- function(x,iter) {
y <- as.ts(x)
out2 <- tso(y,maxit.iloop = iter,tsmethod = "auto.arima",remove.method = "bottom-up",cval=3)
y[out2$outliers$ind] <- NA
return(y)}
df <- data.table(outlying1);setkey(df,id)
test <- df[,list(new.weight = outlier.detection(weight,iter=1)),by=id]
the above function finds the annomalies and replace them with NAs. here is an example,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b 90
10 b 82
it will look like the following,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b NA
10 b 82

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

Resources