Random generation of numbers using R - r

I have some data that involves zebu (beef animals) that are labeled 1-40. I need to divide them into 4 groups of 10 each. I need to choose them randomly to remove any bias and I need to use R and Excel. Thank you please help.

There are ways of doing this that only require less code, but here's a verbose example that let's me explain what's happening.
Here's the dataset I'll be using since I don't know exactly how your data look.
beef <-
data.frame(number = 1:40, weight = round(rnorm(40, mean = 2000, sd = 500)))
Because your animals are numbered from 1 to 40, you can create a new dataframe that contains those numbers with a random group number (1 to 4) as the second column.
num_group <- (data.frame(
number = 1:40,
group =
sample(
x = 1:4,
size = 40,
replace = TRUE
)
))
Join the two dataframes together and you have your answer.
merge(beef, num_group)

To shuffle the data in excel follow this tip
Create new column in your data then apply RAND()
It will generate random number over that column and sort random numbers column you will get your data shuffled.
Later load data in to R and select 10 rows each time and assign class to them.

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

Subsetting dataframe rows based on decimals in R?

I am quite new to R and have quite a challenging Question. I have a large dataframe consisting of 110,000 rows representing high-Resolution data from a Sediment core. I would like to select multiple rows based on Depth (which is recorded in mm to 3 decimal points). Of Course, I have not the time to go through the entire dataframe and pick the rows that I Need. I would like to be able to select the rows I would like based on the decimal Point part of the number and not the first Digit. I.e. I would like to be able to subset to a dataframe where all the .035 values would be returned. I have so far tried using the which() function but had no luck
newdata <- Linescan_EN18218[which(Linescan_EN18218$Position.mm.== .035),]
Can anyone offer any hints/suggestions how I can solve this Problem. Link to the first part of the dataframe csv
Welcome to stack overflow
Can you please further describe what you mean with had no luck. Did you get an error message or an empty data.frame?
In principle, your method should work. I have replicated it with simulated data.
n = 100
test <- data.frame(
a = 1:n,
b = rnorm(n = n),
c = sample(c(0.1,0.035, 0.0001), size = n, replace =T)
)
newdata <- test[which(test$c == 0.035),]

Binning an unevenly distributed column in R

I have to a column in R which has uneven distribution like an exponential distribution. I want to normalize the data and then bin the data in subsequent buckets.
Saw following links which helps in normalizing the data but nothing with binning the data to different categories.
Normalizing data in R
Standardize data columns in R
Example: of how eneven distributed column would look like but with lot of rows.
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Qty = c(1,1,1,2,3,13,30,45))
I want it binned the column in 5 categories which may look like:
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Qty = c(1,1,1,2,3,13,30,45),
Binned_Category = c(1,1,1,1,2,3,4,5))
Above binned_Category is sample, the values may not look like this for the given data in real world. I just wanted to showcase how I want the output to look like.
This will help:
num_bins <- 5
findInterval(Qty, unique(quantile(Qty, prob = seq(0, 1, 1/num_bins))))

Summing rows grouped by another parameter in R

I am trying to calculate some rates for time on condition parameters, and have written the following, which successfully calculates the desired rates. But, I'm sure there must be a more succinct way to do this using the data.table methods. Any suggestions?
Background on what I'm trying to achieve with the code.
For each run number there are 10 record numbers. Each record number refers to a value bin (the full range of values for each parameter is split into 10 equal sized bins). The values are counts of time spent in each bin. I am trying to sum the counts for P1 over each run number (calling this opHours for the run number). I then want to divide each of the bin counts by the opHours to show the proportion of each run that is spent in each bin.
library(data.table)
#### Create dummy parameter values
P1 <- rnorm(2000,400, 50);
Date <- seq(from=as.Date("2010/1/1"), by = "day", length.out = length(P1));
RECORD_NUMBER <- rep(1:10, 200);
RUN_NUMBER <- rep(1:200, each=10, len = 2000);
#### Combine the dummy parameters into a dataframe
data <- data.frame(Date, RECORD_NUMBER, RUN_NUMBER, P1);
#### Calculating operating hours for each run
setDT(data);
running_hours_table <- data[ , .(opHours = sum(P1)), by = .(RUN_NUMBER)];
#### Set the join keys for the data and running_hours tables
setkey(data, RUN_NUMBER);
setkey(running_hours_table, RUN_NUMBER);
#### Combine tables row-wise
data <- data[running_hours_table];
data$P1.countRate <- (data$P1 / data$opHours)
Is it possible to generate the opHours column in the data table without first creating a separate table and then joining them back together?
data2[ , opHours := sum(P1), by = .(RUN_NUMBER)]
You should probably read some materials about data.table:
wiki Getting-started
or
data.table.cheat.sheet

Random Assignment of Groups

How do I randomly assign a group of people into four treatment groups and a control group, given that I have a list of their names on an excel document?
Get the randomizr package
install.packages("randomizr")
library(randomizr)
use complete random assignment (holds the number of units assigned to each condition fixed across randomizations, unlike sample with replace = TRUE
Z <- complete_ra(N = 100, num_arms = 5)
table(Z)
If you have 100 names (number them as such) then you can assign them to one of 5 groups with
split(1:100, sample(1:5, 100, replace = TRUE))
split(x, f) splits x into groups according to f, for which I've used sample to sample 100 occurrences of the numbers 1 to 5 (with replacement).
Take these numbered names from your list.
(Note: you didn't specify equal groups).
Alternatively, the caret package can handle this quite nicely for you: https://topepo.github.io/caret/data-splitting.html

Resources