Binning an unevenly distributed column in R - r

I have to a column in R which has uneven distribution like an exponential distribution. I want to normalize the data and then bin the data in subsequent buckets.
Saw following links which helps in normalizing the data but nothing with binning the data to different categories.
Normalizing data in R
Standardize data columns in R
Example: of how eneven distributed column would look like but with lot of rows.
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Qty = c(1,1,1,2,3,13,30,45))
I want it binned the column in 5 categories which may look like:
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Qty = c(1,1,1,2,3,13,30,45),
Binned_Category = c(1,1,1,1,2,3,4,5))
Above binned_Category is sample, the values may not look like this for the given data in real world. I just wanted to showcase how I want the output to look like.

This will help:
num_bins <- 5
findInterval(Qty, unique(quantile(Qty, prob = seq(0, 1, 1/num_bins))))

Related

Dealing with Missing Values for one Variable in R

I'm currently dealing with a data set that has missing values, but they are only missing for one single variable. I was trying to determine whether they are missing at random, so that I can simply remove them from the data frame. Hence, I am trying to find potential correlations between the NA's in the data frame and the values of the other variables. I found the following code online:
library("VIM")
data(sleep)
x <- as.data.frame(abs(is.na(sleep)))
head(sleep)
head(x)
y <- x[which(sapply(x, sd) > 0)]
cor(y)
However, this only shows you how the missing values themselves are correlated, in case there are distributed across all variables.
Is there a way to find not the correlation between the missing values in a data frame, but the correlation between the missing values of one variable and values of another variable? For example, if you have a survey which is optionally asking for family income, how could you determine whether the missing values are e.g. correlated with low income with R?
library(finalfit)
library(dplyr)
df <- data.frame(
A = c(1,2,4,5),
B = c(55,44,3,6),
C = c(NA, 4, NA, 5)
)
df %>%
missing_pairs("A", "C")

Subsetting dataframe rows based on decimals in R?

I am quite new to R and have quite a challenging Question. I have a large dataframe consisting of 110,000 rows representing high-Resolution data from a Sediment core. I would like to select multiple rows based on Depth (which is recorded in mm to 3 decimal points). Of Course, I have not the time to go through the entire dataframe and pick the rows that I Need. I would like to be able to select the rows I would like based on the decimal Point part of the number and not the first Digit. I.e. I would like to be able to subset to a dataframe where all the .035 values would be returned. I have so far tried using the which() function but had no luck
newdata <- Linescan_EN18218[which(Linescan_EN18218$Position.mm.== .035),]
Can anyone offer any hints/suggestions how I can solve this Problem. Link to the first part of the dataframe csv
Welcome to stack overflow
Can you please further describe what you mean with had no luck. Did you get an error message or an empty data.frame?
In principle, your method should work. I have replicated it with simulated data.
n = 100
test <- data.frame(
a = 1:n,
b = rnorm(n = n),
c = sample(c(0.1,0.035, 0.0001), size = n, replace =T)
)
newdata <- test[which(test$c == 0.035),]

How to take subsets of two different Data Frames for comparison - by a random sample?

I would like to compare two different Data Frames. Both Data Frames consists of an equal number of rows and columns. The first Data Frame (1) are purchase probabilities from 0 to 1, whereas the second Data Frame (2) is coded binary and represents real purchases by a user.
My struggle is now, how I could take a RANDOM subset from df (1) which would be ALSO THE SAME in df (2) to compare this subset?
For example: How I can take a subset of 100 users (rows) and two products (columns) of df (1) which are the same as of df (2). Is this however possible? Or do I have to re-manipulate my data frames first? In general it is possible to join both Data Frames by user_ID - if this could be important.
# FIRST DF CONSISTS OF PROBABILITIES
df_probabilities <- data.frame(matrix(runif(20000,0,1), nrow=1000,ncol = 20))
# SECOND DF CONSISTS OF BINARY DATA
library("Matrix")
df_binary <- data.frame(as.matrix(rsparsematrix(1000, 20,nnz = 800, rand.x = runif)))
df_binary[df_binary > 0] = 1
Randomized Index-based sampling should help
set.seed(100)
user_index = sample(1:100)
subs1 = df1[user_index,]
subs2 = df2[user_index,]

Random generation of numbers using R

I have some data that involves zebu (beef animals) that are labeled 1-40. I need to divide them into 4 groups of 10 each. I need to choose them randomly to remove any bias and I need to use R and Excel. Thank you please help.
There are ways of doing this that only require less code, but here's a verbose example that let's me explain what's happening.
Here's the dataset I'll be using since I don't know exactly how your data look.
beef <-
data.frame(number = 1:40, weight = round(rnorm(40, mean = 2000, sd = 500)))
Because your animals are numbered from 1 to 40, you can create a new dataframe that contains those numbers with a random group number (1 to 4) as the second column.
num_group <- (data.frame(
number = 1:40,
group =
sample(
x = 1:4,
size = 40,
replace = TRUE
)
))
Join the two dataframes together and you have your answer.
merge(beef, num_group)
To shuffle the data in excel follow this tip
Create new column in your data then apply RAND()
It will generate random number over that column and sort random numbers column you will get your data shuffled.
Later load data in to R and select 10 rows each time and assign class to them.

Mean of triplicate

I've just cleaned up a data frame that I scraped from an excel spreadsheet by amongst other things, removing percentage signs from some of the numbers see, Removing Percentages from a Data Frame.
The data has twenty four rows representing the parameters and results from eight experiments done in triplicate. Eg, what one would get from,
DF1 <- data.frame(X = 1:24, Y = 2 * (1:24), Z = 3 * (1:24))
I want to find the mean of each of the triplicates (which, fortunately are in sequential order) and create a new data frame with eight rows and the same amount of columns.
I tried to do this using,
DF2 <- data.frame(replicate(3,sapply(DF1, mean)))
which gave me the mean of each column as rows three times. I wanted to get a dataframe that would give me,
data.frame(X = c(2,5,8,11,14,17,20,23), Y = c(4,10,16,22,28,34,40,23), Z = c(6,15,24,33,42,51,60,69))
which I worked out by hand; it's supposed to be the reduced result.
Thanks, ...
Any help would be gratefully recieved.
Nice task for codegolf!
aggregate(DF1, list(rep(1:8, each=3)), mean)[,-1]
to be more general, you should replace 8 with nrow(DF1).
... or, my favorite, using matrix multiplication:
t(t(DF1) %*% diag(8)[rep(1:8,each=3),]/3)
This works:
foo <- matrix(unlist(by(data=DF1,INDICES=rep(1:8,each=3),FUN=colMeans)),
nrow=8,byrow=TRUE)
colnames(foo) <- colnames(DF1)
Look at ?by.

Resources