I have a set of levels in R that I generate with cut, e.g. say fractional values between 0 and 1, broken down into 0.1 bins:
> frac <- cut(c(0, 1), breaks=10)
> levels(frac)
[1] "(-0.001,0.1]" "(0.1,0.2]" "(0.2,0.3]" "(0.3,0.4]" "(0.4,0.5]"
[6] "(0.5,0.6]" "(0.6,0.7]" "(0.7,0.8]" "(0.8,0.9]" "(0.9,1]"
Given a vector v containing continuous values between [0.0, 1.0], how do I count the frequency of elements in v that fall within each level in levels(frac)?
I could customize the number of breaks and/or the interval from which I am making levels, so I'm looking for a way to do this with standard R commands, so that I can build a two-column data frame: one column for the levels as factors, and the second column for a fractional or percentage value of total elements in v over the level.
Note: The following does not work:
> table(frac)
frac
(-0.001,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
1 0 0 0 0 0
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
0 0 0 1
If I use cut on v directly, then I do not get the same levels when I run cut on different vectors, because the range of values — their minimum and maximum — is going to be different between arbitrary vectors, and so while I may have the same number of breaks, the level intervals will not be the same.
My goal is to take different vectors and bin them to the same set of levels. Hopefully this helps clarify my question. Thanks for any assistance.
Amend frac to actually represent your desired intervals, and then use the table function:
x = runif(100) # For example.
frac = cut(x, breaks = seq(0, 1, 0.1))
table(frac)
Result:
frac
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8]
14 9 8 10 8 12 7 7
(0.8,0.9] (0.9,1]
16 9
Introduce extremes c(0, 1) to v then use the same cut:
library(dplyr)
#dummy data
set.seed(1)
v <- round(runif(7), 2)
#result
data.frame(v,
vFrac = cut(c(0, 1, v), breaks = 10)[-c(1, 2)]) %>%
group_by(vFrac) %>%
mutate(vFreq = n())
# Source: local data frame [10 x 3]
# Groups: vFrac [8]
#
# v vFrac vFreq
# <dbl> <fctr> <int>
# 1 0.27 (0.2,0.3] 1
# 2 0.37 (0.3,0.4] 1
# 3 0.57 (0.5,0.6] 1
# 4 0.91 (0.9,1] 2
# 5 0.20 (0.1,0.2] 1
# 6 0.90 (0.8,0.9] 1
# 7 0.94 (0.9,1] 2
frac = seq(0,1,by=0.1)
ranges = paste(head(frac,-1), frac[-1], sep=" - ")
freq = hist(v, breaks=frac, include.lowest=TRUE, plot=FALSE)
data.frame(range = ranges, frequency = freq$counts)
Use findInterval instead of cut:
v<-data.frame(v=runif(100,0,1))
library(plyr)
v$x<-findInterval(v$v,seq(0,1,by=0.1))*0.1
ddply(v, .(x), summarize, n=length(x))
frac = seq(0, 1, 0.1)
set.seed(42); v = rnorm(10, 0.5, 0.2)
sapply(1:(length(frac)-1), function(i) sum(frac[i]<v & frac[i+1]>=v))
#[1] 0 0 0 1 3 2 1 1 1 1
Related
I want to add a percentage density column next to the frequency column in the as dataframe below. As well as the sum values for the frequency and percentage density columns. The percentage density column shows the weigh of the percentage for each sequence so if there are 10 sequences in total and the frequency for that one sequence is 3 then the percentage density would be 3/10 = 0.3 . The sum of the percentage density should be 1.0.
data <- c(3.968, 3.534, 4.032, 3.912, 3.572, 4.014, 3.682, 3.608, 3.669, 3.705,
4.023, 3.588, 3.945, 3.871, 3.744, 3.711, 3.645, 3.977, 3.888, 3.948)
sortdata <- sort(data)
as.data.frame(table(cut(sortdata,breaks=seq(3.50,4.15,by=0.05))))
Try the prop.table() function.
You can wrap it around a table and then combine with your frequency counts as
such:
# Proportion Of The Total For Each Cut
prop.table(table(cut(sortdata,breaks=seq(3.50,4.15,by=0.05))))
# Data Frame With Frequencies And Proportions Combined
data.df <- as.data.frame(
cbind(
table(cut(sortdata,breaks=seq(3.50,4.15,by=0.05))),
prop.table(table(cut(sortdata,breaks=seq(3.50,4.15,by=0.05))))
)
)
names(data.df) <- c("Freq","Pct")
data.df
# Check Sum Of Pct Equals 1
sum(data.df$Pct) == 1
Piping Freq into proportions.
res <- as.data.frame(table(cut(sortdata, breaks=seq(3.50, 4.15, by=0.05)))) |>
transform(Dens=proportions(Freq))
res
# Var1 Freq Dens
# 1 (3.5,3.55] 1 0.05
# 2 (3.55,3.6] 2 0.10
# 3 (3.6,3.65] 2 0.10
# 4 (3.65,3.7] 2 0.10
# 5 (3.7,3.75] 3 0.15
# 6 (3.75,3.8] 0 0.00
# 7 (3.8,3.85] 0 0.00
# 8 (3.85,3.9] 2 0.10
# 9 (3.9,3.95] 3 0.15
# 10 (3.95,4] 2 0.10
# 11 (4,4.05] 3 0.15
# 12 (4.05,4.1] 0 0.00
# 13 (4.1,4.15] 0 0.00
## check
res$Dens |> sum()
# [1] 1
I'm back to using R after using SAS for a few years, and I'm relearning everything again.
I have a dataset with variable Lot_Size, which contains continuous data from 0.1980028 - 1.2000000 acres. I'd like to categorize this variable based on these demarcations:
0 - 1/3 acre = 0
1/3 - 2/3 acre = 1
2/3 - 1 acre = 2
1+ acre = 3
Into a new variable LS_cat.
I've explored the mutate command but I keep returning errors. Anyone have any ideas?
UPDATE
Thanks for responding - both solutions worked perfectly. Since this was a learning experience for me, I'll add to the question.
I actually misunderstood the question posed to me - if I were to make dummy variables for each category previously noted, how would I do that? For example, if Lot_Size is 0 - 1/3 of an acre, I want variable ls_1_3 to be 1, if it's not then I'd like it to be 0. Would I use ifelse command?
Use case_when().
library(tidyverse)
set.seed(123)
my_df <- tibble(
lot_size = runif(n = 10, min = 0.1980028, max = 1.2)
)
my_df |> mutate(
ls_cat = case_when(lot_size < 1 / 3 ~ 0,
lot_size < 2 / 3 ~ 1,
lot_size < 1 ~ 2,
TRUE ~ 3)
)
#> A tibble: 10 x 2
#> lot_size ls_cat
#> <dbl> <dbl>
#> 1 0.486 1
#> 2 0.988 2
#> 3 0.608 1
#> 4 1.08 3
#> 5 1.14 3
#> 6 0.244 0
#> 7 0.727 2
#> 8 1.09 3
#> 9 0.751 2
#>10 0.656 1
Case_when() is usually a sound solution when there's more than two options (if_else() if there are just two), but in this case there's a simpler math(s) solution.
my_df <- tibble(lot_size = seq(0, 1.2, by = 0.1))
my_df$ls_cat <- ceiling((my_df$lot_size*3)-0.99)
Though, this may be less instructive on R programming.
For your follow on question, ifelse() works well, e.g.
Base:
my_df$ls_1_3 <- ifelse(my_df$lot_size < 1/3, 1, 0)
Or Tidyverse:
my_df <- my_df %>%
mutate(ls_1_3 = if_else(lot_size < 1/3, 1, 0))
NB: if_else() is a more pedantic version of ifelse(). Both should work equally well here, but if_else() is better for catching possible errors
We can use findInterval:
Lot_Size <- seq(0.2, 1.2, len=10)
Lot_Size
# [1] 0.2000000 0.3111111 0.4222222 0.5333333 0.6444444 0.7555556 0.8666667 0.9777778 1.0888889 1.2000000
findInterval(Lot_Size, c(0, 1/3, 2/3, 1, Inf), rightmost.closed = TRUE) - 1L
# [1] 0 0 1 1 1 2 2 2 3 3
In this case it is returning the index within the vector, which we then convert to your 0-based with the trailing - 1L (integer 1).
cut it.
dat <- transform(dat, Lot_Size_cat=
cut(Lot_Size, breaks=c(0, 1/3, 2/3, 1, Inf), labels=0:3,
include.lowest=TRUE))
dat
# X1 Lot_Size Lot_Size_cat
# 1 0.77436849 1.0509024 3
# 2 0.19722419 0.2819626 0
# 3 0.97801384 0.8002238 2
# 4 0.20132735 0.9272001 2
# 5 0.36124443 0.6396998 1
# 6 0.74261194 1.0990851 3
# 7 0.97872844 1.1648617 3
# 8 0.49811371 0.7221819 2
# 9 0.01331584 1.1915689 3
# 10 0.25994613 0.4076475 1
Data:
set.seed(666)
n <- 10
dat <- data.frame(X1=runif(n),
Lot_Size=sample(seq(0.1980028, 1.2, 1e-7), n, replace=TRUE))
Could anyone explain how to change the negative values in the below dataframe?
we have been asked to create a data structure to get the below output.
# > df
# x y z
# 1 a -2 3
# 2 b 0 4
# 3 c 2 -5
# 4 d 4 6
Then we have to use control flow operators and/or vectorisation to multiply only the negative values by 10.
I tried so many different ways but cannot get this to work. I get an error when i try to use a loop and because of the letters.
Create indices of the negative values and multiply by 10, i.e.
i1 <- which(df < 0, arr.ind = TRUE)
df[i1] <- as.numeric(df[i1]) * 10
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
First find out the numeric columns of the dataframe and multiply the negative values by 10.
cols <- sapply(df, is.numeric)
#Multiply negative values by 10 and positive with 1
df[cols] <- df[cols] * ifelse(sign(df[cols]) == -1, 10, 1)
df
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
Using dplyr -
library(dplyr)
df <- df %>% mutate(across(where(is.numeric), ~. * ifelse(sign(.) == -1, 10, 1)))
How can I see the original values post normalization? Or change them in the final output?
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
I have a dataset that has 10 columns and 5,000 rows. After cleaning up the data and selecting which columns and rows I want, I run a normalization code.
Then I run a kmeans and get my output. How can I see what the values were changed to after normalization? Like, if I have Region 1, 2, 3, 4, and 5. And post normalization it changes to 0.00, 0.25, 0.5, 0.75, and 1. is there a way to change them back to the original in the kmeans output?
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
normalize = function(X) {
return(abs((X-min(X)))/(max(X)-min(X)))
}
df_age_norm = as.data.frame(lapply(df_age,normalize))
clusters = kmeans(df_age_norm, 9)[['cluster']]
df_age_norm$clusters = clusters
df_age_norm =
aggregate(df_age_norm[,1:4],list(df_age_norm$clusters),FUN
= mean)
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
Head of dataset before normalization
Age HHIncome Region MaritalStatus group
18 11000 5 0 1
18 11000 5 1 1
18 12000 2 0 1
18 12000 4 0 1
18 13000 1 0 1
Head of dataset after normalization
Age HHIncome Region MaritalStatus group
0 0.001879699 1.00 0 0
0 0.001879699 1.00 1 0
0 0.002819549 0.25 0 0
0 0.002819549 0.75 0 0
0 0.003759398 0.00 0 0
This solution is inspired in base R function scale, that centers and scales the vector by subtracting the mean value and dividing by the standard deviation of the vector x. These two values, mean(x) and sd(x) are returned as attributes.
x <- -4:5
y <- scale(x)
attributes(y)
#$dim
#[1] 10 1
#
#$`scaled:center`
#[1] 0.5
#
#$`scaled:scale`
#[1] 3.02765
I have, therefore, rewritten function normalize to also set and return min(x) and max(x) as attributes. They will be used to later denormalize.
normalize <- function(X, na.rm = FALSE) {
if(na.rm) X <- X[!is.na(X)]
Min <- min(X)
Max <- max(X)
Y <- X - Min
if(Min != Max) Y <- Y/(Max - Min)
attr(Y, "scaled:min") <- Min
attr(Y, "scaled:max") <- Max
Y
}
denormalize <- function(X){
Min <- attr(X, "scaled:min")
Max <- attr(X, "scaled:max")
attr(X, "scaled:min") <- NULL
attr(X, "scaled:max") <- NULL
Y <- if(Min != Max) X*(Max - Min) else X
Y <- Y + Min
Y
}
df_age_norm <- as.data.frame(lapply(df_age, normalize))
df_age_2 <- as.data.frame(lapply(df_age_norm, denormalize))
df_age_2
# Age HHIncome Region MaritalStatus group
#1 18 11000 5 0 1
#2 18 11000 5 1 1
#3 18 12000 2 0 1
#4 18 12000 4 0 1
#5 18 13000 1 0 1
Data.
df_age <- read.table(text = "
Age HHIncome Region MaritalStatus group
18 11000 5 0 1
18 11000 5 1 1
18 12000 2 0 1
18 12000 4 0 1
18 13000 1 0 1
", header = TRUE)
I have a following sample code to make one data frame containing information for more than 1 ID.
I want to sort them by defined categories.
In which I want to see the percentage change at specific (given time for e.h here t=10) with respect to
its baseline value and return the value of that found category in output.
I have explained detailed step of my calculation below.
a=c(100,105,126,130,150,100,90,76,51,40)
t=c(0,5,10,20,30)
t=rep(t,2)
ID=c(1,1,1,1,1,2,2,2,2,2)
data=data.frame(ID,t,a)
My desired Calculation
1)for all ID at t=0 "a" value is baseline
2) Computation
e.g At Given t=10 (Have to provide) take corresponding a value
%Change(answer) = (taken a value - baseline/baseline)
3) Compare the answer in the following define CATEGORIES..
#category
1-If answer>0.25
2-If -0.30<answer<0.25
3-If -1.0<answer< -0.30
4-If answer== -1.0
4)Return the value of category
Desired Output
ID My_Answer
1 1
2 3
Can anybody help me in this.I do understand the flow of my computation but not awre of efficient way of doing it as i have so many ID in that data frame.
Thank you
It's easier to do math with columns than with rows. So the first step is to move baseline numbers into their own columns, then use cut to define these groups:
library(dplyr)
library(tidyr)
foo <- data %>%
filter(t == 0) %>%
left_join(data %>%
filter(t != 0),
by = "ID") %>%
mutate(percentchange = (a.y - a.x) / a.x,
My_Answer = cut(percentchange, breaks = c(-1, -0.3, 0.25, Inf),
right = FALSE, include.lowest = TRUE, labels = c("g3","g2","g1")),
My_Answer = as.character(My_Answer),
My_Answer = ifelse(percentchange == -1, "g4", My_Answer)) %>%
select(ID, t = t.y, My_Answer)
foo
ID t.x a.x t.y a.y percentchange My_Answer
1 1 0 100 5 105 0.05 g2
2 1 0 100 10 126 0.26 g1
3 1 0 100 20 130 0.30 g1
4 1 0 100 30 150 0.50 g1
5 2 0 100 5 90 -0.10 g2
6 2 0 100 10 76 -0.24 g2
7 2 0 100 20 51 -0.49 g3
8 2 0 100 30 40 -0.60 g3
You can see that this lets us calculate My_Answer for all values at once. if you want to find out the values for t == 10, you can just pull out those rows:
foo %>%
filter(t == 10)
ID t My_Answer
1 1 10 g1
2 2 10 g2