I'd like to create a variable that bins values from another variable based on a binwidth
The data would look something like this if I wanted to create a bin variable based on counts where:
1 to 5 = 1
6 to 10 = 2
11 to 15 = 3
Without hand recoding each bin is there a function that can do something like this in R?
Since it looks like you want to get a numeric rather than a factor result, try something like trunc((mydata$count-1)/5)+1
e.g.
mydata$bucket = trunc((mydata$count-1)/5)+1
There's also the ceiling function, which is a little simpler:
mydata$bucket = ceiling(mydata$count/5)
see ?round
So on your data:
mydata = data.frame(spend=c(21,32,34,43,36,39,33,47,47,47,25,50,44,44) ,
count=c(3L,1L,2L,15L,1L,8L,1L,11L,15L,11L,3L,12L,11L,4L) )
mydata$bucket = ceiling(mydata$count/5)
Which gives:
> mydata
spend count bucket
1 21 3 1
2 32 1 1
3 34 2 1
4 43 15 3
5 36 1 1
6 39 8 2
7 33 1 1
8 47 11 3
9 47 15 3
10 47 11 3
11 25 3 1
12 50 12 3
13 44 11 3
14 44 4 1
Yeah its called the cut function
? cut
You can use the generic cut() function. For a numeric vector x, the method has these arguments:
> args(cut.default)
function (x, breaks, labels = NULL, include.lowest = FALSE, right = TRUE,
dig.lab = 3L, ordered_result = FALSE, ...)
The argument breaks is central here. It is either a number of intervals or a vector of “breakpoints” defining your intervals. Note that all intervals are by default right-open (right = TRUE), so by creating an object x, containing the numbers from 1 to 100 and defining a vector of breakpoints (brk) {1, 20, 50, 100}, you will get these results (after using table() on the result):
> x <- 1:100
> brk <- c(1,20,50,100)
> table(cut(x = x, breaks = brk))
(1,20] (20,50] (50,100]
19 30 50
You can see that the first interval is $(1,\,20]$, so 1 is not part of it and the first observation will become a missing value NA (as all other observations outside the defined intervals).
By setting include.lowest = TRUE, R includes the lowest value (i.e., the first interval will be closed), so I think this will produce what you want:
> x <- 1:100
> brk <- c(1,20,50,100)
> table(cut(x = x, breaks = brk, include.lowest = TRUE))
[1,20] (20,50] (50,100]
20 30 50
The argument right reverses the whole process, so intervals are left-open by default and include.lowest will close the last interval (i.e., include the highest value in the last category).
As the resulting object will be of class "factor", you might consider setting ordered_result to TRUE, producing an ordered factor object (classes "ordered" and "factor").
Labelling, etc. is optional (see ?cut).
The cut function can actually accomplish binning a variable while keeping it as a continuous variable you just need to use the labels parameter:
myData$bucket <- cut(myData$counts, breaks = 30, labels = rep(1:30))
Related
I am working on subsetting multiple variables in a dataset to remove data points that are not useful. When I enter the subset command for the first variable and check the dataset, the variable has been properly subset. However, after doing the same with the second variable, the first is no longer subset in the dataset. It seems as though the second subset command is overriding the first. In the example I came up with below the first variable (Height) is no longer subset once I subset the second variable (Weight). Any thoughts on how to resolve this?
rTestDataSet = TestDataSet
rTestDataSet = subset(TestDataSet, TestDataSet$Height < 4)
rTestDataSet = subset(TestDataSet, TestDataSet$Weight < 3)
You are applying both subsets to the original data. What you need to do is apply one subset, save it to a variable and then apply the second subset to this new variable. Also as already pointed out you don't need the $ when using subset.
try this:
Make some reproducible data:
set.seed(50)
TestDataSet <- data.frame("Height" = c(sample(1:10,30, replace = T)), Weight = sample(1:10,30, replace = T) )
rTestDataSet = TestDataSet
rTestDataSet = subset(rTestDataSet, Height < 4)
rTestDataSet
Height Weight
3 3 5
6 1 7
9 1 4
10 2 5
12 3 9
14 1 1
15 3 1
19 1 8
20 2 9
22 2 8
28 3 6
rTestDataSet = subset(rTestDataSet, Weight < 3)
rTestDataSet
Height Weight
14 1 1
15 3 1
Why not use tidyverse? Chain the operations together to create your own logic. Instead of subset you can use filter to get the rows you want conditionally:
library(tidyverse)
TestDataSet %>%
filter(Height < 4) %>%
filter(Weight < 3)
or
TestDataSet %>%
filter(Height < 4 & Weight < 3)
I'm using the epiR package as it does nice 2 by 2 contingency tables with odds ratios, and population attributable fractions.
As is common my data is coded
0 = No
1 = Yes
So when I do
tabele(var_1,var_2)
The output comes out as a table aligned like
For its input though epiR wants the top left square to be Exposed+VE Outcome+VE - i.e the top left square should be Var 1==1 and Var 2==1
Currently I do this by recoding the zeroes to 2 or alternatively by setting as a factor and using re-level. Both of these are slightly annoying for other analyses as in general I want Outcome+VE to come after Outcome-VE
So I wondered if there is an easy way (?within table) to flip the orientation of table so that it essentially inverts the ordering of the rows/columns?
Hope the above makes sense - happy to provide clarification if not.
Edit: Thanks for suggestions below; just for clarification I want to be able to do this when calling table from existing dataframe variable - i.e when what I am doing is table(data$var_1, data$var_2) - ideally without having to create a whole new object
Table is a simple matrix. You can just call indices in reverse order.
xy <- table(data.frame(value = rbinom(100, size = 1, prob = 0.5),
variable = letters[1:2]))
variable
value a b
0 20 22
1 30 28
xy[2:1, 2:1]
variable
value b a
1 20 30
0 30 20
Using factor levels:
# reproducible example (adapted from Roman's answer)
df1 <- data.frame(value = rbinom(100, size = 1, prob = 0.5),
variable = letters[1:2])
table(df1)
# variable
# value a b
# 0 32 23
# 1 18 27
#convert to factor, specify levels
df1$value <- factor(df1$value, levels = c("1", "0"))
df1$variable <- factor(df1$variable, levels = c("b", "a"))
table(df1)
# variable
# value b a
# 1 24 26
# 0 26 24
I want to drop all unsued labels from a data.set.
Let's assume this example data.set (which is class from the memisc package).
library(memisc)
d <- data.set(a = sample(1:10), b=rep(c(14,72),5))
labels(d$b) <- c('First' = 14, 'no-use' = 33, 'Second' = 72)
The resulting data.set:
Data set with 10 observations and 2 variables
a b
1 4 First
2 1 Second
3 9 First
4 8 Second
5 7 First
6 10 Second
7 5 First
8 3 Second
9 2 First
10 6 Second
You see that for b only two values used but it has three labels.
> labels(d$b)
Values and labels:
14 'First'
33 'no-use'
72 'Second'
How can I drop the unused label (33) from there? The point is all unsued labels should be droped and I don't know which one is unused. I would know how to remove 33 explicite. But that is not the goal.
I know from the basic-R data.frame the function droplevels(). Would be nice to have something like droplabels().
This isn't very compact, but you could use the following
labels(d$b) <- labels(d$b)[seq_len(length(unique(d$b)))]
update
Your question states you want to drop '72' when it looks like you want to drop '33'. Regardless, the following function will drop any unused labels
labels(d$b) <- labels(d$b)[labels(d$b)#values %in% unique(d$b)]
The following will drop all unused labels for all elements of a list
for (i in seq_along(d)) {
if(!is.null(labels(d[[i]]))) {
labels(d[[i]]) <- labels(d[[i]])[labels(d[[i]])#values %in% unique(d[[i]])]
}
}
I have a data frame consisting of about 22 fields, some system ids and some measurements, such as
bsystemid dcesystemid lengthdecimal heightquantity
2218 58 22 263
2219 58 22 197
2220 58 22 241
What I want:
1 . loop through a list of field ids
2 . define a function to test for a condition
3 . such that both x and y can vary
Where does the y variable definition belong, for varying both x and y? Other different structures?
This code block works for a single field and value of y:
varlist4<-names(brg) [c(6)]
f1<-(function(x,y) count(brg[,x]<y) )
lapply(varlist4, f1, y=c(7.5))
This code block executes, but the counts are off:
varlist4<-names(brg) [c(6,8,10,12)]
f1<-(function(x,y) count(brg[,x]<y) )
lapply(varlist4, f1, y=c(7.5,130,150,0))
For example,
varlist4<-names(brg) [c(6)]
f1<-(function(x,y) count(brg[,x]<y) )
lapply(varlist4, f1, y=c(7.5))
returns (correctly),
x freq
1 FALSE 9490
2 TRUE 309
3 NA 41
whereas the multiple x,y block of code above returns this for the first case,
x freq
1 FALSE 4828
2 TRUE 4971
3 NA 41
Thanks for any comments.
Update:
What I would like is to automate counting of occurances of values in specified fields in a df, meeting some condition. The conditions are numeric constants or text strings, one for each field. For example, I might want to count occurances meeting the condition >360 in field1, >0 in field2, etc. What I thus mean by allowing x and y to vary is reading x and y vectors with the field names and corresponding conditions into a looping structure.
I'd like to automate this task because it involves around 30 tables, each with up to 50 or so fields. And I'll need to do it twice, scanning once for values exceeding a maximum and once for values less than a minimum. Better still might be loading the conditions into a table and referencing that in the loop. That may be the next step but I'd like to understand this piece first.
This working example
t1<-18:29
t2<-c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
t3<-c(1.2,-0.2,-0.3,1.2, 2.2,0.4,0.6,0.4,-0.8,-0.1,5.0,3.1)
t<-data.frame(v1=t1,v2=t2,v3=t3)
varlist<-names(t) [c(1)]
f1<-(function(x,y) count(t[,x]>y) )
lapply(varlist, f1, y=c(27))
illustrates the correct answer for the first field, returning
x freq
1 FALSE 10
2 TRUE 2
But if I add in other fields and the corresponding conditions (the y's) I get something different for the first case:
varlist<-names(t) [c(1,2,3)]
f1<-(function(x,y) count(t[,x]>y) )
lapply(varlist, f1, y=c(27,83,3))
x freq
1 FALSE 8
2 TRUE 4
[[2]]
x freq
1 FALSE 1
2 TRUE 11
[[3]]
x freq
1 FALSE 11
2 TRUE 1
My sense is I'm not going about structuring the y part correctly.
Thanks for any comments.
You can use mapply. Let's create some data:
set.seed(123) # to get exactly the same results
brg = data.frame(x = rnorm(100), y=rnorm(100), z=rnorm(100))
brg$x[c(10, 15)] = NA # some NAs
brg$y[c(12, 21)] = NA # more NAs
Then you need to define the function to do the job. The function .f1 counts the data, and ensure there are always three levels (TRUE, FALSE, NA). Then, f1 uses .f1 in an mapply context to be able to vary x and y. Finally, some improvements in the output (changing the names of the columns).
f1 = function(x, y, data) {
.f1 = function(x, y, data) {
out = factor(data[, x] < y,
levels=c("TRUE", "FALSE", NA), exclude=NULL)
return(table(out))
}
out = mapply(.f1, x, y, MoreArgs = list(data = data)) # check ?mapply
colnames(out) = paste0(x, "<", y) # more clear names for the output
return(out)
}
Finally, the test:
varlist = names(brg)
threshold = c(0, 1, 1000)
f1(x=varlist, y=threshold, data=brg)
And you should get
x<0 y<1 z<1000
TRUE 46 87 100
FALSE 52 11 0
<NA> 2 2 0
i am trying to get all the colums of my data frame to be in the same scale..
right now i have something like this... where a is on a 0-1 scale b is on a 100 scale and c is on a 1-5 scale
a b c
0 89 4
1 93 3
0 88 5
How would i get it to a 100scale like this...
a b c
0 89 80
100 93 60
0 88 100
i hope that is somewhat clear..
i have tried scale() but can not seem to get it to work.
Using scale, if dat is the name of your data frame:
## for one column
dat$a <- scale(dat$a, center = FALSE, scale = max(dat$a, na.rm = TRUE)/100)
## for every column of your data frame
dat <- data.frame(lapply(dat, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/100)))
For a simple case like this, you could also write your own function.
fn <- function(x) x * 100/max(x, na.rm = TRUE)
fn(c(0,1,0))
# [1] 0 100 0
## to one column
dat$a <- fn(dat$a)
## to all columns of your data frame
dat <- data.frame(lapply(dat, fn))
My experience is that this is still unanswered, what if one of the columns had a -2, the current answer would not produce a 0-100 scale. While I appreciate the answer, when I attempted it, I have variables that are -100 to 100 and this left some negative still?
I have a solution in case this applies to you:
rescale <- function(x) (x-min(x))/(max(x) - min(x)) * 100
dat <- rescale(dat)
Even more simple and flexible to other scales is the rescale() function from the scales package. If you wanted to scale from 3 to 50 for some reason, you could set the to parameter to c(3,50) instead of c(0,100) here. Additionally, you can set the from parameter if your data needs to fit to the scale of another dataset (i.e. the min/max of your data should not equal the min/max of the scale you want to set). Here I've provided an example where 0 would be the midpoint between -100 to 100, so rescaling to 0:100 would now place 0 at 50 (the halfway point).
# 0 to 100 scaling
rescale(1:10, to = c(0,100))
# [1] 0.00000 11.11111 22.22222 33.33333 44.44444 55.55556 66.66667 77.77778 88.88889
# [10] 100.00000
# use 'from' to indicate the extended range of values
rescale(seq(0,100,10), to = c(0,100), from = c(-100,100))
# [1] 50 55 60 65 70 75 80 85 90 95 100