R subset dataframe, skip part of interval - r

I'm tryng to subset my total data (including all the other varibales) to an interval of zipcodes EXCLUDING a certain part of that interval. Quite new to R and can't get it to work. (Zipcode = postnr)
I have over 100 000 zipcodes (postnr) and want all values for individs in zipcode 10 000-12 999 and 15 600 - 16 800 in my dataset
Attempt 1
Datan <- subset(Data2, Data2$postnr >= 10000 & Data2$postnr <= 16880)
Datant <- subset(Datan, Datan$postnr >= 15600 & Datan$postnr < 13000)
Datan returns 31 3000 obs in 26 variabels and Datant returns 0 obs in 26 variabels..
Attempt 2
attach(Data2)
Data5 <- Data2 %>% filter(between(postnr, 10000, 12999) & between(postnr, 15600, 16880))
Data 5 returns 0 obsverations...
I have thousands of values for all my variables inside those intervals. What am I doing wrong?

If you think about and versus or you have gotten it. As it is, you're really close!
Can a number be between 1 and 2 and 3 and 5? Nope. But if I said, can a number be between 1 and 2 or 3 and 5? Yup.
Updated
For subset:
Datan <- subset(Data2, postnr >= 10000 & postnr <= 13000 |
postnr >= 15600 & postnr < 16800)
Where that verticle pipe: | means 'or'.
For dplyr:
(I assume it's dplyr with filter.) You don't need to attach the data, it will extract the variable names from Data2 if it's in the pipe (which it is).
Data5 <- Data2 %>% filter(between(postnr, 10000, 12999) |
between(postnr, 15600, 16880))

I have no data, so I can not properly test this, but the following should work.
Note the or operator (|) to specify two different conditions.
library(data.table)
dt <- as.data.table(Data2)
dt[(postnr>10000&postnr<13000)|(postnr>15600&postnr<=16880),]

Related

How to subset in RStudio properly?

I have created the following data frame:
age <- c(21,35,829,2)
sex <- c("m","f","m","c")
height <- c(181,173,171,166)
weight <- c(69,58,75,60)
dat <- as.data.frame(cbind(age,sex,height,weight), stringsAsFactors = FALSE)
dat$age <- as.numeric(age)
dat
I want to choose now only the rows of students which are older than 20 or younger than 80.
Why does this work : dat[dat$age<20| dat$age>80,] ; subset(dat, age < 20 | age > 80)
But this does not: dat[dat$age>20| dat$age<80,] ; subset(dat, age > 20 | age < 80)
I can subset the rows who are NOT younger than 80 or older than 20, but not those who are actually in this interval.
What is the mistake?
Thanks in advance.
Because your condition allows basically every possible age. Think about it, your conditions are independent (because you are using the | operator), so every row, that fits in one of your conditions, are selected by your filter. Every age that is defined in your data.frame now, are higher than 20, OR if not, they certainly are lower than 80.
If you want to select every row, that is in between age 20 and 80, you would change the logic operator. To make these conditions dependent, like this:
dat[dat$age>20 & dat$age<80,]
subset(dat, age > 20 & age < 80)
Resulting this:
age sex height weight
1 21 m 181 69
2 35 f 173 58
Now, if you want to select all the rows, that are outside of this interval, you could negate this logic condition with the ! operator, like was suggested by #r2evans in the comment section. It would be something like this:
dat[!(dat$age > 20 & dat$age < 80),]
subset(dat, !(age > 20 & age < 80))
Resulting this:
age sex height weight
3 829 m 171 75
4 2 c 166 60
Why not use dplyr filter?
library(dplyr)
df_age <- dat %>%
dplyr::filter(age > 20
, age < 80)

Go through an R dataframe and increase the salary at certain conditions

I have this sample of my dataframe (df):
age salary
1 25 20000
2 35 22000
3 31 23500
4 24 19200
5 27 27900
6 32 31010
I want to increase the salary by 11% for people who are aged above 30 and their salary is not the maximum salary in the table. I wrote this loop:
for(row in df){
if (row$age > 30 & row$salary != max(df$salary)){
row$salary = row$salary * 0.11
}
}
but I get less than the salaries posted rather than an increase.
Would really appreciate any help.
Here's one way without explicit ifelse. Should be one of the fastest ways to do this -
df$new_salary <- with(df, salary + 0.11*salary*(age > 30)*(salary != max(salary))
The reason why you're experiencing problems is because in each iteration of the for loop (specifically, when going through each matching row), you're applying a transformation to the entire column. Try this instead:
k <- max(df$age)
df[df$age>30 & df$age<k, 'salary'] <- df[df$age>30 & df$age<k, 'salary'] * 1.11

R : Create specific bin based on data range

I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107

dplyr idiom for summarize() a filtered-group-by, and also replace any NAs due to missing rows

I am computing a dplyr::summarize across a dataframe of sales data.
I do a group-by (S,D,Y), then within each group, compute medians and means for weeks 5..43, then merge those back into the parent df. Variable X is sales. X is never NA (i.e. there are no explicit NAs anywhere in df), but if there is no data (as in, no sales) for that S,D,Y and set of weeks, there will simply be no row with those values in df (take it that means zero sales for that particular set of parameters). In other words, impute X=0 in any structurally missing rows (but I hope I don't need to melt/cast the original df, to avoid bloat. Similar to cast(fill....,add.missing=T) or caret::preProcess()).
Two questions about my code idiom:
Is it better to use summarize than dplyr::filter, because filter physically drops rows so I have to assign the results to df.tmp then left-join it back to the original df (as below)? Also, big subsetting expressions repeated on every single line of summarize computations make the code harder to read.
Should I worry (or not) about caching the rows or logical indices of the subsetting operation, in the general case where I might be computing say n=20 new summary variables?
Not all combinations of S,D,Y-groups and filter (for those weeks) have rows, so how to get the summarize to replace NA on any missing rows? Currently I do as below.
Sorry both the code and dataset are proprietary, but here's the code idiom, and below is code you should run first to generate sample-data:
# Compute median, mean of X across wks 5..43, for that set of S,D,Y-values
# Issue a) filter() or repeatedly use subset() within each calculation?
df.tmp <- df %.% group_by(S,D,Y) %.% filter(Week>=5 & Week<=43) %.%
summarize(ysd_med543_X = median(X),
ysd_mean543_X = mean(X)
) %.% ungroup()
# Issue b) how to replace NAs in groups where the group_by-and-filter gave empty output?
# can you merge this code with the summarize above?
df <- left_join(df, df.tmp, copy=F)
newcols <- match(c('ysd_mean543_X','ysd_med543_X'), names(df))
df[!complete.cases(df[,newcols]), newcols] <- c(0.0,0.0)
and run this first to generate sample-data:
set.seed(1234)
rep_vector <- function(vv, n) {
unlist(as.vector(lapply(vv, function(...) {rep(...,n)} )))
}
n=7
m=3
df = data.frame(S = rep_vector(10:12, n), D = 20:26,
Y = rep_vector(2005:2007, n),
Week = round(52*runif(m*n)),
X = 4e4*runif(m*n) + 1e4 )
# Now drop some rows, to model structurally missing rows
I <- sort(sample(1:nrow(df),0.6*nrow(df)))
df = df[I,]
require(dplyr)
I don't think this has anything to do with the feature you've linked under comments (because IIUC that feature has to do with unused factor levels). Once you filter your data, IMO summarise should not (or rather can't?) be including them in the results (with the exception of factors). You should clarify this with the developers on their project page.
I'm by no means a dplyr expert, but I think, firstly, it'd be better to filter first followed by group_by + summarise. Else, you'll be filtering for each group, which is unnecessary. That is:
df.tmp <- df %.% filter(Week>=5 & Week<=43) %.% group_by(S,D,Y) %.% ...
This is just so that you're aware of it for any future cases.
IMO, it's better to use mutate here instead of summarise, as it'll remove the need for left_join, IIUC. That is:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
md_X = median(X[Week >=5 & Week <= 43]),
mn_X = mean(X[Week >=5 & Week <= 43]))
Here, still we've the issue of replacing the NA/NaN. There's no easy/direct way to sub-assign here. So, you'll have to use ifelse, once again IIUC. But that'd be a little nicer if mutate supports expressions.
What I've in mind is something like:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
{ tmp = Week >= 5 & Week <= 43;
md_X = ifelse(length(tmp), median(X[tmp]), 0),
md_Y = ifelse(length(tmp), mean(X[tmp]), 0)
})
So, we'll have to workaround in this manner probably:
df.tmp = df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43)
df.tmp %.% mutate(md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], mean(X), 0))
Or to put things together:
df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43,
md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], median(X), 0))
# S D Y Week X tmp md_X mn_X
# 1 10 20 2005 6 22107.73 TRUE 22107.73 22107.73
# 2 10 23 2005 32 18751.98 TRUE 18751.98 18751.98
# 3 10 25 2005 33 31027.90 TRUE 31027.90 31027.90
# 4 10 26 2005 0 46586.33 FALSE 0.00 0.00
# 5 11 20 2006 12 43253.80 TRUE 43253.80 43253.80
# 6 11 22 2006 27 28243.66 TRUE 28243.66 28243.66
# 7 11 23 2006 36 20607.47 TRUE 20607.47 20607.47
# 8 11 24 2006 28 22186.89 TRUE 22186.89 22186.89
# 9 11 25 2006 15 30292.27 TRUE 30292.27 30292.27
# 10 12 20 2007 15 40386.83 TRUE 40386.83 40386.83
# 11 12 21 2007 44 18049.92 FALSE 0.00 0.00
# 12 12 26 2007 16 35856.24 TRUE 35856.24 35856.24
which doesn't require df.tmp.
HTH

Create shingles from two vectors

I'd like to classify the values of a data frame according to two columns. Let's say, I've got the following data frame:
my.df <- data.frame(a=c(1:20), b=c(61:80))
And now I want to subdivide it into 8 areas by dividing the 2D-scatterplot into 4 equal parts and then overlaying a rectangle in the middle that would consist of a quarter of each of the 4 parts. So far I've been using the following tedious way:
ar <- range(my.df$a)
br <- range(my.df$b)
aint <- seq(ar[1], ar[2], by=(ar[2]-ar[1])/4)
bint <- seq(br[1], br[2], by=(br[2]-br[1])/4)
my.df$z <- NA
my.df[which(my.df$a < aint[3] & my.df$b < bint[3]),"z"] <- 1
my.df[which(my.df$a < aint[3] & my.df$b >= bint[3]),"z"] <- 2
...
my.df[which(my.df$z == 1 & my.df$a >= aint[2] & my.df$b >= bint[2]),"z"] <- 5
...
I am sure there must be a way to do it in a neater and more general way, i.e. by writing a general function, but I am struggling to write one myself.
Also, I was surprised to see that after all of this, the class of the column z is automatically set to shingle. Why that? How does R "know" that this is a shingle?
I'd approach it by cutting it into 16 groups first (x and y into 4 groups independently) and then combining them back together into fewer groups.
my.df$a.q <- cut(my.df$a, breaks=4, labels=1:4)
my.df$b.q <- cut(my.df$b, breaks=4, labels=1:4)
my.df$a.b.q <- paste(my.df$a.q, my.df$b.q, sep=".")
my.df$z <- c("1.1"=1, "1.2"=1, "1.3"=2, "1.4"=2,
"2.1"=1, "2.2"=3, "2.3"=4, "2.4"=2,
"3.1"=5, "3.2"=6, "3.3"=7, "3.4"=8,
"4.1"=5, "4.2"=5, "4.3"=8, "4.4"=8)[my.df$a.b.q]
This seems reasonable
plot(my.df$a, my.df$b, col=my.df$z)
With some data with more coverage:
set.seed(1234)
my.df <- data.frame(a=runif(1000, 1, 20), b=runif(1000, 61, 80))

Resources