Data cleaning using subset with 2 conditions on same variable - r

I am a newbie to R,
I have at dataset ITEproduction_2014.2015 and I only want to see datapoints between 4 and 39 days. Currently I use 2 separate lines to create a subset.
Can I do this in 1 line? something like Data.Difference >3 and < 40?
ITEproduction_2014.2015 <- subset(ITEproduction_2014.2015,Date.Difference>3)
ITEproduction_2014.2015 <- subset(ITEproduction_2014.2015,Date.Difference<40)
thanks in advance,
Dirk

just a little googling would have solved your problem, for example read this about logical operators,
like this?
ITEproduction_2014.2015<-subset(ITEproduction_2014.2015,Date.Difference>3 & Date.Difference<40)

Avoid using subset altogether if you can. See the warning in the help file:
?subset()
If you like the syntax of subset(), and prefer it to standard subsetting functions like [, you can use dplyr:
library(dplyr)
ITEproduction_2014.2015 %>%
dplyr::filter(
Date.Difference > 3,
Date.Difference < 40
)

Related

How to subset and sum a column in R in one line of code

I'm wondering how to combine subsetting my data and summing a column within that subset data in one line. I can easily do it in two, but I have so many dataframes to do this for, so I want to minimize the copy/pasting/slight editing for each dataset.
Here is the two lines of code I know I can do:
sumE_df201 = subset(df201, t>=55)
test = sum(sumE_df201)$e
I tried to combine them into one as such, and received the following error:
sumE_df201 = sum(subset(df201, t>=55))$e
>Error in sum(subset(df201, t >= 55))$e :
>$ operator is invalid for atomic vectors
If anyone has insight on how to do this properly, I would appreciate it. I'm sure in the end, me doing two lines and copying it for all dataframes would take less time (I edit them with ctrl+f and replace, when I can, but still). But I am trying to improve my R literacy.
Example/junk data here:
t= 1:121
e= rnorm(t, mean=t, sd=1)
junk1 = 301:421
junk2 = 501:621
df201 = cbind(t, e, junk1, junk2)
The reason that $ does not work is that subset(df201, t>=55) is an atomic vector, and you can see more help by ?"$".
One way is to use indexing
sum(subset(df201, t>=55)[, "e"]).
# 5897.988
Another way is converting it to a data frame and then using $
sum(as.data.frame(subset(df201, t>=55))$e)
# 5897.988
An option with dplyr
library(dplyr)
df201 %>%
filter(t >= 55) %>%
pull(e) %>%
sum

dplyr's removed function? Calculating a mean for several columns in the data frame in R

I would like to calculate the mean of several columns in my data frame. I wanted to select them using the ‘:’ in the dplyr package. The variable names are: Mcheck5_1_1, Mcheck5_2_1, ..., Mcheck5_8_1 (so there are 8 in total). I learnt that I can select them by
select(df, Mcheck5_1_1:Mcheck5_8_1)
in an online course taught by Roger Pang (https://www.youtube.com/watch?v=aywFompr1F4&feature=youtu.be) at 4min33sec.
However, R complained:
Error in select(df, Mcheck5_1_1:Mcheck5_8_1) :
unused argument (Mcheck5_1_1:Mcheck5_8_1)
I also couldn’t find other people’s using of this ‘:’ feature on Google. I suspect this feature no longer exists?
Right now, I use the following code to solve the problem:
idx = grep("Mcheck5_1_1", names(df))
df$avg = rowMeans(df[, idx:idx+7], na.rm = TRUE)
(I’m hesitate to index those columns using number (e.g., df[138]) for fear that its positive might vary.)
However, I think this solution is not elegant enough. Would you advice me is there any other ways to do it? Is it still possible to use the colon(:) method to index my variables nowadays just that I made some mistakes in my code? Thanks all.
https://www.youtube.com/watch?v=aywFompr1F4&feature=youtu.be
(At 4:33)
Try dplyr::select(df, Mcheck5_1_1:Mcheck5_8_1). It is likely to be a package conflict. See here for a related question.
To calculate the mean for each of those columns:
library(magrittr)
library(purrr)
df %>%
dplyr::select(Mcheck5_1_1:Mcheck5_8_1) %>%
map(mean)
maybe using contains can help because it's used to perform a name search in the columns, so in your case it would be: select(df, contains("Mcheck5_"))

Subselection of a variable

I have a problem with selecting a variable that should contain a certain range of values. I want to split up my variable into 3 categories. Namely; small, medium and big. A piece of context. I have a variable named obj_hid_woonopp which is (size in m2) and it goes from 16-375. And my dataset is called datalogitvar.
I'm sorry I have no reproduceable code. But since I think it's a rather simple question I hope it can be answered nonetheless. The code that I'm using is as follows
datalogitvar$size_small<- as.numeric(obj_hid_WOONOPP>="15" & obj_hid_WOONOPP<="75" )
datalogitvar$size_medium<- as.numeric(obj_hid_WOONOPP>="76" & obj_hid_WOONOPP<="100" )
datalogitvar$size_large<- as.numeric(obj_hid_WOONOPP>="101")
When I run this, I do get a result. Just not the result I'm hoping for. For example the small category also contains very high numbers. It seems that (since i define "75") it also takes values of "175" since it contains "75". I've been thinking about it and I feel it reads my data as text and not as numbers. However I do say as.numeric so I'm a bit confused. Can someone explain to me how I make sure I create these 3 variables with the proper range? I feel I'm close but the result is useless so far.
Thank you so much for helping.
For a question like this you can replicate your problem with a publicly available dataset like mtcars.
And regarding your code
1) you will need to name the dataset for DATASET$obj_hid_WOONOPP on the right side of your code.
2) Why are you using quotes around your numeric values? These quotes prevent the numbers from being treated as numbers. They are instead treated as string values.
I think you want to use something like the code I've written below.
mtcars$mpg_small <- as.numeric(mtcars$mpg >= 15 & mtcars$mpg <= 20)
mtcars$mpg_medium <- as.numeric(mtcars$mpg > 20 & mtcars$mpg <= 25)
mtcars$mpg_large <- as.numeric(mtcars$mpg > 25)
Just to illustrate your problem:
a <- "75"
b <- "175"
a > b
TRUE (75 > 175)
a < b
FALSE (75 < 175)
Strings don't compare as you'd expect them to.
Two ideas come to mind, though an example of code would be helpful.
First, look into the documentation for cut(), which can be used to convert numeric vector into factors based on cut-points that you set.
Second, as #MrFlick points out, your code could be rewritten so that as.numeric() is run on a character vector containing strings that you want to convert to numeric values THEN perform Boolean comparisons such as > or &.
To build on #Joe
mtcars$mpg_small <- (as.numeric(mtcars$mpg) >= 15 &
(as.numeric(mtcars$mpg) <= 20))
Also be careful, if your vector of strings obj_hid_WOONOPP contains some values that cannot be coerced into numerics, they will become NA.

Is it possible to to find the indexes of blacklisted dates in a sample vector of dates?

I have a tricky problem in R that I just can't seem to solve without resorting to a loop.
I start with a vector of timeDates:
library(timeDate)
dates <- timeDate(c("2014-01-01","2008-01-02","2008-01-03","2008-01-04"))
I would like to find the indexes of any dates in a preset blacklist:
dateBlacklist <- timeDate(c("2008-01-02","2008-01-03"))
The result would be something like:
indexesOfBlacklistedDates <- c(2,3)
An ugly solution:
indexesOfBlacklistedDates <- which(timeDate:::as.character.timeDate(dates) %in% timeDate:::as.character.timeDate(dateBlacklist))
Another, not so ugly, solution (similar to #agstudy's answer)
which(as.character(dates) %in% as.character(dateBlacklist))
Elegant solution :)
match(as.character(dateBlacklist), as.character(dates))
[1] 2 3

R- Please help. Having trouble writing for loop to lag date

I am attempting to write a for loop which will take subsets of a dataframe by person id and then lag the EXAMDATE variable by one for comparison. So a given row will have the original EXAMDATE and also a variable EXAMDATE_LAG which will contain the value of the EXAMDATE one row before it.
for (i in length(uniquerid))
{
temp <- subset(part2test, RID==uniquerid[i])
temp$EXAMDATE_LAG <- temp$EXAMDATE
temp2 <- data.frame(lag(temp, -1, na.pad=TRUE))
temp3 <- data.frame(cbind(temp,temp2))
}
It seems that I am creating the new variable just fine but I know that the lag won't work properly because I am missing steps. Perhaps I have also misunderstood other peoples' examples on how to use the lag function?
So that this can be fully answered. There are a handful of things wrong with your code. Lucaino has pointed one out. Each time through your loop you are going to create temp, temp2, and temp3 (or overwrite the old one). and thus you'll be left with only the output of the last time through the loop.
However, this isnt something that needs a loop. Instead you can make use of the vectorized nature of R
x <- 1:10
> c(x[-1], NA)
[1] 2 3 4 5 6 7 8 9 10 NA
So if you combine that notion with a library like plyr that splits data nicely you should have a workable solution. If I've missed something or this doesn't solve your problem, please provide a reproducible example.
library(plyr)
myLag <- function(x) {
c(x[-1], NA)
}
ddply(part2test, .(uniquerid), transform, EXAMDATE_LAG=myLag(EXAMDATE))
You could also do this in base R using split or the data.table package using its by= argument.

Resources