I have a dataframe, and I'd like to subset by picking out all the rows that conform to a condition on the factor value for year:
subset_df <- df[ (which(df$year < '1972') || (df$year > '1982')),]
My problem is that the line above returns the whole dataframe, df.
Forgive me if this is too basic or simple, but I cannot figure out where the flaw lies.
I'm suspecting there is something regarding || which I don't understand, or my other theory is that arr.ind=T somehow plays a role. Either that, or the nature of the which() function is a little more complicated than I think it is.
If anyone has any insight, I'd greatly appreciate it. Thanks for your time.
PS: yes, this works as expected and returns the correct subset; ie, there isn't a flaw in my dataframe:
test_df <- df[ (which(df$year < '1972')), ]
as does it's counterpart for 1982.
Note that from the helpfile you can read (See ?"|"):
For |, & and xor a logical or raw vector... and...For ||, && and isTRUE, a length-one logical vector.
Therefore you may want to change your || to | and I think which is not required here.
subset_df <- df[ df$year < '1972' | df$year > '1982',]
Related
How can I convert the following code from Stata to R?
gen a01sb=cond(b01~=1 & c01~=1, a01, 0)
I know that it is sorted by and includes an if-else-condition but I don't know how to code this in R.
Thanks in advance!
In Stata both != and ~= mean "not equals" but in R only != would be equivalent. The ifelse function usually is done within a dataframe but can also work with vectorized logical operators such as & used in the first argument
a01sb <- ifelse( (b01 != 1)& (c01 != 1), a01, 0) # inner parens used for clarity
(There would be no sorting. Sorting would not make a great deal of sense if trying to keep results associated with the vectors on which the calculations are made.)
I am working on a data frame and have extracted on the of the columns with hour data from 0 t0 23. I am adding one more column as type of the day based on hour. I had executed below for loop but getting error. Can somebody help me what is wrong with below syntax and how to correct the same.
for(i in data$Requesthours) {
if(data$Requesthours>=0 & data$Requesthours<3) {
data$Partoftheday <- "Midnight"
} else if(data$Requesthours>=3 & data$Requesthours<6) {
data$Partoftheday <- "Early Morning"
} else if(data$Requesthours>=6 & data$Requesthours<12) {
data$Partoftheday <- "Morning"
} else if(data$Requesthours>=12 & data$Requesthours<16) {
data$Partoftheday <- "Afternoon"
} else if(data$Requesthours>=16 & data$Requesthours<20) {
data$Partoftheday <- "Evening"
} else if(data$Requesthours>=20 & data$Requesthours<=23) {
data$Partoftheday <- "Night"
}
}
Still waiting for you to post your bug, but here's an R coding tip which will reduce this to a one-liner (and bypass your bug). Also it'll be way faster (it's vectorized, unlike your for-loop and if-else-ladder).
data$Partoftheday <- as.character(
cut(data$Requesthours,
breaks=c(-1,3,6,12,16,20,24),
labels=c('Midnight', 'Early Morning', 'Morning', 'Afternoon', 'Evening', 'Night')
)
)
# see Notes on cut() at bottom to explain this
Now back to your bug: You're confused about how to iterate over a column in R. for(i in data$Requesthours) is trying to iterate over your df, but you're confusing indices with data values. Also you try to make i an iterator, but then you don't refer to the value i anywhere inside the loop, you refer back to data$Requesthours, which is an entire column not a single value (how do the loop contents known which value you're referring to? They don't. You could use an ugly explicit index-loop like for (i in 1:nrow(data) ... or for (i in seq_along(data) ... then access data[i,]$Requesthours, but please don't. Because...
One of the huge idiomatic things about learning R is generally when you write a for-loop to iterate over a dataframe or a df column, you should stop to think (or research) if there isn't a vectorized function in R that does what you want. cut, if, sum, mean, max, diff, stdev, ... fns are all vectorized, as are all the arithmetic and logical operators. 'vectorized' means you can feed them an entire (column) vector as an input, and they produce an entire (column) vector as output which you can directly assign to your new column. Very simple, very fast, very powerful. Generally beats the pants off for-loops. Please read R-intro.html, esp. Section 2 about vector assignment
And if you can't find or write a vectorized fn, there's also the *apply family of functions apply, sapply, lapply, ... to apply any arbitrary function you want to a list/vector/dataframe/df column.
Notes on cut()
cut(data, breaks, labels, ...) is a function where data is your input vector (e.g. your selected column data$Requesthours), breaks is a vector of integer or numeric, and labels is a vector to name the output. The length of labels is one more than breaks, since 5 breaks divides your data into 6 ranges.
We want the output vector to be string, not categorical, hence we apply as.character() to the output from cut()
Since your first if-else comparison is (hr>=0 & hr<3), we have to fiddle the lowest cutoff_hour 0 to -1, otherwise hr==0 would wrongly give NA. (There is a parameter include.lowest=TRUE/FALSE but it's not what you want, because it would also cause hr==3 to be 'Midnight', hr==6 to be 'Early Morning', etc.)
if(data$Requesthours>=0 & data$Requesthours<3) (and other similar ifs) make no sense since data$Requesthours is a vector. You should try either of the following:
Solution 1:
for(i in seq(length(data$Requesthours))) {
if(data$Requesthours[i]>=0 & data$Requesthours[i]<3)
data$Partoftheday[i] <- "Midnight"
....
}
This solution is slow like hell and really ugly, but it would work.
Solution 2:
data$Partoftheday[data$Requesthours>=0 & data$Requesthours<3] <- "Midnight"
...
Solution 3 = what was proposed by smci
Please help me!
I have quite big data set containing bank accounts
It is organised in a following way:
V1 - register number of a bank
V2 - date of account value record
V3 - account number
all remaining V-s are for values themselves (in cur, metals, etc)
I need to make a filter through account numbers, remaining everything in the table, but for specific acc numbers. Here is the code I use:
filelist = list.files(pattern = ".txt")
datalist = lapply(filelist, function(x)read.table(x, header=FALSE, sep = ";"))
all_data = do.call("rbind", datalist)
r_d <- rename(all_data, c("V1"="Number", "V2"="Dates", "V3"="Account"))
r_d$Account <- as.character(r_d$Account)
f_d <- filter(all_data, r_d$Account >= 42301 & r_d$Account <= 42315 |
r_d$Account >= 20202 & r_d$Account <= 20210 |
r_d$Account == 98010 | r_d$Account == 98015)
The problem is that the output of this code is a table containing only NAs, everything becomes NA, even though those acc numbers exist, and I am absolutely sure in that.
If I use Account in filter instead of r_d$Account, R writes me that object Account does not exist. Which I also do not understand.
Please, correct me.
There are several things wrong with your code. The reason you are getting NAs is that you are passing NULLs all over the place. Did you ever look at r_d$Account? When you see problems in your code, you should start by going things piece-meal step-by-step, and in this case you'll see that r_d$Account gives you NULL. Why? Because you did not rename the columns correctly. colnames(r_d) will be revealing.
First, rename either does non-standard evaluation with un-quoted arguments, or rename_ takes a vector of character=character pairs. These might work (I can't know for certain, since I'm not about to transcribe your image of data ... please provide copyable output from dput next time!):
# non-standard evaluation
rename(all_data, Number=V1, Dates=V2, Account=V3)
# standard-evaluation #1:
rename_(all_data, Number="V1", Dates="V2", Account="V3")
# standard-evaluation #2
rename_(all_data, .dots = c("Number"="v1", "Dates"="V2", "Account"="V3"))
From there, if you step through your code, you should see that r_d$Account is no longer NULL.
Second, is there a reason you create r_d but still reference all-data? There are definitely times when you need to do this kind of stuff; here is not one of them, it is too prone to problems (e.g., if the row-order or dimensions of one of them changes).
Third, because you convert $Account to character, it is really inappropriate to use inequality comparisons. Though it is certainly legal to do so ("1" < "2" is TRUE), it will run into problems, such as "11" < "2" is also TRUE, and "3" < "22" is FALSE. I'm not saying that you should avoid conversion to string; I think it is appropriate. Your use of account ranges is perplexing: an account number should be categorical, not ordinal, so selecting a range of account numbers is illogical.
Fourth, even assuming that account numbers should be ordinal and ranges make sense, your use of filter can be improved, but only if you either (a) accept that comparisons of stringified-numbers is acceptable ("3" > "22"), or (b) keep them as integers. First, you should not be referencing r_d$ within a NSE dplyr function. (Edit: you also need to group your logic with parentheses.) This is a literal translation from your code:
f_d <- filter(r_d, (Account >= 42301 & Account <= 42315) |
(Account >= 20202 & Account <= 20210) |
Account == 98010 | Account == 98015)
You can make this perhaps more readable with:
f_d <- filter(r_d,
Account %in% c(98010, 98015) |
between(Account, 42301, 42315) |
between(Account, 20202, 20210)
)
Perhaps a better way to do it, assuming $Account is character, would be to determine which accounts are appropriate based on some other criteria (open date, order date, something else from a different column), and once you have a vector of account numbers, do
filter(r_d,
Account %in% vector_of_interesting_account_numbers)
I have data, which I want to do a PCA with. In order to do so I want to log the data because the range of my data is very high (from 0 to four digits). (if you have a better method I'd also be interested :)
The data contains zero values, which should of course be excluded from the log. So How do I do this in R-Cran?
What I do is:
logmydata<-log(mydata)
It logs then also the zero values which returns -inf, which I don't like!
I think this should be very easy, but maybe because it is so basic I couldn’t find it. I'm just a beginner, sorry for that!
All the best!
Lukas
do you just want to make the zeros NAs?
mydata[ mydata == 0] <-NA
or remove them from the analysis altogether
nozeromydata <- mydata[ mydaya != 0 ]
if you don't like either of these suggestions, I say:
log( mydata + 1 )
I am trying to calculate a simple ratio using data.table. Different files have different tmax values, so that is why I need ifelse. When I debug this, the dt looks good. The tmaxValue is a single value (the first "t=60" encountered in this case), but t0Value is all of the "t=0" values in dt.
summaryDT <- calculate_Ratio(reviewDT[,list(Result, Time), by=key(reviewDT)])
calculate_Ratio <- function(dt){
tmaxValue <- ifelse(grepl("hhep", inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=240min"),Result],
ifelse(grepl("hlm",inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=60"),Result],
dt[which(dt[,Time] == "t=30"),Result]))
t0Value <- dt[which(dt[,Time] == "t=0"),Result]
return(dt[,Ratio:=tmaxValue/t0Value])
}
What I am getting out is theResult for tmaxValue divided by all of the Result's for all of the t0Value's, but what I want is a single ratio for each unique by.
Thanks for the help.
You didn't provide a reproducible example, but typically using ifelse is the wrong thing to do.
Try using if(...) ... else ... instead.
ifelse(test, yes, no) acts very weird: It produces a result with the attributes and length from test and the values from yes or no.
...so in your case you should get something without attributes and of length one - and that's probably not what you wanted, right?
[UPDATE] ...Hmm or maybe it is since you say that tmaxValue is a single value...
Then the problem isn't in calculating tmaxValue? Note that ifelse is still the wrong tool for the job...