How can I convert the following code from Stata to R?
gen a01sb=cond(b01~=1 & c01~=1, a01, 0)
I know that it is sorted by and includes an if-else-condition but I don't know how to code this in R.
Thanks in advance!
In Stata both != and ~= mean "not equals" but in R only != would be equivalent. The ifelse function usually is done within a dataframe but can also work with vectorized logical operators such as & used in the first argument
a01sb <- ifelse( (b01 != 1)& (c01 != 1), a01, 0) # inner parens used for clarity
(There would be no sorting. Sorting would not make a great deal of sense if trying to keep results associated with the vectors on which the calculations are made.)
Related
I have a problem with selecting a variable that should contain a certain range of values. I want to split up my variable into 3 categories. Namely; small, medium and big. A piece of context. I have a variable named obj_hid_woonopp which is (size in m2) and it goes from 16-375. And my dataset is called datalogitvar.
I'm sorry I have no reproduceable code. But since I think it's a rather simple question I hope it can be answered nonetheless. The code that I'm using is as follows
datalogitvar$size_small<- as.numeric(obj_hid_WOONOPP>="15" & obj_hid_WOONOPP<="75" )
datalogitvar$size_medium<- as.numeric(obj_hid_WOONOPP>="76" & obj_hid_WOONOPP<="100" )
datalogitvar$size_large<- as.numeric(obj_hid_WOONOPP>="101")
When I run this, I do get a result. Just not the result I'm hoping for. For example the small category also contains very high numbers. It seems that (since i define "75") it also takes values of "175" since it contains "75". I've been thinking about it and I feel it reads my data as text and not as numbers. However I do say as.numeric so I'm a bit confused. Can someone explain to me how I make sure I create these 3 variables with the proper range? I feel I'm close but the result is useless so far.
Thank you so much for helping.
For a question like this you can replicate your problem with a publicly available dataset like mtcars.
And regarding your code
1) you will need to name the dataset for DATASET$obj_hid_WOONOPP on the right side of your code.
2) Why are you using quotes around your numeric values? These quotes prevent the numbers from being treated as numbers. They are instead treated as string values.
I think you want to use something like the code I've written below.
mtcars$mpg_small <- as.numeric(mtcars$mpg >= 15 & mtcars$mpg <= 20)
mtcars$mpg_medium <- as.numeric(mtcars$mpg > 20 & mtcars$mpg <= 25)
mtcars$mpg_large <- as.numeric(mtcars$mpg > 25)
Just to illustrate your problem:
a <- "75"
b <- "175"
a > b
TRUE (75 > 175)
a < b
FALSE (75 < 175)
Strings don't compare as you'd expect them to.
Two ideas come to mind, though an example of code would be helpful.
First, look into the documentation for cut(), which can be used to convert numeric vector into factors based on cut-points that you set.
Second, as #MrFlick points out, your code could be rewritten so that as.numeric() is run on a character vector containing strings that you want to convert to numeric values THEN perform Boolean comparisons such as > or &.
To build on #Joe
mtcars$mpg_small <- (as.numeric(mtcars$mpg) >= 15 &
(as.numeric(mtcars$mpg) <= 20))
Also be careful, if your vector of strings obj_hid_WOONOPP contains some values that cannot be coerced into numerics, they will become NA.
I am working on a data frame and have extracted on the of the columns with hour data from 0 t0 23. I am adding one more column as type of the day based on hour. I had executed below for loop but getting error. Can somebody help me what is wrong with below syntax and how to correct the same.
for(i in data$Requesthours) {
if(data$Requesthours>=0 & data$Requesthours<3) {
data$Partoftheday <- "Midnight"
} else if(data$Requesthours>=3 & data$Requesthours<6) {
data$Partoftheday <- "Early Morning"
} else if(data$Requesthours>=6 & data$Requesthours<12) {
data$Partoftheday <- "Morning"
} else if(data$Requesthours>=12 & data$Requesthours<16) {
data$Partoftheday <- "Afternoon"
} else if(data$Requesthours>=16 & data$Requesthours<20) {
data$Partoftheday <- "Evening"
} else if(data$Requesthours>=20 & data$Requesthours<=23) {
data$Partoftheday <- "Night"
}
}
Still waiting for you to post your bug, but here's an R coding tip which will reduce this to a one-liner (and bypass your bug). Also it'll be way faster (it's vectorized, unlike your for-loop and if-else-ladder).
data$Partoftheday <- as.character(
cut(data$Requesthours,
breaks=c(-1,3,6,12,16,20,24),
labels=c('Midnight', 'Early Morning', 'Morning', 'Afternoon', 'Evening', 'Night')
)
)
# see Notes on cut() at bottom to explain this
Now back to your bug: You're confused about how to iterate over a column in R. for(i in data$Requesthours) is trying to iterate over your df, but you're confusing indices with data values. Also you try to make i an iterator, but then you don't refer to the value i anywhere inside the loop, you refer back to data$Requesthours, which is an entire column not a single value (how do the loop contents known which value you're referring to? They don't. You could use an ugly explicit index-loop like for (i in 1:nrow(data) ... or for (i in seq_along(data) ... then access data[i,]$Requesthours, but please don't. Because...
One of the huge idiomatic things about learning R is generally when you write a for-loop to iterate over a dataframe or a df column, you should stop to think (or research) if there isn't a vectorized function in R that does what you want. cut, if, sum, mean, max, diff, stdev, ... fns are all vectorized, as are all the arithmetic and logical operators. 'vectorized' means you can feed them an entire (column) vector as an input, and they produce an entire (column) vector as output which you can directly assign to your new column. Very simple, very fast, very powerful. Generally beats the pants off for-loops. Please read R-intro.html, esp. Section 2 about vector assignment
And if you can't find or write a vectorized fn, there's also the *apply family of functions apply, sapply, lapply, ... to apply any arbitrary function you want to a list/vector/dataframe/df column.
Notes on cut()
cut(data, breaks, labels, ...) is a function where data is your input vector (e.g. your selected column data$Requesthours), breaks is a vector of integer or numeric, and labels is a vector to name the output. The length of labels is one more than breaks, since 5 breaks divides your data into 6 ranges.
We want the output vector to be string, not categorical, hence we apply as.character() to the output from cut()
Since your first if-else comparison is (hr>=0 & hr<3), we have to fiddle the lowest cutoff_hour 0 to -1, otherwise hr==0 would wrongly give NA. (There is a parameter include.lowest=TRUE/FALSE but it's not what you want, because it would also cause hr==3 to be 'Midnight', hr==6 to be 'Early Morning', etc.)
if(data$Requesthours>=0 & data$Requesthours<3) (and other similar ifs) make no sense since data$Requesthours is a vector. You should try either of the following:
Solution 1:
for(i in seq(length(data$Requesthours))) {
if(data$Requesthours[i]>=0 & data$Requesthours[i]<3)
data$Partoftheday[i] <- "Midnight"
....
}
This solution is slow like hell and really ugly, but it would work.
Solution 2:
data$Partoftheday[data$Requesthours>=0 & data$Requesthours<3] <- "Midnight"
...
Solution 3 = what was proposed by smci
I am trying to do 10-fold-cross-validation in R. In each for run a new row with several columns will be generated, each column will have an appropriate name, I want the results of each 'for' to go under the appropriate column, so that at end I will be able to compute the average value for each column. In each 'for' run results that are generated belong to different columns than the previous for, therefore the names of the columns should also be checked. Is it possible to do it anyway? Or maybe it would be better to just compute the averages for the columns on the spot?
for(i in seq(from=1, to=8200, by=820)){
fold <- df_vector[i:i+819,]
y_fold_vector <- df_vector[!(rownames(df_vector) %in% rownames(folding)),]
alpha_coefficient <- solve(K_training, y_fold_vector)
test_points <- df_matrix[rownames(df_matrix) %in% rownames(K_training), colnames(df_matrix) %in% rownames(folding)]
predictions <- rbind(predictions, crossprod(alpha_coefficient,test_points))
}
You are having problems with the operator precedence of dyadic operators in R should be:
fold <- df_vector[ i:(i+819), ]
Consider:
> i=1
> i:i+189
[1] 190
Lack of a simple example (or any comments on what your code is supposed to be doing) prevents any testing of the rest of the code, but you can find the precedence of operators at ?Syntax. Unary "=" is higher, but binary "+" is lower than ":".
(It's also unclear what the folding vector is supposed to be. You only defined a fold value and it wasn't a vector since you addressed it as you would a dataframe.)
I have a dataframe, and I'd like to subset by picking out all the rows that conform to a condition on the factor value for year:
subset_df <- df[ (which(df$year < '1972') || (df$year > '1982')),]
My problem is that the line above returns the whole dataframe, df.
Forgive me if this is too basic or simple, but I cannot figure out where the flaw lies.
I'm suspecting there is something regarding || which I don't understand, or my other theory is that arr.ind=T somehow plays a role. Either that, or the nature of the which() function is a little more complicated than I think it is.
If anyone has any insight, I'd greatly appreciate it. Thanks for your time.
PS: yes, this works as expected and returns the correct subset; ie, there isn't a flaw in my dataframe:
test_df <- df[ (which(df$year < '1972')), ]
as does it's counterpart for 1982.
Note that from the helpfile you can read (See ?"|"):
For |, & and xor a logical or raw vector... and...For ||, && and isTRUE, a length-one logical vector.
Therefore you may want to change your || to | and I think which is not required here.
subset_df <- df[ df$year < '1972' | df$year > '1982',]
I have used the which() function to generate indices that tell me which values in variable x of a 9-variable dataframe are above 1024 and to tell me which values in variable y are above 768.
Now I want to generate a new dataframe that includes all the values of the original dataframe except for all values returned by which (dataframe$x > 1024) or which (dataframe$y > 768
What functions can I use to generate a new dataframe from the old dataframe minus those indexed values?
I apologize if my language is not standardized to typical R vocabulary, I just started working with R. Thanks.
You can use logical vectors for subsetting. Try dataframe[dataframe$x <= 1024 & dataframe$y <= 768,] which is the same as dataframe[!(dataframe$x > 1024 | dataframe$y > 768),].
You would benefit from reading an introduction to R.
If logical vectors are not intuitive for you, you may prefer using subset().
In your case:
subset(dataframe, (x > 1024 | y > 768))
However you should pay attention to NA's and (c&p from the manual of subset):
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.