Subselection of a variable - r

I have a problem with selecting a variable that should contain a certain range of values. I want to split up my variable into 3 categories. Namely; small, medium and big. A piece of context. I have a variable named obj_hid_woonopp which is (size in m2) and it goes from 16-375. And my dataset is called datalogitvar.
I'm sorry I have no reproduceable code. But since I think it's a rather simple question I hope it can be answered nonetheless. The code that I'm using is as follows
datalogitvar$size_small<- as.numeric(obj_hid_WOONOPP>="15" & obj_hid_WOONOPP<="75" )
datalogitvar$size_medium<- as.numeric(obj_hid_WOONOPP>="76" & obj_hid_WOONOPP<="100" )
datalogitvar$size_large<- as.numeric(obj_hid_WOONOPP>="101")
When I run this, I do get a result. Just not the result I'm hoping for. For example the small category also contains very high numbers. It seems that (since i define "75") it also takes values of "175" since it contains "75". I've been thinking about it and I feel it reads my data as text and not as numbers. However I do say as.numeric so I'm a bit confused. Can someone explain to me how I make sure I create these 3 variables with the proper range? I feel I'm close but the result is useless so far.
Thank you so much for helping.

For a question like this you can replicate your problem with a publicly available dataset like mtcars.
And regarding your code
1) you will need to name the dataset for DATASET$obj_hid_WOONOPP on the right side of your code.
2) Why are you using quotes around your numeric values? These quotes prevent the numbers from being treated as numbers. They are instead treated as string values.
I think you want to use something like the code I've written below.
mtcars$mpg_small <- as.numeric(mtcars$mpg >= 15 & mtcars$mpg <= 20)
mtcars$mpg_medium <- as.numeric(mtcars$mpg > 20 & mtcars$mpg <= 25)
mtcars$mpg_large <- as.numeric(mtcars$mpg > 25)

Just to illustrate your problem:
a <- "75"
b <- "175"
a > b
TRUE (75 > 175)
a < b
FALSE (75 < 175)
Strings don't compare as you'd expect them to.

Two ideas come to mind, though an example of code would be helpful.
First, look into the documentation for cut(), which can be used to convert numeric vector into factors based on cut-points that you set.
Second, as #MrFlick points out, your code could be rewritten so that as.numeric() is run on a character vector containing strings that you want to convert to numeric values THEN perform Boolean comparisons such as > or &.
To build on #Joe
mtcars$mpg_small <- (as.numeric(mtcars$mpg) >= 15 &
(as.numeric(mtcars$mpg) <= 20))
Also be careful, if your vector of strings obj_hid_WOONOPP contains some values that cannot be coerced into numerics, they will become NA.

Related

how to transform columns of a data frame according to the values in a vector in R?

I am trying to normalize some columns on a data frame so they have the same mean. The solution I am now implementing, even though it works, feels like there is a simpler way of doing this.
# we make a copy of women
w = women
# print out the col Means
colMeans(women)
height weight
65.0000 136.7333
# create a vector of factors to normalize with
factor = colMeans(women)/colMeans(women)[1]
# normalize the copy of women that we previously made
for(i in 1:length(factor)){w[,i] <- w[,i] / factor[i]}
#We achieved our goal to have same means in the columns
colMeans(w)
height weight
65 65
I can come up with the same thing easily ussing apply but is there something easier like just doing women/factor and get the correct answer?
By the way, what does women/factor actually doing? as doing:
colMeans(women/factor)
height weight
49.08646 98.40094
Is not the same result.
Can use mapply too
colMeans(mapply("/", w, factor))
Re your question re what does women/factor do, so women is a data.frame with two columns, while factor is numeric vector of length two. So when you do women/factor, R takes each entry of women (i.e. women[i,j]) and divides it once by factor[1] and then factor[2]. Because factor is shorter in length than women, R rolls factor over and over again.
You can see, for example, that every second entry of women[, 1]/factor equals to every second entry of women[, 1] (because factor[1] equals to 1)
One way of doing this is using sweep. By default this function subtracts a summary statistic from each row, but you can also specify a different function to perform. In this case a division:
colMeans(sweep(women, 2, factor, '/'))
Also:
rowMeans(t(women)/factor)
#height weight
#65 65
Regarding your question:
I can come up with the same thing easily ussing apply but is there something easier like just doing women/factor and get the correct answer? By the way, what does women/factor actually doing?
women/factor ## is similar to
unlist(women)/rep(factor,nrow(women))
What you need is:
unlist(women)/rep(factor, each=nrow(women))
or
women/rep(factor, each=nrow(women))
In my solution, I didn't use rep because factor gets recycled as needed.
t(women) ##matrix
as.vector(t(women))/factor #will give same result as above
or just
t(women)/factor #preserve the dimensions for ?rowMeans
In short, column wise operations are happening here.

How to use negative values in which() statement in R?

I want to exclude the rows in which x has values less than or equal to -10, so I wrote this:
newdata <- data[which(data$x> -10), ]
Is this right or I need to put -10 in double quotation marks?
Thank you.
(Decided to upgrade this from a comment to an answer.)
Using double quotation marks is not wise: it will mess you up in some quite surprising ways. For example, 1 > "-10" is FALSE (!!) because of the way in which R compares strings.
R's use of <- for assignment may get you in trouble; if you want x<-10 to do the comparison rather than assign the value 10 to x, you need either spaces x < -10 or parentheses (x<(-10)). However, this doesn't arise with the > comparison.
You can always use parentheses if you're worried (x > (-10)); the only drawback is that things get harder to read if you use too many (e.g., data[(which(((data$x)>(-10)))),])).
As pointed out in the comments, R is an interactive environment; if you can't figure something like this out from the documentation or other help sources, you should just try a small example and convince yourself that it works.
For example:
x <- c(-20,-15,-10,-4,0)
x[x>-10]
## -4 0

How to ingnore/exclude specific values in R-Cran, e.g. Zero

I have data, which I want to do a PCA with. In order to do so I want to log the data because the range of my data is very high (from 0 to four digits). (if you have a better method I'd also be interested :)
The data contains zero values, which should of course be excluded from the log. So How do I do this in R-Cran?
What I do is:
logmydata<-log(mydata)
It logs then also the zero values which returns -inf, which I don't like!
I think this should be very easy, but maybe because it is so basic I couldn’t find it. I'm just a beginner, sorry for that!
All the best!
Lukas
do you just want to make the zeros NAs?
mydata[ mydata == 0] <-NA
or remove them from the analysis altogether
nozeromydata <- mydata[ mydaya != 0 ]
if you don't like either of these suggestions, I say:
log( mydata + 1 )

is.na() in R for loop not quite understood

I am confused by the behavior of is.na() in a for loop in R.
I am trying to make a function that will create a sequence of numbers, do something to a matrix, summarize the resulting matrix based on the sequence of numbers, then modify the sequence of numbers based on the summary and repeat. I made a simple version of my function because I think it still gets at my problem.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq))
##generate a table where the row names are those numbers
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10))
##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <-
count(temp.results[,1])$freq
##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
## the idea would be to keep cutting this sequence of numbers down with
## successive iterations until the desired number of iterations per row in
## details.table was reached. in other words, in the real code i'd do
## something to details.table in the next line
print(rich.seq)
}
}
##call the function
test(desired.iterations=4, max.iterations=2)
On the first run through the for loop the rich.seq looks like I'd expect it to, where 5 & 6 are no longer in the sequence because both ended up with more than 4 iterations. However, on the second run, it spits out something unexpected.
UPDATE
Thanks for your help and also my apologies. After re-reading my original post it is not only less than clear, but I hadn't realized count was part of the plyr package, which I call in my full function but wasn't calling here. I'll try and explain better.
What I have working at the moment is a function that takes a matrix, randomizes it (in any of a number of different ways), then calculates some statistics on it. These stats are temporarily stored in a table--temp.results--where temp.results[,1] is the sum of the non zero elements in each column, and temp.results[,2] is a different summary statistic for that column. I save these results to a csv file (and append them to the same file at subsequent iterations), because looping through it and rbinding hogs a lot of memory.
The problem is that certain column sums (temp.results[,1]) are sampled very infrequently. In order to sample those sufficiently requires many many iterations, and the resulting .csv files would stretch into the hundreds of gigabytes.
What I want to do is create and then update a table (details.table) at each iteration that keeps track of how many times each column sum actually got sampled. When a given element in the table reaches the desired.iterations, I want it to be excluded from the vector rich.seq, so that only columns that haven't received the desired.iterations are actually saved to the csv file. The max.iterations argument will be used in a break() statement in case things are taking too long.
So, what I was expecting in the example case is the exact same line for rich.seq for both iterations, since I didn't actually do anything to change it. I believe that flodel is definitely right that my problem lies in comparing a matrix (details.table) of length longer than rich.seq, leading to unexpected results. However, I don't want the dimensions of details.table to change. Perhaps I can solve the problem implementing %in% somehow when I redefine rich.seq in the for loop?
I agree you should improve your question. However, I think I can spot what is going wrong.
You compute details.table before the for loop. It is a matrix with same length as rich.seq when it was first initialized (length(4:34), i.e. 31).
Inside the for loop, details.table < desired.iterations | is.na(details.table) is then a logical vector of length 31. On the first loop iteration,
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
will result in reducing the length of rich.seq. But on the second loop iteration, unless details.table is redefined (not the case), you are trying to subset rich.seq by a logical vector of longer length than rich.seq. This will certainly lead to unexpected results.
You probably meant to redefine details.table as part of your for loop.
(Also I am surprised to see you never used temp.results[,2].)
Thanks to flodel for setting me off on the right track. It had nothing to do with is.na but rather the lengths of vectors I was comparing.
That said, I set the initial values of the details.table to zero to avoid the added complexity of the is.na statement.
This code works, and can be modified to do what I described above.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq)) ##generate a table where the row names are those numbers
details.table[,1] <- 0
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10)) ##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <- count(temp.results[,1])$freq ##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- row.names(details.table)[details.table[,1] < desired.iterations]
print(rich.seq)
}
}
Rather than trying to cut down the rich.seq I just redefine it every iteration based on whatever happens with details.table during the previous iteration.

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Resources