Different results subsetting with column names - r

I apologize if I'm duplicating a question but I'm a newbie and I couldn't find the answer (probably because I lack the jargon).
I generated a data frame like so:
x1 <- c(1,2,3,4,5)
x2 <- c("a", "b", "c", "d", "e")
df <- data.frame(x1,x2)
x1 x2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
Then I tried to subset conditioning on the first column like this
df[df$x1>3, "x2"]
The result was as expected
[1] d e
However when I try
df["x1" >3, "x2"]
[1] a b c d e
R seems to ignore the conditional statement and returns the whole column x2. Is there a way of evaluating conditional statements (<,>,==) using the column names?
EDIT: I think I found the answer partially: R evaluates
"some text" > 1000
[1] TRUE
and that explains why I get all the rows.
The question remains: what is a good way of evaluating conditional statements using column names?

I won't go into a long explanation because I think you'll be able to see the issue clearly with a few examples. But basically, if you want to use the character data frame names, you will need a construct like this
df[df[["x1"]] > 3, "x2"]
# [1] d e
# Levels: a b c d e
What was happening with your second try is this
"x1" > 3
# [1] TRUE
And then basically what you did was this
df[TRUE, "x2"]
# [1] a b c d e
# Levels: a b c d e
giving all elements. I would have to look up the reason of exactly why a character is always greater than a number. I think this reason has been described in detail somewhere around here before. If I remember correctly it has to do with precedence between classes. I'll see if I can find it.

Your question could have many answers, especially depending on the context and the type of data you're working with. In this particular case though, you could simply use df[x1 > 3, "x2"].
The first argument is for rows and the second argument is for columns. Essentially, you are saying to return all df rows where x1 is greater than 3. In terms of columns, you want only column x2. You'll get it pretty quickly once you tweak around with the different statements. Hope this helps.

Related

Compare 2 vectors and add missing values from target vector in R

I am using R and I have a correct vector whose names contain all the target names (names(correct) <- c("A","B","C","D","E")) such as:
correct <- c("a","b","c","d","e")
names(correct) <- c("A","B","C","D","E")
The vector I have to modify, instead, tofix, has names that may miss some values compared to correct above, in the example below is missing "C" and "E".
tofix <- c(2,5,4)
names(tofix) <- c("A","B","D")
So I want to fix it in a way that the resulting vector, fixed, contains the same names as in correct and with the same order, and when the name is missing adds 0 as a value, like the below:
fixed <- c(2,5,0,4,0)
names(fixed) <- names(correct)
Any idea how to do this in R? I tried with multiple if statements and for loops, but time complexity was far from ideal.
Many thanks in advance.
You may try
fixed <- rep(0, length(correct))
fixed[match(names(tofix), names(correct))] <- tofix
names(fixed) <- names(correct)
fixed
A B C D E
2 5 0 4 0
unlist(modifyList(as.list(table(names(correct))*0), as.list(tofix)))
A B C D E
2 5 0 4 0

For every value in vector: extract value from an appropriate row in a dataframe

I think the best way to explain my question is by an example:
we have a vector:
vector1 (1,2,3,3,5,6,3,7,7)
and a dataframe:
ID VAL
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
I want to create a vector that will look like this:
vector2 (a,b,c,c,e,f,c,g,g)
Sounds very simple and probably is very simple with some trick that I don't know about.
I tried with "%in%" but it produced a vector of values from rows(of the dataframe) present in the vector as opposed to my goal which is a vector of values from the dataframe corresponding to the values in the vector.
Thank you.
Thank you David, following your suggestion I was able solve my problem.
Though I needed to make some preparation (it was because I oversimplified the example)
Actually, (if we will continue with the naming convention from my example) The "ID" column had some strings so the dataframe looked like so:
ID VAL
one a
two b
three c
four d
five e
six f
seven g
eight h
And vector1 looked like this: (one,two,three,three,five,six,three,seven,seven)
Then, I figured I should rename the rownames of the dataframe to the names in "ID" and then perform the command you have suggested.
My preparation looked like this:
rownames(dataframe) <- dataframe$ID
vector2 <- dataframe[vector1, "VAL"]

Sort a list or objects containing different types in R

It probably easier explain myself using an example.
Let's say I have a list s:
s <- list( c(5,3,4,3,6),c("A","B","C","D","E"))
s has always the same number of object for all sub-vectors. NA value are not allowed. The vectors contain different types.
What I want to achieve is:
rank v1 v2
1 3 "B"
2 3 "D"
3 4 "C"
4 5 "A"
5 6 "E"
Basically, to sort the list based on the first vector (in ascending order) and then (in case on tie) look to the second vector using the lexicological order. In C++ world the only thing that I need to do is to define the operator< for my object, however I am pretty new of R and I am running out ideas.
The best strategy that I have found is to loop over the elements and calculate a rank value (double) for each couple (eg. 3 "B" will result with the highest rank and 6 "E" with the lowest), store the results in another vector and sort it. However the solution is not great because find a good ranking function can be tricky and it is not very easy to generalize.
It seems to me such a common problem that it has to be a better way. Can anyone point me in the right direction?
Thanks for your help.
Use order():
s <- data.frame(v1=c(5,3,4,3,6), v2=c("A","B","C","D","E"))
s[order(s$v1, s$v2), ]
v1 v2
2 3 B
4 3 D
3 4 C
1 5 A
5 6 E
Note that I transformed your list to a data frame. Since a data frame is itself a list (with all elements the same length) this shouldn't be a problem in your case.

mapping over the rows of a data frame

Suppose I have a data frame with columns c1, ..., cn, and a function f that takes in the columns of this data frame as arguments.
How can I apply f to each row of the data frame to get a new data frame?
For example,
x = data.frame(letter=c('a','b','c'), number=c(1,2,3))
# x is
# letter | number
# a | 1
# b | 2
# c | 3
f = function(letter, number) { paste(letter, number, sep='') }
# desired output is
# a1
# b2
# c3
How do I do this? I'm guessing it's something along the lines of {s,l,t}apply(x, f), but I can't figure it out.
as #greg points out, paste() can do this. I suspect your example is a simplification of a more general problem. After struggling with this in the past, as illustrated in this previous question, I ended up using the plyr package for this type of thing. plyr does a LOT more, but for these things it's easy:
> require(plyr)
> adply(x, 1, function(x) f(x$letter, x$number))
X1 V1
1 1 a1
2 2 b2
3 3 c3
you'll want to rename the output columns, I'm sure
So while I was typing this, #joshua showed an alternative method using ddply. The difference in my example is that adply treats the input data frame as an array. adply does not use the "group by" variable row that #joshua created. How he did it is exactly how I was doing it until Hadley tipped me to the adply() approach. In the aforementioned question.
paste(x$letter, x$number, sep = "")
I think you were thinking of something like this, but note that the apply family of functions do not return data.frames. They will also attempt to coerce your data.frame to a matrix before applying the function.
apply(x,1,function(x) paste(x,collapse=""))
So you may be more interested in ddply from the plyr package.
> x$row <- 1:NROW(x)
> ddply(x, "row", function(df) paste(df[[1]],df[[2]],sep=""))
row V1
1 1 a1
2 2 b2
3 3 c3

Why is recode in R not changing the original values?

I'm trying to use recode in R (from the car package) and it is not working. I read in data from a .csv file into a data frame called results. Then, I replace the values in the column Built_year, according to the following logic.
recode(results$Built_year,
"2 ='1950s';3='1960s';4='1970s';5='1980s';6='1990s';7='2000 or later'")
When I check results$Built_year after doing this step, it appears to have worked. However, it does not store this value, and returns to its previous value. I don't understand why.
Thanks.
(at the moment something is going wrong and I can't see any of the icons for formatting)
You need to assign to a new variable.
Taking the example from recode in the car package
R> x <- rep(1:3, 3)
R> x
[1] 1 2 3 1 2 3 1 2 3
R> newx <- recode(x, "c(1,2)='A'; else='B'")
R> newx
[1] "A" "A" "B" "A" "A" "B" "A" "A" "B"
R>
By the way, the package is called car, not cars.
car::recode (and R itself) is not working as SPSS Recode function, so if you apply transformation on a variable, you must assign it to a variable, as Dirk said. I don't use car::recode, although it's quite straightforward... learn how to deal with factors... as I can see, you can apply as.numeric(results$Built_year) and get same effect. IMHO, using car::recode in this manor is trivial. You only want to change factor to numeric, right... Well, you'll be surprised when you see that:
> x <- factor(letters[1:10])
> x
[1] a b c d e f g h i j
Levels: a b c d e f g h i j
> mode(x)
[1] "numeric"
> as.numeric(x)
[1] 1 2 3 4 5 6 7 8 9 10
And, boy, do I like answering questions that refer to factors... =) Get familiar with factors, and you'll see the magic of "recode" in R! =) Rob Kabacoff's site is a good starting point.

Resources