I'm trying to use recode in R (from the car package) and it is not working. I read in data from a .csv file into a data frame called results. Then, I replace the values in the column Built_year, according to the following logic.
recode(results$Built_year,
"2 ='1950s';3='1960s';4='1970s';5='1980s';6='1990s';7='2000 or later'")
When I check results$Built_year after doing this step, it appears to have worked. However, it does not store this value, and returns to its previous value. I don't understand why.
Thanks.
(at the moment something is going wrong and I can't see any of the icons for formatting)
You need to assign to a new variable.
Taking the example from recode in the car package
R> x <- rep(1:3, 3)
R> x
[1] 1 2 3 1 2 3 1 2 3
R> newx <- recode(x, "c(1,2)='A'; else='B'")
R> newx
[1] "A" "A" "B" "A" "A" "B" "A" "A" "B"
R>
By the way, the package is called car, not cars.
car::recode (and R itself) is not working as SPSS Recode function, so if you apply transformation on a variable, you must assign it to a variable, as Dirk said. I don't use car::recode, although it's quite straightforward... learn how to deal with factors... as I can see, you can apply as.numeric(results$Built_year) and get same effect. IMHO, using car::recode in this manor is trivial. You only want to change factor to numeric, right... Well, you'll be surprised when you see that:
> x <- factor(letters[1:10])
> x
[1] a b c d e f g h i j
Levels: a b c d e f g h i j
> mode(x)
[1] "numeric"
> as.numeric(x)
[1] 1 2 3 4 5 6 7 8 9 10
And, boy, do I like answering questions that refer to factors... =) Get familiar with factors, and you'll see the magic of "recode" in R! =) Rob Kabacoff's site is a good starting point.
Related
Here: in R, to arise the need to define dimension for a vector,
M. JORGENSEN (Dept of Stat, U of Waikato, NZ):
"Would it not make sense to have dim(A)=length(A) for all vectors?"
B.D. RIPLEY (Dept of Applied Statistics, Oxford, UK):
"No. A one-dimensional array and a vector are not the same thing.
There are subtle differences, such as what names() means (see ?names).
That a 1D array and a vector print in the same way does occasionally
lead to confusion, but then you also cannot tell from your printout that A
has type integer and not double.
......
My question:
(1) Not only I cannot figure out the subtle difference on names() but also
(2) I cannot produce a concrete example about "telling from the printout that A
has type integer and not double issue".
Any help to clarify JORGENSEN-RIPLEY discussion (with concrete examples in R) will be appreciated.
To address the first question, let's first create a vector and a 1-d array:
(vector <- 1:10)
#> [1] 1 2 3 4 5 6 7 8 9 10
(arr_1d <- array(1:10, dim = 10))
#> [1] 1 2 3 4 5 6 7 8 9 10
If we give the objects some names, we can see the difference
that Ripley alludes to by looking at the attributes:
names(vector) <- letters[1:10]
names(arr_1d) <- letters[1:10]
attributes(vector)
#> $names
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
attributes(arr_1d)
#> $dim
#> [1] 10
#>
#> $dimnames
#> $dimnames[[1]]
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
That is, the 1-d array doesn't actually have a names attribute,
but rather a dimnames attribute (which is a list, not a vector),
the first element of which names() actually accesses.
This is covered in the "Note" section in ?names:
For vectors, the names are one of the attributes with restrictions on
the possible values. For pairlists, the names are the tags and
converted to and from a character vector.
For a one-dimensional array the names attribute really is
dimnames[[1]].
Here we also see the lack of a dim
attribute for vectors. (A related SO answer covers the differences between arrays and vectors, too.)
The additional attributes and their storage method means that
1-d arrays always take up a little more memory than their vector equivalent:
# devtools::install_github("r-lib/lobstr")
lobstr::obj_size(vector)
#> 848 B
lobstr::obj_size(arr_1d)
#> 1,056 B
However, that's about the only reason I can think of why one
would want to have separate types for vectors and 1-d arrays. I would assume this was really the question that Jorgensen was
asking, i.e. why have a separate vector type without the dim
attribute at all; and I don't think Ripley really addresses that.
I'd be very interested to hear other rationale for this.
As for point 2), when you create a vector with : it
is always an integer:
vector <- 1:10
typeof(vector)
#> [1] "integer"
A double with the same values will print the same:
double <- as.numeric(vector)
typeof(double)
#> [1] "double"
double
#> [1] 1 2 3 4 5 6 7 8 9 10
But integers and doubles are not the same thing:
identical(vector, double)
#> [1] FALSE
The differences between integers and doubles in R are subtle, the main
one being that integers take up less space in memory.
lobstr::obj_size(vector)
#> 88 B
lobstr::obj_size(double)
#> 168 B
See this answer for a more comprehensive overview of the differences between integers and doubles.
Created on 2018-07-09 by the reprex package (v0.2.0.9000).
I apologize if I'm duplicating a question but I'm a newbie and I couldn't find the answer (probably because I lack the jargon).
I generated a data frame like so:
x1 <- c(1,2,3,4,5)
x2 <- c("a", "b", "c", "d", "e")
df <- data.frame(x1,x2)
x1 x2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
Then I tried to subset conditioning on the first column like this
df[df$x1>3, "x2"]
The result was as expected
[1] d e
However when I try
df["x1" >3, "x2"]
[1] a b c d e
R seems to ignore the conditional statement and returns the whole column x2. Is there a way of evaluating conditional statements (<,>,==) using the column names?
EDIT: I think I found the answer partially: R evaluates
"some text" > 1000
[1] TRUE
and that explains why I get all the rows.
The question remains: what is a good way of evaluating conditional statements using column names?
I won't go into a long explanation because I think you'll be able to see the issue clearly with a few examples. But basically, if you want to use the character data frame names, you will need a construct like this
df[df[["x1"]] > 3, "x2"]
# [1] d e
# Levels: a b c d e
What was happening with your second try is this
"x1" > 3
# [1] TRUE
And then basically what you did was this
df[TRUE, "x2"]
# [1] a b c d e
# Levels: a b c d e
giving all elements. I would have to look up the reason of exactly why a character is always greater than a number. I think this reason has been described in detail somewhere around here before. If I remember correctly it has to do with precedence between classes. I'll see if I can find it.
Your question could have many answers, especially depending on the context and the type of data you're working with. In this particular case though, you could simply use df[x1 > 3, "x2"].
The first argument is for rows and the second argument is for columns. Essentially, you are saying to return all df rows where x1 is greater than 3. In terms of columns, you want only column x2. You'll get it pretty quickly once you tweak around with the different statements. Hope this helps.
I want to write a function that is doing the same as the SPSS command AUTORECODE.
AUTORECODE recodes the values of string and numeric variables to consecutive integers and puts the recoded values into a new variable called a target variable.
At first I tried this way:
AUTORECODE <- function(variable = NULL){
A <- sort(unique(variable))
B <- seq(1:length(unique(variable)))
REC <- Recode(var = variable, recodes = "A = B")
return(REC)
}
But this causes an error. I think the problem is caused by the committal of A and B to the recodes argument. Thats why I tried
eval(parse(text = paste("REC <- Recode(var = variable, recodes = 'c(",A,") = c(",B,")')")))
within the function. But this isnĀ“t the right solution.
Ideas?
factor may be simply what you need, as James suggested in a comment, it's storing them as integers behind the scenes (as seen by str) and just outputting the corresponding labels. This may also be very useful as R has lots of commands for working with factors appropriately, such as when fitting linear models, it makes all the "dummy" variables for you.
> x <- LETTERS[c(4,2,3,1,3)]
> f <- factor(x)
> f
[1] D B C A C
Levels: A B C D
> str(f)
Factor w/ 4 levels "A","B","C","D": 4 2 3 1 3
If you do just need the numbers, use as.integer on the factor.
> n <- as.integer(f)
> n
[1] 4 2 3 1 3
An alternate solution is to use match, but if you're starting with floating-point numbers, watch out for floating-point traps. factor converts everything to characters first, which effectively rounds floating-point numbers to a certain number of digits, making floating-point traps less of a concern.
> match(x, sort(unique(x)))
[1] 4 2 3 1 3
It probably easier explain myself using an example.
Let's say I have a list s:
s <- list( c(5,3,4,3,6),c("A","B","C","D","E"))
s has always the same number of object for all sub-vectors. NA value are not allowed. The vectors contain different types.
What I want to achieve is:
rank v1 v2
1 3 "B"
2 3 "D"
3 4 "C"
4 5 "A"
5 6 "E"
Basically, to sort the list based on the first vector (in ascending order) and then (in case on tie) look to the second vector using the lexicological order. In C++ world the only thing that I need to do is to define the operator< for my object, however I am pretty new of R and I am running out ideas.
The best strategy that I have found is to loop over the elements and calculate a rank value (double) for each couple (eg. 3 "B" will result with the highest rank and 6 "E" with the lowest), store the results in another vector and sort it. However the solution is not great because find a good ranking function can be tricky and it is not very easy to generalize.
It seems to me such a common problem that it has to be a better way. Can anyone point me in the right direction?
Thanks for your help.
Use order():
s <- data.frame(v1=c(5,3,4,3,6), v2=c("A","B","C","D","E"))
s[order(s$v1, s$v2), ]
v1 v2
2 3 B
4 3 D
3 4 C
1 5 A
5 6 E
Note that I transformed your list to a data frame. Since a data frame is itself a list (with all elements the same length) this shouldn't be a problem in your case.
I'm trying to do a boxplot of a list of values at ggplot2, but the problem is that it doesn't know how to deal with lists, what should I try ?
E.g.:
k <- list(c(1,2,3,4,5),c(1,2,3,4),c(1,3,6,8,14),c(1,3,7,8,10,37))
k
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 1 2 3 4
[[3]]
[1] 1 3 6 8 14
[[4]]
[1] 1 3 7 8 10 37
If I pass k as an argument to boxplot() it will handle it flawlessly and produce a nice (well not so nice... hehehe) boxplot with the range of all the values as the Y-axis and the list index (each element) as the X-axis.
How should I achieve the exact same effect with ggplot2 ? I think that dataframes or matrices are not an option because the vectors are of different length.
Thanks
The answer is that you don't. ggplot2 is designed to work with data frames, particularly long form data frames. That means you need your data as one tall vector, with a grouping factor:
d <- data.frame(x = unlist(k),
grp = rep(letters[1:length(k)],times = sapply(k,length)))
ggplot(d,aes(x = grp, y = x)) + geom_boxplot()
And as pointed out in the comments, melt achieves the same result as this manual reshaping and is much simpler. I guess I like to make things difficult.