I want to write a function that is doing the same as the SPSS command AUTORECODE.
AUTORECODE recodes the values of string and numeric variables to consecutive integers and puts the recoded values into a new variable called a target variable.
At first I tried this way:
AUTORECODE <- function(variable = NULL){
A <- sort(unique(variable))
B <- seq(1:length(unique(variable)))
REC <- Recode(var = variable, recodes = "A = B")
return(REC)
}
But this causes an error. I think the problem is caused by the committal of A and B to the recodes argument. Thats why I tried
eval(parse(text = paste("REC <- Recode(var = variable, recodes = 'c(",A,") = c(",B,")')")))
within the function. But this isn´t the right solution.
Ideas?
factor may be simply what you need, as James suggested in a comment, it's storing them as integers behind the scenes (as seen by str) and just outputting the corresponding labels. This may also be very useful as R has lots of commands for working with factors appropriately, such as when fitting linear models, it makes all the "dummy" variables for you.
> x <- LETTERS[c(4,2,3,1,3)]
> f <- factor(x)
> f
[1] D B C A C
Levels: A B C D
> str(f)
Factor w/ 4 levels "A","B","C","D": 4 2 3 1 3
If you do just need the numbers, use as.integer on the factor.
> n <- as.integer(f)
> n
[1] 4 2 3 1 3
An alternate solution is to use match, but if you're starting with floating-point numbers, watch out for floating-point traps. factor converts everything to characters first, which effectively rounds floating-point numbers to a certain number of digits, making floating-point traps less of a concern.
> match(x, sort(unique(x)))
[1] 4 2 3 1 3
Related
I'm quite new to R Programming so I am just learning here and there. I recently got into these lines
x <- as.factor(rep(1:4, 2))
x
# [1] 1 2 3 4 1 2 3 4
# Levels: 1 2 3 4
But if I do
x <- factor(rep(1:4, 2))
that gives me the same result. So what is the difference between factor and as.factor? I get how factor is pulling same numbers out and making them levels, but I don't get what the exact differences are between factor and as.factor.
Five atomic data types in R (in least to most flexible order) are: logical, integer, double, and character.
All elements of an atomic vector must be the same type, so when we attempt to combine different types they will be coerced to the most flexible type.
For example:
str(c("a", 1))
#> chr [1:2] "a" "1"
Coercion often happens automatically. We can also explicitly coerce with as.factor(), as.character(), as.double(),as.integer(), or as.logical().
So, as pointed by #alistaire, to understand the difference between factor() and as.factor(), you must understand 'coercion'.
You can read abour coercion at https://www.safaribooksonline.com/library/view/r-in-a/9781449358204/ch05s08.html
Also, as a beginer, you must read about data strucures in R at
http://adv-r.had.co.nz/Data-structures.html
Part of a function I'm working on uses the following code to take a data frame and reorder its columns on the basis of the largest (absolute) value in each column.
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)])))
For the most part, this works fine, but with the dataset I'm working on, I occasionally get data that looks like this:
a <- rnorm(10,5,7); b <- rnorm(10,0,1); c <- rep(1,10)
dfm <- data.frame(A = a, B = b, C = c)
> dfm
A B C
1 0.6438373 -1.0487023 1
2 10.6882204 0.7665011 1
3 -16.9203506 -2.5047946 1
4 11.7160291 -0.1932127 1
5 13.0839793 0.2714989 1
6 11.4904625 0.5926858 1
7 -5.9559206 0.1195593 1
8 4.6305924 -0.2002087 1
9 -2.2235623 -0.2292297 1
10 8.4390810 1.1989515 1
When that happens, the above code returns a "non-numeric argument to mathematical function" error at the abs() step. (And if I get rid of the abs() step because I know, due to transformation, my data will be all positive, order() returns: "unimplemented type 'list' in 'orderVector1'".) This is because which() returns all the 1's in column C, which in turn makes apply() spit out a list, rather than a nice tidy vector.
My question is this: How can I make which() JUST return one value for column C in this case? Alternately, is there a better way to write this code to do what I want it to (reorder the columns of a matrix based on the largest value in each column, whether or not that largest value is duplicated) that won't have this problem?
If you want to select just the first element of the result, you can subset it with [1]:
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)][1])))
To order the columns by their maximum element (in absolute value), you can do
dfm[order(apply(abs(dfm),2,max))]
Your code, with #CarlosCinelli's correction, should work fine, though.
I have a vector v <- c(6,8,5,5,8) of which I can obtain the unique values using
> u <- unique(v)
> u
[1] 6 8 5
Now I need an index i = [2,3,1,1,3] that returns the original vector v when indexed into u.
> u[i]
[1] 6,8,5,5,8
I know such an index can be generated automatically in Matlab, the ci index, but does not seem to be part of the standard repertoire in R. Is anyone aware of a function that can do this?
The background is that I have several vectors with anonymized IDs that are long character strings:
ids
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
"PTefkd43fmkl28en==3rnl4"
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
To reduce the file size and simplify the code, I want to transform them into integers of the sort
ids
1
2
1
1
2
and found that the index of the unique vector does just this. Since there are many rows, I am hesitant to write a function that loops over each element of the unique vector and wonder whether there is a more efficient way — or a completely different way to transform the character strings into matching integers.
Try with match
df1$ids <- with(df1, match(ids, unique(ids)) )
df1$ids
#[1] 1 2 1 1 2
Or we can convert to factor and coerce to numeric
with(df1,as.integer(factor(ids, levels=unique(ids))))
#[1] 1 2 1 1 2
Using u and v. Based on the output of 'u' in the OP's post, it must have been sorted
u <- sort(unique(v))
match(v, u)
#[1] 2 3 1 1 3
Or using findInterval. Make sure that 'u' is sorted.
findInterval(v,u)
#[1] 2 3 1 1 3
I need to compare the values stored in two variables.The variable sizes are different. For example
x = c(1,2,3,4,5,6,7,8,9,10)
and
y = c(2,6,11,12,13)
I need an answer that 2 and 6 are present in both variables. I need this to be done in R.Anyone help please.
The intersect function avoids the need for #mdsumner's simple indexing:
> x = c(1,2,3,4,5,6,7,8,9,10)
> y = c(2,6,11,12,13)
> intersect(x,y)
[1] 2 6
Whole bunch of set operators to be found here: help(intersect)
Posted after the added requirement that some sort of tolerance be allowed: You could sequentially check one set of values against all the others in the second set or you could do it all at once with outer(). Once you have the outer result as a logical matrix there remains the task of referring back to the values, but expand.grid seems capable of handling that:
expand.grid(x,y)[outer(x,y, FUN=function(x,y) abs(x-y) < 0.01), ]
# Var1 Var2
#2 2 2
#16 6 6
After posting It occurred to me that you values were sorted. Turns out that this extraction from expand.grid() survives passing unsorted vectors.
x[x %in% y]
[1] 2 6
Or, more explicitly:
x[match(x, y, nomatch = 0) > 0]
[1] 2 6
Note that you actually chain together the results of the match with simple indexing into the input values.
See ?match.
Consider the following code:
> a <- data.frame(name=c('a','b','c'))
> b <- data.frame(type=a$name[1])
> c <- data.frame(type=c(a$name[1],a$name[2]))
> b
type
1 a
> c
type
1 1
2 2
Why does b$type have a value of a, the actual assigned value, whereas c$type takes the value of the index number (1 and 2)?
Well, a$name is a factor, not a character vector and You can't concatenate factors like that (because the c function currently doesn't handle factors). Factors are really integer vectors with a levels attribute (and a class), so the c function just uses the integer values. This can probably be considered a bug.
One way to combine factor is by using unlist, which has special code for this case:
c <- data.frame(type=unlist(list(a$name[1], a$name[2])))
Another way is to convert to character vectors:
c <- data.frame(type=c(as.character(a$name[1]), as.character(a$name[2])))
A third way is to use a character vector from the start:
a <- data.frame(name=c('a','b','c'), stringsAsFactors=FALSE)
c <- data.frame(type=c(a$name[1],a$name[2]))