Concatenating data frame values in R not working as expected - r

Consider the following code:
> a <- data.frame(name=c('a','b','c'))
> b <- data.frame(type=a$name[1])
> c <- data.frame(type=c(a$name[1],a$name[2]))
> b
type
1 a
> c
type
1 1
2 2
Why does b$type have a value of a, the actual assigned value, whereas c$type takes the value of the index number (1 and 2)?

Well, a$name is a factor, not a character vector and You can't concatenate factors like that (because the c function currently doesn't handle factors). Factors are really integer vectors with a levels attribute (and a class), so the c function just uses the integer values. This can probably be considered a bug.
One way to combine factor is by using unlist, which has special code for this case:
c <- data.frame(type=unlist(list(a$name[1], a$name[2])))
Another way is to convert to character vectors:
c <- data.frame(type=c(as.character(a$name[1]), as.character(a$name[2])))
A third way is to use a character vector from the start:
a <- data.frame(name=c('a','b','c'), stringsAsFactors=FALSE)
c <- data.frame(type=c(a$name[1],a$name[2]))

Related

Compare 2 vectors and add missing values from target vector in R

I am using R and I have a correct vector whose names contain all the target names (names(correct) <- c("A","B","C","D","E")) such as:
correct <- c("a","b","c","d","e")
names(correct) <- c("A","B","C","D","E")
The vector I have to modify, instead, tofix, has names that may miss some values compared to correct above, in the example below is missing "C" and "E".
tofix <- c(2,5,4)
names(tofix) <- c("A","B","D")
So I want to fix it in a way that the resulting vector, fixed, contains the same names as in correct and with the same order, and when the name is missing adds 0 as a value, like the below:
fixed <- c(2,5,0,4,0)
names(fixed) <- names(correct)
Any idea how to do this in R? I tried with multiple if statements and for loops, but time complexity was far from ideal.
Many thanks in advance.
You may try
fixed <- rep(0, length(correct))
fixed[match(names(tofix), names(correct))] <- tofix
names(fixed) <- names(correct)
fixed
A B C D E
2 5 0 4 0
unlist(modifyList(as.list(table(names(correct))*0), as.list(tofix)))
A B C D E
2 5 0 4 0

How to find vectors with duplicate values in a row?

I have a lot of vectors, which looks something like this:
a <- c(0,0,0,1,1)
b <- c(1,0,0,0,1)
c <- c(0,0,1,1,1)
In all of these vectors have the values that are repeated three times in succession.
I need to somehow identify these repetitions. The main condition is that the value of repeated one after the other.
Duplicated() will not help, at least in the base.
The definition of such vectors is necessary in order then to remove them.
A suitable vector for my work.
d <- c(1,0,1,0,0)
Improper vector.
e <- c(1,1,1,0,0)
You might want to take a look at the rle from the base package or the rleid function from data.table.
rle(c(0,0,0,1,1))
Run Length Encoding
lengths: int [1:2] 3 2
values : num [1:2] 0 1
library(data.table)
rleid(c(0,0,0,1,1))
[1] 1 1 1 2 2
Both will look at runs of the same number. The rle function returns a list of lengths and values, and the rleid function returns a vector counting up each time the number in the series changes.

r - Force which() to return only first match

Part of a function I'm working on uses the following code to take a data frame and reorder its columns on the basis of the largest (absolute) value in each column.
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)])))
For the most part, this works fine, but with the dataset I'm working on, I occasionally get data that looks like this:
a <- rnorm(10,5,7); b <- rnorm(10,0,1); c <- rep(1,10)
dfm <- data.frame(A = a, B = b, C = c)
> dfm
A B C
1 0.6438373 -1.0487023 1
2 10.6882204 0.7665011 1
3 -16.9203506 -2.5047946 1
4 11.7160291 -0.1932127 1
5 13.0839793 0.2714989 1
6 11.4904625 0.5926858 1
7 -5.9559206 0.1195593 1
8 4.6305924 -0.2002087 1
9 -2.2235623 -0.2292297 1
10 8.4390810 1.1989515 1
When that happens, the above code returns a "non-numeric argument to mathematical function" error at the abs() step. (And if I get rid of the abs() step because I know, due to transformation, my data will be all positive, order() returns: "unimplemented type 'list' in 'orderVector1'".) This is because which() returns all the 1's in column C, which in turn makes apply() spit out a list, rather than a nice tidy vector.
My question is this: How can I make which() JUST return one value for column C in this case? Alternately, is there a better way to write this code to do what I want it to (reorder the columns of a matrix based on the largest value in each column, whether or not that largest value is duplicated) that won't have this problem?
If you want to select just the first element of the result, you can subset it with [1]:
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)][1])))
To order the columns by their maximum element (in absolute value), you can do
dfm[order(apply(abs(dfm),2,max))]
Your code, with #CarlosCinelli's correction, should work fine, though.

R: Index to unique vector that returns original

I have a vector v <- c(6,8,5,5,8) of which I can obtain the unique values using
> u <- unique(v)
> u
[1] 6 8 5
Now I need an index i = [2,3,1,1,3] that returns the original vector v when indexed into u.
> u[i]
[1] 6,8,5,5,8
I know such an index can be generated automatically in Matlab, the ci index, but does not seem to be part of the standard repertoire in R. Is anyone aware of a function that can do this?
The background is that I have several vectors with anonymized IDs that are long character strings:
ids
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
"PTefkd43fmkl28en==3rnl4"
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
To reduce the file size and simplify the code, I want to transform them into integers of the sort
ids
1
2
1
1
2
and found that the index of the unique vector does just this. Since there are many rows, I am hesitant to write a function that loops over each element of the unique vector and wonder whether there is a more efficient way — or a completely different way to transform the character strings into matching integers.
Try with match
df1$ids <- with(df1, match(ids, unique(ids)) )
df1$ids
#[1] 1 2 1 1 2
Or we can convert to factor and coerce to numeric
with(df1,as.integer(factor(ids, levels=unique(ids))))
#[1] 1 2 1 1 2
Using u and v. Based on the output of 'u' in the OP's post, it must have been sorted
u <- sort(unique(v))
match(v, u)
#[1] 2 3 1 1 3
Or using findInterval. Make sure that 'u' is sorted.
findInterval(v,u)
#[1] 2 3 1 1 3

AUTORECODE from SPSS to R

I want to write a function that is doing the same as the SPSS command AUTORECODE.
AUTORECODE recodes the values of string and numeric variables to consecutive integers and puts the recoded values into a new variable called a target variable.
At first I tried this way:
AUTORECODE <- function(variable = NULL){
A <- sort(unique(variable))
B <- seq(1:length(unique(variable)))
REC <- Recode(var = variable, recodes = "A = B")
return(REC)
}
But this causes an error. I think the problem is caused by the committal of A and B to the recodes argument. Thats why I tried
eval(parse(text = paste("REC <- Recode(var = variable, recodes = 'c(",A,") = c(",B,")')")))
within the function. But this isn´t the right solution.
Ideas?
factor may be simply what you need, as James suggested in a comment, it's storing them as integers behind the scenes (as seen by str) and just outputting the corresponding labels. This may also be very useful as R has lots of commands for working with factors appropriately, such as when fitting linear models, it makes all the "dummy" variables for you.
> x <- LETTERS[c(4,2,3,1,3)]
> f <- factor(x)
> f
[1] D B C A C
Levels: A B C D
> str(f)
Factor w/ 4 levels "A","B","C","D": 4 2 3 1 3
If you do just need the numbers, use as.integer on the factor.
> n <- as.integer(f)
> n
[1] 4 2 3 1 3
An alternate solution is to use match, but if you're starting with floating-point numbers, watch out for floating-point traps. factor converts everything to characters first, which effectively rounds floating-point numbers to a certain number of digits, making floating-point traps less of a concern.
> match(x, sort(unique(x)))
[1] 4 2 3 1 3

Resources