I have a dataframe whose columns consists of randomly distributed values and NAs, as seen below:
a<-c("S","E","NA","S","NA")
b<-c("A","NA","M","G","K")
c<-c("I","NA","NA","NA","L")
meh<-dataframe(a,b,c)
# [,1] [,2] [,3] [,4] [,5]
#a "S" "E" "NA" "S" "NA"
#b "A" "NA" "M" "G" "K"
#c "I" "NA" "NA" "NA" "L"
I want to remove all NAs and shift the non-NAs to the left - it should look like this
# [,1] [,2] [,3] [,4]
#a "S" "E" "S"
#b "A" "M" "G" "K"
#c "I" "L"
Any ideas?
We can also use stri_list2matrix
library(stringi)
stri_list2matrix(lapply(meh, function(x) x[x!='NA']), fill='', byrow=TRUE)
# [,1] [,2] [,3] [,4]
#[1,] "S" "E" "S" ""
#[2,] "A" "M" "G" "K"
#[3,] "I" "L" "" ""
It might help if you specify what you want to do with the data after you finish this process, but here's a way to get rid of NA's in the each column and store them to a variable. That is if you actually have NA's. I changed your example dataset to reflect the comments above to include NA not "NA".
a<-c("S","E",NA,"S",NA)
b<-c("A",NA,"M","G","K")
c<-c("I",NA,NA,NA,"L")
meh<-data.frame(a,b,c)
newcol<-na.omit(meh$a) #Removes all NA's from your column
newcol<-newcol[1:length(newcol)] #Gives you an output without any NA's
The same can be done with each row like jeremycg suggests, using lapply.
lapply(1:nrow(meh), function(x) meh[x,][is.na(meh[x,])==F])
Once the vectors are all different sizes, it doesn't make sense to colbind them back into wide form. Instead,
library(dplyr)
library(tidyr)
meh %>%
gather(variable, value) %>%
filter(!is.na(value))
Related
I need to write a function in R that has no input but randomly selects a set of 13 pairs of letters.
And the output of such function has to be a 2 x 13 matrix. But the letters can appear only once, meaning they cannot be repeated within a row or amongst rows.
So far, I've come up with this:
f <- function(){
x <- letters[1:26]
return(matrix(sample(x,13, replace = F), 2, 13))
}
I've managed to make sure letters do not repeat within a row (with replace = F), but I don't know how to make sure letters from one row do not appear again in the other row.
Any ideas?
you don't need to generate two vectors
x <- letters[1:26]
matrix(sample(x,26,replace = F),2,13)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "s" "m" "h" "z" "q" "y" "w" "x" "p" "n" "e" "o" "j"
[2,] "r" "b" "d" "v" "u" "a" "k" "i" "f" "l" "g" "c" "t"
Here is the shorthand version
x <- letters
matrix(sample(x),2)
Im trying to convert a data set in a long format panel structure to an adjacency matrix or edge list to make network graphs. The data set contains articles each identified by an ID-number. Each article can appear several times under a number of categories. Hence I have a long format structure at the moment:
ID <- c(1,1,1,2,2,2,3,3)
Category <- c("A","B","C","B","E","H","C","E")
dat <- data.frame(ID,Category)
I want to convert this into an adjacency matrix or edge list. Where the edge list such look something like this
A B
A C
B C
B E
B H
E H
C E
Edit: I have tried dat <- merge(ID, Category, by="Category") but it returns the error message Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
Thanks in advance
Update: I ended up using the crossprod(table(dat)) from the comments, but the solution suggested by Navy Cheng below works just as well
This code will work
do.call(rbind,lapply(split(dat, dat$ID), function(x){
t(combn(as.vector(x$Category), 2))
}))
Update
As #Parfait 's suggestion, you can have by instead of split+lapply.
1) Use by to group nodes ("A", "B", "C" ...) by Category;
2) Use combn to create edge between nodes in each group, and t to transform the matrix for further rbind
> edge.list <- by(dat, dat$ID, function(x) t(combn(as.vector(x$Category), 2)))
dat$ID: 1
[,1] [,2]
[1,] "A" "B"
[2,] "A" "C"
[3,] "B" "C"
------------------------------------------------------------
dat$ID: 2
[,1] [,2]
[1,] "B" "E"
[2,] "B" "H"
[3,] "E" "H"
------------------------------------------------------------
dat$ID: 3
[,1] [,2]
[1,] "C" "E"
3) Then merge the list
> do.call(rbind, edge.list)
[,1] [,2]
[1,] "A" "B"
[2,] "A" "C"
[3,] "B" "C"
[4,] "B" "E"
[5,] "B" "H"
[6,] "E" "H"
[7,] "C" "E"
So if you are willing to convert your data.frame to a data.table this problem can be solved pretty efficiently and cleanly and if you have many rows will be much faster.
library(data.table)
dat<-data.table(dat)
Basically you can apply functions to columns of the data.table in the j cell and group in the k cell. So you want all the combinations of categories taken two at a time for each ID which looks like this:
dat[,combn(Categories,2),by=ID]
However stopping at this point will keep the ID column and by default create a column called V1 that basically concatenates the array returned by combn into a vector of the categories and not the two-column adjacency matrix that you need. But by chaining another call to this you can create the matrix easily as you would with any single vector. In one line of code this will look like:
dat[,combn(Category,2),by=ID][,matrix(V1,ncol=2,byrow = T)]
Remember that the vector column we wish to convert to a matrix is called V1 by default and also we want the 2-column matrix to be created by row instead of the default which is by column. Hope that helps and let me know if I need to add anything to my explanation. Good luck!
This question already has answers here:
How to generate all possible combinations of vectors without caring for order?
(2 answers)
Closed 5 years ago.
What R command generates all possible ordered combinations of length k?
For example from this vector:
a,b,c,d
It want to generate all combinations of length 3 but only those ones where the order is conserved:
a,b,c
a,b,d
a,c,d
b,c,d
Or If I have this vector
a,b,7,d,e
I want to do the same for length 2:
a,b
a,7
a,d
a,e
b,7
b,d
b,e
7,d
7,e
d,e
combn doesn't work here because it gives you all possible combinations including reversed ones such as
c,b
In simple cases I could try to do it with expand.grid but both methods would need further processing.
Maybe there is a base function (or package) able to do what I want or even accepting more complex conditions.
PD: When I say "ordered" I'm speaking about the order of appearance in the starting vector. I don't mean the typographic order, though in my example they are the same.
You can use combn in base R:
vec <- c("a", "b", "c", "d")
len <- 2
combn(length(vec), len, function(x) vec[x])
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] "a" "a" "a" "b" "b" "c"
#[2,] "b" "c" "d" "c" "d" "d"
Of length 3:
combn(length(vec), 3, function(x) vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "b"
#[2,] "b" "b" "c" "c"
#[3,] "c" "d" "d" "d"
OR as #Sotos pointed out in the comments:
combn(vec, len)
I have the following matrix:
V1 V2 V3 V4
[1,] "d" "e" "i" "NA"
[2,] "j" "e" "i" "NA"
[3,] "j" "n" "k" "l"
[4,] "j" "k" "l" "m"
[5,] "j" "k" "i" "NA"
[6,] "o" "n" "NA" "NA"
I am trying to count the number elements per row that is not NA, but all of the usual ways like !is.na(MATRIX) are not working. I am always getting the answer to be 4. I presume this is because the program is viewing "NA" as a character, but I do not know how to fix this.
'NA' is not NA_character_ so is.na does not work. Just use
rowSums(MATRIX != 'NA')
If the NAs are character strings, convert them to real NA with mat[mat=="NA"] <- NA and then use the solution in Sotos' comment
Similar to my question at Using a sample list as a template for sampling from a larger list without wraparound, how can I know do this allowing for a wrap-around?
Thus, if I have a vector of letters:
> all <- letters
> all
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
and then I define a reference sample from letters as follows:
> refSample <- c("j","l","m","s")
in which the spacing between elements is 2 (1st to 2nd), 1 (2nd to 3rd) and 6 (3rd to 4th), how can I then select n samples from all that have identical, wrap-around spacing between its elements to refSample? For example, "a","c","d","j", "q" "s" "t" "z" and "r" "t" "u" "a" would be valid samples, but "a","c","d","k" would not.
Again, parameterised for a function is best.
I would have left it as an exercise but here goes --
all <- letters
refSample <- c("j","l","m","s")
pick_matches <- function(n, ref, full, wrap = FALSE) {
iref <- match(ref,full)
spaces <- diff(iref)
tot_space <- sum(spaces)
N <- length( full ) - 1
max_start <- N - tot_space*(1-wrap)
starts <- sample(0:max_start, n, replace = TRUE)
return( sapply( starts, function(s) full[ 1 + cumsum(c(s, spaces)) %% (N+1) ] ) )
}
> set.seed(1)
> pick_matches(5, refSample, all, wrap = FALSE)
[,1] [,2] [,3] [,4] [,5]
[1,] "e" "g" "j" "p" "d"
[2,] "g" "i" "l" "r" "f"
[3,] "h" "j" "m" "s" "g"
[4,] "n" "p" "s" "y" "m"
> pick_matches(5, refSample, all, wrap = TRUE)
[,1] [,2] [,3] [,4] [,5]
[1,] "x" "y" "r" "q" "b"
[2,] "z" "a" "t" "s" "d"
[3,] "a" "b" "u" "t" "e"
[4,] "g" "h" "a" "z" "k"