Reducing lists in R to match another list. - r

Suppose I have a dataframe 'H', like so
C1 C2
a 1
b 1
c 2
d 3
e 4
f 4
g 5
and a list X (as.factor) that goes
"1" "2" "4"
Using the match command,
X2=H[match(X,H$C2),]
only reduces H to three rows and only one instance of each element of X is present (a,c,e). What command should I employ to reduce H to X such that all instances of elements found in X are present (i.e, the reduced table should contain a,b,c,e,f)?
Cheers.

> H[H$C2 %in% X,]
C1 C2
1 a 1
2 b 1
3 c 2
5 e 4
6 f 4

Related

count characters based on the order they appear

How does one count the characters based on the order they appear in a single length string. Below is an minimal example:
x <- "abbccdddaab"
First thought was this but it only counts them irrespective of order:
table(unlist(strsplit(x, "\\b")))
a b c d
3 3 2 3
But the desired output is:
a b c d a b
1 2 2 3 2 1
I would imagine the solution would require a for loop?
We can use rle instead of table as rle returns the output as a list of values and lengths based on checking whether the adjacent elements are same or not
out <- rle(strsplit(x, "\\b")[[1]])
setNames(out$lengths, out$values)
# a b c d a b
# 1 2 2 3 2 1
Using data.table::rleid :
x <- "abbccdddaab"
tmp <- strsplit(x, "\\b")[[1]]
table(data.table::rleid(tmp))
#1 2 3 4 5 6
#1 2 2 3 2 1

How to skip not completly empty rows in r

So, I'm trying to read a excel files. What happens is that some of the rows are empty for some of the columns but not for all of them. I want to skip all the rows that are not complete, i.e., that don't have information in all of the columns. For example:
In this case I would like to skip the lines 1,5,6,7,8 and so on.
There is probably more elegant way of doing it, but a possible solution is to count the number of elements per rows that are not NA and keep only rows with the number of elements equal to the number of columns.
Using this dummy example:
df <- data.frame(A = LETTERS[1:6],
B = c(sample(1:10,5),NA),
C = letters[1:6])
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
6 F NA f
Using apply, you can for each rows count the number of elements without NA:
v <- apply(df,1, function(x) length(na.omit(x)))
[1] 3 3 3 3 3 2
And then, keep only rows with the number of elements equal to the number of columns (which correspond to complete rows):
df1 <- df[v == ncol(df),]
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
Does it answer your question ?

Grouping of two variables in r

I have a data frame like below.
dat <- data.frame(v1=c("a","b","c","c","a","w","f"),
v2=c("z","a","a","w","p","e","h"))
v1 v2
1 a z
2 b a
3 c a
4 c w
5 a p
6 w e
7 f h
I want to add a group column based on whether these letters appear in the same row.
v1 v2 gp
1 a z 1
2 b a 1
3 c a 1
4 c w 1
5 a p 1
6 w e 1
7 f h 2
My idea is to first assign the first row to group 1, and then any row that v1 or v2 is "a" or "z" will also be assigned to group 1.
There are scenarios like row 3 and 4, where c is assigned to group 1, because, in row 3, v2 is "a". And "w" is assigned to group 1 because in row 4 v1 is "c", which is assigned to group 1 previously. But my list is very long, so I cannot keep checking all the "descendants".
I wonder if there is a way to group these letters, and return a list with group number. Something like the below table will do.
letter gp
a 1
b 1
c 1
e 1
f 2
h 2
w 1
z 1
One way to solve this is to consider the letters as vertices of a graph and being in the same row as a link between the vertices. Then what you are asking for is the connected components of the graph. All of that is easy using the igraph package in R.
library(igraph)
G = graph_from_edgelist(as.matrix(dat), directed=FALSE)
letters = sort(unique(c(as.character(dat$v1), as.character(dat$v2))))
(gp = components(G)$membership[letters])
a b c e f h p w z
1 1 1 1 2 2 1 1 1
If you want a data.frame containing this information
(Groups = data.frame(letters, gp, row.names=NULL))
letters gp
1 a 1
2 b 1
3 c 1
4 e 1
5 f 2
6 h 2
7 p 1
8 w 1
9 z 1
In order to think through why this works, it may help you to look at the graph that was created and think how that represents your problem.

Matching and merging headers in R

In R, I want to match and merge two matrices.
For example,
> A
ID a b c d e f g
1 ex 3 8 7 6 9 8 4
2 am 7 5 3 0 1 8 3
3 ple 8 5 7 9 2 3 1
> B
col1
1 a
2 c
3 e
4 f
Then, I want to match header of matrix A and 1st column of matrix B.
The final result should be a matrix like below.
> C
ID a c e f
1 ex 3 7 9 8
2 am 7 3 1 8
3 ple 8 7 2 3
*(My original data has more than 500 columns and more than 20,000 rows.)
Are there any tips for that? Would really appreciate your help.
*In advance, if the matrix B is like below,
> B
col1 col2 col3 col4
1 a c e f
How to make the matrix C in this case?
You want:
A[, c('ID', B[, 1])]
For the second case, you want to use row number 1 of the second matrix, instead of its first column.
A[, c('ID', B[1, ])]
If B is a data.frame instead of a matrix, the syntax changes somewhat — you can use B$col1 instead of B[, 1], and to select by row, you need to transform the result to a vector, because the result of selecting a row in a data.frame is again a data.frame, i.e. you need to do unlist(B[1, ]).
You can use a subset:
cbind(A$ID, A[names(A) %in% B$col1])

Filtering a dataframe in r row names from a second data frame in r

I have the data.frame :
df1<-data.frame("Sp1"=1:6,"Sp2"=7:12,"Sp3"=13:18)
rownames(df1)=c("A","B","C","D","E","F")
df1
Sp1 Sp2 Sp3
A 1 7 13
B 2 8 14
C 3 9 15
D 4 10 16
E 5 11 17
F 6 12 18
I filter df1 by a cutoff value for rowSums(df1) and return sites (row names) that I want to include in downstream analysis.
include<-rownames(df1[rowSums(df1)>=22,])
include
[1] "B" "C" "D" "E" "F"
I have a second data.frame :
df2<-data.frame(site.x=c("A","B","C"), site.y=c("D","E","F"),score=1:3)
site.x site.y score
1 A D 1
2 B E 2
3 C F 3
I want to filter df2 such that it only includes rows where df2$site.x and df2$site.y are exactly equal to the sites listed in 'include' i.e. filtering out the row containing "A" and returning.
site.x site.y score
2 B E 2
3 C F 3
I have tried :
filter<-df2$site.x == include & df2$site.y == include
filtered<-df2[filter,]
Thanks for any advice!
ANSWER
use %in%
filter<-df2$site.x %in% include & df2$site.y =%in% include
filtered<-df2[filter,]
filtered
site.x site.y score
2 B E 2
3 C F 3
For me, it works with :
filter<-df2$site.x %in% include & df2$site.y %in% include
df2[filter,]
In fact, you've put df1 instead of df2 in the last two lines of your question.

Resources