Extract subset of data - r

Ok, I have a matrix of values with certain identifiers, such as:
A 2
B 3
C 4
D 5
E 6
F 7
G 8
I would like to pull out a subset of these values (using R) based on a list of the identifiers ("B", "D", "E") for example, so I would get the following output:
B 3
D 5
E 6
I'm sure there's an easy way to do this (some sort of apply?) but I can't seem to figure it out. Any ideas? Thanks!

If the letters are the row names, then you can just use this:
m <- matrix(2:8, dimnames = list(LETTERS[1:7], NULL))
m[c("B","D","E"),]
# B D E
# 3 5 6
Note that there is a subtle but very important difference between: m[c("B","D","E"),] and m[rownames(m) %in% c("B","D","E"),]. Both return the same rows, but not necessarily in the same order.
The former uses the character vector c("B","D","E") as in index into m. As a result, the rows will be returned in the order of character vector. For instance:
# result depends on order in c(...)
m[c("B","D","E"),]
# B D E
# 3 5 6
m[c("E","D","B"),]
# E D B
# 6 5 3
The second method, using %in%, creates a logical vector with length = nrow(m). For each element, that element is T if the row name is present in c("B","D","E"), and F otherwise. Indexing with a logical vector returns rows in the original order:
# result does NOT depend on order in c(...)
m[rownames(m) %in% c("B","D","E"),]
# B D E
# 3 5 6
m[rownames(m) %in% c("E","D","B"),]
# B D E
# 3 5 6
This is probably more than you wanted to know...

Your matrix:
> m <- matrix(2:8, dimnames = list(LETTERS[1:7]))
You can use %in% to filter out the desired rows. If the original matrix only has a single column, using drop = FALSE will keep the matrix structure. Otherwise it will be converted to a named vector.
> m[rownames(m) %in% c("B", "D", "E"), , drop = FALSE]
# [,1]
# B 3
# D 5
# E 6

Related

R how to merge 2 list [duplicate]

I'm trying to set the default value for a function parameter to a named numeric. Is there a way to create one in a single statement? I checked ?numeric and ?vector but it doesn't seem so. Perhaps I can convert/coerce a matrix or data.frame and achieve the same result in one statement? To be clear, I'm trying to do the following in one shot:
test = c( 1 , 2 )
names( test ) = c( "A" , "B" )
The setNames() function is made for this purpose. As described in Advanced R and ?setNames:
test <- setNames(c(1, 2), c("A", "B"))
How about:
c(A = 1, B = 2)
A B
1 2
...as a side note, the structure function allows you to set ALL attributes, not just names:
structure(1:10, names=letters[1:10], foo="bar", class="myclass")
Which would produce
a b c d e f g h i j
1 2 3 4 5 6 7 8 9 10
attr(,"foo")
[1] "bar"
attr(,"class")
[1] "myclass"
The convention for naming vector elements is the same as with lists:
newfunc <- function(A=1, B=2) { body} # the parameters are an 'alist' with two items
If instead you wanted this to be a parameter that was a named vector (the sort of function that would handle arguments supplied by apply):
newfunc <- function(params =c(A=1, B=2) ) { body} # a vector wtih two elements
If instead you wanted this to be a parameter that was a named list:
newfunc <- function(params =list(A=1, B=2) ) { body}
# a single parameter (with two elements in a list structure
magrittr offers a nice and clean solution.
result = c(1,2) %>% set_names(c("A", "B"))
print(result)
A B
1 2
You can also use it to transform data.frames into vectors.
df = data.frame(value=1:10, label=letters[1:10])
vec = extract2(df, 'value') %>% set_names(df$label)
vec
a b c d e f g h i j
1 2 3 4 5 6 7 8 9 10
df
value label
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
To expand upon #joran's answer (I couldn't get this to format correctly as a comment): If the named vector is assigned to a variable, the values of A and B are accessed via subsetting using the [ function. Use the names to subset the vector the same way you might use the index number to subset:
my_vector = c(A = 1, B = 2)
my_vector["A"] # subset by name
# A
# 1
my_vector[1] # subset by index
# A
# 1

Applying a function to get comma separated string using multiple columns in DataFrame and create a third column

I am trying to sort names in a row and create a comma separated string which would create another column.
This being my sample data.frame .
df=data.frame(A=c("A","K","B","D","F"),B =c("E","C","D","A","K"))
A B
1 A E
2 K C
3 B D
4 D A
5 F K
The Output I am trying to get would be like this
A B C
1 A E A , E
2 K C C , K
3 B D B , D
4 D A A , D
5 F K F , K
So far I have tried this :
lapply(df,FUN=paste(sort(df$A,df$B),collapse=" , "))
mapply(FUN= function(x,y)paste(sort(x,y),collapse=" , "),df$A,df$B)
Here I am trying to sort column values and paste them using ',' to create a unique pair name.
Any help is appreciated.
If you only have 2 columns, you can use pmax and pmin to avoid any costly looping code. E.g.:
with(lapply(df, as.character), paste(pmin(A,B),pmax(A,B),sep=",") )
#[1] "A,E" "C,K" "B,D" "A,D" "F,K"
You can do it with mapply, but since your data are factors, you need to coerce to character to they sort properly:
df$C <- mapply(function(x, y){paste(sort(c(as.character(x), as.character(y))),
collapse = ',')}, df$A, df$B)
df
# A B C
# 1 A E A,E
# 2 K C C,K
# 3 B D B,D
# 4 D A A,D
# 5 F K F,K
To simplify a bit, you can just use apply to iterate over the rows:
apply(df, 1, function(x){paste(sort(x), collapse = ',')})
Since it treats df as a matrix, it converts everything to character, which happens to be what you want for the sample data.
Also see tidyr::unite for pasting two columns together, though it can't easily sort.
Try this
> for( i in 1:nrow(df)){
+ df$C[i]<-paste0(as.character(unlist(sort(df[i,1:2]))),collapse=" , ")
+ }
> df
> df
A B C
1 A E A , E
2 K C C , K
3 B D B , D
4 D A A , D
5 F K F , K

Add list of columns above a certain threshold

Say I have a dataframe:
df <- data.frame(rbind(c(10,1,5,4), c(6,0,3,10), c(7,1,10,10)))
colnames(df) <- c("a", "b", "c", "d")
df
a b c d
10 1 5 4
6 0 3 10
7 1 10 10
And a vector of numbers (which correspond to the four column names a,b,c,d)
threshold <- c(7,1,5,8)
I need to compare each row in the data frame to the vector. When the value in the data frame meets or exceeds that in the vector, I need to return the column name. The output would be:
a b c d cols
10 1 5 4 a,b,c #10>7, 1>=1, 5>=5
6 0 3 10 d #10>8
7 1 10 10 a,b,c,d ##7>=7, 1>=1, 10>=5, 10>-8
The column cols can be a string that simply lists the columns where the value is exceeded.
Is there any clever way to do this? I'm migrating an old Excel function and I can write a loop or something, but I thought there almost had to be a better way.
You do not need which and the desired output is for comma separated values:
df$cols <- apply(df[-1], 1, function(x) toString(names(df)[-1][x >= threshold]))
df
id a b c d cols
1 aa 10 1 5 4 a, b, c
2 bb 6 0 3 10 d
3 cc 7 1 10 10 a, b, c, d
We can also try
i1 <- which(df >=threshold[col(df)], arr.ind=TRUE)
df$cols <- unname(tapply(names(df)[i1[,2]], i1[,1], toString))
df$cols
#[1] "a, b, c" "d" "a, b, c, d"
You can try this:
df$cols <- apply(df[, 2:5], 1, function(x) names(df[, 2:5])[which(x >= threshold)])

Find the index of the row in data frame that contain one element in a string vector

If I have a data.frame like this
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I want to get the row indices that contains one of the element in c("a", "k", "n"); in this example, the result should be 1, 2, 5.
If you have a large data frame and you wish to check all columns, try this
x <- c("a", "k", "n")
Reduce(union, lapply(x, function(a) which(rowSums(df == a) > 0)))
# [1] 1 5 2
and of course you can sort the end result.
s <- c('a','k','n');
which(df$col1%in%s|df$col3%in%s);
## [1] 1 2 5
Here's another solution. This one works on the entire data.frame, and happens to capture the search strings as element names (you can get rid of those via unname()):
sapply(s,function(s) which(apply(df==s,1,any))[1]);
## a k n
## 1 2 5
Original second solution:
sort(unique(rep(1:nrow(df),ncol(df))[as.matrix(df)%in%s]));
## [1] 1 2 5

R: operate over subset of columns in data.table

I'm trying to implement a data.table for my relatively large datasets and I can't figure out how to operate a function over multiple columns in the same row. Specifically, I want to create a new column that contains a specifically-formatted tally of the values (i.e., a histogram) in a subset of columns. It is kind of like table() but that also includes 0 entries and is sorted--so, if you know of a better/faster method I'd appreciate that too!
Simplified test case:
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c"))
DT<-as.data.table(DF)
> DT
A B C D E
1: a b c a a
2: d a a b a
3: a a a c c
my klunky histogram function:
histo<-function(vec){
foo<-c("a"=0,"b"=0,"c"=0,"d"=0)
for(i in vec){foo[i]=foo[i]+1}
return(foo)}
>histo(unname(unlist(DF[1,])))
a b c d
3 1 1 0
>histo(unname(unlist(DF[2,])))
a b c d
3 1 0 1
>histo(unname(unlist(DF[3,])))
a b c d
3 0 2 0
pseduocode of desired function and output
>DT[,his:=some_func_with_histo(A:E)]
>DT
A B C D E his
1: a b c a a (3,1,1,0)
2: d a a b a (3,1,0,1)
3: a a a c c (3,0,2,0)
df <- data.table(DF)
df$hist <- unlist(apply(df, 1, function(x) {
list(
sapply(letters[1:4], function(d) {
b <- sum(!is.na(grep(d,x)))
assign(d, b)
}))
}), recursive=FALSE)
Your df$hist column is a list, with each value named:
> df
A B C D E hist
1: a b c a a 3,1,2,0
2: d a a b a 3,1,1,1
3: a a a c c 3,0,3,0
> df$hist
[[1]]
a b c d
3 1 2 0
[[2]]
a b c d
3 1 1 1
[[3]]
a b c d
3 0 3 0
NOTE: Answer has been updated to to OP's request and mnel's comment
OK, how do you like that solution:
library(data.table)
DT <- data.table(A=c("a","d","a"),
B=c("b","a","a"),
C=c("c","a","a"),
D=c("a","b","c"),
E=c("a","a","c"))
fun <- function(vec, char) {
sum(vec==char)
}
DT[, Vec_Nr:= paste(Vectorize(fun, 'char')(.SD, letters[1:4]), collapse=","),
by=1:nrow(DT),
.SDcols=LETTERS[1:5]]
A B C D E Vec_Nr
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
I basically split up your problem into several steps:
First, I define a function fun that gives me the number of occurrences for one character. To see how
that function works, just call
fun(c("a", "a", "b"), "b")
[1] 1
Next, I vectorize this function because you don't want to know that for only one character "b", but for many. To pass a vector of arguments to a function,
use Vectorize. To see how that works, just type
Vectorize(fun, "char")(c("a", "a", "b"), c("a", "b"))
a b
2 1
Next, I collapse the results into one string and save that as a new column. Note that I deliberatly used the letters and LETTERS here to show you how make this more dynamic.
EDIT (also see below): Provided you first convert column classes to character, e.g., with DT <- DT[,lapply(.SD,as.character)]...
By using factor, you can convert vec and pass the values (a,b,c,d) in one step:
histo2 <- function(x) table(factor(x,levels=letters[1:4]))
Then you can iterate over rows by passing by=1:nrow(DT).
DT[,as.list(histo2(.SD)),by=1:nrow(DT)]
This gives...
nrow a b c d
1: 1 3 1 1 0
2: 2 3 1 0 1
3: 3 3 0 2 0
Also, this iterates over columns. This works because .SD is a special variable holding the subset of data associated with the call to by. In this case, that subset is the data.table consisting of one of the rows. histo2(DT[1]) works the same way.
EDIT (responding to OP's comment): Oh, sorry, I instinctively replaced your first line with
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c")
,stringsAsFactors=FALSE)
since I dislike using factors except when making tables. If you do not want to convert your factor columns to character columns in this way, this will work:
histo3 <- function(x) table(factor(sapply(x,as.character),levels=letters[1:4]))
To put the output into a single column, you use := as you suggested...
DT[,hist:=list(list(histo3(.SD))),by=1:nrow(DT)]
The list(list()) part is key; I always figure this out by trial-and-error. Now DT looks like this:
A B C D E hist
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
You might find that it's a pain to access the information directly from your new column. For example, to access the "a" column of the "histogram", I think the fastest route is...
DT[,hist[[1]][["a"]],by=1:nrow(DT)]
My initial suggestion created an auxiliary data.table with just the counts. I think it's cleaner to do whatever you want to do with the counts in that data.table and then cbind it back. If you choose to store it in a column, you can always create the auxiliary data.table later with
DT[,as.list(hist[[1]]),by=1:nrow(DT)]
You are correct about using .SDcols. For your example, ...
cols = c("A","C")
histname = paste(c("hist",cols),collapse="")
DT[,(histname):=list(list(histo3(.SD))),by=1:nrow(DT),.SDcols=cols]
This gives
A B C D E hist histAC
1: a b c a a 3,1,1,0 1,0,1,0
2: d a a b a 3,1,0,1 1,0,0,1
3: a a a c c 3,0,2,0 2,0,0,0

Resources