Select specific names from list of dataframes in R

Select specific names from list of dataframes in R - r

Sample data:
df <- data.frame(names=letters[1:10],name1=rnorm(10,1,1),name2=rexp(10,2))
list <- list(df,df)
vec_name <- c("f","i","c") # desired row names
I would like to select per list rows given the vec_name names:
Desired outcome:
[[1]]
names value1 value2
6 nd:f -1.6323952 0.3117470
9 nd:i 1.8270855 0.2475741
3 nd:c 0.6978422 0.4695581 # the ordering does matter; must be as seen in vec_name
[[2]]
names value1 value2
6 ad:f -1.6323952 0.3117470
9 ad:i 1.8270855 0.2475741
3 ad:c 0.6978422 0.4695581
Desired output 2: Is in dataframe, which would be I believe just do.call(rbind,list):
However the clean names from vec_names should be used instead.
names value1 value2
1 f -1.6323952 0.3117470
2 i 1.8270855 0.2475741
3 c 0.6978422 0.4695581
4 f -1.6323952 0.3117470
5 i 1.8270855 0.2475741
6 c 0.6978422 0.4695581
I have tried sapply; lapply ... for example:
lapply(list, function(x) x[grepl(vec_name,x$names),])
EDIT : PLEASE SEE THE EDITED QUESTION ABOVE.

You were almost there. The warning message was saying:
Warning messages:
1: In grepl(vec_name, x$names) :
argument 'pattern' has length > 1 and only the first element will be used
Reason is that you provide a vector to grepl which is expecting a regex (see ?regex). What you want to do is to match the contents:
lapply(list, function(x) x[match(vec_name,x$names),])
Which will give you a list of data.frame objects. If you want to combine them afterwards just use:
do.call(rbind, lapply(list, function(x) x[match(vec_name,x$names),]))
Or you use ldply from library(plyr):
library(plyr)
ldply(list, function(x) x[match(vec_name,x$names),])
# names name1 name2
# 1 f 2.01421228 0.4489627
# 2 i 0.28899891 0.8323940
# 3 c -0.01746007 1.5309936
# 4 f 2.01421228 0.4489627
# 5 i 0.28899891 0.8323940
# 6 c -0.01746007 1.5309936
And as a remark: avoid to use protected names like list for your variables to avoid unwanted effects.
Update
Taking the comments into account (vec_name does not match completely the names in the data.frame)you should clean first the names and then do the match. This is, however, assuming that your 'uncleaned' names contain the cleaned names with a pre-fix separated by a colon (':') (if this is not the case adapt the regex in the gsub statement):
ldply(list, function(x) x[match(vec_name, gsub(".*:(.*)", "\\1", x$names)),])

for the first output :
output1<-lapply(list,function(elt){
resmatch<-sapply(vec_name,function(x) regexpr(x,df$names))
elt<-elt[apply(resmatch,2,function(rg) which(rg>0)),]
colnames(elt)<-c("names","value1","value2")
return(elt)
})
>output1
[[1]]
names value1 value2
6 nd:f -0.2132962 0.7618105
9 nd:i -0.6580247 0.6010379
3 nd:c 0.9302625 0.1490061
[[2]]
names value1 value2
6 nd:f -0.2132962 0.7618105
9 nd:i -0.6580247 0.6010379
3 nd:c 0.9302625 0.1490061
For the second output, you can do what you wanted to :
output2<-do.call(rbind,output1)
> output2
names value1 value2
6 nd:f -0.2132962 0.7618105
9 nd:i -0.6580247 0.6010379
3 nd:c 0.9302625 0.1490061
61 nd:f -0.2132962 0.7618105
91 nd:i -0.6580247 0.6010379
31 nd:c 0.9302625 0.1490061

Related

R regular expression for p#q#c#

What would the regular expression be to encompass variable names such as p3q10000c150 and p29q2990c98? I want to add all variables in the format of p-any number-q-any number-c-any number to a list in R.
Thanks!

I think you are looking for something like matches function in dplyr::select:
df = data.frame(1:10, 1:10, 1:10, 1:10)
names(df) = c("p3q10000c150", "V1", "p29q2990c98", "V2")
library(dplyr)
df %>%
select(matches("^p\\d+q\\d+c\\d+$"))
Result:
p3q10000c150 p29q2990c98
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
matches in select allows you to use regex to extract variables.

If your objective is to pull out the 3 numbers and put them in a 3 column data frame or matrix then any of these alternatives would do it.
The regular expression in #1 matches p and then one or more digits and then q and then one or more digits and then c and one or more digits. The parentheses form capture groups which are placed in the corresponding columns of the prototype data frame given as the third argument.
In #2 each non-digit ("\\D") is replaced with a space and then read.table reads in the data using the indicated column names.
In #3 we convert each element of the input to DCF format, namely c("\np: 3\nq: 10000\nc: 150", "\np: 29\nq: 2990\nc: 98") and then read it in using read.dcf and conver the columns to numeric. This creates a matrix whereas the prior two alternatives create data frames.
The second alternative seems simplest but the third one is more general in that it does not hard code the header names or the number of columns. (If we used col.names = strsplit(input, "\\d+")[[1]] in #2 then it would be similarly general.)
# 1
strcapture("p(\\d+)q(\\d+)c(\\d+)", input,
data.frame(p = character(), q = character(), c = character()))
# 2
read.table(text = gsub("\\D", " ", input), col.names = c("p", "q", "c"))
# 3
apply(read.dcf(textConnection(gsub("(\\D)", "\n\\1: ", input))), 2, as.numeric)
The first two above give this data.frame and the third one gives the corresponding numeric matrix.
p q c
1 3 10000 150
2 29 2990 98
Note: The input is assumed to be:
input <- c("p3q10000c150", "p29q2990c98")

Try:
x <- c("p3q10000c150", "p29q2990c98")
sapply(strsplit(x, "[pqc]"), function(i){
setNames(as.numeric(i[-1]), c("p", "q", "c"))
})
# [,1] [,2]
# p 3 29
# q 10000 2990
# c 150 98

I'll assume you have a data frame called df with variables names names(df). If you want to only retain the variables with the structure p<somenumbers>q<somenumbers>c<somenumbers> you could use the regex that Wiktor Stribiżew suggested in the comments like this:
valid_vars <- grepl("p\\d+q\\d+c\\d", names(df))
df2 <- df[, valid_vars]
grepl() will return a vector of TRUE and FALSE values, indicating which element in names(df) follows the structure you suggested. Afterwards you use the output of grepl() to subset your data frame.
For clarity, observe:
var_names_test <- c("p3q10000c150", "p29q2990c98", "var1")
grepl("p\\d+q\\d+c\\d", var_names_test)
# [1] TRUE TRUE FALSE

R: match () only returns first occurrence

I have a dataframe
names2 <- c('AdagioBarber','AdagioBarber', 'Beethovan','Beethovan')
Value <- c(33,55,21,54)
song.data <- data.frame(names2,Value)
I would like to arrange it according to this character vector
names <- c('Beethovan','Beethovan','AdagioBarber','AdagioBarber')
I am using match() to achieve this
data.frame(song.data[match((names), (song.data$names2)),])
The problem is that match returns only first occurences
names2 Value
3 Beethovan 21
3.1 Beethovan 21
1 AdagioBarber 33
1.1 AdagioBarber 33

You can use order, as #zx8754 and #Evan Friedland have pointed out.
> name.order <- c('Beethovan','AdagioBarber')
> song.data$names2 <- factor(song.data$names2, levels= name.order)
> song.data[order(song.data$names2), ]
names2 Value
3 Beethovan 21
4 Beethovan 54
1 AdagioBarber 33
2 AdagioBarber 55
Basically, factor turns the strings into integers and creates a lookup table of what integers correspond to what strings. The levels argument specifies what you want that lookup table to be. Without that argument, it would just go by order of appearance.
So for example:
> as.numeric(factor(letters[1:5]))
[1] 1 2 3 4 5
> as.numeric(factor(letters[1:5], levels=c("d","b","e","a","c")))
[1] 4 2 5 1 3
Note: You'll need to be absolutely sure you get all your (correctly spelled) levels in that name.order vector, otherwise you'll end up with NA's in the output from order.
(I'm not sure why sort doesn't have the ability to sort factors, but it is what it is.)

How to unquote string in R to access column in data table

Suppose I have a data.table called mysample. It has multiple columns, two of them being weight and height. I can access the weight column by typing:
mysample[,weight]
But when I try to write mysample[,colnames(mysample)[1]] I cannot see the elements of weight. Is there something wrong with my code?

Please refer to section 1.1 of data.table FAQ: http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf
colnames(mysample)[1] evaluates to character vector "weight", and the 2nd argument J in data.table is an expression which is evaluated within the scope of DT. Thus, "weight" evaluates to character vector "weight" itself and you can't see the elements of "weight" column. To actually subset "weight" column you should try:
mysample[,colnames(mysample)[1], with = F]

Your syntax should work for data frames. data.table has its unique rules.
df <- data.frame(a=1:3, b=4:6)
df
a b
1 1 4
2 2 5
3 3 6
df[,"a"]
[1] 1 2 3
df$a
[1] 1 2 3
df[,1]
[1] 1 2 3
df[,colnames(df)[1]]
[1] 1 2 3

r create a column that contains the objects names inside a lapply function

I would like to create a column that contains the objects names inside a lapply function, as a proxy I call it name.of.x.as.strig.function(), unfortunately I am not sure how to do it, maybe a combination of assign, do.call and paste. But so far using this function only led my into deeper troubles, I am quite sure there is a more R like solution.
# generates a list of dataframes,
data <- list(data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)))
# assigns names to dataframe
names(data) <- list("one","two", "tree", "four")
# subsets the second column into the object data.anova
data.anova <- lapply(data, function(x){x <- x[[2]];
return(matrix(x))})
This should allow me to create a column inside the dataframe that contains its name, for all matrices inside the list
data.anova <- lapply(data, function(x){
x$id <- name.of.x.as.strig.function(x)
return(x)})
I would like to retrieve:
3 one
3 one
3 two
3 two
...
Any input is highly appreciated.
Search history: function to retrieve object name as string, R get name of an object inside lapply...

Can it be that you are just looking for stack?
stack(lapply(data, `[[`, 2))
# values ind
# 1 3 one
# 2 3 one
# 3 3 two
# 4 3 two
# 5 3 tree
# 6 3 tree
# 7 3 four
# 8 3 four
(Or, using your original approach: stack(lapply(data, function(x) {x <- x[[2]]; x})))
If this is the case, melt from "reshape2" would also work.

Loop through the indices of data.anova, and use that to fetch both the data and the names:
data.anova <- lapply(seq_along(data.anova), function(i){
x <- as.data.frame(data.anova[[i]])
x$id <- names(data.anova)[i]
return(x)})
This produces:
# [[1]]
# V1 id
# 1 3 one
# 2 3 one
# [[2]]
# V1 id
# 1 3 two
# 2 3 two
# [[3]]
# V1 id
# 1 3 tree
# 2 3 tree
# [[4]]
# V1 id
# 1 3 four
# 2 3 four

Extract data elements found in a single column

Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.

This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.

A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Select specific names from list of dataframes in R - r

Related

R regular expression for p#q#c#

R: match () only returns first occurrence

How to unquote string in R to access column in data table

r create a column that contains the objects names inside a lapply function

Extract data elements found in a single column

Categories

Resources