R: how to find and extract values in a dataframe - r

I have a character vector in R with 330000 values e.g.
amp184660
amp947
amp53303
amp364886
amp121615
and and a data frame like this:
I want to find each value from my character vector in first column of the data frame i.e. "Assay Name" and then output its corresponding chromosome position i.e "Chrom" into a new vector. I want to do this as quickly as possible as there are about 330k entries and doing this via grep over a loop will take about 12 hours to finish.
Any ideas?
Thanks
Jason.

I would suggest %in%, which is likely to be faster than merge. Here's a toy example:
## Assume that "x" is your data.frame
set.seed(1)
x <- data.frame(Assay = sample(letters, 30, replace = TRUE),
Chrom = 4, ChromPos = rnorm(30))
## And that "y" is your vector you want to match
y <- c("a", "b", "c", "d", "e")
## Here's how you can use %in%
x[x$Assay %in% y, ]
# Assay Chrom ChromPos
# 10 b 4 0.6198257
# 12 e 4 -0.1557955
# 24 d 4 1.1000254
# 27 a 4 -0.2533617
## And can also directly extract a specific column
x[x$Assay %in% y, "ChromPos"]
# [1] 0.6198257 -0.1557955 1.1000254 -0.2533617

# assume your df called your_data_frame and vector called your_character_vector
vector_frame<-data.frame("Assay Name"=your_character_vector)
merge(vector_frame,your_data_frame,by="Assay Name")[,3]
note I changed the column notation from $Chrom to [,3] because I saw you wanted the third column and R will rename the column in the $ call e.g. to Chrom.Pos..bp. or something similar - if you type the $ and press TAB in the RStudio editor it'll give you the options

Just in case runtime is still a problem, using the data.table package is approx. 100x faster than merge and 50x faster than %in%:
library(data.table)
dt <- as.data.table( yourDataFrame )
setkey( dt, Assay )
dt[ J(yourVector) ]

Related

Why won't R recognize data frame column names within lists?

HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.
You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6
If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

R - number of unique values in a column of data frame

for a dataframe df, I need to find the unique values for some_col. Tried the following
length(unique(df["some_col"]))
but this is not giving the expected results. However length(unique(some_vector)) works on a vector and gives the expected results.
Some preceding steps while the df is created
df <- read.csv(file, header=T)
typeof(df) #=> "list"
typeof(unique(df["some_col"])) #=> "list"
length(unique(df["some_col"])) #=> 1
Try with [[ instead of [. [ returns a list (a data.frame in fact), [[ returns a vector.
df <- data.frame( some_col = c(1,2,3,4),
another_col = c(4,5,6,7) )
length(unique(df[["some_col"]]))
#[1] 4
class( df[["some_col"]] )
[1] "numeric"
class( df["some_col"] )
[1] "data.frame"
You're getting a value of 1 because the list is of length 1 (1 column), even though that 1 element contains several values.
you need to use
length(unique(unlist(df[c("some_col")])))
When you call column by df[c("some_col")] or by df["some_col"] ; it pulls it as a list. Unlist will convert it into the vector and you can work easily with it. When you call column by df$some_col .. it pulls the data column as vector
I think you might just be missing a ,
Try
length(unique(df[,"some_col"]))
In response to comment :
df <- data.frame(cbind(A=c(1:10),B=rep(c("A","B"),5)))
df["B"]
Output :
B
1 A
2 B
3 A
4 B
5 A
6 B
7 A
8 B
9 A
10 B
and
length(unique(df[,"B"]))
Output:
[1] 1
Which is the same incorrect/undesirable output as the OP posted
HOWEVER With a comma ,
df[,"B"]
Output :
[1] A B A B A B A B A B
Levels: A B
and
length(unique(df[,"B"]))
Now gives you the correct/desired output by the OP. Which in this example is 2
[1] 2
The reason is that df["some_col"] calls a data.frame and length call to an object class data.frame counts the number of data.frames in that object which is 1, while df[,"some_col"] returns a vector and length call to a vector correctly returns the number of elements in that vector. So you see a comma (,) makes all the difference.
using tidyverse
df %>%
select("some_col") %>%
n_distinct()
The data.table package contains the convenient shorthand uniqueN. From the documentation
uniqueN is equivalent to length(unique(x)) when x is anatomic vector, and nrow(unique(x)) when x is a data.frame or data.table. The number of unique rows are computed directly without materialising the intermediate unique data.table and is therefore faster and memory efficient.
You can use it with a data frame:
df <- data.frame(some_col = c(1,2,3,4),
another_col = c(4,5,6,7) )
data.table::uniqueN(df[['some_col']])
[1] 4
or if you already have a data.table
dt <- setDT(df)
dt[,uniqueN(some_col)]
[1] 4
Here is another option:
df %>%
distinct(column_name) %>%
count()
or this without tidyverse:
count(distinct(df, column_name))
checking benchmarks in the web you will see that distinct() is fast.

R data.table intersection of all groups

I want to have the intersection of all groups of a data table. So for the given data:
data.table(a=c(1,2,3, 2, 3,2), myGroup=c("x","x","x", "y", "z","z"))
I want to have the result:
2
I know that
Reduce(intersect, list(c(1,2,3), c(2), c(3,2)))
will give me the desired result but I didn't figure out how to produce a list of groups of a data.table query.
I would try using Reduce in the following way (assuming dt is your data)
Reduce(intersect, dt[, .(list(unique(a))), myGroup]$V1)
## [1] 2
Here's one approach.
nGroups <- length(unique(dt[,myGroup]))
dt[, if(length(unique(myGroup))==nGroups) .BY else NULL, by="a"][[1]]
# [1] 2
And here it is with some explanatory comments.
## Mark down the number of groups in your data set
nGroups <- length(unique(dt[,myGroup]))
## Then, use `by="a"` to examine in turn subsets formed by each value of "a".
## For subsets having the full complement of groups
## (i.e. those for which `length(unique(myGroup))==nGroups)`,
## return the value of "a" (stored in .BY).
## For the other subsets, return NULL.
dt[, if(length(unique(myGroup))==nGroups) .BY else NULL, by="a"][[1]]
# [1] 2
If that code and the comments aren't clear on their own, a quick glance at the following might help. Basically, the approach above is just looking for and reporting the value of a for those groups that return x,y,z in column V1 below.
dt[,list(list(unique(myGroup))), by="a"]
# a V1
# 1: 1 x
# 2: 2 x,y,z
# 3: 3 x,z

Remove quotes from vector element in order to use it as a value

Suppose that I have a vector x whose elements I want to use to extract columns from a matrix or data frame M.
If x[1] = "A", I cannot use M$x[1] to extract the column with header name A, because M$A is recognized while M$"A" is not. How can I remove the quotes so that M$x[1] is M$A rather than M$"A" in this instance?
Don't use $ in this case; use [ instead. Here's a minimal example (if I understand what you're trying to do).
mydf <- data.frame(A = 1:2, B = 3:4)
mydf
# A B
# 1 1 3
# 2 2 4
x <- c("A", "B")
x
# [1] "A" "B"
mydf[, x[1]] ## As a vector
# [1] 1 2
mydf[, x[1], drop = FALSE] ## As a single column `data.frame`
# A
# 1 1
# 2 2
I think you would find your answer in the R Inferno. Start around Circle 8: "Believing it does as intended", one of the "string not the name" sub-sections.... You might also find some explanation in the line The main difference is that $ does not allow computed indices, whereas [[ does. from the help page at ?Extract.
Note that this approach is taken because the question specified using the approach to extract columns from a matrix or data frame, in which case, the [row, column] mode of extraction is really the way to go anyway (and the $ approach would not work with a matrix).

Resources