R accessing variable column names for subsetting - r

The following works and does what I want it to do:
dat<-subset(data,NLI.1 %in% NLI)
However, I may need to subset via a different column (i.e. NLI.2 and NLI.3). I've tried
NLI_col<-"NLI.1"
NLI_col<-subset(data,select=NLI_col)
dat<-subset(data,NLI_col %in% NLI)
Unsurprisingly this doesn't work. How do I use NLI_col to achieve the result from the code that does work?
It was requested that I give an example of what data looks like. Here:
NLI.1<-c(NA,NA,NA,NA,NA,1,2,2,2,NA,2,2,2,2,2,2,2,NA,NA,2,2,2,2,NA,2,2,2,2,2,2,2,NA,NA,NA,NA,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,1,2,2,2,2,2,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,2,2,1,2,2,2)
NLI.2<-c(NA,NA,NA,NA,NA,NA,2,2,2,NA,NA,2,2,2,2,2,2,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,NA,NA,NA,NA,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,1,2,2,2,2,2,2,2)
NLI.3<-c(NA,35,40,NA,10,NA,31,NA,14,NA,NA,15,17,NA,NA,16,10,15,14,39,17,35,14,14,22,10,15,0,34,23,13,35,32,2,14,10,14,10,10,10,40,10,13,13,10,10,10,13,13,25,10,35,NA,13,NA,10,40,0,0,20,40,10,14,40,10,10,10,10,13,10,8,NA,NA,14,NA,10,28,10,10,15,15,16,10,10,35,16,NA,NA,NA,NA,30,19,14,30,10,10,8,10,21,10,10,35,15,34,10,39,NA,10,10,6,16,10,10,10,10,34,10)
other<-c(NA,NA,511,NA,NA,NA,NA,NA,849,NA,NA,NA,NA,1324,1181,832,1005,166,204,1253,529,317,294,NA,514,801,534,1319,272,315,572,96,666,236,842,980,290,843,904,528,27,366,540,560,659,107,63,20,1184,1052,214,46,139,310,872,891,651,687,434,1115,1289,455,764,938,1188,105,757,719,1236,982,710,NA,NA,632,NA,546,747,941,1257,99,133,61,249,NA,NA,1080,NA,645,19,107,486,1198,276,777,738,1073,539,1096,686,505,104,5,55,553,1023,1333,NA,NA,969,691,1227,1059,358,991,1019,NA,1216)
data<-cbind(NLI.1,NLI.2,NLI.3,other)
NLI<-c(10,13)
With this, after sub-setting I should get all the rows with tens and thirteens in data$NLI.3 if NLI_col <- "NLI.3"
Since this is relatively trivial I am guessing this is a duplicate question (my apologies), but the hours drag on and I still cant find a solution

Seems like you are unnecessarily using subset. Try this:
NLI_col <- 'NLI.3'
head(data[,NLI_col] %in% NLI)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE
head(data[data[,NLI_col] %in% NLI, ])
## NLI.1 NLI.2 NLI.3 other
## 5 NA NA 10 NA
## 17 2 2 10 1005
## 26 2 2 10 801
## 31 2 2 13 572
## 36 2 2 10 980
## 38 2 2 10 843

I'm not sure I am following the question exactly. Are you asking to just subset the rows of NLI.3 that contain a 10 or a 13? Is it more complicated than that?
If you just want to get those rows....
df[ which(df$NLI.3==10 | df$NLI.3==13 ),]
Assuming your data is in a dataframe. Also, I changed the name of the dataframe from 'data' to 'df' - calling it 'data' can lead to issues.

Related

in R, concatenate a conditional vector of strings [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am struggling with how to gain insight into two tables of information in R I have. I want to search to see if a string of characters in one data frame is present in another data frame. If it is, record the name for that string and append it to a new data frame.
Here's what I am working with:
df_repeats
sequence promoter_numbers promotors
1 AAAAAAAAAAAA 715 NA
2 AAAAAAAAAAAC 61 NA
3 AAAAAAAAAAAG 184 NA
df_promotors
gene promotor_coordinates sequence
1 Xkr4_1 range=chr1:3671549-36… GAGCTAGTTCTCTTTTCCCTGGTTACTAGCCATGTCCCTCCTCCCA…
2 Rp1_2 range=chr1:4360255-43… CACACACACACACACACACACACACATGTAACAATGAAACAAAAAG…
3 Rp1_1 range=chr1:4409254-44… AGGTATAACTTGGTAAAGACTTTGAAGTAAACAAGAACAAACAGCT…
I am trying to see which gene repeat sequences in df_repeats are present in the sequence column in df_promotors. My goal is to create a new data frame to be able to perform some visualizations. So I've been struggling to create something like the below (just as an example)
df_repeat_occurances
sequence promotor_numbers in_genes
1 AAAAAAAAAAAA 715 Rp1_2
2 AAAAAAAAAAAC 61 Xkr4_1, Rp1_2
3 AAAAAAAAAAAG 184 Xkr4_1
I tried to write a nested loop to search through and if there's a match, append it to the df_repeats in place of the NA, and then change the row names later, but I am completely lost on how to do this, or if it's an ideal way to combine the information of from the two tables into one. Here's what I tried and could not work through.
for (i in 1:nrow(df_repeats)) {
x = df_repeats$sequence[i]
for (j in 1:nrow(df_promotors)) {
if (grepl(x, df_promotors$sequence[j])) {
y = df_promotors$gene[j]
df_repeats$sequence[i] = c(df_repeats$sequence[i], " ", y)
}
}
}
First time ever posting and asking for help, so any guidance or pointers would be greatly appreciated!!!
welcome to SO, in the future please include a reproducible example as I did below, including some meaningful names such as "result" etc.
Also mark any acceptable answer as "accepted".
The best approach is to separate the different computation steps.
#First, define a reproducible example
sequences <- c("AAA", "BBB", "CCC", 'DDD')
promnb <- 1:4
result <- data.frame(sequences, promnb)
genes_names <- paste0("gene_", letters[1:4])
sequence <- c('BBB', 'ABC', 'AAA', 'AAA')
df_proms <- data.frame(genes_names, sequence)
# genes_names sequence
# 1 gene_a BBB
# 2 gene_b ABC
# 3 gene_c AAA
# 4 gene_d AAA
# 1: check in which genes each sequence is present using grepl
# sapply used with data.frames will by default apply the defined function over each column:
in_genes <- sapply(result$sequences, function(x) grepl(x, df_proms$sequence))
# AAA BBB CCC DDD
# [1,] FALSE TRUE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE
# [3,] TRUE FALSE FALSE FALSE
# [4,] TRUE FALSE FALSE FALSE
#2: replace TRUE or FALSE by the names of the genes
in_genes_names <- data.frame(ifelse(in_genes, paste0(genes_names), ""))
#3: finally, paste each column of the last df to get all the names of the genes
that contain this sequence
result$in_genes <- sapply(in_genes_names, paste, collapse = " ")
result$in_genes <- trimws(result$in_genes)
# By the way, you'd probably want to keep a list of the matches
# you can also include this list as a column of the result df
result$in_genes_list <- sapply(in_genes_names, list)
result
# sequences promnb in_genes in_genes_list
# 1 AAA 1 gene_c gene_d , , gene_c, gene_d
# 2 BBB 2 gene_a gene_a, , ,
# 3 CCC 3 , , ,
# 4 DDD 4 , , ,
You may try the following sapply loop -
df_repeats$in_genes <- sapply(df_repeats$sequence, function(x)
toString(df_promotors$gene[grepl(x, df_promotors$sequence)]))

Change default NA from logical to character

Is there a way to change the default NA (missing) from logical to character (NA_character_) for an entire R session?
For example, if you load a CSV where one column is empty, it will be filled with NA, and the class of that NA will be logical. For this question, we want a way to ensure that it will always be NA_character_. Not to be confused with the literal string "NA".
More examples:
> class(NA)
"logical" # No!
> class(NA_character_)
"character" # Yes! but for NA!
Not sure if I understand but you could specify na.strings argument.
Ex:
df <- read.table(text='
a b c d e
1 56 43.0 12 1 NA
2 23 NA 7 2 45
3 15 90.7 10 3 2
4 10 30.5 2 4 NA', na.strings="", as.is=T)
And:
> class(df$b)
[1] "character"
>
As far as I can see, the answer is no:
from the documentation of NA
Details
The NA of character type is distinct from the string "NA". Programmers who need to specify an explicit missing string should use NA_character_ (rather than "NA") or set elements to NA using is.na<-.
I browsed thought the list of input parameters to the 'options' function and nothing seem to apply here.
I think the best way and safest is to explicitly define the NAs as they are likely to be encountered. For the csv-case in your example I would recommend the readr package where the 'col_types' is used to define classes.

Remove All Columns where the last row is not equal to specific value x [duplicate]

This question already has an answer here:
Subset columns based on row value
(1 answer)
Closed 4 years ago.
I have a data frame(DF) that is like so:
DF <- rbind (c(10,20,30,40,50), c(21,68,45,33,21), c(11,98,32,10,30), c(50,70,70,70,50))
10 20 30 40 50
21 68 45 33 21
11 98 32 10 30
50 70 70 70 50
In my scenario my x would be 50. So my resulting dataframe(resultDF) will look like this:
10 50
21 21
11 30
50 50
How Can I do this in r? I have attempted using subset as below but it doesn't seem to work as I am expecting:
resultDF <- subset(DF, DF[nrow(DF),] == 50)
Error in x[subset & !is.na(subset), vars, drop = drop] :
(subscript) logical subscript too long
I have solved it. My sub setting was function was inaccurate. I used the following piece of code to get the results I needed.
resultDF <- DF[, DF[nrow(DF),] == 50]
Your issue with subset() was only about the syntax for calling it with a logical column vector (its third arg, not its second). You can either use subset() or plain logical indexing. The latter is recommended.
The help page ?subset tells you its optional second arg ('subset') is a logical row-vector, and its optional third arg ('select') is a logical column-vector:
subset: logical expression indicating elements or rows to keep:
missing values are taken as false.
select: expression, indicating columns to select from a data frame.
So you want to call it with this logical column-vector:
> DF[nrow(DF),] == 50
[1] TRUE FALSE FALSE FALSE
There are two syntactical ways to leave subset()'s second arg default and pass the third arg:
# Explicitly pass the third arg by name...
> subset(DF, select=(DF[nrow(DF),] == 50) )
# Leave 2nd arg empty, it will default (to NULL)...
> subset(DF, , (DF[nrow(DF),] == 50) )
[,1] [,2]
[1,] 10 50
[2,] 21 21
[3,] 11 30
[4,] 50 50
The second way is probably preferable as it looks like generic row,col-indexing, and also doesn't require you to know the third arg's name.
(As a mnemonic, in R and SQL terminology, understand that 'select' implicitly means 'column-indices', and 'filter'/'subset' implicitly means 'row-indices'. Or in data.table terminology they're called i-indices, j-indices respectively.)

Change multiple dataframes in a loop

I have, for example, this three datasets (in my case, they are many more and with a lot of variables):
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame2 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
on each data frame I want to add a variable resulting from a transformation of an existing variable on that data frame. I would to do this by a loop. For example:
datasets <- c("data_frame1","data_frame2","data_frame3")
vars <- c("a","b","c")
for (i in datasets){
for (j in vars){
# here I need a code that create a new variable with transformed values
# I thought this would work, but it didn't...
get(i)$new_var <- log(get(i)[,j])
}
}
Do you have some valid suggestions about that?
Moreover, it would be great for me if it were possible also to assign the new column names (in this case new_var) by a character string, so I could create the new variables by another for loop nested in the other two.
Hope I've not been too tangled in explain my problem.
Thanks in advance.
You can put your dataframes in a list and use lapply to process them one by one. So no need to use a loop in this case.
For example you can do this :
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame3 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
ll <- list(data_frame1,data_frame2,data_frame3)
lapply(ll,function(df){
df$log_a <- log(df$a) ## new column with the log a
df$tans_col <- df$a+df$b+df$c ## new column with sums of some columns or any other
## transformation
### .....
df
})
the dataframe1 becomes :
[[1]]
a b c log_a tans_col
1 1 3 4 0.0000000 8
2 5 6 4 1.6094379 15
3 3 1 1 1.0986123 5
4 3 5 9 1.0986123 17
5 2 5 2 0.6931472 9
I had the same need and wanted to change also the columns in my actual list of dataframes.
I found a great method here (the purrr::map2 method in the question works for dataframes with different columns), followed by
list2env(list_of_dataframes ,.GlobalEnv)

Linking two datasets

I have a dataset called "J_BL5H1", this includes :
Var1 Freq
4 10
8 10
10 13
11 7
13 3
17 10
19 10
25 1
26 4
27 8
53 13
From this dataset, I want to find all Var1s seperately, and I want to called this new data like J_BL5H1JNVar1Number, here Var1Number denotes to specific Var1s, e.g. "4, 8, 10".
I will use this :
J_BL5H1JNVar1Number <- J_BL5H1$Freq[1]
Here, I want to replace Var1Number to "Var1" values in the old data.
For example, if I want to know the "Freq[4]", my new data should be called like "J_BL5H1JN11", the "Var1Number" will be automatically replaced by the Var1 of Freq[4], in this case by 11.
I hope I can clearly state my problem, Thanks.
First use paste to create the names of the data.sets:
data.string <- "J_BL5H1LN"
split.var <- "Var1"
data.sets <- paste(data.string, J_BL5H1[, split.var], sep = "")
Then use a loop to assign the according values to the data sets:
for( i in seq_along(data.sets) ) assign(data.sets[i], J_BL5H1[i, "Freq"])
Now you have the data sets in your workspace:
ls()
Btw, if you want to access the different data sets without actually calling them every time, you can access them by name using the get function:
sapply(data.sets, get)

Resources