For someone new to R, what is the best way to view the range for a number of variables? I've run the summary command on the entire dataset, can I do range () on the entire dataset as well or do i need to create variables for each variable in the dataset?
For individual variable, you can use range. To see the range of multiple variables, you can combine range with one of the apply functions. See below for an example.
range(iris$Sepal.Length)
# [1] 4.3 7.9
sapply(iris[, 1:4], range)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#[1,] 4.3 2.0 1.0 0.1
#[2,] 7.9 4.4 6.9 2.5
(only the first four columns were selected from iris since the 5th is a factor, and range doesn't apply for factors)
Related
I have a dataset with multiple date-variables and want to create subsets, where I can filter out certain rows by defining the wanted date of the date-variables.
To be more precise: Each row in the dataset represents a patient case in a psychiatry and contains all the applied seclusions. So for each case there is either no seclusion, or they are documented as seclusion_date1, seclusion_date2..., seclusion_enddate1, seclusion_enddate2...(depending on how many seclusions were happening).
My plan is to create a subset with only those cases, where there is either no seclusion documented or the seclusion_date1 (first seclusion) is after 2019-06-30 and all the possible seclusion_enddates (1, 2, 3....) are before 2020-05-01. Cases with seclusions happening before 2019-06-30 and after 2020-05-01 would be excluded.
I'm very new in the R language so my tries are possibly very wrong. I appreciate any help or ideas.
I tried it with the subset function in R.
To filter all possible seclusion_enddates at once, I tried to use starts_with and I tried writing a loop.
all_seclusion_enddates <- function() { c(WMdata, any_of(c("seclusion_enddate")), starts_with("seclusion_enddate")) }
Error: any_of()` must be used within a selecting function.
and then my plan would have been: cohort_2_before <- subset(WMdata, seclusion_date1 >= "2019-07-01" & all_seclusion_enddates <= "2020-04-30")
loop:
for(i in 1:53) { cohort_2_before <- subset(WMdata, seclusion_date1 >= "2019-07-01" & ((paste0("seclusion_enddate", i))) <= "2020-04-30" & restraint_date1 >= "2019-07-01" & ((paste0('seclusion_enddate', i))) <= "2020-04-30") }
Result: A subset with 0 obs. was created.
Since you don't provide a reproducible example, I can't see your specific problem, but I can help with the core issue.
any_of, starts_with and the like are functions used by the tidyverse set of packages to select columns within their functions. They can only be used within tidyverse selector functions to control their behavior, which is why you got that error. They probably are the tools I'd use to solve this problem, though, so here's how you can use them:
Starting with the default dataset iris, we use the filter_at function from dplyr (enter ?filter_at in the R console to read the help). This function filters (selects specific rows) from a data.frame (given to the .tbl argument) based on a criteria (given to .vars_predicate argument) which is applied to specific columns based on selectors given to the .vars argument.
library(dplyr)
iris %>%
filter_at(vars(starts_with('Sepal')), all_vars(.>4))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
2 5.2 4.1 1.5 0.1 setosa
3 5.5 4.2 1.4 0.2 setosa
In this example, we take the dataframe iris, pass it into filter_at with the %>% pipe command, then tell it to look only in columns which start with 'Sepal', then tell it to select rows where all the selected columns match the given condition: value > 4. If we wanted rows where any column matched the condition, we could use any_vars(.>4).
You can add multiple conditions by piping it into other filter functions:
iris %>%
filter_at(vars(starts_with('Sepal')), all_vars(.>4)) %>%
filter(Petal.Width > 0.3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
Here we filter the previous result again to get rows that also have Petal.Width > 0.3
In your case, you'd want to make sure your date values are formatted as date (with as.Date), then filter on seclusion_date1 and vars(starts_with('secusion_enddate'))
I have a single-cell RNAseq dataset that I have been using R to analyze. So I have a data frame with 205 columns and 15000 rows. Each column is a cell and each row is a gene.
I have an annotation matrix that has the identity of each cell. For example, patient ID, disease status, etc...
I want to do different comparisons based on the grouping info provided by the annotation matrix.
I know that in python, you can create a dictionary that is attached to the cell IDs.
What is an efficient way in R to perform subsetting of the same dataset in different ways?
So far what I have been doing is:
EC_index <-subset(annotation_index_LN, conditions == "EC_LN")
CP_index <-subset(annotation_index_LN, conditions =="CP_LN")
CD69pos <-subset(annotation_index_LN, CD69 == 100)
EC_CD69pos <- subset(EC_index, CD69 == 100)
EC_CD69pos <- subset(EC_CD69pos, id %in% colnames(manual_normalized))
CP_CD69pos <- subset(CP_index, CD69 == 100)
CP_CD69pos <- subset(CP_CD69pos, id %in% colnames(manual_normalized))
This probably won't entirely answer your question, but I think that even before you begin trying to subset your data etc. you might want to think about converting this into a SummarizedExperiment. This is a type of object that can hold annotation data for features and samples and will keep everything properly referenced if you decide to subset samples, remove rows, etc. This type of object is commonly implemented by packages hosted on Bioconductor. They have loads of tutorials on various genomics pipelines, and I'm sure you can find more detailed information there.
http://bioconductor.org/help/course-materials/
Following is from the iris data in R since you haven't given a minimal example of your data.
For that you need a R package that gives access to %>%: the magrittr R package, but also available in dplyr.
If you have to a lot of subsetting, the have the following in a function where you pass the arguments to subset.
iris %>%
subset(Species == "setosa" & Petal.Width == 0.2 & Petal.Length == 1.4) %>%
subset(select = !is.na(str_match(colnames(iris), "Len")))
# Sepal.Length Petal.Length
# 1 5.1 1.4
# 2 4.9 1.4
# 5 5.0 1.4
# 9 4.4 1.4
# 29 5.2 1.4
# 34 5.5 1.4
# 48 4.6 1.4
# 50 5.0 1.4
I am student who is working with the iris dataset in r. This has 3 flower types.
I am supposed to create a new vector of the Petal.Length vector in one statement that is the same but for only the Virginica Species I take the log base 10 value. I am not sure how to command r to take the log base 10 value of only the virginica values in the Petal.Length column but to keep the other values for the other two flowers the same.
Use square brackets in R to subset data. The generic formula is object[object operator condition]. For example, iris$Petal.Length[iris$Species == "virginica"] is equivalent to saying "Show me the Petal.Length values only for the Species values that equal "virginica".
more a curiosity than a question. Is it possible to make some operation only on specific columns of a dataframe but maintaining the dataframe original structure?
For example, suppose I want simply to add 1 to the first 4 columns of the iris dataset because the 5th column is a factor and it is nonsense to add values to it.
1. ignoring the factor column
just perform the operation without caring of the Warning Message
ex <- iris[,] + 1
head(ex, 2)
#gives
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.1 4.5 2.4 1.2 NA
2 5.9 4.0 2.4 1.2 NA
so the 5th original column loose the original values due to the nonsense operation.
2. excluding the last column
excluding the index of the column from the operation
ex <- iris[,-c(5)] + 1
head(ex, 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 6.1 4.5 2.4 1.2
2 5.9 4.0 2.4 1.2
but doing so I have to perform a cbind operation to recover the original column (not a big deal with this dataframe)
I was wondering if there is a smarter solution for this operation. Imagine the dataframe is very big,with cbind one loose the original position of the columns and it could be quite tricky to do it.
Thanks to all
I am trying to create a new data frame which is identical in the number of columns (but not rows) of an existing data frame. All columns are of identical type, numeric. I need to sample each column of the original data frame (n=241 samples, replace=T) and add those samples to the new data frame at the same column number as the original data frame.
My code so far:
#create the new data frame
tree.df <- data.frame(matrix(nrow=0, ncol=72))
#give same column names as original data frame (data3)
colnames(tree.df)<-colnames(data3)
#populate with NA values
tree.df[1:241,]=NA
#sample original data frame column wise and add to new data frame
for (i in colnames(data3)){
rbind(sample(data3[i], 241, replace = T),tree.df)}
The code isn't working out. Any ideas on how to get this to work?
Use the fact that a data frame is a list, and pass to lapply to perform a column-by-column operation.
Here's an example, taking 5 elements from each column in iris:
as.data.frame(lapply(iris, sample, size=5, replace=TRUE))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.7 3.2 1.7 0.2 versicolor
## 2 5.8 3.1 1.5 1.2 setosa
## 3 6.0 3.8 4.9 1.9 virginica
## 4 4.4 2.5 5.3 0.2 versicolor
## 5 5.1 3.1 3.3 0.3 setosa
There are several issues here. Probably the one that is causing things not to work is that you are trying to access a column of the data frame data3. To do that, you use the following data3[, i]. Note the comma. That separates the row index from the column index.
Additionally, since you already know how big your data frame will be, allocate the space from the beginning:
tree.df <- data.frame(matrix(nrow = 241, ncol = 72))
tree.df is already prepopulated with missing (NA) values so you don't need to do it again. You can now rewrite your for loop as
for (i in colnames(data3)){
tree.df[, i] <- sample(data3[, i], 241, replace = TRUE)
}
Notice I spelled out TRUE. This is better practice than using T because T can be reassigned. Compare:
T
T <- FALSE
T
TRUE <- FALSE