How to extract values across all years? - r

I am following the code given in: https://github.com/CornellLabofOrnithology/ebird-best-practices/blob/master/03_covariates.Rmd
On arriving at this code:
# extract landcover values within neighborhoods, only needed most recent year
lc_extract_ext <- landcover[[paste0("y", max_lc_year)]] %>%
exact_extract(r_cells, progress = FALSE)
lc_extract_cnt <- map(lc_extract_ext, ~ count(., landcover = value)) %>%
tibble(id = r_cells$id, data = .)
lc_extract_pred <- unnest(lc_extract_cnt, data)
I wished to extract landcover values within neighborhoods across all years.
Instead of using this bit of code max_lc_year, I used this:
# I have thought of using
all_lc_year <- names(landcover) %>%
str_extract("[0-9]{4}") %>%
as.integer()
To extract all years, however, it returns this error at this piece of code:
lc_extract_cnt <- map(lc_extract_ext, ~ count(., landcover = value)) %>%
tibble(id = r_cells$id, data = .)
Error: Problem with mutate() input landcover.
x object 'value' not found
i Input landcover is value.
I'm thinking I would need to loop, either with map or a for loop, over the years and bind together the resulting data frames.
EDIT:
Noticed the previous steps did not work. Here they are, working and corrected!
I have uploaded the landcover .tif files onto my Github here:
https://github.com/lime-n/Landcover.git
You can then implement this code to stack them together in R:
library(sf)
library(raster)
library(exactextractr)
library(viridis)
library(tidyverse)
# resolve namespace conflicts
select <- dplyr::select
map <- purrr::map
projection <- raster::projection
landcover <- list.files("insert_folder_name_here", "^modis_mcd12q1_umd",
full.names = TRUE) %>%
stack()
# label layers with year
landcover <- names(landcover) %>%
str_extract("(?<=modis_mcd12q1_umd_)[0-9]{4}") %>%
paste0("y", .) %>%
setNames(landcover, .)
landcover
neighborhood_radius <- 5 * ceiling(max(res(landcover))) / 2
I have uploaded the r data at my github here to produce the r_cells data: https://github.com/lime-n/Landcover/blob/master/prediction-surface.tif
and the following code to get the r_cells:
r <- raster("data/prediction-surface.tif")
r_centers <- rasterToPoints(r, spatial = TRUE) %>%
st_as_sf() %>%
transmute(id = row_number())
r_cells <- st_buffer(r_centers, dist = neighborhood_radius)
I understand the process to download and implement may take a few minutes. However, this code has been bugging me for weeks! any help is appreciated.

Found that doing the code individually for each individual year using all_lc_year[1], all_lc_year[2] etc ... and then combining all the rows using rbind(lc_extract_pred, lc_extract_pred_two, lc_extract_pred_three #...etc)
EDIT:
I found that in this code:
lc_extract_cnt <- map(lc_extract_ext, ~ count(., landcover = value)) %>%
tibble(id = r_cells$id, data = .)
The problem has been mentioned similarly here
I tried assigning many different columns under a single column landcover which is why the code never worked. By replacing landcover by the column names needed for the function count, and taking this from
lc_extract_ext <- landcover %>%
instead of the maximum year to include all columns.

Related

R Script - TopGO Package

I am an R beginner. I am using the {TopGO} R Package for performing enrichment analysis for gene ontology.
In the last part of the script (#create topGO object), I get this error:
Error in .local(.Object, ...) : allGenes must be a factor with 2 levels
Could you please help me with this? I really appreciate your help. Thanks
Below I have reported the R script I have used. I would like to upload also my starting dataset but I don't know how to to it.
library(tidyverse)
library(data.table)
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("topGO")
library(topGO)
# make table from meta data holding grouping information for each sample
# in this case landrace, cultivar ...
getwd()
setwd("C:/Users/iaia-/Desktop/")
background <- fread('anno.variable.out.txt') %>%
as_tibble() %>%
#the ID is the actual GO term
mutate(id = as.numeric(id)) %>%
na.omit() %>%
#according to pannzer2 ARGOT is the best scoring algorithm
filter(str_detect(type ,'ARGOT'),
#choose a PPV value
#according to the pannzer2 manual, there is no 'best' option
# Philipp advised to use 0.4, 0.6, 0.8
# maybe I write a loop to check differences later
PPV >=0.6 ) %>%
#create a ontology column to have the 3 ontology options, which topGO supports
# 'BP' - biological process, 'MF' - molecular function, 'CC' - cellular component
mutate(ontology=sapply(strsplit(type,'_'),'[',1)) %>%
#select the 3 columns we need
dplyr::select(id,ontology, qpid) %>%
#add GO: to the GO ids
dplyr::mutate(id=paste0('GO:',id))
foreground <- fread('LIST.txt',header=F)
colnames(foreground) <- c("rowname")
#rename type to group to prevent confusion
#dplyr::rename(group=type)
#for all groups together
all_results <- tibble()
for (o in unique(background$ontology)){
#filter background for a certain ontology
ont_background <- filter(background, ontology==o)} #BIOLOGICAL PROCESS
annAT <- split(ont_background$qpid,ont_background$id)
#filter foreground for a group
fg_genes <- foreground %>% pull(rowname)
ont_background <- ont_background %>%
mutate(present=as.factor(ifelse(qpid %in% fg_genes,1,0))) %>%
dplyr::select(-id) %>% distinct() %>%
pull(present, name = qpid)
#create topGO object
GOdata <-new("topGOdata", ontology = o, allGenes = ont_background, nodeSize = 5,annot=annFUN.GO2genes,GO2genes=annAT)
weight01.fisher <- runTest(GOdata, statistic = "fisher")
results <- GenTable(GOdata, classicFisher=weight01.fisher,topNodes=ifelse(length(GOdata#graph#nodes) < 30,length(GOdata#graph#nodes),30)) %>%
dplyr::rename(pvalue=6) %>%
mutate(ontology=o,
pvalue=as.numeric(pvalue))
all_results <- bind_rows(all_results,results)
}
#all_results %>% pull(GO.ID) %>% writeClipboard()
resultspvalue<- all_results %>% dplyr::select(GO.ID,pvalue)
write_tsv(resultspvalue, "GOTermsrep_pvalue.txt")
It's tough to say for sure without knowing what ont_background looked like before you modified it, but based on these threads I think that there's a problem with the final ont_background object you are using to make the topGO object in your final "#create topGO object" chunk.
It looks like the allGenes argument (ont_background) needs to be a vector of pvalues with names() assigned as the gene names in whatever format you are using for other arguments.

R Function to Create Custom Data Frames from Larger Data Frame

Ok, So I found somewhat similar questions asked of this already, but I'm not quite getting it. So, here is my example. I have a very large table of data that has a basic setup like the small example data below. I will try to explain very clearly what I am wanting to do. I'm guessing maybe it's easier to do than I think, but I'm not really good at creating functions or for-loops at this point, and I'm guessing that's what I need. So here is the basic setup for my data.
test_year <- c(2019,2019,2019,2020,2020,2020,2021,2021,2021)
SN <- c(1001,1002,1003,1004,1005,1006,1007,1008,1009)
Owner <- c("Adam","Bob","Bob","Carl","Adam","Bob","Adam","Carl","Adam")
ObsA <- c(0,0,1,1,0,1,1,NA,1)
ObsB <- c(1,1,1,0,0,0,0,0,1)
ObsC <- c(0,0,0,0,1,1,0,0,0)
df <- data.frame(test_year, SN, Owner, ObsA, ObsB, ObsC)
From this, I need to be able to create smaller data frames by selecting individual observation columns. So if this were a small data set:
df_A <- df %>% select(test_year, SN, Owner, ObsA)
and then have a data frame for each of the other observations. And yes, it is easier to select the columns that I want versus the columns I don't want as most of the columns selected will be standard, and I just need to change which observation is picked out of over 40 in my real data.
From these smaller data frames, I will be doing numerous other operations including making multiple tables and graphs. As examples, the following are similar to the types of graphs I will make (with some additional formatting that is simple enough). Notice too in these graphs a title that is based on (though not identical to), the column selected.
df_A[is.na(df_A)] = 0
df_A
df_A %>% group_by(test_year) %>%
summarize(n = n(), obs = sum(ObsA)) %>%
ggplot(aes(x = test_year, y = 100*obs/n)) +
ggtitle("Observation A") +
geom_point()
df_A %>% group_by(Owner) %>%
summarize(n = n(), obs = sum(ObsA)) %>%
ggplot(aes(x = Owner, y = 100*obs/n)) +
ggtitle("Observation A") +
geom_bar(stat = "identity") +
coord_flip() +
scale_x_discrete()
As I said, additional analysis will also need to be done. So, I'm needing help figuring out how I can structure a function to do what it is I'm wanting to do. Thanks!
Here is a way to return a list of plots.
Split all the 'Obs' columns in a list of dataframes, use imap to pass dataframe along with the column name (to use it as title).
library(tidyverse)
common_cols <- 1:3
df[is.na(df)] = 0
list_plots <- df %>%
select(starts_with('Obs')) %>%
split.default(names(.)) %>%
imap(~{
tmp <- df[common_cols] %>% bind_cols(.x)
tmp %>% group_by(test_year) %>%
summarize(n = n(), obs = sum(.data[[.y]])) %>%
ggplot(aes(x = factor(test_year), y = 100*obs/n)) +
geom_point() +
labs(x = 'Year', y = 'ratio', title = .y)
})
Individual plots can be accessed by list_plots[[1]],list_plots[[2]] etc.

How to make dot plot with multiple data points for single variable?

I would like to create dot-plot for my data set. I know how to create a normal dot-plot for treatment comparisons or similar data sets using ggplot. I have the following data. And would like to create a dot-plot with three different colors. Please suggest me how to prepare data for this dot-plot. If I have a single data point in NP and P, it is easy to plot as I already worked with similar data but not getting any idea with this kind of data. I can use ggplot module from R and can be done.
The variable W has always single data point while NP and P has different data points i.e. some time one in NP and some times three and same with variable P,as I shown in the table.
Here is the screen shot for my data.
Sorry for my language
I agree my data is mess. I googled and did some coding to get the plot. I used tidyverse and dplyr packages to attain the plot but again there is a problem with y-axis. Y-axis is very clumsy. I used this following code
d <- read.table("Data1.txt", header = TRUE, sep = "\t", stringsAsFactors = NA)
df <- data.frame(d)
df <- df %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(!ID, names_to="colid", values_to="val") %>%
separate_rows(val, sep="\t", convert=TRUE) %>%
mutate(ID=as_factor(ID)
Then I plot the graph with ggplot
ggplot(df, aes(x=ID, y=val, color=colid))+geom_point(size=1.5) +theme(axis.text.x = element_text(angle = 90))
The output is this. I tried to adjust Y-axis with ylim and scale_y_discrete() but nothing worked. Please suggest a way to rectify it.
This contains many necessary steps for data cleaning, as suggested by user Dan Adams in the comment. This was kind of fun, and it helped me procrastinate my own thesis.
I am using a function from a very famous thread which offers a way to splits columns when the number of resulting columns is unknown.
P.S. The way you shared the data was less than ideal.
#your data is unreadable without this awesome package
# devtools::install_github("alistaire47/read.so")
library(tidyverse)
df <- read.so::read_md("|ID| |W| |NP| |P|
|:-:| |:-:| |:-:| |:-:|
|1| |4.161| |1.3,1.5| |1.5,2.8|
|2| |0.891| |1.33,1.8,1.79| |1.6|
|3| |7.91| |4.3| |0.899,1.43,0.128|
|40| |2.1| |1.4,0.99,7.9,0.32| |0.6,0.5,1.57|") %>%select(-starts_with("x"))
#> Warning: Missing column names filled in: 'X2' [2], 'X4' [4], 'X6' [6]
# from this thread https://stackoverflow.com/a/47060452/7941188
split_into_multiple <- function(column, pattern = ", ", into_prefix){
cols <- str_split_fixed(column, pattern, n = Inf)
cols[which(cols == "")] <- NA
cols <- as.tibble(cols)
m <- dim(cols)[2]
names(cols) <- paste(into_prefix, 1:m, sep = "_")
cols
}
# apply this over the columns of interest
ls_cols <- lapply(c("NP", "P"), function(x) split_into_multiple(df$NP, pattern = ",", x))
# bind it to the single columns of the old data frame
# convert character columns to numeric
# apply pivot longer twice (there might be more direct options, but I won't be
# bothered to do too much here)
df_new <-
bind_cols(df[c("ID", "W")], ls_cols) %>%
pivot_longer(cols = c(-ID,-W), names_sep = "_", names_to = c(".value", "value")) %>%
mutate(across(c(P, NP), as.numeric)) %>%
select(-value) %>%
pivot_longer(W:P, names_to = c("var"), values_to = "value")
# The new tidy data can easily be plotted
ggplot(df_new, aes(ID, value, color = var)) +
geom_point()
#> Warning: Removed 12 rows containing missing values (geom_point).

Calculate pairwise correlation in R using dplyr::mutate

I have a large data frame with on every rows enough data to calculate a correlation using specific columns of this data frame and add a new column containing the correlations calculated.
Here is a summary of what I would like to do (this one using dplyr):
example_data %>%
mutate(pearsoncor = cor(x = X001_F5_000_A:X030_F5_480_C, y = X031_H5_000_A:X060_H5_480_C))
Obviously it is not working this way as I get only NA's in the pearsoncor column, does anyone has a suggestion? Is there an easy way to do this?
Best,
Example data frame
With tidyr, you can gather separately all x- and y-variables, you'd like to compare. You get a tibble containing the correlation coefficients and their p-values for every combination you provided.
library(dplyr)
library(tidyr)
example_data %>%
gather(x_var, x_val, X001_F5_000_A:X030_F5_480_C) %>%
gather(y_var, y_val, X031_H5_000_A:X060_H5_480_C) %>%
group_by(x_var, y_var) %>%
summarise(cor_coef = cor.test(x_val, y_val)$estimate,
p_val = cor.test(x_val, y_val)$p.value)
edit, update some years later:
library(tidyr)
library(purrr)
library(broom)
library(dplyr)
longley %>%
pivot_longer(GNP.deflator:Armed.Forces, names_to="x_var", values_to="x_val") %>%
pivot_longer(Population:Employed, names_to="y_var", values_to="y_val") %>%
nest(data=c(x_val, y_val)) %>%
mutate(cor_test = map(data, ~cor.test(.x$x_val, .x$y_val)),
tidied = map(cor_test, tidy)) %>%
unnest(tidied)
Here is a solution using the reshape2 package to melt() the data frame into long form so that each value has its own row. The original wide-form data has 60 values per row for each of the 6 genes, while the melted long-form data frame has 360 rows, one for each value. Then we can easily use summarize() from dplyr to calculate the correlations without loops.
library(reshape2)
library(dplyr)
names1 <- names(example_data)[4:33]
names2 <- names(example_data)[34:63]
example_data_longform <- melt(example_data, id.vars = c('Gene','clusterFR','clusterHR'))
example_data_longform %>%
group_by(Gene, clusterFR, clusterHR) %>%
summarize(pearsoncor = cor(x = value[variable %in% names1],
y = value[variable %in% names2]))
You could also generate more detailed results, as in Eudald's answer, using do():
detailed_r <- example_data_longform %>%
group_by(Gene, clusterFR, clusterHR) %>%
do(cor = cor.test(x = .$value[.$variable %in% names1],
y = .$value[.$variable %in% names2]))
This outputs a tibble with the cor column being a list with the results of cor.test() for each gene. We can use lapply() to extract output from the list.
lapply(detailed_r$cor, function(x) c(x$estimate, x$p.value))
I had the same problem a few days back, and I know loops are not optimal in R but that's the only thing I could think of:
df$r = rep(0,nrow(df))
df$cor_p = rep(0,nrow(df))
for (i in 1:nrow(df)){
ct = cor.test(as.numeric(df[i,cols_A]),as.numeric(df[i,cols_B]))
df$r[i] = ct$estimate
df$cor_p[i] = ct$p.value
}

R purrr extracting multiple items from a list and converting to a data frame

I am currently learning purrr in R. I have code which does the following
Uses the pysch package in r to get the mean, SD, range etc from a list of questions
Returns those statistics in a single data-frame where the list item is added to the table as a column. In the case below its schools.
Below is an example where I'm about 90% there i think. All i want to do is add the names of the schools to the dataframe as a column so as to be able to chart them afterwards. Can anyone help? The method below loses the names as soon as the bind_rows() command is run
library(lavaan)
library(tidyverse)
# function pulls the mean, sd, range, kurtosis and skew
get_stats <- function(x){
row_names <- rownames(x)
mydf_temp <- x %>%
dplyr::select(mean, sd, range, kurtosis, skew) %>%
mutate_if(is.numeric, round, digits=2) %>%
filter(complete.cases(.))
mydf_temp
}
# Generate the data for the reproducible example
mydf <- HolzingerSwineford1939 %>%
select(school, starts_with("x")) %>%
psych::describeBy(., group=.$school, digits = 2)
# Gets the summary statistics per school
stats_summ <- mydf %>%
map(get_stats) %>%
bind_rows()
We can use the .id argument from bind_rows
mydf %>%
map(get_stats) %>%
bind_rows(., .id = 'group')
Using a reproducible example with iris dataset
mydf <- iris %>%
psych::describeBy(., group=.$Species, digits = 2)

Resources