run a function through multiple dataframes in r - r

There are many other questions which read similar to this, but none address my query [in a way I understand]
I have multiple dataframes: 'snps' 'snp2' 'snp3' 'snp4' 'snp5' which all have the same format
library(tidyr)
library(TwoSampleMR)
glimpse(snps)
Rows: 4,873
Columns: 4
$ chrpos <chr> "6:39016574-39055519", "6:39016574-39055519", "6:39016574-39055519"
$ target <chr> "GL", "GL", "GL"
$ query <chr> "6:39016574-39055519", "6:39016574-39055519", "6:39016574-39055519"
$ name <chr> "rs113920163", "rs183723208", "rs555466268"
The only thing that differs is the 'target' column.
I want to run a function from the TwoSampleMR package
glc <- extract_outcome_data(snps = snps$name, outcomes = 'ukb-a-583')
But I want to do this for the other dataframes, in a loop
I made a list of all the names in each of the 'snps..' dataframes called targets
glimpse(targets)
Rows: 11
Columns: 1
$ target <chr> "GL", "ML", "HL", "TD", "ED"
And have been trying to loop through
list <- targets$target
for(i in list)
l <- list()
{
[[i]] <- extract_outcome_data(snps = targets$target, outcomes = 'ukb-a-583')
}
Which runs the but the object it makes 'l' is empty but there is no error msg so I don't know what to change.
Thanks!
*** EDIT ***
I am now trying
my_files <- list.files(pattern = "*_files.txt")
my_data <- lapply(my_files, read.table)
for (i in 1:length(my_data)){
dat <- extract_outcome_data(snps = my_data$[i]$targets, outcomes='ukb-a-583'))))
}
It works fine until the for loop and the for loop does run. However this does not extract the information that I need. I think its due to the 'snps = my_data$[i]$targets' bit - how do I access the column 'targets' in each of my dataframes in the my_data list?
Thanks!

Related

loop through a filter call in R

I have a dataframe of genetic data called 'windows'
glimpse(windows)
Rows: 3,000
Columns: 5
$ chrpos <chr> "1:104159999-104168402", "1:104159999-104168402", "1:104159999-104168402"
$ target <chr> "AMY2A", "CFD13", "PUTA"
$ name <chr> "rs5753", "rs70530", "rs21111"
$ chr <chr> "1", "1", "1"
$ pos <int> 104560629, 104562750, 104557705
I want to make them in to separate dataframes by the row in the column 'target'
AMY2A<- filter(windows, target == 'AMY2A')
write.table("am2ya.txt", header=T)
Which works fine, but is laborious doing for each element row of 'target'
But how do I loop through all of these, and then save them out as text files, in one go?
list <- windows$target
for(i in list)
df.list[[i]]<-filter(windows, target == list[[i]])
Which gives the error:
Error in `filter()`:
! Problem while computing `..1 = target == list[[i]]`.
Caused by error in `list[[i]]`:
! subscript out of bounds
And when I google saving out multiple txt files, it just comes up how to read in multiple text files.
Any help would be great, thanks!
Or, using the tidyverse, something like
windows %>%
group_by(target) %>%
group_walk(
function(.x, .y) {
write.table(.x, paste0(tolower(.y$target[1]), ".txt"))
},
.keep=TRUE
)
[Untested code, since your question is not reproducible.]
the .keep=TRUE keeps the target column in the text file. If that's unnecessary, feel free to delete it. Details are in the online doc.
probably something like:
lst <- windows$target
for(i in seq_along(lst)){
dat <- filter(windows, target == lst[[i]])
write.table(dat, paste0(tolower(lst[[i]]), ".txt"))
}

Beginner using pipes

I am a beginner and I'm trying to find the most efficient way to change the name of the first column for many CSV files that I will be creating. Once I have created the CSV files, I am loading them into R as follows:
data <- read.csv('filename.csv')
I have used the names() function to do the name change of a single file:
names(data)[1] <- 'Y'
However, I would like to find the most efficient way of combining/piping this name change to read.csv so the same name change is applied to every file when they are opened. I tried to write a 'simple' function to do this:
addName <- function(data) {
names(data)[1] <- 'Y'
data
}
However, I do not yet fully understand the syntax for writing a function and I can't get this to work.
Note
If you were expecting your original addName function to "mutate" an existing object like so
x <- data.frame(Column_1 = c(1, 2, 3), Column_2 = c("a", "b", "c"))
# Try (unsuccessfully) to change title of "Column_1" to "Y" in x.
addName(x)
# Print x.
x
please be aware that R passes by value rather than by reference, so x itself would remain unchanged:
Column_1 Column_2
1 1 a
2 2 b
3 3 c
Any "mutation" would be achieved by overwriting x with the return value of the function
x <- addName(x)
# Print x.
x
in which case x itself would obviously be changed:
Y Column_2
1 1 a
2 2 b
3 3 c
Answer
Anyway, here's a solution that compactly incorporates pipes (%>% from the magrittr package) and a custom function. Please note that without the linebreaks and comments, which I have added for clarity, this could be condensed to only a few lines of code.
# The dplyr package helps with easy renaming, and it includes the magrittr pipe.
library(dplyr)
# ...
filenames <- c("filename1.csv", "filename2.csv", "filename3.csv")
# A function to take a CSV filename and give back a renamed dataset taken from that file.
addName <- function(filename) {
return(# Read in the named file as a data.frame.
read.csv(file = filename) %>%
# Take the resulting data.frame, and rename its first column as "Y";
# quotes are optional, unless the name contains spaces: "My Column"
# or `My Column` are needed then.
dplyr::rename(Y = 1))
}
# Get a list of all the renamed datasets, as taken by addName() from each of the filenames.
all_files <- sapply(filenames, FUN = addName,
# Keep the list structure, in which each element is a
# data.frame.
simplify = FALSE,
# Name each list element by its filename, to help keep track.
USE.NAMES = TRUE)
In fact, you could easily rename any columns you desire, all in one fell swoop:
dplyr::rename(Y = 1, 'X' = 2, "Z" = 3, "Column 4" = 4, `Column 5` = 5)
This will read a vector of filenames, change the name of the first column of each one to "Y" and store all of the files in a list.
filenames <- c("filename1.csv","filename2.csv")
addName <- function(filename) {
data <- read.csv(filename)
names(data)[1] <- 'Y'
data
}
files <- list()
for (i in 1:length(filenames)) {
files[[i]] <- addName(filenames[i])
}

R, creating variables on the fly in a list using assign statement

I want to create variable names on the fly inside a list and assign them values in R, but I am unable to get the desired result. Here is the logic of my code:
Upon the function call: dat_in <- readf(1,2), an input file is read based on a product and site. After reading, a particular column (13th, here) is assigned to a variable aot500. I want to have this variable return from the function for each combination of product and site. For example, I need variables name in the list statement as aot500.AF, aot500.CM, aot500.RB to be returned from this function. I am having trouble in the return statement. There is no error but there is nothing in dat_in. I expect it to have dat_in$aot500.AF etc. Please inform what is wrong in the return statement. Furthermore, I want to read files for all combinations in a single call to the function, say using a for loop and I wonder how would the return statement handle list of more variables.
prod <- c('inv','tot')
site <- c('AF','CM','RB')
readf <- function(pp, kk) {
fname.dsa <- paste("../data/site_data_",prod[pp],"/daily_",site[kk],".dat",sep="")
inp.aod <- read.csv(fname.dsa,skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
aot500 <- inp.aod[,13]
return(list(assign(paste("aot500",siteabbr[kk],sep="."),aot500)))
}
Almost always there is no need to use assign(), we can solve the problem in two steps, read the files into a list, then give names.
(Not tested as we don't have your files)
prod <- c('inv', 'tot')
site <- c('AF', 'CM', 'RB')
# get combo of site and prod
prod_site <- expand.grid(prod, site)
colnames(prod_site) <- c("prod", "site")
# Step 1: read the files into a list
res <- lapply(1:nrow(prod_site), function(i){
fname.dsa <- paste0("../data/site_data_",
prod_site[i, "prod"],
"/daily_",
prod_site[i, "site"],
".dat")
inp.aod <- read.csv(fname.dsa,
skip = 4,
stringsAsFactors = FALSE,
na.strings = "N/A")
inp.aod[, 13]
})
# Step 2: assign names to a list
names(res) <- paste("aot500", prod_site$prod, prod_site$site, sep = ".")
I propose two answers, one based on dplyr and one based on base R.
You'll probably have to adapt the filename in the readAOT_500 function to your particular case.
Base R answer
#' Function that reads AOT_500 from the given product and site file
#' #param prodsite character vector containing 2 elements
#' name of a product and name of a site
readAOT_500 <- function(prodsite,
selectedcolumn = c("AOT_500"),
path = tempdir()){
cat(path, prodsite)
filename <- paste0(path, prodsite[1],
prodsite[2], ".csv")
dtf <- read.csv(filename, stringsAsFactors = FALSE)
dtf <- dtf[selectedcolumn]
dtf$prod <- prodsite[1]
dtf$site <- prodsite[2]
return(dtf)
}
# Load one file for example
readAOT_500(c("inv", "AF"))
listofsites <- list(c("inv","AF"),
c("tot","AF"),
c("inv", "CM"),
c( "tot", "CM"),
c("inv", "RB"),
c("tot", "RB"))
# Load all files in a list of data frames
prodsitedata <- lapply(listofsites, readAOT_500)
# Combine all data frames together
prodsitedata <- Reduce(rbind,prodsitedata)
dplyr answer
I use Hadley Wickham's packages to clean data.
library(dplyr)
library(tidyr)
daily_CM <- read.csv("~/downloads/daily_CM.dat",skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
# Generate all combinations of product and site.
prodsite <- expand.grid(prod = c('inv','tot'),
site = c('AF','CM','RB')) %>%
# Group variables to use do() later on
group_by(prod, site)
Create 6 fake files by sampling from the data you provided
You can skip this section when you have real data.
I used various sample length so that the number of observations
differs for each site.
prodsite$samplelength <- sample(1:495,nrow(prodsite))
prodsite %>%
do(stuff = write.csv(sample_n(daily_CM,.$samplelength),
paste0(tempdir(),.$prod,.$site,".csv")))
Read many files using dplyr::do()
prodsitedata <- prodsite %>%
do(read.csv(paste0(tempdir(),.$prod,.$site,".csv"),
stringsAsFactors = FALSE))
# Select only the columns you are interested in
prodsitedata2 <- prodsitedata %>%
select(prod, site, AOT_500)

How to efficiently create the same variables for each element of a list?

I am a long-time Stata user but am trying to familiarize myself with the syntax and logic of R. I am wondering if you could help me with writing more efficient codes as shown below (The "The Not-so-efficient Codes")
The goal is to (A) read several files (each of which represents the data of a year), (B) create the same variables for each file, and (C) combine the files into a single one for statistical analysis. I have finished revising "part A", but are struggling with the rest, particularly part B. Could you give me some ideas as to how to proceed, e.g. use unlist to unlist data.l first, or lapply to each element of data.l? I appreciate your comments-thanks.
More Efficient Codes: Part A
# Creat an empty list
data.l = list()
# Create a list of file names
fileList=list.files(path="C:/My Data, pattern=".dat")
# Read the ".dat" files into a single list
data.l = sapply(fileList, readLines)
The Not-so-efficient Codes: Part A, B and C
setwd("C:/My Data")
# Part A: Read the data. Each "dat" file is text file and each line in the file has 300 characters.
dx2004 <- readLines("2004.INJVERBT.dat")
dx2005 <- readLines("2005.INJVERBT.dat")
dx2006 <- readLines("2006.INJVERBT.dat")
# Part B-1: Create variables for each year of data
dt2004 <-data.frame(hhx = substr(dx2004,7,12),fmx = substr(dx2004,13,14),
,iphow = substr(dx2004,19,318),stringsAsFactors = FALSE)
dt2005 <-data.frame(hhx = substr(dx2005,7,12),fmx = substr(dx2005,13,14),
,iphow = substr(dx2005,19,318),stringsAsFactors = FALSE)
dt2006 <-data.frame(hhx = substr(dx2006,7,12),fmx = substr(dx2006,13,14),
iphow = substr(dx2006,19,318),stringsAsFactors = FALSE)
# Part B-2: Create the "iid" variable for each year of data
dt2004$iid<-paste0("2004",dt2004$hhx, dt2004$fmx, dt2004$fpx, dt2004$ipepno)
dt2005$iid<-paste0("2005",dt2005$hhx, dt2005$fmx, dt2005$fpx, dt2005$ipepno)
dt2006$iid<-paste0("2006",dt2006$hhx, dt2006$fmx, dt2006$fpx, dt2006$ipepno)
# Part C: Combine the three years of data into a single one
data = rbind(dt2004,dt2005, dt2006)
you are almost there. Its a combination of lapply and do.call/rbind to work with lapply's list output.
Consider this example:
test1 = "Thisistextinputnumber1"
test2 = "Thisistextinputnumber2"
test3 = "Thisistextinputnumber3"
data.l = list(test1, test2, test3)
makeDF <- function(inputText){
DF <- data.frame(hhx = substr(inputText, 7, 12), fmx = substr(inputText, 13, 14), iphow = substr(inputText, 19, 318), stringsAsFactors = FALSE)
DF <- within(DF, iid <- paste(hhx, fmx, iphow))
return(DF)
}
do.call(rbind, (lapply(data.l, makeDF)))
Here test1, test2, test3 represent your dx200X, and data.l should be the list format you get from the efficient version of Part A.
In makeDF you create your desired data.frame. The do.call(rbind, ) is somewhat standard if you work with lapply-return values.
You also might want to consider checking out the data.table-package which features the function rbindlist, replacing any do.call-rbind construction (and is much faster), next to other great utility for large data sets.

Error in drop && !has.j : invalid 'x' type in 'x && y’ when using sum(complete.cases) Windows7 R3.2.1

I am very new to programming, both in R and in general.
Here is my goal for writing this script:
I have 332 csv files. I want to, “Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.”
The outline of the function is as follows:
complete <- function(directory, id = 1:332) {
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files
## 'id' is an integer vector indicating the monitor ID numbers
## to be used
## Return a data frame of the form:
## id nobs
## 1 117
## 2 1041
## ...
## where 'id' is the monitor ID number and 'nobs' is the
## number of complete cases
}
Example output would look like this:
source("complete.R")
complete("specdata", 1)
## id nobs
## 1 1 117
complete("specdata", c(2, 4, 8, 10, 12))
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
## 4 10 148
## 5 12 96
My script so far looks like this:
setwd("C:/users/beachlb/Desktop/R_Programming/specdata") #this is the local directory on my computer where all 332 csv files are stored
>complete <- function(directory, id = 1:332) {
>files_list <- list.files(directory, full.names=TRUE) #creates a list of files from within the specified directory
>dat <- data.frame() #creates an empty data frame that we can use to add data to
>for (i in id) {
>dat <- rbind(dat, read.csv(files_list[i])) #loops through the 332 csv files, rbinding them together into one data frame called dat
}
>dat$nobs <- sum(complete.cases(dat)) #add the column nobs to dat, populated with number of rows of complete cases in the dataframe
>dat_subset <- dat[which(dat[, "ID"] %in% id),] #subsets dat so that only the desired cases are included in output when function is run
>dat_subset[, "ID", "nobs"] #prints all rows of the desired data frame for the named columns}
When I run my function as is, I get this error, “Error in drop && !has.j : invalid 'x' type in 'x && y’. I am not sure what is throwing me that error. I would appreciate any advice on what could be causing this error and how I can work to resolve it. Pointing me to literature I could read to study this and/or tutorials that would help me strengthen the coding skills needed to avoid this error would also be appreciated.
Preface: I am not sure if I should ask this question on a separate thread. Right now, my function is written to populate the total number of complete cases for all rows (for all 332 files), instead of specifically calculating the number of complete cases for a given monitor id and putting that into the column nobs for that ID only. (Note that each file is named after the monitor id and contains only cases from that monitor, such that 001.csv = output from monitor 1, 002.csv = output from monitor 2). Therefore, I am hoping for someone to help point me to a resource for how to subset dat so that when the nobs column populates, each row in the nobs column gives the number of complete cases for each id number.
complete <- function(directory, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE)
nobs <- c()
for (i in id) {
dat <- read.csv(files_list[i])
nobs <- c(nobs, sum(complete.cases(dat)))
}
data.frame(id,nobs)
}
You were close. But you shouldn't read in all of the files at once and then find the complete cases. It will not separate the results by id for you. Instead I just edited your code a little bit.
Test
complete("specdata", c(2,4,8,10,12))
id nobs
1 2 1041
2 4 474
3 8 192
4 10 148
5 12 96
I have no idea what is throwing the error, but I would recommend avoiding the process that is leading up to it. Your situation would benefit greatly from vectorization. I don't think this code will work out of the box, but should be on the right path:
#* Get the file names of the CSV files to read
files <- list.files(getwd(), pattern = "\\d{3}[.]csv$")
#* Read in all of the CSV files into a list of data frames
DataFrames <- lapply(files, read.csv)
#* Calculate the number of complete cases in each file
CompleteCases <- vapply(DataFrames,
function(df) sum(complete.cases(df)),
numeric(1))
#* Produce a data frame with the file name, and the number of complete cases in the file.
data.frame(file = basename(files),
nobs = CompleteCases)
You are making a silly mistake in the last line
dat_subset[, "ID", "nobs"] # incorrect code and will generate the error
#Error in drop && length(x) == 1L : invalid 'x' type in 'x && y'
base R does not allow subsetting inside [ ] with a comma-separated column name list. You should convert that into a character vector and pass as one parameter, as follows:
dat_subset[, c("ID", "nobs")]
above is the correct way of subsetting on multiple columns.

Resources