Get function or Passing character variable names in R - r

I have data frame name MC_df, I need to get the list of some variables in MC_df to access through my function, so if I do manually by vari_names_MCdf <- list(MC_df$Age, MC_df$Year, MC_df$Education, MC_df$City, MC_df$Country), it works, but due to the large of data set, I write all of variables' names I want to include into csv file. Now I read csv file to get access all of variables, I got stuck. Here is data MC_df
MC_df <- data.frame(ID=c(1:5),Age=c(18,25,30,22,19),
Year=c(2003,2010,2008,2010,2015),
Education = c("<12","13-15",">15","<12",">15"),
City = c("SD","MS","LA","CV","SD"),
Country=c("US","CA","CA","CA","US"),
Group=c(rep("group",5)))
I do manually, just call list all of variables I want, it works
vari_names_MCdf <- list(MC_df$Age, MC_df$Year, MC_df$Education, MC_df$City, MC_df$Country)
vari_group_MCdf <- list(MC_df$Group,MC_df$Group,MC_df$Group,MC_df$Group,MC_df$Group)
Now I write variables into csv file, and read.csv file, it helps me for the futures, I need to edit in csv file, I don't need to access from MC_df
df <- data.frame(names=c("Age","Year","Education","City","Country"),
text=c("Age Continuos","Year Category","Eduaction Group","City Category","Country Group"))
write.csv(df,"vari.csv")
var <- read.csv("vari.csv", stringsAsFactors = FALSE)
var
X names text
1 1 Age Age Continuos
2 2 Year Year Category
3 3 Education Eduaction Group
4 4 City City Category
5 5 Country Country Group
I try
vari_names <- as.list(var$names)
vari_text <- as.list(var$text)
get("MC_df")$vari_names[[1]]
vari_names_new <- lapply(vari_names, function(i) paste0("MC_df$",i, sep=""))
get(vari_names_new[[1]])
Everything doesn't work, so how I can active variable names in CSV file and dataset names

Related

When I save and load RDA files the name changes?

I have a parameter called country_name that reflects the name of a country I am interested in and which I change sometimes when I run my code, I would like my RDA file to reflect that name change once saved and loaded back into the environment.
Currently what happens is this:
Name the country:
country_name <- "Ireland"
Create a simple data frame:
x <- 10
vars_2_keep <- data.frame(x)
vars_2_keep
It contains x=10
x
1 10
I save it, renaming the data frame with the country name so that when I do this with a different country I will have country specific information:
save(vars_2_keep, file=paste("my_data_", country_name[1], ".rda", sep = ""))
I delete everything, and load it back in:
rm(list=ls())
load(file='my_data_Ireland.rda')
Unfortunately, in the environment instead of my data frame being called "my_data_Ireland", it is still called vars_2_keep.
How can I update the name of this data frame to my_data_country_name[1] (which in this example would be my_data_Ireland)
Thank you

Extract all values automatically from multiple xml files into data frame

I'm new to R and trying to parse over 100k xml files into 1 csv file. I used a formula from a previous question asked and it works perfectly if I state the specific column name. My xml files are rather long to specifically write them out so I want to add all the columns into data frame without explicitly writing the column headings. I'm using this exact same formula except I have more rows listing column names rather than just zip code and amount.
require(XML)
require(plyr)
setwd("LOCATION_OF_XML_FILES")
xmlfiles <- list.files(pattern = "*.xml")
dat <- ldply(seq(xmlfiles), function(i){
doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
zipcode <- xmlValue(doc[["//ZipCode"]])
amount <- xmlValue(doc[["//AwardAmount"]])
return(data.frame(zip = zipcode, amount = amount))
})
write.csv(dat, "zipamount.csv", row.names=FALSE)
Hopefully the xmlToDataFrame() function will do what you want.
It assumes the XML document is a a root node whose child nodes are a sequence of records and that each record has simple elements. Then it extracts them into a data.frame
Consider a sample XML document
<doc>
<record><a>1</a><b>2</b></record>
<record><a>10</a><b>20</b><c>bob</c></record>
<record><a>20</a><b>30</b></record>
</doc>
xmlToDataFrame() returns
a b c
1 1 2 <NA>
2 10 20 bob
3 20 30 <NA>

NULL values, regression / correlation switch

I have a dataset with lets say 2 variables. I want to do some regression testing, but the quite a few numeric observations have "NULL". I would want to use this as a value however, but I don't want to convert it to a specific number, ie 99999.
I keep trying all the different ways after googling and it doesn't work.
Benny2 <- read_excel("C:/Users/EH9508/Desktop/Benny2.xlsx")
I have two variables "Days" and "Amount" both have numeric values and "NULL"
Any help would be appreicated.
You can convert the file/sheet to csv from Excel (save as > csv) and then:
mydata <- read.csv("path/to/file.csv")
If you don't have access to Excel, then this is how it goes with the xlsx library:
library("xlsx")
mydata <- read.xlsx("path/to/file.xlsx")
If you put the csv/xlsx file in the same folder as your R script, you can type the file name without the path as read.xlsx("file.xlsx").
If you already have your data in R and are wondering how to get the NULL converted to a given value, try this:
mydata <- matrix(rnorm(10),5,2) # You data
mydata[2,1] <- NA # Some NA
mydata[5,2] <- NA
mydata[is.na(mydata)] <- 99999 # Replaces mydata where NA for 99999

How to combine files and match them with their identifier from a separate file?

I have 500 txt files all under the same folder. Each text file represents a patient and has a list of genes (miRNA genes in this example) and their corresponding expression values. I am only interested in the reads_per_million_miRNA_mapped for each corresponding miRNA_ID. Below is an example of three:
File name: 0a4af8c8.mirnas.quantification.txt
miRNA_ID read_count reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1 39039 5576.681 N
2 hsa-let-7a-2 38985 5568.967 Y
3 hsa-let-7a-3 38773 5538.684 N
File name: 0a867fd6.mirnas.quantification.txt
miRNA_ID read_count reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1 36634 11413.6842 N
2 hsa-let-7a-2 36608 11405.5837 N
3 hsa-let-7a-3 36006 11218.0246 N
File name: 0ac65c4b.mirnas.quantification.txt
miRNA_ID read_count reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1 68376 14254.3693 N
2 hsa-let-7a-2 67965 14168.6880 Y
3 hsa-let-7a-3 67881 14151.1765 N
While each file has a unique name, the name does not tell me the patient's ID, and there is nothing in the file which directly tells me the patient's ID. To determine the patient's ID, I use a separate master CSV file which includes a row of all patient ID's and there corresponding file name for the txt files. This csv file has way to many columns for me to post an example row so I only have the two columns of interest listed below.
file_name patient_id
0a4af8c8.mirnas.quantification.txt TCGA-G9-6373-01A
0a867fd6.mirnas.quantification.txt TCGA-XJ-A9DX-01A
0ac65c4b.mirnas.quantification.txt TCGA-V1-A9OF-01A
My goal is to create a data frame of all combined txt files which has the gene expression data for all patients for all genes
miRNA_ID TCGA-G9-6373-01A TCGA-XJ-A9DX-01A TCGA-V1-A9OF-01A
hsa-let-7a-1 5576.681 11413.6842 14254.3693
hsa-let-7a-2 5568.967 11405.5837 14168.6880
hsa-let-7a-3 5538.684 11218.0246 14151.1765
I have figured out a way to do this by subsetting the file name and patient ID into a new data frame and then using a for loop to combine all the txt files and add on an additional column with the file name to get to each file. I then use the left_join function from the tidyverse package to combine the data frames.
While this works, it is not resource efficient as I am creating extra data frames and columns which I do not need. I was wondering if anyone knows of a better approach which can do the same thing in one goal. For example by using a which function within the for loop that can be used to rename the Expression_value column as the patient ID by associating the file going through the loop with the patient ID from the same row in the separate master CSV file. Thanks in advance.
Here is the link to the previous method I used.
How to create a data frame in R where I have to associate different txt files with a sample ID from a separate file?
Without your actual data it is very challenging to attempt to answer this, so hopefully this will be a useful design pattern. You will need a two things:
1) An identifying pattern that you can construct based on the file name and merge with the master
2) All of the files in the working directory
Here is what I would recommend:
library(data.table)
library(magrittr)
library(stringr)
setwd("path/to/directory")
# Probably implement some kind of regex on the file name
# to extract the patient name
read_file <- function(file_name){
fread(file_name) %>%
.[,patient_name := str_replace_all(file_name,"regex_string","")]
}
all_files <- list.files(pattern = "file_pattern")
master <- fread("path/to/master")
combined_files <- lapply(all_files, read_file) %>%
rbindlist %>%
merge(master, by = "patient_name")
Essentially this sets the working directory to where your files are, implements a parser which grabs the patient name to match to the master, applies that parser to all the files, combines them to a single data frame with the identifying observation, and then merges them with the master. Hopefully it helps!
This should work. You'll need to customize the input_folder (or set your working directory there and delete the references to it in my code). I'm calling the data frame with the patient IDs and file names filekey.
library(data.table)
input_folder = "path/to/folder/"
cols_to_keep = c("miRNA_ID", "reads_per_million_miRNA_mapped")
files = lapply(paste0(input_folder, "filekey$file_name"), fread, select = cols_to_keep)
names(files) = filekey$patient_id
long = rbindlist(files, id = T)
result = dcast(long, miRNA_ID ~ .id, value.var = "reads_per_million_miRNA_mapped")
result
# miRNA_ID TCGA-G9-6373-01A TCGA-V1-A9OF-01A TCGA-XJ-A9DX-01A
# 1: hsa-let-7a-1 5576.681 14254.37 11413.68
# 2: hsa-let-7a-2 5568.967 14168.69 11405.58
# 3: hsa-let-7a-3 5538.684 14151.18 11218.02

Import multiple .txt files into R and skip to actual data rows

I have 537 .txt files which I need to import into either a list or separate data frames in R. I do not want to append any data as it is crucial to keep everything separate.
I've renamed each file, so the file names are all uniform. In each file, there is a header section with a lot of miscellaneous information. This header section is 12-16 rows depending on the file. For the data, I have between 5 and 7 columns. The data is all tab delimited. The number of columns varies between 5 and 9 columns, and the columns are not always in the same order, so it is important that I can import the column names with the data (column names are uniform across files). The format of the file is as follows:
Header
Header
Header
Header...up to 16 rows
((number of spaces between header and column names varies))
Date(\t)Time(\t)dataCol1(\t)dataCol2(\t)dataCol3(\t)dataCol4
((no empty row between column names and units))
mm/dd/yyyy(\t)hh:mm:ss(\t)units(\t)units(\t)units(\t)units
((1 empty row between units and data))
01/31/2016(\t)14:32:02(\t)14.9(\t)25.3(\t)15.8(\t)25.6
((data repeats for up to 4000 rows))
To recap what I need:
Import all of the files into individual data frames or a lists of data frames.
Skip past the header information to the row with "Date" (and possibly delete the two rows following with units and the empty row) leaving me with a row of column names and the data following.
Here's a crude copy of what I have been working on for code. The idea is, after importing all of the files into R, determine the max value for 1-2 columns in each file. Then, export a single file which will have 1 row for each file with 2 columns containing the 2 max values from each file.
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path){
##read files
files[[i]] <- read.delim(path[[i]],header = FALSE, sep = "\t", skip = 18
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I know right now, my code just skips past everything into the data ( max doesn't come until much later in file, so skipping 1 or 2 lines won't hurt me), and it assumes the columns are in the same order in each file. This is horrible practice and gives me some bad results on about 5% of my data points, but I want to do this correctly. My main concern is to get the data into R in a usable format. Then, I can add the other calculations and conversions. I am new to R, and after 2 days of searching, I have not found the help I need already posted to any forum.
Assuming that the structure of the header follows a Line \n Line \n Data we can use a grep to find the line number where "mm/dd/yyyy"
As such:
system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T)
# ran.txt is an arbitrary text file I created, we will substitute
# 'ran.txt' with path[[i]] later on.
#[1] "6:mm/dd/yyyy\thh:mm:ss\tunits\tunits\tunits\tunits"
From this we can then strsplit the output into the number before the : and use that argument as the necessary value for skip.
as.numeric(strsplit(system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T),":")[[1]][1])
# [[1]][1] will specify the first element of the output of strsplit as
# in the output the hh:mm:ss also is split.
# [1] 6
As there is an empty row between our called row and the actual data we can add 1 to this and then begin reading the data.
Thusly:
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path)){
##read files
# Calculate the number of rows to skip.
# Using Dave2e's suggestion:
header <-readLines("path[[i]]", n=20)
skip <- grep("^mm/dd/yy", header)
#Add one due to missing line
skip <- skip + 1
files[[i]] <- read.delim(path[[i]],
header = FALSE,
sep = "\t",
skip = skip)
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I think that about covers everything.
Thought I would add this here in case it helps someone else with a similar issue. #TJGorrie's solution helped solve my slightly different challenge. I have several .rad files that I need to read in, tag, and merge. The .rad files have headers that start at random rows so I needed a way to find the row with the header. I didn't need to do any additional calculations except create a tag column. Hope this helps someone in the future but thanks #TJGorrie for the awesome answer!
##list files and create list for data.frames
path <- list.files(pattern="*.rad")
files <- list()
##loop to read in data files
for(i in 1:length(path)){
# Using Dave2e's suggestion:
header <-readLines(path[[i]], n=20)
skip <- grep("Sample", header)
#Subtract one row to keep the row with "Sample" in it as the header
skip <- skip - 1
files[[i]] <- read.table(path[[i]],
header = TRUE,
fill = TRUE,
skip = skip,
stringsAsFactors = FALSE)
# Name the newly created file objects the same name as the original file.
names(files)[i] = gsub(".rad", "", (path[i]))
files[[i]] = na.omit(as.data.frame(files[[i]]))
# Create new column that includes the file name to act as a tag
# when the dfs get merged through rbind
files[[i]]$Tag = names(files)[i]
# bind all the dfs listed in the file into a single df
df = do.call("rbind",
c(files, make.row.names = FALSE))
}
##export to .csv
write.csv(df,file="PATH.csv", row.names = FALSE)

Resources