Using loop to create modicate dataframe - r

I have been struggling with finding a way to create a new data frame using a loop, where the main goal is to filter the data when is >= 0.5.
I´m using Rstudio; however, python is an option too.
Here is how looks like my data frame (csv file) and some lines of the script (incomplete):
df <- read.table(choose.files(), header = T, sep = ",", comment.char = "")
Site,Partition,alpha,beta,omega,alpha=beta,LRT,p-value,Total branch length
1,1,"0.000","0.000","NaN","0.000","0.000","1.000","0.000"
2,1,"0.060","0.046","0.774","0.048","0.049","0.825","0.000"
Then I use select function to take only two columns that interest me:
sdf <- subset(df, select = c("ï..Site", "alpha.beta"))
ï..Site alpha.beta
1 1 0.000
2 2 0.048
...
Then I thought in use a loop to create a new csv file, when the second column has a value >= 0.5 print this value, it doesn´t have a value that satisfies this requisite pass and print a 0.
Here I try differents ways; obviously neither works for me. Here are the last lines that I tried.
for (i in names(sdf1)) {
f_sdf1 <- sdf1[sdf1[, i] >= 0.5]
write.csv(f_sdf1, paste0(i, ".csv"))
}
So in this post I´m looking for some ideas to generate this script. Maybe it´s simple, but in this case, I need to ask how?

You could use subset to filter your data as in
# first get some example data
expl <- data.frame(site = 1:10, alpha.beta = runif(10))
print(expl)
# now do the filtering
expl.filtered = subset(expl, alpha.beta >= .5)
print(expl.filtered)
# Now write.table or write.csv...

Related

Parsing colnames text string as expression in R

I am trying to create a large number of data frames in a for loop using the "assign" function in R. I want to use the colnames function to set the column names in the data frame. The code I am trying to emulate is the following:
county_tmax_min_df <- data.frame(array(NA,c(length(days),67)))
colnames(county_tmax_min_df) <- c('Date',sd_counties$NAME)
county_tmax_min_df$Date <- days
The code I have so far in the loop looks like this:
file_vars = c('file1','file2')
days <- seq(as.Date("1979-01-01"), as.Date("1979-01-02"), "days")
f = 1
for (f in 1:2){
assign(paste0('county_',file_vars[f]),data.frame(array(NA,c(length(days),67))))
}
I need to be able to set the column names similar to how I did in the above statement. How do I do this? I think it needs to be something like this, but I am unsure what goes in the text portion. The end result I need is just a bunch of data frames. Any help would be wonderful. Thank you.
expression(parse(text = ))
You can set the names within assign, like that:
file_vars = c('file1', 'file2')
days <- seq.Date(from = as.Date("1979-01-01"), to = as.Date("1979-01-02"), by = "days")
for (f in seq_along(file_vars)) {
assign(x = paste0('county_', file_vars[f]),
value = {
df <- data.frame(array(NA, c(length(days), 67)))
colnames(df) <- paste0("fancy_column_",
sample(LETTERS, size = ncol(df), replace = TRUE))
df
})
}
When in {} you can use colnames(df) or setNames to assign column names in any manner desired. In your first piece of code you are referring to sd_counties object that is not available but the generic idea should work for you.

Print row from table in R

I'm new in R, and I'm trying to display in console several columns from a row when a condition is fulfilled. I've searched through the internet and I couldn't find a proper solution. At the moment, I've tried the R where clause with little success.
Here's my script.
#Coordinates
northing <- 398380.16
easting <- 6873865.89
filePath = '/media/jgm/Toshiba\ HDD/SatelliteData/data/'
file = 'MOD09GQ_2006075.csv'
mydata <- read.table(paste(filePath,file, sep = ""),header=TRUE,sep=",")
mydata$'(x-northing)²' <- (mydata$x-northing)**2
mydata$'(y-easting)²' <- (mydata$y-easting)**2
mydata$'DISTANCE' <- sqrt(mydata$`(x-northing)²`+mydata$`(y-easting)²`)
minDistance <- min(mydata[,10], na.rm = T)
I want to display in console the value of the columns sur_refl_b01, sur_refl_b02, NDVI and NDVI_SCALED when the value of the column DISTANCE is minDistance.
Hope this table output helps.
welcome to SO, try something like :
print(mydata[which(mydata$'DISTANCE'==minDistance),4:7])

How to import large dataset in r splitting and filtering by 3 different criteria when found

I'm dealing with a couple of txt files with climatological data with 3 parameters that differentiate each chunk of data (Parameter measured, station of measurement, and year), each file has more than a million lines, In the past I mannualy selected each parameter one a time, for a station and year and read it into r using read.fwd; but with this size files that is absurd and inefficient. Is there any way to automate this process, taking into account that the file has a "FF" as indicator every time a new parameter for a station and a given year starts and knowing that i want to generate separate files or datasets that have to be named according to the station, year and parameter to be able to use it thereafter?
File to read Format
Circled in red is the FF, I guess intended to mark the start of a new set of records.
Circled in Black is the name of the parameter measured (there are in total 8 different parameter classes)
Circled in blue is the year of meassurement.
Circled in green is the number or identifier of the station of measurement.
In the past, i read just what i need it with read.fwf, given the fixed with in the data, but that separation is not applied in the head of each table.
PRUEBA3 <- read.fwf("SanIgnacio_Pmax24h.txt", header = FALSE, widths = c(5,4,4,6,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,10,2),skip=1)
Thanks, and every help will be appreciated
You will need to make a function that loops through the txt files. (The output that you linked to was produced by a database; I assume you don't have access to it).
Here is how the function could look like using the fast fread from data.table and a foreach loop (you can make the loop parallel by registering a parallel backend and change %do% into %dopar%):
library(data.table)
library(foreach)
myfiles = dir(pattern = ".txt$")
res = foreach(i = 1:myfiles) %dopar% {
x = fread(myfiles[i], na.strings = c("", " "))
# get row indices for start and end dates
# the "V" variables are column indices, I assume these don't change per file
start.dia = x[, grep("DIA", V2)] + 2
end.dia = x[, grep("MEDIA", V2)] - 2
# get name of station
estacion.detect = x[, grep("ESTACION", V9)]
estacion.name = x[estacion.detect, V10]
mydf = x[start.dia : end.dia, estacion := estacion.name]
# remove empty rows and columns
junkcol = which(colSums(is.na(mydf)) == nrow(mydf))
junkrow = which(rowSums(is.na(mydf)) == ncol(mydf))
if (length(junkcol) > 0) {
mydf = mydf[, !junkcol, with = F]
}
if (length(junkrow) > 0) {
mydf = mydf[!junkrow, ]
}
# further data cleaning
}
# bind all files
all = rbindlist(res)

R, creating variables on the fly in a list using assign statement

I want to create variable names on the fly inside a list and assign them values in R, but I am unable to get the desired result. Here is the logic of my code:
Upon the function call: dat_in <- readf(1,2), an input file is read based on a product and site. After reading, a particular column (13th, here) is assigned to a variable aot500. I want to have this variable return from the function for each combination of product and site. For example, I need variables name in the list statement as aot500.AF, aot500.CM, aot500.RB to be returned from this function. I am having trouble in the return statement. There is no error but there is nothing in dat_in. I expect it to have dat_in$aot500.AF etc. Please inform what is wrong in the return statement. Furthermore, I want to read files for all combinations in a single call to the function, say using a for loop and I wonder how would the return statement handle list of more variables.
prod <- c('inv','tot')
site <- c('AF','CM','RB')
readf <- function(pp, kk) {
fname.dsa <- paste("../data/site_data_",prod[pp],"/daily_",site[kk],".dat",sep="")
inp.aod <- read.csv(fname.dsa,skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
aot500 <- inp.aod[,13]
return(list(assign(paste("aot500",siteabbr[kk],sep="."),aot500)))
}
Almost always there is no need to use assign(), we can solve the problem in two steps, read the files into a list, then give names.
(Not tested as we don't have your files)
prod <- c('inv', 'tot')
site <- c('AF', 'CM', 'RB')
# get combo of site and prod
prod_site <- expand.grid(prod, site)
colnames(prod_site) <- c("prod", "site")
# Step 1: read the files into a list
res <- lapply(1:nrow(prod_site), function(i){
fname.dsa <- paste0("../data/site_data_",
prod_site[i, "prod"],
"/daily_",
prod_site[i, "site"],
".dat")
inp.aod <- read.csv(fname.dsa,
skip = 4,
stringsAsFactors = FALSE,
na.strings = "N/A")
inp.aod[, 13]
})
# Step 2: assign names to a list
names(res) <- paste("aot500", prod_site$prod, prod_site$site, sep = ".")
I propose two answers, one based on dplyr and one based on base R.
You'll probably have to adapt the filename in the readAOT_500 function to your particular case.
Base R answer
#' Function that reads AOT_500 from the given product and site file
#' #param prodsite character vector containing 2 elements
#' name of a product and name of a site
readAOT_500 <- function(prodsite,
selectedcolumn = c("AOT_500"),
path = tempdir()){
cat(path, prodsite)
filename <- paste0(path, prodsite[1],
prodsite[2], ".csv")
dtf <- read.csv(filename, stringsAsFactors = FALSE)
dtf <- dtf[selectedcolumn]
dtf$prod <- prodsite[1]
dtf$site <- prodsite[2]
return(dtf)
}
# Load one file for example
readAOT_500(c("inv", "AF"))
listofsites <- list(c("inv","AF"),
c("tot","AF"),
c("inv", "CM"),
c( "tot", "CM"),
c("inv", "RB"),
c("tot", "RB"))
# Load all files in a list of data frames
prodsitedata <- lapply(listofsites, readAOT_500)
# Combine all data frames together
prodsitedata <- Reduce(rbind,prodsitedata)
dplyr answer
I use Hadley Wickham's packages to clean data.
library(dplyr)
library(tidyr)
daily_CM <- read.csv("~/downloads/daily_CM.dat",skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
# Generate all combinations of product and site.
prodsite <- expand.grid(prod = c('inv','tot'),
site = c('AF','CM','RB')) %>%
# Group variables to use do() later on
group_by(prod, site)
Create 6 fake files by sampling from the data you provided
You can skip this section when you have real data.
I used various sample length so that the number of observations
differs for each site.
prodsite$samplelength <- sample(1:495,nrow(prodsite))
prodsite %>%
do(stuff = write.csv(sample_n(daily_CM,.$samplelength),
paste0(tempdir(),.$prod,.$site,".csv")))
Read many files using dplyr::do()
prodsitedata <- prodsite %>%
do(read.csv(paste0(tempdir(),.$prod,.$site,".csv"),
stringsAsFactors = FALSE))
# Select only the columns you are interested in
prodsitedata2 <- prodsitedata %>%
select(prod, site, AOT_500)

How to efficiently create the same variables for each element of a list?

I am a long-time Stata user but am trying to familiarize myself with the syntax and logic of R. I am wondering if you could help me with writing more efficient codes as shown below (The "The Not-so-efficient Codes")
The goal is to (A) read several files (each of which represents the data of a year), (B) create the same variables for each file, and (C) combine the files into a single one for statistical analysis. I have finished revising "part A", but are struggling with the rest, particularly part B. Could you give me some ideas as to how to proceed, e.g. use unlist to unlist data.l first, or lapply to each element of data.l? I appreciate your comments-thanks.
More Efficient Codes: Part A
# Creat an empty list
data.l = list()
# Create a list of file names
fileList=list.files(path="C:/My Data, pattern=".dat")
# Read the ".dat" files into a single list
data.l = sapply(fileList, readLines)
The Not-so-efficient Codes: Part A, B and C
setwd("C:/My Data")
# Part A: Read the data. Each "dat" file is text file and each line in the file has 300 characters.
dx2004 <- readLines("2004.INJVERBT.dat")
dx2005 <- readLines("2005.INJVERBT.dat")
dx2006 <- readLines("2006.INJVERBT.dat")
# Part B-1: Create variables for each year of data
dt2004 <-data.frame(hhx = substr(dx2004,7,12),fmx = substr(dx2004,13,14),
,iphow = substr(dx2004,19,318),stringsAsFactors = FALSE)
dt2005 <-data.frame(hhx = substr(dx2005,7,12),fmx = substr(dx2005,13,14),
,iphow = substr(dx2005,19,318),stringsAsFactors = FALSE)
dt2006 <-data.frame(hhx = substr(dx2006,7,12),fmx = substr(dx2006,13,14),
iphow = substr(dx2006,19,318),stringsAsFactors = FALSE)
# Part B-2: Create the "iid" variable for each year of data
dt2004$iid<-paste0("2004",dt2004$hhx, dt2004$fmx, dt2004$fpx, dt2004$ipepno)
dt2005$iid<-paste0("2005",dt2005$hhx, dt2005$fmx, dt2005$fpx, dt2005$ipepno)
dt2006$iid<-paste0("2006",dt2006$hhx, dt2006$fmx, dt2006$fpx, dt2006$ipepno)
# Part C: Combine the three years of data into a single one
data = rbind(dt2004,dt2005, dt2006)
you are almost there. Its a combination of lapply and do.call/rbind to work with lapply's list output.
Consider this example:
test1 = "Thisistextinputnumber1"
test2 = "Thisistextinputnumber2"
test3 = "Thisistextinputnumber3"
data.l = list(test1, test2, test3)
makeDF <- function(inputText){
DF <- data.frame(hhx = substr(inputText, 7, 12), fmx = substr(inputText, 13, 14), iphow = substr(inputText, 19, 318), stringsAsFactors = FALSE)
DF <- within(DF, iid <- paste(hhx, fmx, iphow))
return(DF)
}
do.call(rbind, (lapply(data.l, makeDF)))
Here test1, test2, test3 represent your dx200X, and data.l should be the list format you get from the efficient version of Part A.
In makeDF you create your desired data.frame. The do.call(rbind, ) is somewhat standard if you work with lapply-return values.
You also might want to consider checking out the data.table-package which features the function rbindlist, replacing any do.call-rbind construction (and is much faster), next to other great utility for large data sets.

Resources