R, define a function then apply to a list - r

I am trying to write a function (and I am new to R, most of my knowledeges of R were learned form this wedsite, thanks),
I want to apply my function to a list. The list contain some ".CSV" files.
All CSV files in my folder look like the picture below, same structure but with different column numbers.
I want to :
based on "Frame" column, delete all the row contain words "T",
then I got "110*n1" rows data.
delete all the column contain ""Flag" words, they are blank column.
delete the 1st column. then I have "2*n2" columns.
reshape the mulit-column to 2 column data, now I got "110*n3" rows data.
repeat "1,2,3,4,...,110" as seires numbers, n times(n=n3), rebind as a column.
form "1,2,3,...,n3", each repeat 110 times, make as a colum.
export the new table as txt files.
Here is what I've done so far:
T_function <- function(x) {
data.df <- read.csv(x, skip = 1,header=TRUE, na.strings=c("NA","NaN", " ","*"),
dec=".", strip.white=TRUE)
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
data.df[!grepl("T", data.df$Frame),]
data.df <- data.df [,-1]
data.df <- data.df [,colSums(is.na(data.df))<nrow(data.df)]
splitter <- function(indf, ncols) {
if (ncol(indf) %% ncols != 0) stop("Not the right number of columns to split")
inds <- split(sequence(ncol(indf)), c(0, sequence(ncol(indf)-1) %/% ncols))
temp <- unlist(lapply(inds, function(x) c(t(indf[x]))), use.names = FALSE)
as.data.frame(matrix(temp, ncol = ncols, byrow = TRUE))
}
out <- splitter(data.df, 2)
list <- 1:110
from <- which(out$V1 == 1)
to <- c((from-1)[-1], nrow(out))
end <- c(to/110)
list2 <- rep(list,length(to/110))
out$Number <- unlist(list2)
out$Number <- as.factor(out$Number)
list3 <- rep(1:end,each=110)
out$slice <- unlist(list3)
out$slice <- as.factor(out$slice)
write.table(x = data.df,
file = paste0(filename, "_analysis.txt"),
sep = ",",quote=F)
}
It seems the function can not add correct "out$Number" and "out$slice".
filenames <- list.files(path = "",pattern="csv",full.names = T)
sapply(filenames, FUN = T_function)
I am trying to apply my function to all files in list, while it seems beside the 1st files I can't get other files to work.
Could anybody help me find out and salve problems?

Related

What line(s) can I add to only pull unique values before counting all values and writing the CSV file?

I am trying to create a CSV file that is a list of all unique values in my dataset. My data is from a folder that contains 200+ CSV files all with 9 columns and a varying number of rows. Some files have no duplicates but many have duplicate values. I have found a code that lists how many rows in each file but I am wondering what I could add to it so it removes the duplicate values and only counts the unique values in the final output CSV. I would like the final CSV file to list the row count each of the 200+ files in one sheet.
The code I found is below
library(tidyverse)
csv.file <- list.files("TestA") # Directory with your .csv files
data.frame.output <- data.frame(number_of_cols = NA,
number_of_rows = NA,
name_of_csv = NA) #The df to be written
MyF <- function(x){
csv.read.file <- data.table::fread(
paste("TestA", x, sep = "/")
)
number.of.cols <- ncol(csv.read.file)
number.of.rows <- nrow(csv.read.file)
data.frame.output <<- add_row(data.frame.output,
number_of_cols = number.of.cols,
number_of_rows = number.of.rows,
name_of_csv = str_remove_all(x,".csv")) %>%
filter(!is.na(name_of_csv))
}
map(csv.file, MyF)
data.table::fwrite(data.frame.output, file = "Output1.csv")
I appreciate any guidance as I am a total R/coding beginner.
The following function accepts a vector of file names, reads them one by one, removes duplicated rows and outputs a data.frame with numbers of columns and rows and CSV filename.
There is no need to previously create a results data.frame data.frame.output.
MyF <- function(x, path = "TestA"){
f <- function(x, path) {
# commented out to test the function
# uncomment these 3 lines and comment out the next one
#csv.read.file <- data.table::fread(
# file.path(path, x)
#)
csv.read.file <- data.table::fread(x)
i_dups <- (duplicated(csv.read.file) | duplicated(csv.read.file, fromLast = TRUE))
csv.read.file <- csv.read.file[!i_dups, ]
#
number.of.cols <- ncol(csv.read.file)
number.of.rows <- nrow(csv.read.file)
#
name_of_csv <- if(is.na(x)) NA_character_ else basename(x)
name_of_csv <- tools::file_path_sans_ext(name_of_csv)
#
data.frame(number_of_cols = number.of.cols,
number_of_rows = number.of.rows,
name_of_csv = name_of_csv) |>
dplyr::filter(!is.na(name_of_csv))
}
#
y <- purrr::map(x, f, path = path)
data.table::rbindlist(y)
}
data.frame.output <- MyF(csv.file)
data.table::fwrite(data.frame.output, file = "Output1.csv")
I find this for loop version better. Though for loops are not considered very idiomatic in R, there is nothing wrong with them. Like the function above, it avoids assignment in the parent environment with the operator <<- and the code is simpler. The results data.frame data.frame.output is created beforehand with the number of rows equal to the length of the input filenames vector and assignment is done by replacing the NA values by each CSV files' values.
MyF <- function(x, path = "TestA"){
data.frame.output <- data.frame(number_of_cols = rep(NA, length(x)),
number_of_rows = rep(NA, length(x)),
name_of_csv = rep(NA, length(x)))
for(i in seq_along(x)) {
# commented out to test the function
# uncomment this line and comment out the next one
#fl_name <- file.path(path, x[i])
fl_name <- x[i]
#
csv.read.file <- data.table::fread(fl_name)
i_dups <- (duplicated(csv.read.file) | duplicated(csv.read.file, fromLast = TRUE))
csv.read.file <- csv.read.file[!i_dups, ]
#
data.frame.output$number_of_cols[i] <- ncol(csv.read.file)
data.frame.output$number_of_rows[i] <- nrow(csv.read.file)
#
name_of_csv <- if(is.na(fl_name)) NA_character_ else basename(fl_name)
name_of_csv <- tools::file_path_sans_ext(name_of_csv)
data.frame.output$name_of_csv[i] <- name_of_csv
}
#
data.frame.output |> dplyr::filter(!is.na(name_of_csv))
}
MyF(csv.file)

How to loop over row values in a two column data frame in R?

I have a data frame that looks like:
mydata <- data.frame(name = c("Rick", "Dan", "Michelle", "Ryan", "Gary"),
id = c (1:5))
I want to use the loop over the row and pull out the name and id variables for each. This is used to export a GTiff file. The important part is looping over each value. I've shown below how I could do it one by one using the paste functions to import the names as strings where necessary. In this case I would have 5 geotiff files, one for each name.
head(mydata)
x <- paste(mydata[1, 1])
x
y <- paste0(x, ".asc")
y
z <- paste(mydata[1, 2])
z
species_raster <- raster(y)
m <- c(0, as.numeric(z), 0, as.numeric(z), 1, 1)
rclmat <- matrix(m, ncol = 3, byrow = TRUE)
rc <- reclassify(species_raster, rclmat)
plot(rc)
writeRaster(rc,
filename = x,
format = "GTiff",
overwrite = TRUE)
you can get a list of your pasted filenames with this
outputnames <- lapply(mydata[,1], paste0, ".asc")
#OR
outputnames <- lapply(mydata$name, paste0, ".asc")
These can later be used in another apply function or be referenced in a loop like so
for(i in mydata){
writeRaster(rc,
filename = i,
format = "GTiff",
overwrite = TRUE)
}
another alternative of referencing your dataframe in a loop is the following
for(i in 1:nrow(mydata)){
filename <- paste(mydata$name[i], ".asc")
print(filename)
Idascharacter <- as.character(mydata$id[i])
print(Idascharacter)
}

How to create a random sample from .csv in r

I need to pull a random sample of 100,000 - 200,000 rows from a csv dataset of 2.8mil rows. How do I effectively do this so that the random sample can be cleaned and processed?
Under the DMwR2 library, I have used the sampleCSV function, but the output data messes up the 22 variables that I need to use.
library(caret)
library(DMwR2)
dataset source: https://www.kaggle.com/pschale/mlb-pitch-data-20152018#pitches.csv
pitchData <- sampleCSV(file.choose(), 200000, 2867154 , header = TRUE , mxPerc = 0.5)
summary(pitchData)
I expect the output of summary(pitchData) to have the same variable names as the csv file, but it renames them using random numbers, and some of the variables are lost.
Maybe the following function can do what the question asks for. Note that it uses a function from package R.utils.
The return values is a list with 2 members:
lines the line numbers read in;
data the data frame.
This can be changed to return just the dataframe.
sample_csv <- function(fname, n, sep = ",", header = TRUE, ...){
N <- R.utils::countLines(fname)
stopifnot(N >= n)
lns <- sample(N, n)
x <- sapply(lns, function(l){
scan(fname, what = character(), skip = l - 1, nlines = 1, quiet = TRUE)
})
list(lines = lns,
data = read.table(textConnection(x),
sep = sep, header = header, ...)
)
}
set.seed(1234)
res <- sample_csv(filename, 100, header = FALSE)
str(res$data)

How to export data frame with vector entries into CSV file?

I have a data frame with some entries as lists. This was an import from a JSON file, where an entry might have multiple tags. It imported JSON file using jsonlite package with flatten=TRUE. An example entry from my tags column is:
list(tag = c("ethicaltheory", "gametheory"), raw_tag = c("ethical heory", "Game Theory"))
I filtered the table down and want to export it to a csv. When I tried the write.csv command, I hit an error when it hit the first entry with list:
"unimplemented type 'list' in 'EncodeElement'"
The question is can I export this file as is, did I make a mistake in importing it?
I'd be fine with converting entries to strings or something, but I'm not sure how to do that for the entire table.
Interesting problem. I did not know that data.frame can store list. This algorithm is taking variable length list stored in a row of a data.frame. Then it finds maximum length, create matrix with proper dimensions then it saves the file.
x <- list(tag = letters[1:2], raw_tag = letters[1:3])
y <- list(tag = letters[1:2], raw_tag = letters[1:2])
z <- list(tag = letters[1:3], raw_tag = letters[1:4])
df <- data.frame(clmn = I(list(x, y, z)))
r <- apply(df, 1, unlist)
lm <- max(unlist(lapply(r, length)))
df <- data.frame(
matrix(
rep(0, (lm * nrow(df))),
ncol = lm
)
)
)
vals.v <- unlist(lapply(1:nrow(df), function(i) {
v <- unlist(r[i])
l <- length(v)
c(v, rep(0, lm - l))
}))
fin.res <- t(matrix(vals.v, ncol = nrow(df)))
# write.csv(fin.res, "res2.csv") # uncomment to save CSV file

Reading series of values in R

I have read a series of 332 files like below by storing the data in each file as a data frame in List.
files <- list.files()
data <- list()
for (i in 1:332){
data[[i]] = read.csv(files[[i]])
}
The data has 3 columns with names id, city, town. Now I need to calculate the mean of all values under city corresponding to the id values 1:10 for which I wrote the below code
for(j in 1:10){
req.data <- data[[j]]$city
}
mean(na.omit(req.data))
But it is giving me a wrong value and when I call it in a function its transferring null values. Any help is highly appreciated.
Each time you iterate through j = 1:10 you assign data[[j]]$city to the object req.data. In doing so, for steps j = 2:10 you are overwriting the previous version of req.data with the contents of the jth data set. Hence req.data only ever contains at any one time a single city's worth of data and hence you are getting the wrong answer sa you are computing the mean for the last city only, not all 10.
Also note that you could do mean(req.data, na.rm = TRUE) to remove the NAs.
You can do this without an explicit loop at the user R level using lapply(), for example, with dummy data,
set.seed(42)
data <- list(data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)))
mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
which gives
> mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
[1] -0.02177902
So in your case, you need:
mean(unlist(lapply(data[1:10], `[`, "city")), na.rm = TRUE)
If you want to write a loop, then perhaps
req.data <- vector("list", length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your data / Q
req.data[[j]] <- data[[j]]$city ## fill in
}
mean(unlist(req.data), na.rm = TRUE)
> mean(unlist(req.data), na.rm = TRUE)
[1] -0.02177902
is one way. Or alternatively, compute the mean of the individual cities and then average those means
vec <- numeric(length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your question
vec[j] <- mean(data[[j]]$city, na.rm = TRUE)
}
mean(vec)

Resources