R: Transpose the a results table and add column headers - r

Setting the scene:
So I have a directory with 50 .csv files in it.
All files have unique names e.g. 1.csv 2.csv ...
The contents of each may vary in the number of rows but always have 4 columns
The column headers are:
Date
Result 1
Result 2
ID
I want them all to be merged together into one dataframe (mydf) and then I'd like to ignore any rows where there is an NA value.
So that I can count how many complete instances of an "ID" there were. By calling for example;
myfunc("my_files", 1)
myfunc("my_files", c(2,4,6))
My code so far:
myfunc <- function(directory, id = 1:50) {
files_list <- list.files(directory, full.names=T)
mydf <- data.frame()
for (i in 1:50) {
mydf <- rbind(mydf, read.csv(files_list[i]))
}
mydf_subset <- mydf[which(mydf[, "ID"] %in% id),]
mydf_subna <- na.omit(mydf_subset)
table(mydf_subna$ID)
}
My issues and where I need help:
My results come out this way
2 4 6
200 400 600
and I'd like to transpose them to be like this. I'm not sure if calling a table is right or should I call it as.matrix perhaps?
2 100
4 400
8 600
I'd also like to have either the headers from the original files or assign new ones
ID Count
2 100
4 400
8 600
Any and all advice is welcome
Matt
Additional update
I tried amending to incorperate some of the helpful comments below, so I also have a set of code that looks like this;
myfunc <- function(directory, id = 1:50) {
files_list <- list.files(directory, full.names=T)
mydf <- data.frame()
for (i in 1:50) {
mydf <- rbind(mydf, read.csv(files_list[i]))
}
mydf_subset <- mydf[which(mydf[, "ID"] %in% id),]
mydf_subna <- na.omit(mydf_subset)
result <- data.frame(mydf_subna$ID)
transposed_result <- t(result)
colnames(transposed_result) <- c("ID","Count")
}
which I try to call with this:
myfunc("myfiles", 1)
myfunc("myfiles", c(2, 4, 6))
but I get this error
> myfunc("myfiles", c(2, 4, 6))
Error in `colnames<-`(`*tmp*`, value = c("ID", "Count")) :
length of 'dimnames' [2] not equal to array extent
I wonder if perhaps I'm not creating this data.frame correctly and should be using a cbind or not summing the rows by ID maybe?

You need want to change your function to create a data frame rather than a table and then transpose that data frame. Change the line
table(mydf_subna$ID)
to be instead
result <- data.frame(mydf_subna$ID)
then use the t() function which transposes your data frame
transposed_result <- t(result)
colnames(transposed_result) <- c("ID","Count")

Welcome to Stack Overflow.
I am assuming that the function that you have written returns the table which is saved in variable ans.
You may give a try to this code:
ans <- myfunc("my_files", c(2,4,6))
ans2 <- data.frame(ans)
colnames(ans2) <- c('ID' ,'Count')

Related

How to add many data frame columns efficiently in R

I need to add several thousand columns to a data frame. Currently, I have a list of 93 lists, where each of the embedded lists contains 4 data frames, each with 19 variables. I want to add each column of all those data frames to an outside file. My code looks like:
vars <- c('tmin_F','tavg_F','tmax_F','pp','etr_grass','etr_alfalfa','vpd','rhmin','rhmax','dtr_F','us','shum','pp_def_grass','pp_def_alfalfa','rw_tot','fdd28_F0','fdd32_F0','fdd35_F0',
'fdd356_F0','fdd36_F0','fdd38_F0','fdd39_F0','fdd392_F0','fdd40_F0','fdd41_F0','fdd44_F0','fdd45_F0','fdd464_F0','fdd48_F0','fdd50_F0','fdd52_F0','fdd536_F0','fdd55_F0',
'fdd57_F0','fdd59_F0','fdd60_F0','fdd65_F0','fdd70_F0','fdd72_F0','hdd40_F0','hdd45_F0','hdd50_F0','hdd55_F0','hdd57_F0','hdd60_F0','hdd65_F0','hdd45_F0',
'cdd45_F0','cdd50_F0','cdd55_F0','cdd57_F0','cdd60_F0','cdd65_F0','cdd70_F0','cdd72_F0',
'gdd32_F0','gdd35_F0','gdd356_F0','gdd38_F0','gdd39_F0','gdd392_F0','gdd40_F0','gdd41_F0','gdd44_F0','gdd45_F0',
'gdd464_F0','gdd48_F0','gdd50_F0','gdd52_F0','gdd536_F0','gdd55_F0','gdd57_F0','gdd59_F0','gdd60_F0','gdd65_F0','gdd70_F0','gdd72_F0',
'gddmod_32_59_F0','gddmod_32_788_F0','gddmod_356_788_F0','gddmod_392_86_F0','gddmod_41_86_F0','gddmod_464_86_F0','gddmod_48_86_F0','gddmod_50_86_F0','gddmod_536_95_F0',
'sdd77_F0','sdd86_F0','sdd95_F0','sdd97_F0','sdd99_F0','sdd104_F0','sdd113_F0')
windows <- c(15,15,15,29,29,29,15,15,15,15,29,29,29,29,15,rep(15,78))
perc_list <- c('obs','smoothed_obs','windowed_obs','smoothed_windowed_obs')
percs <- c('00','02','05','10','20','25','30','33','40','50','60','66','70','75','80','90','95','98','100')
vcols <- seq(1,19,1)
for (v in 1:93){
for (pl in 1:4){
for (p in 1:19){
normals_1981_2010 <- normals_1981_2010 %>% mutate(!!paste0(vars[v],'_daily',perc_list[pl],'_perc',percs[p]) := percents[[v]][[pl]][,vcols[p]])}}
print(v)}
The code starts fast, but very quickly slows to a crawl as the outside data frame grows in size. I didn't realize this would be problem. How do I add all these extra columns efficiently? Is there a better way to do this than by using mutate? I've tried add_column, but that does not work. Maybe it doesn't like the loop or something.
Your example is not reproducible as is (the object normals_1981_2010 doesn't exist but is called within the loop, so I am unsure I understood your question.
If I did though, this should work:
First, I am reproducing your dataset structure, except that instead of 93 list, I set it up to have 5, instead of 4 nested tables within, I set it up to have 3 tables, and instead of each tables having 19 columns, I set them up to have 3 columns.
df_list <- vector("list", 5) # Create an empty list vector, then fill it in.
for(i in 1:5) {
df_list[[i]] <- vector("list", 3)
for(j in 1:3) {
df_list[[i]][[j]] <- data.frame(a = 1:12,
b = letters[1:12],
c = month.abb[1:12])
colnames(df_list[[i]][[j]]) <- paste0(colnames(df_list[[i]][[j]]), "_nest_", i, "subnest_", j)
}
}
df_list # preview the structure.
Then, answering your question:
# Now, how to bind everything together:
df_out <- vector("list", 5)
for(i in 1:5) {
df_out[[i]] <- bind_cols(df_list[[i]])
}
# Final step
df_out <- bind_cols(df_out)
ncol(df_out) # Here I have 5*3*3 = 45 columns, but you will have 93*4*19 = 7068 columns
# [1] 45

how do you add new columns to an empty data frame in R

I declare an empty data frame as this:
df <- data.frame()
then I go though processing some files and as process, I need to build my df data frame. I need to keep adding columns to it:
For example, I process some file and build a data frame called new_df, I now need to add this new_df to my df:
I've tried this:
latest_df <- cbind(latest_df, new_df)
I get this error:
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 0, 1
Just put data into the index after the last column
new_df = data.frame()
new_df[,ncol(new_df)+1] = NA
So if you knew you had 3 columns then:
new_df[,4] = c('a','b','c')
Example:
new_df = data.frame('a'=NA)
for(i in 1:10){
new_df[,ncol(new_df)+1] = NA
}
new_df
EDIT:
ProcessExample <- function(){
return(c(5)) #just returns 5 as fake data everytime
}
new_df = data.frame(matrix(nrow=1))
for(i in 1:10){
new_df[,ncol(new_df)+1] = ProcessExample()
}
latest_df <- new_df[,-1]
Or just add rows and transpose the data set
new_df = data.frame()
for(i in 1:10){
new_df[i,1] = ProcessExample()
}
latest_df <- t(new_df)
If you simply want an empty data frame of the proper size before you enter the loop, and assuming "df" and "new_df" have the same number of rows x, try
df <- data.frame(matrix(nrow=x))
for (i in 1:n){
temp[i] <- % some vector of length x
}

Access variable dataframe in R loop

If I am working with dataframes in a loop, how can I use a variable data frame name (and additionally, variable column names) to access data frame contents?
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10),Y = sample(c("yes", "no"), 10, replace = TRUE))
for (i in seq_along(dfnames)){
curr.dfname <- dfnames[i]
#how can I do this:
curr.dfname$X <- 42:52
#...this
dfnames[i]$X <- 42:52
#or even this doubly variable call
for (j in 1_seq_along(colnames(curr.dfname)){
curr.dfname$[colnames(temp[j])] <- 42:52
}
}
You can use get() to return a variable reference based on a string of its name:
> x <- 1:10
> get("x")
[1] 1 2 3 4 5 6 7 8 9 10
So, yes, you could iterate through dfnames like:
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10), Y = sample(c("yes", "no"), 10, replace = TRUE))
for (cur.dfname in dfnames)
{
cur.df <- get(cur.dfname)
# for a fixed column name
cur.df$X <- 42:52
# iterating through column names as well
for (j in colnames(cur.df))
{
cur.df[, j] <- 42:52
}
}
I really think that this is gonna be a painful approach, though. As the commenters say, if you can get the data frames into a list and then iterate through that, it'll probably perform better and be more readable. Unfortunately, get() isn't vectorised as far as I'm aware, so if you only have a string list of data frame names, you'll have to iterate through that to get a data frame list:
# build data frame list
df.list <- list()
for (i in 1:length(dfnames))
{
df.list[[i]] <- get(dfnames[i])
}
# iterate through data frames
for (cur.df in df.list)
{
cur.df$X <- 42:52
}
Hope that helps!
2018 Update: I probably wouldn't do something like this anymore. Instead, I'd put the data frames in a list and then use purrr:map(), or, the base equivalent, lapply():
library(tidyverse)
stuff_to_do = function(mydata) {
mydata$somecol = 42:52
# … anything else I want to do to the current data frame
mydata # return it
}
df_list = list(df1, df2)
map(df_list, stuff_to_do)
This brings back a list of modified data frames (although you can use variants of map(), map_dfr() and map_dfc(), to automatically bind the list of processed data frames row-wise or column-wise respectively. The former uses column names to join, rather than column positions, and it can also add an ID column using the .id argument and the names of the input list. So it comes with some nice added functionality over lapply()!

Add Columns to an empty data frame in R

I have searched extensively but not found an answer to this question on Stack Overflow.
Lets say I have a data frame a.
I define:
a <- NULL
a <- as.data.frame(a)
If I wanted to add a column to this data frame as so:
a$col1 <- c(1,2,3)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "a", value = c(1, 2, 3)) :
replacement has 3 rows, data has 0
Why is the row dimension fixed but the column is not?
How do I change the number of rows in a data frame?
If I do this (inputting the data into a list first and then converting to a df), it works fine:
a <- NULL
a$col1 <- c(1,2,3)
a <- as.data.frame(a)
The row dimension is not fixed, but data.frames are stored as list of vectors that are constrained to have the same length. You cannot add col1 to a because col1 has three values (rows) and a has zero, thereby breaking the constraint. R does not by default auto-vivify values when you attempt to extend the dimension of a data.frame by adding a column that is longer than the data.frame. The reason that the second example works is that col1 is the only vector in the data.frame so the data.frame is initialized with three rows.
If you want to automatically have the data.frame expand, you can use the following function:
cbind.all <- function (...)
{
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function(x) rbind(x, matrix(, n -
nrow(x), ncol(x)))))
}
This will fill missing values with NA. And you would use it like: cbind.all( df, a )
You could also do something like this where I read in data from multiple files, grab the column I want, and store it in the dataframe. I check whether the dataframe has anything in it, and if it doesn't, create a new one rather than getting the error about mismatched number of rows:
readCounts = data.frame()
for(f in names(files)){
d = read.table(files[f], header=T, as.is=T)
d2 = round(data.frame(d$NumReads))
colnames(d2) = f
if(ncol(readCounts) == 0){
readCounts = d2
rownames(readCounts) = d$Name
} else{
readCounts = cbind(readCounts, d2)
}
}
if you have an empty dataframe, called for example df, in my opinion another quite simple solution is the following:
df[1,]=NA # ad a temporary new row of NA values
df[,'new_column'] = NA # adding new column, called for example 'new_column'
df = df[0,] # delete row with NAs
I hope this may help.

R 3.1.0 How to make a function to select between 2 columns (subsetting or indexing) using 1 argument?

I need to get the mean (and remove all the NA values) of a particular "fruit" within a range of IDs or just one ID (The ID are from the farms the fruits are from).
This is my formatted data:
date mangos papayas id
2010-04-17 20 30 2
2012-02-17 40 22 3
I have a folder called: "fruits". Then i have created this variable: "files_full"
files_full <- list.files("fruits", full.names = TRUE) # it contains: chr 1:32.
Now i. I have create a data frame: "dat" (dat <- data.frame())
What i need is to create a function with 3 arguments: directory, fruit, id. I have this function for that:
fruit <- function (directory, fruit, id) {
files_list <- list.files(directory, full.names = TRUE)
dat <- data.frame()
for (i in **id**){
dat <- rbind(dat, read.csv(files_list[i]))
dat_subset <- subset (dat, dat$ID == id && dat$papaya == fruit|dat$mango == fruit)
mean(data_subset)
}
}
So:
1) Users will need to enter: directory, fruit, and id (for the farm the fruits are from, i have a csv file for everyone of the 32 farms, thats why im doing a loop to combine them into a data frame (dat)).
2) My question: How to subset for a specific fruit. Let's say i have 2 columns: papayas, mangos. But only 1 argument: "fruit". AS you see i have tried something but not sure if it is OK.
After, subsetting or indexing the fruit and Id(s) i need to have the median of that values.
So the desire oput would by something like:
fruit("fruits", "papayas", 2:3)
[1] 26
The simple idea is that you use the name of the column to choose the column. For example: df[[myFruit]] where myFruit <- "mango" (say).
I think this should work. Give it a spin:
my_fun <- function (dir, fruit, id) {
files_list <- list.files(dir, full.names = TRUE)
dat <- do.call(rbind, lapply(files_list, read.csv))
dat <- complete.cases(dat) # Remove any row that has NAs.
return(mean(dat[dat[["id"]] %in% id, fruit]))
}
You should add the requirement for removing NAs directly into your question.

Resources