For Loop Across Specific Column Range in R - r

I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.

As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.

You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

Print histograms including variable name for all variables in R

I'm trying to generate a simple histogram for every variable in my dataframe, which I can do using sapply below. But, how can I include the name of the variable in either the title or the x-axis so I know which one I'm looking at? (I have about 20 variables.)
Here is my current code:
x = # initialize dataframe
sapply(x, hist)
Here's a way to modify your existing approach to include column name as the title of each histogram, using the iris dataset as an example:
# loop over column *names* instead of actual columns
sapply(names(iris), function(cname){
# (make sure we only plot the numeric columns)
if (is.numeric(iris[[cname]]))
# use the `main` param to put column name as plot title
print(hist(iris[[cname]], main=cname))
})
After you run that, you'll be able to flip through the plots with the arrows in the viewer pane (assuming you're using R Studio).
Here's an example output:
p.s. check out grid::grob(), gridExtra::grid.arrange(), and related functions if you want to arrange the histograms onto a single plot window and save it to a single file.
How about this? Assuming you have wide data you can transform it to long format with gather. Than a ggplot solution with geom_histogram and facet_wrap:
library(tidyverse)
# make wide data (20 columns)
df <- matrix(rnorm(1000), ncol = 20)
df <- as.data.frame(df)
colnames(df) <- LETTERS[1:20]
# transform to long format (2 columns)
df <- gather(df, key = "name", value = "value")
# plot histigrams per name
ggplot(df) +
geom_histogram(aes(value)) +
facet_wrap(~name, ncol = 5)

Loop through and plot columns of two identical dataframes

I have two dataframes I'd like to plot against each other:
> df1 <- data.frame(HV = c(3,3,3), NAtlantic850t = c(0.501, 1.373, 1.88), AO = c(-0.0512, 0.2892, 0.0664))
> df2 <- data.frame(HV = c(3,3,2), NAtlantic850t = c(1.2384, 1.3637, -0.0332), AO = c(-0.5915, -0.0596, -0.8842))
They're identical, I'd like to plot them column vs column (e.g. df1$HV, df2$HV) - loop through the dataframe columns and plot them against each other in a scatter graph.
I've looked through 20+ questions asking similar things and can't figure it out - would appreciate some help on where to start. Can I use lapply and plot or ggplot when they're two DFs? Should I merge them first?
As you suggest, I would indeed first rearrange into a list of plottable data frames before calling the plot command. I think that would especially be the way to go if you want to feed the data argument into ggplot. Something like:
plot_dfs <- lapply(names(df1),function(nm)data.frame(col1 = df1[,nm], col2 = df2[,nm]))
for (df in plot_dfs)plot(x = df[,"col1"], y = df[,"col2"])
or using ggplot:
for (df in plot_dfs){
print(
ggplot(data = df, aes(x=col1, y=col2)) +
geom_point())}
and if you want to add the column names as plot titles, you can do:
for (idx in seq_along(plot_dfs)){
print(
ggplot(data = plot_dfs[[idx]], aes(x=col1, y=col2)) +
ggtitle(names(df1)[idx]) +
geom_point())}
You can loop through the columns like this:
for(col in 1:ncol(df1)){
plot(df1[,col], df2[,col])
}
Make sure that both data frames have the same number of columns (and the order of the columns are the same) before running this.
Here’s one way to do it — loop over the column indices and create the plots one by one, adding them to a list and writing each one to a file:
library(ggplot2)
# create some data to plot
df1 <- iris[, sapply(iris, is.numeric)]
df2 <- iris[sample(1:nrow(iris)), sapply(iris, is.numeric)]
# a list to catch each plot object
plot_list <- vector(mode="list", length=ncol(df1))
for (idx in seq_along(df1)){
plot_list[[idx]] <- ggplot2::qplot(df1[[idx]], df2[[idx]]) +
labs(title=names(df1)[idx])
ggsave(filename=paste0(names(df1)[idx], ".pdf"), plot=plot_list[[idx]])
}
As you suggest in the question, you can also use s/lapply() with an anonymous function, e.g. like this (though here we're not storing the plots, just writing each one to disk):
lapply(seq_along(df1), function(idx){
the_plot <- ggplot2::qplot(df1[[id]], df2[[idx]]) + labs(title=names(df1)[idx])
ggsave(filename=paste0(names(df1)[idx], ".pdf"), plot=the_plot)
})
If you want to keep the list of plots (as in the for-loop example), just assign the lapply() to a variable (e.g. plot_list) and add line like return(the_plot) before closing the function.
There's tons of ways you could modify/adapt this approach, depending on what your objectives are.
Hope this helps ~~
p.s. if it's possible the columns won't be in the same order, it is better to loop over column names instead of column indices (i.e. use for (colname in names(df1)){... instead of for (idx in seq_along(df1)){...). You can use the same [[ subsetting syntax with both names and indices.

create multiple graphs/plots from xlsx file with n columns

I have a .xlsx file with several column (with some inter-dependencies). I'd like to plot multiple graphs on the same chart using a select number of the columns. The first column is Date (which will be my only X-variable) and the remainder columns of interests will be Y-values. There are 1000 rows of data in this file.
So ...
X-axis ... "Date" column only
Y-axis (multiple data) ... columns B, C, D, E, T, U, V only
Question:
How to:
1) Read the file
2) plot a line graph of the data, all on the same chart (X-axis = Date, Y-axis = columns B, C, D, E, T, U, V)
3) Color code each line with some type of a legend
I've read this post and many more (not allowed to post more than 2 links??) ... none has been helpful. Most are too arbitrary:
how to plot all the columns of a data frame in R
The problem you have is with this labels/sublabels combination. They mess up the import (variable classes are not recognized). Here is a two-step solution.
In the first step, we import the database just to extract clean column names. What I did for that was to concatenate the main label (row 2) with the sublabel (row 3) when there was one. There are two pairs of identical column labels, so we also rename them to have clean colnames (I suggest you take the time to review your variable names and give them proper labels). Then we save them as an object (n).
Then, we import the file again skipping the first two rows. That way, read_excel knows what classes to expect. We assign the previously saved names to the new data.frame. Now the data is clean. The rest is trivial: melt with tidyr:gather and plot with ggplot.
Code
library(readxl)
library(tidyr)
library(zoo)
library(ggplot2)
df <- read_excel("./myfile.xlsx",skip = 1)
names(df)[!is.na(df[1,])] <- paste(na.locf(names(df)[!is.na(df[1,])]),df[1,][!is.na(df[1,])],sep="_")
names(df)[duplicated(names(df))] <- paste0(names(df)[duplicated(names(df))],"bis")
n <- names(df)
df <- read_excel("./myfile.xlsx",skip = 2)
names(df) <- n
# df <- dplyr::slice(df,1:3) # this line is for the censored datafile that has only three rows
melted <- gather(df,key,value,-Date)
ggplot(melted, aes(x=Date,y=value,color=key)) + geom_line()
Of course, with only three rows of data, the result is ugly:

dplyr mutate in R - adding a new column depending on sequence of another column

I am having an issue with mutate function in dplyr.
I am trying to
add a new column called state depending on the change in one of the column (V column). (V column repeat itself with a sequence so each sequence (rep(seq(100,2100,100),each=96) corresponds to one dataset in my df)
Error: impossible to replicate vector of size 8064
Here is reproducible example of md df:
df <- data.frame (
No=(No= rep(seq(0,95,1),times=84)),
AC= rep(rep(c(78,110),each=1),times=length(No)/2),
AR = rep(rep(c(256,320,384),each=2),times=length(No)/6),
AM = rep(1,times=length(No)),
DQ = rep(rep(seq(0,15,1),each=6),times=84),
V = rep(rep(seq(100,2100,100),each=96),times=4),
R = sort(replicate(6, sample(5000:6000,96))))
labels <- rep(c("CAP-CAP","CP-CAP","CAP-CP","CP-CP"),each=2016)
I added here 2016 value intentionally since I know the number of rows of each dataset.
But I want to assign these labels with automated function when the dataset changes. Because there is a possibility the total number of rows may change for each df for my real files. For this question think about its only one txt file and also think about there are plenty of them with different number of rows. But the format is the same.
I use dplyr to arrange my df
library("dplyr")
newdf<-df%>%mutate_each(funs(as.numeric))%>%
mutate(state = labels)
is there elegant way to do this process?
Iff you know the number of data sets contained in df AND the column you're keying off --- here, V --- is ordered in df like it is in your toy data, then this works. It's pretty clunky, and there should be a way to make it even more efficient, but it produced what I take to be the desired result:
# You'll need dplyr for the lead() part
library(dplyr)
# Make a vector with the labels for your subsets of df
labels <- c("AP-AP","P-AP","AP-P","P-P")
# This line a) produces an index that marks the final row of each subset in df
# with a 1 and then b) produces a vector with the row numbers of the 1s
endrows <- which(grepl(1, with(df, ifelse(lead(V) - V < 0, 1, 0))))
# This line uses those row numbers or the differences between them to tell rep()
# how many times to repeat each label
newdf$state <- c(rep(labels[1], endrows[1]), rep(labels[2], endrows[2] - endrows[1]),
rep(labels[3], endrows[3] - endrows[2]), rep(labels[4], nrow(newdf) - endrows[3]))

Resources