I have a .xlsx file with several column (with some inter-dependencies). I'd like to plot multiple graphs on the same chart using a select number of the columns. The first column is Date (which will be my only X-variable) and the remainder columns of interests will be Y-values. There are 1000 rows of data in this file.
So ...
X-axis ... "Date" column only
Y-axis (multiple data) ... columns B, C, D, E, T, U, V only
Question:
How to:
1) Read the file
2) plot a line graph of the data, all on the same chart (X-axis = Date, Y-axis = columns B, C, D, E, T, U, V)
3) Color code each line with some type of a legend
I've read this post and many more (not allowed to post more than 2 links??) ... none has been helpful. Most are too arbitrary:
how to plot all the columns of a data frame in R
The problem you have is with this labels/sublabels combination. They mess up the import (variable classes are not recognized). Here is a two-step solution.
In the first step, we import the database just to extract clean column names. What I did for that was to concatenate the main label (row 2) with the sublabel (row 3) when there was one. There are two pairs of identical column labels, so we also rename them to have clean colnames (I suggest you take the time to review your variable names and give them proper labels). Then we save them as an object (n).
Then, we import the file again skipping the first two rows. That way, read_excel knows what classes to expect. We assign the previously saved names to the new data.frame. Now the data is clean. The rest is trivial: melt with tidyr:gather and plot with ggplot.
Code
library(readxl)
library(tidyr)
library(zoo)
library(ggplot2)
df <- read_excel("./myfile.xlsx",skip = 1)
names(df)[!is.na(df[1,])] <- paste(na.locf(names(df)[!is.na(df[1,])]),df[1,][!is.na(df[1,])],sep="_")
names(df)[duplicated(names(df))] <- paste0(names(df)[duplicated(names(df))],"bis")
n <- names(df)
df <- read_excel("./myfile.xlsx",skip = 2)
names(df) <- n
# df <- dplyr::slice(df,1:3) # this line is for the censored datafile that has only three rows
melted <- gather(df,key,value,-Date)
ggplot(melted, aes(x=Date,y=value,color=key)) + geom_line()
Of course, with only three rows of data, the result is ugly:
Related
I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.
As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.
You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.
I'm trying to generate a simple histogram for every variable in my dataframe, which I can do using sapply below. But, how can I include the name of the variable in either the title or the x-axis so I know which one I'm looking at? (I have about 20 variables.)
Here is my current code:
x = # initialize dataframe
sapply(x, hist)
Here's a way to modify your existing approach to include column name as the title of each histogram, using the iris dataset as an example:
# loop over column *names* instead of actual columns
sapply(names(iris), function(cname){
# (make sure we only plot the numeric columns)
if (is.numeric(iris[[cname]]))
# use the `main` param to put column name as plot title
print(hist(iris[[cname]], main=cname))
})
After you run that, you'll be able to flip through the plots with the arrows in the viewer pane (assuming you're using R Studio).
Here's an example output:
p.s. check out grid::grob(), gridExtra::grid.arrange(), and related functions if you want to arrange the histograms onto a single plot window and save it to a single file.
How about this? Assuming you have wide data you can transform it to long format with gather. Than a ggplot solution with geom_histogram and facet_wrap:
library(tidyverse)
# make wide data (20 columns)
df <- matrix(rnorm(1000), ncol = 20)
df <- as.data.frame(df)
colnames(df) <- LETTERS[1:20]
# transform to long format (2 columns)
df <- gather(df, key = "name", value = "value")
# plot histigrams per name
ggplot(df) +
geom_histogram(aes(value)) +
facet_wrap(~name, ncol = 5)
I have a dataframe (NBA_Data) that is 27,538 by 29. One of the columns is called "month", and there are eight months (October:June). I would like to write a function that automatically subsets this data frame by month and then creates 8 plot-objects in ggplot2. I'm getting out of my depth here, but I imagine that these would be stored in a single list ("My Plots").
I plan to use geom_plot for each plot object to plot Points against Minutes.Played. I'm not familiar with grid.arrange, but I'm guessing that once I have "My Plots", I could use that (in some form) as an argument for grid.arrange.
I've tried:
empty_list <- list()
for (cat in unique(NBA_Data$month)){
d <- subset(NBA_Data, month == cat)
empty_list <- c(empty_list, d)
}
This gives an unbroken list that repeats all 29 columns for each month, with a length of 261. Not ideal, but workable maybe. Then I try using lapply to split the list, but I screw it up.
lapply(empty_list, split(empty_list, empty_list$month))
Error in match.fun(FUN) :
'split(x = empty_list, f = empty_list$month)' is not a function, character or symbol
In addition: Warning message:
In split.default(x = empty_list, f = empty_list$month) :
data length is not a multiple of split variable
Any suggestions?
Thank you.
You can use split to chunk the dataset into a list already:
list <- split(data, data$month)
Also you can use facet_wrap to make multiple plots on one page with the same data if you're using ggplot.
library(ggplot2)
ggplot(data, aes(x = PlayerName, y = PPG)) + geom_point() + facet_wrap(~month)
I am having an issue with mutate function in dplyr.
I am trying to
add a new column called state depending on the change in one of the column (V column). (V column repeat itself with a sequence so each sequence (rep(seq(100,2100,100),each=96) corresponds to one dataset in my df)
Error: impossible to replicate vector of size 8064
Here is reproducible example of md df:
df <- data.frame (
No=(No= rep(seq(0,95,1),times=84)),
AC= rep(rep(c(78,110),each=1),times=length(No)/2),
AR = rep(rep(c(256,320,384),each=2),times=length(No)/6),
AM = rep(1,times=length(No)),
DQ = rep(rep(seq(0,15,1),each=6),times=84),
V = rep(rep(seq(100,2100,100),each=96),times=4),
R = sort(replicate(6, sample(5000:6000,96))))
labels <- rep(c("CAP-CAP","CP-CAP","CAP-CP","CP-CP"),each=2016)
I added here 2016 value intentionally since I know the number of rows of each dataset.
But I want to assign these labels with automated function when the dataset changes. Because there is a possibility the total number of rows may change for each df for my real files. For this question think about its only one txt file and also think about there are plenty of them with different number of rows. But the format is the same.
I use dplyr to arrange my df
library("dplyr")
newdf<-df%>%mutate_each(funs(as.numeric))%>%
mutate(state = labels)
is there elegant way to do this process?
Iff you know the number of data sets contained in df AND the column you're keying off --- here, V --- is ordered in df like it is in your toy data, then this works. It's pretty clunky, and there should be a way to make it even more efficient, but it produced what I take to be the desired result:
# You'll need dplyr for the lead() part
library(dplyr)
# Make a vector with the labels for your subsets of df
labels <- c("AP-AP","P-AP","AP-P","P-P")
# This line a) produces an index that marks the final row of each subset in df
# with a 1 and then b) produces a vector with the row numbers of the 1s
endrows <- which(grepl(1, with(df, ifelse(lead(V) - V < 0, 1, 0))))
# This line uses those row numbers or the differences between them to tell rep()
# how many times to repeat each label
newdf$state <- c(rep(labels[1], endrows[1]), rep(labels[2], endrows[2] - endrows[1]),
rep(labels[3], endrows[3] - endrows[2]), rep(labels[4], nrow(newdf) - endrows[3]))
I work with surveys and would like to export a large number of tables (drawn from data frames) into an .xlsx or .csv file. I use the xlsx package to do this. This package requires me to stipulate which column in the excel file is the first column of the table. Because I want to paste multiple tables into the .csv file I need to be able to stipulate that the first column for table n is the length of table (n-1) + x number of spaces. To do this I planned on creating values like the following.
dt# is made by changing a table into a data frame.
table1 <- table(df$y, df$x)
dt1 <- as.data.frame.matrix(table1)
Here I make the values for the number of the starting column
startcol1 = 1
startcol2 = NCOL(dt1) + 3
startcol3 = NCOL(dt2) + startcol2 + 3
startcol4 = NCOL(dt3) + 3 + startcol2 + startcol3
And so on. I will probably need to produce somewhere between 50-100 tables. Is there a way in R to make this an iterative process so I can create the 50 values of starting columns without having to write 50+ lines of code with each one building on the previous?
I found stuff on stack overflow and other blogs about writing for - loops or using apply type functions in R but this all seemed to deal with manipulating a vector as opposed to adding values to the workspace. Thanks
You can use a structure similar to this:
Your list of files to read:
file_list = list.files("~/test/",pattern="*csv",full.names=TRUE)
for each file, read and process the data frame and capture how many columns there are in the frame you are reading/processing:
columnsInEachFile = sapply(file_list,
function(x)
{
df = read.csv(x,...) # with your approriate arguments
# do any necessary processing you require per file
return(ncol(df))
}
)
The cumulative sum of the number of columns plus 1 will indicate the start columns of a data frame that contains your processed data stuck next to each other:
columnsToStartDataFrames = cumsum(columnsInEachFile)+1
columnsToStartDataFrames = columnsToStartDataFrames[-length(columnsToStartDataFrames)] # last value is not the start of a data frame but the end
Assuming tab.lst is a list containing tables, then you can do:
cumsum(c(1, sapply(tail(tab.lst, -1), ncol)))
Basically, what I'm doing here is I'm looping through all the tables but the last one (since that one's start col is determined by the second to last), and getting each table's width with ncol. Then I'm doing the cumulative sum over that vector to get all the start positions.
And here is how I created the tables (tables based on all possible combinations of columns in df):
df <- replicate(5, sample(1:10), simplify=F) # data frame with 5 columns
names(df) <- tail(letters, 5) # name the cols
name.combs <- combn(names(df), 2) # get all 2 col combinations
tab.lst <- lapply( # make tables for each 2 col combination
split(name.combs, col(name.combs)), # loop through every column in name.combs
function(x) table(df[[x[[1]]]], df[[x[[2]]]]) # ... and make a table
)