String Conversion- For loop on categorical variables - r

Hi all I am a novice to R and appreciate your hints on this case.
I've been struggling to convert the variables (objects) in my dataframe to strings and plot them using a for loop, as detailed below.
COUNTRY: China Belgium ...
COMPANY: XXX Inc. YYY Inc. ...
Here, COUNTRY and COMPANY are categorical variables.
I've used toString() as well as as.character() to convert variable name to a string so I can specify the plot name but I cant seem to get it to work. I need 4 variable as listed in code below in for loop for 2 purposes:
as String for naming plot
use in barplot()
but neither string conversion nor the for loop is working properly as I meant to.
Could somebody assist me with the proper command for this purpose?
Your help is greatly appreciated...
Kind regards,
CODE
Frequency_COUNTRY <- table(COUNTRY)#Get Frequency for COUNTRY
Relative_Frequency_COUNTRY <- table(COUNTRY) / length(COUNTRY)#Get Relative
#Frequency (Percentage %) for Variable COUNTRY
Frequency_COMPANY <- table(COMPANY) #Get Frequency and Relative Frequency for COMPANY
Relative_Frequency_COMPANY <- table(COMPANY) / length(COMPANY)
Categorical_Variable_List = c(Frequency_COUNTRY,
Relative_Frequency_COUNTRY ,
Frequency_COMPANY,
Relative_Frequency_COMPANY)`# Get list of 4 variables above
for (Categorical_Variable in Categorical_Variable_List){#Plot 4 variables using a for loop
A = toString(Categorical_Variable) #Trying to convert non-string variable name to string
plotName <- paste("BarChart_", A, sep = "_")# Specify plot name, e.g. BarChart_Frequency_COUNTRY
png(file = plotName)#Create png file
barplot(Categorical_Variable) #use barplot() to make graph
dev.off()`# Switch off dev
}

Your code is treating Categorical_Variable_List as if it were a named list of categorical variables. It is neither.
The following code corrects those errors and plots a graph of 4 barplots. In your code, remove the two calls to par, one before and the other after the for loop.
I will make up a dataset, to test the code.
set.seed(1234)
n <- 20
COUNTRY <- sample(LETTERS[1:5], n, TRUE)
COMPANY <- sample(letters[1:4], n, TRUE)
Frequency_COUNTRY <- table(COUNTRY) # Get Frequency for COUNTRY
Relative_Frequency_COUNTRY <- table(COUNTRY) / length(COUNTRY)#Get Relative
# Frequency (Percentage %) for Variable COUNTRY
Frequency_COMPANY <- table(COMPANY) # Get Frequency and Relative Frequency for COMPANY
Relative_Frequency_COMPANY <- table(COMPANY) / length(COMPANY)
Variable_List <- list(Frequency_COUNTRY = Frequency_COUNTRY,
Relative_Frequency_COUNTRY = Relative_Frequency_COUNTRY,
Frequency_COMPANY = Frequency_COMPANY,
Relative_Frequency_COMPANY = Relative_Frequency_COMPANY) # Get list of 4 variables above
Variable_Name <- names(Variable_List)
old_par <- par(mfrow = c(2, 2))
for (i in seq_along(Variable_List)){ # Plot 4 variables using a for loop
plotName <- paste("BarChart", Variable_Name[[i]], sep = "_") # Specify plot name
print(plotName) # for debugging only
#png(file = plotName) # Create png file
barplot(Variable_List[[i]]) # use barplot() to make graph
#dev.off() # Switch off dev
}
par(old_par)

Related

Reading ZIP file of machine-written data won't "plot" in RStudio

Summary: Despite a complicated lead-up, the solution was very simple: In order to plot a row of a dataframe as a line instead of a lattice, I needed to transpose the data in order to invert from x obs of y variables to y obs of x variables.
I am using RStudio on a Windows 10 computer.
I am using scientific equipment to write measurements to a csv file. Then I ZIP several files and read to R using read.csv. However, the data frame behaves strangely. Commands "length" and "dim" disagree and the "plot" function throws errors. Because I can create simulated data that doesn't throw the errors, I think the problem is either in how the machine wrote the data or in my loading and processing of the data.
Two ZIP files are located in my stackoverflow repository (with "Monterey Jack" in the name):
https://github.com/baprisbrey/stackoverflow
Here is my code for reading and processing them:
# Unzip the folders
unZIP <- function(folder){
orig.directory <- getwd()
setwd(folder)
zipped.folders <- list.files(pattern = ".*zip")
for (i in zipped.folders){
unzip(i)}
setwd(orig.directory)
}
folder <- "C:/Users/user/Documents/StackOverflow"
unZIP(folder)
# Load the data into a list of lists
pullData <- function(folder){
orig.directory <- getwd()
setwd(folder)
#zipped.folders <- list.files(pattern = ".*zip")
#unzipped.folders <- list.files(folder)[!(list.files(folder) %in% zipped.folders)]
unzipped.folders <- list.dirs(folder)[-1] # Removing itself as the first directory.
oData <- vector(mode = "list", length = length(unzipped.folders))
names(oData) <- str_remove(unzipped.folders, paste(folder,"/",sep=""))
for (i in unzipped.folders) {
filenames <- list.files(i, pattern = "*.csv")
#setwd(paste(folder, i, sep="/"))
setwd(i)
files <- lapply(filenames, read.csv, skip = 5, header = TRUE, fileEncoding = "UTF-16LE") #Note unusual encoding
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- vector(mode="list", length = length(files))
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- files
}
setwd(orig.directory)
return(oData)
}
theData <- pullData(folder) #Load the data into a list of lists
# Process the data into frames
bigFrame <- function(bigList) {
#where bigList is theData is the result of pullData
#initialize the holding list of frames per set
preList <- vector(mode="list", length = length(bigList))
names(preList) <- names(bigList)
# process the data
for (i in 1:length(bigList)){
step1 <- lapply(bigList[[i]], t) # transpose each data
step2 <- do.call(rbind, step1) # roll it up into it's own matrix #original error that wasn't reproduced: It showed length(step2) = 24048 when i = 1 and dim(step2) = 48 501. Any comments on why?
firstRow <- step2[1,] #holding onto the first row to become the names
step3 <- as.data.frame(step2) # turn it into a frame
step4 <- step3[grepl("µA", rownames(step3)),] # Get rid of all those excess name rows
rownames(step4) <- 1:(nrow(step4)) # change the row names to rowID's
colnames(step4) <- firstRow # change the column names to the first row steps
step4$ID <- rep(names(bigList[i]),nrow(step4)) # Add an I.D. column
step4$Class[grepl("pos",tolower(step4$ID))] <- "Yes" # Add "Yes" class
step4$Class[grepl("neg",tolower(step4$ID))] <- "No" # Add "No" class
preList[[i]] <- step4
}
# bigFrame <- do.call(rbind, preList) #Failed due to different number of measurements (rows that become columns) across all the data sets
# return(bigFrame)
return(preList) # Works!
}
frameList <- bigFrame(theData)
monterey <- rbind(frameList[[1]],frameList[[2]])
# Odd behaviors
dim(monterey) #48 503
length(monterey) #503 #This is not reproducing my original error of length = 24048
rowOne <- monterey[1,1:(ncol(monterey)-2)]
plot(rowOne) #Error in plot.new() : figure margins too large
#describe the data
quantile(rowOne, seq(0, 1, length.out = 11) )
quantile(rowOne, seq(0, 1, length.out = 11) ) %>% plot #produces undesired lattice plot
# simulate the data
doppelganger <- sample(1:20461,501,replace = TRUE)
names(doppelganger) <- names(rowOne)
# describe the data
plot(doppelganger) #Successful scatterplot. (With my non-random data, I want a line where the numbers in colnames are along the x-axis)
quantile(doppelganger, seq(0, 1, length.out = 11) ) #the random distribution is mildly different
quantile(doppelganger, seq(0, 1, length.out = 11) ) %>% plot # a simple line of dots as desired
# investigating structure
str(rowOne) # results in a dataframe of 1 observation of 501 variables. This is a correct interpretation.
str(as.data.frame(doppelganger)) # results in 501 observations of 1 variable. This is not a correct interpretation but creates the plot that I want.
How do I convert the rowOne to plot like doppelganger?
It looks like one of my errors is not reproducing, where calls to "dim" and "length" apparently disagree.
However, I'm confused as to why the "plot" function is producing a lattice plot on my processed data and a line of dots on my simulated data.
What I would like is to plot each row of data as a line. (Next, and out of the scope of this question, is I would like to classify the data with adaboost. My concern is that if "plot" behaves strangely then the classifier won't work.)
Any tips or suggestions or explanations or advice would be greatly appreciated.
Edit: Investigating the structure with ("str") of the two examples explains the difference between plots. I guess my modified question is, how do I switch between the two structures to enable plotting a line (like doppelganger) instead of a lattice (like rowOne)?
I am answering my own question.
I am leaving behind the part about the discrepancy between "length" and "dim" since I can't provide a reproducible example. However, I'm happy to leave up for comment.
The answer is that in order to produce my plot, I simply have to transpose the row as follows:
rowOne %>% t() %>% as.data.frame() %>% plot
This inverts the structure from one observation of 501 variables to 501 obs of one variable as follows:
rowOne %>% t() %>% as.data.frame() %>% str()
#'data.frame': 501 obs. of 1 variable:
# $ 1: num 8712 8712 8712 8712 8712 ...
Because of the unusual encoding I used, and the strange "length" result, I failed to see a simple solution to my "plot" problem.

Using value-labels in R with sjlabelled package

Recently I have switched from STATA to R.
In STATA, you have something called value label. Using the command encode for example allows you to turn a string variable into a numeric, with a string label attached to each number. Since string variables contain names (which repeat themselves most of the time), using value labels allows you to save a lot of space when dealing with large dataset.
Unfortunately, I did not manage to find a similar command in R. The only package I have found that could attach labels to my values vector is sjlabelled. It does the attachment but when I’m trying to merge attached numeric vector to another dataframe, the labels seems to “fall of”.
Example: Start with a string variable.
paragraph <- "Melanija Knavs was born in Novo Mesto, and grew up in Sevnica, in the Yugoslav republic of Slovenia. She worked as a fashion model through agencies in Milan and Paris, later moving to New York City in 1996. Her modeling career was associated with Irene Marie Models and Trump Model Management"
install.packages("sjlabelled")
library(sjlabelled)
sentences <- strsplit(paragraph, " ")
sentences <- unlist(sentences, use.names = FALSE)
# Now we have a vector to string values.
sentrnces_df <- as.data.frame(sentences)
sentences <- unique(sentrnces_df$sentences)
group_sentences <- c(1:length(sentences))
sentences <- as.data.frame(sentences)
group_sentences <- as.data.frame(group_sentences)
z <- cbind(sentences,group_sentences)
z$group_sentences <- set_labels(z$group_sentences, labels = (z$sentences))
sentrnces_df <- merge(sentrnces_df, z, by = c('sentences'))
get_labels(z$group_sentences) # the labels I was attaching using set labels
get_labels(sentrnces_df$group_sentences) # the output is just “NULL”
Thanks!
P.S. Sorry about the inelegant code, as I said before, I'm pretty new in R.
source: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
...
Around June of 2007, R introduced hashing of CHARSXP elements in the
underlying C code thanks to Seth Falcon. What this meant was that
effectively, character strings were hashed to an integer
representation and stored in a global table in R. Anytime a given
string was needed in R, it could be referenced by its underlying
integer. This effectively put in place, globally, the factor encoding
behavior of strings from before. Once this was implemented, there was
little to be gained from an efficiency standpoint by encoding
character variables as factor. Of course, you still needed to use
‘factors’ for the modeling functions.
...
I adjusted your initial test data a little bit. I was confused by so many strings and am unsure whether they are necessary for this issue. Let me know, if I missed a point. Here is my adjustment and the answer:
#####################################
# initial problem rephrased
#####################################
# create test data
id = seq(1:20)
variable1 = sample(30:35, 20, replace=TRUE)
variable2 = sample(36:40, 20, replace=TRUE)
df1 <- data.frame(id, variable1)
df2 <- data.frame(id, variable2)
# set arbitrary labels
df1$variable1 <- set_labels(df1$variable1, labels = c("few" = 1, "lots" = 5))
# show labels in this frame
get_labels(df1)
# include associated values
get_labels(df1, values = "as.prefix")
# merge df1 and df2
df_merge <- merge(df1, df2, by = c('id'))
# labels lost after merge
get_labels(df_merge, values = "as.prefix")
#####################################
# solution with dplyr
#####################################
library(dplyr)
df_merge2 <- left_join(x = df1, y = df2, by = "id")
get_labels(df_merge2, values = "as.prefix")
Solution attributed to:
Merging and keeping variable labels in R

Loop over number preceded by underscore symbol in R

I apologize in advance but I did not find what I need in previous topic-related posts.
Suppose that I have the following data. "bchain" is a dataframe of 2192 observations. The column "Date" contains dates from 2011/01/01 to 2016/12/31. The column "Value" contains daily exchange rates.
>bchain
Date Value
1 2011-01-01 0.299998
2 2011-01-02 0.299996
3 2011-01-03 0.299998
4 2011-01-04 0.299899
5 2011-01-05 0.298998
6 2011-01-06 0.299000
7 2011-01-07 0.322000
8 2011-01-08 0.322898
. ....... .......
What I want to do is to visualize the exchange rates year by year in separate plots and save the six graphs on my desktop by using a "for" loop. Consider this simple following pseudo-code which I built around this post content:
https://www.r-bloggers.com/automatically-save-your-plots-to-a-folder/
PSEUDO-CODE:
Date_2011=bchain[1:365,1]
Date_2012=bchain[366:731,1]
Date_2013=bchain[732:1096,1]
Date_2014=bchain[1097:1461,1]
Date_2015=bchain[1462:1826,1]
Date_2016=bchain[1827:2192,1]
bchain_2011=bchain[1:365,2]
bchain_2012=bchain[366:731,2]
bchain_2013=bchain[732:1096,2]
bchain_2014=bchain[1097:1461,2]
bchain_2015=bchain[1462:1826,2]
bchain_2016=bchain[1827:2192,2]
years=2011:2016
for(i in years){
mypath = file.path("C:/Users/toshiba1/Desktop",paste("myplot_", years[i], ".jpg", sep = ""))
jpeg(file=mypath)
mytitle = paste("my title is", years[i])
plot(Date_[i],bchain_[i], main = mytitle)
dev.off()
}
Then I get the following error message: object "Date_" not found. I suspect that the problem is that the above loop does not recognize the numbers which come after the underscore sign. So, any suggestion?
Thank you in advance.
Here is another approach avoiding the need to make the year-specific data frames. I used the lubridate package to extract the year from the date values, generated a data.frame of that year, and plotted those data. As #Konrad also pointed out, the way in which you call some of the objects is giving you issues - I cleaned up some of those in your paste statements below.
library(lubridate)
# Create toy data to plot
bchain <- data.frame(Date = seq.Date(from = as.Date("2011-01-01"), to = as.Date("2016-12-31"),
by = 1),
Value = runif(2192, 0, 1))
years <- 2011:2016
for(i in years){
# Create dataset of just data to plot
bchain_plot <- bchain[year(bchain$Date) == i, ]
# Edited file name w/i jpeg call and fixed paste statement
jpeg(filename=paste0("C:/Users/toshiba1/Desktop/myplot_", i, ".jpg"))
# Plot data w/ title included in plot call
plot(bchain_plot$Date, bchain_plot$Value, main = paste("my title is", i))
dev.off()
}
You should call your object properly one approach may involve making use of get on the lines:
# Now plot data number i
x <- get(paste("Date", i, sep = "_"))
# Plot
plot(x)
or simply by nesting:
plot(get(paste("Date", i, sep = "_")))
To test it, see what happens if you type Date_[i] in R console? Are you getting the object you want to pass to the plot function? Arrive at the desired object via get or any other mechanism that suits you and then pass it to the plotting function.
I reckon that you want to iterate through your objects - you need i not [i]. Type [i] in the R console and see what happens.

Have trouble running googlevis with my dataset

I am new to R programming. I was trying to visualize some dataset. I was using Googlevis in R and was unable to visualize it.
The error I got was:
Error: Length of logical index vector must be 1 or 8, got: 14835
Can someone help?
Dataset is here:
https://www.kaggle.com/c/predict-west-nile-virus/data
Code is below
# Read competition data files:
library(readr)
data_dir <- "C:/Users/Wesley/Desktop/input"
train <- read_csv(file.path(data_dir, "train.csv"))
spray <- read_csv(file.path(data_dir, "spray.csv"))
# Generate output files with write_csv(), plot() or ggplot()
# Any files you write to the current directory get shown as outputs
# Install and read packages
library(lubridate)
library(googleVis)
# Create useful date columns
spray$Date <- as.Date(as.character(spray$Date),format="%Y-%m-%d")
spray$Week <- isoweek(spray$Date)
spray$Year <- year(spray$Date)
# Create a total count of measurements
spray$Total <- 1
for(i in 1:nrow(spray)) {
spray$Total[i] = i
}
# Aggregate data by Year, Week, Trap and order by old-new
spray_agg <- aggregate(cbind(Total)~Year+Week+Latitude+Longitude,data=spray,sum)
spray_agg <- spray[order(spray$Year,spray$Week),]
# Create a misc format for Week for Google Vis Motion Chart
spray_agg$Week_Format <- paste(spray_agg$Year,"W",spray_agg$Week,sep="")
# Function to create a motion chart together with a overview table
# It takes the aggregated data as input as well as a year of choice (2007,2009,2011,2013)
# It filters out "no presence" weeks since they distort the graphical view
# Next to that it creates an overview table of that year
# With gvisMerge you can merge the 3 html outputs into 1
create_motion <- function(data=spray_agg,year=2011){
data_motion <- data[data$Year==year]
motion <- gvisMotionChart(data=data_motion,idvar="Total",timevar="Week_Format",xvar="Longitude",yvar="Latitude"
,sizevar=0.1,colorvar="Blue",options=list(width="600"))
return(motion)
}
# Get the per year motion charts
#motion1 <- create_motion(spray_agg,2007)
#motion2 <- create_motion(spray_agg,2009)
motion3 <- create_motion(spray_agg,2011) : (Error: Length of logical index vector must be 1 or 8, got: 14835)
motion4 <- create_motion(spray_agg,2013) :(Error: Length of logical index vector must be 1 or 8, got: 14835)
# Merge them together into 1 dashboard
output <- gvisMerge(gvisMerge(motion1,motion2,horizontal=TRUE),gvisMerge(motion3,motion4,horizontal=TRUE),horizontal=FALSE)
plot(output)
# Plot the output in your browser

R ncdf package - put.var.ncdf requiring incorrect number of dimensions

I am organizing weather data into netCDF files in R. Everything goes fine until I try to populate the netcdf variables with data, because it is asking me to specify only one dimension for two-dimensional variables.
library(ncdf)
These are the dimension tags for the variables. Each variable uses the Threshold dimension and one of the other two dimensions.
th <- dim.def.ncdf("Threshold", "level", c(5,6,7,8,9,10,50,75,100))
rt <- dim.def.ncdf("RainMinimum", "cm", c(5, 10, 25))
wt <- dim.def.ncdf("WindMinimum", "m/s", c(18, 30, 50))
The variables are created in a loop, and there are a lot of them, so for the sake of easy understanding, in my example I'll only populate the list of variables with one variable.
vars <- list()
v1 <- var.def.ncdf("ARMM_rain", "percent", list(th, rt), -1, prec="double")
vars[[length(vars)+1]] <- v1
ncdata <- create.ncdf("composite.nc", vars)
I use another loop to extract data from different data files into a 9x3 data frame named subframe while iterating through the variables of the netcdf file with varindex. For the sake of reproducing, I'll give a quick initialization for these values.
varindex <- 1
subframe <- data.frame(matrix(nrow=9, ncol=3, rep(.01, 27)))
The desired outcome from there is to populate each ncdf variable with the contents of subframe. The code to do so is:
for(x in 1:9) {
for(y in 1:3) {
value <- ifelse(is.na(subframe[x,y]), -1, subframe[x,y])
put.var.ncdf(ncdata, varindex, value, start=c(x,y), count=1)
}
}
The error message is:
Error in put.var.ncdf(ncdata, varindex, value, start = c(x, y), count = 1) :
'start' should specify 1 dims but actually specifies 2
tl;dr: I have defined two-dimensional variables using ncdf in R, I am trying to write data to them, but I am getting an error message because R believes they are single-dimensional variables instead.
Anyone know how to fix this error?

Resources