Failure to create reproducible example by means of replicate/dput function - r

Im trying to use dput() to create a reproducible example with a large database. The database needs to be large as the reproducible example involves moving averages. The way I've found to do this involves the function reproduce, shared here How to make a great R reproducible example? by #Ricardo Saporta. reproduce is based on dput() (code here https://github.com/rsaporta/pubR/blob/gitbranch/reproduce.R).
library(data.table)
library(devtools)
source_url("https://raw.github.com/rsaporta/pubR/gitbranch/reproduce.R")
data <- read.table("http://pastebin.com/raw/xP1Zd0sC")
setDF(data)
reproduce(data, rows = c(1:100))
That code creates data dataframe, and then provides a dput() output for it. It uses the rows argument to output the full dataframe. Yet if I use such output to recreate the dataframe, it fails.
Trying to allocate the dput() output to a new dataframe results in incomplete code, requiring me to add three parentheses manually at the end. And after doing so, I get the following error message: "Error in View : arguments imply differing number of rows: 100, 61".
Please not that the dput() output from reproduce without the rows = c(1:100) argument works fine. It just does not output the full dataframe, but rather a sample of it.
#This works fine
reproduce(data)
Please also note that I used the pastebin method to create this reproducible example. That method does not replace the dput() method for my purposes because it fails whenever trying to import data where some columns have spaces between the words (e.g. dataframes with datetime stamps).
EDIT: after some further troubleshooting discovered that reproduce fails as described above when the rows argument is used together with a dataframe containing 4 or more columns. Will have to find an alternative.
If anyone is interested in testing this, run the code above with the following links, all containing different number of columns:
1) 100x5: http://pastebin.com/raw/xP1Zd0sC
2) 100x4: http://pastebin.com/raw/YZtetfne
3) 100x4: http://pastebin.com/raw/63Ap2bh5
4) 100x3: http://pastebin.com/raw/1vMMcMtx
5) 100x3: http://pastebin.com/raw/ziM1bYQt
6) 100x1: http://pastebin.com/raw/qxtQs5u4

If you are just trying to dput() the first 100 rows of a data set, then you can simply subset the data just prior to running dput(). There doesn't seem to be a need to use the linked function.
dput(droplevels(head(data, 100))) ## or dput(droplevels(data[1:100,]))
should do it.
It is, however, peculiar that your try on reproduce() did not work. I would file an issue on the github page for that. You will likely get a more constructive answer there.
Thanks to #David Arenburg for reminding me about droplevels(). It is useful on this operation if we have factor columns. "Leftover" levels will be dropped.

Related

Column of original data is returning length as 0 in R

I am working in R studio and trying to create a table. The error I keep getting is "Error in table(players, fitmod1$classification) : all arguments must have the same length". When I check the length of my data, fitmod1$classification is returning a value; but players is returning 0. I have no idea how to fix this.
Player's is a qualitative column of the Hitters data in R package ISLR. fitmod1 is a mclust model. I am attaching my code below so hopefully that helps! Thanks]1
Your issue is that the players are the row names and not an actual column of the data. So when you subset the Hitter's data frame with:
players <- Hitters[,0]
you end up with an empty dataframe (though the rows are still named which what you are seeing when you view it in RStudio).
Instead you want to get the row names and store them as a vector:
players <- row.names(Hitters)
You will now be able to generate a table.
Here is all of the code (by the way it is much easier for us as a community to answer your questions if you use the code feature in stack overflow rather than attaching a png. This way we can copy and paste your code rather than having to type it by hand) :
library(ISLR)
library(mclust)
data(Hitters)
Hitters=Hitters[,c(1:7)]
Hitters<-na.omit(Hitters)
players <- row.names(Hitters)
fitmod1<-Mclust(Hitters, G=3, modelNames=c("VEE"))
table(players, fitmod1$classification)

How to block bootstrap in R?

I'm trying to run a block bootstrapping function on some time series data (monthly interest rates for ~15 years).
My data is in a csv file with no header, all comprising one column and going down by row.
I installed the package bootstrap because tsboot wouldn't work for me.
Here is my code:
testFile = read.csv("\\Users\\unori/sample_data.csv")
theta <- function(x){mean(x)}
results = bootstrap(testFile,100,theta)
It tells me there are at least 50 errors. All of them say "In mean.default(x) : argument is not numeric or logical: returning NA"
What to do? It runs when I use the example in the documentation. I think it must be how my data is stored/imported?
Thanks in advance.
Try to supply a working, minimal example that reproduces your problem! Check here to see how to make a minimal reproducible example.
The error messages tells you that the thing you want to calculate the mean of, is not a number! So R will just return NA.
Suggestions for debugging:
Does the object 'testFile' exist?
What is the output of
str(testFile)
This works for me:
library(bootstrap)
testFile <- cars[,1]
theta <- function(x){mean(x)}
results = bootstrap(testFile,100,theta)

Looping in R to create transformed variables

I have a dataset of 80 variables, and I want to loop though a subset of 50 of them and construct returns. I have a list of the names of the variables for which I want to construct returns, and am attempting to use the dplyr command mutate to construct the variables in a loop. Specifically my code is:
for (i in returnvars) {
alldta <- mutate(alldta,paste("r",i,sep="") = (i - lag(i,1))/lag(i,1))}
where returnvars is my list, and alldta is my dataset. When I run this code outside the loop with just one of the `i' values, it works fine. The code for that looks like this:
alldta <- mutate(alldta,rVar = (Var- lag(Var,1))/lag(Var,1))
However, when I run it in the loop (e.g., attempting to do the previous line of code 50 times for 50 different variables), I get the following error:
Error: unexpected '=' in:
"for (i in returnvars) {
alldta <- mutate(alldta,paste("r",i,sep="") ="
I am unsure why this issue is coming up. I have looked into a number of ways to try and do this, and have attempted solutions that use lapply as well, without success.
Any help would be much appreciated! If there is an easy way to do this with one of the apply commands as well, that would be great. I did not provide a dataset because my question is not data specific, I'm simply trying to understand, as a relative R beginner, how to construct many transformed variables at once and add them to my data frame.
EDIT: As per Frank's comment, I updated the code to the following:
for (i in returnvars) {
varname <- paste("r",i,sep="")
alldta <- mutate(alldta,varname = (i - lag(i,1))/lag(i,1))}
This fixes the previous error, but I am still not referencing the variable correctly, so I get the error
Error in "Var" - lag("Var", 1) :
non-numeric argument to binary operator
Which I assume is because R sees my variable name Var as a string, rather than as a variable. How would I correctly reference the variable in my dataset alldta? I tried get(i) and alldta$get(i), both without success.
I'm also still open to (and actively curious about), more R-style ways to do this entire process, as opposed to using a loop.
Using mutate inside a loop might not be a good idea either. I am not sure if mutate makes a copy of the data frame but its generally not a good practice to grow a data frame inside a loop. Instead create a separate data frame with the output and then name the columns based on your logic.
result = do.call(rbind,lapply(returnvars,function(i) {...})
names(result) = paste("r",returnvars,sep="")
After playing around with this more, I discovered (thanks to Frank's suggestion), that the following works:
extended <- alldta # Make a copy of my dataset
for (i in returnvars) {
varname <- paste("r",i,sep="")
extended[[varname]] = (extended[[i]] - lag(extended[[i]],1))/lag(extended[[i]],1)}
This is still not very R-styled in that I am using a loop, but for a task that is only repeating about 50 times, this shouldn't be a large issue.

Strangeness with filtering in R and showing summary of filtered data

I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!
What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)

CSV file to Histogram in R

I'm a total newbie with R, and I'm trying to create a histogram (with value and frequency as the axises) from a csv file (just one row of values). Any idea how I can do this?
I'm also an R newbie, and I ran into the same thing. I made two separate mistakes, actually, so I'll describe them both here.
Mistake 1: Passing a frequency table to hist(). Originally I was trying to pass a frequency table to hist() instead of passing in the raw data. One way to fix this is to use the rep() ("replicate") function to explode your frequency table back into a raw dataset, as described here:
Creating a histogram using aggregated data
Simple R (histogram) from counted csv file
Instead of that, though, I just decided to read in my original dataset instead of the frequency table.
Mistake 2: Wrong data type. My raw data CSV file contains two columns: hostname and bookings (idea is to count the number of bookings each host generated during some given time period). I read it into a table.
> tbl <- read.csv('bookingsdata.csv')
Then when I tried to generate a histogram off the second column, I did this:
> hist(tbl[2])
This gave me the "'x' must be numeric" error you mention in a comment. (It was trying to read the "bookings" column header in as a data value.)
This fixed it:
> hist(tbl$bookings)
You should really start to read some basic R manual...
CRAN offers a lot of them (look into the Manuals and Contributed sections)
In any case:
setwd("path/to/csv/file")
myvalues <- read.csv("filename.csv")
hist(myvalues, 100) # Example: 100 breaks, but you can specify them at will
See the manual pages for those functions for more help (accessible through ?read.table, ?read.csv and ?hist).
To plot the histogram, the values must be of numeric class i.e the data must be of numeric value. Here the value of x seems to be of some other class.
Run the following command and see:
sapply(myvalues[1,],class)

Resources