I'm running into problems converting factor to date; it is making NA values which I don't want.
Data for my problem can be found here: (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip)
x <- read.csv("activity.csv")
head(x)
steps date interval
1 NA 2012-10-01 0
2 NA 2012-10-01 5
3 NA 2012-10-01 10
4 NA 2012-10-01 15
5 NA 2012-10-01 20
6 NA 2012-10-01 25
Goal: I'm trying to find the mean total number of steps taken per day. So first, I need to bin values so that each data point corresponds to the sum for a given day
x$Day <- as.Date(cut(x$date, breaks = "day"))
Error in cut.default(x$date, breaks = "day") : 'x' must be numeric
Just confirm this with class function
class(x[,2])
"factor"
This is weird because from the head(x) above it looked like it was Date. Anyways so in order to bin values so that each data point corresponds to the sum for a given day using the cut function I need to change the date's to "Date" class
x[,2] <- as.Date(x[,2], format="%Y/%m/%d")
class(x[,2])
[1] "Date"
OK, so in theory I should be able to bin values now
x$Day <- as.Date(cut(x$date, breaks = "day"))
Error in seq.int(0, to0 - from, by) : 'to' cannot be NA, NaN or infinite
In addition: Warning messages:
1: In min.default(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, :
no non-missing arguments to min; returning Inf
2: In max.default(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, :
no non-missing arguments to max; returning -Inf
head(is.na(x))
steps date interval
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] TRUE TRUE FALSE
[4,] TRUE TRUE FALSE
[5,] TRUE TRUE FALSE
[6,] TRUE TRUE FALSE
If I compare this to what I saw prior to the x[,2] <- as.Date(x[,2], format="%Y/%m/%d")
head(is.na(x))
steps date interval
[1,] TRUE FALSE FALSE
[2,] TRUE FALSE FALSE
[3,] TRUE FALSE FALSE
[4,] TRUE FALSE FALSE
[5,] TRUE FALSE FALSE
[6,] TRUE FALSE FALSE
Not sure what's going on here? I know this should work because I got this idea from the following tutorial (http://blog.mollietaylor.com/2013/08/plot-weekly-or-monthly-totals-in-r.html?m=1)
sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Canada.1252
[2] LC_CTYPE=English_Canada.1252
[3] LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils
[5] datasets methods base
other attached packages:
[1] scales_0.2.4 ggplot2_1.0.0
loaded via a namespace (and not attached):
[1] colorspace_1.2-4 digest_0.6.4
[3] grid_3.0.3 gtable_0.1.2
[5] MASS_7.3-29 munsell_0.4.2
[7] plyr_1.8.1 proto_0.3-10
[9] Rcpp_0.11.1 reshape2_1.4
[11] stringr_0.6.2 tools_3.0.3
Just to illustrate, these all result in the same output (except for the class of the date column of course):
x <- read.csv("~/Downloads/activity.csv")
# Date is a factor
r1 <- aggregate(steps~date,data = x,FUN = mean)
x1 <- read.csv("~/Downloads/activity.csv",stringsAsFactors = FALSE)
# Date is a character
r2 <- aggregate(steps~date,data = x1,FUN = mean)
x2 <- x
x2$date <- as.Date(as.character(x$date))
# Date is a date
r3 <- aggregate(steps~date,data = x2,FUN = mean)
my_data <-
read.csv(your_file, stringsAsFactors = FALSE)
# Convert 'my_data$date' to Date format
my_data$date <-
as.Date(my_data$date)
This should work...
Related
I get a numeric to integer64 type conversion after melting a data.table object in R.
Given the file stats.txt, tab separated:
id x y
A 283726709252 0.1
B 288604342155 0.2
C 329048184196 0.3
D 192107948937 0.4
I want to read it into a data.table and melt it. So:
library(data.table)
stats<- fread('stats.txt')
stats
id x y
1: A 283726709252 0.1
2: B 288604342155 0.2
3: C 329048184196 0.3
4: D 192107948937 0.4
str(stats)
Classes ‘data.table’ and 'data.frame': 4 obs. of 3 variables:
$ id: chr "A" "B" "C" "D"
$ x :integer64 283726709252 288604342155 329048184196 192107948937
$ y : num 0.1 0.2 0.3 0.4
- attr(*, ".internal.selfref")=<externalptr>
So far so good. Now if I melt it, I get the y variable converted from numeric to integer64:
xm<- melt.data.table(data= stats, id.vars= 'id')
xm
id variable value
1: A x 283726709252
2: B x 288604342155
3: C x 329048184196
4: D x 192107948937
5: A y 4591870180066957722
6: B y 4596373779694328218
7: C y 4599075939470750515
8: D y 4600877379321698714
str(xm)
Classes ‘data.table’ and 'data.frame': 8 obs. of 3 variables:
$ id : chr "A" "B" "C" "D" ...
$ variable: Factor w/ 2 levels "x","y": 1 1 1 1 2 2 2 2
$ value :integer64 283726709252 288604342155 329048184196 192107948937 4591870180066957722 4596373779694328218 4599075939470750515 4600877379321698714
- attr(*, ".internal.selfref")=<externalptr>
Is this a bug or am I doing something wrong?
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_2.2.1 data.table_1.10.4-3
loaded via a namespace (and not attached):
[1] compiler_3.4.1 colorspace_1.3-2 scales_0.5.0 lazyeval_0.2.0 plyr_1.8.4 gtable_0.2.0 tibble_1.3.3 Rcpp_0.12.12 grid_3.4.1 rlang_0.1.1 munsell_0.4.3
I am looking at the code from here which has this at the beginning:
## generate data for medical example
clinical.trial <-
data.frame(patient = 1:100,
age = rnorm(100, mean = 60, sd = 6),
treatment = gl(2, 50,
labels = c("Treatment", "Control")),
center = sample(paste("Center", LETTERS[1:5]), 100, replace =
TRUE))
## set some ages to NA (missing)
is.na(clinical.trial$age) <- sample(1:100, 20)
I cannot understand this last line.
The LHS is a vector of all FALSE values. The RHS is a vector of 20 numbers selected from the vector 1:100.
I don't understand this kind of assignment. How is this result in clinical.trial$age getting some NA values? Does this kind of assignment have a name? At best I would say that the boolean vector on the RHS gets numbers assigned to it with recycling.
is.na(x) <- value is translated as 'is.na<-'(x, value).
You can think of 'is.na<-'(x, value) as 'assign NA to x, at position value'.
A perhaps better and intuitive phrasing could be assign_NA(to = x, pos = value).
Regarding other similar function, we can find those in the base package:
x <- as.character(lsf.str("package:base"))
x[grep('<-', x)]
#> [1] "$<-" "$<-.data.frame"
#> [3] "#<-" "[[<-"
#> [5] "[[<-.data.frame" "[[<-.factor"
#> [7] "[[<-.numeric_version" "[<-"
#> [9] "[<-.data.frame" "[<-.Date"
#> [11] "[<-.factor" "[<-.numeric_version"
#> [13] "[<-.POSIXct" "[<-.POSIXlt"
#> [15] "<-" "<<-"
#> [17] "attr<-" "attributes<-"
#> [19] "body<-" "class<-"
#> [21] "colnames<-" "comment<-"
#> [23] "diag<-" "dim<-"
#> [25] "dimnames<-" "dimnames<-.data.frame"
#> [27] "Encoding<-" "environment<-"
#> [29] "formals<-" "is.na<-"
#> [31] "is.na<-.default" "is.na<-.factor"
#> [33] "is.na<-.numeric_version" "length<-"
#> [35] "length<-.factor" "levels<-"
#> [37] "levels<-.factor" "mode<-"
#> [39] "mostattributes<-" "names<-"
#> [41] "names<-.POSIXlt" "oldClass<-"
#> [43] "parent.env<-" "regmatches<-"
#> [45] "row.names<-" "row.names<-.data.frame"
#> [47] "row.names<-.default" "rownames<-"
#> [49] "split<-" "split<-.data.frame"
#> [51] "split<-.default" "storage.mode<-"
#> [53] "substr<-" "substring<-"
#> [55] "units<-" "units<-.difftime"
All works the same in the sense that 'fun<-'(x, val) is equivalent to fun(x) <- val. But after that they all behave like any normal functions.
R manuals: 3.4.4 Subset assignment
The help tells us, that:
(xx <- c(0:4))
is.na(xx) <- c(2, 4)
xx #> 0 NA 2 NA 4
So,
is.na(xx) <- 1
behaves more like
set NA at position 1 on variable xx
#matt, to respond to your question asked above in the comments, here's an alternative way to do the same assignment that I think is easier to follow :-)
clinical.trial$age[sample(1:100, 20)] <- NA
I'm sorry to bother you with probably an encoding question. Spending couple of hours without getting the solution I decided to post it here.
I'm trying to write a simple table unsuccessfully using write.table, write.csv,write.csv2from Ubuntu 14.04. My data is kind of messy resulting from a cronjob:
ID <- c("",30,26,20,30,40,5,10,4)
b <- c("",2233,12,2,22,13,23,23,100)
c <- c("","","","","","","","","")
d <- c("","","","","","","","","")
e <- c("","","","","","800","","","")
f <- c("","","","","","","","","")
g <- c("","","","","","","","EA","")
h <- c("","","","","","","","","")
df <- data.frame(ID,b,c,d,e,f,g,h)
# change columns to chr
for(i in c(1,2:ncol(df))) {
df[,i] <- as.character(df[,i])
}
str(df)
# data.frame': 9 obs. of 8 variables:
# $ ID: chr "" "30" "26" "20" ...
# $ b : chr "" "2233" "12" "2" ...
# $ c : chr "" "" "" "" ...
# $ d : chr "" "" "" "" ...
# $ e : chr "" "" "" "" ...
# $ f : chr "" "" "" "" ...
# $ g : chr "" "" "" "" ...
# $ h : chr "" "" "" "" ...
head(df,n=9)
ID b c d e f g h
# 1
# 2 30 2233
# 3 26 12
# 4 20 2
# 5 30 22
# 6 40 13 800
# 7 5 23
# 8 10 23 EA
# 9 4 100
I have tried different combinations and suggestions found on SO, however nothing worked. The result is always somehow displaced instead of long its wide. In the current example ist just one long row.
I tried:
write.table(df,"df.csv",row.names = FALSE, dec=".",sep=";")
write.table(df,"df.csv",row.names = FALSE,dec=".",sep=";", col.names = T)
write.table(df,"df.csv",row.names = FALSE,sep=";",fileEncoding = "UTF-8")
write.table(df,"df.csv",row.names = FALSE,fileEncoding = "UTF-8")
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8
[4] LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.4.3 DBI_0.4-1 RGA_0.4.2 RMySQL_0.11-3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 lubridate_1.5.6 digest_0.6.9 assertthat_0.1 R6_2.1.2
[6] plyr_1.8.3 jsonlite_1.0 magrittr_1.5 httr_1.1.0 stringi_1.1.1
[11] curl_0.9.7 tools_3.3.1 stringr_1.0.0 parallel_3.3.1
Wrong output as pic:
Correct output results from the same data on :
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
[![enter image description here][2]][2]
The problem isn't R or Ubuntu it is notepad. Specifically, it expects "\r\n" for line breaks whereas most other text readers are happy with "\n" which is the default line break used by write.xxx.
If you add the parameter eol="\r\n" then you should be able to open in Notepad and see the expected line breaks.
For instance:
write.table(df,"df.csv",row.names = FALSE, dec=".",sep=";",eol="\r\n")
I'm struggling to figure out why I can't use filter() on the results
of predict.gam() and then ggplot() the subset of predictions. I'm not
sure the prediction step is really part of the problem, but that's what
it takes to trigger the error. Just filter() %>% ggplot() with a
dataframe works fine.
library(dplyr)
library(ggplot2)
library(mgcv)
gam1 <- gam(Petal.Length~s(Petal.Width) + Species, data=iris)
nd <- expand.grid(Petal.Width = seq(0,5,0.05),
Species = levels(iris$Species),
stringsAsFactors = FALSE)
predicted <- predict(gam1,newdata=nd)
predicted <- cbind(predicted,nd)
filter(tbl_df(predicted), Species == "setosa") %>%
ggplot(aes(x=Petal.Width, y = predicted)) +
geom_point()
## Error: length(rows) == 1 is not TRUE
But:
filter(tbl_df(predicted), Species == "setosa")
## Source: local data frame [101 x 3]
##
## predicted Petal.Width Species
## (dbl[10]) (dbl) (chr)
## 1 1.294574 0.00 setosa
## 2 1.327482 0.05 setosa
## 3 1.360390 0.10 setosa
## 4 1.393365 0.15 setosa
## 5 1.426735 0.20 setosa
## 6 1.460927 0.25 setosa
## 7 1.496477 0.30 setosa
## 8 1.533949 0.35 setosa
## 9 1.573888 0.40 setosa
## 10 1.616810 0.45 setosa
## .. ... ... ...
And the problem is filter() because:
pick <- predicted$Species == "setosa"
ggplot(predicted[pick,],aes(x=Petal.Width, y = predicted)) +
geom_point()
I've also tried saving the result of filter to an object and using that directly in ggplot() but that has the same error.
Obviously not a crisis, because there's a workaround, but my mental
model of how to use filter() is obviously wrong! Any insights much
appreciated.
Edit: When I first posted this I was still using R 3.2.3 and was getting warnings from ggplot2 and dplyr. So I upgraded to 3.3.0 and it's still happening.
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mgcv_1.8-12 nlme_3.1-127 ggplot2_2.1.0 dplyr_0.4.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 knitr_1.11 magrittr_1.5 munsell_0.4.2
## [5] colorspace_1.2-6 lattice_0.20-33 R6_2.1.1 stringr_1.0.0
## [9] plyr_1.8.3 tools_3.3.0 parallel_3.3.0 grid_3.3.0
## [13] gtable_0.1.2 DBI_0.3.1 htmltools_0.2.6 lazyeval_0.1.10
## [17] yaml_2.1.13 assertthat_0.1 digest_0.6.8 Matrix_1.2-6
## [21] formatR_1.2 evaluate_0.7.2 rmarkdown_0.9.5 labeling_0.3
## [25] stringi_1.0-1 scales_0.3.0
The problem arises because your predict() call generates a named array, instead of just a numerical vector.
class(predicted$predicted)
# [1] "array"
The first filter() will give you the correct output on the surface, however if you inspect the output you will notice that the column predicted is still some sort of nested array.
str(filter(tbl_df(predicted), Species == "setosa"))
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 101 obs. of 3 variables:
$ predicted : num [1:303(1d)] 1.29 1.33 1.36 1.39 1.43 ...
..- attr(*, "dimnames")=List of 1
.. ..$ : chr "1" "2" "3" "4" ...
$ Petal.Width: num 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 ...
$ Species: chr "setosa" "setosa" "setosa" "setosa" ...
In contrast, good old logical subsetting does the job on all dimensions:
str(predicted[pick,])
'data.frame': 101 obs. of 3 variables:
$ predicted : num [1:101(1d)] 1.29 1.33 1.36 1.39 1.43 ... # Now 101 obs here too
..- attr(*, "dimnames")=List of 1
.. ..$ : chr "1" "2" "3" "4" ...
$ Petal.Width: num 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 ...
$ Species : chr "setosa" "setosa" "setosa" "setosa" ...
So either you coerce the predicted column to numeric:
library(dplyr)
library(ggplot2)
predicted %>% mutate(predicted = as.numeric(predicted)) %>%
filter(Species == "setosa") %>%
ggplot(aes(x = Petal.Width, y = predicted)) +
geom_point()
Or replace filter() by subset():
predicted %>%
subset(Species == "setosa") %>%
ggplot(aes(x = Petal.Width, y = predicted)) +
geom_point()
I'm using the bigmemory and biganalytics packages and specifically trying to compute the mean of a big.matrix object. The documentation for biganalytics (e.g. ?biganalytics) suggests that mean() should be available for big.matrix objects, but this fails:
x <- big.matrix(5, 2, type="integer", init=0,
+ dimnames=list(NULL, c("alpha", "beta")))
x
# An object of class "big.matrix"
# Slot "address":
# <pointer: 0x00000000069a5200>
x[,1] <- 1:5
x[,]
# alpha beta
# [1,] 1 0
# [2,] 2 0
# [3,] 3 0
# [4,] 4 0
# [5,] 5 0
mean(x)
# [1] NA
# Warning message:
# In mean.default(x) : argument is not numeric or logical: returning NA
Although some things work OK:
colmean(x)
# alpha beta
# 3 0
sum(x)
# [1] 15
mean(x[])
# [1] 1.5
mean(colmean(x))
# [1] 1.5
without mean(), it seems mean(colmean(x)) is the next best thing:
# try it on something bigger
x = big.matrix(nrow=10000, ncol=10000, type="integer")
x[] <- c(1:(10000*10000))
mean(colmean(x))
# [1] 5e+07
mean(x[])
# [1] 5e+07
system.time(mean(colmean(x)))
# user system elapsed
# 0.19 0.00 0.19
system.time(mean(x[]))
# user system elapsed
# 0.28 0.11 0.39
Presumably mean() could be faster still, especially for rectangular matrices with a large number of columns.
Any ideas why mean() isn't working for me?
OK - re-installing biganalytics seems to have fixed this.
I now have:
library("biganalytics")
x = big.matrix(10000,10000, type="integer")
for(i in 1L:10000L) { j = c(1L:10000L) ; x[i,] <- i*10000L + j }
mean(x)
# [1] 50010001
mean(x[,])
# [1] 50010001
mean(colmean(x))
# [1] 50010001
system.time(replicate(100, mean(x)))
# user system elapsed
# 20.16 0.02 20.23
system.time(replicate(100, mean(colmean(x))))
# user system elapsed
# 20.08 0.00 20.24
system.time(replicate(100, mean(x[,])))
# user system elapsed
# 31.62 12.88 44.74
So all good. My sessionInfo() is now:
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biganalytics_1.1.12 biglm_0.9-1 DBI_0.3.1 foreach_1.4.2 bigmemory_4.5.8 bigmemory.sri_0.1.3
loaded via a namespace (and not attached):
[1] codetools_0.2-8 iterators_1.0.7 Rcpp_0.11.2