So this is I'm sure a fairly elementary problem. I have a data frame that has data for 10 years for a bunch of countries. It looks like this. The data frame is df.
X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010 X2011 X2012
Afghanistan 7.321 7.136 6.930 6.702 6.456 6.196 5.928 5.659 5.395 5.141
Albania 2.097 2.004 1.919 1.849 1.796 1.761 1.744 1.741 1.748 1.760
Algeria 2.412 2.448 2.507 2.580 2.656 2.725 2.781 2.817 2.829 2.820
Angola 6.743 6.704 6.657 6.598 6.523 6.434 6.331 6.218 6.099 5.979
Antigua and Barbuda 2.268 2.246 2.224 2.203 2.183 2.164 2.146 2.130 2.115 2.102
Argentina 2.340 2.310 2.286 2.268 2.254 2.241 2.228 2.215 2.201 2.188
The first column is metadata. It hasn't got a name. I'd like to use qplot to plot time series for each of the rows. Something like the following command:
library(ggplot2)
qplot (data = df, binwidth = 1, geom="freqpoly") but I get the following error
Error: stat_bin requires the following missing aesthetics: x.
I would like to set x = first column but I don't have a name on that column. Do I have to create a first column of country names? If so, how do I do that?
Seems like there should be an easier way. Sorry if this is so elementary.
Not sure what you need, maybe something like this?
library(reshape2)
library(ggplot2)
df$metadata <- row.names(df)
df <- melt(df, "metadata")
ggplot(df, aes(variable, value, group = metadata, color = metadata)) +
geom_line()
following your comments, I guess you want this kind of graphic?
# Create a "long" data frame rather than a "wide" data frame.
country <- rep(c("Afghanistan", "Albania", "Algeria","Angola",
"Antigua and Barbuda", "Argentina"),each = 10, times = 1)
year <- rep(c(2003:2012), each = 1, times = 6)
value <- runif(60, 0, 50)
foo <- data.frame(country,year,value,stringsAsFactors=F)
foo$year <- as.factor(foo$year)
# Draw a ggplot figure
ggplot(foo, aes(x=year, y = value,group = country, color = country)) +
geom_line() +
geom_point()
Hi. Here is a very similar solution to what Charles correctly suggested using melt. I've used the package ggvis to produce the plot and made sure the scale of the y-axsis is fixed at 0. The block of code below assumes that df is already read into R.
R Code:
library(reshape2)
library(ggvis)
str(df) # just to demonstrate initial structure of df....results in coment block below
# data.frame': 6 obs. of 11 variables:
# $ Country: chr "Afghanistan" "Albania" "Algeria" "Angola" ...
# $ X2003 : num 7.32 2.1 2.41 6.74 2.27 ...
# $ X2004 : num 7.14 2 2.45 6.7 2.25 ...
# $ X2005 : num 6.93 1.92 2.51 6.66 2.22 ...
# $ X2006 : num 6.7 1.85 2.58 6.6 2.2 ...
# $ X2007 : num 6.46 1.8 2.66 6.52 2.18 ...
# $ X2008 : num 6.2 1.76 2.73 6.43 2.16 ...
# $ X2009 : num 5.93 1.74 2.78 6.33 2.15 ...
# $ X2010 : num 5.66 1.74 2.82 6.22 2.13 ...
# $ X2011 : num 5.39 1.75 2.83 6.1 2.12 ...
# $ X2012 : num 5.14 1.76 2.82 5.98 2.1 ...
df1 <- melt(df, "Country")
df1 %>% ggvis(~factor(variable),~value,stroke=~Country) %>% layer_lines(strokeWidth:=2.5) %>%
add_axis("x",title="Year") %>% scale_numeric("y",zero=TRUE)
I never really started using ggplot, but when I saw ggvis and especially its use of %>% pipe operator introduced in the magrittr package I was hooked. Best....
Related
I'm using the following code to try to transform my response variable for regression. Seems to need a log transformation.
bc = boxCox(auto.tf.lm)
lambda.mpg = bc$x[which.max(bc$y)]
auto.tf.bc <- with(auto_mpg, data.frame(log(mpg), as.character(cylinders), displacement**.2, log(as.numeric(horsepower)), log(weight), log(acceleration), model_year))
auto.tf.bc.lm <- lm(log(mpg) ~ ., data = auto.tf.bc)
view(auto.tf.bc)
I am receiving this error though.
Error in Math.data.frame(mpg) :
non-numeric variable(s) in data frame: manufacturer, model, trans, drv, fl, class
Not sure how to resolve this. The data is in a data frame, not csv.
Here's the output from str(auto.tf.bc). Sorry for such bad question formatting.
'data.frame': 392 obs. of 7 variables:
$ log.mpg. : num 2.89 2.71 2.89 2.77 2.83 ...
$ as.character.cylinders.: chr "8" "8" "8" "8" ...
$ displacement.0.2 : num 3.14 3.23 3.17 3.14 3.13 ...
$ log.horsepower. : num 4.87 5.11 5.01 5.01 4.94 ...
$ log.weight. : num 8.16 8.21 8.14 8.14 8.15 ...
$ log.acceleration. : num 2.48 2.44 2.4 2.48 2.35 ...
$ model_year : num 70 70 70 70 70 70 70 70 70 70 ...
removing the cylinders doesn't change anything.
I'm trying to fit PLSR model, but I'm doing something wrong. Below, you can see how I created data frame and its structure.
reflektance <- read_excel("data/reflektance.xlsx", na = "NA")
reflektance <- dput(reflektance)
pH <- read_excel("data/rijen2016.xls", na = "NA")
pH <- na.omit(pH)
pH <- dput(pH)
reflektance<-aggregate(reflektance[, 2:753], list(reflektance$Vzorek), mean)
colnames(reflektance)[colnames(reflektance)=='Group.1']<-'Vzorek'
datapH <- merge(pH, reflektance, by="Vzorek")
datasetpH <- data.frame(pH=datapH[,2], ref=I(as.matrix(datapH[, 3:754], 22, 752)))
Problem is with using "plsr", because result is this error:
ph1<-plsr(pH ~ ref, ncomp = 5, data=datasetpH)
Error in pls::mvr(ref ~ pH, ncomp = 5, data = datasetpH, method = "kernelpls") :
Invalid number of components, ncomp
dput(reflectance):
https://jpst.it/RyyS
Here you can see structure of table datapH:
'data.frame': 22 obs. of 754 variables:
$ Vzorek: chr "5 - P01" "5 - P02" "5 - P03" "5 - R1 - A1" ...
$ pH/H2O: num 6.96 6.62 7.02 5.62 5.97 6.12 5.64 5.81 5.61 5.47 ...
$ 325 : num 0.017 0.0266 0.0191 0.0241 0.016 ...
$ 326 : num 0.021 0.0263 0.0154 0.0264 0.0179 ...
$ 327 : num 0.0223 0.0238 0.0147 0.028 0.0198 ...
...
And here structure of table datasetpH:
'data.frame': 22 obs. of 2 variables:
$ pH : num 6.96 6.62 7.02 5.62 5.97 6.12 5.64 5.81 5.61 5.47 ...
$ ref: AsIs [1:22, 1:752] 0.016983.... 0.026556.... 0.019059.... 0.024097.... 0.016000.... ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "325" "326" "327" "328" ...
Do you have any advice and solution? Thank you
The problem seems to come from one of your columns containing only NA's.
The last line of the output of names(df)gives:
[745] "1068" "1069" "1070" "1071" "1072" "1073" "1074" "1075" NA
Using your data + some randomly generated values for pH (which isn't in the reflektance dataframe, named df here):
test=data.frame(pH=rnorm(23,5,2), ref=I(as.matrix(df[, 2:752], 22, 751)))
pls::plsr(pH ~ ref, data=test)
Error in matrix(0, ncol = ncomp, nrow = npred) :
invalid 'ncol' value (< 0)
Note that the indexing is a bit different from yours. I didn't have the second column in df (the one that contains pH in yours).
If I remove the last column which contains NA's :
test=data.frame(pH=rnorm(23,5,2), ref=I(as.matrix(df[, 2:752], 22, 751)))
pls::plsr(pH ~ ref, data=test)
Partial least squares regression , fitted with the kernel algorithm.
Call:
plsr(formula = pH ~ ref, data = test)
Let me know if that fixes it.
I'm using the rela package to check whether I can use PCA in my data.
paf.neur2 <- paf(neur2)
summary(paf.neur2)
# [1] "Your dataset is not a numeric object."
I want to see the KMO (The Kaiser-Meyer-Olkin measure of sampling adequacy test). How to do that?
Output of str(neur2)
'data.frame': 1457 obs. of 66 variables:
$ userid : int 200 387 458 649 931 991 1044 1075 1347 1360 ...
$ funct : num 3.73 3.79 3.54 3.04 3.81 ...
$ pronoun: num 2.26 2.55 2.49 1.98 2.71 ...
.
.
.
$ time : num 1.68 1.87 1.51 1.03 1.74 ...
$ work : num 0.7419 0.2311 -0.1985 -1.6094 -0.0619 ...
$ achieve: num 0.174 0.2469 0.1823 -0.478 -0.0513 ...
$ leisure: num 0.2852 0.0296 0.0583 -0.3567 -0.0408 ...
$ home : num -0.844 -0.58 -0.844 -2.207 -1.079 ...
.
Variables are all numeric.
According to ?paf, object is a numeric dataset (usually a coerced matrix from a prior data frame)
So you need to turn your data.frame neur2 into a matrix: as.matrix(neur2).
Here is a reproduction of your problem using the Seatbelts dataset:
library(rela)
Belts <- Seatbelts[,1:7]
class(Belts)
# [1] "mts" "ts" "matrix"
Belts <- as.data.frame(Belts)
# [1] "data.frame"
paf.belt <- paf(Belts)
[1] "Your dataset is not a numeric object."
Belts <- as.matrix(Belts)
class(Belts)
# [1] "matrix"
paf.belt <- paf(Belts) # Works
Two options which can do it for you:
kmo_DIY <- function(df){
csq = cor(df)^2
csumsq = (sum(csq)-dim(csq)[1])/2
library(corpcor)
pcsq = cor2pcor(cor(df))^2
pcsumsq = (sum(pcsq)-dim(pcsq)[1])/2
kmo = csumsq/(csumsq+pcsumsq)
return(kmo)
}
or
the function KMO() from the psych package.
I have daily temperature values for several years, 1949-2010. I would like to calculate monthly means. Here is an example of the data:
head(tmeasmax)
TIMESTEP MEAN.C. MINIMUM.C. MAXIMUM.C. VARIANCE.C.2. STD_DEV.C. SUM COUNT
1949-01-01 6.836547 6.65 7.33 0.02850574 0.1688364 1.426652 6
1949-01-02 10.533371 10.24 10.74 0.06140426 0.2477988 1.426652 6
1949-01-03 18.746729 18.02 19.78 0.18507860 0.4302076 1.426652 6
1949-01-04 21.244562 20.09 22.40 0.76106980 0.8723931 1.426652 6
1949-01-05 3.826716 3.11 5.37 0.52706647 0.7259935 1.426652 6
1949-01-06 9.127782 8.46 10.26 0.20236358 0.4498484 1.426652 6
str(tmeasmax)
'data.frame': 22645 obs. of 8 variables:
$ TIMESTEP : Date, format: "1949-01-01" "1949-01-02" ...
$ MEAN.C. : num 6.84 10.53 18.75 21.24 3.83 ...
$ MINIMUM.C. : num 6.65 10.24 18.02 20.09 3.11 ...
$ MAXIMUM.C. : num 7.33 10.74 19.78 22.4 5.37 ...
$ VARIANCE.C.2.: num 0.0285 0.0614 0.1851 0.7611 0.5271 ...
$ STD_DEV.C. : num 0.169 0.248 0.43 0.872 0.726 ...
$ SUM : num 1.43 1.43 1.43 1.43 1.43 ...
$ COUNT : int 6 6 6 6 6 6 6 6 6 6 ...
There is a previous question that I couldn't make heads or tails of. I imagine I can probably use aggregate, but I don't know how to break up the dates into the years and months and then approach the nesting of the months inside the years. I tried a loop inside of a loop, but I can never get nested loops to work.
EDIT to reply to comments/questions:
I was looking for the mean of "MEAN.C."
Here's a quick data.table solution. I assuming you want the means of MEAN.C. (?)
library(data.table)
setDT(tmeasmax)[, .(MontlyMeans = mean(MEAN.C.)), by = .(year(TIMESTEP), month(TIMESTEP))]
# year month MontlyMeans
# 1: 1949 1 11.71928
You can also do this for all the columns at once if you want
tmeasmax[, lapply(.SD, mean), by = .(year(TIMESTEP), month(TIMESTEP))]
# year month MEAN.C. MINIMUM.C. MAXIMUM.C. VARIANCE.C.2. STD_DEV.C. SUM COUNT
# 1: 1949 1 11.71928 11.095 12.64667 0.2942481 0.482513 1.426652 6
Here's a way to do it with the dplyr package:
library(dplyr)
library(lubridate)
tmeasmax$TIMESTEP = ymd(tmeasmax$TIMESTEP)
tmeasmax %>%
group_by(Year=year(TIMESTEP), Month=month(TIMESTEP)) %>%
summarise(meanDailyMin=mean(MINIMUM.C.),
meanDailyMean=mean(MEAN.C.))
Year Month meanDailyMin meanDailyMean
1 1949 1 11.095 11.71928
You can summarise any other column by month in a similar way.
You can use the lubridate package to create a new factor variable consisting of the year-month combinations, then use aggregate.
library('lubridate')
tmeasmax2 <- within(tmeasmax, {
monthlies <- paste(year(TIMESTEP),
month(TIMESTEP))
})
aggregate(tmeasmax2, list(monthlies), mean, na.rm = TRUE)
So this is I'm sure a fairly elementary problem. I have a data frame that has data for 10 years for a bunch of countries. It looks like this. The data frame is df.
X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010 X2011 X2012
Afghanistan 7.321 7.136 6.930 6.702 6.456 6.196 5.928 5.659 5.395 5.141
Albania 2.097 2.004 1.919 1.849 1.796 1.761 1.744 1.741 1.748 1.760
Algeria 2.412 2.448 2.507 2.580 2.656 2.725 2.781 2.817 2.829 2.820
Angola 6.743 6.704 6.657 6.598 6.523 6.434 6.331 6.218 6.099 5.979
Antigua and Barbuda 2.268 2.246 2.224 2.203 2.183 2.164 2.146 2.130 2.115 2.102
Argentina 2.340 2.310 2.286 2.268 2.254 2.241 2.228 2.215 2.201 2.188
The first column is metadata. It hasn't got a name. I'd like to use qplot to plot time series for each of the rows. Something like the following command:
library(ggplot2)
qplot (data = df, binwidth = 1, geom="freqpoly") but I get the following error
Error: stat_bin requires the following missing aesthetics: x.
I would like to set x = first column but I don't have a name on that column. Do I have to create a first column of country names? If so, how do I do that?
Seems like there should be an easier way. Sorry if this is so elementary.
Not sure what you need, maybe something like this?
library(reshape2)
library(ggplot2)
df$metadata <- row.names(df)
df <- melt(df, "metadata")
ggplot(df, aes(variable, value, group = metadata, color = metadata)) +
geom_line()
following your comments, I guess you want this kind of graphic?
# Create a "long" data frame rather than a "wide" data frame.
country <- rep(c("Afghanistan", "Albania", "Algeria","Angola",
"Antigua and Barbuda", "Argentina"),each = 10, times = 1)
year <- rep(c(2003:2012), each = 1, times = 6)
value <- runif(60, 0, 50)
foo <- data.frame(country,year,value,stringsAsFactors=F)
foo$year <- as.factor(foo$year)
# Draw a ggplot figure
ggplot(foo, aes(x=year, y = value,group = country, color = country)) +
geom_line() +
geom_point()
Hi. Here is a very similar solution to what Charles correctly suggested using melt. I've used the package ggvis to produce the plot and made sure the scale of the y-axsis is fixed at 0. The block of code below assumes that df is already read into R.
R Code:
library(reshape2)
library(ggvis)
str(df) # just to demonstrate initial structure of df....results in coment block below
# data.frame': 6 obs. of 11 variables:
# $ Country: chr "Afghanistan" "Albania" "Algeria" "Angola" ...
# $ X2003 : num 7.32 2.1 2.41 6.74 2.27 ...
# $ X2004 : num 7.14 2 2.45 6.7 2.25 ...
# $ X2005 : num 6.93 1.92 2.51 6.66 2.22 ...
# $ X2006 : num 6.7 1.85 2.58 6.6 2.2 ...
# $ X2007 : num 6.46 1.8 2.66 6.52 2.18 ...
# $ X2008 : num 6.2 1.76 2.73 6.43 2.16 ...
# $ X2009 : num 5.93 1.74 2.78 6.33 2.15 ...
# $ X2010 : num 5.66 1.74 2.82 6.22 2.13 ...
# $ X2011 : num 5.39 1.75 2.83 6.1 2.12 ...
# $ X2012 : num 5.14 1.76 2.82 5.98 2.1 ...
df1 <- melt(df, "Country")
df1 %>% ggvis(~factor(variable),~value,stroke=~Country) %>% layer_lines(strokeWidth:=2.5) %>%
add_axis("x",title="Year") %>% scale_numeric("y",zero=TRUE)
I never really started using ggplot, but when I saw ggvis and especially its use of %>% pipe operator introduced in the magrittr package I was hooked. Best....