Import stuff from a R file - r

I'm studying R, and I was asked to use the pi2000 data set, available at the simpleR library (http://www.math.csi.cuny.edu/Statistics/R/simpleR/index.html/simpleR.R). Then I downloaded this file. How do I import it to the command line?

Use the source function:
> source("http://www.math.csi.cuny.edu/Statistics/R/simpleR/index.html/simpleR.R")
You can then access the variables defined there:
> vacation
[1] 23 12 10 34 25 16 27 18 28 13 14 20 8 21 23 33 30 13 16 14 38 19 6 11 15 21 10
[28] 39 42 25 12 17 19 26 20
for more information, type ?source into an R terminal.

Do you mean load the function into R? If so, use source():
source("http://www.math.csi.cuny.edu/Statistics/R/simpleR/index.html/simpleR.R")

Related

hcboxplot not building correct boxplots

While working on a project, I had a dataframe with a column for which I needed to build a boxplot using highcharter.
The column looked like this:
head(data,20)
Col1
1 30
2 30
3 30
4 30
5 28
6 27
7 29
8 27
9 30
10 30
11 28
12 29
13 29
14 30
15 30
16 30
17 30
18 29
19 30
20 NA
and so on.. data has 693 observations. The data is 'labelled' and hence while building the boxplot, I use as.numeric().
When I try building boxplot with hcboxplot ->
hcboxplot(x = as.numeric(data$Col1[!is.na(data$Col1)]))
The boxplot details I get are -
Maximum: 30
Upper Quartile: 30
Median: 29
Lower Quartile: 28
Minimum: 25
But in data, there definitely are values less than 25 (even 0)
> min(data$Col1[!is.na(data$Col1)])
[1] 0
So, the maximum value is correct but minimum definitely is not.
Why is the boxplot not coming out right ? Any help is appreciated! Apologies for not being able to add a pic to the question (Reputation restrictions).
Thank you.

Conditional vectorization

Given a data frame like below:
set.seed(123)
df1 <- data.frame(V1=sample(c(0,1,2),100,replace=TRUE),
V2=sample(c(2,3,4),100,replace=TRUE),
V3=sample(c(4,5,6),100,replace=TRUE),
V4=sample(c(6,7,8),100,replace=TRUE),
V5=sample(c(6,7,8),100,replace=TRUE))
I want to sum each row, starting from the first column with a value >=2, and ending with the column with a value >6, else sum until the end of the row.
How would I do this in a vectorized fashion?
Update: This is not for any homework assignment. I just want more examples of vectorization code that I can study and learn from. I had to do something like the above before, but couldn't figure out the apply syntax for this particular task and resorted to for loops.
This is what appeared the most R-like approach but I don't consider it "vectorized" in the R meaning of the term:
apply( df1, 1, function(x) sum( x[which(x>=2)[1]: min(which(x>6)[1], 5, na.rm=TRUE)] ) )
#---------
[1] 15 22 16 19 17 17 23 21 14 13 18 13 16 23 15 18 16 21 16 19 17 23 21 18
[25] 21 24 15 20 15 18 17 24 19 18 19 15 18 17 15 17 14 21 13 19 15 15 15 15
[49] 21 19 21 15 17 18 14 17 15 16 22 16 23 22 17 21 17 16 23 23 16 14 18 13
[73] 18 15 17 17 17 20 20 16 17 16 16 16 14 16 20 23 23 24 14 18 16 17 22 23
[97] 23 19 20 17
Due to your sampling structure, we can vectorize quite easily.
We know that only the first column can be less than 2, and thus excluded, and that columns V2, V3 and V4 must be included, as they are either below 6, or the first non six. Column V5 is excluded, only if column V4 was above 6.
So:
(df1$V1 == 2) * df1$V1 + df1$V2 + df1$V3 + df1$V4 + df1$V5 * !(df1$V4 > 6)
[1] 15 22 16 19 17 17 23 21 14 13 18 13 16 23 15 18 16 21 16 19 17 23 21 18 21 24 15 20 15 18 17 24 19 18
[35] 19 15 18 17 15 17 14 21 13 19 15 15 15 15 21 19 21 15 17 18 14 17 15 16 22 16 23 22 17 21 17 16 23 23
[69] 16 14 18 13 18 15 17 17 17 20 20 16 17 16 16 16 14 16 20 23 23 24 14 18 16 17 22 23 23 19 20 17
is your vectorized calculation. This is obviously much less general than the other answers here, but fits your question.
Using apply would be the most sensible solution. However, since we seem to be competing on who can answer this without using R-based loops, I humbly offer this
m<-as.matrix(df1)
start<-max.col(m>=2,ties="first")
end<-max.col(`[<-`(m>6,,ncol(m),TRUE),ties="first")
i<-t(matrix(1:ncol(m),nrow=ncol(m),ncol=nrow(m)))
rowSums(m*(i>=start & i<=end))
Output is the same as these answres.
I am sure that there is a more elegant way but for a brute force approach you can write a function and pass it to apply.
First, define your example data
df <- data.frame(V1=sample(c(0,1,2),100,replace=TRUE),
V2=sample(c(2,3,4),100,replace=TRUE),
V3=sample(c(4,5,6),100,replace=TRUE),
V4=sample(c(6,7,8),100,replace=TRUE),
V5=sample(c(6,7,8),100,replace=TRUE))
Write a function that will define the conditional statement. The use of which returns the position of the condition in the vector. The first use of which "start" pulls the position of the first occurrence of the condition thus the bracket use of [1]. Since there are multiple potential outcomes of the end position I used an if statement to fulfill it. If there is not a value that meets the condition > 6 for "end" the variable is assigned the last position of the vector otherwise the position the meets the condition. Then it is just a matter of subsetting the vector based on the start and end values to be evaluated using sum.
sum.col <- function(x) {
start <- which(x >= 2)[1]
end <- which(x > 6)
if( length(end) == 0 ) {
end <- length(x)
} else {
end <- end[length(end)]
}
return( sum( x[start:end] ) )
}
Now we can pass the function to apply, which deals with the vectorization of each row for us.
apply(df, FUN=sum.col, MARGIN = 1)

How to calculate the mean for a column subsetted from the data

This shouldn't be too hard, but I always have issues when tying to run calculations on a column in a dataframe that relies on the value of a another column in the data frame. Here is my data.frame
stream reach length.km length.m total.sa pools.sa
1 Stream Reach_Code 109 109 1 1
2 Brooks BRK_001 17 14 108 13
3 Brooks BRK_002 15 12 99 9
4 Brooks BRK_003 24 21 94 95
5 Brooks BRK_004 32 29 97 33
6 Brooks BRK_005 27 24 92 79
7 Brooks BRK_006 26 23 95 6
8 Brooks BRK_007 16 13 77 15
9 Brooks BRK_008 29 26 84 26
10 Brooks BRK_009 18 15 87 46
11 Brooks BRK_010 23 20 88 47
12 Brooks BRK_011 22 19 91 40
13 Brooks BRK_012 30 27 98 37
14 Brooks BRK_013 25 22 93 29
19 Buncombe_Hollow BNH_0001 7 4 75 65
20 Buncombe_Hollow BNH_0002 8 5 66 21
21 Buncombe_Hollow BNH_0003 9 6 68 53
22 Buncombe_Hollow BNH_0004 19 16 81 11
23 Buncombe_Hollow BNH_0005 6 3 65 27
24 Buncombe_Hollow BNH_0006 13 10 63 23
25 Buncombe_Hollow BNH_0007 12 9 71 57
I would like to calculate the mean of a column (lets say length.m) where stream = Brooks and then do the same thing for stream = Buncombe_Hollow. I actually have 17 different stream names, and plan on calculating the mean of some column for each stream. I will then store these means as a vector, and bind them to another vector of the stream names, so the end result is something like this
stream truevalue
1 Brooks 0.9440620
2 Siouxon 0.5858527
3 Speelyai 0.5839844
Thanks!
try using aggregate:
# Generate some data to use
someDf <- data.frame(stream = rep(c("Brooks", "Buncombe_Hollow"), each = 10),
length.m = rpois(20, 4))
# Calculate the means with aggregate
with(someDf, aggregate(list(truevalue = length.m), list(stream = stream), mean))
The reason for the "list" bits is to specifically name the columns in the (data frame) output
Start using the dplyr package. It makes such calculations quick as well as very easy to write
library(dplyr)
result <- data %>% group_by(stream) %>% summarize(truevalue = mean(length.m))

Data frame header in R

I am trying to make some calculations with data from oracle db using R. I connected to the DB and extracted the data correctly.
> y=dbGetQuery(con, "select distinct(fk_parametro) from t_datos")
> y
FK_PARAMETRO
1 30
2 42
3 43
4 83
5 87
6 1
7 6
8 44
9 20
10 14
11 86
12 88
13 85
14 81
15 35
16 8
17 80
18 89
19 7
20 12
21 82
22 9
23 10
The following command.. works:
> sum(y)
[1] 1042
But this one.. fails:
> mean(y)
[1] NA
Warning message:
In mean.default(y) : argument is not numeric or logical: returning NA
I think it happens because R is considering the header "FK_PARAMETRO" as an element. can someone help me to figure out?
As commented by #akrun, this works
mean(y[,1])
Or as suggested by #PierreLafortune, could also do
colMeans(y)

What is the function that allows you to avoid explicitly naming the dataset when using a variable in R?

I know there is a function for this and have used it in the past but can't seem to remember or find it.
basically I'm trying to do the following:
data(mtcars)
missing_function(mtcars)
lm(mpg ~ cyl)
where "missing_function()" allows me to use the the variables in the mtcars dataset without having to use the "$". For example: "mpg" instead of "mtcars$mpg".
Though generally frowned upon, attach will do that for you
R> attach(mtcars)
R> mpg
[1] 21 21 23 21 19 18 14 24 23 19 18 16 17 15 10 10 15 32 30 34 22 16 15 13 19 27 26 30
[29] 16 20 15 21
R>

Resources