Multiple columns of data and getting average R program - r

I asked a question like this before but I decided to simplify my data format because I'm very new at R and didnt understand what was going on....here's the link for the question How to handle more than multiple sets of data in R programming?
But I edited what my data should look like and decided to leave it like this..in this format...
X1.0 X X2.0 X.1
0.9 0.9 0.2 1.2
1.3 1.4 0.8 1.4
As you can see I have four columns of data, The real data I'm dealing with is up to 2000 data points.....Columns "X1.0" and "X2.0" refer "Time"...so what I want is the average of "X" and "X.1" every 100 seconds based on my 2 columns of time which are "X1.0" and "X2.0"...I can do it using this command
cuts <- cut(data$X1.0, breaks=seq(0, max(data$X1.0)+400, 400))
 by(data$X, cuts, mean)
But this will only give me the average from one set of data....which is "X1.0" and "X".....How will I do it so that I could get averages from more than one data set....I also want to stop having this kind of output
cuts: (0,400]
[1] 0.7
------------------------------------------------------------
cuts: (400,800]
[1] 0.805
Note that the output was done every 400 s....I really want a list of those cuts which are the averages at different intervals...please help......I just used data=read.delim("clipboard") to get my data into the program

It is a little bit confusing what output do you want to get.
First I change colnames but this is optional
colnames(dat) <- c('t1','v1','t2','v2')
Then I will use ave which is like by but with better output. I am using a trick of a matrix to index column:
matrix(1:ncol(dat),ncol=2) ## column1 is col1 adn col2...
[,1] [,2]
[1,] 1 3
[2,] 2 4
Then I am using this matrix with apply. Here the entire solution:
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10){ ## by 10 seconds! you can replace this
## with 100 or 400 in you real data
t.col <- dat[,x][,1] ## txxx
v.col <- dat[,x][,2] ## vxxx
ave(v.col,cut(t.col,
breaks=seq(0, max(t.col),by)),
FUN=mean)})
)
EDIT correct the cut and simplify the code
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10)ave(dat[,x][,1], dat[,x][,1] %/% by)))
X1.0 X X2.0 X.1 1 2
1 0.9 0.9 0.2 1.2 3.3000 3.991667
2 1.3 1.4 0.8 1.4 3.3000 3.991667
3 2.0 1.7 1.6 1.1 3.3000 3.991667
4 2.6 1.9 2.2 1.6 3.3000 3.991667
5 9.7 1.0 2.8 1.3 3.3000 3.991667
6 10.7 0.8 3.5 1.1 12.8375 3.991667
7 11.6 1.5 4.1 1.8 12.8375 3.991667
8 12.1 1.4 4.7 1.2 12.8375 3.991667
9 12.6 1.8 5.4 1.2 12.8375 3.991667
10 13.2 2.1 6.3 1.3 12.8375 3.991667
11 13.7 1.6 6.9 1.1 12.8375 3.991667
12 14.2 2.2 9.4 1.3 12.8375 3.991667
13 14.6 1.8 10.0 1.5 12.8375 10.000000

Related

Trying to use a variable as label in ggplots

I'm not sure what's going on here, but when I try to run ggplots, it tells me that u and u1 are not valid lists. Did I enter u and u1 incorrectly, that it thinks these are functions, did I forget something, or did I enter things wrong into ggplots?
u1 <- function(x,y){max(utilityf1(x))}
utilityc1 <- data.frame("utilityc1" =
u(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20),
c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
utilityc1 <- data.frame("utilityc1" =
u1(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20),
c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
hhcomp <- data.frame(
pqx, pqy, utility, hours, p1qx, p1qy, utilit, utilityc1,
utilityc, u,u1, o, o1, o2
)
library(ggplot2)
ggplot(hhcomp, aes(x=utility, y=consumption))+
coord_cartesian(xlim = c(0, 16) )+
ylim(0,20)+
labs(x = "leisure(hours)",y="counsumption(units)")+
geom_line(aes(x = u, y = consumption))+
geom_line(aes(x = u1, y = consumption))
I'm not sure what else to explain, so if someone could provide some help on providing code to stack overflow that would be useful. I'm also not sure how much of a description to have, I should have enough code to be reproducible, but there is a problem that Stack Overflow only allows so much code, so it would be good to know the right amount to add.
I think you may need to read the documentation for ggplot2 and maybe r in general.
data.frame
For starters, a data.frame object is a collection of vectors appended together column wise. Most of what you have defined as inputs for hhcomp are functions, which cannot be stored as a data.frame. A canonical example of a data frame in r is iris
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
str(iris) #print the structure of an r object
#'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
functions
There is a lot going on with your functions. Nested functions are fine, but it seems as though you are failing to pass all values on. This probably means you are trying to apply R's scoping rules but this makes code ambiguous of where values are found.
With the currently defined functions, calling u(1:2,3:4) passes 1:2 to utilityf but utilityf's y argument is never assigned (but with r's lazy evaluation we reach a different error before r realizes that this value is missing). The next function that gets evaluated in this nest is p1qyf which is defined as follows
p1qyf <- function(y){(w1*16)-(w1*x)}
with this definition, it does not matter what you pass to the argument y it will never be used and will always return the same thing.
#with only the function defined
p1qyf()
#Error in p1qyf() : object 'w1' not found
#defining w1
w1 <- 1.5
p1qyf()
#Error in p1qyf() : object 'x' not found
x <- 10:20
#All variables defined in the function
#can now be found in the global environment
#thus the function can be called with no errors because
#w1 and x are defined somewhere...
p1qyf() #nothing assigned to y
[1] 9.0 7.5 6.0 4.5 3.0 1.5 0.0 -1.5 -3.0 -4.5 -6.0
p1qyf(y = iris) #a data.frame assigned to y
[1] 9.0 7.5 6.0 4.5 3.0 1.5 0.0 -1.5 -3.0 -4.5 -6.0
p1qyf(y = foo_bar) #an object that hasn't even been assigned yet
[1] 9.0 7.5 6.0 4.5 3.0 1.5 0.0 -1.5 -3.0 -4.5 -6.0
I imagine you actually intend to define it this way
p1qyf <- function(y){(w1*16)-(w1*y)}
#Now what we pass to it affects the output
p1qyf(1:10)
#[1] 22.5 21.0 19.5 18.0 16.5 15.0 13.5 12.0 10.5 9.0
head(p1qyf(iris))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 16.35 18.75 21.90 23.7 NA
#2 16.65 19.50 21.90 23.7 NA
#3 16.95 19.20 22.05 23.7 NA
#4 17.10 19.35 21.75 23.7 NA
#5 16.50 18.60 21.90 23.7 NA
#6 15.90 18.15 21.45 23.4 NA
You can improve this further by defining more arguments so that R doesn't need to search for missing values with it's scoping rules
p1qyf <- function(y, w1 = 1.5){(w1*16)-(w1*y)}
#w1 is defaulted to 1.5 and doesn't need to be searched for.
I would spend some time looking into your functions because they are unclear and some, such as your p1qyf, do not fully use the arguments they are passed.
ggplot
ggplot takes some type of structured data object such as data.frame tbl_df, and allows plotting. The aes mappings can take the symbol names of the column headers you wish to map. Continuing with iris as an example.
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species))+
geom_point() +
geom_line()
I hope this helps clears up why you may be getting some errors. Honestly though, if you were actually able to declare a data.frame then the problem here is that your post is still not that reproducible. Good luck
pqxf <- function(x){(1)*(y)} # replace 1 with py and assign a value to py
pqyf <- function(y){(w * 16)-(w * x)} #
utilityf <- function(x, y) { (pqyf(x)) * ((pqxf(y)))} # the utility function C,l
hours <- c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)
w1 <- 1.5
p1qxf <- function(x){(1)*(y)} # replace 1 with py and assign a value to p1y
p1qyf <- function(y){(w1 * 16)-(w1 * x)} #
utilityf1 <- function(x, y) { (p1qyf(x)) * ((p1qxf(y)))} # the utility function (C,l)
utilitycf <- function(x,y){max(utilityf(x))/((pqyf(y)))}
utilityc1f <- function(x,y){max(utilityf1(x))/((pqyf(y)))}
u <- function(x,y){max(utilityf(x))}
u1 <- function(x,y){max(utilityf1(x))}```

Sum column every n column in a data frame R

I have a df(A) with 10 column and 300 row. I need to sum every two column, between them, in this way:
A[,1]+A[,2] = # first result
A[,3]+A[,4] = # second result
A[,5]+A[,6]= # third result
....
A[,9]+A[,10] # last result
The expected final result is a new dataframe with 5 column and 300 row.
Any way to do this? with tapply or loop for?
I know that i can try with the upon example, but i'm looking for a fast method
Thank you
We could use sapply:
df <- data.frame(replicate(expr=rnorm(100),n = 10))
sapply(seq(1,9,by=2),function(i) rowSums(df[,i:(i+1)]))
You can do it without *apply loops.
Sample data:
df <- head(iris[-5])
df
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#3 4.7 3.2 1.3 0.2
#4 4.6 3.1 1.5 0.2
#5 5.0 3.6 1.4 0.2
#6 5.4 3.9 1.7 0.4
Now you can use vector recycling of a logicals:
df[c(TRUE,FALSE)] + df[c(FALSE,TRUE)]
# Sepal.Length Petal.Length
#1 8.6 1.6
#2 7.9 1.6
#3 7.9 1.5
#4 7.7 1.7
#5 8.6 1.6
#6 9.3 2.1
It's a bit cryptic but I it should be fast. We add each column to the adjacent column. Then delete the unnecessary results with c(T,F) which recycles through odd columns:
(A[1:(ncol(A)-1)] + A[2:ncol(A)])[c(T,F)]

Quickest way to read a subset of rows of a CSV

I have a 5GB csv with 2 million rows. The header are comma separated strings and each row are comma separated doubles with no missing or corrupted data. It is rectangular.
My objective is to read a random 10% (with or without replacement, doesn't matter) of the rows into RAM as fast as possible. An example of a slow solution (but faster than read.csv) is to read in the whole matrix with fread and then keep a random 10% of the rows.
require(data.table)
X <- data.matrix(fread('/home/user/test.csv')) #reads full data.matix
X <- X[sample(1:nrow(X))[1:round(nrow(X)/10)],] #sample random 10%
However I'm looking for the fastest possible solution (this is slow because I need to read the whole thing first, then trim it after).
The solution deserving of a bounty will give system.time() estimates of different alternatives.
Other:
I am using Linux
I don't need exactly 10% of the rows. Just approximately 10%.
I think this should work pretty quickly, but let me know since I have not tried with big data yet.
write.csv(iris,"iris.csv")
fread("shuf -n 5 iris.csv")
V1 V2 V3 V4 V5 V6
1: 37 5.5 3.5 1.3 0.2 setosa
2: 88 6.3 2.3 4.4 1.3 versicolor
3: 84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1 virginica
5: 114 5.7 2.5 5.0 2.0 virginica
This takes a random sample of N=5 for the iris dataset.
To avoid the chance of using the header row again, this might be a useful modification:
fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)
Here's a file with 100000 lines in it like this:
"","a","b","c"
"1",0.825049088569358,0.556148858508095,0.591679535107687
"2",0.161556158447638,0.250450366642326,0.575034103123471
"3",0.676798462402076,0.0854280597995967,0.842135070590302
"4",0.650981109589338,0.204736212035641,0.456373531138524
"5",0.51552157686092,0.420454133534804,0.12279288447462
$ wc -l d.csv
100001 d.csv
So that's 100000 lines plus a header. We want to keep the header and sample each line if a random number from 0 to 1 is greater than 0.9.
$ awk 'NR==1 {print} ; rand()>.9 {print}' < d.csv >sample.csv
check:
$ head sample.csv
"","a","b","c"
"12",0.732729186303914,0.744814146542922,0.199768838472664
"35",0.00979996216483414,0.633388962829486,0.364802648313344
"36",0.927218825090677,0.730419414117932,0.522808947600424
"42",0.383301998255774,0.349473554175347,0.311060158303007
and it has 10027 lines:
$ wc -l sample.csv
10027 sample.csv
This took 0.033s of real time on my 4-yo box, probably the HD speed is the limiting factor here. It should scale linearly since the file is being dealt with strictly line-by-line.
You then read in sample.csv using read.csv or fread as desired:
> s = fread("sample.csv")
You could use sqldf::read.csv.sql and an SQL command to pull the data in:
library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE) # write a csv file to test with
read.csv.sql("iris.csv","SELECT * FROM file ORDER BY RANDOM() LIMIT 10")
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 6.3 2.8 5.1 1.5 virginica
2 4.6 3.1 1.5 0.2 setosa
3 5.4 3.9 1.7 0.4 setosa
4 4.9 3.0 1.4 0.2 setosa
5 5.9 3.0 4.2 1.5 versicolor
6 6.6 2.9 4.6 1.3 versicolor
7 4.3 3.0 1.1 0.1 setosa
8 4.8 3.4 1.9 0.2 setosa
9 6.7 3.3 5.7 2.5 virginica
10 5.9 3.2 4.8 1.8 versicolor
It doesn't calculate the 10% for you, but you can choose the absolute limit of rows to return.

R programming transfering data from excel with missing values to R

So I have an excel spreadsheet with NA values....What is the best way to copy the data and put it in R...I usually use data=read.delim("clipboard").... But because of those missing values....I keep getting this error
Error in if (del == 0 && to == 0) return(to) :
missing value where TRUE/FALSE needed
What are the possible ways I can get rid of this error?...I tried putting zeros instead of NA values but that kinda screws up the what the code is doing
Heres the link of the code that I'm using R programming fixing error was really helpful for my data problems.
I was gonna post the whole set but theres a limit of 30000 characters
You need to set option fill to TRUE , This will let you in case the rows have unequal length, to add NA fields.
read.table(fileName,header=TRUE,fill=TRUE)
fileName here is your excel file path. for example filename ='c:\temp\myfile.csv'.
This should work also with read.delim which is a wrapper of read.table. You can give read.table a string , but you set the text argument not the file one. For example:
read.table(text = ' Time Speed Time Speed
0.8 2.9 0.3 2.7
1.3 2.8 0.9 2.7
1.7 2.3 2.5 3.1
2.0 0.6
2.3 1.7 13.6 3.3
3.0 1.4 15.1 3.5
3.5 1.3 17.5 3.3',head=T,fill=T)
Time Speed Time.1 Speed.1
1 0.8 2.9 0.3 2.7
2 1.3 2.8 0.9 2.7
3 1.7 2.3 2.5 3.1
4 2.0 0.6 NA NA
5 2.3 1.7 13.6 3.3
6 3.0 1.4 15.1 3.5
7 3.5 1.3 17.5 3.3

How can I use functions returning vectors (like fivenum) with ddply or aggregate?

I would like to split my data frame using a couple of columns and call let's say fivenum on each group.
aggregate(Petal.Width ~ Species, iris, function(x) summary(fivenum(x)))
The returned value is a data.frame with only 2 columns and the second being a matrix. How can I turn it into normal columns of a data.frame?
Update
I want something like the following with less code using fivenum
ddply(iris, .(Species), summarise,
Min = min(Petal.Width),
Q1 = quantile(Petal.Width, .25),
Med = median(Petal.Width),
Q3 = quantile(Petal.Width, .75),
Max = max(Petal.Width)
)
Here is a solution using data.table (while not specifically requested, it is an obvious compliment or replacement for aggregate or ddply. As well as being slightly long to code, repeatedly calling quantile will be inefficient, as for each call you will be sorting the data
library(data.table)
Tukeys_five <- c("Min","Q1","Med","Q3","Max")
IRIS <- data.table(iris)
# this will create the wide data.table
lengthBySpecies <- IRIS[,as.list(fivenum(Sepal.Length)), by = Species]
# and you can rename the columns from V1, ..., V5 to something nicer
setnames(lengthBySpecies, paste0('V',1:5), Tukeys_five)
lengthBySpecies
Species Min Q1 Med Q3 Max
1: setosa 4.3 4.8 5.0 5.2 5.8
2: versicolor 4.9 5.6 5.9 6.3 7.0
3: virginica 4.9 6.2 6.5 6.9 7.9
Or, using a single call to quantile using the appropriate prob argument.
IRIS[,as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25))), by = Species]
Species 0% 25% 50% 75% 100%
1: setosa 4.3 4.800 5.0 5.2 5.8
2: versicolor 4.9 5.600 5.9 6.3 7.0
3: virginica 4.9 6.225 6.5 6.9 7.9
Note that the names of the created columns are not syntactically valid, although you could go through a similar renaming using setnames
EDIT
Interestingly, quantile will set the names of the resulting vector if you set names = TRUE, and this will copy (slow down the number crunching and consume memory - it even warns you in the help, fancy that!)
Thus, you should probably use
IRIS[,as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25), names = FALSE)), by = Species]
Or, if you wanted to return the named list, without R copying internally
IRIS[,{quant <- as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25), names = FALSE))
setattr(quant, 'names', Tukeys_five)
quant}, by = Species]
You can use do.call to call data.frame on each of the matrix elements recursively to get a data.frame with vector elements:
dim(do.call("data.frame",dfr))
[1] 3 7
str(do.call("data.frame",dfr))
'data.frame': 3 obs. of 7 variables:
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
$ Petal.Width.Min. : num 0.1 1 1.4
$ Petal.Width.1st.Qu.: num 0.2 1.2 1.8
$ Petal.Width.Median : num 0.2 1.3 2
$ Petal.Width.Mean : num 0.28 1.36 2
$ Petal.Width.3rd.Qu.: num 0.3 1.5 2.3
$ Petal.Width.Max. : num 0.6 1.8 2.5
As far as I know, there isn't an exact way to do what you're asking, because the function you're using (fivenum) doesn't return data in a way that can be easily bound to columns from within the 'ddply' function. This is easy to clean up, though, in a programmatic way.
Step 1: Perform the fivenum function on each 'Species' value using the 'ddply' function.
data <- ddply(iris, .(Species), summarize, value=fivenum(Petal.Width))
# Species value
# 1 setosa 0.1
# 2 setosa 0.2
# 3 setosa 0.2
# 4 setosa 0.3
# 5 setosa 0.6
# 6 versicolor 1.0
# 7 versicolor 1.2
# 8 versicolor 1.3
# 9 versicolor 1.5
# 10 versicolor 1.8
# 11 virginica 1.4
# 12 virginica 1.8
# 13 virginica 2.0
# 14 virginica 2.3
# 15 virginica 2.5
Now, the 'fivenum' function returns a list, so we end up with 5 line entries for each species. That's the part where the 'fivenum' function is fighting us.
Step 2: Add a label column. We know what Tukey's five numbers are, so we just call them out in the order that the 'fivenum' function returns them. The list will repeat until it hits the end of the data.
Tukeys_five <- c("Min","Q1","Med","Q3","Max")
data$label <- Tukeys_five
# Species value label
# 1 setosa 0.1 Min
# 2 setosa 0.2 Q1
# 3 setosa 0.2 Med
# 4 setosa 0.3 Q3
# 5 setosa 0.6 Max
# 6 versicolor 1.0 Min
# 7 versicolor 1.2 Q1
# 8 versicolor 1.3 Med
# 9 versicolor 1.5 Q3
# 10 versicolor 1.8 Max
# 11 virginica 1.4 Min
# 12 virginica 1.8 Q1
# 13 virginica 2.0 Med
# 14 virginica 2.3 Q3
# 15 virginica 2.5 Max
Step 3: With the labels in place, we can quickly cast this data into a new shape using the 'dcast' function from the 'reshape2' package.
library(reshape2)
dcast(data, Species ~ label)[,c("Species",Tukeys_five)]
# Species Min Q1 Med Q3 Max
# 1 setosa 0.1 0.2 0.2 0.3 0.6
# 2 versicolor 1.0 1.2 1.3 1.5 1.8
# 3 virginica 1.4 1.8 2.0 2.3 2.5
All that junk at the end are just specifying the column order, since the 'dcast' function automatically puts things in alphabetical order.
Hope this helps.
Update: I decided to return, because I realized there is one other option available to you. You can always bind a matrix as part of a data frame definition, so you could resolve your 'aggregate' function like so:
data <- aggregate(Petal.Width ~ Species, iris, function(x) summary(fivenum(x)))
result <- data.frame(Species=data[,1],data[,2])
# Species Min. X1st.Qu. Median Mean X3rd.Qu. Max.
# 1 setosa 0.1 0.2 0.2 0.28 0.3 0.6
# 2 versicolor 1.0 1.2 1.3 1.36 1.5 1.8
# 3 virginica 1.4 1.8 2.0 2.00 2.3 2.5
This is my solution:
ddply(iris, .(Species), summarize, value=t(fivenum(Petal.Width)))

Resources