c. What is the correlation coefficient between “beds” and “baths” for the houses with total rooms not exceeding (i.e., less than or equal) 6 rooms (total rooms = “beds” + “baths”)
Data:
myData$beds + myData$baths
[1] 5 5 6 5 5 5 5 7 6 5 5 5 6 6 5 6 5 6 5 6 6 5 7 5 5 8 5 5 7 5 7 7 6 7 8 6 7
[38] 7 7 6 5 7 6 7 6 7 5 7 6 5 5 7 5 5 6 7 7 5 5 7 5 7 8 8 5 6 6 5 5 6 6 8 6 7
[75] 7 6 6 6 7 6 6 7 7 6 6 6 8 7 6 8 6 9 10 7 7 7 6 6 6 7 8 7 6 7 7 7 7 7
How do I write code that does the cor(myData$beds, myData$baths), with the condition that the sum of both is less than 6?
Related
So I have a list of length 9 where each element is a dataframe. I wanted to extract specific columns from each dataframe in the most efficient way possible, so I used the below function.
Down <- lapply(tables, "[", 2)
This successfully extracted the information I wanted, but why? What is "[" and how does it satisfy the semantics of lapply?
To augment Ritchie Sacramento's very nice explanation, you can also access the dataframe columns in a list using conventional notation:
cars1 <- mtcars
cars2 <- cars1
cars3 <- cars2
tables <- list(cars1, cars2, cars3)
lapply(tables, function(x) x$cyl)
[[1]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[2]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[3]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
lapply(tables, function(x) x[, 2])
[[1]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[2]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[3]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
I currently have multiple NetCDF files with 4 dimensions, (latitude, longitude, time, and depth). Each represents a single year of monthly data. The unit of time is "month", 1-12, and therefore quite useless if I want to merge these files across years to give me a single NetCDF file with a time dimension of size months*years.
The time dimension attributes for a single file:
time Size:12 *** is unlimited ***
long_nime: time
units: month
I used ncrcat of nco to merge.
ncrcat soda3.3.1*sst.nc -O soda3.3.1_1980_2015_sst.nc
This works except that when merged, time values read
#in R
soda.info$var$temp$dim[[3]]$vals
[1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1
[26] 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2
[51] 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
[76] 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4
[101] 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5
[126] 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6
[151] 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7
[176] 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8
[201] 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9
[226] 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10
[251] 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11
[276] 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
[301] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1
[326] 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2
[351] 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
[376] 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4
[401] 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5
[426] 6 7 8 9 10 11 12
...which obviously isn't much help if I want to keep track of time.
In the past I've only used NetCDF files with a "months since..." unit. Is there a way to change these rather groundless 'month' units to 'months since...'?
Would it suffice to enumerate the months sequentially?
ncap2 -s 'time=array(0,1,$time)' soda3.3.1_1980_2015_sst.nc out.nc
You can also add a "months since ..." unit to time as described in the comment by Chelmy and/or in the NCO manual. I leave that as an exercise for you, gentle reader.
How to create a vector sequence of:
2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
I tried to use:
2:8+rep(0:6,each=6)
but the result is:
2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 .... 12 13 14
Please help. Thanks.
This should accomplish what you're looking for:
x = 2
VecSeq = c(x:8)
while (x < 7) {
x = x + 1
calc = c(x:8)
VecSeq = c(VecSeq, calc)
}
VecSeq # Your desired vector
you could do this:
library(purrr)
unlist(map(2:7, ~.x:8))
# [1] 2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
and a little function in base R:
funky_vec <- function(from,to){unlist(sapply(from:(to-1),`:`,to))}
funky_vec(2,8)
# [1] 2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
This is made really easy with sequence (since R 4.0.0):
sequence(7:2, 2:7)
# [1] 2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
I have a numeric element z as below:
> sort(z)
[1] 1 5 5 5 6 6 7 7 7 7 7 9 9
I would like to sequentially reorganize this element so to have
> z
[1] 1 2 2 2 3 3 4 4 4 4 4 5 5
I guess converting z to a factor and use it as an index should be the way.
You answered it yourself really:
as.integer(factor(sort(z)))
I know this has been accepted already but I decided to look inside factor() to see how it's done there. It more or less comes down to this:
x <- sort(z)
match(x, unique(x))
Which is an extra line I suppose but it should be faster if that matters.
This should do the trick
z = sort(sample(1:10, 100, replace = TRUE))
cumsum(diff(z)) + 1
[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
[26] 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6
[51] 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8
[76] 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10
Note that diff omits the first element of the series. So to compensate:
c(1, cumsum(diff(z)) + 1)
Alternative using rle:
z = sort(sample(1:10, 100, replace = TRUE))
rle_result = rle(sort(z))
rep(rle_result$values, rle_result$lengths)
> rep(rle_result$values, rle_result$lengths)
[1] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[26] 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6
[51] 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8
[76] 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10
rep(seq_along(rle(x)$l), rle(x)$l)
I have a dataframe running into about 500,000 rows. One of these columns contains positive integer values, say column A. let there be another column B
I now need to create a second dataframe with number of rows equal to sum(dataframe$A). this is done.
A question of performance arises when i need to fill this new data frame up with data. I am trying to create a column A2 for this second frame as follows:
A2<-vector()
for (i in 1:nrow(dataframe)){
A2<-c(A2,rep(dataframe$B[i],dataframe$A[i]))
}
The external loop is obviously very slow for the large number of rows being processed. Any suggestions on how to achieve this task with faster processing.
Thanks for responses
You simply do not need the loop at all. rep is already vectorized.
A2 <- rep(dataframe$B, dataframe$A)
Should work. As a reproducible example, here is your way using the built in mtcars dataset.
x <- vector()
for(i in 1:nrow(mtcars)) {x <- c(x, rep(mtcars$cyl[i], mtcars$gear[i]))}
> x
[1] 6 6 6 6 6 6 6 6 4 4 4 4 6 6 6 8 8 8 6 6 6 8 8 8 4 4 4 4 4 4 4 4 6 6 6 6 6
[38] 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8
[75] 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 6 6 6 6 6 8 8
[112] 8 8 8 4 4 4 4
and vectorized, it is:
x2 <- rep(mtcars$cyl, mtcars$gear)
> x2
[1] 6 6 6 6 6 6 6 6 4 4 4 4 6 6 6 8 8 8 6 6 6 8 8 8 4 4 4 4 4 4 4 4 6 6 6 6 6
[38] 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8
[75] 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 6 6 6 6 6 8 8
[112] 8 8 8 4 4 4 4
which will be orders of magnitude faster than using a loop.