Splitting a variable into equally sized groups - r

I have a continuous variable called Longitude (it corresponds to geographical longitude) that has 12465 unique values. I need to create a new variable called Longitude1024 that consists of the variable Longitude split into 1024 equally sized groups. I did that using the following function:
data$Longitude1024 <- as.factor( as.numeric( cut(data$Longitude,1024)))
However, the problem is that, when I use this function to create the new variable Longitude1024, this new variable consists of only 651 unique elements rather than 1024. Does anyone know what the problem here is and how could I actually get the new variable with 1024 unique values?
Thanks a lot

Use rank, then scale it down. Here's an example with 10 groups:
x <- rnorm(124655)
g <- floor(rank(x) * 10 / (length(x) + 1))
table(g)
# g
# 0 1 2 3 4 5 6 7 8 9
# 12465 12466 12465 12466 12465 12466 12466 12465 12466 12465

Short answer: try cut2 from the Hmisc package
Long answer
Example: split dat, which is 1000 unique values, into 100 equal groups of 10.
Doesn't work:
# dummy data
set.seed(321)
dat <- rexp(1000)
# all unique values
length(unique(dat))
[1] 1000
cut generates 100 levels
init_res <- cut(dat, 100)
length(unique(levels(init_res)))
[1] 100
But does not split the data into equally sized groups
init_grps <- split(dat, cut(dat, 100))
table(unlist(lapply(init_grps, length)))
0 1 2 3 4 5 6 7 9 10 11 13 15 17 18 19 22 23 24 25 27 37 38 44 47 50 63 71 72 77
42 9 8 4 1 3 1 3 2 1 2 1 1 1 2 1 1 1 2 2 2 1 1 1 1 1 1 2 1 1
Works with Hmisc::cut2
cut2 divides the vector into groups of equal length, as desired
require(Hmisc)
final_grps <- split(dat, cut2(dat, g=100))
table(unlist(lapply(final_grps, length)))
10
100
If you want, you can store the results in a data frame, for example
foobar <- do.call(rbind, final_grps)
head(foobar)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[0.000611,0.00514) 0.004345915 0.002192086 0.004849693 0.002911516 0.003421753 0.003159641 0.004855366 0.0006111574
[0.005137,0.01392) 0.009178133 0.005137309 0.008347482 0.007072484 0.008732725 0.009379002 0.008818794 0.0110489833
[0.013924,0.02004) 0.014283326 0.014356782 0.013923721 0.014290554 0.014895342 0.017992638 0.015608931 0.0173707930
[0.020041,0.03945) 0.023047527 0.020437743 0.026353839 0.036159321 0.024371834 0.026629812 0.020793695 0.0214221779
[0.039450,0.05912) 0.043379064 0.039450453 0.050806316 0.054778805 0.040093806 0.047228050 0.055058519 0.0446634954
[0.059124,0.07362) 0.069671018 0.059124220 0.063242564 0.064505875 0.072344089 0.067196661 0.065575249 0.0634142853
[,9] [,10]
[0.000611,0.00514) 0.002524557 0.003155055
[0.005137,0.01392) 0.008287758 0.011683228
[0.013924,0.02004) 0.018537469 0.014847937
[0.020041,0.03945) 0.026233400 0.020040981
[0.039450,0.05912) 0.041310471 0.058449603
[0.059124,0.07362) 0.063608022 0.066316782
Hope this helps

Related

Add new column with mutate() from a second dataframe with index given in the first dataframe

I have one dataframe, that contains my results and another dataframe, that contains e.g. only values. Now I want to add a new column to the first dataframe that has the data from the second dataframe. However, the second dataframe does not have a tidy format or the same rows as the first one. However, the position of the value I want to get from the second dataframe is given in two coloums of the first dataframe.
library(tidyverse)
df1<-data.frame(Row_no=c(1,2,3,4, 1,2,3,4), Col_no=c(1,1,2,2,3,3,4,4), > Size=c(sample(200:300, 8)))
>df1
Row_no Col_no Size
1 1 1 226
2 2 1 208
3 3 2 297
4 4 2 211
5 1 3 209
6 2 3 296
7 3 4 273
8 4 4 261
df2=cbind(rnorm(8), rnorm(8), rnorm(8), rnorm(8), rnorm(8), rnorm(8), rnorm(8), rnorm(8))
> df2
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] *1.4568994* -0.3324945 *-0.2885171* -0.79393545 -0.02439371 1.4216918 0.07288639 -0.2441228
[2,] *0.3648703* 0.7494033 *-0.9974556* -0.33820023 -0.30235757 1.5094486 -0.10982881 1.9349127
[3,] 0.5044991 *1.2208453* -0.8748034 *-0.86325341* 0.10462120 -0.3674390 -0.04107733 1.1815123
[4,] -1.2792906 *0.7408320* -0.2711479 *-0.07350530* -0.92132461 -0.7753123 0.99841815 1.5802167
[5,] -0.8801507 0.2580448 0.3099108 0.66716720 -0.01144132 -0.9353671 0.44608715 -0.6729589
[6,] 0.4809844 0.6349390 1.9900160 0.62358533 0.35075449 2.4124712 -1.45171943 0.4409148
[7,] -0.5146914 0.9115070 -0.3971806 -0.06477066 0.46028331 0.7067722 -0.44562194 1.9545829
[8,] -0.4299626 1.8211741 0.3272991 0.06177976 1.25383361 -0.7770162 -0.49841279 0.5098795
The desired result would be something like the following (I put asteriks around the values in df2, to show which I wanted):
Row_no Col_no Size Value
1 1 1 226 1.4568994
2 2 1 208 0.3648703
3 3 2 297 1.2208453
4 4 2 211 0.7408320
5 1 3 209 -0.2885171
6 2 3 296 -0.9974556
7 3 4 273 -0.86325341
8 4 4 261 -0.07350530
However, when I try to run the code
df1%>%
mutate(value=df2[Row_no, Col_no])
I get the error message,
`Fehler: Column `value` must be length 8 (the number of rows) or one, not 64
Which would be expected. However, when I try to index the columns themselves I get
df1%>%
mutate(value=df2[Row_no[1], Col_no[1]])
Row_no Col_no Size value
1 1 1 226 1.456899
2 2 1 208 1.456899
3 3 2 297 1.456899
4 4 2 211 1.456899
5 1 3 209 1.456899
6 2 3 296 1.456899
7 3 4 273 1.456899
8 4 4 261 1.456899
> df1%>%
+ mutate(value[1]=df2[Row_no[1], Col_no[1]])
Error: Unexpected '=' in:
"df1%>%
mutate(value[1]="
So how would I get my desired result? I would prefer to have a tidy solution. Also, the given example is just a minimum reproducible example, my real files are really large, that's why I need a clear solution...
Thanks!
Thanks to #Yuriy Barvinchenko, I was able to figure out a solution:
df1%>%
mutate(value=df2[cbind(Row_no, Col_no)])
> df1%>%
+ mutate(value=df2[cbind(Row_no, Col_no)])
Row_no Col_no Size value
1 1 1 226 1.4568994
2 2 1 208 0.3648703
3 3 2 297 1.2208453
4 4 2 211 0.7408320
5 1 3 209 -0.2885171
6 2 3 296 -0.9974556
7 3 4 273 -0.8632534
8 4 4 261 -0.0735053
The important part was the cbind() in the indexing brackets.
based on answer here
df1$value <- with( df1, df2[ cbind(Row_no, Col_no) ] )
Using purrr::pmap:
df1$Value <- unlist(pmap(list(df1$Row_no, df1$Col_no, list(df2)), ~ ..3[..1,..2]))
and with piping:
df1 %>%
mutate(Value = pmap(list(Row_no, Col_no, list(df2)), ~ ..3[..1,..2]))
The problem is that when you try mutate(value=df2[Row_no, Col_no]), you are actually generating a square matrix of length(Row_no) * length(Col_no) elements, equivalent to df2[df1$Col_no, df1$Row_no]. When you think about it, this is a stack of the 8 "correct" rows, where the correct columns are numbered 1 to 8. The correct elements can therefore be found at [1, 1], [2, 2], [3, 3]...[n, n], i.e. the diagonal of the matrix. The tidiest way to get these into a single column is to multiply it by the identity matrix and take the row sums.
I have replicated your random data here to give a complete solution that matches your example.
library(tidyverse)
df1 <- data.frame(Row_no = rep(1:4, 2),
Col_no = rep(1:4, each = 2),
Size = c(sample(200:300, 8)))
df2 <- cbind(c( 1.4568994, -0.3324945, -0.2885171, -0.79393545,
-0.02439371, 1.4216918, 0.07288639, -0.2441228),
c( 0.3648703, 0.7494033, -0.9974556, -0.33820023,
-0.30235757, 1.5094486, -0.10982881, 1.9349127),
c( 0.5044991, 1.2208453, -0.8748034, -0.86325341,
0.10462120, -0.3674390, -0.04107733, 1.1815123),
c(-1.2792906, 0.7408320, -0.2711479, -0.07350530,
-0.92132461, -0.7753123, 0.99841815, 1.5802167),
c(-0.8801507, 0.2580448, 0.3099108, 0.66716720,
-0.01144132, -0.9353671, 0.44608715, -0.6729589),
c( 0.4809844, 0.6349390, 1.9900160, 0.62358533,
0.35075449, 2.4124712, -1.45171943, 0.4409148),
c(-0.5146914, 0.9115070, -0.3971806, -0.06477066,
0.46028331, 0.7067722, -0.44562194, 1.9545829),
c(-0.4299626, 1.8211741, 0.3272991, 0.06177976,
1.25383361, -0.7770162, -0.49841279, 0.5098795))
df1 %>% mutate(value = rowSums(df2[Col_no, Row_no] * diag(8))) %>% print
# Row_no Col_no Size value
# 1 1 1 267 1.4568994
# 2 2 1 283 0.3648703
# 3 3 2 259 1.2208453
# 4 4 2 235 0.7408320
# 5 1 3 212 -0.2885171
# 6 2 3 263 -0.9974556
# 7 3 4 251 -0.8632534
# 8 4 4 200 -0.0735053

R ranges: 1:0 - illogical behavior

I have an array X of length N, and I'd like to compute sum(X[(i+1):N]) - sum(X[1:(i-1)]. This works fine if my index, i, is within 2..(N-1). If it's equal to 1, the second term will return the first element of the array rather than exclude it. If it's equal to N, the first term will return the last element of the array rather than exclude it. seq_len is the only function I'm aware of that does the job, but only for the 2nd term (it indexes 1:n). What I need is a range function that will return NULL (rather than throw an exception like seq) when its 2nd argument is below its first. The sum function will do the rest. Is anyone aware of one, or do I have to write one myself?
I suggest an alternate path for generating indexing sequences: seq_len, which reacts intuitively in the extremes.
Bottom Line Up Front: use sum(X[-seq_len(i)]) - sum(X[seq_len(i-1)]) instead.
First, some sample data:
X <- 1:10
N <- length(X)
Your approach, at the two extremes:
i <- 1
X[(i+1):N]
# [1] 2 3 4 5 6 7 8 9 10
X[1:(i-1)] # oops
# [1] 1
That should return "nothing", I believe. (More the point, sum(...) should return 0. For the record, sum(integer(0)) is 0.)
i <- 10
X[(i+1):N] # oops
# [1] NA 10
X[1:(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
There's your other error, where you'd expect "nothing" in the first subset.
Instead, I suggest you use seq_len:
i <- 1
X[-seq_len(i)]
# [1] 2 3 4 5 6 7 8 9 10
X[seq_len(i-1)]
# integer(0)
i <- 10
X[-seq_len(i)]
# integer(0)
X[seq_len(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
Both seem fine, and something in the middle makes sense.
i <- 5
X[-seq_len(i)]
# [1] 6 7 8 9 10
X[seq_len(i-1)]
# [1] 1 2 3 4
In this contrived example, what we're looking for at any value of i:
1: sum(2:10) - 0 = 54 - 0 = 54
2: sum(3:10) - sum(1:1) = 52 - 1 = 51
3: sum(4:10) - sum(1:2) = 49 - 3 = 46
...
10: 0 - sum(1:9) = 0 - 45 = -45
And we now get that:
func <- function(i, x) sum(x[-seq_len(i)]) - sum(x[seq_len(i-1)])
sapply(c(1,2,3,10), func, X)
# [1] 54 51 46 -45
Edit:
李哲源's answer got me to thinking that you don't need to re-sum the numbers before and after all the time. Just do it once and re-use it. This method could be easily a bit faster if your vector is large.
Xb <- c(0, cumsum(X)[-N])
Xb
# [1] 0 1 3 6 10 15 21 28 36 45
Xa <- c(rev(cumsum(rev(X)))[-1], 0)
Xa
# [1] 54 52 49 45 40 34 27 19 10 0
sapply(c(1,2,3,10), function(i) Xa[i] - Xb[i])
# [1] 54 51 46 -45
So this suggests that your summed value at any value of i is
Xs <- Xa - Xb
Xs
# [1] 54 51 46 39 30 19 6 -9 -26 -45
where you can find the specific value with Xs[i]. No repeated summing required.

How to extract the values from a raster in R

I want to use R to extract values from a raster. Basically, my raster has values from 0-6 and I want to extract for every single pixel the corresponding value. So that I have at the end a data table containing those two variables.
Thank you for your help, I hope my explanations are precisely enough.
Example data
library(raster)
r <- raster(ncol=5, nrow=5, vals=1:25)
To get all values, you can do
values(r)
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#as.matrix(r)
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 2 3 4 5
#[2,] 6 7 8 9 10
#[3,] 11 12 13 14 15
#[4,] 16 17 18 19 20
#[5,] 21 22 23 24 25
Also see ?getValues
You can also use indexing
r[2,2]
#7
r[7:8]
#[1] 7 8
For more complex extractions using points, lines or polygons, see ?extract
x is the raster object you are trying to extract values from; y is may be a SpatialPoints, SpatialPolygons,SpatialLines, Extent or a vector representing cell numbers (take a look at ?extract). Your code values_raster <- extract(x = values, df=TRUE) will not work because you're feeding the function with any y object/vector.
You could try to build a vector with all cell numbers of your raster. Imagine your raster have 200 cells. If your do values_raster <- extract(x = values,y=seq(1,200,1), df=TRUE) you'll get a dataframe with values for each cell.
How about simply doing
as.data.frame(s, xy=TRUE) # s is your raster file

Row Differences in Dataframe by Group

My problem has to do with finding row differences in a data frame by group. I've tried to do this a few ways. Here's an example. The real data set is several million rows long.
set.seed(314)
df = data.frame("group_id"=rep(c(1,2,3),3),
"date"=sample(seq(as.Date("1970-01-01"),Sys.Date(),by=1),9,replace=F),
"logical_value"=sample(c(T,F),9,replace=T),
"integer"=sample(1:100,9,replace=T),
"float"=runif(9))
df = df[order(df$group_id,df$date),]
I ordered it by group_id and date so that the diff function can find the sequential differences, which results in time ordered differences of the logical, integer, and float variables. I could easily do some sort of apply(df,2,diff), but I need it by group_id. Hence, doing apply(df,2,diff) results in extra unneeded results.
df
group_id date logical_value integer float
1 1 1974-05-13 FALSE 4 0.03472876
4 1 1979-12-02 TRUE 45 0.24493995
7 1 1980-08-18 TRUE 2 0.46662253
5 2 1978-12-08 TRUE 56 0.60039164
2 2 1981-12-26 TRUE 34 0.20081799
8 2 1986-05-19 FALSE 60 0.43928929
6 3 1983-05-22 FALSE 25 0.01792820
9 3 1994-04-20 FALSE 34 0.10905326
3 3 2003-11-04 TRUE 63 0.58365922
So I thought I could break up my data frame into chunks by group_id, and pass each chunk into a user defined function:
create_differences = function(data_group){
apply(data_group, 2, diff)
}
But I get errors using the code:
diff_df = lapply(split(df,df$group_id),create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
by(df,df$group_id,create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
As a side note, the data is nice, no NAs, nulls, blanks, and every group_id has at least 2 rows associated with it.
Edit 1: User alexis_laz correctly pointed out that my function needs to be sapply(data_group, diff).
Using this edit, I get a list of data frames (one list entry per group).
Edit 2:
The expected output would be a combined data frame of differences. Ideally, I would like to keep the group_id, but if not, it's not a big deal. Here is what the sample output should be like:
diff_df
group_id date logical_value integer float
[1,] 1 2029 1 41 0.2102112
[2,] 1 260 0 -43 0.2216826
[1,] 2 1114 0 -22 -0.3995737
[2,] 2 1605 -1 26 0.2384713
[1,] 3 3986 0 9 0.09112507
[2,] 3 3485 1 29 0.47460596
I think regarding the fact that you have millions of rows you can move to the data.table suitable for by group actions.
library(data.table)
DT <- as.data.table(df)
## this will order per group and per day
setkeyv(DT,c('group_id','date'))
## for all column apply diff
DT[,lapply(.SD,diff),group_id]
# group_id date logical_value integer float
# 1: 1 2029 days 1 41 0.21021119
# 2: 1 260 days 0 -43 0.22168257
# 3: 2 1114 days 0 -22 -0.39957366
# 4: 2 1604 days -1 26 0.23847130
# 5: 3 3987 days 0 9 0.09112507
# 6: 3 3485 days 1 29 0.47460596
It certainly won't be as quick compared to data.table but below is an only slightly ugly base solution using aggregate:
result <- aggregate(. ~ group_id, data=df, FUN=diff)
result <- cbind(result[1],lapply(result[-1], as.vector))
result[order(result$group_id),]
# group_id date logical_value integer float
#1 1 2029 1 41 0.21021119
#4 1 260 0 -43 0.22168257
#2 2 1114 0 -22 -0.39957366
#5 2 1604 -1 26 0.23847130
#3 3 3987 0 9 0.09112507
#6 3 3485 1 29 0.47460596

How to bootstrap respecting within-subject information?

This is the first time I post to this forum, and I want to say from the start I am not a skilled programmer. So please let me know if the question or code were unclear!
I am trying to get the 95% confidence interval (CI) for an interaction (that is my test statistic) by doing bootstrapping. I am using the package "boot". My problem is that for every resample, I would like the randomization to be done within subjects, so that observations from different subjects are not mixed. Here is the code to generate a dataframe similar to mine. As you can see, I have two within-subjects factors ("Num" and "Gram" and I am interested in the interaction between both):
Subject = rep(c("S1","S2","S3","S4"),4)
Num = rep(c("singular","plural"),8)
Gram = rep(c("gram","gram","ungram","ungram"),4)
RT = c(657,775,678,895,887,235,645,916,930,768,890,1016,590,978,450,920)
data = data.frame(Subject,Num,Gram,RT)
This is the code I used to get the empirical interaction value:
summary(lm(RT ~ Num*Gram, data=data))
As you can see, the interaction between my two factors is -348. I want to get a bootstrap confidence interval for this statistic, which I can generate using the "boot" package:
# You need the following packages
install.packages("car")
install.packages("MASS")
install.packages("boot")
library("car")
library("MASS")
library("boot")
#Function to create the statistic to be boostrapped
boot.huber <- function(data, indices) {
data <- data[indices, ] #select obs. in bootstrap sample
mod <- lm(RT ~ Num*Gram, data=data)
coefficients(mod) #return coefficient vector
}
#Generate bootstrap estimate
data.boot <- boot(data, boot.huber, 1999)
#Get confidence interval
boot.ci(data.boot, index=4, type=c("norm", "perc", "bca"),conf=0.95) #4 gets the CI for the interaction
My problem is that I think the resamples should be generated without mixing the individual subjects observations: that is, to generate the new resamples, the observations from subject 1 (S1) should be shuffled within subject 1, not mixing them with the observations from subjects 2, etc... I don't know how "boot" is doing the resampling (I read the documentation but don't understand how the function is doing it)
Does anyone know how I could make sure that the resampling procedure used by "boot" respects subject level information?
Thanks a lot for your help/advice!
Just modify your call to boot() like this:
data.boot <- boot(data, boot.huber, 1999, strata=data$Subject)
?boot provides this description of the strata= argument, which does exactly what you are asking for:
strata: An integer vector or factor specifying the strata for
multi-sample problems. This may be specified for any
simulation, but is ignored when ‘sim = "parametric"’. When
‘strata’ is supplied for a nonparametric bootstrap, the
simulations are done within the specified strata.
Additional note:
To confirm that it's working as you'd like, you can call debugonce(boot), run the call above, and step through the debugger until the object i (whose rows contain the indices used to resample rows of data to create each bootstrap resample) has been assigned, and then have a look at it.
debugonce(boot)
data.boot <- boot(data, boot.huber, 1999, strata=data$Subject)
# Browse[2]>
## [Press return 34 times]
# Browse[2]> head(i)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
# [1,] 9 10 11 16 9 14 15 16 9 2 15 16 1 10
# [2,] 9 14 7 12 5 6 15 4 13 6 11 16 13 6
# [3,] 5 10 15 16 9 6 3 4 1 2 15 12 5 6
# [4,] 5 10 11 4 9 6 15 16 9 14 11 16 5 2
# [5,] 5 10 3 4 1 10 15 16 9 6 3 8 13 14
# [6,] 13 10 3 12 5 10 3 4 5 14 7 16 5 14
# [,15] [,16]
# [1,] 7 8
# [2,] 11 16
# [3,] 3 16
# [4,] 3 8
# [5,] 7 8
# [6,] 7 12
(You can enter Q to leave the debugger at any time.)

Resources