Mapping cut factor variable to numeric in R - r

I have a factor variable represented by the histogram bins with values: '660-664' , ... , '740-744' , 745-749' ..
How can I map the factor variable to its mean value, e.g mapping '660-664' to 662?
Basically, what I'm looking for is the inverse of the "cut" function.

You can make use of the plot = FALSE argument from hist to extract the breaks, then use that to get your midpoints:
set.seed(1)
x <- sample(300, 30)
x
# [1] 80 112 171 270 60 266 278 194 184 18 296 52 198 111 221 142
# [17] 204 281 108 219 262 290 182 35 74 107 4 105 237 93
temp <- hist(x, plot = FALSE)$breaks
temp
# [1] 0 50 100 150 200 250 300
rowMeans(cbind(head(temp, -1),
tail(temp, -1)))
# [1] 25 75 125 175 225 275
Update: Calculating the mean from a character string of ranges
Judging by your comments, you might be looking for something more like this:
myVec <- c("735-739", "715-719", "690-694", "695-699", "695-699",
"670-674", "720-724", "705-709", "685-689")
myVec
# [1] "735-739" "715-719" "690-694" "695-699" "695-699" "670-674"
# [7] "720-724" "705-709" "685-689"
sapply(strsplit(myVec, "-"), function(x) mean(as.numeric(x)))
# [1] 737 717 692 697 697 672 722 707 687

Related

Populate new dataframe with rowMeans for every four columns minus other rowMeans value

I am trying to calculate my data's means by populating a new dataframe with data corrected by my experiment's blank.
So far, I have created my new data frame:
data_mean <- data.frame(matrix(ncol = 17, # As many columns as experimental conditions plus one for "Time(h)"
nrow = nrow(data)))
Copied the data corresponding to time:
data_mean[,1] <- data[,1]
And attempted to populate the dataframe by assigning the mean of every condition minus the mean of the blanks to each column:
data_mean[,2] <- rowMeans(data[,5:8])-rowMeans(data[,2:4])
data_mean[,3] <- rowMeans(data[,9:12])-rowMeans(data[,2:4])
data_mean[,4] <- rowMeans(data[,13:16])-rowMeans(data[,2:4])
data_mean[,5] <- rowMeans(data[,17:20])-rowMeans(data[,2:4])
and so on.
Is there an easier way to do this rather than typing the same code over and over?
res <- sapply(split.default(data[, -1], seq(ncol(data) - 1)%/%4), rowSums)
res[,-1] - res[,1] # Should give you all the differences above
example:
data <- data.frame(matrix(1:200, 10))
res <- sapply(split.default(data[, -1], seq(ncol(data) - 1)%/%4), rowSums)
res[,-1] - res[,1]
1 2 3 4
[1,] 161 321 481 641
[2,] 162 322 482 642
[3,] 163 323 483 643
[4,] 164 324 484 644
[5,] 165 325 485 645
[6,] 166 326 486 646
[7,] 167 327 487 647
[8,] 168 328 488 648
[9,] 169 329 489 649
[10,] 170 330 490 650
and you can check:
rowSums(data[, 5:8]) - rowSums(data[,2:4])
[1] 161 162 163 164 165 166 167 168 169 170 # first column
rowSums(data[, 9:12]) - rowSums(data[,2:4])
[1] 321 322 323 324 325 326 327 328 329 330 # second column

How to find out the sequence of value in R

Supposed that I have generated 100 different value from -100 to 100, and I have cumsum all of those value.
set.seed(123)
x <- -100:100
z <- sample (x, size = 100,replace=T)
cumsum(z)
and I got
[1] 58 136 49 143 212 161 178 120 33 50 102 91 81 177 167 251 242 278 276 247 172 78 147
[24] 183 246 223 203 145 147 163 138 180 111 119 25 61 129 102 24 78 165 117 151 103 157 222
[47] 155 123 94 69 31 71 67 57 109 46 -34 -94 -20 -31 -72 -157 -142 -149 -244 -145 -160 -175 -237
[70] -179 -162 -213 -280 -377 -465 -497 -471 -419 -468 -547 -559 -500 -576 -642 -575 -564 -635 -596 -538 -518 -509 -452
[93] -489 -448 -350 -384 -334 -313 -335 -351
Now, I would like to stop or find out the value that is greater than 200 or lower than -200.
If I do it by my hand, I know that the 5th sequence (212) is greater than 200.
However, in R, is there any command to find out the first time that z is greater than 200 or lower than -200?
Thank you very much
A quick hack way to do this might be:
z <- as.data.frame(z)
z$lv <- if_else(z >200,T,F)
min(which(lv == TRUE))
The min(which(...)) solutions provided by others don't give a convenient answer in case none of the values meet the condition. For example,
set.seed(123)
x <- -100:100
z <- sample (x, size = 100,replace=T)
min(which(abs(cumsum(z)) > 200))
#> [1] 5
min(which(abs(cumsum(z)) > 1000)) # None meet this condition
#> Warning in min(which(abs(cumsum(z)) > 1000)): no non-missing arguments to min;
#> returning Inf
#> [1] Inf
A better way is given in the R help page for which.max:
match(TRUE, abs(cumsum(z)) > 200)
#> [1] 5
match(TRUE, abs(cumsum(z)) > 1000)
#> [1] NA

Turn extraction list to csv file

I have uploaded a raster file and polyline shapefile into R and use the extract function to to extract the data from every pixel along the polyline. How do I turn the list output by extract into a CSV file?
Here is a simple self-contained reproducible example (this one is taken from ?raster::extract)
library(raster)
r <- raster(ncol=36, nrow=18, vals=1:(18*36))
cds1 <- rbind(c(-50,0), c(0,60), c(40,5), c(15,-45), c(-10,-25))
cds2 <- rbind(c(80,20), c(140,60), c(160,0), c(140,-55))
lines <- spLines(cds1, cds2)
e <- extract(r, lines)
e is a list
> e
[[1]]
[1] 126 127 161 162 163 164 196 197 200 201 231 232 237 266 267 273 274 302 310 311 338 346 381 382 414 417 450 451 452 453 487 488
[[2]]
[1] 139 140 141 174 175 177 208 209 210 213 243 244 249 250 279 286 322 358 359 394 429 430 465 501 537
and you cannot directly write this to a csv because the list elements (vectors) have different lengths.
So first make them all the same length
x <- max(sapply(e, length))
ee <- sapply(e, `length<-`, x)
Let's see
head(ee)
# [,1] [,2]
#[1,] 126 139
#[2,] 127 140
#[3,] 161 141
#[4,] 162 174
#[5,] 163 175
#[6,] 164 177
tail(ee)
# [,1] [,2]
#[27,] 450 NA
#[28,] 451 NA
#[29,] 452 NA
#[30,] 453 NA
#[31,] 487 NA
#[32,] 488 NA
And now you can write to a csv file
write.csv(ee, "test.csv", row.names=FALSE)
If I understand what it is you're asking, I think you could resolve your situation by using unlist().
d <- c(1:10) # creates a sample data frame to use
d <- as.list(d) # converts the data frame into a list
d <- unlist(d) # converts the list into a vector

Generate a sequence of numbers with repeated intervals

I am trying to create sequences of number of 6 cases, but with 144 cases intervals.
Like this one for example
c(1:6, 144:149, 288:293)
1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
How could I generate automatically such a sequence with
seq
or with another function ?
I find the sequence function to be helpful in this case. If you had your data in a structure like this:
(info <- data.frame(start=c(1, 144, 288), len=c(6, 6, 6)))
# start len
# 1 1 6
# 2 144 6
# 3 288 6
then you could do this in one line with:
sequence(info$len) + rep(info$start-1, info$len)
# [1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
Note that this solution works even if the sequences you're combining are different lengths.
Here's one approach:
unlist(lapply(c(0L,(1:2)*144L-1L),`+`,seq_len(6)))
# or...
unlist(lapply(c(1L,(1:2)*144L),function(x)seq(x,x+5)))
Here's a way I like a little better:
rep(c(0L,(1:2)*144L-1L),each=6) + seq_len(6)
Generalizing...
rlen <- 6L
rgap <- 144L
rnum <- 3L
starters <- c(0L,seq_len(rnum-1L)*rgap-1L)
rep(starters, each=rlen) + seq_len(rlen)
# or...
unlist(lapply(starters+1L,function(x)seq(x,x+rlen-1L)))
This can also be done using seq or seq.int
x = c(1, 144, 288)
c(sapply(x, function(y) seq.int(y, length.out = 6)))
#[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
As #Frank mentioned in the comments here is another way to achieve this using #josilber's data structure (This is useful particularly when there is a need of different sequence length for different intervals)
c(with(info, mapply(seq.int, start, length.out=len)))
#[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
From R >= 4.0.0, you can now do this in one line with sequence:
sequence(c(6,6,6), from = c(1,144,288))
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
The first argument, nvec, is the length of each sequence; the second, from, is the starting point for each sequence.
As a function, with n being the number of intervals you want:
f <- function(n) sequence(rep(6,n), from = c(1,144*1:(n-1)))
f(3)
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
I am using R 3.3.2. OSX 10.9.4
I tried:
a<-c() # stores expected sequence
f<-288 # starting number of final sub-sequence
it<-144 # interval
for (d in seq(0,f,by=it))
{
if (d==0)
{
d=1
}
a<-c(a, seq(d,d+5))
print(d)
}
print(a)
AND the expected sequence stores in a.
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
And another try:
a<-c() # stores expected sequence
it<-144 # interval
lo<-4 # number of sub-sequences
for (d in seq(0,by=it, length.out = lo))
{
if (d==0)
{
d=1
}
a<-c(a, seq(d,d+5))
print(d)
}
print(a)
The result:
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293 432 433 434 435 436 437
I tackled this with cumsum function
seq_n <- 3 # number of sequences
rep(1:6, seq_n) + rep(c(0, cumsum(rep(144, seq_n-1))-1), each = 6)
# [1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
No need to calculate starting values of sequences as in the #josilber's solution, but the length of a sequence has to be constant.

Summarizing a data frame

I am trying to take the following data, and then uses this data to create a table which has the information broken down by state.
Here's the data:
> head(mydf2, 10)
lead_id buyer_account_id amount state
1 52055267 62 300 CA
2 52055267 64 264 CA
3 52055305 64 152 CA
4 52057682 62 75 NJ
5 52060519 62 750 OR
6 52060519 64 574 OR
15 52065951 64 152 TN
17 52066749 62 600 CO
18 52062751 64 167 OR
20 52071186 64 925 MN
I've allready subset the states that I'm interested in and have just the data I'm interested in:
mydf2 = subset(mydf, state %in% c("NV","AL","OR","CO","TN","SC","MN","NJ","KY","CA"))
Here's an idea of what I'm looking for:
State Amount Count
NV 1 50
NV 2 35
NV 3 20
NV 4 15
AL 1 10
AL 2 6
AL 3 4
AL 4 1
...
For each state, I'm trying to find a count for each amount "level." I don't necessary need to group the amount variable, but keep in mind that they are are not just 1,2,3, etc
> mydf$amount
[1] 300 264 152 75 750 574 113 152 750 152 675 489 188 263 152 152 600 167 34 925 375 156 675 152 488 204 152 152
[29] 600 489 488 75 152 152 489 222 563 215 452 152 152 75 100 113 152 150 152 150 152 452 150 152 152 225 600 620
[57] 113 152 150 152 152 152 152 152 152 152 640 236 152 480 152 152 200 152 560 152 240 222 152 152 120 257 152 400
Is there an elegant solution for this in R for this or will I be stuck using Excel (yuck!).
Here's my understanding of what you're trying to do:
Start with a simple data.frame with 26 states and amounts only ranging from 1 to 50 (which is much more restrictive than what you have in your example, where the range is much higher).
set.seed(1)
mydf <- data.frame(
state = sample(letters, 500, replace = TRUE),
amount = sample(1:50, 500, replace = TRUE)
)
head(mydf)
# state amount
# 1 g 28
# 2 j 35
# 3 o 33
# 4 x 34
# 5 f 24
# 6 x 49
Here's some straightforward tabulation. I've also removed any instances where frequency equals zero, and I've reordered the output by state.
temp1 <- data.frame(table(mydf$state, mydf$amount))
temp1 <- temp1[!temp1$Freq == 0, ]
head(temp1[order(temp1$Var1), ])
# Var1 Var2 Freq
# 79 a 4 1
# 157 a 7 2
# 391 a 16 1
# 417 a 17 1
# 521 a 21 1
# 1041 a 41 1
dim(temp1) # How many rows/cols
# [1] 410 3
Here's a little bit different tabulation. We are tabulating after grouping the "amount" values. Here, I've manually specified the breaks, but you could just as easily let R decide what it thinks is best.
temp2 <- data.frame(table(mydf$state,
cut(mydf$amount,
breaks = c(0, 12.5, 25, 37.5, 50),
include.lowest = TRUE)))
temp2 <- temp2[!temp2$Freq == 0, ]
head(temp2[order(temp2$Var1), ])
# Var1 Var2 Freq
# 1 a [0,12.5] 3
# 27 a (12.5,25] 3
# 79 a (37.5,50] 3
# 2 b [0,12.5] 2
# 28 b (12.5,25] 6
# 54 b (25,37.5] 5
dim(temp2)
# [1] 103 3
I am not sure if I understand correctly (you have two data.frames mydf and mydf2). I'll assume your data is in mydf. Using aggregate:
mydf$count <- 1:nrow(mydf)
aggregate(data = mydf, count ~ amount + state, length)
Is this what you are looking for?
Note: here count is a variable that is created just to get directly the output of the 3rd column as count.
Alternatives with ddply from plyr:
# no need to create a variable called count
ddply(mydf, .(state, amount), summarise, count=length(lead_id))
Here' one could use any column that exists in one's data instead of lead_id. Even state:
ddply(mydf, .(state, amount), summarise, count=length(state))
Or equivalently without using summarise:
ddply(mydf, .(state, amount), function(x) c(count=nrow(x)))

Resources