How to replace columnar values in R between 2 numbers - r

I have imported a simple CSV file into a data frame. In a particular column, I would like to replace all values under 15.0 with 15.0, all values above 25.0 with 25.0, and all values between 15.0 and 25.0 with 20.0.
The below snippet works well for the first task, and switching the directional sign and replacing '15' with '25' works fine for the second task.
ff$temp[ ff$temp<15 ] <- 15
How can I accomplish replacing the values between 15 and 25 with 20?

You could use the same premise, just with a second logical operator to catch the "between" values.
For example,
> ff <- data.frame(temp = c(14, 17, 19, 24, 30))
> within(ff, temp[temp > 15 & temp < 25] <- 20)
# temp
# 1 14
# 2 20
# 3 20
# 4 20
# 5 30

set.seed(25)
v1 <- sample(5:35, 25, replace=TRUE)
c(15, 20, 25)[cut(v1, breaks=c(-Inf, 15, 25, Inf), labels=FALSE)]
#[1] 20 25 15 25 15 25 20 15 15 15 15 20 25 20 25 15 20 25 20 25 15 20 15 15 15

Related

Distributing value in a column to values in other columns given certain rules

I am currently working on a programming puzzle that sounds straightforward, but apparently it is pretty difficult if I want to do this efficiently in R without having to use for loop to go through a column with 100k+ rows within a data-frame. I am trying to apply dplyr (particularly group_by and mutate) or data.table, and -apply family, but it's quite tough. Could anyone give some help?
The problem is as follows: given a data-frame df with columns key ("string" data type), x, y, z (all are "numeric" data type). Some elements within column key might get repeated. The rule set is as follows: for every rows with the same value in key column, we determine whether their values in column x are smaller than the sum of elements in column y (for example, with key = aa_bb_1, there are 6 rows with this key, and all of these rows always have the same value in column x. Please see the Sample Output to see how the rule works). If it is, then keep that value in column x , while distributing the element in column x to elements in column y in a decreasing order based on corresponding values in column z. How do we effectively do this given that we need to go through all distinct elements in column key?
Sample Input
df <- data.frame(key = c('aa_bb_1', 'aa_bb_0', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1',
'aa_bb_1', 'aa_bb_1', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1',
'aa_bb_0', 'aa_bb_1', 'ab_ca_0', 'abc_bbb_0', 'abbbc_aa_0', 'aaa_ccc_1',
'aa_bb_0', 'aa_bb_1', 'ab_ca_1', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1',
'aa_bb_1', 'aa_bb_0', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1'),
x = c(20, 19, 30, 25, 37, 13, 20, 20, 30, 25, 37, 13, 19, 20, 30, 43,
71, 13, 19, 20, 10, 25, 37, 13, 20, 19, 30, 25, 37,13),
y = c(3, 10, 18, 15, 32, 4, 12, 29, 71, 92, 11, 7, 21, 19, 13,
26,28,11,8, 8, 5, 23, 3, 12, 19, 7, 9, 11, 7, 12),
z = c(8,13,15,16,10,10,25,21,32,15,45,8,10,50,12,10,35,
23,10,12,2,40,45,57,66,49,100,5,11,30))
key x y z
1 aa_bb_1 20 3 8
2 aa_bb_0 19 10 13
3 ab_ca_0 30 18 15
4 abc_bbb_1 25 15 16
5 abbbc_aa_1 37 32 10
6 aaa_ccc_1 13 4 10
7 aa_bb_1 20 12 25
8 aa_bb_1 20 29 21
9 ab_ca_0 30 71 32
10 abc_bbb_1 25 92 15
11 abbbc_aa_1 37 11 45
12 aaa_ccc_1 13 7 8
13 aa_bb_0 19 21 10
14 aa_bb_1 20 19 50
15 ab_ca_0 30 13 12
16 abc_bbb_0 43 26 10
17 abbbc_aa_0 71 28 35
18 aaa_ccc_1 13 11 23
19 aa_bb_0 19 8 10
20 aa_bb_1 20 8 12
21 ab_ca_1 10 5 2
22 abc_bbb_1 25 23 40
23 abbbc_aa_1 37 3 45
24 aaa_ccc_1 13 12 57
25 aa_bb_1 20 19 66
26 aa_bb_0 19 7 49
27 ab_ca_0 30 9 100
28 abc_bbb_1 25 11 5
29 abbbc_aa_1 37 7 11
30 aaa_ccc_1 13 12 30
Sample Output for aa_bb_1 and aa_bb_0
key x y z
1 aa_bb_1 20 0 8
2 aa_bb_0 19 10 13 -- Second largest value of z among rows with same key aa_bb_0. Get second distribution equal to min(10,19-7)=min(10,12)=10.
7 aa_bb_1 20 0 25
8 aa_bb_1 20 0 21 -- Nothing left to be distributed => 0 in column y.
13 aa_bb_0 19 0 10 --- Nothing left so distribute 0
14 aa_bb_1 20 1 50 --- Second largest value of z among rows with same key aa_bb_1. So distribute min(19,20-19)=1 to column y.
19 aa_bb_0 19 2 10 --- Tie as third largest value of z among rows with same key aa_bb_0. Pick *randomly* for now (in reality, I would have another column to decide on which row would get distributed first). Since min(8,19-7-10)=min(8,2)=2, only 2 is distributed.
20 aa_bb_1 20 0 12
25 aa_bb_1 20 19 66 --- Largest value of z among rows with same key aa_bb_1. Get first distribution = min(20, 19)=19.
26 aa_bb_0 19 7 49 --- Largest value of z among rows with same key aa_bb_0. Get first distribution equal to min(7,19)=7.
Caveat. Only perform the above operations if the sum of all the elements in column z with the same key is greater than the value in column x of that key. Example includes aa_bb_1 where x = 20 < 3+19+8+19
Pretty much anything you can do with a for loop.
Here I apply a function to the data.frame split by key,
that function being a for loop. Then I assign the output to the ordered df, because the split data frame loop output is ordered by key.
df <- dplyr::arrange(df, key, desc(z))
df$y <- lapply(split(df, df$key), \(x) {
ndf <- x
base <- min(ndf$x)
#out values for y
yout = list(x$y)
if(sum(x$y) > min(x$x) {
for (i in seq(nrow(x))) {
##get the max
maxz <- which.max(ndf$z)
##get the minimum
minv <- min(base, ndf$y[maxz])
#add to yout
yout[[i]] <- minv
#new base
base <- base - minv
##update dataframe
ndf <- ndf[-maxz, ]
}
}
return(yout)
}) |> unlist()
key x y z
1 aa_bb_0 19 7 49
2 aa_bb_0 19 10 13
3 aa_bb_0 19 2 10
4 aa_bb_0 19 0 10
5 aa_bb_1 20 19 66
6 aa_bb_1 20 1 50
7 aa_bb_1 20 0 25
8 aa_bb_1 20 0 21
9 aa_bb_1 20 0 12
10 aa_bb_1 20 0 8
11 aaa_ccc_1 13 12 57
12 aaa_ccc_1 13 1 30
13 aaa_ccc_1 13 0 23
14 aaa_ccc_1 13 0 10

Print levels of a factor present within select criteria rather than all levels of the factor in R?

I am working with R version i386 3.1.1 and RStudio 0.99.442.
I have large datasets of tree species that I've collected from 7 plots, each of which are divided into 5 subplots (i.e. 35 distinct subplots). I am trying to get R to run through my dataset and print the species which are present within each plot.
I thought I could use "aggregate" to apply the "levels" function to the Species data column and have it return the Species present for each Plot and Subplot, however it returns the levels of the entire data frame (for 12 species, total) rather than the 3 or 4 species that are actually present in the Subplot.
To provide a reproducible example of what I'm trying to do, we can use the "warpbreaks" dataset that comes with R.
I convert the 'breaks' variable in warpbreaks to a factor variable to recreate the problem; It thus exemplifies my 'species' variable, whereas 'warpbreaks$wool' would represent 'plot', and 'warpbreaks$tension' would represent 'subplot'.
require(stats)
warpbreaks$breaks = as.factor(warpbreaks$breaks)
aggregate(breaks ~ wool + tension, data = warpbreaks, FUN="levels")
If we look at the warpbreaks data, then for "Plot" A (wool) and "Subplot" L (tension) - the desired script would print the species "26, 30, 54, 25, etc."
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L
7 51 A L
8 26 A L
9 67 A L
10 18 A M
11 21 A M
12 29 A M
...
Instead, R returns something of this sort, where it is printing ALL of the levels of the factor variable for ALL of the plots:
wool tension breaks.1 breaks.2 breaks.3 breaks.4 breaks.5 breaks...
1 A L 10 12 13 14 15 ...
2 B L 10 12 13 14 15 ...
3 A M 10 12 13 14 15 ...
4 B M 10 12 13 14 15 ...
5 A H 10 12 13 14 15 ...
6 B H 10 12 13 14 15 ...
How do I get it to print only the factors that are present within that Plot/Subplot combination? Am I totally off in my use of "aggregate"? I'd imagine this is a relatively easy task for an experience R user...
First time stackoverflow post - would appreciate any help or nudges towards the right code!
Many kind thanks.
Try FUN=unique rather than FUN=levels. levels will return every level of the factor, as you have surmised already. unique(...) will only return the unique levels.
y <- aggregate(breaks ~ wool + tension, data = warpbreaks, FUN=unique)
wool tension breaks
1 A L 14, 18, 29, 13, 31, 28, 27, 30
2 B L 15, 4, 17, 9, 19, 23, 10, 26
3 A M 8, 11, 17, 7, 2, 20, 18, 21
4 B M 24, 14, 9, 6, 22, 16, 11, 17
5 A H 21, 11, 12, 8, 1, 25, 16, 5, 14
6 B H 10, 11, 12, 7, 3, 5, 6, 16
NOTE the breaks column is a little weird, as in each row of that column instead of having one value (which makes sense for a dataframe), you have a vector of values. i.e. each cell of that breaks column is NOT a string; it's a vector!
> class(y$wool)
[1] "factor"
> class(y$breaks) # list !!
[1] "list"
> y$breaks[[1]] # first row in breaks
[1] 26 30 54 25 70 52 51 67
Levels: 10 12 13 14 15 16 17 18 19 20 21 24 25 26 27 28 29 30 31 35 36 39 41 42 43 44 51 52 54 67 70
Note that to access the first element of the breaks column, instead of doing y$breaks[1] (like you would with the wool or tension column) you need to do y$breaks[[1]] because of this.
Data frames are not really meant to work like this; a single cell in a dataframe is supposed to have a single value, and most functions will expect a dataframe to conform to this, so just keep this in mind when doing future processing.
If you wanted to convert to a string, use (e.g.) FUN=function (x) paste(unique(x), collapse=', '); then y$breaks will be a column of strings and behaves as normal.

Regroup lines of a data frame for which a column value is inferior to x

I have this data frame :
> df
Z freq proba
1 17 1 0.0033289263
2 18 4 0.0055569026
3 19 2 0.0087878028
4 20 3 0.0132023556
5 21 16 0.0188900561
6 22 12 0.0257995234
7 23 30 0.0337042731
8 24 41 0.0421963455
9 25 56 0.0507149437
10 26 65 0.0586089198
11 27 65 0.0652230449
12 28 93 0.0699913154
13 29 82 0.0725182432
14 30 94 0.0726318551
15 31 72 0.0703990113
16 32 74 0.0661024717
17 33 58 0.0601873020
18 34 66 0.0531896431
19 35 38 0.0456625487
20 36 45 0.0381117389
21 37 27 0.0309498221
22 38 17 0.0244723502
23 39 15 0.0188543771
24 40 13 0.0141629367
25 41 4 0.0103793600
26 42 1 0.0074254435
27 43 2 0.0051886582
28 45 1 0.0023658767
29 46 1 0.0015453804
30 49 2 0.0003792308
# Here are my datas :
> dput(df)
structure(list(Z = c(17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 45, 46, 49), freq = c(1, 4, 2, 3, 16, 12, 30, 41, 56, 65,
65, 93, 82, 94, 72, 74, 58, 66, 38, 45, 27, 17, 15, 13, 4, 1,
2, 1, 1, 2), proba = c(0.0033289262662263, 0.00555690264007235,
0.00878780282243439, 0.0132023555702843, 0.0188900560866825,
0.0257995234198431, 0.0337042730520012, 0.0421963455163949, 0.0507149437492447,
0.0586089198012906, 0.0652230449359029, 0.0699913153996099, 0.0725182432348992,
0.0726318551493006, 0.0703990113442269, 0.0661024716831246, 0.0601873020200862,
0.0531896430528685, 0.045662548708844, 0.0381117389181843, 0.030949822142559,
0.0244723501557229, 0.01885437705459, 0.0141629366839816, 0.0103793599644779,
0.00742544354411115, 0.00518865818999788, 0.00236587669133322,
0.00154538036835848, 0.000379230768851682)), .Names = c("Z",
"freq", "proba"), row.names = c(NA, -30L), class = "data.frame")
And I want to regroup lines for which the value "freq" is < 5 with the next line, and this while the next line is < 5.
Idk if I'm clear enough so this is the output I expect :
> df2
labels effectifs pi
1 17;20 10 0.03087599
2 21 16 0.01889006
3 22 12 0.02579952
4 23 30 0.03370427
5 24 41 0.04219635
6 25 56 0.05071494
7 26 65 0.05860892
8 27 65 0.06522304
9 28 93 0.06999132
10 29 82 0.07251824
11 30 94 0.07263186
12 31 72 0.07039901
13 32 74 0.06610247
14 33 58 0.06018730
15 34 66 0.05318964
16 35 38 0.04566255
17 36 45 0.03811174
18 37 27 0.03094982
19 38 17 0.02447235
20 39 15 0.01885438
21 40 13 0.01416294
22 41;49 11 0.02728395
I did it with nested while, but I find this solution very painful and so unoptimized.
i <- 1
freqs <- c()
labels <- c()
pi <- c()
while(i < nrow(df)) {
if (df$freq[i] >= 5) {
freqs <- c(freqs, df$freq[i])
labels <- c(labels, df$Z[i])
pi <- c(pi, df$proba[i])
i <- i + 1
}
else {
count <- df$freq[i]
countPi <- df$proba[i]
k <- i
j <- i
while(df$freq[i] < 5 & i < nrow(df)) {
if (df$freq[i+1] < 5) {
count <- count + df$freq[i+1]
countPi <- countPi + df$proba[i+1]
j <- i + 1
}
i <- i + 1
}
labels <- c(labels, paste0(df$Z[k], ";", df$Z[j]))
freqs <- c(freqs, count)
pi <- c(pi, countPi)
}
}
df2 <- data.frame(labels, freqs, pi)
I'm sure there is far better, maybe with dplyr. If you have a better solution.. Thanks !
We could use the "devel" version of "data.table" as new functions are introduced (rleid). Here, we convert the "data.frame" to "data.table" (setDT(df)), create a grouping variable ("gr") based on the logical index (freq <5) using rleid. 'Z' column is 'numeric/integer' class. Create a character column ("Z1") from the "Z". Grouped by 'gr', if the "freq" is less than 5 for all the elements of that group, summarise the rows to a single row by taking the first observation of columns (.SD[1L]), remove the unwanted columns (as .SD includes "Z1" which will result in duplicate columns), append it with the "Z1" that we get from pasting the min and max value of "Z" for that group. Otherwise, leave it unchanged (else .SD). Remove the columns that we don't need by assigning it to "NULL".
library(data.table) #data.table_1.9.5
res <- setDT(df)[, gr:=rleid(freq<5)][, Z1:= as.character(Z)][,
if(all(freq<5)) c(.SD[1L][,-4, with=FALSE],
list(Z1=toString(c(min(Z), max(Z)))))
else .SD, gr][,1:2 :=NULL][]
head(res,3)
# freq proba Z1
#1: 1 0.003328926 17, 20
#2: 16 0.018890056 21
#3: 12 0.025799523 22
Since this is a dplyr question, here is a dplyr solution. First I used a grouping function in order to define the groups (similar to the rleid function in data.table). Then the summary and is fairly simple.
# grouping function
grouping <- function(condition){
# calculate runs for grouping
run <- rle((!condition) * 1:length(condition))
# revalue
run$values <- seq_along(run$values)
# invert to get grouping
inverse.rle(run)
}
# load dplyr
require(dplyr)
df %>%
mutate(group = grouping(freq<5)) %>% # add groups
group_by(group) %>% # group data
summarize(freq = sum(freq), # sum freq
proba = sum(proba), # sum proba
Z = toString(unique(range(Z)))) %>% # rename Z
mutate(group=NULL) # remove groups
## Source: local data table [22 x 3]
##
## freq proba Z
## 1 10 0.03087599 17, 20
## 2 16 0.01889006 21
## 3 12 0.02579952 22
## 4 30 0.03370427 23
## 5 41 0.04219635 24
## 6 56 0.05071494 25
## 7 65 0.05860892 26
## 8 65 0.06522304 27
## 9 93 0.06999132 28
## 10 82 0.07251824 29
## .. ... ... ...

create a new column with multiple categories using two columns when they satisfy certain conditions using R

I have a data set of "X" (values from 0 to 80) and "Y" (values from 0 to 80). I would like to create a new column "Table". I have 36 tables in mind: In groups of 6... They should be grouped according to:
Tables 1-6:ALL Y 11-20... Table 7-12:Y 21-30, Table 13-18:Y 31-40, Table 19-24:Y 41-50, Table 25-30:Y 51-60, Table 31-36:Y 61-70
Table 1: X 21-30 and Tables 7, 13, 19, 25, 31
Table 2: X 31-40 and Tables 8, 14, 20, 26, 32
Table 3: X 41-50 and Tables 9, 15, 21, 27, 33
Table 4: X 51-60 and Tables 10, 16, 22, 28, 34
Table 5: X 61-70 and Tables 11, 17, 23, 29, 35
Table 6: X 71-80 and Tables 12, 18, 24, 30, 36
End Result:
X Y Table
45 13 3
66 59 29
21 70 31
17 66 NA (there is no table for X lower than 21)
Should I be using the If Else function to group the data from the "X" and "Y" into my new "Table", ranging from 1 to 36 or something else? Any help will be appreciated! Thank you!
head(data)
value avg.temp X Y
1 0 6.69 45 13
2 0 6.01 48 14
3 0 7.35 39 15
4 0 5.86 45 15
5 0 6.43 42 16
6 0 5.68 48 16
I think you could use something like this. If your data frame is called df :
df$Table <- NA
df$Table[df$X>=21 & df$X<=30 & df$Y>=11 & df$Y<=20] <- 1
df$Table[df$X>=31 & df$X<=40 & df$Y>=11 & df$Y<=20] <- 2
...
Use math and indexes:
# demo data
x <- data.frame(X = c(45,66,21,17,0,1,21,80,45),Y = c(13,59,70,66,80,11,0,1,27))
# if each GROUP of Y tables was numbered 1-6, aka indexing
x$ytableindex <- ((x$Y-1) - (x$Y-1) %% 10) / 10
# NA if too low
x$ytableindex[x$ytableindex < 1] <- NA
# find lowest table based on Y index
x$ytable <- (0:5*6+1)[x$ytableindex]
# find difference from lowest Y table to arrive at correct table using X
x$xdiff <- floor((x$X - 1) / 10 - 2)
# NA if too low
x$xdiff[x$xdiff < 0] <- NA
# use difference to calculate the correct table, NA's stay NA
x$Table <- x$ytable + x$xdiff

How to generate a frequency table in R with with cumulative frequency and relative frequency

I'm new with R. I need to generate a simple Frequency Table (as in books) with cumulative frequency and relative frequency.
So I want to generate from some simple data like
> x
[1] 17 17 17 17 17 17 17 17 16 16 16 16 16 18 18 18 10 12 17 17 17 17 17 17 17 17 16 16 16 16 16 18 18 18 10
[36] 12 15 19 20 22 20 19 19 19
a table like:
frequency cumulative relative
(9.99,11.7] 2 2 0.04545455
(11.7,13.4] 2 4 0.04545455
(13.4,15.1] 1 5 0.02272727
(15.1,16.9] 10 15 0.22727273
(16.9,18.6] 22 37 0.50000000
(18.6,20.3] 6 43 0.13636364
(20.3,22] 1 44 0.02272727
I know it should be simple, but I don't know how.
I got some results using this code:
factorx <- factor(cut(x, breaks=nclass.Sturges(x)))
as.matrix(table(factorx))
You're close! There are a few functions that will make this easy for you, namely cumsum() and prop.table(). Here's how I'd probably put this together. I make some random data, but the point is the same:
#Fake data
x <- sample(10:20, 44, TRUE)
#Your code
factorx <- factor(cut(x, breaks=nclass.Sturges(x)))
#Tabulate and turn into data.frame
xout <- as.data.frame(table(factorx))
#Add cumFreq and proportions
xout <- transform(xout, cumFreq = cumsum(Freq), relative = prop.table(Freq))
#-----
factorx Freq cumFreq relative
1 (9.99,11.4] 11 11 0.25000000
2 (11.4,12.9] 3 14 0.06818182
3 (12.9,14.3] 11 25 0.25000000
4 (14.3,15.7] 2 27 0.04545455
5 (15.7,17.1] 6 33 0.13636364
6 (17.1,18.6] 3 36 0.06818182
7 (18.6,20] 8 44 0.18181818
The base functions table, cumsum and prop.table should get you there:
cbind( Freq=table(x), Cumul=cumsum(table(x)), relative=prop.table(table(x)))
Freq Cumul relative
10 2 2 0.04545455
12 2 4 0.04545455
15 1 5 0.02272727
16 10 15 0.22727273
17 16 31 0.36363636
18 6 37 0.13636364
19 4 41 0.09090909
20 2 43 0.04545455
22 1 44 0.02272727
With cbind and naming of the columns to your liking this should be pretty easy for you in the future. The output from the table function is a matrix, so this result is also a matrix. If this were being done on something big it would be more efficient todo this:
tbl <- table(x)
cbind( Freq=tbl, Cumul=cumsum(tbl), relative=prop.table(tbl))
If you are looking for something pre-packaged, consider the freq() function from the descr package.
library(descr)
x = c(sample(10:20, 44, TRUE))
freq(x, plot = FALSE)
Or to get cumulative percents, use the ordered() function
freq(ordered(x), plot = FALSE)
To add a "cumulative frequencies" column:
tab = as.data.frame(freq(ordered(x), plot = FALSE))
CumFreq = cumsum(tab[-dim(tab)[1],]$Frequency)
tab$CumFreq = c(CumFreq, NA)
tab
If your data has missing values, a valid percent column is added to the table.
x = c(sample(10:20, 44, TRUE), NA, NA)
freq(ordered(x), plot = FALSE)
Yet another possibility:
library(SciencesPo)
x = c(sample(10:20, 50, TRUE))
freq(x)
My suggestion is to check the agricolae package... check it out:
library(agricolae)
weight<-c( 68, 53, 69.5, 55, 71, 63, 76.5, 65.5, 69, 75, 76, 57, 70.5,
+ 71.5, 56, 81.5, 69, 59, 67.5, 61, 68, 59.5, 56.5, 73,
+ 61, 72.5, 71.5, 59.5, 74.5, 63)
h1<- graph.freq(weight,col="yellow",frequency=1,las=2,xlab="h1")
print(summary(h1),row.names=FALSE)

Resources