Related
I have a data frame that corresponds to the path taken by a river, describing elevation and distance. I need to evaluate each different ground path traveled by the river and extract this information.
Example:
df = data.frame(Soil = c("Forest", "Forest",
"Grass", "Grass","Grass",
"Scrub", "Scrub","Scrub","Scrub",
"Grass", "Grass","Grass","Grass",
"Forest","Forest","Forest","Forest","Forest","Forest"),
Distance = c(1, 5,
10, 15, 56,
59, 67, 89, 99,
102, 105, 130, 139,
143, 145, 167, 189, 190, 230),
Elevation = c(1500, 1499,
1470, 1467, 1456,
1450, 1445, 1440, 1435,
1430, 1420, 1412, 1400,
1390, 1387, 1384, 1380, 1376, 1370))
Soil Distance Elevation
1 Forest 1 1500
2 Forest 5 1499
3 Grass 10 1470
4 Grass 15 1467
5 Grass 56 1456
6 Scrub 59 1450
7 Scrub 67 1445
8 Scrub 89 1440
9 Scrub 99 1435
10 Grass 102 1430
11 Grass 105 1420
12 Grass 130 1412
13 Grass 139 1400
14 Forest 143 1390
15 Forest 145 1387
16 Forest 167 1384
17 Forest 189 1380
18 Forest 190 1376
19 Forest 230 1370
But i need to something like this:
Soil Distance.Min Distance.Max Elevation.Min Elevation.Max
1 Forest 1 5 1499 1500
2 Grass 10 56 1456 1470
3 Scrub 59 99 1435 1450
4 Grass 102 139 1400 1430
5 Forest 143 230 1370 1390
I tried to use group_by() and which.min(Soil), but that takes into account the whole df, not each path.
We need a run-length encoding to track consecutive Soil.
Using this function (fashioned to mimic data.table::rleid):
myrleid <- function (x) {
r <- rle(x)
rep(seq_along(r$lengths), times = r$lengths)
}
We can do
df %>%
group_by(grp = myrleid(Soil)) %>%
summarize(Soil = Soil[1], across(c(Distance, Elevation), list(min = min, max = max))) %>%
select(-grp)
# # A tibble: 5 x 5
# Soil Distance_min Distance_max Elevation_min Elevation_max
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Forest 1 5 1499 1500
# 2 Grass 10 56 1456 1470
# 3 Scrub 59 99 1435 1450
# 4 Grass 102 139 1400 1430
# 5 Forest 143 230 1370 1390
You can try this:
df = df %>% mutate(id=data.table::rleid(Soil))
inner_join(
distinct(df %>% select(Soil,id)),
df %>%
group_by(id) %>%
summarize(across(Distance:Elevation, .fns = list("min" = min,"max"=max))),
by="id"
) %>% select(!id)
Output:
Soil Distance_min Distance_max Elevation_min Elevation_max
1 Forest 1 5 1499 1500
2 Grass 10 56 1456 1470
3 Scrub 59 99 1435 1450
4 Grass 102 139 1400 1430
5 Forest 143 230 1370 1390
Or, even more concise, thanks to r2evans.
df %>%
group_by(id = data.table::rleid(Soil)) %>%
summarize(Soil=first(Soil),across(Distance:Elevation, .fns = list("min" = min,"max"=max))) %>%
select(!id)
I have the following dataframe:
> zCode <- sample(50:150, size = 10, replace = TRUE)
> x <- sample(50:150, size = 10, replace = TRUE)
> test <- data.frame(x,zCode )
> test
zCode x
1 110 114
2 108 150
3 57 100
4 53 98
5 114 67
6 143 126
7 110 95
8 106 101
9 103 70
10 149 73
I also have this vector:
> z <- c(53, 57, 110)
> z
[1] 53 57 110
I want to create a new dataframe based on vector Z, that pulls the maximum x value associated with that z-code, like so:
Z x
53 98
57 100
110 114
Here are some possibilities. They do not use any packages.
1) For each element of z compute the subset of rows in test with that zCode and then take the maximum of each x:
data.frame(z, x = sapply(z, function(z) max(subset(test, z == zCode)$x)))
giving:
z x
1 53 98
2 57 100
3 110 114
2) Another approach is to use aggregate to find all the maxima and the merge with z to get just those:
merge(data.frame(z), aggregate(x ~ zCode, test, max), by = 1, all.x = TRUE)
giving:
z x
1 53 98
2 57 100
3 110 114
Hote: The input used, in reproducible form, is:
Lines <- "
zCode x
1 110 114
2 108 150
3 57 100
4 53 98
5 114 67
6 143 126
7 110 95
8 106 101
9 103 70
10 149 73"
test <- read.table(text = Lines)
z <- c(53, 57, 110)
Here is a data.table solution:
# Original data
dt <- data.table(zCode = c(110, 108, 57, 53, 114, 143, 110, 106, 103, 149),
x = c(114, 150, 100, 98, 67, 126, 95, 101, 70, 73))
z <- c(53, 57, 110)
# a new dataframe based on vector z
dt[zCode %in% z, max(x), by = zCode]
zCode V1
1: 110 114
2: 57 100
3: 53 98
EDIT:
# Keeps the columns names unchanged
dt[zCode %in% z, .(x = max(x)), by = zCode]
zCode x
1: 110 114
2: 57 100
3: 53 98
Using the following data:
set.seed(1234)
df1 <- structure(
list(wavelength = c(400, 400, 400, 400, 400, 400, 400, 400, 500, 500, 500, 500, 500, 500, 500, 500),
depth = c(0, 30, 40, 60, 79, 89, 101, 110, 0, 30, 40, 60, 79, 89, 101, 110),
value = sample(16)),
class = "data.frame", row.names = c(NA, -16L), .Names = c("wavelength", "depth", "value"))
df1
#> wavelength depth value
#> 1 400 0 2
#> 2 400 30 10
#> 3 400 40 9
#> 4 400 60 14
#> 5 400 79 11
#> 6 400 89 8
#> 7 400 101 1
#> 8 400 110 3
#> 9 500 0 6
#> 10 500 30 4
#> 11 500 40 5
#> 12 500 60 13
#> 13 500 79 16
#> 14 500 89 12
#> 15 500 101 15
#> 16 500 110 7
How is it possible to group the data by wavelength and then calculate res in such way that it represents an arithmetic operation between pairs of value. In this example, the res is simply the sum of square between pairs. res[1] is simply 2^2 + 10^2 and res[2] is 10^2 + 9^2 and so on.
df2 <- structure(
list(wavelength = c(400, 400, 400, 400, 400, 400, 400, 500, 500, 500, 500, 500, 500, 500),
depth = rep(c("0-30", "30-40", "40-60", "60-79", "79-89", "89-101", "101-110"), 2),
res = c(104, 181, 277, 317, 185, 65, 45, 52, 41, 194, 425, 400, 369, 274)),
class = "data.frame", row.names = c(NA, -14L), .Names = c("wavelength", "depth", "res"))
df2
#> wavelength depth res
#> 1 400 0-30 104
#> 2 400 30-40 181
#> 3 400 40-60 277
#> 4 400 60-79 317
#> 5 400 79-89 185
#> 6 400 89-101 65
#> 7 400 101-110 45
#> 8 500 0-30 52
#> 9 500 30-40 41
#> 10 500 40-60 194
#> 11 500 60-79 425
#> 12 500 79-89 400
#> 13 500 89-101 369
#> 14 500 101-110 274
Ideally, the answer would use the dplyr syntax.
Update
Based on received answers I came up with this solution.
f1 <- function(x, y) {
return(x^2 + y^2)
}
df1 %>%
group_by(wavelength) %>%
mutate(depth = paste(depth, lead(depth), sep = "-")) %>%
mutate(res = f1(value, c(lead(value)))) %>%
na.omit()
#> Source: local data frame [14 x 4]
#> Groups: wavelength [2]
#>
#> wavelength depth value res
#> <dbl> <chr> <int> <dbl>
#> 1 400 0-30 2 104
#> 2 400 30-40 10 181
#> 3 400 40-60 9 277
#> 4 400 60-79 14 317
#> 5 400 79-89 11 185
#> 6 400 89-101 8 65
#> 7 400 101-110 1 10
#> 8 500 0-30 6 52
#> 9 500 30-40 4 41
#> 10 500 40-60 5 194
#> 11 500 60-79 13 425
#> 12 500 79-89 16 400
#> 13 500 89-101 12 369
#> 14 500 101-110 15 274
After grouping by 'wavelength', create the 'depth' column by pasteing the 'depth' with the 'lead' of 'depth' and 'value' by the difference of adjacent elements (diff), then remove the NA elements with na.omit
library(dplyr)
df1 %>%
group_by(wavelength) %>%
mutate(depth = paste(depth, lead(depth), sep="-"),
value = c(diff(value), NA)) %>% na.omit()
# wavelength depth value
# <dbl> <chr> <int>
#1 400 0-30 8
#2 400 30-40 -1
#3 400 40-60 5
#4 400 60-79 -3
#5 400 79-89 -3
#6 400 89-101 -7
#7 400 101-110 2
#8 500 0-30 -2
#9 500 30-40 1
#10 500 40-60 8
#11 500 60-79 3
#12 500 79-89 -4
#13 500 89-101 3
#14 500 101-110 -8
I have a data frame like this:
df <- data.frame(x=c(7,5,4),y=c(100,100,100),w=c(170,170,170),z=c(132,720,1256))
I create a new column using mapply:
set.seed(123)
library(truncnorm)
df$res <- mapply(rtruncnorm,df$x,df$y,df$w,df$z,25)
So, I got:
> df
#x y w z res
#1 7 100 170 132 117.9881, 126.2456, 133.7627, 135.2322, 143.5229, 100.3735, 114.8287
#2 5 100 170 720 168.8581, 169.4955, 169.6461, 169.8998, 169.0343
#3 4 100 170 1256 169.7245, 167.6744, 169.7025, 169.4441
#dput(df)
df <- structure(list(x = c(7, 5, 4), y = c(100, 100, 100), w = c(170,
170, 170), z = c(132, 720, 1256), res = list(c(117.988108836195,
126.245562762918, 133.762709785614, 135.232193379024, 143.52290514973,
100.373469134837, 114.828678702662), c(168.858147661715, 169.495493758985,
169.646123183828, 169.899849943838, 169.034333943479), c(169.724470294466,
167.674371713068, 169.70250974042, 169.444134892323))), .Names = c("x",
"y", "w", "z", "res"), row.names = c(NA, -3L), class = "data.frame")
But what I really need is repeat each row of df dataframe according to the df$res result as follows:
> df2
# x y w z res
#1 7 100 170 132 117.9881
#2 7 100 170 132 126.2456
#3 7 100 170 132 133.7627
#4 7 100 170 132 135.2322
#5 7 100 170 132 143.5229
#6 7 100 170 132 100.3735
#7 7 100 170 132 114.8287
#8 5 100 170 720 168.8581
#9 5 100 170 720 169.4955
#10 5 100 170 720 169.6461
#11 5 100 170 720 169.8998
#12 5 100 170 720 169.0343
#13 4 100 170 1256 169.7245
#14 4 100 170 1256 167.6744
#15 4 100 170 1256 169.7025
#16 4 100 170 1256 169.4441
How, do I achieve this efficiently? I need to apply this to a big dataframe
df <- data.frame(x=c(7,5,4),y=c(100,100,100),w=c(170,170,170),z=c(132,720,1256))
set.seed(123)
l <- mapply(rtruncnorm,df$x,df$y,df$w,df$z,25)
cbind.data.frame(df[rep(seq_along(l), lengths(l)),],
res = unlist(l))
# x y w z res
# 1 7 100 170 132 117.9881
# 1.1 7 100 170 132 126.2456
# 1.2 7 100 170 132 133.7627
# 1.3 7 100 170 132 135.2322
# 1.4 7 100 170 132 143.5229
# 1.5 7 100 170 132 100.3735
# 1.6 7 100 170 132 114.8287
# 2 5 100 170 720 168.8581
# 2.1 5 100 170 720 169.4955
# 2.2 5 100 170 720 169.6461
# 2.3 5 100 170 720 169.8998
# 2.4 5 100 170 720 169.0343
# 3 4 100 170 1256 169.7245
# 3.1 4 100 170 1256 167.6744
# 3.2 4 100 170 1256 169.7025
# 3.3 4 100 170 1256 169.4441
Try this based on your given df:
df$res <- sapply(df$res, paste0, collapse=",")
do.call(rbind, apply(df, 1, function(x) do.call(expand.grid, strsplit(x, ","))))
# x y w z res
# 1 7 100 170 132 117.988108836195
# 2 7 100 170 132 126.245562762918
# 3 7 100 170 132 133.762709785614
# 4 7 100 170 132 135.232193379024
# 5 7 100 170 132 143.52290514973
# 6 7 100 170 132 100.373469134837
# 7 7 100 170 132 114.828678702662
# 8 5 100 170 720 168.858147661715
# 9 5 100 170 720 169.495493758985
# 10 5 100 170 720 169.646123183828
# 11 5 100 170 720 169.899849943838
# 12 5 100 170 720 169.034333943479
# 13 4 100 170 1256 169.724470294466
# 14 4 100 170 1256 167.674371713068
# 15 4 100 170 1256 169.70250974042
# 16 4 100 170 1256 169.444134892323
I am trying to duplicated "manually" the example in this Wikipedia post using R.
Here is the data:
after = c(125, 115, 130, 140, 140, 115, 140, 125, 140, 135)
before = c(110, 122, 125, 120, 140, 124, 123, 137, 135, 145)
sgn = sign(after-before)
abs = abs(after - before)
d = data.frame(after,before,sgn,abs)
after before sgn abs
1 125 110 1 15
2 115 122 -1 7
3 130 125 1 5
4 140 120 1 20
5 140 140 0 0
6 115 124 -1 9
7 140 123 1 17
8 125 137 -1 12
9 140 135 1 5
10 135 145 -1 10
If I try to rank the rows based on the abs column, the 0 entry is naturally ranked as 1:
rank = rank(abs)
(d = data.frame(after,before,sgn,abs,rank))
after before sgn abs rank
1 125 110 1 15 8.0
2 115 122 -1 7 4.0
3 130 125 1 5 2.5
4 140 120 1 20 10.0
5 140 140 0 0 1.0
6 115 124 -1 9 5.0
7 140 123 1 17 9.0
8 125 137 -1 12 7.0
9 140 135 1 5 2.5
10 135 145 -1 10 6.0
However, zeros are ignored in the Wilcoxon signed-test.
How can I get R to ignore that row, so as to end up with:
after before sgn abs rank
1 125 110 1 15 7.0
2 115 122 -1 7 3.0
3 130 125 1 5 1.5
4 140 120 1 20 9.0
5 140 140 0 0 0
6 115 124 -1 9 4.0
7 140 123 1 17 8.0
8 125 137 -1 12 6.0
9 140 135 1 5 1.5
10 135 145 -1 10 5.0
SOLUTION (accepted answer below):
after = c(125, 115, 130, 140, 140, 115, 140, 125, 140, 135)
before = c(110, 122, 125, 120, 140, 124, 123, 137, 135, 145)
sgn = sign(after-before)
abs = abs(after - before)
d = data.frame(after,before,sgn,abs)
d$rank = rank(replace(abs,abs==0,NA), na='keep')
d$multi = d$sgn * d$rank
(W=abs(sum(d$multi, na.rm = T)))
9
From the Wikipedia article:
Exclude pairs with |x2,i − x1,i| = 0. Let Nr be the reduced sample size.
We need to exclude zeroes. By my thinking, you should replace zeroes with NA, and then specify to rank() that you want to exclude NAs from consideration for ranking. Since you need to return a vector of the same length as the input, you can specify 'keep' as the argument:
d$rank <- rank(replace(abs,abs==0,NA),na='keep');
d;
## after before sgn abs rank
## 1 125 110 1 15 7.0
## 2 115 122 -1 7 3.0
## 3 130 125 1 5 1.5
## 4 140 120 1 20 9.0
## 5 140 140 0 0 NA
## 6 115 124 -1 9 4.0
## 7 140 123 1 17 8.0
## 8 125 137 -1 12 6.0
## 9 140 135 1 5 1.5
## 10 135 145 -1 10 5.0
The subtraction-based solutions will not work if the input vector contains zero zeroes or multiple zeroes.
You could create the new column and then just update the rank where the abs value isn't 0
d$rank <- 0 # default value for rows with abs=0
d$rank[d$abs!=0] <- rank(d$abs[d$abs!=0])
If you wanted to drop the row completely, you could just do
transform(subset(d, abs!=0), rank=rank(abs))
A quick way to do it would be to rank as normal and then do:
d$rank <- ifelse(d$rank == 1, 0, d$rank - 1)
This switches all ranks of 1 to 0, and reduces any other ranks by 1.