str_match based on vector with count issue - r

I havent got a reprex but my data are stored in a csv file
https://transcode.geo.data.gouv.fr/services/5e2a1fbefa4268bc25628f27/feature-types/drac:site?format=CSV&projection=WGS84
library(readr)
bzh_sites <- read_csv("site.csv")
I want to count row based on characters matching (column NATURE)
pattern<-c("allée|aqueduc|architecture|atelier|bas|carrière|caveau|chapelle|château|chemin|cimetière|coffre|dépôt|dolmen|eau|église|enceinte|enclos|éperon|espace|exploitation|fanum|ferme|funéraire|groupe|habitat|maison|manoir|menhir|monastère|motte|nécropole|occupation|organisation|parcellaire|pêcherie|prieuré|production|rue|sépulture|stèle|thermes|traitement|tumulus|villa")
test2 <- bzh_sites %>%
drop_na(NATURE) %>%
group_by(NATURE = str_match( NATURE, pattern )) %>%
summarise(n = n())
gives me :
NATURE n
1 allée 176
2 aqueduc 73
3 architecture 68
4 atelier 200
AND another test with the same data (NATURE)
pattern <- c("allée|aqueduc|architecture|atelier")
test2 <- bzh_sites %>%
drop_na(NATURE) %>%
group_by(NATURE = str_match( NATURE, pattern )) %>%
summarise(n = n())
gives me :
NATURE n
1 allée 178
2 aqueduc 74
3 architecture 79
4 atelier 248
I have no idea about the différences of count.

I tried to find out where the discrepancy is for first group i.e "allée". This is what I found :
library(stringr)
pattern1<-c("allée|aqueduc|architecture|atelier|bas|carrière|caveau|chapelle|château|chemin|cimetière|coffre|dépôt|dolmen|eau|église|enceinte|enclos|éperon|espace|exploitation|fanum|ferme|funéraire|groupe|habitat|maison|manoir|menhir|monastère|motte|nécropole|occupation|organisation|parcellaire|pêcherie|prieuré|production|rue|sépulture|stèle|thermes|traitement|tumulus|villa")
#Get indices where 'allée' is found using pattern1
ind1 <- which(str_match(bzh_sites$NATURE, pattern1 )[, 1] == 'allée')
pattern2 <- c("allée|aqueduc|architecture|atelier")
#Get indices where 'allée' is found using pattern1
ind2 <- which(str_match(bzh_sites$NATURE, pattern2)[, 1] == 'allée')
#Indices which are present in ind2 but absent in ind1
setdiff(ind2, ind1)
#[1] 3093 10400
#Get corresponding text
temp <- bzh_sites$NATURE[setdiff(ind2, ind1)]
temp
#[1] "dolmen allée couverte" "coffre funéraire allée couverte"
What happens when we use pattern1 and pattern2 on temp
str_match(temp, pattern1)
# [,1]
#[1,] "dolmen"
#[2,] "coffre"
str_match(temp, pattern2)
# [,1]
#[1,] "allée"
#[2,] "allée"
As we can see using pattern1 certain values are classified in another group since they occur first in the string hence we have a mismatch.
A similar explanation can be given for mismatches in other groups.
str_match only returns first match, to get all the matches in pattern we can use str_match_all
table(unlist(str_match_all(bzh_sites$NATURE, pattern1)))
# allée aqueduc architecture atelier bas
# 178 76 79 252 62
# carrière caveau chapelle château chemin
# 46 35 226 205 350
# cimetière coffre dépôt dolmen eau
# 275 155 450 542 114
# église enceinte enclos éperon space
# 360 655 338 114 102
#exploitation fanum ferme funéraire groups
# 1856 38 196 1256 295
# habitat maison manoir menhir monastère
# 1154 65 161 1036 31
# motte nécropole occupation organisation parcellaire
# 566 312 5152 50 492
# pêcherie prieuré production rue sépulture
# 69 66 334 44 152
# stèle thermes traitement tumulus villa
# 651 50 119 1232 225

Related

Summing columns in a data frame and adding those values to a new data frame in R [duplicate]

This question already has answers here:
How to sum data.frame column values?
(5 answers)
Closed 2 years ago.
I am trying to sum the columns of a data frame and add these sums to a new output data frame. When I run the following script, I get an error stating that the replacement has two rows and the data has 3.
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
for (i in 1:ncol(a)) {
b <-as.data.frame(names(a))
c <- sum(a[i])
b$d[i] <- c[i]
}
I am looking for the output as a data frame such as:
name1 sum1
name2 sum2
name3 sum3
Your solution was already pretty close. I made some slight modifications for you and it works:
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
b <-as.data.frame(names(a))
for (i in 1:ncol(a)) {
b$sum[i] <- sum(a[i])
}
Output:
names(a) sum
1 name1 470
2 name2 616
3 name3 495
I would suggest a dplyr approach:
library(dplyr)
#Data
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
#Code
a %>%
mutate(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) ))
Output:
name1 name2 name3 name1_sum name2_sum name3_sum
1 98 31 79 599 489 506
2 8 71 4 599 489 506
3 59 23 48 599 489 506
4 65 76 64 599 489 506
5 47 53 57 599 489 506
6 80 84 55 599 489 506
7 40 19 28 599 489 506
8 39 2 47 599 489 506
9 65 36 40 599 489 506
10 98 94 84 599 489 506
If only one dataframe is desired you can use this:
a %>%
summarise(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) ))
Output:
name1_sum name2_sum name3_sum
1 599 489 506
Initial code should be used when you want to add those variables to same dataframe.
And if you want a variable for the names and another for results you can use previous code with pivot_longer() from tidyverse to produce this:
library(tidyverse)
#Code
a %>%
summarise(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) )) %>%
pivot_longer(cols = everything())
Output:
# A tibble: 3 x 2
name value
<chr> <int>
1 name1_sum 599
2 name2_sum 489
3 name3_sum 506
It can be vectorized with colSums in base R
as.data.frame.list(colSums(a))
Or for a two column summary
stack(colSums(a))
If we need to create new columns in 'a'
a[paste0(names(a), "_sum")] <- colSums(a)

Convert list of lists to single dataframe with first column filled by first value (for each list) in R

I have a list of lists, like so:
x <-list()
x[[1]] <- c('97', '342', '333')
x[[2]] <- c('97','555','556','742','888')
x[[3]] <- c ('100', '442', '443', '444', '445','446')
The first number in each list (97, 97, 100) refers to a node in a tree and the following numbers refer to traits associated with that node.
My goal is to create a dataframe that looks like this:
df= data.frame(node = c('97','97','97','97','97','97','100','100','100','100','100'),
trait = c('342','333','555','556','742','888','442','443','444','445','446'))
where each trait has its corresponding node.
I think the first thing I need to do is convert the list of lists into a single dataframe. I've tried doing so using:
do.call(rbind,x)
but that repeats the values in x[[1]] and x[[2]] to match the length of x[[3]]. I've also tried using:
dt_list <- map(x, as.data.table)
dt <- rbindlist(dt_list, fill = TRUE, idcol = T)
Which I think gets me closer, but I'm still unsure of how to assign the first node value to the corresponding trait values. I know this is probably a simple task but it's stumping me today!
Maybe you can try the code below
h <- sapply(x, `[`,1)
d <- lapply(x, `[`,-1)
df <- data.frame(node = rep(h,lengths(d)), trait = unlist(d))
such that
> df
node trait
1 97 342
2 97 333
3 97 555
4 97 556
5 97 742
6 97 888
7 100 442
8 100 443
9 100 444
10 100 445
11 100 446
You can create a data frame with the first value from the vector in column 'node' and the rest of the values in column 'trait'. This strategy can be applied to all entries in the list using the map_df() function from purrr package, giving the output you describe.
library(purrr)
library(dplyr)
x %>%
map_df(., function(vec) data.frame(node = vec[1],
trait = vec[-1],
stringsAsFactors = F))
An option with base R is
stack(setNames(lapply(x, `[`, -1), sapply(x, `[`, 1)))[2:1]
# ind values
#1 97 342
#2 97 333
#3 97 555
#4 97 556
#5 97 742
#6 97 888
#7 100 442
#8 100 443
#9 100 444
#10 100 445
#11 100 446
Another solution
library(tidyverse)
library(purrr)
node <- map(x, ~rep(.x[1], length(.x)-1)) %>% flatten_chr()
trait <- map(x, ~.x[2:length(.x)]) %>% flatten_chr()
out <- tibble(node, trait)
node trait
<chr> <chr>
1 97 342
2 97 333
3 97 555
4 97 556
5 97 742
6 97 888
7 100 442
8 100 443
9 100 444
10 100 445
11 100 446

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Resources