For each cell in a particuar column of a dataframe (which here we will simply name as df), I want to find the value of maximum and minimum number that is originally represented as a string, embedded in a string. Any commas present in the cell have no special significance. These numbers should not be a percentage, so if for example 50% appears then 50 is to be excluded from consideration. The relevant column of the dataframe looks something like this:
| particular_col_name |
| ------------------- |
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|
So two new columns should be created with title 'maximum_number' and 'minimum number, and in the case of the first row the former should say 20 and 5 respectively. Note that 70 has been excluded because of the % sign next to it. Similarly, the second row should put 40 and 4 into the new columns.
I have tried a couple of methods (e.g. str_extract_all, regmatches, strsplit), within the dplyr 'mutate' operator, but they either give error messages (particularly regarding the input column particular_col_name) or do not output the data in an appropriate format for the maximum and minimum values to be easily identified.
Any help on this would be most appreciated please.
library(tidyverse)
tibble(
particular_col_name = c(
"First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15",
"20% 30%"
)
) %>%
mutate(
numbers = particular_col_name %>% map(~ {
.x %>% str_remove_all("[0-9]+%") %>% str_extract_all("[0-9]+") %>% simplify() %>% as.numeric()
}),
min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
) %>%
select(-numbers)
#> Warning in min(.): no non-missing arguments to min; returning Inf
#> Warning in max(.): no non-missing arguments to max; returning -Inf
#> # A tibble: 3 x 3
#> particular_col_name min max
#> <chr> <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e… 5 20
#> 2 Second_Row_50%, number40. Number 4. number_15 4 40
#> 3 20% 30% NA NA
Created on 2022-02-22 by the reprex package (v2.0.0)
We could use str_extract_all in combination with sapply:
library(stringr)
df$min <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) min(as.integer(x)))
df$max <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) max(as.integer(x)))
particular_col_name min max
<chr> <int> <int>
1 First Row String10. This is also a string_5, and so is this 20, exclude70% 5 70
2 Second_Row_50%, number40. Number 4. number_15 4 50
data:
df <- structure(list(particular_col_name = c("First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15"), min = 5:4,
max = c(70L, 50L)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Related
I have a data frame of say 20 columns. Column 1 is group, column 2 is weights (not normalized to 1 or 100) and columns 3 to 20 contain data to be aggregated. There are some 250 rows but just 15 groups. So on an average for each group there are around 16-17 rows for each group.
For each of columns 3 to 20, I need to get group-wise weighted mean, weights being column 2.
As such this is easy by multiplying all columns by column 2 and then running
group_by(df, column1)%>%
summarise_all(sum_na)
Here sum_na is the usual function sum with na.rm=T
And then dividing column 3 to 20 by column 2.
The problem is that there are NAs scattered in the data frame. Say for example, 150th row (belonging to group 5, say) in column 12 has NA. While calculating weighted mean for Group 5 and column 12, the denominator should exclude the weight in row 150 of column 2.
How to do this? Sorry for the long post. Unable to provide sample data as unfortunately stack overflow is inaccessible in office (posting from mobile).
Would something like this work ?
library(dplyr)
df %>%
group_by(group) %>%
summarise_at(vars(col1:col18), ~weighted.mean(., wt, na.rm = TRUE))
You can select range of columns in vars. This removes NA values from the columns col1 to col18 with weight column as wt.
Tried this on this example :
df <- data.frame(group = rep(1:3, each = 3), wt = 1:9,
col1 = c(2:5, NA, 6:9), col2 = c(NA, 3:6, NA, 2:4))
df %>%
group_by(group) %>%
summarise_at(vars(col1:col2), ~weighted.mean(., wt, na.rm = TRUE))
# group col1 col2
# <int> <dbl> <dbl>
#1 1 3.33 3.6
#2 2 5.6 5.56
#3 3 8.08 3.08
We can use data.table methods
library(data.table)
setDT(df)[, lapply(.SD, function(x) weighted.mean(x, wt, na.rm = TRUE)),
by = group, .SDcols = col1:col18]
I have a problem where I'm trying to extract numbers from a string containing text and numbers and then create two new columns showing the Min and Max of the numbers.
For example, I have one column and a string of data like this:
Text
Section 12345.01 to section 12345.02
And I want to create two new columns from the data in the Text column, like this:
Min Max
12345.01 12345.02
I'm using dplyr and stringr with regex, but the regex only extracts the first occurence of the pattern (first number).
df%>%dplyr::mutate(SectionNum = stringr::str_extract(Text, "\\d+.\\d+"))
If I try to use the stringr::str_extract_all function. It seems to extract both occurence of the pattern, but it creates a list in the tibble, which I find is a real hassle. So I'm stuck on the first step, just trying to get the numbers out into their own columns.
Can anyone recommend the most efficient way to do this? Ideally I'd like to extract the numbers from the string, convert them to numbers as.numeric and then run min() and max() functions.
With extract from tidyr. extract turns each regex capture group into its own column. convert = TRUE is convenient in that it coerces the resulting columns to the best format. remove = FALSE can be used if we want to keep the original column. The last mutate is optional to make sure that the first number extracted is really the minimum:
library(tidyr)
library(purrr)
df %>%
extract(Text, c("Min", "Max"), "([\\d.]+)[^\\d.]+([\\d.]+)", convert = TRUE) %>%
mutate(Min = pmap_dbl(., min),
Max = pmap_dbl(., max))
Output:
Min Max
1 12345.02 12345.03
Data:
df <- structure(list(Text = structure(1L, .Label = "Section 12345.03 to section 12345.02", class = "factor")), class = "data.frame", row.names = c(NA,
-1L), .Names = "Text")
Using some other tidyverse tools, you can either approach this by unnesting the list-column and using group_by and summarise semantics (the more dplyr way), or you can just deal with the list-col as-is and use map_dbl to extract the max and min from each row (a more purrr way). My benchmarks have map_dbl about 7 times faster than unnest and dplyr, and about 15% faster than extract, though this is only on the one row.
library(tidyverse)
df <- tibble(
Text = c("Section 12345.01 to section 12345.02")
)
df %>%
mutate(SectionNum = str_extract_all(Text, "\\d+\\.\\d+")) %>%
unnest %>%
group_by(Text) %>%
summarise(min = min(as.numeric(SectionNum)), max = max(as.numeric(SectionNum)))
#> # A tibble: 1 x 3
#> Text min max
#> <chr> <dbl> <dbl>
#> 1 Section 12345.01 to section 12345.02 12345. 12345.
df %>%
mutate(
SectionNum = str_extract_all(Text, "\\d+\\.\\d+"),
min = map_dbl(SectionNum, ~ min(as.numeric(.x))),
max = map_dbl(SectionNum, ~ max(as.numeric(.x)))
)
#> # A tibble: 1 x 4
#> Text SectionNum min max
#> <chr> <list> <dbl> <dbl>
#> 1 Section 12345.01 to section 12345.02 <chr [2]> 12345. 12345.
Created on 2018-09-24 by the reprex package (v0.2.0).
There have already been answers which say how to accomplish your final goal as asked in the question, but just to address the question of how you can find the first or second match using the stringr package, you can use the str_match function, and specify the specific match you are interested in by referring to the column of str_match.
library(stringr)
Text <- "Section 12345.01 to section 12345.02"
str_match(Text, "^[^0-9.]*([0-9.]*)[^0-9.]*([0-9.]*)[^0-9.]*$")[2]
#> [1] "12345.01"
str_match(Text, "^[^0-9.]*([0-9.]*)[^0-9.]*([0-9.]*)[^0-9.]*$")[3]
#> [1] "12345.02"
Created on 2018-09-24 by the reprex package (v0.2.0).
I have the feeling this was already asked several times, but I can not make it run in my case. Don't know why.
I group_by my data frame and calculate a mean from values. Additionally, I marked a specific row and I want to calculate the ratio of my fresh calculated mean with the value of my highlighted row of the subset.
library(dplyr)
df <- data.frame(int=c(5:1,4:1),
highlight=c(T,F,F,F,F,F,T,F,F),
exp=c('a','a','a','a','a','b','b','b','b'))
df %>%
group_by(exp) %>%
summarise(mean=mean(int),
l1=nrow(.),
ratio_mean=.[.$highlight, 'int']/mean)
But for some reason, . is not the subset of group_by but the complete input. Am I missing something here?
My expected output would be
exp mean ratio_mean
<fct> <dbl> <dbl>
1 a 3 1.67
2 b 2.5 1.2
This works:
df %>%
group_by(exp) %>%
summarise(mean = mean(int),
l1 = n(),
ratio_mean = int[highlight] / mean)
But what's going wrong with your solution?
nrow(.) counts the number of rows of your whole input dataframe, wherase n() counts only the rows per group
.[.$highlight, 'int']/mean here again you use the whole input dataframe and subset using the highlight column, but it get's divided by the correct group mean. Actually you are returning two values here as two rows of your original df have a highlight = TRUE. This causes a nasty NA-column name.
To save it, we could use do() as suggested by #MikkoMarttila, but this gets a little bit clunky:
df %>%
group_by(exp) %>%
do(summarise(., mean = mean(.$int),
l1 = nrow(.),
ratio_mean = .$int[.$highlight] / mean))
Original output
df %>%
group_by(exp) %>%
summarise(mean=mean(int),
l1=nrow(.),
ratio_mean=.[.$highlight, 'int']/mean)
# A tibble: 2 x 4
# exp mean l1 ratio_mean$ NA
# <fct> <dbl> <int> <dbl> <dbl>
# 1 a 3 9 1.67 2
# 2 b 2.5 9 1 1.2
I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).
In R, I have a large list of large dataframes consisting of two columns, value and count. The function which I am using in the previous step returns the value of the observation in value, the corresponding column count shows how many times this specific value has been observed. The following code produces one dataframe as an example - however all dataframes in the list do have different values resp. value ranges:
d <- as.data.frame(
cbind(
value = runif(n = 1856, min = 921, max = 4187),
count = runif(n = 1856, min = 0, max = 20000)
)
)
Now I would like to aggregate the data to be able to create viewable visualizations. This aggregation should be applied to all dataframes in a list, which do each have different value ranges. I am looking for a function, cutting the data into new values and counts, a little bit like a histogram function. So for example, for all data from a value of 0 to 100, the counts should be summated (and so on, in a defined interval, with a clean interval border starting point like 0).
My first try was to create a simple value vector, where each value is repeated in a number of times that is determined by the count field. Then, the next step would have been applying the hist() function without plotting to obtain the aggregated values and counts which can be defined in the hist()'s arguments. However, this produces too large vectors (some Gb for each) that R cannot handle anymore. I appreciate any solutions or hints!
I am not entirely sure I understand your question correctly, but this might solve your problem or at least point you in a direction. I make a list of data-frames and then generate a new column containing the result of applying the binfunction to each dataframe by using mapfrom the purrr package.
library(tidyverse)
d1 <- d2 <- tibble(
value = runif(n = 1856, min = 921, max = 4187),
count = runif(n = 1856, min = 0, max = 20000)
)
d <- tibble(name = c('d1', 'd2'), data = list(d1, d2))
binfunction <- function(data) {
data %>% mutate(bin = value - (value %% 100)) %>%
group_by(bin) %>%
mutate(sum = sum(count)) %>%
select(bin, sum)
}
d_binned <- d %>%
mutate(binned = map(data, binfunction)) %>%
select(-data) %>%
unnest() %>%
group_by(name, bin) %>%
slice(1L)
d_binned
#> Source: local data frame [66 x 3]
#> Groups: name, bin [66]
#>
#> # A tibble: 66 x 3
#> name bin sum
#> <chr> <dbl> <dbl>
#> 1 d1 900 495123.8
#> 2 d1 1000 683108.6
#> 3 d1 1100 546524.4
#> 4 d1 1200 447077.5
#> 5 d1 1300 604759.2
#> 6 d1 1400 506225.4
#> 7 d1 1500 499666.5
#> 8 d1 1600 541305.9
#> 9 d1 1700 514080.9
#> 10 d1 1800 586892.9
#> # ... with 56 more rows
d_binned %>%
ggplot(aes(x = bin, y = sum, fill = name)) +
geom_col() +
facet_wrap(~name)
See this comment for my inspiration for the binning. It bins the data in groups of 100, so e.g. bin 1100 represents 1100 to <1200 etc. I imagine you can adapt the binfunction to your needs.