I have a dataset where I list states with their respective cities, some of these places have been aggregated (not by me) and are classified as "Other ([count of places])" (e.g. Other (99)). Appended to this list of places are numeric 'count' values. I'd like to 1.) find the average count per place and 2.) duplicate these 'Other...' places along with the average according to the number within the parenthesis. Example below:
set.seed(5)
df <- data.frame(state = c('A','B'), city = c('Other (3)','Other (2)'), count = c('250','50'))
Output:
state
city
count
A
Other (3)
83.333
A
Other (3)
83.333
A
Other (3)
83.333
B
Other (2)
25.000
B
Other (2)
25.000
So far I've only been able to figure out how to pull the numbers from the parenthesis and create an average:
average = df$count/as.numeric(gsub(".*\\((.*)\\).*", "\\1", df$city))
An option with uncount. Extract the numeric part in 'city' with parse_number, divide the 'count' by 'n' and replicate the rows with uncount
library(dplyr)
library(tidyr)
df %>%
mutate(n = readr::parse_number(city), count = as.numeric(count)/n) %>%
uncount(n)
-output
state city count
1 A Other (3) 83.33333
2 A Other (3) 83.33333
3 A Other (3) 83.33333
4 B Other (2) 25.00000
5 B Other (2) 25.00000
You could extend your example with the followign code:
set.seed(5)
df <- data.frame(state = c('A','B'), city = c('Other (3)','Other (2)'), count = c('250','50'))
times <- as.numeric(gsub(".*\\((.*)\\).*", "\\1", df$city))
df$count <- as.numeric(df$count)/times
output <- df[rep(seq_along(times),times),]
The key addition is the line creating output, which uses row indexing on the input dataframe to repeat each row as required.
Related
I have a data.frame "DF" of 2020 observations and 79066 variables.
The first column is the "Year" spanning continuously from 1 to 2020, the others variables are the values.
In the first instance, I did an average by row in order to have one mean value per year.
E.g.
Aver <- apply(DF[,2:79066], 1, mean, na.rm=TRUE)
However, I would like to do a weighted average and the weight values differ based on columns string values.
The header name of the variables is "Year" (first column) followed by 79065 columns, where the name of each column is composed of a string that starts from 50 to 300, followed by ".R" repeated from 1 to 15 times, and the ".yr" from 10 to 30. This brings 251(50-300) x 15(R) x 21(10-30) = 79065 columns
E.g. : "Year", "50.R1.10.yr", "50.R1.11.yr", "50.R1.12.yr", ... "50.R1.30.yr", "51.R1.10.yr", "51.R1.11.yr", "51.R1.12.yr", ... "51.R1.30.yr", ..."300.R1.10.yr", "300.R1.11.yr", "300.R1.12.yr", ... "300.R1.30.yr", "50.R2.10.yr", "50.R2.11.yr", "50.R2.12.yr", ... "50.R2.30.yr", "51.R2.10.yr", "51.R2.11.yr", "51.R2.12.yr", ... "51.R2.30.yr", ..."300.R2.10.yr", "300.R2.11.yr", "300.R2.12.yr", ... "300.R2.30.yr", ... "50.R15.10.yr", "50.R15.11.yr", "50.R15.12.yr", ... "300.R15.30.yr".
The weight I would like to assign to each column is based on the string values 50 to 300. I would like to give more weight to values on the column "50." and following a power function, less weight to "300.".
The equation fitting my values is a power function: y = 2305.2*x^-1.019.
E.g.
av.classes <- data.frame(av=seq(50, 300, 1))
library(dplyr)
av.classes.weight <- av.classes %>% mutate(weight = 2305.2*av^-1.019)
Thank you for any help.
I guess you could get your weight vector like this:
library(tidyverse)
weights_precursor <- str_split(names(data)[-1], pattern = "\\.", n = 2, simplify = TRUE)[, 1] %>%
as.numeric()
weights <- 2305.2 * weights_precursor ^ -1.019
Setting up some sample data:
DF <- data.frame(year=2020,`50.R1.10.yr`=1,`300.R15.30.yr`=10)
names(DF) <- stringr::str_remove(names(DF),"X")
Getting numerical vector:
weights <- stringr::str_split(names(DF),"\\.")
weights <- sapply(1:length(weights),function(x) weights[[x]][1])[-1]
as.numeric(weights)
I am calculating the dissimilarity index of several groups compared to the total population with the function "seg" from the identically named package.
The data consists of about 450 rows, each a different district, and around 20 columns (groups that may be segregated). The values are the number of people from respective group living in respective district. Here are the first few rows of my csv file:
Region,Germany,EU15 without Germany,Poland,Former Yugoslavia and successor countries,Former Soviet Union and successor countries,Turkey,Arabic states,West Afrika,Central Afrika,East Afrika,North America,Central America and the Carribean,South America,East and Central Asia,South and Southeast Asia - excluding Vietnam,Australia and Oceania,EU,Vietnam,Non EU Europe,Total Population
1011101,1370,372,108,35,345,91,256,18,6,3,73,36,68,272,98,3,1979,19,437,3445
1011102,117,21,6,0,0,0,6,0,0,0,7,0,6,0,7,0,156,0,3,188
1011103,2180,482,181,102,385,326,358,48,12,12,73,24,75,175,129,12,3152,34,795,5159
Since the seg function only works with two columns as input, my current code to create a table with the index for all groups looks like this:
DI_table <- as.data.frame(0)
DI_table[1,1] <- print (seg(data =dfplrcountrygroups2019[, c( "Germany", "Total.Population")]))
DI_table[1,2] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[3], "Total.Population")]))
DI_table[1,3] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[4], "Total.Population")]))
DI_table[1,4] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[5], "Total.Population")]))
# and so on...
colnames(DI_table)<- (colnames(dfplrcountrygroups2019[2:20]))
Works well, but a hassle to recode every time I change something with my data and I would like to use this method for other datasets too.
I thought I might try something like below but the seg function did not consider it a selection of two columns.
for (i in colnames(dfplrcountrygroups2019)) {
di_matrix [i] <- seg(data =dfplrcountrygroups2019[, c( "i", "Total.Population")])
}
Error in [.data.frame(dfplrcountrygroups2019, , c("i",
"Total.Population")) : undefined columns selected
I also thought of the apply function but not sure how to make it work so it repeats itself while just changing the column where "Germany" is in the example. How do I make the selection of columns change for each time I repeat the seg function?
my_function <- seg(data =dfplrcountrygroups2019[, c("Germany", "Total.Population")])
apply(X = dfplrcountrygroups2019,
FUN = my_function,
MARGIN = 2
)
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'my_function' of mode 'function' was not found
The seg package's functions such as dissim (seg::seg is being deprecated in its favor) have a specific expected data format. From the docs:
data - a numeric matrix or data frame with two columns that represent mutually exclusive population groups (e.g., Asians and non-Asians). If more than two columns are given, only the first two will be used for computing the index.
To get a data frame of the d values seg::dissim returns, where each column is a region's dissimilarity index, you can iterate over the columns, making a temporary data frame and calculating the index. Because the data you're starting with isn't made up of mutually-exclusive categories, you'll have to subtract each population from the total population column to get a not-X counterpart for each group X.
A base R option with sapply will return a named list, which you can then convert into a data frame.
di_table <- sapply(names(dat)[2:20], function(col) {
tmp_df <- dat[col]
tmp_df$other <- dat$Total.Population - dat[col]
seg::dissim(data = tmp_df)$d
}, simplify = FALSE)
as.data.frame(di_table)
#> Germany EU15.without.Germany Poland
#> 1 0.03127565 0.03989693 0.02770549
#> Former.Yugoslavia.and.successor.countries
#> 1 0.160239
#> Former.Soviet.Union.and.successor.countries Turkey Arabic.states West.Afrika
#> 1 0.08808277 0.2047 0.02266828 0.1415519
#> Central.Afrika East.Afrika North.America Central.America.and.the.Carribean
#> 1 0.08004711 0.213581 0.1116014 0.2095969
#> South.America East.and.Central.Asia
#> 1 0.08486598 0.2282734
#> South.and.Southeast.Asia...excluding.Vietnam Australia.and.Oceania EU
#> 1 0.0364721 0.213581 0.04394527
#> Vietnam Non.EU.Europe
#> 1 0.05505789 0.06624686
A couple tidyverse options: you can use purrr functions to do something like above in one step.
dat[2:20] %>%
purrr::map(~data.frame(value = ., other = dat$Total.Population - .)) %>%
purrr::map_dfc(~seg::dissim(data = .)$d)
# same output
Or with reshaping the data and splitting by county. This takes more steps, but might fit a larger workflow better.
library(dplyr)
dat %>%
tidyr::pivot_longer(c(-Region, -Total.Population)) %>%
mutate(other = Total.Population - value) %>%
split(.$name) %>%
purrr::map_dfc(~seg::dissim(data = .[c("value", "other")])$d)
# same output
Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)
I have a data similar to this.
B <- data.frame(State = c(rep("Arizona", 8), rep("California", 8), rep("Texas", 8)),
Account = rep(c("Balance", "Balance", "In the Bimester", "In the Bimester", "Expenses",
"Expenses", "In the Bimester", "In the Bimester"), 3), Value = runif(24))
You can see that Account has 4 occurrences of the element "In the Bimester", two "chunks" of two elements for each state, "Expenses" in between them.
The order here matters because the first chunk is not referring to the same thing as the second chunk.
My data is actually more complex, It has a 4th variable, indicating what each row of Account means. The number of its elements for each Account element (factor per se) can change. For example, In some state, the first "chunk" of "In the Bimester" can have 6 rows and the second, 7; but, I cannot differentiate by this 4th variable.
Desired: I'd like to subset my data, spliting those two "In the Bimester" by each state, subsetting only the rows of the first "chunks" by each state or the second "chunks".
I have a solution using data.table package, but I'm finding it kind of poor. any thoughts?
library(data.table)
B <- as.data.table(B)
B <- B[, .(Account, Value, index = 1:.N), by = .(State)]
x <- B[Account == "Expenses", .(min_ind = min(index)), by = .(State)]
B <- merge(B, x, by = "State")
B <- B[index < min_ind & Account == "In the Bimester", .(Value), by = .(State)]
You can use dplyr package:
library(dplyr)
B %>% mutate(helper = data.table::rleid(Account)) %>%
filter(Account == "In the Bimester") %>%
group_by(State) %>% filter(helper == min(helper)) %>% select(-helper)
# # A tibble: 6 x 3
# # Groups: State [3]
# State Account Value
# <fctr> <fctr> <dbl>
# 1 Arizona In the Bimester 0.17730148
# 2 Arizona In the Bimester 0.05695585
# 3 California In the Bimester 0.29089678
# 4 California In the Bimester 0.86952723
# 5 Texas In the Bimester 0.54076144
# 6 Texas In the Bimester 0.59168138
If instead of min you use max you'll get the last occurrences of "In the Bimester" for each State. You can also exclude Account column by changing the last pipe to select(-helper,-Account).
p.s. If you don't want to use rleid from data.table and just use dplyr functions take a look at this thread.
I'm at the last stage of cleaning/organizing data and would appreciate suggestions for this step. I'm new to R and don't understand fully how dataframes or other data types work. (I'm trying to learn but have a project due so need a quick solution). I've imported the data from a CSV file.
I want to group instances with the same (date, ID1, ID2, ID3). I want the average of all stats in the output and also a new column with the number of instances grouped.
Note: ID3 contains . I'd like to rename these to "na" before grouping
I've tried solutions
tdata$ID3[is.na(tdata$ID3)] <- "NA"
tdata[["ID3"]][is.na(tdata[["ID3"]])] <- "NA"
But get Error:
In `[<-.factor`(`*tmp*`, is.na(tdata[["ID3"]]), value = c(3L, 3L, :
invalid factor level, NA generated
The data is:
date ID1 ID2 ID3 stat1 stat2 stat.3
1 12-03-07 abc123 wxy456 pqr123 10 20 30
2 12-03-07 abc123 wxy456 pqr123 20 40 60
3 10-04-07 bcd456 wxy456 hgf356 10 20 40
4 12-03-07 abc123 wxy456 pqr123 30 60 90
5 5-09-07 spa234 int345 <NA> 40 50 70
Desired Output
date ID1, ID2, ID3, n, stat1, stat2, stat 3
12-03-07 abc123, wxy456, pqr457, 3, 20, 40, 60
10-04-07 bcd456, wxy456, hgf356, 1, 10, 20, 40
05-09-07 spa234, int345, big234, 1 , 40, 50, 70
I tried this solution: How to merge multiple data.frames and sum and average columns at the same time in R
But I was not successful merging the columns which have to be grouped and tested for similarity.
DF <- merge(tdata$date, tdata$ID1, tdata$ID2, tdata$ID3, by = "Name", all = T)
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
Finally, to generate the n column. Perhaps insert a rows of 1s and use the sum of the column while summarizing?
We can do this with dplyr. After grouping by the 'ID' columns, add 'date' and 'n' also in the grouping variables, and get the mean of 'stat' columns
library(dplyr)
df1 %>%
group_by(ID1, ID2, ID3) %>%
group_by(date = first(date), n =n(), add=TRUE) %>%
summarise_at(vars(matches("stat")), mean)
NOTE: Regarding change the 'NA' to 'big234', we can convert the 'ID3' to character class and change it before doing the above operation
df1$ID3 <- as.character(df1$ID3)
df1$ID3[is.na(df1$ID3)] <- "big234"
While I find the dplyr solution proposed by akrun very intuitive to use, there is also a nice data.table solution:
Similarly as akrun, I assume that the NA value has been converted to "big234" to get the desired result.
library(data.table)
# convert data.frame to data.table
data <- data.table(df1)
# return the desired output
data[, c(.N, lapply(.SD, mean)),
by = list(date, ID1,ID2, ID3)]