I am trying to create a new data frame with 2 columns: var1 and var2, each one of them is the row sum of specific columns in data frame sampData.
library(dplyr)
sampData <-
rnorm(260) %>%
matrix(ncol = 26) %>%
data.frame() %>%
setNames(LETTERS)
var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")
I know that I can select columns using [] and c(), like this:
sampData[ ,c("A","B")]
but when I try to generate and use that format from my vectors like this:
d1_ <-paste(var1, collapse=",")
d2_ <-paste(var2, collapse=",")
sampData[ ,d1_]
I get this error:
Error in `[.data.frame`(sampData, , d1_) : undefined columns selected
Which I also get if I try to calculate the rowSums -- which is what I am interested in getting.
data.frame(var1 = rowSums(sampData[ , d1_])
, var2 = rowSums(sampData[ , d2_])
I think I have managed to figure out what you are asking, but if I am wrong, let me know.
You are trying to select columns from prep that match the values in l1 and l2, and sum across the rows, limited to the columns that matched each.
It is always better to provide reproducible data, here is some for this case (using dplyr to build it):
sampData <-
rnorm(260) %>%
matrix(ncol = 26) %>%
data.frame() %>%
setNames(LETTERS)
var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")
Then, you don't need to concatenate the column indices at all -- just use the variable (or column, in your case) directly. Here, I have made the ID's letters and will match the letters. However, if your ID's are numeric, it will match that index (e.g., 3 will return the third column).
data.frame(
var1sums = rowSums(sampData[, var1])
, var2sums = rowSums(sampData[, var2])
)
Of note, cat returns NULL after printing to the screen. If you need to concatenate values, you will need to use paste (or similar), but that will not work for what you are trying to do here.
This question got me thinking about flexibility of such solutions, so here is an attempt using dplyr and tidyr, which yields effectively the same result. The difference is that this may provide more flexibility for variable selection or even downstream processing.
sampData %>%
# add column for individual
mutate(ind = 1:nrow(.)) %>%
# convert data to long format
gather("Variable", "Value", -ind) %>%
# Set to group by the individual we added above
group_by(ind) %>%
# Calculate sums as desired
summarise(
var1sums = sum(Value[Variable %in% var1])
, var2sums = sum(Value[Variable %in% var2])
)
However, the real advantage would come if you had an arbitrary number (or just a large number generally) of sets of variables that you wanted to get the individual sums from. Instead of manually constructing every column you might be interested in, you can use standard evaluation (as opposed to non-standard) to automatically generate the columns based on a named list of vectors:
sampData %>%
mutate(ind = 1:nrow(.)) %>%
gather("Variable", "Value", -ind) %>%
group_by(ind) %>%
# Calculate one column for each vector in `varList`
summarise_(
.dots = lapply(varList, function(x){
paste0("sum(Value[Variable %in% c('"
, paste(x, collapse = "', '")
, "')])")
})
)
Related
It’s hard to describe what I mean, I mean I have the following data frame
A 1013574 1014475
A 1014005 1014475
A 1014005 1014435
I want to merge these data into A 1013574 1014475,Is there any function that can do me achieve this goal?
My desired output is two have 1 row for each ID (in my case value "A"), the second column will contain the smallest value and the third the highest value for each ID.
This is an updated answer. I think that this is what you want. I added additional rows, so you can see how it works with multiple data.
library(dplyr)
df <- tibble(a = c("A", "A", "A","B", "B", "B" ),
v1 = as.numeric(c(1013574,1014005,1014005, 1014005, 1014305, 1044005)),
v2 = as.numeric(c(1014475, 1014475,1014435, 1014435, 1014435, 1314435)))
df_new <-df %>% group_by(a) %>% mutate(v1 = min(v1),
v2 = max(v2)) %>%
distinct()
Say we have a data frame,
library(tidyverse)
library(rlang)
df <- tibble(id = rep(c(1:2), 10),
grade = sample(c("A", "B", "C"), 20, replace = TRUE))
we would like to get the mean of grades grouped by id,
df %>%
group_by(id) %>%
summarise(
n = n(),
mu_A = mean(grade == "A"),
mu_B = mean(grade == "B"),
mu_C = mean(grade == "C")
)
I am handling a case where there are multiple conditions (many grades in this case) and would like to make my code more robust. How can we simplify this using tidyevaluation in dplyr 1.0?
I am talking about the idea of generating multiple column names by passing all grades at once, without breaking the flow of piping in dplyr, something like
# how to get the mean of A, B, C all at once?
mu_{grade} := mean(grade == {grade})
I actually found the answer to my own question from a post that I wrote 2 years ago...
I am just going to post the code right below hoping to help anybody that comes across the same problem.
make_expr <- function(x) {
x %>%
map( ~ parse_expr(str_glue("mean(grade == '{.x}')")))
}
# generate multiple expressions
grades <- c("A", "B", "C")
exprs <- grades %>% make_expr() %>% set_names(paste0("mu_", grades))
# we can 'top up' something extra by adding named element
exprs <- c(n = parse_expr("n()"), exprs)
# using the big bang operator `!!!` to force expressions in data frame
df %>% group_by(id) %>% summarise(!!!exprs)
I'm trying to count results from one dataset that I imported into R and display those counts into a separate dataset that gets created within R for each unique Player.
Here is what a simplified version of the dataset looks like with only the relevant columns:
Label <- c("Raul", "Raul", "Raul", "Eric", "Eric", "Eric", "Aaron", "Aaron", "Aaron")
Result <- c("s", "b", "fo", "s", "f", "b", "ss", "go", "s")
df2 <- data.frame(Label, Result)
My data was compiled in Excel and exported as a CSV with about 4000 more rows of similar results and about 45 unique "Labels", but this smaller example shows you what the df looks like. Here is an example of what I want to end up with (line breaks to keep the rows separate):
Raul, count(s), count(b), count(fo), etc
Eric, count(s), count(b), count(fo), etc
Aaron, count(s), count(b), count(fo), etc
So that each unique "Label" for the players is on the row and the columns are the count of each type of Result. It should give me 45 rows, one for each of the unique players in my dataset.
I've been able to get the unique Player Labels just fine by running this:
dfstat <- data.frame(unique(df2$Label)
The problem comes when I try to get the counts for each type of result. I've tried a variety of things, like:
dfstat <- dfstat %>%
mutate(Strikes = count(subset(df2, Label = unique.df2.label & Result == "s")))
But I get this error code: Error: Column ``Strikes`` is of unsupported class data.frame
And
df34$Strikes <- count(subset(df2, Label = unique.df2.label & Result == "s"))
Gives me this error code: Error in ``$<-.data.frame``(``*tmp*``, Strikes, value = list(n = 9L)) : replacement has 1 row, data has 3
I'm doing something similar to be a part of a Shiny App and got that to work no problem, but that's because I was able to subset for my input value of a single Player. But I'm having trouble with getting this count data for ALL the unique players in my dataset into another dataset within R.
I appreciate any help with this issue because I'd really rather not manually type in all my different count formulas for every unique player. Thank you!
You can use table to count the frequencies for each Player.
table(df2)
# Result
#Label b f fo go s ss
# Aaron 0 0 0 1 1 1
# Eric 1 1 0 0 1 0
# Raul 1 0 1 0 1 0
If there are other columns in the data you can specify the columns whose frequency you want to count.
table(df2$Label, df2$Result)
A tidyverse approach would be :
library(dplyr)
library(tidyr)
df2 %>%
count(Label, Result) %>%
pivot_wider(names_from = Result, values_from = n, values_fill = 0)
We could group by 'Label' and get the number of 's' elements by taking the sum of logical expression
library(dplyr)
df2 %>%
group_by(Label) %>%
summarise(n = sum(Result == 's'))
Or to get the frequency of both column elements
count(df2, Label, Result)
If we need all the combinations, then do a complete before getting the count
library(tidyr)
df2 %>%
mutate(n = 1) %>%
complete(Label, Result, fill = list(n = 0)) %>%
group_by(Label, Result) %>%
summarise(n = sum(n))
NOTE: count expects a data.frame/tibble as input, so it won't work within mutate where it receives a vector as input
You could do a tapply followed by an rbind making sure that stats that are missing are given a count of 0.
res <- tapply(df2$Result, df2$Label, function(x) {
x <- table(x)
x[setdiff(unique(df2$Result), names(x))] <- 0
return(x[order(names(x))])
})
Then we can take this list of counts and rbind it
res <- do.call(rbind, res)
Your players will now be rownames
dfstat <- data.frame(label = row.names(res), res)
I have a fairly large dataset, and I need to determine the maximum value of each row, from several columns. So in the below sample data, for "II" what the highest value is, and if the highest value is in "N" or "P". I know very similar questions to this have been posted previously, however I need the output to not remove the other metadata columns in my dataset. This also means I need to specify the range of columns which should be included in the "max" query.
Thanks in advance for any guidance with this.
data<-data_frame(Exp = c("I", "II", "III", "IV", "V", "VI", "VII", "VIII"),
N = c(8.77, 1.67, 7.47, 7.58, 1.1, 8.9, 7.5, 7.7),
P = c(1.848, 3.029, 1.925, 2.725, 1.900, 3.100,
2.000, 9.800))
I have tried several variations of the below code
test %>%
mutate(Max = pmax(!!! rlang::syms(names(.)[c("N", "P"),]))) %>%
group_by(data, Exp) %>%
summarise(Max = max(Max))
and receive the error:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "function"
This is my first question asked on here, so apologies for any incorrect formatting etc, any advice on this (and my question) would be much appreciated.
I am considering this in two steps
find the max value of columns
find label that matches the max value (assume not equal values)
If you only have two columns N and P then this is straightforward to do using case_when.
data2 = data %>%
mutate(max_val = pmax(N,P)) %>% # find max
mutate(source = case_when(max_val == N ~ "N", # find label
max_val == P ~ "P"))
However, if the number of columns, or the column names, is dynamic then this becomes harder. I have the following working:
cols = c("N", "P") # list of column names to work with
data2 = data %>%
mutate(max_val = pmax(!!!syms(cols))) %>% # find max
mutate(source = NA) # initialize blank labels
# iterate to find labels
data3 = data2
for(c in cols)
data3 = mutate(data3, source = ifelse(is.na(source) & max_val == !!sym(c), c, source))
There is probably a way to combine sym with case_when so you do not have to iterate over the labels. If someone finds it, please post an update to this answer.
Looking to solve the same problem I found a different solution, that is clearer to me.
cur_data returns the current working group.
rowwise can have columns specified which work like groups while using summarise.
ungroup is needed to revert to the default column-wise format.
The summarise method drops the non-grouping variables.
# using names
v = c('N', 'P')
data %>% rowwise %>% mutate(m=max(cur_data()[v])) %>% ungroup
# using ranges
start = 8
end = 25
data %>% rowwise %>% mutate(m=max(cur_data()[start:end])) %>% ungroup
# using summarize
data %>% rowwise(Exp) %>% summarize(m=max(cur_data()))
I can't figure out how to:
efficiently create, with rbind or another way, a data.frame compiling csv-derived data.frames, whose number varies for different projects. Or similarly:
efficiently create a data.frame of the difference between a csv-derived "baseline scenario" 's values and those of the rest of the csv-based alternative scenarios.
The csvs are timeseries of hydrologic model output, already in long, 'tidy' format and they're identical in format, size, and order -- there's just different numbers of them for different projects. There's always at least two, a baseline and an alternative, but there's usually quite a few. Eg, Project A might have four csvs/scenarios and Project B might have thirty csvs/scenarios.
I'm hoping to have one code template that will efficiently accommodate projects with any number of scenarios. Without an efficient way, I'm needing to add or delete quite a few lines to match the number of scenarios I have on an sub-daily basis, so it's a time-consuming step I'd like to avoid. After df and df_diff are created, both are used for later summaries and plots.
I'll manually enter the names of the scenarios as they always differ, eg:
library(dplyr)
scenarios <- c("baseline", "alt1", "alt1b", "no dam")
length(scenarios) will always match the number of CSVs I have for a given project.
Read in the csvs (one csv for each scenario) and keep them unmodified for later, separate processing:
#In my case these csv#s are from a separate file's list of csvs,
#eg csv1 <- read.csv("baseline.csv")
# csv2 <- read.csv("alt1.csv"), etc - all tidy monthly timeseries of many variables
#For reproducibility, simplyfying:
csv1 <- data.frame("variable" = "x", "value" = 13) #baseline scenario
csv2 <- data.frame("variable" = "x", "value" = 5) #"alternative 1"
csv3 <- data.frame("variable" = "x", "value" = 109) #"alternative 1b"
csv4 <- data.frame("variable" = "x", "value" = 11) #"dam removal"
#csv5 <- data.frame("variable" = "x", "value" = 2.5) #"100 extra flow for salmon sep-dec"
#...
#csv30 <- data.frame("variable" = "x", "value" = 41) #"alternative H3"
Copy the csvs and connect data to scenario:
baseline <- csv1 %>% mutate(scenario = as.factor(paste0(scenarios[1])))
scen2 <- csv2 %>% mutate(scenario = as.factor(paste0(scenarios[2])))
scen3 <- csv3 %>% mutate(scenario = as.factor(paste0(scenarios[3])))
scen4 <- csv4 %>% mutate(scenario = as.factor(paste0(scenarios[4])))
df <- rbind(baseline, scen2, scen3, scen4) #data.frame #1 I'm looking for.
#eg, if csv1-csv30 were included, how to compile in df efficiently, w/o needing the "scen" lines?
There are 4 scenarios in this case so df$scenario has 4 levels. To get here.
Now for the second "difference" data.frame:
bslnevals <- baseline %>% select(value)
scen2vals <- scen2 %>% select(value)
scen3vals <- scen3 %>% select(value)
scen4vals <- scen4 %>% select(value)
scen2diff <- (scen2vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[2], " - baseline"))) %>%
data.frame(scen2) %>% select(-value, -scenario)
scen3diff <- (scen3vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[3], " - baseline"))) %>%
data.frame(scen3) %>% select(-value, -scenario)
scen4diff <- (scen4vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[4], " - baseline"))) %>%
data.frame(scen4) %>% select(-value, -scenario)
df_diff <- rbind(scen2diff, scen3diff, scen4diff) #data.frame #2 I'm looking for.
#same as above, if csv1 - csv30 were included, how to compile in df_diff efficiently, w/o
#needing the "scen#vals" and "scen#diff" lines?
rm(baseline, scen2, scen3, scen4) #declutter - now unneeded (but csv1, csv2, etc orig csv#s needed later)
rm(bslnevals, scen2vals, scen3vals, scen4vals) #unneeded
rm(scen2diff, scen3diff, scen4diff) #unneeded
With 4 scenarios, there are 3 differences from the baseline so df_diff$scenario has 3 levels.
So, if I had 4 csvs (1 baseline, 3 alternatives) or maybe 30 CSVs (1 baseline, 29 alternatives), I tried to write functions and for loops that would assign scen2 and scen3 ...scen28 , and scen2diff, scen3diff...scen28diff etc, variables dynamically, but I failed. So, I'm looking for a way that works and that doesn't need much modification when applied to a project with any number of scenarios. I'm looking just to create df and df_diff in a clean way for a user, for however many scenarios (ie csvs) happen to be given to me or them for a given project.
Any help is greatly appreciated.
I can't test with your case but this may be a good starting point for refactoring your code. I use case_when to generate rules to map the name of the CSV file to a scenario. I subtract the baseline value from the value in each scenario.
library(dplyr)
library(readr)
library(purrr)
library(tidyr)
baseline_df <- read_csv("baseline.csv") %>%
mutate(id = row_number())
# list all csv files (in current directory), then read them all, and row-bind them.
# use case_when to apply rules to change filenames to "scenarios" (grepl to check presence of string)
# join with baseline df (by scenario row number) for easy subtracting.
# calculate differences values.
# remove baseline-baseline rows (diff is 0)
diff_df <- list.files(path = getwd(), pattern = "*.csv", full.names = TRUE) %>%
tibble(filename = .) %>%
mutate(data = map(filename, read_csv)) %>%
unnest() %>%
mutate(scenario = case_when(
grepl("baseline", filename) ~ "baseline",
grepl("alternative1", filename) ~ "alt1",
grepl("alternative2", filename) ~ "alt2",
grepl("dam_removal", filename) ~ "no dam",
TRUE ~ "other"
)) %>%
group_by(scenario) %>%
mutate(id = row_number()) %>%
left_join(baseline_df, by = "id", suffix = c("_new", "_baseline")) %>%
mutate(Value_diff = Value_new - Value_baseline) %>%
filter(scenario != "baseline")