I'm trying to count results from one dataset that I imported into R and display those counts into a separate dataset that gets created within R for each unique Player.
Here is what a simplified version of the dataset looks like with only the relevant columns:
Label <- c("Raul", "Raul", "Raul", "Eric", "Eric", "Eric", "Aaron", "Aaron", "Aaron")
Result <- c("s", "b", "fo", "s", "f", "b", "ss", "go", "s")
df2 <- data.frame(Label, Result)
My data was compiled in Excel and exported as a CSV with about 4000 more rows of similar results and about 45 unique "Labels", but this smaller example shows you what the df looks like. Here is an example of what I want to end up with (line breaks to keep the rows separate):
Raul, count(s), count(b), count(fo), etc
Eric, count(s), count(b), count(fo), etc
Aaron, count(s), count(b), count(fo), etc
So that each unique "Label" for the players is on the row and the columns are the count of each type of Result. It should give me 45 rows, one for each of the unique players in my dataset.
I've been able to get the unique Player Labels just fine by running this:
dfstat <- data.frame(unique(df2$Label)
The problem comes when I try to get the counts for each type of result. I've tried a variety of things, like:
dfstat <- dfstat %>%
mutate(Strikes = count(subset(df2, Label = unique.df2.label & Result == "s")))
But I get this error code: Error: Column ``Strikes`` is of unsupported class data.frame
And
df34$Strikes <- count(subset(df2, Label = unique.df2.label & Result == "s"))
Gives me this error code: Error in ``$<-.data.frame``(``*tmp*``, Strikes, value = list(n = 9L)) : replacement has 1 row, data has 3
I'm doing something similar to be a part of a Shiny App and got that to work no problem, but that's because I was able to subset for my input value of a single Player. But I'm having trouble with getting this count data for ALL the unique players in my dataset into another dataset within R.
I appreciate any help with this issue because I'd really rather not manually type in all my different count formulas for every unique player. Thank you!
You can use table to count the frequencies for each Player.
table(df2)
# Result
#Label b f fo go s ss
# Aaron 0 0 0 1 1 1
# Eric 1 1 0 0 1 0
# Raul 1 0 1 0 1 0
If there are other columns in the data you can specify the columns whose frequency you want to count.
table(df2$Label, df2$Result)
A tidyverse approach would be :
library(dplyr)
library(tidyr)
df2 %>%
count(Label, Result) %>%
pivot_wider(names_from = Result, values_from = n, values_fill = 0)
We could group by 'Label' and get the number of 's' elements by taking the sum of logical expression
library(dplyr)
df2 %>%
group_by(Label) %>%
summarise(n = sum(Result == 's'))
Or to get the frequency of both column elements
count(df2, Label, Result)
If we need all the combinations, then do a complete before getting the count
library(tidyr)
df2 %>%
mutate(n = 1) %>%
complete(Label, Result, fill = list(n = 0)) %>%
group_by(Label, Result) %>%
summarise(n = sum(n))
NOTE: count expects a data.frame/tibble as input, so it won't work within mutate where it receives a vector as input
You could do a tapply followed by an rbind making sure that stats that are missing are given a count of 0.
res <- tapply(df2$Result, df2$Label, function(x) {
x <- table(x)
x[setdiff(unique(df2$Result), names(x))] <- 0
return(x[order(names(x))])
})
Then we can take this list of counts and rbind it
res <- do.call(rbind, res)
Your players will now be rownames
dfstat <- data.frame(label = row.names(res), res)
Related
I am a beginner in R and I am trying to solve a problem in R, which is I guess quite easy for experienced users.
The problem is the following: Customers (A, B, C) are coming in repeatedly using different programms (Prg). I would like to identify "typical sequences" of programs. Therefore, I identify the first programm, they consume, the second, and the third. In a next step, I would like to combine these information to sequences of programms by customer. For a customer first consuming Prg1, then Prg2, then Prg3, the final outcome should be "Prg1-Prg2-Prg3".
The code below produces a dataframe similar to the one I have. Prg is the Programm in the respective year, First is the first year the customer enters, Sec the second and Third the third.
The code produces columns that extract the program consumed in the first contract (Code_1_Prg), second contract (Code_2_Prg) and third contract (Code_3_Prg).
Unfortunately, I am not successful combining these 3 columns to the required goal. I tried to group by ID and save the frist element of the sequence in a new column called "chain1". Here I get the error message "Error in df %>% group_by(ID) %>% df$chain1 = df[df$Code_1_Prg != "NA", :
could not find function "%>%<-", even though I am using the magrittr and dplyr packages.
detach(package:plyr)
library(dplyr)
library(magrittr)
df %>%
group_by(ID) %>%
df$chain1 = df[df$Code_1_Prg!="NA", "Code_1_Prg"]
Below, I share some code, which produces the dataframe and the starting point for extracting the character variable in Code_1_Prg by group.
I would be really grateful, if you could help me with this. Thank you very much in advance!
df <- data.frame("ID"=c("A","A","A","A","B", "B", "B","B","B","C","C", "C", "C","C","C","C"),
"Year_Contract" =c("2010", "2015", "2017","2017","2010","2010", "2015","2015","2020","2015","2015","2017","2017","2017","2018","2018"),
"Prg"=c("AIB","AIB","LLA","LLA","BBU","BBU", "KLU","KLU","DDI","CKN","CKN","BBU","BBU","BBU","KLU","KLU"),
"First"=c("2010","2010","2010","2010","2010","2010", "2010","2010","2010","2015","2015","2015","2015","2015","2015","2015"),
"Sec"=c("2015","2015","2015","2015","2015","2015", "2015","2015","2015","2017","2017","2017","2017","2017","2017","2017"),
"Third"=c("2017","2017","2017","2017","2020","2020", "2020","2020","2020","2018","2018","2018","2018","2018","2018","2018")
)
df$Code_1_Prg <- ifelse(df$Year_Contract == df$First, df$Code_1_Prg <- df$Prg, NA)
df$Code_2_Prg <- ifelse(df$Year_Contract == df$Sec, df$Code_2_Prg <- df$Prg, NA)
df$Code_3_Prg <- ifelse(df$Year_Contract == df$Third, df$Code_3_Prg <- df$Prg, NA)
detach(package:plyr)
library(dplyr)
library(magrittr)
df %>%
group_by(ID) %>%
df$chain1 = df[df$Code_1_Prg!="NA", "Code_1_Prg"]
#This is the final column, I am trying to create
df2 <- data.frame("ID"=c("A","B", "C"),
"Goal" =c("AIB-LLA", "BBU-KLU-DDI", "CKN-BBU-KLU")
)
df <- merge(df, df2, by="ID")
Are you looking for something like this?
libra4ry(dplyr)
df %>%
group_by(ID) %>%
arrange(Year_Contract, .by_group = TRUE) %>%
distinct() %>%
summarise(sequence = toString(Prg))
ID sequence
<chr> <chr>
1 A AIB, AIB, LLA
2 B BBU, KLU, DDI
3 C CKN, BBU, KLU
I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6
I have a data frame in which the first column indicates the work (manager, employee or worker), the second indicates whether the person works at night or not and the last is a household code (if two individuals share the same code then it means that they share the same house).
#Here is the reproductible data :
PCS <- c("worker", "manager","employee","employee","worker","worker","manager","employee","manager","employee")
work_night <- c("Yes","Yes","No", "No","No","Yes","No","Yes","No","Yes")
HHnum <- c(1,1,2,2,3,3,4,4,5,5)
df <- data.frame(PCS,work_night,HHnum)
My problem is that I would like to have a new data frame with households instead of individuals. I would like to group individuals based on HHnum and then merge their answers.
For the variable "PCS" I have new categories based on the combination of answers : Manager+work ="I" ; manager+employee="II", employee+employee=VI, worker+worker=III etc
For the variable "work_night", I would like to apply a score (is both answered Yes then score=2, if one answered YES then score =1 and if both answered No then score = 0).
To be clear, I would like my data frame to look like this :
HHnum PCS work_night
1 "I" 2
2 "VI" 0
3 "III" 1
4 "II" 1
5 "II" 1
How can I do this on R using dplyr ? I know that I need group_by() but then I don't know what to use.
Best,
Victor
Here is one way to do it (though I admit it is pretty verbose). I created a reference dataframe (i.e., combos) in case you had more categories than 3, which is then joined with the main dataframe (i.e., df_new) to bring in the PCS roman numerals.
library(dplyr)
library(tidyr)
# Create a dataframe with all of the combinations of PCS.
combos <- expand.grid(unique(df$PCS), unique(df$PCS))
combos <- unique(t(apply(combos, 1, sort))) %>%
as.data.frame() %>%
dplyr::mutate(PCS = as.roman(row_number()))
# Create another dataframe with the columns reversed (will make it easier to join to the main dataframe).
combos2 <- data.frame(V1 = c(combos$V2), V2 = c(combos$V1), PCS = c(combos$PCS)) %>%
dplyr::mutate(PCS = as.roman(PCS))
combos <- rbind(combos, combos2)
# Get the count of "Yes" for each HHnum group.
# Then, put the PCS into 2 columns to join together with "combos" df.
df_new <- df %>%
dplyr::group_by(HHnum) %>%
dplyr::mutate(work_night = sum(work_night == "Yes")) %>%
dplyr::group_by(grp = rep(1:2, length.out = n())) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = grp, values_from = PCS) %>%
dplyr::rename("V1" = 3, "V2" = 4) %>%
dplyr::left_join(combos, by = c("V1", "V2")) %>%
unique() %>%
dplyr::select(HHnum, PCS, work_night)
I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.
I am trying to create a new data frame with 2 columns: var1 and var2, each one of them is the row sum of specific columns in data frame sampData.
library(dplyr)
sampData <-
rnorm(260) %>%
matrix(ncol = 26) %>%
data.frame() %>%
setNames(LETTERS)
var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")
I know that I can select columns using [] and c(), like this:
sampData[ ,c("A","B")]
but when I try to generate and use that format from my vectors like this:
d1_ <-paste(var1, collapse=",")
d2_ <-paste(var2, collapse=",")
sampData[ ,d1_]
I get this error:
Error in `[.data.frame`(sampData, , d1_) : undefined columns selected
Which I also get if I try to calculate the rowSums -- which is what I am interested in getting.
data.frame(var1 = rowSums(sampData[ , d1_])
, var2 = rowSums(sampData[ , d2_])
I think I have managed to figure out what you are asking, but if I am wrong, let me know.
You are trying to select columns from prep that match the values in l1 and l2, and sum across the rows, limited to the columns that matched each.
It is always better to provide reproducible data, here is some for this case (using dplyr to build it):
sampData <-
rnorm(260) %>%
matrix(ncol = 26) %>%
data.frame() %>%
setNames(LETTERS)
var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")
Then, you don't need to concatenate the column indices at all -- just use the variable (or column, in your case) directly. Here, I have made the ID's letters and will match the letters. However, if your ID's are numeric, it will match that index (e.g., 3 will return the third column).
data.frame(
var1sums = rowSums(sampData[, var1])
, var2sums = rowSums(sampData[, var2])
)
Of note, cat returns NULL after printing to the screen. If you need to concatenate values, you will need to use paste (or similar), but that will not work for what you are trying to do here.
This question got me thinking about flexibility of such solutions, so here is an attempt using dplyr and tidyr, which yields effectively the same result. The difference is that this may provide more flexibility for variable selection or even downstream processing.
sampData %>%
# add column for individual
mutate(ind = 1:nrow(.)) %>%
# convert data to long format
gather("Variable", "Value", -ind) %>%
# Set to group by the individual we added above
group_by(ind) %>%
# Calculate sums as desired
summarise(
var1sums = sum(Value[Variable %in% var1])
, var2sums = sum(Value[Variable %in% var2])
)
However, the real advantage would come if you had an arbitrary number (or just a large number generally) of sets of variables that you wanted to get the individual sums from. Instead of manually constructing every column you might be interested in, you can use standard evaluation (as opposed to non-standard) to automatically generate the columns based on a named list of vectors:
sampData %>%
mutate(ind = 1:nrow(.)) %>%
gather("Variable", "Value", -ind) %>%
group_by(ind) %>%
# Calculate one column for each vector in `varList`
summarise_(
.dots = lapply(varList, function(x){
paste0("sum(Value[Variable %in% c('"
, paste(x, collapse = "', '")
, "')])")
})
)