convert data frame of "missed" numbers into data frame of numbers "hit" - r

I have quite an specific doubt, but it should be easy to solve, I just cannot think how...
I have a simple data frame like this:
mydf <- data.frame(Shooter=1:3, Targets.missed=c(paste(sample(1:10,4),collapse=";"), paste(sample(1:10,5),collapse=";"), paste(sample(1:10,8),collapse=";")))
mydf
Shooter Targets.missed
1 1 3;8;4;7
2 2 10;1;5;7;4
3 3 5;9;4;10;8;1;6;7
This data frame tells me the Targets (from 1 to 10) that are missed by each Shooter.
I would like to obtain a different data frame that tells me, per Target, which Shooter\s made it.
The result would be:
Target hit.by.Shooters
1 1
2 1;2;3
3 2;3
4 NA
5 1
6 1;2
7 NA
8 2
9 1;2
10 1

We expand the data by splitting at the ; of the 'Targets.missed' into 'long' format, then grouped by 'Shooter', summarise with a list of numbers that are not in the 'Targets.missed' from 1:10, unnest the list column, grouped by 'Target', summarise by pasteing the unique 'Shooter' elements into a single string, and fill the missing elements from 1:10 with NA by using complete
library(tidyverse)
mydf %>%
separate_rows(Targets.missed) %>%
group_by(Shooter) %>%
summarise(Target = list(setdiff(1:10, Targets.missed))) %>%
unnest %>%
group_by(Target) %>%
summarise(hit.by.Shooters = paste(unique(Shooter), collapse=";")) %>%
complete(Target = 1:10)
# A tibble: 10 x 2
# Target hit.by.Shooters
# <int> <chr>
# 1 1 1
# 2 2 1;2;3
# 3 3 2;3
# 4 4 <NA>
# 5 5 1
# 6 6 1;2
# 7 7 <NA>
# 8 8 2
# 9 9 1;2
#10 10 1
Or another option is base R by splitting the 'Targets.missed' (assuming character class) into a list of vectors, loop through the list, get the values that are not in 1:10 (with setdiff), set the names of the list with the 'Shooter' column, stack the key/val list pairs into a two column data.frame, get the unique rows, aggregate by pasteing the 'ind' column grouped by 'values', merge with a full 'values' dataset from 1:10
out <- aggregate(ind ~ values,
unique(stack(setNames(lapply(strsplit(mydf$Targets.missed, ';'),
setdiff, x= 1:10), mydf$Shooter))), FUN = paste, collapse=";")
out1 <- merge(data.frame(values = 1:10), out, all.x = TRUE)
and change the column names if necessary
names(out1) <- c('Target', 'hit.by.Shooters')
data
mydf <- structure(list(Shooter = 1:3, Targets.missed = c("3;8;4;7", "10;1;5;7;4",
"5;9;4;10;8;1;6;7")), class = "data.frame", row.names = c("1",
"2", "3"))

Another tidyverse possibility. We first create dataframe with all possible combinations of Shooter and Targets and then remove rows which are present in mydf using anti_join, fill in the missing Targets by adding them as NA and finally summarise by Targets to get Shooters who actually hit the target.
library(tidyverse)
crossing(Shooter = unique(mydf$Shooter), Targets.missed = 1:10) %>%
anti_join(mydf %>% separate_rows(Targets.missed) %>% mutate_all(as.numeric)) %>%
complete(Targets.missed = 1:10) %>%
group_by(Targets.missed) %>%
summarise(hit.by.Shooters = paste0(Shooter, collapse = ";"))
# Targets.missed hit.by.Shooters
# <int> <chr>
# 1 1 1;2
# 2 2 1;2
# 3 3 1
# 4 4 1
# 5 5 2
# 6 6 1;3
# 7 7 1;2
# 8 8 2
# 9 9 NA
#10 10 3
data
set.seed(987)
mydf <- data.frame(Shooter=1:3,
Targets.missed=c(paste(sample(1:10,4),collapse=";"),
paste(sample(1:10,5),collapse=";"), paste(sample(1:10,8),collapse=";")))

data.table approach
library( data.table )
#vector with all possible targets
targets.v <- 1:10
#split the missed targets to a list
missed.list <- strsplit( mydf$Targets.missed, ";")
#inverse, to get all hit targets
hit.list <- lapply( missed.list, function(x) as.data.table( targets.v[!targets.v %in% x] ) )
#bind hit targets to data.table
dt <- rbindlist( hit.list, idcol = "shooter" )
#summarise (paste with collapse), and join on all possible targets
dt[, .(hit.by.shooters = paste(shooter, collapse = ";")), by = .(target = V1)][data.table(target = targets.v), on = c("target")]
# target hit.by.shooters
# 1: 1 1
# 2: 2 1;2;3
# 3: 3 2;3
# 4: 4 <NA>
# 5: 5 1
# 6: 6 1;2
# 7: 7 <NA>
# 8: 8 2
# 9: 9 1;2
# 10: 10 1

Related

R Extract specific text from column into multiple columns [duplicate]

This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 10 months ago.
I have a dataframe exported from the web with this format
id vals
1 {7,12,58,1}
2 {1,2,5,7}
3 {15,12}
I would like to extract ONLY the numbers (ignore curlys and commas) into multiple columns like this
id val_1 val_2 val_3 val_4 val_5
1 7 12 58 1
2 1 2 5 7
3 15 12
Even though the Max of values we got was 4 I want to always go up to value val_5.
Thanks!
We could use str_extract_all for this:
library(dplyr)
library(stringr)
df %>%
mutate(vals = str_extract_all(vals, '\\d+', ''))
or as #akrun suggest in the comments
df %>%
mutate(vals = str_extract_all(vals, '\\d+', '')) %>%
do.call(data.frame, .)
id vals.1 vals.2 vals.3 vals.4
1 1 7 12 58 1
2 2 1 2 5 7
3 3 15 12 <NA> <NA>
data:
df <- structure(list(id = 1:3, vals = c("{7,12,58,1}", "{1,2,5,7}",
"{15,12}")), class = "data.frame", row.names = c(NA, -3L))
Another possible tidyverse option, where we remove the curly brackets, then separate the rows on the ,, then pivot to wide form. Then, we can create the additional column (using add_column from tibble) based on the max value in the column names (which is 4 in this case), and then can create val_5.
library(tidyverse)
df %>%
mutate(vals = str_replace_all(vals, "\\{|\\}", "")) %>%
separate_rows(vals, sep=",") %>%
group_by(id) %>%
mutate(ind = row_number()) %>%
pivot_wider(names_from = ind, values_from = vals, names_prefix = "val_") %>%
add_column(!!(paste0("val_", parse_number(names(.)[ncol(.)])+1)) := NA)
Output
id val_1 val_2 val_3 val_4 val_5
1 1 7 12 58 1 NA
2 2 1 2 5 7 NA
3 3 15 12 <NA> <NA> NA
Data
df <- read.table(text = "id vals
1 {7,12,58,1}
2 {1,2,5,7}
3 {15,12} ", header = T)
Using data.table
library(data.table)
library(stringi)
result <- setDT(df)[, stri_match_all_regex(vals, '\\d+')[[1]], by=.(id)]
result[, item:=paste('val', 1:.N, sep='_'), by=.(id)] # defines column names
dcast(result, id~item, value.var = 'V1') # convert from long to wide
## id val_1 val_2 val_3 val_4
## 1: 1 7 12 58 1
## 2: 2 1 2 5 7
## 3: 3 15 12 <NA> <NA>

Appending a column to each data frame within a list

I have a list of dataframes and want to append a new column to each, however I keep getting various error messages. Can anybody explain why the below code doesn't work for me? I'd be happy if rowid_to)column works as the data in my actual set is alright ordered correctly, otherwise i'd like a new column with a list going from 1:length(data$data)
##dataset
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
nest_by(Location)
###map + rowid_to_column
attempt1<- data%>%
map(.,rowid_to_column(.,var = "hour"))
##mutate
attempt2<-data %>%
map(., mutate("Hours" = 1:6))
###add column
attempt3<- data%>%
map(.$data,add_column(.data,hours = 1:6))
newcolumn<- 1:6
###lapply
attempt4<- lapply(data,cbind(data$data,newcolumn))
Many thanks,
Stuart
You were nearly there with your base R attempt, but you want to iterate over data$data, which is a list of data frames.
data$data <- lapply(data$data, function(x) {
hour <- seq_len(nrow(x))
cbind(x, hour)
})
data$data
# [[1]]
# Day Average Amplitude hour
# 1 1 6.070539 1.123182 1
# 2 2 3.638313 8.218556 2
# 3 3 11.220683 2.049816 3
# 4 4 12.832782 14.858611 4
# 5 5 12.485757 7.806147 5
# 6 6 19.250489 6.181270 6
Edit: Updated as realised it was iterating over columns rather than rows. This approach will work if the data frames have different numbers of rows, which the methods with the vector defined as 1:6 will not.
a data.table approach
library(data.table)
setDT(data)
data[, data := lapply(data, function(x) cbind(x, new_col = 1:6))]
data$data
# [[1]]
# Day Average Amplitude test new_col
# 1 1 11.139917 0.3690539 1 1
# 2 2 5.350847 7.0925508 2 2
# 3 3 9.602104 6.1782818 3 3
# 4 4 14.866074 13.7356913 4 4
# 5 5 1.114201 1.1007080 5 5
# 6 6 2.447236 5.9944926 6 6
#
# [[2]]
# Day Average Amplitude test new_col
# 1 1 17.230213 13.966576 1 1
# .....
A purrr approach:
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
group_split(Location) %>%
purrr::map_dfr(~.x %>% mutate(Hours = c(1:6)))
If you want to use your approach and preserve the same data structure, this is a way again using purrr (you need to ungroup, otherwise it will not work due to the rowwise grouping)
data %>% ungroup() %>%
mutate_at("data", .f = ~map(.x, ~.x %>% mutate(Hours = c(1:6))) )

mutate_at does not create variable suffixes in some cases?

I have been playing with dplyr::mutate_at to create new variables by applying the same function to some of the columns. When I name my function in the .funs argument, the mutate call creates new columns with a suffix instead of replacing the existing ones, which is a cool option that I discovered in this thread.
df = data.frame(var1=1:2, var2=4:5, other=9)
df %>% mutate_at(vars(contains("var")), .funs=funs('sqrt'=sqrt))
#### var1 var2 other var1_sqrt var2_sqrt
#### 1 1 4 9 1.000000 2.000000
#### 2 2 5 9 1.414214 2.236068
However, I noticed that when the vars argument used to point my columns returns only one column instead of several, the resulting new column drops the initial name: it gets named sqrt instead of other_sqrt here:
df %>% mutate_at(vars(contains("other")), .funs=funs('sqrt'=sqrt))
#### var1 var2 other sqrt
#### 1 1 4 9 3
#### 2 2 5 9 3
I would like to understand why this behaviour happens, and how to avoid it because I don't know in advance how many columns the contains() will return.
EDIT:
The newly created columns must inherit the original name of the original columns, plus the suffix 'sqrt' at the end.
Thanks
Here is another idea. We can add setNames(sub("^sqrt$", "other_sqrt", names(.))) after the mutate_at call. The idea is to replace the column name sqrt with other_sqrt. The pattern ^sqrt$ should only match the derived column sqrt if there is only one column named other, which is demonstrated in Example 1. If there are more than one columns with other, such as Example 2, the setNames would not change the column names.
library(dplyr)
# Example 1
df <- data.frame(var1 = 1:2, var2 = 4:5, other = 9)
df %>%
mutate_at(vars(contains("other")), funs("sqrt" = sqrt(.))) %>%
setNames(sub("^sqrt$", "other_sqrt", names(.)))
# var1 var2 other other_sqrt
# 1 1 4 9 3
# 2 2 5 9 3
# Example 2
df2 <- data.frame(var1 = 1:2, var2 = 4:5, other1 = 9, other2 = 16)
df2 %>%
mutate_at(vars(contains("other")), funs("sqrt" = sqrt(.))) %>%
setNames(sub("^sqrt$", "other_sqrt", names(.)))
# var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1 1 4 9 16 3 4
# 2 2 5 9 16 3 4
Or we can design a function to check how many columns contain the string other before manipulating the data frame.
mutate_sqrt <- function(df, string){
string_col <- grep(string, names(df), value = TRUE)
df2 <- df %>% mutate_at(vars(contains(string)), funs("sqrt" = sqrt(.)))
if (length(string_col) == 1){
df2 <- df2 %>% setNames(sub("^sqrt$", paste(string_col, "sqrt", sep = "_"), names(.)))
}
return(df2)
}
mutate_sqrt(df, "other")
# var1 var2 other other_sqrt
# 1 1 4 9 3
# 2 2 5 9 3
mutate_sqrt(df2, "other")
# var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1 1 4 9 16 3 4
# 2 2 5 9 16 3 4
I just figured out a (not so clean) way to do it;
I add a extra dummy variable to the dataset, with a name that ensures that it will be selected and that we don't fall into the 1-variable case, and after the calculation I remove the 2 dummies, like this:
df %>% mutate(other_fake=NA) %>%
mutate_at(vars(contains("other")), .funs=funs('sqrt'=sqrt)) %>%
select(-contains("other_fake"))
#### var1 var2 other other_sqrt
#### 1 1 4 9 3
#### 2 2 5 9 3

Getting a summary data frame for all the combinations of categories represented in two columns

I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464

How to get a frequency table of all columns of complete data frame in R?

I want to create a frequency table from a data frame and save it in excel. Using table() function i can only create frequency of a particular column. But I want to create frequency table for all the columns altogether, and for each column the levels or type of variables may differ too. Like kind of summary of a data frame but there will not be mean or other measures, only frequencies.
I was trying something like this
for(i in 1:230){
rm(tb)
tb<-data.frame(table(mydata[i]))
tb2<-cbind(tb2,tb)
}
But it's showing the following Error
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 15, 12
In place of cbind() I also used data.frame() but the Error didn't changed.
You are getting an error because you are trying to combine the data frames that have different dimensions. From what I understand, your problem is two-fold: (1) you want to get the frequency distribution of each column regardless of type; and, (2) you want to save all of the results in a single Excel sheet.
For the first problem, you can use the mapply() function.
set.seed(1)
dat <- data.frame(
x = sample(LETTERS[1:5], 15, replace = TRUE),
y = rbinom(5, 15, prob = 0.4)
)
mylist <- mapply(table, dat); mylist
# $x
#
# A B C D E
# 2 5 1 4 3
#
# $y
#
# 5 6 7 11
# 3 3 6 3
You can also use purrr::map().
library(purrr)
dat %>% map(table)
The second problem has several solutions in this question: Export a list into a CSV or TXT file in R. In particular, LyzandeR's answer will enable you to do just what you intended. If you prefer to save the outputs in separate files, you can do:
mapply(write.csv, mylist, file=paste0(names(mylist), '.csv'))
Maybe an rbind solution is better as it allows you to handle variables with different levels:
dt = data.frame(x = c("A","A","B","C"),
y = c(1,1,2,1))
dt
# x y
# 1 A 1
# 2 A 1
# 3 B 2
# 4 C 1
dt_res = data.frame()
for (i in 1:ncol(dt)){
dt_temp = data.frame(t(table(dt[,i])))
dt_temp$Var1 = names(dt)[i]
dt_res = rbind(dt_res, dt_temp)
}
names(dt_res) = c("Variable","Levels","Freq")
dt_res
# Variable Levels Freq
# 1 x A 2
# 2 x B 1
# 3 x C 1
# 4 y 1 3
# 5 y 2 1
And an alternative (probably faster) process using apply:
dt = data.frame(x = c("A","A","B","C"),
y = c(1,1,2,1))
dt
ff = function(x){
y = data.frame(t(table(x)))
y$Var1 = NULL
names(y) = c("Levels","Freq")
return(y)
}
dd = do.call(rbind, apply(dt, 2, ff))
dd
# Levels Freq
# x.1 A 2
# x.2 B 1
# x.3 C 1
# y.1 1 3
# y.2 2 1
# extract variable names from row names
dd$Variable = sapply(row.names(dd), function(x) unlist(strsplit(x,"[.]"))[1])
dd
# Levels Freq Variable
# x.1 A 2 x
# x.2 B 1 x
# x.3 C 1 x
# y.1 1 3 y
# y.2 2 1 y
Edit (2021-03-29): tidyverse Principles
Here is some updated code that utilizes tidyverse, specifically functions from dplyr, tibble, and purrr. The code is a bit more readable and easier to carry out as well. Example data set is provided.
tibble(
a = rep(c(1:3), 2),
b = factor(rep(c("Jan", "Feb", "Mar"), 2)),
c = factor(rep(LETTERS[1:3], 2))
) ->
dat
dat #print df
# A tibble: 6 x 3
a b c
<int> <fct> <fct>
1 1 Jan A
2 2 Feb B
3 3 Mar C
4 1 Jan A
5 2 Feb B
6 3 Mar C
Get counts and proportions across columns.
library(purrr)
library(dplyr)
library(tibble)
#library(tidyverse) #to load assortment of pkgs
#output tables - I like to use parentheses & specifying my funs
purrr::map(
dat, function(.x) {
count(tibble(x = .x), x) %>%
mutate(pct = (n / sum(n) * 100))
})
#here is the same code but more concise (tidy eval)
purrr::map(dat, ~ count(tibble(x = .x), x) %>%
mutate(pct = (n / sum(n) * 100)))
$a
# A tibble: 6 x 3
x n pct
<int> <int> <dbl>
1 1 1 16.7
2 2 1 16.7
3 3 1 16.7
4 4 1 16.7
5 5 1 16.7
6 6 1 16.7
$b
# A tibble: 3 x 3
x n pct
<fct> <int> <dbl>
1 Feb 2 33.3
2 Jan 2 33.3
3 Mar 2 33.3
$c
# A tibble: 2 x 3
x n pct
<fct> <int> <dbl>
1 A 3 50
2 B 3 50
Old code...
The table() function returns a "table" object, which is nigh impossible to manipulate using R in my experience. I tend to just write my own function to circumvent this issue. Let's first create a data frame with some categorical variables/features (wide formatted data).
We can use lapply() in conjunction with the table() function found in base R to create a list of frequency counts for each feature.
freqList = lapply(select_if(dat, is.factor),
function(x) {
df = data.frame(table(x))
names(df) = c("x", "y")
return(df)
}
)
This approach allows each list object to be easily indexed and further manipulated if necessary, which can be really handy with data frames containing a lot of features. Use print(freqList) to view all of the frequency tables.

Resources