when I execute the following code:
data_ikea_wider <- data_ikea_longer %>%
pivot_wider(id_cols = c(Record_no
, Geography
, City
, Country
, City.Country
, Year)
, names_from = Category, values_from = Value)
The columns just have n/a's as shown in the attached print screen.
What am I doing wrong? Thanks!
We could use dcast from data.table
library(data.table)
setDT(dat)[, col1 ~ col2, value.var = 'val')
Getting NAs from a pivot is not unexpected, it means that not all of your id columns have all "columns".
For example,
dat <- data.frame(col1 = c(1,1,2), col2 = c('a', 'b', 'a'), val = 1:3)
dat
# col1 col2 val
# 1 1 a 1
# 2 1 b 2
# 3 2 a 3
If we want to pivot keeping col1 as an id, and turning col2 values into new columns, then it should be apparent that we'll end up with two rows (ida 1 and 2), and two new columns (a and b) to replace col2 and val. Unfortunately, since we only have three rows, the 2 rows 2 columns = 4 cells will not be completely filled with 3 values, so one will be NA:
pivot_wider(dat, col1, names_from = col2, values_from = val)
# # A tibble: 2 x 3
# col1 a b
# <dbl> <int> <int>
# 1 1 1 2
# 2 2 3 NA
If you see this and are surprised, thinking that you actually have the data ... then you should check your data importing and filtering to make sure you did not inadvertently remove it (or it was not provided initially).
Related
I have the following table:
col1
col2
col3
col4
1
2
1
4
5
6
6
3
My goal is to find the max value per each row, and then find how many times it was repeated in the same row.
The resulting table should look like this:
col1
col2
col3
col4
max_val
repetition
1
2
1
4
4
1
5
6
6
3
6
2
Now to achieve this, I am doing the following for Max:
df%>% rowwise%>%
mutate(max=max(col1:col4))
However, I am struggling to find the repetition. My idea is to use this pseudo code in mutate:
sum( "select current row entirely or only for some columns"==max). But I don't know how to select entire row or only some columns of it and use its content to do the check, i.e.: is it equal to the max. How can we do this in dplyr?
A dplyr approach:
library(dplyr)
df %>%
rowwise() %>%
mutate(max_val = max(across(everything())),
repetition = sum(across(col1:col4) == max_val))
# A tibble: 2 × 6
# Rowwise:
col1 col2 col3 col4 max_val repetition
<int> <int> <int> <int> <int> <int>
1 1 2 1 4 4 1
2 5 6 6 3 6 2
An R base approach:
df$max_val <- apply(df,1,max)
df$repetition <- rowSums(df[, 1:4] == df[, 5])
For other (non-tidyverse) readers, a base R approach could be:
df$max_val <- apply(df, 1, max)
df$repetition <- apply(df, 1, function(x) sum(x[1:4] == x[5]))
Output:
# col1 col2 col3 col4 max_val repetition
# 1 1 2 1 4 4 1
# 2 5 6 6 3 6 2
Although dplyr has added many tools for working across rows of data, it remains, in my mind at least, much easier to adhere to tidy principles and always convert the data to "long" format for these kinds of operations.
Thus, here is a tidy approach:
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
group_by(row) %>%
mutate(max_val = max(value), repetitions = sum(value == max(value))) %>%
pivot_wider(id_cols = c(row, max_val, repetitions)) %>%
select(col1:col4, max_val, repetitions)
The last select() is just to get the columns in the order you want.
I've got a dataframe such as this:
df = data.frame(col1=c(1,1,1,2,2,2,3,3,3),
col2=as.factor(c('a','b','b','a','a','a','b','a','b')))
Then I extract all the categories (levels) related to each column:
levels_df = expand.grid(unique(df$col1), unique(df$col2))
colnames(levels_df)=c('col1','col2')
My objective now is to perform for the rows belonging to each pair of levels a function. How can I do that?
sapply(levels, FUN, dataset=df)
Any other strategy to perform the same task is accepted. The function operation could be whatever you like, for example a counting function (how many rows belong to each pair of levels), in which case the output would have this aspect:
In conclusion I want to susbset rows from a dataframe using each pair of levels, so I can manipulate those rows to perform a function ( such as nrows() )
You can skip the levels part, and just use dplyr to group by col1 and col2, then count the rows. Finally, we use complete to add in any combinations that don't appear in our dataset:
library(tidyverse)
df %>%
group_by(col1, col2) %>% # group df by col1 and col2
summarise(n = n()) %>% # make a new column, n, which is the count
complete(col1, col2, fill=list(n=0)) # Fill in missing pairs with 0
The output matches what you expected:
# A tibble: 6 x 3
# Groups: col1 [3]
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2
I‘m not sure if this specific count example will help you, but here‘s what you could do in the tidyverse:
library(tidyverse)
df %>%
group_by(col1, col2) %>%
count() %>%
ungroup() %>%
complete(col1, col2, fill = list(n = 0))
which gives:
# A tibble: 6 x 3
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2
Is there a simple and elegant way to left join (with dplyr) a "b" table in an "a" table when both contains the same column, but the first has NA's and the second table has the missing values? Here folows an example:
# Tables A and B
a <- tibble(
"ID" = c(1,2,3),
"x" = c(NA,5, NA)
)
b <- tibble(
"ID" = c(1,3),
"x" = c(7, 4)
)
# Table I want as result
c <- tibble(
"ID" = c(1,2,3),
"x" = c(7,5,4)
)
You could use the coalesce function in the dplyr package to match together a complete vector from missing pieces. This is inspired by the sql COALESCE function.
left_join(a,b, by='ID') %>%
mutate(col = coalesce(x.x, x.y)) %>%
select(ID, col)
# A tibble: 3 x 2
ID col
<dbl> <dbl>
1 1 7
2 2 5
3 3 4
Joining and then removing rows with an NA should do it. If an ID has non-NA values of x in both tables, then this code will have 2 rows for that ID, but that is probably the behavior you'd want
library(dplyr)
full_join(a,b, by = c('ID', 'x')) %>%
na.omit()
# A tibble: 3 x 2
ID x
<dbl> <dbl>
1 2 5
2 1 7
3 3 4
I recently had to compile a data frame of student scores (one row per student, id column and several integer-valued columns, one per score component). I had to combine a "master" data frame and several "correction" data frames (containing mostly NA and some updates to the master), so that the result contains the maximum values from the master, and all corrections.
I succeeded by copy-pasting a sequence of mutate() calls, which works (see example below), but is not elegant in my opinion. What I would have wanted to do, was instead of copying and pasting, to use something along the lines of map2 and two lists of columns to compare the columns pair-wise. Something like (which obviously does not work as such):
list_of_cols1 <- list(col1.x, col2.x, col3.x)
list_of_cols2 <- list(col1.y, col2.y, col3.y
map2(list_of_cols1, list_of_cols2, ~ column = pmax(.x, .y, na.rm=T))
I can't seem to be able to figure out to do it. My question is: how to specify such lists of columns and mutate them in one map2() call in dplyr pipe, or is it even possible – have I gotten it all wrong?
Minimum working example
library(tidyverse)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- reduce(
# Ultimately there would several correction data frames
list(master, correction1),
function(x,y) {
x <- x %>%
left_join(
y,
by = c("id")
) %>%
# Wish I knew how to do this mutate call with map2
mutate(
col1 = pmax(col1.x, col1.y, na.rm=T),
col2 = pmax(col2.x, col2.y, na.rm=T),
col3 = pmax(col3.x, col3.y, na.rm=T)
) %>%
select(id, col1:col3)
}
)
The result is
> result
# A tibble: 3 x 4
id col1 col2 col3
<int> <dbl> <dbl> <dbl>
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3
Rather than do a left_join, just bind the rows then summarize. For example
result <- reduce(
list(master, master),
function(x,y) {
bind_rows(x, y) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
}
)
result
# id col1 col2 col3
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 3
# 2 2 1 2 3
# 3 3 2 3 3
Actually, you don't even need reduce as bind_rows can take a list
Adding another table
correction2 <- tibble(id=2,col1=NA,col2=8,col3=NA)
bind_rows(master, correction1, correction2) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
Sorry this doesn't answer your question about map2, I find it's easier to aggregate over rows than it is over columns in tidy R:
library(dplyr)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- list(master, correction1) %>%
bind_rows() %>%
group_by(id) %>%
summarise_all(max, na.rm = TRUE)
result
#> # A tibble: 3 x 4
#> id col1 col2 col3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 3
#> 2 2 1 2 3
#> 3 3 2 3 3
If correction tables will always have the same structure as master, you can do something like the following:
library(dplyr)
library(purrr)
update_master = function(...){
map(list(...), as.matrix) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame()
}
update_master(master, correction1)
To allow id to take character values, make the following modification:
update_master = function(x, ...){
map(list(x, ...), function(x) as.matrix(x[-1])) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame(id = x[[1]], .)
}
update_master(master, correction1)
Result:
id col1 col2 col3
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3
I have following data.frame (df)
ID1 ID2 Col1 Col2 Col3 Grp
A B 1 3 6 G1
C D 3 5 7 G1
E F 4 5 7 G2
G h 5 6 8 G2
What I would like to achieve is the following:
- group by Grp, easy
- and then summarize so that for each group I sum the columns and create the columns with strings with all ID1s and ID2s
It would be something like this:
df %>%
group_by(Grp) %>%
summarize(ID1s=toString(ID1), ID2s=toString(ID2), Col1=sum(Col1), Col2=sum(Col2), Col3=sum(Col3))
Everything is fine whae Iknow the number of the columns (Col1, Col2, Col3), however I would like to be able to implement it so that it would work for a data frame with known and always named the same ID1, ID2, Grp, and any number of additional numeric column with unknown names.
Is there a way to do it in dplyr.
I would like to be able to implement it so that it would work for a data frame with known and always named the same ID1, ID2, Grp, and any number of additional numeric column with unknown names.
You can overwrite the ID columns first and then group by them as well:
DF %>%
group_by(Grp) %>% mutate_each(funs(. %>% unique %>% sort %>% toString), ID1, ID2) %>%
group_by(ID1, ID2, add=TRUE) %>% summarise_each(funs(sum))
# Source: local data frame [2 x 6]
# Groups: Grp, ID1 [?]
#
# Grp ID1 ID2 Col1 Col2 Col3
# (chr) (chr) (chr) (int) (int) (int)
# 1 G1 A, C B, D 4 8 13
# 2 G2 E, G F, h 9 11 15
I think you'll want to uniqify and sort before collapsing to a string, so I've added those steps.
Using the data table you could try the following:
setDT(df)
sd_cols=3:(ncol(df)-1)
merge(df[ ,.(toString(ID1), toString(ID2)), by = Grp], df[ , c(-1,-2), with = F][ , lapply(.SD, sum), by = Grp],by = "Grp")