I'll illustrate my question with an example.
Sample data:
df <- data.frame(ID = c(1, 1, 2, 2, 3, 5), A = c("foo", "bar", "foo", "foo", "bar", "bar"), B = c(1, 5, 7, 23, 54, 202))
df
ID A B
1 1 foo 1
2 1 bar 5
3 2 foo 7
4 2 foo 23
5 3 bar 54
6 5 bar 202
What I want to do is to summarize, by ID, the sum of B and the sum of B when A is "foo". I can do this in a couple steps like:
require(magrittr)
require(dplyr)
df1 <- df %>%
group_by(ID) %>%
summarize(sumB = sum(B))
df2 <- df %>%
filter(A == "foo") %>%
group_by(ID) %>%
summarize(sumBfoo = sum(B))
left_join(df1, df2)
ID sumB sumBfoo
1 1 6 1
2 2 30 30
3 3 54 NA
4 5 202 NA
However, I'm looking for a more elegant/faster way, as I'm dealing with 10gb+ of out-of-memory data in sqlite.
require(sqldf)
my_db <- src_sqlite("my_db.sqlite3", create = T)
df_sqlite <- copy_to(my_db, df)
I thought of using mutate to define a new Bfoo column:
df_sqlite %>%
mutate(Bfoo = ifelse(A=="foo", B, 0))
Unfortunately, this doesn't work on the database end of things.
Error in sqliteExecStatement(conn, statement, ...) :
RS-DBI driver: (error in statement: no such function: IFELSE)
You can do both sums in a single dplyr statement:
df1 <- df %>%
group_by(ID) %>%
summarize(sumB = sum(B),
sumBfoo = sum(B[A=="foo"]))
And here is a data.table version:
library(data.table)
dt = setDT(df)
dt1 = dt[ , .(sumB = sum(B),
sumBfoo = sum(B[A=="foo"])),
by = ID]
dt1
ID sumB sumBfoo
1: 1 6 1
2: 2 30 30
3: 3 54 0
4: 5 202 0
Writing up #hadley's comment as an answer
df_sqlite %>%
group_by(ID) %>%
mutate(Bfoo = if(A=="foo") B else 0) %>%
summarize(sumB = sum(B),
sumBfoo = sum(Bfoo)) %>%
collect
If you want to do counting instead of summarizing, then the answer is somewhat different. The change in code is small, especially in the conditional counting part.
df1 <- df %>%
group_by(ID) %>%
summarize(countB = n(),
countBfoo = sum(A=="foo"))
df1
Source: local data frame [4 x 3]
ID countB countBfoo
1 1 2 1
2 2 2 2
3 3 1 0
4 5 1 0
If you wanted to count the rows, instead of summing them, can you pass a variable to the function:
df1 <- df %>%
group_by(ID) %>%
summarize(RowCountB = n(),
RowCountBfoo = n(A=="foo"))
I get an error both with n() and nrow().
Related
I have a dataframe with two columns per sample (n > 1000 samples):
df <- data.frame(
"sample1.a" = 1:5, "sample1.b" = 2,
"sample2.a" = 2:6, "sample2.b" = c(1, 3, 3, 3, 3),
"sample3.a" = 3:7, "sample3.b" = 2)
If there is a zero in column .b, the correspsonding value in column .a should be set to NA.
I thought to write a function over colnames (without suffix) to filter each pair of columns and conditional exchaning values. Is there a simpler approach based on tidyverse?
We can split the data.frame into a list of data.frames and do the replacement in base R
df1 <- do.call(cbind, lapply(split.default(df,
sub("\\..*", "", names(df))), function(x) {
x[,1][x[2] == 0] <- NA
x}))
Or another option is Map
acols <- endsWith(names(df), "a")
bcols <- endsWith(names(df), "b")
df[acols] <- Map(function(x, y) replace(x, y == 0, NA), df[acols], df[bcols])
Or if the columns are alternate with 'a', 'b' columns, use a logical index for recycling, create the logical matrix with 'b' columns and assign the corresponding values in 'a' columns to NA
df[c(TRUE, FALSE)][df[c(FALSE, TRUE)] == 0] <- NA
or an option with tidyverse by reshaping into 'long' format (pivot_longer), changing the 'a' column to NA if there is a correspoinding 0 in 'a', and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_sep="\\.",
names_to = c('group', '.value')) %>%
mutate(a = na_if(b, a == 0)) %>%
pivot_wider(names_from = group, values_from = c(a, b)) %>%
select(-rn)
# A tibble: 5 x 6
# a_sample1 a_sample2 a_sample3 b_sample1 b_sample2 b_sample3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2 1 2 2 1 2
#2 2 3 2 2 3 2
#3 2 3 2 2 3 2
#4 2 3 2 2 3 2
#5 2 3 2 2 3 2
I am trying to extract some data from a number of excel spreadsheets that do not have a tidy format. I think I need to run lapply within lapply, but can't seem to make it work. Here is an example:
Here are two dataframes with formats equivalent to what i find in the excel sheets:
library('dplyr')
library('tidyr')
library('readxl')
df1 <- data.frame(instance = c('...', 'A', 'B'),
`1990.1` = c('est', 1, 2),
`1990.2` = c('val', 2, 3),
`1991.1` = c('est', 3, 4),
`1991.2` = c('val', 4, 5))
df2 <- data.frame(instance = c('...', 'A', 'B'),
`1990.1` = c('est', 5, 6),
`1990.2` = c('val', 6, 7),
`1991.1` = c('est', 7, 8),
`1991.2` = c('val', 8, 9))
> df1
instance X1990.1 X1990.2 X1991.1 X1991.2
1 ... est val est val
2 A 1 2 3 4
3 B 2 3 4 5
I create a function to clean the data based off:
df1 %>%
select(1, which(.[1,] == 'est')) %>%
.[-1,] %>%
gather(key = year, value = score, -instance) %>%
mutate(var = 'est')
instance year score var
1 A X1990.1 1 est
2 B X1990.1 2 est
3 A X1991.1 3 est
4 B X1991.1 4 est
Gives:
data_clean <- function(x) {
df1 %>%
select(1, which(.[1,] == x)) %>%
.[-1,] %>%
gather(key = year, value = score, -instance) %>%
mutate(var = x)
}
I can now generate a clean version of each df as follows:
do.call(rbind, lapply(c('est', 'val'), data_clean)) %>%
mutate(origin = 'df1')
instance year score var origin
1 A X1990.1 1 est df1
2 B X1990.1 2 est df1
3 A X1991.1 3 est df1
4 B X1991.1 4 est df1
5 A X1990.2 2 val df1
6 B X1990.2 3 val df1
7 A X1991.2 4 val df1
8 B X1991.2 5 val df1
What I now need to do is apply this to the list of dataframes:
list_data <- list(df1, df2)
In my case i would generate this from a function:
data_pull <- function(x) {
read_excel('path/to/file', sheet = x)
}
list_data <- lapply(2:20, data_pull)
But I can't think how to do this. I need to apply data_clean to each element of the list generated by data_pull. I obviously need to remove the first call to df in the data_clean function, but then what object am i passing to data_clean?
What I eventually want is a single data frame with all the data in one place, in a tidy format.
Sorry if i am missing something simple here. I feel there is lots of data that is structured like this and the solution for cleaning it should be fairly simple.I can't seem to think of it.
An option is to keep it in a list and loop over the list with map. We can rename the columns by pasteing the 1st row for all those columns except the 'instance', slice out the first row, use pivot_longer to reshape from 'wide' to 'long', separate the 'name' column into two, and convert the type if needed.
library(dplyr)
library(tidyr)
library(purrr)
library(readr)
library(stringr)
f1 <- function(dat) {
names(dat)[-1] <- str_c(names(dat)[-1], unlist(dat[1,-1]), sep="_")
dat %>%
slice(-1) %>%
pivot_longer(cols = -instance, values_to = "seq" ) %>%
mutate_all(as.character) %>%
separate(name, into = c('year', 'var'), sep="_", convert = TRUE) %>%
type_convert()
}
map_dfr(set_names(list_data, c('df1', 'df2')), f1, .id = 'origin')
# A tibble: 16 x 5
# origin instance year var seq
# <chr> <chr> <chr> <chr> <dbl>
# 1 df1 A X1990.1 est 1
# 2 df1 A X1990.2 val 2
# 3 df1 A X1991.1 est 3
# 4 df1 A X1991.2 val 4
# 5 df1 B X1990.1 est 2
# 6 df1 B X1990.2 val 3
# 7 df1 B X1991.1 est 4
# 8 df1 B X1991.2 val 5
# 9 df2 A X1990.1 est 5
#10 df2 A X1990.2 val 6
#11 df2 A X1991.1 est 7
#12 df2 A X1991.2 val 8
#13 df2 B X1990.1 est 6
#14 df2 B X1990.2 val 7
#15 df2 B X1991.1 est 8
#16 df2 B X1991.2 val 9
If we are using the function data_pull
map_dfr(2:20, ~ data_pull(.x) %>%
f1, .id = 'origin')
i have a key in tableA and in tableB i have key and numeric. How can i achieve formula excel sumifs(numeric,tableB.key,tableA.key,tableA.key,1)
with dplyr without join the two table
i already tried summarise_if within mutate
mutate(newColumn = summarise_if(tableB, .predicate = tableB$Key == .$Key, .funs = sum(tableB$numeric)))
but i get this error
In tableB$Key == .$Key:
longer object length is not a multiple of shorter object length
tableA tableB
key key numeric
1 1 10
2 1 30
3
4
Expected
key newColumn
1 40
2
3
4
you could try
library(tidyverse)
tableA <- tibble(key = c(1, 2, 3, 4))
tableB <- tibble(key = c(1, 1, 2, 2),
numeric = c(10, 30, 10, 15))
(function(){
tmpDF <- tableB %>%
filter(key %in% tableA$key) %>%
group_by(key) %>%
summarise(newColumn = sum(numeric))
tableA %>%
mutate(new = ifelse(key == tmpDF$key, tmpDF$newColumn, 0)
)
})()
which gives
# A tibble: 4 x 2
# key new
# <dbl> <dbl>
# 1 40
# 2 25
# 3 0
# 4 0
There is my problem that I can't solve it:
Data:
df <- data.frame(f1=c("a", "a", "b", "b", "c", "c", "c"),
v1=c(10, 11, 4, 5, 0, 1, 2))
data.frame:f1 is factor
f1 v1
a 10
a 11
b 4
b 5
c 0
c 1
c 2
# What I want is:(for example, fetch data with the number of element of some level == 2, then to data.frame)
a b
10 4
11 5
Thanks in advance!
I might be missing something simple here , but the below approach using dplyr works.
library(dplyr)
nlevels = 2
df1 <- df %>%
add_count(f1) %>%
filter(n == nlevels) %>%
select(-n) %>%
mutate(rn = row_number()) %>%
spread(f1, v1) %>%
select(-rn)
This gives
# a b
# <int> <int>
#1 10 NA
#2 11 NA
#3 NA 4
#4 NA 5
Now, if you want to remove NA's we can do
do.call("cbind.data.frame", lapply(df1, function(x) x[!is.na(x)]))
# a b
#1 10 4
#2 11 5
As we have filtered the dataframe which has only nlevels observations, we would have same number of rows for each column in the final dataframe.
split might be useful here to split df$v1 into parts corresponding to df$f1. Since you are always extracting equal length chunks, it can then simply be combined back to a data.frame:
spl <- split(df$v1, df$f1)
data.frame(spl[lengths(spl)==2])
# a b
#1 10 4
#2 11 5
Or do it all in one call by combining this with Filter:
data.frame(Filter(function(x) length(x)==2, split(df$v1, df$f1)))
# a b
#1 10 4
#2 11 5
Here is a solution using unstack :
unstack(
droplevels(df[ave(df$v1, df$f1, FUN = function(x) length(x) == 2)==1,]),
v1 ~ f1)
# a b
# 1 10 4
# 2 11 5
A variant, similar to #thelatemail's solution :
data.frame(Filter(function(x) length(x) == 2, unstack(df,v1 ~ f1)))
My tidyverse solution would be:
library(tidyverse)
df %>%
group_by(f1) %>%
filter(n() == 2) %>%
mutate(i = row_number()) %>%
spread(f1, v1) %>%
select(-i)
# # A tibble: 2 x 2
# a b
# * <dbl> <dbl>
# 1 10 4
# 2 11 5
or mixing approaches :
as_tibble(keep(unstack(df,v1 ~ f1), ~length(.x) == 2))
Using all base functions (but you should use tidyverse)
# Add count of instances
x$len <- ave(x$v1, x$f1, FUN = length)
# Filter, drop the count
x <- x[x$len==2, c('f1','v1')]
# Hacky pivot
result <- data.frame(
lapply(unique(x$f1), FUN = function(y) x$v1[x$f1==y])
)
colnames(result) <- unique(x$f1)
> result
a b
1 10 4
2 11 5
I'd like code this, may it helps for you
library(reshape2)
library(dplyr)
aa = data.frame(v1=c('a','a','b','b','c','c','c'),f1=c(10,11,4,5,0,1,2))
cc = aa %>% group_by(v1) %>% summarise(id = length((v1)))
dd= merge(aa,cc) #get the level
ee = dd[dd$aa==2,] #select number of level equal to 2
ee$id = rep(c(1,2),nrow(ee)/2) # reset index like (1,2,1,2)
dcast(ee, id~v1,value.var = 'f1')
all done!
Consider the following two data.frames:
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])
I would like to remove the exact rows of a1 that are in a2 so that the result should be:
A B
4 d
5 e
4 d
2 b
Note that one row with 2 b in a1 is retained in the final result. Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. Is there any built-in function to get this result?
The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. Data table is nice because it is easy to count the duplicates (with .N), and it also gives the necessary function (fsetdiff) for set operations.
library(data.table)
a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])
# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]
# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)
# A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3
You could use dplyr to do this. I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches.
library(dplyr)
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)
## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <-
a1 %>%
group_by(A, B) %>%
mutate(tmp_id = row_number()) %>%
ungroup()
# Create a count
a2_tmp <-
a2 %>%
group_by(A, B) %>%
summarise(count = n()) %>%
ungroup()
## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
select(-tmp_id, -count)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
EDIT
Here is a similar solution that is a little shorter. This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame) that will show up as null in the join to a1 (i.e. indicates it's unique to a1).
library(dplyr)
left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number()) %>% ungroup(),
a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
I think this solution is a little simpler (perhaps very little) than the first.
I guess this is similar to DWal's solution but in base R
a1_temp = Reduce(paste, a1)
a1_temp = paste(a1_temp, ave(seq_along(a1_temp), a1_temp, FUN = seq_along))
a2_temp = Reduce(paste, a2)
a2_temp = paste(a2_temp, ave(seq_along(a2_temp), a2_temp, FUN = seq_along))
a1[!a1_temp %in% a2_temp,]
# A B
#4 4 d
#5 5 e
#7 4 d
#8 2 b
Here's another solution with dplyr:
library(dplyr)
a1 %>%
arrange(A) %>%
group_by(A) %>%
filter(!(paste0(1:n(), A, B) %in% with(arrange(a2, A), paste0(1:n(), A, B))))
Result:
# A tibble: 4 x 2
# Groups: A [3]
A B
<dbl> <fctr>
1 2 b
2 4 d
3 4 d
4 5 e
This way of filtering avoids creating extra unwanted columns that you have to later remove in the final output. This method also sorts the output. Not sure if it's what you want.