I've already asked similar quistion, but I stated it incorrectly, so answers doesn't help me. Again, I have two data sets, The first one is like this:
df1 <- data.frame(id=c(111,111,111,222,222,333,333,333,333),
type=c("a","b","a","d","b","c","c","b","b"),
var=c(1,0,1,0,1,1,1,1,1))
df1
id type var
1 111 a 1
2 111 b 0
3 111 a 1
4 222 d 0
5 222 b 1
6 333 c 1
7 333 c 1
8 333 b 1
9 333 b 1
The second is like this:
df2
id A B
1 111
2 222
3 333
I need to fill the empty cells in such a way that A is the sum of var with type a or b, B is the sum of var with type c or d for each id. The result should be like this:
df2
id A B
1 111 2 0
2 222 1 0
3 333 2 2
It's important to fill this very data frame (df2), not create the new one
It's really just aggregation plus reshaping to wide form:
library(tidyverse)
# set grouping, edit var to A/B form
df1 %>% group_by(id, type = ifelse(type %in% c('a', 'b'), 'A', 'B')) %>%
summarise(var = sum(var)) %>%
spread(type, var, fill = 0) # reshape to wide
## Source: local data frame [3 x 3]
## Groups: id [3]
##
## id A B
## * <dbl> <dbl> <dbl>
## 1 111 2 0
## 2 222 1 0
## 3 333 2 2
You could create A and B in summarise if you subset var, but the code is more repetitive.
In base R,
df2 <- df1
df2$type <- ifelse(df2$type %in% c('a', 'b'), 'A', 'B')
df2 <- aggregate(var ~ id + type, df2, sum)
df2 <- reshape(df2, timevar = 'type', direction = 'wide')
df2[is.na(df2)] <- 0L
names(df2) <- sub('var\\.', '', names(df2))
df2
## id A B
## 1 111 2 0
## 2 222 1 0
## 3 333 2 2
We can do this in a single line in base R (without using any external packages)
transform(as.data.frame.matrix(xtabs(var~id+type, df1)), A= a+b, B = c+d)[-(1:4)]
# A B
#111 2 0
#222 1 0
#333 2 2
Related
I have a dataframe with groups, and a vector containing all possible values. I want to expand the dataframe that each group will have all of the values in vector. If the elements were not present before, it can be filled by zero.
Dummy example:
dd <- data.frame(trip_n = c(1,1,2),
sp = c('a', 'b', 'd'),
val = c(4,3,3))
sp = c('a', 'b', 'd', 'e')
Expected output, where the missing elements from vector are filled in by 0:
trip_n sp val
1 1 a 4
2 1 b 3
3 1 d 0
4 1 e 0
5 2 a 0
6 2 b 0
7 2 d 3
8 2 e 0
I wonder, how to accomplsh this? THis is waht I tried, but neither of them leads to expected outcome. Somehow I need to include grouping, but how? THanks you!
expand.grid(dd$sp, sp)
merge(dd, sp) %>%
arrange(trip_n)
left_join(data.frame(sp), dd)
We could use complete - as the object name sp is the same as the column name, either get the object from the global env (.env) or escape (!!) - (complete(dd, trip_n, sp = !!sp, fill = list(val = 0)))
library(tidyr)
complete(dd, trip_n, sp = .env$sp, fill = list(val = 0))
-output
# A tibble: 8 × 3
trip_n sp val
<dbl> <chr> <dbl>
1 1 a 4
2 1 b 3
3 1 d 0
4 1 e 0
5 2 a 0
6 2 b 0
7 2 d 3
8 2 e 0
This may have solutions/answers available here, but I am unable to find.
Let us assume a simple data like this
x <- data.frame(id = rep(1:3, each = 2),
v1 = c('A', 'B', 'A', 'B', 'A', 'C'))
> x
id v1
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 3 C
Now I want an output of relation of V1 column with itself, but across group on id something like this
v1 A B C
1 A 0 2 1
2 B 2 0 0
3 C 1 0 0
So, I proceeded like this..
library(tidyverse)
#merged the V1 column by itself with all = TRUE
x <- merge(x, x, by = "id", all = T)
# removed same group rows
x <- x[x$v1.x != x$v1.y, ]
# final code
x %>% select(-id) %>%
group_by(v1.x, v1.y) %>%
summarise(val = n()) %>%
pivot_wider(names_from = v1.y, values_from = val, values_fill = 0L, names.sort = T)
# A tibble: 3 x 4
# Groups: v1.x [3]
v1.x A B C
<chr> <int> <int> <int>
1 A 0 2 1
2 B 2 0 0
3 C 1 0 0
My question is that any better/direct method to obtain the cross-table?
How about creating a contingency table with xtabs (which can work with large data sets as well). Then, you can use crossprod on the table and set the diagonal to zero for the final result.
ct <- xtabs(~ id + v1, data = x)
cp <- crossprod(ct, ct)
diag(cp) <- 0
cp
Instead of xtabs you can create a cross-table with simply table as well. As noted by #A5C1D2H2I1M1N2O1R2T1, you can simplify to a nice one-liner equivalent:
"diag<-"(crossprod(table(x)), 0)
Output
v1
v1 A B C
A 0 2 1
B 2 0 0
C 1 0 0
I have a dataframe that I want to update with information from another dataframe, a lookup dataframe.
In particular, I'd like to update the cells of df1$value with the cells of df2$value based on the columns id and id2.
If the cell of df1$value is NA, I know how to do it using the package data.table
BUT
If the cell of df1$value is not empty, data.table will update it with the cell of df2$value anyway.
I don't want that. I'd like to have that:
IF the cell of df1$value is NOT empty (in this case the row in which df1$id is c), do not update the cell but create a duplicate row of df1 in which the cell of df1$value takes the value from the cell of df2$value
I already looked for solutions online but I couldn't find any. Is there a way to do it easily with tidyverse or data.table or an sql-like package?
Thank you for your help!
edit: I've just realized that I forgot to put the corner case in which in both dataframes the row is NA. With the replies I had so far (07/08/19 14:42) the row e is removed from the last dataframe. But I really need to keep it!
Outline:
> df1
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 d 4 NA
5 e 5 NA
> df2
id id2 value
1 c 3 200
2 d 4 201
3 e 5 NA
# I'd like:
> df5
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 c 3 200
5 d 4 201
6 e 5 NA
This is how I managed to solve my problem but it's quite cumbersome.
# I create the dataframes
df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))
# I first do a left_join so I'll have two value columnes: value.x and value.y
df3 <- dplyr::left_join(df1, df2, by = c("id","id2"))
# > df3
# id id2 value.x value.y
# 1 a 1 100 NA
# 2 b 2 101 NA
# 3 c 3 50 200
# 4 d 4 NA 201
# I keep only the rows in which value.x is NA, so the 4th row
df4 <- df3 %>%
filter(is.na(value.x)) %>%
dplyr::select(id, id2, value.y)
# > df4
# id id2 value.y
# 1 d 4 201
# I rename the column "value.y" to "value". (I don't do it with dplyr because the function dplyr::replace doesn't work in my R version)
colnames(df4)[colnames(df4) == "value.y"] <- "value"
# > df4
# id id2 value
# 1 d 4 201
# I update the df1 with the df4$value. This step is necessary to update only the rows of df1 in which df1$value is NA
setDT(df1)[setDT(df4), on = c("id","id2"), `:=`(value = i.value)]
# > df1
# id id2 value
# 1: a 1 100
# 2: b 2 101
# 3: c 3 50
# 4: d 4 201
# I filter only the rows in which both value.x and value.y are NAs
df3 <- as_tibble(df3) %>%
filter(!is.na(value.x), !is.na(value.y)) %>%
dplyr::select(id, id2, value.y)
# > df3
# # A tibble: 1 x 3
# id id2 value.y
# <chr> <dbl> <dbl>
# 1 c 3 200
# I rename column df3$value.y to value
colnames(df3)[colnames(df3) == "value.y"] <- "value"
# I bind by rows df1 and df3 and I order by the column id
df5 <- rbind(df1, df3) %>%
arrange(id)
# > df5
# id id2 value
# 1 a 1 100
# 2 b 2 101
# 3 c 3 50
# 4 c 3 200
# 5 d 4 201
A left join with data.table:
library(data.table)
setDT(df1); setDT(df2)
df2[df1, on=.(id, id2), .(value =
if (.N == 0) i.value
else na.omit(c(i.value, x.value))
), by=.EACHI]
id id2 value
1: a 1 100
2: b 2 101
3: c 3 50
4: c 3 200
5: d 4 201
How it works: The syntax is x[i, on=, j, by=.EACHI]: for each row of i = df1 do j.
In this case j = .(value = expr) where .() is a shortcut to list() since in general j should return a list of columns.
Regarding the expression, .N is the number of rows of x = df2 that are found for each row of i = df1, so if no matches are found we keep values from i; and otherwise we keep values from both tables, dropping missing values.
A dplyr way:
bind_rows(df1, semi_join(df2, df1, by=c("id", "id2"))) %>%
group_by(id, id2) %>%
do(if (nrow(.) == 1) . else na.omit(.))
# A tibble: 5 x 3
# Groups: id, id2 [4]
id id2 value
<chr> <dbl> <dbl>
1 a 1 100
2 b 2 101
3 c 3 50
4 c 3 200
5 d 4 201
Comment. The dplyr way is kind of awkward because do() is needed to get a dynamically determined number of rows, but do() is typically discouraged and does not support n() and other helper functions. The data.table way is kind of awkward because there is no simple semi join functionality.
Data:
df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))
> df1
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 d 4 NA
> df2
id id2 value
1 c 3 200
2 d 4 201
3 e 5 300
Another idea via base R is to remove the rows from df2 that do not match in df1, bind the two data frames rowwise (rbind) and omit the NAs, i.e.
na.omit(rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),]))
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#5 c 3 200
#6 d 4 201
To answer your new requirements, we can keep the same rbind method and filter based on your conditions, i.e.
dd <- rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),])
dd[!!with(dd, ave(value, id, id2, FUN = function(i)(all(is.na(i)) & !duplicated(i)) | !is.na(i))),]
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#5 e 5 NA
#6 c 3 200
#7 d 4 201
A possible approach with data.table using update join then full outer merge:
merge(df1[is.na(value), value := df2[.SD, on=.(id, id2), x.value]], df2, all=TRUE)
output:
id id2 value
1: a 1 100
2: b 2 101
3: c 3 50
4: c 3 200
5: d 4 201
6: e 5 NA
data:
library(data.table)
df1 <- data.table(id=c('a', 'b', 'c', 'd', 'e'), id2=c(1,2,3,4,5),value=c(100, 101, 50, NA, NA))
df2 <- data.table(id=c('c', 'd', 'e'), id2=c(3,4, 5), value=c(200, 201, NA))
Here is one way using full_join and gather
library(dplyr)
left_join(df1, df2, by = c("id","id2")) %>%
tidyr::gather(key, value, starts_with("value"), na.rm = TRUE) %>%
select(-key)
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#7 c 3 200
#8 d 4 201
For the updated case, we can do
left_join(df1, df2, by = c("id","id2")) %>%
tidyr::gather(key, value, starts_with("value")) %>%
group_by(id, id2) %>%
filter((all(is.na(value)) & !duplicated(value)) | !is.na(value)) %>%
select(-key)
# id id2 value
# <chr> <int> <int>
#1 a 1 100
#2 b 2 101
#3 c 3 50
#4 e 5 NA
#5 c 3 200
#6 d 4 201
Slightly difficult to phrase, as far as I saw none of the similar questions answered my problem.
I have a data.frame such as:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
id val
1 a NA
2 a NA
3 a NA
4 a NA
5 b 1
6 b 2
7 b 2
8 b 3
9 c NA
10 c 2
11 c NA
12 c 3
and I want to get rid of all the NA values (easy enough using e.g. filter() ) but make sure that if this removes all of one id value (in this case it removes every instance of "a") that one extra row is inserted of (e.g.) a = 0
so that:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c 2
7 c 3
obviously easy enough to do this in a roundabout way but I was wondering if there's a tidy/elegant way to do this. I thought tidyr::complete() might help but not entirely sure how to apply it to a case like this
I don't care about the order of the rows
Cheers!
edit: updated with clearer desired output. might make desired answers submitted before that a bit less clear
Another idea using dplyr,
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(val = ifelse(row_number() == 1 & all(is.na(val)), 0, val)) %>%
na.omit()
which gives,
# A tibble: 5 x 2
# Groups: id [2]
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
We may do
df1 %>% group_by(id) %>% do(if(all(is.na(.$val))) replace(.[1, ], 2, 0) else na.omit(.))
# A tibble: 5 x 2
# Groups: id [2]
# id val
# <fct> <dbl>
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
After grouping by id, if everything in val is NA, then we leave only the first row with the second element replaced by 0, otherwise the same data is returned after applying na.omit.
In a more readable format that would be
df1 %>% group_by(id) %>%
do(if(all(is.na(.$val))) data.frame(id = .$id[1], val = 0) else na.omit(.))
(Here I presume that you indeed want to get rid of all NA values; otherwise there is no need for na.omit.)
df1[is.na(df1)] <- 0
df1[!(duplicated(df1$id) & df1$val == 0), ]
id val
1 a 0
5 b 1
6 b 2
7 b 2
8 b 3
Base R option is to find groups with all NAs and transform them by changing their val to 0 and select only unique rows so that there is only one row per group. We rbind this dataframe with the groups which are !all_NA.
all_NA <- with(df1, ave(is.na(val), id, FUN = all))
rbind(unique(transform(df1[all_NA, ], val = 0)), df1[!all_NA, ])
# id val
#1 a 0
#5 b 1
#6 b 2
#7 b 2
#8 b 3
dplyr option looks ugly but one way is to make two groups of dataframes one with groups of all NA values and other with groups of all non-NA values. For groups with all NA values we add row with it's id and val as 0 and bind this to the other group.
library(dplyr)
bind_rows(df1 %>%
group_by(id) %>%
filter(all(!is.na(val))),
df1 %>%
group_by(id) %>%
filter(all(is.na(val))) %>%
ungroup() %>%
summarise(id = unique(id),
val = 0)) %>%
arrange(id)
# id val
# <fct> <dbl>
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Changed the df to make example more exhaustive -
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(case=sum(is.na(val))==n(), row_num=row_number() ) %>%
mutate(val=ifelse(is.na(val)&case,0,val)) %>%
filter( !(case&row_num!=1) ) %>%
select(id, val)
Output
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3
Another base approach, one that doesn't maintain the order of the rows and takes advantage of factors remembering lost values:
df1 <- na.omit(df1)
df1 <- rbind(
df1,
data.frame(
id = levels(df1$id)[!levels(df1$id) %in% df1$id],
val = 0)
)
I do personally prefer the dplyr approach given by Sotos, as I don't like rbind-ing data.frames back together so it's a matter of taste, but this isn't unbearably complicated by my eye. It's easy enough to adapt to a character id column with a unique(df1$id) variable.
Here is an option too:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
slice(4:nrow(.))
This gives:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
Alternative:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
unique()
UPDATE based on other requirements:
Some users suggested to test on this dataframe. Of course this answer assumes you'll look at everything by hand. Might be less useful if you have to look at everything by "hand" but here goes:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4), val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate(val=ifelse(id=="a",0,val)) %>%
slice(4:nrow(.))
This yields:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3
Here is a base R solution.
res <- lapply(split(df1, df1$id), function(DF){
if(anyNA(DF$val)) {
i <- is.na(DF$val)
DF$val[i] <- 0
DF <- rbind(DF[i & !duplicated(DF[i, ]), ], DF[!i, ])
}
DF
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
# id val
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Edit.
A dplyr solution could be the following.
It was tested with the original dataset posted by the OP, with the dataset in Vivek Kalyanarangan's answer and with the dataset in markus' comment, renamed df2 and df3, respectively.
library(dplyr)
na2zero <- function(DF){
DF %>%
group_by(id) %>%
mutate(val = ifelse(is.na(val), 0, val),
crit = val == 0 & duplicated(val)) %>%
filter(!crit) %>%
select(-crit)
}
na2zero(df1)
na2zero(df2)
na2zero(df3)
One may try this :
df1 = data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
# id val
#1 a NA
#2 a NA
#3 a NA
#4 a NA
#5 b 1
#6 b 2
#7 b 2
#8 b 3
#9 c NA
#10 c 2
#11 c NA
#12 c 3
Task is to remove all rows corresponding to any id IFF val for the corresponding id is all NAs and add new row with this id and val = 0.
In this example, id = a.
Note : val for c also has NAs but all the val corresponding to c are not NA therefore we need to remove the corresponding row for c where val = NA.
So lets create another column say, val2 which indicates 0 means its all NAs and 1 otherwise.
library(dplyr)
df1 = df1 %>%
group_by(id) %>%
mutate(val2 = if_else(condition = all(is.na(val)),true = 0, false = 1))
df1
# A tibble: 12 x 3
# Groups: id [3]
# id val val2
# <fct> <dbl> <dbl>
#1 a NA 0
#2 a NA 0
#3 a NA 0
#4 a NA 0
#5 b 1 1
#6 b 2 1
#7 b 2 1
#8 b 3 1
#9 c NA 1
#10 c 2 1
#11 c NA 1
#12 c 3 1
Get the list of ids with corresponding val = NA for all.
all_na = unique(df1$id[df1$val2 == 0])
Then remove theids from the dataframe df1 with val = NA.
df1 = na.omit(df1)
df1
# A tibble: 6 x 3
# Groups: id [2]
# id val val2
# <fct> <dbl> <dbl>
# 1 b 1 1
# 2 b 2 1
# 3 b 2 1
# 4 b 3 1
# 5 c 2 1
# 6 c 3 1
And create a new dataframe with ids in all_na and val = 0
all_na_df = data.frame(id = all_na, val = 0)
all_na_df
# id val
# 1 a 0
then combine these two dataframes.
df1 = bind_rows(all_na_df, df1[,c('id', 'val')])
df1
# id val
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
# 6 c 2
# 7 c 3
Hope this helps and Edits are most welcomed :-)
I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1