I wish to have the factor that happened earlier as a new row.
This is my data
df <- data.frame (id =c(1,1,2,2,1), date= c(20161002,20151019, 20160913, 20161117, 20160822), factor = c("A" , "B" ,"C" ,"D" ,"H"))
and I want to have an additional row that shows the immediate last factor. So, my ideal output is:
id date factor col2
1 1 20161002 A H
2 1 20151019 B NA
3 2 20160913 C NA
4 2 20161117 D C
5 1 20160822 H B
For instance, for id 1 in the first row the previous factor happend in 20160822 and its value was H.
What I tied does not consider the last date
library (dplyr)
library(zoo)
mutate( col2 = na.locf(factor))
do this
library(data.table)
df$date = as.Date(as.character(df$date),"%Y%m%d")
setDT(df)
setorder(df,id,date)
df[, "col2" := shift(factor), by = .(id)]
id date factor col2
1: 1 2015-10-19 B NA
2: 1 2016-08-22 H B
3: 1 2016-10-02 A H
4: 2 2016-09-13 C NA
5: 2 2016-11-17 D C
We can use dplyr. Convert the character date to Date format. Then we sort the date by group (id) using arrange and select the last factor using lag.
df$date <- as.Date(as.character(df$date), "%Y%m%d")
library(dplyr)
df %>%
group_by(id) %>%
arrange(date) %>%
mutate(col2 = lag(factor))
# id date factor col2
# <dbl> <date> <fctr> <fctr>
#1 1 2015-10-19 B NA
#2 1 2016-08-22 H B
#3 2 2016-09-13 C NA
#4 1 2016-10-02 A H
#5 2 2016-11-17 D C
Related
I have a huge dataset of a beetles counting experiment with the following exemplary structure:
species_name1 <- c("A", "A", "A", "A", "B") # two factors for name1
species_name2 <- c("a", "a", "b", "b", "c") # three factors for name2
date <- c("2021-06-02", "2021-08-20", "2021-06-15", "2021-08-20", "2021-08-20") # three date factors
number <- c("30", "30", "11", "15", "40") # number of encountered beetles for the "date"
df <- data.frame(species_name1, species_name2, date, number) # create dataframe
df$species_full_name <- gsub(" ", " ", paste(df$species_name1, df$species_name2)) # new column with merged data of the first two columns
df$date <- as.Date(df$date, format ="%Y-%m-%d")
df$number <- as.numeric(df$number)
df$species_name1 <- as.factor(df$species_name1)
df$species_name2 <- as.factor(df$species_name2)
df$species_full_name <- as.factor(df$species_full_name)
str(df)
Overall there are three date factors (2021-06-02, 2021-06-15, 2021-08-20), but not for every "species_full_name".
I need to create a dataframe which includes every single of the three dates for the factors of the "species_full_name" column.
For "species_full_name"-factors with not existing "date" in the originally dataframe dates R should write a '0' to the column "numbers".
I found a code which is nearly a solution for my target dataframe. The problem is that the other columns ("species_name1" and ..."_name2") will disappear:
as.data.frame(xtabs(number ~ species_full_name+date, df)) # create every factor "date" for every factor "species_full_name" and give counting data in column "Freq"
I need a dataframe which is similar to this output, but with every column from the original dataframe "df". It’s important to assume the values for the columns “species_name1” and “species_name2” too.
Thanks for your help!
You can use complete() from tidyr
complete(df, species_full_name,date) %>%
mutate(number=if_else(is.na(number),0,number))
Output:
species_full_name date species_name1 species_name2 number
<fct> <date> <fct> <fct> <dbl>
1 A a 2021-06-02 A a 30
2 A a 2021-06-15 NA NA 0
3 A a 2021-08-20 A a 30
4 A b 2021-06-02 NA NA 0
5 A b 2021-06-15 A b 11
6 A b 2021-08-20 A b 15
7 B c 2021-06-02 NA NA 0
8 B c 2021-06-15 NA NA 0
9 B c 2021-08-20 B c 40
However a data.table approach will be faster. You can use data.table and CJ() as follows:
# load library
library(data.table)
# set df as data.table
setDT(df)
# get unique values of species_full_name and date
species_full_name = unique(df$species_full_name)
date = unique(df$date)
# merge (and update number to 0 if NA, and the name1 and name2 columns)
merge(CJ(date,species_full_name),df,by=c('date','species_full_name'),all.x = T) %>%
.[, number:=fifelse(is.na(number),0,as.double(number))] %>%
.[, c("species_name1","species_name2"):=tstrsplit(species_full_name, " ")] %>%
.[]
Output:
date species_full_name species_name1 species_name2 number
<Date> <fctr> <char> <char> <num>
1: 2021-06-02 A a A a 30
2: 2021-06-02 A b A b 0
3: 2021-06-02 B c B c 0
4: 2021-06-15 A a A a 0
5: 2021-06-15 A b A b 11
6: 2021-06-15 B c B c 0
7: 2021-08-20 A a A a 30
8: 2021-08-20 A b A b 15
9: 2021-08-20 B c B c 40
I want to identify the two-way combinations of levels in one column grouped by the id and Date variables. Basically, I want the daily unique letter pairs for each person.
I have a dataframe that looks like this:
in_df <- data.frame(id = c(1,1,1,1,1,2,2,3),
Date = as.Date(c("2019-01-01", "2019-01-01", "2019-01-01", "2019-01-02", "2019-01-02", "2019-01-01", "2019-01-01", "2019-01-01")),
letter = c("A", "B", "C", "A", "B", "A", "D", "B"))
in_df
id Date letter
1 1 2019-01-01 A
2 1 2019-01-01 B
3 1 2019-01-01 C
4 1 2019-01-02 A
5 1 2019-01-02 B
6 2 2019-01-01 A
7 2 2019-01-01 D
8 3 2019-01-01 B
And I want one that looks like this:
out_df
id Date letter_1 letter_2
1 1 2019-01-01 A B
2 1 2019-01-01 A C
3 1 2019-01-01 B C
4 1 2019-01-02 A B
5 2 2019-01-01 A D
6 3 2019-01-01 B NA
So the first id and the first Date have letters A, B, and C. I want every unique pair from the three. Order doesn't matter so switching what goes to letter_1 and letter_2 would be the same thing.
I have played around with expand.grid and combn, but neither seems quite appropriate for this task.
EDIT
I also have cases where there is only one row per id/Date so using combn gives me Error in combn(letter, m = 2) : n < m. How can I add an if case such that the letter_2 gets an NA? (I also updated the dfs above to address this)
Using data.table:
require(data.table); setDT(in_df)
dt = in_df[, data.table(t(combn(letter, m = 2))), .(id, Date)]
Output:
> dt
id Date V1 V2
1: 1 2019-01-01 A B
2: 1 2019-01-01 A C
3: 1 2019-01-01 B C
4: 1 2019-01-02 A B
5: 2 2019-01-01 A D
We can use split and combn:
do.call('rbind',
lapply(split(in_df, list(in_df$id, in_df$Date), drop = TRUE),
FUN = function(d)
cbind.data.frame(unique(d[c('id', 'Date')]),
data.frame(t(
if(length(d$letter) > 1){
combn(d$letter, 2)
}else{
matrix(c(d$letter, NA), nrow = 2)
})))))
# id Date X1 X2
# 1.2019-01-01.1 1 2019-01-01 A B
# 1.2019-01-01.2 1 2019-01-01 A C
# 1.2019-01-01.3 1 2019-01-01 B C
# 2.2019-01-01 2 2019-01-01 A D
# 1.2019-01-02 1 2019-01-02 A B
It might be helpful to step through this. Investigate the output of:
(ss <- split(in_df, list(in_df$id, in_df$Date), drop = TRUE))
Then check out:
lapply(ss, FUN = function(d) data.frame(t(combn(d$letter, 2))))
The rest of the way, we're just combining the data. You might want to adjust the column names a bit.
I think the following code works:
library("dplyr")
in_df %>%
group_by(id, Date) %>%
mutate(
letter_1 = combn(letter, 2)[1, ],
letter_2 = combn(letter, 2)[2, ]
) %>%
distinct(letter_1, letter_2)
# # A tibble: 5 x 4
# # Groups: id, Date [3]
# letter_1 letter_2 id Date
# <fct> <fct> <dbl> <date>
# 1 A B 1 2019-01-01
# 2 A C 1 2019-01-01
# 3 B C 1 2019-01-01
# 4 A B 1 2019-01-02
# 5 A D 2 2019-01-01
I have a data frame that contains a date column that is in integer type.
df
date values
11/25/18 a
11/30/18 b
12/4/18 a
12/5/18 b
12/5/18 a
12/6/18 b
12/6/18 c
12/6/18 a
12/6/18 a
12/7/18 b
12/7/18 c
12/7/18 a
12/9/18 b
12/12/18 a
12/12/18 c
12/13/18 b
1/9/19 a
1/9/19 c
1/9/19 b
1/10/19 d
1/10/19 d
1/10/19 d
1/10/19 a
1/11/19 c
1/11/19 d
2/1/19 a
2/10/19 a
2/13/19 b
3/14/19 d
3/17/19 c
5/4/19 d
5/5/19 c
5/6/19 d
5/31/19 a
I was trying this code but I am not able to aggregate in month
df %>% group_by(DATE) %>%
count(values)
from this, I am getting the frequency for daily
group_by(month = month(date)) %>% count(values)
when I was trying this code to aggregate date in month then i was getting following error
(Error in as.POSIXlt.character(as.character(x), ...) :
character string is not in a standard unambiguous format)
I want my output like this
date values freq
11/18 a 1
11/18 b 1
12/18 a 6
12/18 b 5
12/18 c 6
and the same for other months.
Extract the month from date and then use count
library(dplyr)
df %>%
mutate(month = format(as.Date(date, "%m/%d/%y"), "%m/%y")) %>%
count(month, values)
# month values n
# <chr> <fct> <int>
# 1 01/19 a 2
# 2 01/19 b 1
# 3 01/19 c 2
# 4 01/19 d 4
# 5 02/19 a 2
# 6 02/19 b 1
# 7 03/19 c 1
# 8 03/19 d 1
# 9 05/19 a 1
#10 05/19 c 1
#11 05/19 d 2
#12 11/18 a 1
#13 11/18 b 1
#14 12/18 a 6
#15 12/18 b 5
#16 12/18 c 3
Or keeping completely in base R, we can use aggregate
aggregate(date~month+values,
transform(df, month = format(as.Date(date, "%m/%d/%y"), "%m/%y")), length)
We can use base R with table
with(df1, as.data.frame(table(format(as.Date(date, "%m/%d/%y"), "%m/%y"), values)))
Advantage is that it would also provide the information about the combinations that are absent with 'Freq' as 0
I have a dataframe that I want to update with information from another dataframe, a lookup dataframe.
In particular, I'd like to update the cells of df1$value with the cells of df2$value based on the columns id and id2.
If the cell of df1$value is NA, I know how to do it using the package data.table
BUT
If the cell of df1$value is not empty, data.table will update it with the cell of df2$value anyway.
I don't want that. I'd like to have that:
IF the cell of df1$value is NOT empty (in this case the row in which df1$id is c), do not update the cell but create a duplicate row of df1 in which the cell of df1$value takes the value from the cell of df2$value
I already looked for solutions online but I couldn't find any. Is there a way to do it easily with tidyverse or data.table or an sql-like package?
Thank you for your help!
edit: I've just realized that I forgot to put the corner case in which in both dataframes the row is NA. With the replies I had so far (07/08/19 14:42) the row e is removed from the last dataframe. But I really need to keep it!
Outline:
> df1
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 d 4 NA
5 e 5 NA
> df2
id id2 value
1 c 3 200
2 d 4 201
3 e 5 NA
# I'd like:
> df5
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 c 3 200
5 d 4 201
6 e 5 NA
This is how I managed to solve my problem but it's quite cumbersome.
# I create the dataframes
df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))
# I first do a left_join so I'll have two value columnes: value.x and value.y
df3 <- dplyr::left_join(df1, df2, by = c("id","id2"))
# > df3
# id id2 value.x value.y
# 1 a 1 100 NA
# 2 b 2 101 NA
# 3 c 3 50 200
# 4 d 4 NA 201
# I keep only the rows in which value.x is NA, so the 4th row
df4 <- df3 %>%
filter(is.na(value.x)) %>%
dplyr::select(id, id2, value.y)
# > df4
# id id2 value.y
# 1 d 4 201
# I rename the column "value.y" to "value". (I don't do it with dplyr because the function dplyr::replace doesn't work in my R version)
colnames(df4)[colnames(df4) == "value.y"] <- "value"
# > df4
# id id2 value
# 1 d 4 201
# I update the df1 with the df4$value. This step is necessary to update only the rows of df1 in which df1$value is NA
setDT(df1)[setDT(df4), on = c("id","id2"), `:=`(value = i.value)]
# > df1
# id id2 value
# 1: a 1 100
# 2: b 2 101
# 3: c 3 50
# 4: d 4 201
# I filter only the rows in which both value.x and value.y are NAs
df3 <- as_tibble(df3) %>%
filter(!is.na(value.x), !is.na(value.y)) %>%
dplyr::select(id, id2, value.y)
# > df3
# # A tibble: 1 x 3
# id id2 value.y
# <chr> <dbl> <dbl>
# 1 c 3 200
# I rename column df3$value.y to value
colnames(df3)[colnames(df3) == "value.y"] <- "value"
# I bind by rows df1 and df3 and I order by the column id
df5 <- rbind(df1, df3) %>%
arrange(id)
# > df5
# id id2 value
# 1 a 1 100
# 2 b 2 101
# 3 c 3 50
# 4 c 3 200
# 5 d 4 201
A left join with data.table:
library(data.table)
setDT(df1); setDT(df2)
df2[df1, on=.(id, id2), .(value =
if (.N == 0) i.value
else na.omit(c(i.value, x.value))
), by=.EACHI]
id id2 value
1: a 1 100
2: b 2 101
3: c 3 50
4: c 3 200
5: d 4 201
How it works: The syntax is x[i, on=, j, by=.EACHI]: for each row of i = df1 do j.
In this case j = .(value = expr) where .() is a shortcut to list() since in general j should return a list of columns.
Regarding the expression, .N is the number of rows of x = df2 that are found for each row of i = df1, so if no matches are found we keep values from i; and otherwise we keep values from both tables, dropping missing values.
A dplyr way:
bind_rows(df1, semi_join(df2, df1, by=c("id", "id2"))) %>%
group_by(id, id2) %>%
do(if (nrow(.) == 1) . else na.omit(.))
# A tibble: 5 x 3
# Groups: id, id2 [4]
id id2 value
<chr> <dbl> <dbl>
1 a 1 100
2 b 2 101
3 c 3 50
4 c 3 200
5 d 4 201
Comment. The dplyr way is kind of awkward because do() is needed to get a dynamically determined number of rows, but do() is typically discouraged and does not support n() and other helper functions. The data.table way is kind of awkward because there is no simple semi join functionality.
Data:
df1 <- data.frame(id=c('a', 'b', 'c', 'd'), id2=c(1,2,3,4),value=c(100, 101, 50, NA))
df2 <- data.frame(id=c('c', 'd', 'e'),id2=c(3,4, 5), value=c(200, 201, 300))
> df1
id id2 value
1 a 1 100
2 b 2 101
3 c 3 50
4 d 4 NA
> df2
id id2 value
1 c 3 200
2 d 4 201
3 e 5 300
Another idea via base R is to remove the rows from df2 that do not match in df1, bind the two data frames rowwise (rbind) and omit the NAs, i.e.
na.omit(rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),]))
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#5 c 3 200
#6 d 4 201
To answer your new requirements, we can keep the same rbind method and filter based on your conditions, i.e.
dd <- rbind(df1, df2[do.call(paste, df2[1:2]) %in% do.call(paste, df1[1:2]),])
dd[!!with(dd, ave(value, id, id2, FUN = function(i)(all(is.na(i)) & !duplicated(i)) | !is.na(i))),]
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#5 e 5 NA
#6 c 3 200
#7 d 4 201
A possible approach with data.table using update join then full outer merge:
merge(df1[is.na(value), value := df2[.SD, on=.(id, id2), x.value]], df2, all=TRUE)
output:
id id2 value
1: a 1 100
2: b 2 101
3: c 3 50
4: c 3 200
5: d 4 201
6: e 5 NA
data:
library(data.table)
df1 <- data.table(id=c('a', 'b', 'c', 'd', 'e'), id2=c(1,2,3,4,5),value=c(100, 101, 50, NA, NA))
df2 <- data.table(id=c('c', 'd', 'e'), id2=c(3,4, 5), value=c(200, 201, NA))
Here is one way using full_join and gather
library(dplyr)
left_join(df1, df2, by = c("id","id2")) %>%
tidyr::gather(key, value, starts_with("value"), na.rm = TRUE) %>%
select(-key)
# id id2 value
#1 a 1 100
#2 b 2 101
#3 c 3 50
#7 c 3 200
#8 d 4 201
For the updated case, we can do
left_join(df1, df2, by = c("id","id2")) %>%
tidyr::gather(key, value, starts_with("value")) %>%
group_by(id, id2) %>%
filter((all(is.na(value)) & !duplicated(value)) | !is.na(value)) %>%
select(-key)
# id id2 value
# <chr> <int> <int>
#1 a 1 100
#2 b 2 101
#3 c 3 50
#4 e 5 NA
#5 c 3 200
#6 d 4 201
I am new to R. I struggle to find a suitable solution for the following problem:
My dataframe looks approximately like this:
ID Att
1 a
1 b
1 c
2 d
3 e
3 f
4 g
I would like to convert it into a new df of the following form:
ID Att_1 Att_2 ... Att_n
1 a b c
2 d N/A N/A
3 e f N/A
4 g N/A N/A
Where the number of columns is dependent on max counts of unique 'Att' in 'ID' (here three). The generation of the number of columns in the new dataframe (i.e. 'n') should be automated and dependent on the count of :
max_ID_count <- table(df$ID)
n <- max(max_ID_count)
Thanks a lot!
We can create a sequence column and then spread
library(tidyverse)
df1 %>%
group_by(ID) %>%
mutate(rn = paste0("Att_", row_number())) %>%
spread(rn, Att)
# A tibble: 4 x 4
# Groups: ID [4]
# ID Att_1 Att_2 Att_3
# <int> <chr> <chr> <chr>
#1 1 a b c
#2 2 d <NA> <NA>
#3 3 e f <NA>
#4 4 g <NA> <NA>
Or with dcast from data.table
library(data.table)
dcast(setDT(df1), ID ~ paste0("Att_", rowid(ID)), value.var = "Att")