Transference values in column on the next Date (in R) - r

I have a data frame like:
df <- data.frame(id = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10'), Date = c("01-Feb-17", "05-Feb-17", "01-May-17", "03-May-17","24-May-17", "05-Oct-17", "20-Oct-17", "25-Oct-17", "01-Dec-17", "12-Dec-17"), Name=c("John", "Jack", "Jack", "John", "John", "Jack", "John", "Jack", "John", "Jack"), Workout=c('150', '130', '140', '160', '150', '130', '140', '160', '150', '130'))
Now I want to shift values in the column Workout for every Name on the next Date.
For example:
150 move from 01-Feb-17 (John) to 03-May-17 (John)
ect.
to value "Jack" the same action

df <- data.frame(id = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10'),
Date = c("01-Feb-17", "05-Feb-17", "01-May-17", "03-May-17","24-May-17", "05-Oct-2017", "20-Oct-17", "25-Oct-17", "01-Dec-2017", "12-Dec-2017"),
Name=c("John", "Jack", "Jack", "John", "John", "Jack", "John", "Jack", "John", "Jack"),
Workout=c('150', '130', '140', '160', '150', '130', '140', '160', '150', '130'))
library(dplyr)
df %>%
group_by(Name) %>% # for every name
mutate(Workout = lag(Workout)) %>% # replace value with the previous one
ungroup() # forget the grouping
# # A tibble: 10 x 4
# id Date Name Workout
# <fct> <fct> <fct> <fct>
# 1 1 01-Feb-17 John NA
# 2 2 05-Feb-17 Jack NA
# 3 3 01-May-17 Jack 130
# 4 4 03-May-17 John 150
# 5 5 24-May-17 John 160
# 6 6 05-Oct-2017 Jack 140
# 7 7 20-Oct-17 John 150
# 8 8 25-Oct-17 Jack 130
# 9 9 01-Dec-2017 John 140
#10 10 12-Dec-2017 Jack 160
I assume your dataset will be ordered by Date like in your example. If not you can order it using the arrange function.

Related

How can I check whether a group contains the correct number of observations in R?

I have a data set with monthly results for each site. I need to delete any sites that don't have at least one sample from each season.
An example of the data is below:
df <- data.frame(site = c('D', 'D', 'D', 'D', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'),
result = c('1', '2', '1.5', '3', '1.8', '7', '3.2', '4', '1','1.1', '3', '3.3', '2', '5', '4'),
season = c('w', 'sp', 'su', 'a', 'sp', 'sp', 'sp', 'su', 'a','a', 'w', 'w', 'sp', 'w', 's')
In this case, all the data for site D and A would be retained as they have at least 1 sample per season, but all the data for site B would be deleted.
I am struggling with the logic steps of how to do this and would appreciate some pointers please. I am doing this in R. I think I need to group_by site but then I don't know what I should do next.
library(dplyr)
df %>%
group_by(site) %>%
filter(length(unique(season)) == 4) %>%
ungroup()
output:
# A tibble: 12 x 3
site result season
<chr> <chr> <chr>
1 D 1 w
2 D 2 sp
3 D 1.5 su
4 D 3 a
5 A 1.8 sp
6 A 7 sp
7 A 3.2 sp
8 A 4 su
9 A 1 a
10 A 1.1 a
11 A 3 w
12 A 3.3 w

Compare two dataframes based on first and last name separeted by row

I have two dataframes organised like this.
df1 <- data.frame(lastname = c("Miller", "Smith", "Grey"),
firstname = c("John", "Jane", "Hans")
)
df2 <- data.frame(lastname =c("Smith", "Grey"),
firstname = c("Jane", "Hans")
)
df2 is not necessarily a subset of df1. Duplicated entries are also possible.
My goal is to keep a copy of df1 in which all entries occur represented in both dfs. Alternatively, I would like to end up with a subset of df1 with a new variable, indicating that the name is also element of df2.
Can someone suggest a way to do this? A {dyplr}-attempt is totally fine.
Desired output for the the paticular simple case:
res <- data.frame(lastname = c("Smith", "Grey"),
firstname = c("Jane", "Hans")
)
Including the "alternatively" part of the question this is an approach with left_join. Adding a grouping variable grp to distinguish the 2 sets.
library(dplyr)
left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffix=c("_A", "_B"))
lastname firstname grp_A grp_B
1 Miller John A <NA>
2 Smith Jane A B
3 Grey Hans A B
or with base R merge
merge(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffixes=c("_A", "_B"), all=T)
lastname firstname grp_A grp_B
1 Grey Hans A B
2 Miller John A <NA>
3 Smith Jane A B
To remove NA and compact the grps
na.omit(left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffix=c("_A", "_B"))) %>%
summarize(lastname, firstname,
grp = list(across(starts_with("grp"), ~ unique(.x))))
lastname firstname grp
1 Smith Jane A, B
2 Grey Hans A, B
The other part is simply
merge(df1, df2)
lastname firstname
1 Grey Hans
2 Smith Jane

Unnest non-consecutive tokens in R

Suppose I have a few sentences describing how John spends his days stored in a dataframe in R:
df <- data_frame(sentence = c("John went to work this morning", "John likes to jog", "John is hungry"))
Thus, I want to identify what words are more often repeated when a sentence contains "John". I can use unnest_tokens() to identify consecutive words. How can I identify recurring pairings that are non consecutive?
The goal is to obtain a result that counts how many times each other word appears close to John:
df2 <- data_frame(word1 = c("John", "John", "John", "John", "John", "John", "John", "John", "John"),
word2 = c("went", "to", "work", "this", "morning", "likes", "jog", "is", "hungry"),
n = c(1, 2, 1, 1, 1, 1, 1, 1, 1))
We can try
library(dplyr)
lst <- lapply(strsplit(df$sentence , " ") , \(x) list(x[1] , x[-1])) |>
lapply(\(x) data.frame(x[1], x[2]))
ans <- lapply(lst , \(x) {colnames(x) <- c("word1" , "word2") ;x}) |>
do.call(rbind , args = _) |> group_by(word1 , word2) |>
summarise(n = n())
Output
# A tibble: 9 × 3
# Groups: word1 [1]
word1 word2 n
<chr> <chr> <int>
1 John hungry 1
2 John is 1
3 John jog 1
4 John likes 1
5 John morning 1
6 John this 1
7 John to 2
8 John went 1
9 John work 1

Reshape dataframe - create rows as per data availability in R

I want to reshape the original dataframe into the target dataframe as follows.
But first, to recreate dataframes:
original <- data.frame(caseid = c("id101", 'id201', 'id202', 'id301', 'id302'),
age_child1 = c('3', '5', '8', NA, NA),
age_child2 = c('1', '7', NA, NA, NA),
age_child3 = c('2', '6', '8', '3', NA))
target <- data.frame(caseid = c('id101_1', 'id101_2', 'id101_3', 'id201_1', 'id201_2', 'id201_3', 'id202_1', 'id202_3', 'id301_3'),
age = c(3, 1, 2, 5, 7, 6, 8, 8, 3))
The caseid column represents mothers. I want to create a new caseid row per each of the children and add the respective 'age' value to the age column. If no 'age' value is available, it means there is not an n child and no new row should be created.
Thanks for the help!
You can use pivot_longer() and its various helpful options:
pivot_longer(original, cols = starts_with("age"), names_prefix = "age_child",values_to = "age",values_transform = as.integer) %>%
filter(!is.na(age)) %>%
mutate(caseid = paste0(caseid,"_",name)) %>%
select(-name)
Output:
# A tibble: 9 × 2
caseid age
<chr> <int>
1 id101_1 3
2 id101_2 1
3 id101_3 2
4 id201_1 5
5 id201_2 7
6 id201_3 6
7 id202_1 8
8 id202_3 8
9 id301_3 3
Using reshape form base r ,
original <- data.frame(caseid = c("id101", 'id201', 'id202', 'id301', 'id302'),
age_child1 = c('3', '5', '8', NA, NA),
age_child2 = c('1', '7', NA, NA, NA),
age_child3 = c('2', '6', '8', '3', NA))
a <- reshape(original , varying = c("age_child1" , "age_child2" , "age_child3") ,
direction = "long" ,
times = c("_1" , "_2" , "_3") ,
v.names = "age")
a$caseid <- paste0(a$caseid , a$time)
a <- a[order(a$caseid) , ][c("caseid" , "age")]
a <- na.omit(a)
row.names(a) <- NULL
a
#> caseid age
#> 1 id101_1 3
#> 2 id101_2 1
#> 3 id101_3 2
#> 4 id201_1 5
#> 5 id201_2 7
#> 6 id201_3 6
#> 7 id202_1 8
#> 8 id202_3 8
#> 9 id301_3 3
Created on 2022-06-01 by the reprex package (v2.0.1)
original %>%
pivot_longer(-caseid, names_to = 'child', names_pattern = '([0-9]+$)',
values_to = 'age', values_drop_na = TRUE)%>%
unite(caseid, caseid, child)
# A tibble: 9 x 2
caseid age
<chr> <chr>
1 id101_1 3
2 id101_2 1
3 id101_3 2
4 id201_1 5
5 id201_2 7
6 id201_3 6
7 id202_1 8
8 id202_3 8
9 id301_3 3

How to merge two tables without adding and deleting columns

I have two tables with information that I would like to join with a testcase as key. I could first join them, then rename the columns and then re-order the dataframe, but is there a more elegant way?
df1 <- data.frame(
testcase = c('testcase1', 'testcase2', 'testcase3', 'testcase4', 'testcase5'),
passed = c('2', '0', '2', '0', '0'),
failed = c('0', '2', '2', '0', '2'))
df2 <- data.frame(
id = c(1:10), testid = c('testcase3', 'testcase1', 'testcase3', 'testcase2', 'testcase5', 'testcase1',
'testcase3', 'testcase5', 'testcase2', 'testcase3'), total_passed = rep("", 10), total_failed= rep("", 10), testid = c(510:519), total_items = rep("", 10))
My solution would be the following, but could it be done with less steps?
df3 <- merge(df2, df1, by.x='testid', by.y='testcase')
df3$total_passed <- df3$total_failed <- NULL
df3$total_items <- 10
df3 <- select(df3, id, testid, total_passed = passed, total_failed= failed, testid, total_items)
Maybe you can take help of dplyr library :
library(dplyr)
df2 %>%
inner_join(df1, by = c('testid' = 'testcase')) %>%
transmute(id, testid, total_passed = passed, total_failed = failed,
total_items = 10)
# id testid total_passed total_failed total_items
#1 1 testcase3 2 2 10
#2 2 testcase1 2 0 10
#3 3 testcase3 2 2 10
#4 4 testcase2 0 2 10
#5 5 testcase5 0 2 10
#6 6 testcase1 2 0 10
#7 7 testcase3 2 2 10
#8 8 testcase5 0 2 10
#9 9 testcase2 0 2 10
#10 10 testcase3 2 2 10
We can use a join in data.table
library(data.table)
setDT(df2)[df1, c('total_passed', 'total_failed', 'total_items')
:= .(passed, failed, 10), on = .(testid = testcase)]

Resources