I have a data frame with two columns: ID and product. The product contains a list of items like the ones below:
ID
Product
1
'desk','chair','clock'
2
NA
3
'pen'
I want to extract every single product in a separate row with the corresponding ID, as below:
ID
Product
1
'desk'
1
'chair'
1
'clock'
3
'pen'
It would be appreciated if you had any suggestions.
You can do it with separate.
library(tidyverse)
df <- data.frame(
id = c(1,2,3),
product=c('desk, chair, clock', NA, 'pen')
)
df |>
separate_rows(product) |>
drop_na()
#> # A tibble: 4 × 2
#> id product
#> <dbl> <chr>
#> 1 1 desk
#> 2 1 chair
#> 3 1 clock
#> 4 3 pen
You can do it with tidyr lib separate_rows
library(tidyr)
df = df %>%
separate_rows(Product, sep = ",")
Beside the above answers, I tried this method and works fine as well.
result_df <- unnest(df, Product)
Related
I would like to join two data sets that look like the following data sets. The matching rule would be that the Item variable from mykey matches the first part of the Item entry in mydata to some degree.
mydata <- tibble(Item = c("ab_kssv", "ab_kd", "cde_kh", "cde_ksa", "cde"),
Answer = c(1,2,3,4,5),
Avg = rep(-100, length(Item)))
mykey <- tibble(Item = c("ab", "cde"),
Avg = c(0 ,10))
The result should be the following:
Item Answer Avg
1 ab_kssv 1 0
2 ab_kd 2 0
3 cde_kh 3 10
4 cde_ksa 4 10
5 cde 5 10
I looked at these three SO questions, but did not find a nice solution there. I also briefly tried the fuzzyjoin package, but that did not work. Finally, I have a for-loop-based solution:
for (currLine in 1:nrow(mydata)) {
mydata$Avg[currLine] <- mykey$Avg[str_starts(mydata$Item[currLine], mykey$Item)]
}
It does the job, but is not nice to read / understand and I wonder if there is a possibility to make the "by" argument of full_join() from the dplyr package a bit more tolerant with its matching. Any help will be apreciated!
Using a fuzzyjoin::regex_left_join you could do:
Note: I renamed the Item column in your mykey dataset to regex to make clear that this is the regex to match by and added a "^" to ensure that we match at the beginning of the Item column in the mydata dataset.
library(fuzzyjoin)
library(dplyr)
mykey <- mykey %>%
rename(regex = Item) %>%
mutate(regex = paste0("^", regex))
mydata %>%
select(-Avg) %>%
regex_left_join(mykey, by = c(Item = "regex")) %>%
select(-regex)
#> # A tibble: 5 × 3
#> Item Answer Avg
#> <chr> <dbl> <dbl>
#> 1 ab_kssv 1 0
#> 2 ab_kd 2 0
#> 3 cde_kh 3 10
#> 4 cde_ksa 4 10
#> 5 cde 5 10
I have a data frame that looks like this :
names
value
John123abc
1
George12894xyz
2
Mary789qwe
3
I want to rename all the name values of the column "names" and keep only the names (not the extra numbers and characters that its name has). Imagine that the code for each name changes and I have 100.000 rows.I thing that something like starts_with("John") ="John")
Ideally i want the new data frame to look like this:
names
value
John
1
George
2
Mary
3
How I can do this in R using dplyr?
library(tidyverse)
names = c("John123abc","George12894xyz","Mary789qwe")
value = c(1,2,3)
dat = tibble(names,value)
Using strings::str_remove you could do:
library(tidyverse)
names = c("John123abc","George12894xyz","Mary789qwe")
value = c(1,2,3)
dat = tibble(names,value)
dat |>
mutate(names = str_remove(names, "\\d+.*$"))
#> # A tibble: 3 × 2
#> names value
#> <chr> <dbl>
#> 1 John 1
#> 2 George 2
#> 3 Mary 3
Using base R
dat$names <- trimws(dat$names, whitespace = "\\d+.*")
-output
> dat
# A tibble: 3 × 2
names value
<chr> <dbl>
1 John 1
2 George 2
3 Mary 3
I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05
This is an approximation of the original dataframe. In the original, there are many more columns than are shown here.
id init_cont family description value
1 K S impacteach 1
1 K S impactover 3
1 K S read 2
2 I S impacteach 2
2 I S impactover 4
2 I S read 1
3 K D impacteach 3
3 K D impactover 5
3 K D read 3
I want to combine the values for impacteach and impactover to generate an average value that is just called impact. I would like the final table to look like the following:
id init_cont family description value
1 K S impact 2
1 K S read 2
2 I S impact 3
2 I S read 1
3 K D impact 4
3 K D read 3
I have not been able to figure out how to generate this table. However, I have been able to create a dataframe that looks like this:
id description value
1 impact 2
1 read 2
2 impact 3
2 read 1
3 impact 4
3 read 3
What is the best way for me to take these new values and add them to the original dataframe? I also need to remove the original values (like impacteach and impactover) in the original dataframe. I would prefer to modify the original dataframe as opposed to creating an entirely new dataframe because the original dataframe has many columns.
In case it is useful, this is a summary of the code I used to create the shorter dataframe with impact as a combination of impacteach and impactover:
df %<%
mutate(newdescription = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %<%
group_by(id, newdescription) %<%
summarise(value = mean(as.numeric(value)))
What if you changed the description column first so that it could be included in the grouping:
df %>%
mutate(description = substr(description, 1, 6)) %>%
group_by(id, init_cont, family, description) %>%
summarise(value = mean(value))
# A tibble: 6 x 5
# Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.
# 2 1 K S read 2.
# 3 2 I S impact 3.
# 4 2 I S read 1.
# 5 3 K D impact 4.
# 6 3 K D read 3.
You just need to modify your group_by statement. Try group_by(id, init_cont, family)
Because your id seems to be mapped to init_cont and family already, adding in these values won't change your summarization result. Then you have all the columns you want with no extra work.
If you have a lot of columns you could trying something like the code below. Essentially, do a left_join onto your original data with your summarised data, but doing it using the . to not store off a new dataframe. Then, once joined (by id and description which we modified in place) you'll have two value columns which should be prepeneded with a .x and .y, drop the original and then use distinct to get rid of the duplicate 'impact' columns.
df %>%
mutate(description = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %>%
left_join(. %>%
group_by(id, description)
summarise(value = mean(as.numeric(value))
,by=c('id','description')) %>%
select(-value.x) %>%
distinct()
gsub can be used to replace description containing imact as impact and then group_by from dplyr package will help in summarising the value.
df %>% group_by(id, init_cont, family,
description = gsub("^(impact).*","\\1", description)) %>%
summarise(value = mean(value))
# # A tibble: 6 x 5
# # Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.00
# 2 1 K S read 2.00
# 3 2 I S impact 3.00
# 4 2 I S read 1.00
# 5 3 K D impact 4.00
# 6 3 K D read 3.00
I have data with a list of people's names and their ID numbers. Not all people with the same name will have the same ID number but everyone with different names should have a different ID number. Like this:
Name david david john john john john megan bill barbara chris chris
ID 1 1 2 2 2 3 4 5 6 7 8
I need to make sure that these IDs are correct. So, I want to write a code that says "subset only if ID numbers are the same but their names are different"(so I will be only subsetting ID errors). I don't even know where to start with this because I tried
df1<-df(subset(duplicated(df$Name) & duplicated(df$ID)))
Error in subset.default(duplicated(df$officer) & duplicated(df$ID)) :
argument "subset" is missing, with no default
but it didn't work and I know it doesn't tell R to match and compare names and ID numbers.
Thank you so much in advance.
Updated with the information in the comments below
Here are some test data:
> DF <- data.frame(name = c("A", "A", "A", "B", "B", "C"), id=c(1,1,2,3,4,4))
> DF
name id
1 A 1
2 A 1
3 A 2
4 B 3
5 B 4
6 C 4
So ... if I understand your problem correctly you want to get the information that there are problems with id 4 since two different names (B and C) appear for that id.
library(dplyr)
DF %>% group_by(id) %>% distinct(name) %>% tally()
# A tibble: 4 x 2
id n
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 2
Here we get a summary and see that there are two different names (n) for id 4. You can combine that with filter to only see the ids with more than one name
> DF %>% group_by(id) %>% distinct(name) %>% tally() %>% filter(n > 1)
# A tibble: 1 x 2
id n
<dbl> <int>
1 4 2
Did that help?