Combine two rows into one in R - r

I encounter another challenge about combine two row into one based on identifier col.
My dataset looks like this:
var<-c("round","round","round","hhid","hhid","chid","chid","sex")
dfile<-c("df1","df2","df3","df1","df2","df1","df2","df1")
uniquevar<-c("df1::round","df2::round","df3::round", "df1::hhid","df2::hhid","df1::chid","df2::chid","df1::sex")
flag<-c("dup","dup","dup","dup","dup","dup","dup","NA")
df<-data.frame(var, dfile,flag)
I am trying to do
find the obs which is marked as "dup"
If it is marked as "dup", combine two/three/or multiple rows into one with format:
df1::var | df2::var |df3::var
So, the ideal outcome would look like this
var dfile. uniquevar flag
round df1 |df2 |df3 df1::round | df2::round |df3::round dup
hhid df1 |df2 df1::hhid | df2::hhid dup
chid df1 |df2 df1::chid | df2::chid dup
sex df1 NA
So far I can only do that manually in excel, that is really time-consuming. I appreciate if I could be told how to achieve that in R, which would be much faster considering the dataset contains over 600,000 obs...
Thanks a lot~~!

You can paste cells together after using group_by(var). Use sep = "::" to specify the separator between different columns, and collapse = " | " for the separator representing rows. You can do this inside summarize from the dplyr package.
library(dplyr)
df %>%
group_by(var) %>%
summarize(uniquevar = ifelse(all(flag == "dup"),
paste(dfile, var, sep = "::", collapse = " | "),
""),
dfile = paste(dfile, collapse = " | "),
dup = flag[1]) %>%
select(var, dfile, uniquevar, dup)
#> # A tibble: 4 x 4
#> var dfile uniquevar dup
#> <chr> <chr> <chr> <chr>
#> 1 chid df1 | df2 "df1::chid | df2::chid" dup
#> 2 hhid df1 | df2 "df1::hhid | df2::hhid" dup
#> 3 round df1 | df2 | df3 "df1::round | df2::round | df3::round" dup
#> 4 sex df1 "" NA

Related

Combining rows using fuzzy matching of the keys in R

I have a data set that might contain some very similar keys - something like a row of data for each of the email address john.doe#foo.com and john.m.doe#foo.com. How can I combine similarly named keys and do an aggregate in R?
Sample input
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 10 |
|john.m.doe#foo.com | 11 |
|jane.doe#foo.com | 20 |
Expected result
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 21 |
|jane.doe#foo.com | 20 |
I know agrep and few other libraries can do fuzzy matching, but how do I employ it in combining rows in a data set?
Here is one way to use agrep in combination with dplyr:
df <- data.frame(mail = c("john.doe#foo.com", "john.m.doe#foo.com", "jane.doe#foo.com"),
sub = c(10, 11, 20))
df %>%
rowwise() %>%
mutate(new = paste(agrep(mail, df$mail, max = 2, ignore.case = TRUE), collapse = ",")) %>%
group_by(new) %>%
mutate(sub = sum(sub)) %>%
slice(1)
mail sub new
<fct> <dbl> <chr>
1 john.doe#foo.com 21 1,2
2 jane.doe#foo.com 20 3

How to count unique occurrences of data saved in a multi-column table?

I have a table with 3 columns and cca 14.000 rows. I want to count every occurrence of each type of a row.
I am a newbie into R, so can't really come up with a solution to extract it from the table.
I managed to list all different values in single column with levels(), but can't really make it work.
Table looks like this:
My expected result:
IPV4|UDP|UDP: 120 times
IPV4|UDP|SSDP: 60 times
...
With some sample data that looks like this
tst <- data.frame(Type = c("IPV4", " ", "IPV4", "IPV4"), Protocol = c("UDP", " ", "UDP", "UDP"), Protocol.1 = c("SSDP", " ", "UDP", "UDP"))
You could get tallies as follows using tools from the tidyverse (dplyr, magrittr).
tst_summmary <- tst %>%
mutate(class_var = paste(Type, Protocol, Protocol.1, sep = "|")) %>%
group_by(class_var) %>%
tally() %>% as.data.frame()
# # A tibble: 3 x 2
# class_var n
# <chr> <int>
# 1 " | | " 1
# 2 IPV4|UDP|SSDP 1
# 3 IPV4|UDP|UDP 2
What we're doing here is concatenating the strings from all the different columns (that you want to use to group/classify) together into the contents of a single column class_var using paste() (mutate() creates this new class_var column). Then we can group the data (group_by) with this newly created column and tally the occurrences with tally().
Getting a table with the original columns along with the generated counts would invoke a for loop and the str_split() function from stringr as shown below.
tst_summary <- tst %>%
mutate(class_var = paste(Type, Protocol, Protocol.1, sep = "|")) %>%
group_by(class_var) %>%
tally() %>% as.data.frame()
for(i in 1:nrow(tst_summary)){
tst_summary$Type[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[1]]})
tst_summary$Protocol[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[2]]})
tst_summary$Protocol.1[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[3]]})
}
tst_summary <- tst_summary[, c(3,4,5,2)]
tst_summary
# Type Protocol Protocol.1 n
# 1 1
# 2 IPV4 UDP SSDP 1
# 3 IPV4 UDP UDP 2

Conditional Filtering using R

Consider the below given dataframe;
Sample DataFrame
| Name | Age | Type |
---------------------
| EF | 50 | A |
| GH | 60 | B |
| VB | 70 | C |
Code to perform Filter
df2 <- df1 %>% filter(Type == 'C') %>% select(Name)
The above code will provide me a dataframe with singe column and row.
I would like to perform a conditional filter where if a certain type is not present it should consider the name to be NULL/NA.
Example
df2 <- df1 %>% filter(Type = 'D') %>% select(Name)
Must give an output of;
| Name |
--------
| NA |
Instead of throwing an error. Any inputs will be really helpful. Either DPLYR or any other methods will be appreciable.
Here is a base R approach:
name <- df[df$Name == "D", "Name"]
ifelse(identical(name, character(0)), NA, name)
[1] NA
Should the name not match to D, the subset operation would return character(0). We can compare the output against this, and then return NA as appropriate.
Data:
df <- data.frame(Name=c("EF", "GH", "VB"),
Age=c(50, 60, 70),
Type=c("A", "B", "C"),
stringsAsFactors=FALSE)
An approach with complete from tidyr would be:
library(dplyr)
library(tidyr)
df1 %>%
complete(Type = LETTERS) %>% # Specify which Types you'd expect, other values are filled with NA
filter(Type == 'D') %>%
select(Name)
# A tibble: 1 x 1
# Name
# <fct>
# 1 NA

Replace NA in one row of data frame with values from other

i want to replace NA in one row with values from another row, example data are:
group <-c('A','A_old')
year1<- c(NA,'20')
year2<- c(NA,'40')
year3<- c('20','230')
datac=data_frame(group,year1,year2,year3)
group <-c('A','A_old')
year1<- c('20','20')
year2<- c('40','40')
year3<- c('20','230')
finaldatac=data_frame(group,year1,year2,year3)
Original table is much larger so referring to each element one by one and assigning value is not possible..
Thanks!
For sake of argument below, i need to refer to the row values by their name as original table is big and i can not play around with only two rows. For example in table below, i would like to replace row 1 (group==A) with row 5 (group==E). Data are here:
group <-c('A','B','C','D','E','F','G')
year1<- c(NA,'100',NA,'200','300',NA,NA)
year2<- c(NA,'100',NA,'200','300','50','40')
year3<- c('20','100',10,'200','300','150','230')
data=data.frame(group,year1,year2,year3)
SO i want to get:
group <-c('A','B','C','D','E','F','G')
year1<- c('300','100',NA,'200','300',NA,NA)
year2<- c('300','100',NA,'200','300','50','40')
year3<- c('20','100',10,'200','300','150','230')
data=data.frame(group,year1,year2,year3)
Other than using fill or na.locf, you could do:
datac %>%
group_by(grp = gsub("_.*", "", group)) %>%
mutate_at(vars(contains("year")),
funs(.[!is.na(.)])) %>%
ungroup() %>% select(-grp)
Output:
# A tibble: 2 x 4
group year1 year2 year3
<chr> <chr> <chr> <chr>
1 A 20 40 20
2 A_old 20 40 230
For your second example, you could do:
data %>%
mutate_at(
vars(contains("year")),
funs(
case_when(
group == "A" & is.na(.) ~ .[group == "E"],
TRUE ~ .)
)
)
Output:
group year1 year2 year3
1 A 300 300 20
2 B 100 100 100
3 C <NA> <NA> 10
4 D 200 200 200
5 E 300 300 300
6 F <NA> 50 150
7 G <NA> 40 230
You can also add other conditions to case_when.
For instance, if you'd additionally like to replace C years with what is there for group D, you would add:
data %>%
mutate_at(
vars(contains("year")),
funs(
case_when(
group == "A" & is.na(.) ~ .[group == "E"],
group == "C" & is.na(.) ~ .[group == "D"],
TRUE ~ .)
)
)
After a very long evening and headache from r i managed to get this:
rm(list = ls())
group <-c('A','A old')
year1<- c(NA,'20')
year2<- c(NA,'40')
year3<- c('20','230')
datac=data_frame(group,year1,year2,year3)
group <-c('A','A old')
year1<- c('20','20')
year2<- c('40','40')
year3<- c('20','230')
finaldatac=data_frame(group,year1,year2,year3)
datac$group <- gsub(' ', '--', datac$group)
datact = t(datac)
colnames(datact) = datact[1, ]
datact = datact[-1, ]
datact[,"A"] <- ifelse(!is.na(datact[,"A"]), datact[,"A"] , datact[,"A--old"])
datactt=t(datact)
group = rownames(datactt)
datactt<-cbind(datactt, group)
rownames(datactt) <- c()
datactt <- as.data.frame(datactt)
sapply(datactt, class)
datactt <- data.frame(lapply(datactt, as.character), stringsAsFactors=FALSE)
datactt$group <- gsub('--', ' ', datactt$group)
Where datactt (hopefully) is the same as finaldatac that i wanted... I am sure this cant be the best solution, obviously not the prettiest. If anybody has something similar, but shorter or more efficient please post it i would appreciate the answer.

Remove old date rows in R

I have table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
5/1/1 | B | 4
4/1/1 | C | 5
1/1/1 | A | 1
7/1/1 | B | 2
1/1/1 | C | 3
I need table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
4/1/1 | C | 5
7/1/1 | B | 2
How to remove old rows based on two criteria (Column1, Column2)?
Group by Dates, arrange in descending order within group, then keep the first row with slice, like this
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(as.Date(Date))) %>% # will sort within group now
slice(1) %>% # keep first row entry of each group
ungroup()
Your error is occurring because your date format is a bit funny. I recommend using lubridate::parse_date_time which is more robust than base R datetime functions
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(parse_date_time(Date, format="mdy"))) %>% # will sort within group now
# the date format is specified as month-day-year
slice(1) %>% # keep first row entry of each group
ungroup()
EDIT
Based on helpful comment by #count, we can simplify dplyr chain to
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
slice(which.max(parse_date_time(Date, format="mdy"))) %>% # keep max-Date row entry of each group
ungroup()

Resources