I have a data set that might contain some very similar keys - something like a row of data for each of the email address john.doe#foo.com and john.m.doe#foo.com. How can I combine similarly named keys and do an aggregate in R?
Sample input
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 10 |
|john.m.doe#foo.com | 11 |
|jane.doe#foo.com | 20 |
Expected result
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 21 |
|jane.doe#foo.com | 20 |
I know agrep and few other libraries can do fuzzy matching, but how do I employ it in combining rows in a data set?
Here is one way to use agrep in combination with dplyr:
df <- data.frame(mail = c("john.doe#foo.com", "john.m.doe#foo.com", "jane.doe#foo.com"),
sub = c(10, 11, 20))
df %>%
rowwise() %>%
mutate(new = paste(agrep(mail, df$mail, max = 2, ignore.case = TRUE), collapse = ",")) %>%
group_by(new) %>%
mutate(sub = sum(sub)) %>%
slice(1)
mail sub new
<fct> <dbl> <chr>
1 john.doe#foo.com 21 1,2
2 jane.doe#foo.com 20 3
I have a table with 3 columns and cca 14.000 rows. I want to count every occurrence of each type of a row.
I am a newbie into R, so can't really come up with a solution to extract it from the table.
I managed to list all different values in single column with levels(), but can't really make it work.
Table looks like this:
My expected result:
IPV4|UDP|UDP: 120 times
IPV4|UDP|SSDP: 60 times
...
With some sample data that looks like this
tst <- data.frame(Type = c("IPV4", " ", "IPV4", "IPV4"), Protocol = c("UDP", " ", "UDP", "UDP"), Protocol.1 = c("SSDP", " ", "UDP", "UDP"))
You could get tallies as follows using tools from the tidyverse (dplyr, magrittr).
tst_summmary <- tst %>%
mutate(class_var = paste(Type, Protocol, Protocol.1, sep = "|")) %>%
group_by(class_var) %>%
tally() %>% as.data.frame()
# # A tibble: 3 x 2
# class_var n
# <chr> <int>
# 1 " | | " 1
# 2 IPV4|UDP|SSDP 1
# 3 IPV4|UDP|UDP 2
What we're doing here is concatenating the strings from all the different columns (that you want to use to group/classify) together into the contents of a single column class_var using paste() (mutate() creates this new class_var column). Then we can group the data (group_by) with this newly created column and tally the occurrences with tally().
Getting a table with the original columns along with the generated counts would invoke a for loop and the str_split() function from stringr as shown below.
tst_summary <- tst %>%
mutate(class_var = paste(Type, Protocol, Protocol.1, sep = "|")) %>%
group_by(class_var) %>%
tally() %>% as.data.frame()
for(i in 1:nrow(tst_summary)){
tst_summary$Type[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[1]]})
tst_summary$Protocol[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[2]]})
tst_summary$Protocol.1[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[3]]})
}
tst_summary <- tst_summary[, c(3,4,5,2)]
tst_summary
# Type Protocol Protocol.1 n
# 1 1
# 2 IPV4 UDP SSDP 1
# 3 IPV4 UDP UDP 2
Consider the below given dataframe;
Sample DataFrame
| Name | Age | Type |
---------------------
| EF | 50 | A |
| GH | 60 | B |
| VB | 70 | C |
Code to perform Filter
df2 <- df1 %>% filter(Type == 'C') %>% select(Name)
The above code will provide me a dataframe with singe column and row.
I would like to perform a conditional filter where if a certain type is not present it should consider the name to be NULL/NA.
Example
df2 <- df1 %>% filter(Type = 'D') %>% select(Name)
Must give an output of;
| Name |
--------
| NA |
Instead of throwing an error. Any inputs will be really helpful. Either DPLYR or any other methods will be appreciable.
Here is a base R approach:
name <- df[df$Name == "D", "Name"]
ifelse(identical(name, character(0)), NA, name)
[1] NA
Should the name not match to D, the subset operation would return character(0). We can compare the output against this, and then return NA as appropriate.
Data:
df <- data.frame(Name=c("EF", "GH", "VB"),
Age=c(50, 60, 70),
Type=c("A", "B", "C"),
stringsAsFactors=FALSE)
An approach with complete from tidyr would be:
library(dplyr)
library(tidyr)
df1 %>%
complete(Type = LETTERS) %>% # Specify which Types you'd expect, other values are filled with NA
filter(Type == 'D') %>%
select(Name)
# A tibble: 1 x 1
# Name
# <fct>
# 1 NA
i want to replace NA in one row with values from another row, example data are:
group <-c('A','A_old')
year1<- c(NA,'20')
year2<- c(NA,'40')
year3<- c('20','230')
datac=data_frame(group,year1,year2,year3)
group <-c('A','A_old')
year1<- c('20','20')
year2<- c('40','40')
year3<- c('20','230')
finaldatac=data_frame(group,year1,year2,year3)
Original table is much larger so referring to each element one by one and assigning value is not possible..
Thanks!
For sake of argument below, i need to refer to the row values by their name as original table is big and i can not play around with only two rows. For example in table below, i would like to replace row 1 (group==A) with row 5 (group==E). Data are here:
group <-c('A','B','C','D','E','F','G')
year1<- c(NA,'100',NA,'200','300',NA,NA)
year2<- c(NA,'100',NA,'200','300','50','40')
year3<- c('20','100',10,'200','300','150','230')
data=data.frame(group,year1,year2,year3)
SO i want to get:
group <-c('A','B','C','D','E','F','G')
year1<- c('300','100',NA,'200','300',NA,NA)
year2<- c('300','100',NA,'200','300','50','40')
year3<- c('20','100',10,'200','300','150','230')
data=data.frame(group,year1,year2,year3)
Other than using fill or na.locf, you could do:
datac %>%
group_by(grp = gsub("_.*", "", group)) %>%
mutate_at(vars(contains("year")),
funs(.[!is.na(.)])) %>%
ungroup() %>% select(-grp)
Output:
# A tibble: 2 x 4
group year1 year2 year3
<chr> <chr> <chr> <chr>
1 A 20 40 20
2 A_old 20 40 230
For your second example, you could do:
data %>%
mutate_at(
vars(contains("year")),
funs(
case_when(
group == "A" & is.na(.) ~ .[group == "E"],
TRUE ~ .)
)
)
Output:
group year1 year2 year3
1 A 300 300 20
2 B 100 100 100
3 C <NA> <NA> 10
4 D 200 200 200
5 E 300 300 300
6 F <NA> 50 150
7 G <NA> 40 230
You can also add other conditions to case_when.
For instance, if you'd additionally like to replace C years with what is there for group D, you would add:
data %>%
mutate_at(
vars(contains("year")),
funs(
case_when(
group == "A" & is.na(.) ~ .[group == "E"],
group == "C" & is.na(.) ~ .[group == "D"],
TRUE ~ .)
)
)
After a very long evening and headache from r i managed to get this:
rm(list = ls())
group <-c('A','A old')
year1<- c(NA,'20')
year2<- c(NA,'40')
year3<- c('20','230')
datac=data_frame(group,year1,year2,year3)
group <-c('A','A old')
year1<- c('20','20')
year2<- c('40','40')
year3<- c('20','230')
finaldatac=data_frame(group,year1,year2,year3)
datac$group <- gsub(' ', '--', datac$group)
datact = t(datac)
colnames(datact) = datact[1, ]
datact = datact[-1, ]
datact[,"A"] <- ifelse(!is.na(datact[,"A"]), datact[,"A"] , datact[,"A--old"])
datactt=t(datact)
group = rownames(datactt)
datactt<-cbind(datactt, group)
rownames(datactt) <- c()
datactt <- as.data.frame(datactt)
sapply(datactt, class)
datactt <- data.frame(lapply(datactt, as.character), stringsAsFactors=FALSE)
datactt$group <- gsub('--', ' ', datactt$group)
Where datactt (hopefully) is the same as finaldatac that i wanted... I am sure this cant be the best solution, obviously not the prettiest. If anybody has something similar, but shorter or more efficient please post it i would appreciate the answer.
I have table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
5/1/1 | B | 4
4/1/1 | C | 5
1/1/1 | A | 1
7/1/1 | B | 2
1/1/1 | C | 3
I need table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
4/1/1 | C | 5
7/1/1 | B | 2
How to remove old rows based on two criteria (Column1, Column2)?
Group by Dates, arrange in descending order within group, then keep the first row with slice, like this
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(as.Date(Date))) %>% # will sort within group now
slice(1) %>% # keep first row entry of each group
ungroup()
Your error is occurring because your date format is a bit funny. I recommend using lubridate::parse_date_time which is more robust than base R datetime functions
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(parse_date_time(Date, format="mdy"))) %>% # will sort within group now
# the date format is specified as month-day-year
slice(1) %>% # keep first row entry of each group
ungroup()
EDIT
Based on helpful comment by #count, we can simplify dplyr chain to
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
slice(which.max(parse_date_time(Date, format="mdy"))) %>% # keep max-Date row entry of each group
ungroup()