I have a data set that might contain some very similar keys - something like a row of data for each of the email address john.doe#foo.com and john.m.doe#foo.com. How can I combine similarly named keys and do an aggregate in R?
Sample input
|Email | Subscriptions |
|john.doe#foo.com | 10 |
|john.m.doe#foo.com | 11 |
|jane.doe#foo.com | 20 |
Expected result
|Email | Subscriptions |
|john.doe#foo.com | 21 |
|jane.doe#foo.com | 20 |
I know agrep and few other libraries can do fuzzy matching, but how do I employ it in combining rows in a data set?
Here is one way to use agrep in combination with dplyr:
df <- data.frame(mail = c("john.doe#foo.com", "john.m.doe#foo.com", "jane.doe#foo.com"),
sub = c(10, 11, 20))
df %>%
rowwise() %>%
mutate(new = paste(agrep(mail, df$mail, max = 2, ignore.case = TRUE), collapse = ",")) %>%
group_by(new) %>%
mutate(sub = sum(sub)) %>%
mail sub new
<fct> <dbl> <chr>
1 john.doe#foo.com 21 1,2
2 jane.doe#foo.com 20 3
I have a table with 3 columns and cca 14.000 rows. I want to count every occurrence of each type of a row.
I am a newbie into R, so can't really come up with a solution to extract it from the table.
I managed to list all different values in single column with levels(), but can't really make it work.
Table looks like this:
My expected result:
IPV4|UDP|UDP: 120 times
IPV4|UDP|SSDP: 60 times
With some sample data that looks like this
tst <- data.frame(Type = c("IPV4", " ", "IPV4", "IPV4"), Protocol = c("UDP", " ", "UDP", "UDP"), Protocol.1 = c("SSDP", " ", "UDP", "UDP"))
You could get tallies as follows using tools from the tidyverse (dplyr, magrittr).
tst_summmary <- tst %>%
mutate(class_var = paste(Type, Protocol, Protocol.1, sep = "|")) %>%
group_by(class_var) %>%
tally() %>% as.data.frame()
# # A tibble: 3 x 2
# class_var n
# <chr> <int>
# 1 " | | " 1
# 3 IPV4|UDP|UDP 2
What we're doing here is concatenating the strings from all the different columns (that you want to use to group/classify) together into the contents of a single column class_var using paste() (mutate() creates this new class_var column). Then we can group the data (group_by) with this newly created column and tally the occurrences with tally().
Getting a table with the original columns along with the generated counts would invoke a for loop and the str_split() function from stringr as shown below.
tst_summary <- tst %>%
mutate(class_var = paste(Type, Protocol, Protocol.1, sep = "|")) %>%
group_by(class_var) %>%
tally() %>% as.data.frame()
for(i in 1:nrow(tst_summary)){
tst_summary$Type[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[1]]})
tst_summary$Protocol[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[2]]})
tst_summary$Protocol.1[i] <- lapply(tst_summary$class_var[i], function(x){ unlist(str_split(x, "\\|"))[[3]]})
tst_summary <- tst_summary[, c(3,4,5,2)]
# Type Protocol Protocol.1 n
# 1 1
# 3 IPV4 UDP UDP 2
Consider the below given dataframe;
Sample DataFrame
| Name | Age | Type |
| EF | 50 | A |
| GH | 60 | B |
| VB | 70 | C |
Code to perform Filter
df2 <- df1 %>% filter(Type == 'C') %>% select(Name)
The above code will provide me a dataframe with singe column and row.
I would like to perform a conditional filter where if a certain type is not present it should consider the name to be NULL/NA.
df2 <- df1 %>% filter(Type = 'D') %>% select(Name)
Must give an output of;
| Name |
| NA |
Instead of throwing an error. Any inputs will be really helpful. Either DPLYR or any other methods will be appreciable.
Here is a base R approach:
name <- df[df$Name == "D", "Name"]
ifelse(identical(name, character(0)), NA, name)
[1] NA
Should the name not match to D, the subset operation would return character(0). We can compare the output against this, and then return NA as appropriate.
df <- data.frame(Name=c("EF", "GH", "VB"),
Age=c(50, 60, 70),
Type=c("A", "B", "C"),
An approach with complete from tidyr would be:
df1 %>%
complete(Type = LETTERS) %>% # Specify which Types you'd expect, other values are filled with NA
filter(Type == 'D') %>%
# A tibble: 1 x 1
# Name
# <fct>
# 1 NA
i want to replace NA in one row with values from another row, example data are:
group <-c('A','A_old')
year1<- c(NA,'20')
year2<- c(NA,'40')
year3<- c('20','230')
group <-c('A','A_old')
year1<- c('20','20')
year2<- c('40','40')
year3<- c('20','230')
Original table is much larger so referring to each element one by one and assigning value is not possible..
For sake of argument below, i need to refer to the row values by their name as original table is big and i can not play around with only two rows. For example in table below, i would like to replace row 1 (group==A) with row 5 (group==E). Data are here:
group <-c('A','B','C','D','E','F','G')
year1<- c(NA,'100',NA,'200','300',NA,NA)
year2<- c(NA,'100',NA,'200','300','50','40')
year3<- c('20','100',10,'200','300','150','230')
SO i want to get:
group <-c('A','B','C','D','E','F','G')
year1<- c('300','100',NA,'200','300',NA,NA)
year2<- c('300','100',NA,'200','300','50','40')
year3<- c('20','100',10,'200','300','150','230')
Other than using fill or na.locf, you could do:
datac %>%
group_by(grp = gsub("_.*", "", group)) %>%
funs(.[!is.na(.)])) %>%
ungroup() %>% select(-grp)
# A tibble: 2 x 4
group year1 year2 year3
<chr> <chr> <chr> <chr>
1 A 20 40 20
2 A_old 20 40 230
For your second example, you could do:
data %>%
group == "A" & is.na(.) ~ .[group == "E"],
TRUE ~ .)
group year1 year2 year3
1 A 300 300 20
2 B 100 100 100
3 C <NA> <NA> 10
4 D 200 200 200
5 E 300 300 300
6 F <NA> 50 150
7 G <NA> 40 230
You can also add other conditions to case_when.
For instance, if you'd additionally like to replace C years with what is there for group D, you would add:
data %>%
group == "A" & is.na(.) ~ .[group == "E"],
group == "C" & is.na(.) ~ .[group == "D"],
TRUE ~ .)
After a very long evening and headache from r i managed to get this:
rm(list = ls())
group <-c('A','A old')
year1<- c(NA,'20')
year2<- c(NA,'40')
year3<- c('20','230')
group <-c('A','A old')
year1<- c('20','20')
year2<- c('40','40')
year3<- c('20','230')
datac$group <- gsub(' ', '--', datac$group)
datact = t(datac)
colnames(datact) = datact[1, ]
datact = datact[-1, ]
datact[,"A"] <- ifelse(!is.na(datact[,"A"]), datact[,"A"] , datact[,"A--old"])
group = rownames(datactt)
datactt<-cbind(datactt, group)
rownames(datactt) <- c()
datactt <- as.data.frame(datactt)
sapply(datactt, class)
datactt <- data.frame(lapply(datactt, as.character), stringsAsFactors=FALSE)
datactt$group <- gsub('--', ' ', datactt$group)
Where datactt (hopefully) is the same as finaldatac that i wanted... I am sure this cant be the best solution, obviously not the prettiest. If anybody has something similar, but shorter or more efficient please post it i would appreciate the answer.
I have table:
Date | Column1 | Column2
6/1/1 | A | 3
5/1/1 | B | 4
4/1/1 | C | 5
1/1/1 | A | 1
7/1/1 | B | 2
1/1/1 | C | 3
I need table:
Date | Column1 | Column2
6/1/1 | A | 3
4/1/1 | C | 5
7/1/1 | B | 2
How to remove old rows based on two criteria (Column1, Column2)?
Group by Dates, arrange in descending order within group, then keep the first row with slice, like this
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(as.Date(Date))) %>% # will sort within group now
slice(1) %>% # keep first row entry of each group
Your error is occurring because your date format is a bit funny. I recommend using lubridate::parse_date_time which is more robust than base R datetime functions
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(parse_date_time(Date, format="mdy"))) %>% # will sort within group now
# the date format is specified as month-day-year
slice(1) %>% # keep first row entry of each group
Based on helpful comment by #count, we can simplify dplyr chain to
ans <- df %>%
group_by(Column1, Column2) %>%
slice(which.max(parse_date_time(Date, format="mdy"))) %>% # keep max-Date row entry of each group