I have merged two data frames using bind_rows. I have a situation where I have two rows of data as for example below:
Page Path Page Title Byline Pageviews
/facilities/when-lighting-strikes NA NA 668
/facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
When I have these type of duplicate page paths I'd like to merge the identical page paths, eliminate the two NA's in the first row keeping the page title (When Lighting Strikes) and Byline (Tom Jones) and then keep the pageviews result of 668 from the first row. Somehow it seems that I need
to identify the duplicate pages paths
look to see if there are different titles and bylines; remove NAs
keep the row with the pageview result; remove the NA row
Is there a way I can do this in R dplyr? Or is there a better way?
A simple solution:
library(dplyr)
df %>% group_by(PagePath) %>% summarise_each(funs(na.omit))
# Source: local data frame [1 x 4]
#
# PagePath PageTitle Byline Pageviews
# (fctr) (fctr) (fctr) (int)
# 1 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
If your data is more complicated, you may need a more robust approach.
Data
df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
Use replace function in for loop
for(i in unique(df$Page_Path)){
df$Pageviews[df$Page_Path==i] <- replace(df$Pageviews[df$Page_Path==i],is.na(df$Pageviews[df$Page_Path==i]),
df$Pageviews[!is.na(df$Pageviews[df$Page_Path==i])])
}
df <- subset(df, !is.na(Page_Title))
print(df)
Page_Path Page_Title Byline Pageviews
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
Here is an option using data.table and complete.cases. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'PathPath', loop through the columns of the dataset (lapply(.SD, ..) and remove the NA elements with complete.cases. The complete.cases returns a logical vector and can be used for subsetting. According to this, complete.cases usage is much more faster than na.omit and coupled with data.table it would increase the efficiency.
library(data.table)
setDT(df)[, lapply(.SD, function(x) x[complete.cases(x)]), by = PagePath]
# PagePath PageTitle Byline Pageviews
#1: /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
data
df <- structure(list(PagePath = structure(c(1L, 1L),
.Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
Another way to do this (similar to a previous solutions that uses dplyr) would be:
df %>% group_by(PagePath) %>%
dplyr::summarize(PageTitle = paste(na.omit(PageTitle)),
Byline = paste(na.omit(Byline)),
Pageviews =paste(na.omit(Pageviews)))
An alternative approach using fill. Using tidyverse 1.3.0+ with dplyr 0.8.5+, you can use fill to fill in missing values.
See this for more information https://tidyr.tidyverse.org/reference/fill.html
DATA Thanks Alistaire
df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
# A tibble: 2 x 4
# Groups: PagePath [1]
PagePath PageTitle Byline Pageviews
<fct> <fct> <fct> <int>
1 /facilities/when-lighting-strikes NA NA 668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
CODE
I just did this for PageTitle but you can repeat fill to do it for other columns. (dplyr gurus might have a smarter way to do all 3 columns at once). If you have ordered data like dates, then you can set .direction to be just down for example (look at past data).
df.new <- df %>% group_by(PagePath)
%>% fill(PageTitle, .direction = "updown")
which gives you
# A tibble: 2 x 4
# Groups: PagePath [1]
PagePath PageTitle Byline Pageviews
<fct> <fct> <fct> <int>
1 /facilities/when-lighting-strikes When Lighting Strikes NA 668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
Once you have all the NAs cleaned up then you can use distinct or rank to get your final summarised dataframe.
Related
how can I extract "7-9", "2-5" and "2-8", then paste to new column as event_time?
event_details
2.9(S) 7-9 street【Train】#2097
2.1(S) 2-5 street【Train】#2012
2.2(S) 2-8A TBC【Train】#202
You haven't really shared the logic to extract the numbers but based on limited data that you have shared we can do :
df$new_col <- sub('.*(\\d+-\\d+).*', '\\1', df$event_details)
df
# event_details new_col
#1 2.9(S) 7-9 street【Train】 7-9
#2 2.1(S) 2-5 street【Train】 2-5
#3 2.2(S) 2-8A TBC【Train】 2-8
Or same using str_extract
df$new_col <- stringr::str_extract(df$event_details, "\\d+-\\d+")
data
df <- structure(list(event_details = structure(c(3L, 1L, 2L),
.Label = c("2.1(S) 2-5 street【Train】",
"2.2(S) 2-8A TBC【Train】", "2.9(S) 7-9 street【Train】"), class =
"factor")), class = "data.frame", row.names = c(NA, -3L))
Given a CSV with the following structure,
id, postCode, someThing, someOtherThing
1,E3 4AX, cats, dogs
2,E3 4AX, elephants, sheep
3,E8 KAK, mice, rats
4,VH3 2K2, humans, whales
I wish to create two tables, based on whether the value in the postCode column is unique or not. The values of the other columns do not matter to me, but they have to be copied to the new tables.
My end data should look like this, with one table based on unique postCodes:
id, postCode, someThing, someOtherThing
3,E8 KAK, mice, rats
4,VH3 2K2, humans, whales
And another where postCode values are duplicated
id, postCode, someThing, someOtherThing
1,E3 4AX, cats, dogs
2,E3 4AX, elephants, sheep
So far I can load the data but I'm not sure of the next step:
myData <- read.csv("path/to/my.csv",
header=TRUE,
sep=",",
stringsAsFactors=FALSE
)
New to R so help appreciated.
Data in dput format.
df <-
structure(list(id = 1:4, postCode = structure(c(1L, 1L, 2L, 3L
), .Label = c("E3 4AX", "E8 KAK", "VH3 2K2"), class = "factor"),
someThing = structure(c(1L, 2L, 4L, 3L), .Label = c(" cats",
" elephants", " humans", " mice"), class = "factor"),
someOtherThing = structure(c(1L, 3L, 2L, 4L),
.Label = c(" dogs", " rats", " sheep", " whales "
), class = "factor")), class = "data.frame",
row.names = c(NA, -4L))
If df is the name of your data.frame, which can be formed as:
df <- read.table(header = T, text = "
id, postCode, someThing, someOtherThing
1, E3 4AX, cats, dogs
2, E3 4AX, elephants, sheep
3, E8 KAK, mice, rats
4, VH3 2K2, humans, whales
")
Then the uniques and duplicates can be found using the funciton n(), which collects the number of observation per grouped variable. Then,
uniques = df %>%
group_by(postCode) %>%
filter(n() == 1)
dupes = df %>%
group_by(postCode) %>%
filter(n() > 1)
Unclear why someone edited this response. Maybe they hate tribbles
If you can do with a list of the two data.frames, which seems to be better than to have many related objects in the .GlobalEnv, try split.
f <- rev(cumsum(rev(duplicated(df$postCode))))
split(df, f)
#$`0`
# id postCode someThing someOtherThing
#3 3 E8 KAK mice rats
#4 4 VH3 2K2 humans whales
#
#$`1`
# id postCode someThing someOtherThing
#1 1 E3 4AX cats dogs
#2 2 E3 4AX elephants sheep
I have a data frame in R like the following:
Group.ID status
1 1 open
2 1 open
3 2 open
4 2 closed
5 2 closed
6 3 open
I want to count the number of IDs under the condition: when all status are "open" for same ID number. For example, Group ID 1 has two observations, and their status are both "open", so that's one for my count. Group ID 2 is not because not all status are open for group ID 2.
I can count the rows or the group IDs under conditions. However I don't know how to apply "all status equal to one value for a group" logic.
DATA.
df1 <-
structure(list(Group.ID = c(1, 1, 2, 2, 2, 3), status = structure(c(2L,
2L, 2L, 1L, 1L, 2L), .Label = c("closed", "open"), class = "factor")), .Names = c("Group.ID",
"status"), row.names = c(NA, -6L), class = "data.frame")
Here are two solutions, both using base R, one more complicated with aggregate and the other with tapply. If you just want the total count of Group.ID matching you request, I suggest that you use the second solution.
agg <- aggregate(status ~ Group.ID, df1, function(x) as.integer(all(x == "open")))
sum(agg$status)
#[1] 2
sum(tapply(df1$status, df1$Group.ID, FUN = function(x) all(x == "open")))
#[1] 2
a dplyrsolution:
library(dplyr)
df1 %>%
group_by(Group.ID) %>%
filter(cumsum(status == "open") == 2) %>%
nrow()
I have two data tables as shown below:
bigrams
w1w2 freq w1 w2
common names 1 common names
department of 4 department of
family name 6 family name
bigrams = setDT(structure(list(w1w2 = c("common names", "department of", "family name"
), freq = c(1L, 4L, 6L), w1 = c("common", "department", "family"
), w2 = c("names", "of", "name")), .Names = c("w1w2", "freq",
"w1", "w2"), row.names = c(NA, -3L), class = "data.frame"))
unigrams
w1 freq
common 2
department 3
family 4
name 5
names 1
of 9
unigrams = setDT(structure(list(w1 = c("common", "department", "family", "name",
"names", "of"), freq = c(2L, 3L, 4L, 5L, 1L, 9L)), .Names = c("w1",
"freq"), row.names = c(NA, -6L), class = "data.frame"))
desired output
w1w2 freq w1 w2 w1freq w2freq
common names 1 common names 2 1
department of 4 department of 3 9
family name 6 family name 4 5
What I have done so far
setkey(bigrams, w1)
setkey(unigrams, w1)
result <- bigrams[unigrams]
This gives me the i.freq column for w1 but when I try to do the same for w2 the i.freq column is updated to reflect the freq of w2.
How can I get freq for both w1 and w2 in separate columns?
Note: I have already seen solutions to data.table Lookup value and translate and Modify column of a data.table based on another column and add the new column
You can do two joins, and in v1.9.6 of data.table you can specify the on= argument for differing column names.
library(data.table)
bigrams[unigrams, on=c("w1"), nomatch = 0][unigrams, on=c(w2 = "w1"), nomatch = 0]
w1w2 freq w1 w2 i.freq i.freq.1
1: family name 6 family name 4 5
2: common names 1 common names 2 1
3: department of 4 department of 3 9
You can do this with a bit of reshaping.
library(dplyr)
library(tidyr)
bigrams %>%
rename(w1w2_string = w1w2,
w1w2_freq = freq) %>%
gather(order, string,
w1, w2) %>%
left_join(unigrams %>%
rename(string = w1) ) %>%
gather(type, value,
string, freq) %>%
unite(order_type, order, type) %>%
spread(order_type, value)
Edit: Explanation
The first observation you can make is that bigrams contains in fact information about three different units of analysis: a bigram and two unigrams. Convert to long form so that the unit of analysis is a unigram. Then we can merge in the other unigram data. Now note that your unigram has two different pieces of information per row: the frequency for the unigram, and the text of the unigram. Convert to long form again so that the unit of analysis is a piece of information about a unigram. Now spread, so that each new column is a type of information about a unigram.
I'm struggling to come up with a vectorised solution to the following problem. I have two dataframes:
> people <- data.frame(name = c('Fred', 'Bob'), profession = c('Builder', 'Baker'))
> people
name profession
1 Fred Builder
2 Bob Baker
> allowed <- data.frame(name = c('Fred', 'Fred', 'Bob', 'Bob'), profession = c('Builder', 'Baker', 'Barman', 'Biker'))
> allowed
name profession
1 Fred Builder
2 Fred Baker
3 Bob Barman
4 Bob Biker
That is to say, I want to check every person in people has a permitted profession, and return any names which do not.
For instance, Fred can be a Builder or a Baker, and so he is fine. However, Bob can be a Barman or a Biker, but not a Baker (note: there are only ever two permitted professions in my use case).
I would like to a return a data frame those names which do not have a permitted profession:
name profession permitted
1 Bob Baker Biker
2 Bob Baker Barman
Thanks for the help
Simple base-only solution. I'm sure someone can come up with something better.
out <- allowed[!allowed$name %in% merge(people, allowed)$name, ]
This gets you the desired people, along with their permitted professions. If you also want their actual professions:
names(out)[2] <- "permitted"
out <- merge(people, out, all.y=TRUE)
Here's a slightly more readable data.table solution. You can do the last step on the same line as well to make it a one-liner, if you consider that readable.
# load library, convert people to a data.table and set a key
library(data.table)
people = data.table(people, key = "name,profession")
# compute
result = data.table(allowed, key = "name")[people[!allowed]]
setnames(result, "profession.1", "permitted")
result
# name profession permitted
#1: Bob Barman Baker
#2: Bob Biker Baker
Probably there's another way, but this should work. I added a third person with an unpermitted profession to show you how to apply the function to the entire dataset.
currentprof <-structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Bob",
"Fred", "Jan"), class = "factor"), profession = structure(c(3L,
2L, 1L), .Label = c("Analyst", "Baker", "Builder"), class = "factor")), .Names = c("name",
"profession"), class = "data.frame", row.names = c(NA, -3L))
allowed <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("Bob",
"Fred", "Jan"), class = "factor"), profession = structure(c(4L,
1L, 2L, 3L, 6L, 5L), .Label = c("Baker", "Barman", "Biker", "Builder",
"Driver", "Teacher"), class = "factor")), .Names = c("name",
"profession"), class = "data.frame", row.names = c(NA, -6L))
checkprof <- function(name){
allowedn <- allowed[allowed$name == name,]
currentprofn <- currentprof[currentprof$name==name,]
if(!currentprofn$profession %in% allowedn$profession)
{result <- merge(currentprofn, allowedn, by = "name", all.x=TRUE)} else
{result <-data.frame(col1=character(),
col2=character(),
col3=character(),
stringsAsFactors=FALSE)}
colnames(result) <- c("name","profession","permitted")
return(result)
}
do.call(rbind,lapply(levels(allowed$name),checkprof))
This is my take on it. May need some more testing though.I'd be open to suggestions myself. It works with your example but I am not sure if it would generalize.
people$check <- ifelse(people$profession %in% allowed[which(allowed$name == people$name),"profession"], TRUE,FALSE)
people_select <- people[people$check == TRUE,]
EDIT: and just for clarification in case this is holding you back from voting. The ifelse is vectorized and will run very fast.