Given a CSV with the following structure,
id, postCode, someThing, someOtherThing
1,E3 4AX, cats, dogs
2,E3 4AX, elephants, sheep
3,E8 KAK, mice, rats
4,VH3 2K2, humans, whales
I wish to create two tables, based on whether the value in the postCode column is unique or not. The values of the other columns do not matter to me, but they have to be copied to the new tables.
My end data should look like this, with one table based on unique postCodes:
id, postCode, someThing, someOtherThing
3,E8 KAK, mice, rats
4,VH3 2K2, humans, whales
And another where postCode values are duplicated
id, postCode, someThing, someOtherThing
1,E3 4AX, cats, dogs
2,E3 4AX, elephants, sheep
So far I can load the data but I'm not sure of the next step:
myData <- read.csv("path/to/my.csv",
header=TRUE,
sep=",",
stringsAsFactors=FALSE
)
New to R so help appreciated.
Data in dput format.
df <-
structure(list(id = 1:4, postCode = structure(c(1L, 1L, 2L, 3L
), .Label = c("E3 4AX", "E8 KAK", "VH3 2K2"), class = "factor"),
someThing = structure(c(1L, 2L, 4L, 3L), .Label = c(" cats",
" elephants", " humans", " mice"), class = "factor"),
someOtherThing = structure(c(1L, 3L, 2L, 4L),
.Label = c(" dogs", " rats", " sheep", " whales "
), class = "factor")), class = "data.frame",
row.names = c(NA, -4L))
If df is the name of your data.frame, which can be formed as:
df <- read.table(header = T, text = "
id, postCode, someThing, someOtherThing
1, E3 4AX, cats, dogs
2, E3 4AX, elephants, sheep
3, E8 KAK, mice, rats
4, VH3 2K2, humans, whales
")
Then the uniques and duplicates can be found using the funciton n(), which collects the number of observation per grouped variable. Then,
uniques = df %>%
group_by(postCode) %>%
filter(n() == 1)
dupes = df %>%
group_by(postCode) %>%
filter(n() > 1)
Unclear why someone edited this response. Maybe they hate tribbles
If you can do with a list of the two data.frames, which seems to be better than to have many related objects in the .GlobalEnv, try split.
f <- rev(cumsum(rev(duplicated(df$postCode))))
split(df, f)
#$`0`
# id postCode someThing someOtherThing
#3 3 E8 KAK mice rats
#4 4 VH3 2K2 humans whales
#
#$`1`
# id postCode someThing someOtherThing
#1 1 E3 4AX cats dogs
#2 2 E3 4AX elephants sheep
Related
Hi I have been trying for a while to match two large columns of names, several have different spellings etc... so far I have written some code to practice on a smaller dataset
examples%>% mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1, TRUE ~ example_2))
This manages to create a new column with names the name from example 1 if it is less than an edit distance of 3 away. However, it does not give the name from example 2 if it does not meet this criteria which I need it to do.
This code also only works on the adjacent row of each column, whereas, I need it to work on a dataset which has two columns (one is larger- so cant be put in the same order).
Also needs to not try to match the NAs from the smaller column of names (there to fill it out to equal length to the other one).
Anyone know how to do something like this?
dput(head(examples))
structure(list(. = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("grarryfieldsred","harroldfrankknight", "sandramaymeres", "sheilaovensnew", "terrifrank"), class = "factor"), example_2 = structure(c(4L, 2L, 3L, 1L,
5L), .Label = c(" grarryfieldsred", "candramymars", "haroldfranrinight",
"sheilowansknew", "terryfrenk"), class = "factor")), row.names = c(NA,
5L), class = "data.frame")
The problem is that your columns have become factors rather than character vectors. When you try to combine two columns together with different factor levels, unexpected results can happen.
First convert your columns to character:
library(dplyr)
examples %>%
mutate(across(contains("example"),as.character)) %>%
mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1,
TRUE ~ example_2))
# example_1 example_2 new_ID
#1 sheilaovensnew sheilowansknew sheilowansknew
#2 sandramaymeres candramymars candramymars
#3 harroldfrankknight haroldfranrinight harroldfrankknight
#4 grarryfieldsred grarryfieldsred grarryfieldsred
#5 terrifrank terryfrenk terrifrank
In your dput output, somehow the name of example_1 was changed. I ran this first:
names(examples)[1] <- "example_1"
I am new to text-mining in R. I want to remove stopwords (i.e. extract keywords) from my data frame's column and put those keywords into a new column.
I tried to make a corpus, but it didn't help me.
df$C3 is what I currently have. I would like to add column df$C4, but I can't get it to work.
df <- structure(list(C3 = structure(c(3L, 4L, 1L, 7L, 6L, 9L, 5L, 8L,
10L, 2L), .Label = c("Are doing good", "For the help", "hello everyone",
"hope you all", "I Hope", "I need help", "In life", "It would work",
"On Text-Mining", "Thanks"), class = "factor"), C4 = structure(c(2L,
4L, 1L, 6L, 3L, 7L, 5L, 9L, 8L, 3L), .Label = c("doing good",
"everyone", "help", "hope", "Hope", "life", "Text-Mining", "Thanks",
"work"), class = "factor")), .Names = c("C3", "C4"), row.names = c(NA,
-10L), class = "data.frame")
head(df)
# C3 C4
# 1 hello everyone everyone
# 2 hope you all hope
# 3 Are doing good doing good
# 4 In life life
# 5 I need help help
# 6 On Text-Mining Text-Mining
This solution uses packages dplyr and tidytext.
library(dplyr)
library(tidytext)
# subset of your dataset
dt = data.frame(C1 = c(108,20, 999, 52, 400),
C2 = c(1,3,7, 6, 9),
C3 = c("hello everyone","hope you all","Are doing good","in life","I need help"), stringsAsFactors = F)
# function to combine words (by pasting one next to the other)
f = function(x) { paste(x, collapse = " ") }
dt %>%
unnest_tokens(word, C3) %>% # split phrases into words
filter(!word %in% stop_words$word) %>% # keep appropriate words
group_by(C1, C2) %>% # for each combination of C1 and C2
summarise(word = f(word)) %>% # combine multiple words (if there are multiple)
ungroup() # forget the grouping
# # A tibble: 2 x 3
# C1 C2 word
# <dbl> <dbl> <chr>
# 1 20 3 hope
# 2 52 6 life
The problem here is that the "stop words" built in that package filter out some of the words you want to keep. Therefore, you have to add a manual step where you specify words you need to include. You can do something like this:
dt %>%
unnest_tokens(word, C3) %>% # split phrases into words
filter(!word %in% stop_words$word | word %in% c("everyone","doing","good")) %>% # keep appropriate words
group_by(C1, C2) %>% # for each combination of C1 and C2
summarise(word = f(word)) %>% # combine multiple words (if there are multiple)
ungroup() # forget the grouping
# # A tibble: 4 x 3
# C1 C2 word
# <dbl> <dbl> <chr>
# 1 20 3 hope
# 2 52 6 life
# 3 108 1 everyone
# 4 999 7 doing good
This is one of the first things I did in R, it may not be the best but something like:
library(stringi)
df2 <- do.call(rbind, lapply(stop$stop, function(x){
t <- data.frame(c1= df[,1], c2 = df[,2], words = stri_extract(df[,3], coll=x))
t<-na.omit(t)}))
Example data:
df = data.frame(c1 = c(108,20,99), c2 = c(1,3,7), c3 = c("hello everyone", "hope you all", "are doing well"))
stop = data.frame(stop = c("you", "all"))
Then after you can reshapedf2 using:
df2 = data.frame(c1 = unique(u$c1), c2 = unique(u$c2), words = paste(u$words, collapse= ','))
Then cbind df and df2
I would use the tm-package. It has a little dictionary with english stopwords. You can replace these stopwords with a white space using gsub():
library(tm)
prep <- tolower(paste(" ", df$C3, " "))
regex_pat <- paste(stopwords("en"), collapse = " | ")
df$C4 <- gsub(regex_pat, " ", prep)
df$C4 <- gsub(regex_pat, " ", df$C4)
# C3 C4
# 1 hello everyone hello everyone
# 2 hope you all hope
# 3 Are doing good good
# 4 In life life
# 5 I need help need help
You can easily add new words like c("hello", "othernewword", stopwords("en")).
I have merged two data frames using bind_rows. I have a situation where I have two rows of data as for example below:
Page Path Page Title Byline Pageviews
/facilities/when-lighting-strikes NA NA 668
/facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
When I have these type of duplicate page paths I'd like to merge the identical page paths, eliminate the two NA's in the first row keeping the page title (When Lighting Strikes) and Byline (Tom Jones) and then keep the pageviews result of 668 from the first row. Somehow it seems that I need
to identify the duplicate pages paths
look to see if there are different titles and bylines; remove NAs
keep the row with the pageview result; remove the NA row
Is there a way I can do this in R dplyr? Or is there a better way?
A simple solution:
library(dplyr)
df %>% group_by(PagePath) %>% summarise_each(funs(na.omit))
# Source: local data frame [1 x 4]
#
# PagePath PageTitle Byline Pageviews
# (fctr) (fctr) (fctr) (int)
# 1 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
If your data is more complicated, you may need a more robust approach.
Data
df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
Use replace function in for loop
for(i in unique(df$Page_Path)){
df$Pageviews[df$Page_Path==i] <- replace(df$Pageviews[df$Page_Path==i],is.na(df$Pageviews[df$Page_Path==i]),
df$Pageviews[!is.na(df$Pageviews[df$Page_Path==i])])
}
df <- subset(df, !is.na(Page_Title))
print(df)
Page_Path Page_Title Byline Pageviews
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
Here is an option using data.table and complete.cases. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'PathPath', loop through the columns of the dataset (lapply(.SD, ..) and remove the NA elements with complete.cases. The complete.cases returns a logical vector and can be used for subsetting. According to this, complete.cases usage is much more faster than na.omit and coupled with data.table it would increase the efficiency.
library(data.table)
setDT(df)[, lapply(.SD, function(x) x[complete.cases(x)]), by = PagePath]
# PagePath PageTitle Byline Pageviews
#1: /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
data
df <- structure(list(PagePath = structure(c(1L, 1L),
.Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
Another way to do this (similar to a previous solutions that uses dplyr) would be:
df %>% group_by(PagePath) %>%
dplyr::summarize(PageTitle = paste(na.omit(PageTitle)),
Byline = paste(na.omit(Byline)),
Pageviews =paste(na.omit(Pageviews)))
An alternative approach using fill. Using tidyverse 1.3.0+ with dplyr 0.8.5+, you can use fill to fill in missing values.
See this for more information https://tidyr.tidyverse.org/reference/fill.html
DATA Thanks Alistaire
df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
# A tibble: 2 x 4
# Groups: PagePath [1]
PagePath PageTitle Byline Pageviews
<fct> <fct> <fct> <int>
1 /facilities/when-lighting-strikes NA NA 668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
CODE
I just did this for PageTitle but you can repeat fill to do it for other columns. (dplyr gurus might have a smarter way to do all 3 columns at once). If you have ordered data like dates, then you can set .direction to be just down for example (look at past data).
df.new <- df %>% group_by(PagePath)
%>% fill(PageTitle, .direction = "updown")
which gives you
# A tibble: 2 x 4
# Groups: PagePath [1]
PagePath PageTitle Byline Pageviews
<fct> <fct> <fct> <int>
1 /facilities/when-lighting-strikes When Lighting Strikes NA 668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
Once you have all the NAs cleaned up then you can use distinct or rank to get your final summarised dataframe.
I have two data tables as shown below:
bigrams
w1w2 freq w1 w2
common names 1 common names
department of 4 department of
family name 6 family name
bigrams = setDT(structure(list(w1w2 = c("common names", "department of", "family name"
), freq = c(1L, 4L, 6L), w1 = c("common", "department", "family"
), w2 = c("names", "of", "name")), .Names = c("w1w2", "freq",
"w1", "w2"), row.names = c(NA, -3L), class = "data.frame"))
unigrams
w1 freq
common 2
department 3
family 4
name 5
names 1
of 9
unigrams = setDT(structure(list(w1 = c("common", "department", "family", "name",
"names", "of"), freq = c(2L, 3L, 4L, 5L, 1L, 9L)), .Names = c("w1",
"freq"), row.names = c(NA, -6L), class = "data.frame"))
desired output
w1w2 freq w1 w2 w1freq w2freq
common names 1 common names 2 1
department of 4 department of 3 9
family name 6 family name 4 5
What I have done so far
setkey(bigrams, w1)
setkey(unigrams, w1)
result <- bigrams[unigrams]
This gives me the i.freq column for w1 but when I try to do the same for w2 the i.freq column is updated to reflect the freq of w2.
How can I get freq for both w1 and w2 in separate columns?
Note: I have already seen solutions to data.table Lookup value and translate and Modify column of a data.table based on another column and add the new column
You can do two joins, and in v1.9.6 of data.table you can specify the on= argument for differing column names.
library(data.table)
bigrams[unigrams, on=c("w1"), nomatch = 0][unigrams, on=c(w2 = "w1"), nomatch = 0]
w1w2 freq w1 w2 i.freq i.freq.1
1: family name 6 family name 4 5
2: common names 1 common names 2 1
3: department of 4 department of 3 9
You can do this with a bit of reshaping.
library(dplyr)
library(tidyr)
bigrams %>%
rename(w1w2_string = w1w2,
w1w2_freq = freq) %>%
gather(order, string,
w1, w2) %>%
left_join(unigrams %>%
rename(string = w1) ) %>%
gather(type, value,
string, freq) %>%
unite(order_type, order, type) %>%
spread(order_type, value)
Edit: Explanation
The first observation you can make is that bigrams contains in fact information about three different units of analysis: a bigram and two unigrams. Convert to long form so that the unit of analysis is a unigram. Then we can merge in the other unigram data. Now note that your unigram has two different pieces of information per row: the frequency for the unigram, and the text of the unigram. Convert to long form again so that the unit of analysis is a piece of information about a unigram. Now spread, so that each new column is a type of information about a unigram.
I'm having an issue using apply functions (which I assume is the right way to do the following) across multiple data frames.
Some example data (3 different data frames, but the problem I'm working on has upwards of 50):
biz <- data.frame(
country = c("england","canada","australia","usa"),
businesses = sample(1000:2500,4))
pop <- data.frame(
country = c("england","canada","australia","usa"),
population = sample(10000:20000,4))
restaurants <- data.frame(
country = c("england","canada","australia","usa"),
restaurants = sample(500:1000,4))
Here's what I ultimately want to do:
1) Sort eat data frame from largest to smallest, according to the variable that's included
dataframe <- dataframe[order(dataframe$VARIABLE,)]
2) then create a vector variable that gives me the rank for each
dataframe$rank <- 1:nrow(dataframe)
3) Then create another data frame that has one column of the countries and the rank for each of the variables of interest as other columns. Something that would look like (rankings aren't real here):
country.rankings <- structure(list(country = structure(c(5L, 1L, 6L, 2L, 3L, 4L), .Label = c("brazil",
"canada", "england", "france", "ghana", "usa"), class = "factor"),
restaurants = 1:6, businesses = c(4L, 5L, 6L, 3L, 2L, 1L),
population = c(4L, 6L, 3L, 2L, 5L, 1L)), .Names = c("country",
"restaurants", "businesses", "population"), class = "data.frame", row.names = c(NA,
-6L))
So I'm guessing there's a way to put each of these data frames together into a list, something like:
lib <- c(biz, pop, restaurants)
And then do an lapply across that to 1) sort, 2)create the rank variable and 3) create the matrix or data frame of rankings for each variable (# of businesses, population size, # of restaurants) for each country. Problem I'm running into is that writing the lapply function to sort each data frame runs into issues when I try to order by the variable:
sort <- lapply(lib,
function(x){
x <- x[order(x[,2]),]
})
returns the error message:
Error in `[.default`(x, , 2) : incorrect number of dimensions
because I'm trying to apply column headings to a list. But how else would I tackle this problem when the variable names are different for every data frame (but keeping in mind that the country names are consistent)
(would also love to know how to use this using plyr)
Ideally I'd would recommend data.table for this.
However, here is a quick solution using data.frame
Try this:
Step1: Create a list of all data.frames
varList <- list(biz,pop,restaurants)
Step2: Combine all of them in one data.frame
temp <- varList[[1]]
for(i in 2:length(varList)) temp <- merge(temp,varList[[i]],by = "country")
Step3: Get ranks:
cbind(temp,apply(temp[,-1],2,rank))
You can remove the undesired columns if you want!!
cbind(temp[,1:2],apply(temp[,-1],2,rank))[,-2]
Hope this helps!!
totaldatasets <- c('biz','pop','restaurants')
totaldatasetslist <- vector(mode = "list",length = length(totaldatasets))
for ( i in seq(length(totaldatasets)))
{
totaldatasetslist[[i]] <- get(totaldatasets[i])
}
totaldatasetslist2 <- lapply(
totaldatasetslist,
function(x)
{
temp <- data.frame(
country = totaldatasetslist[[i]][,1],
countryrank = rank(totaldatasetslist[[i]][,2])
)
colnames(temp) <- c('country', colnames(x)[2])
return(temp)
}
)
Reduce(
merge,
totaldatasetslist2
)
Output -
country businesses population restaurants
1 australia 3 3 3
2 canada 2 2 2
3 england 1 1 1
4 usa 4 4 4