How to remove specific duplicates in R - r

I have the following data:
> head(bigdata)
type text
1 neutral The week in 32 photos
2 neutral Look at me! 22 selfies of the week
3 neutral Inside rebel tunnels in Homs
4 neutral Voices from Ukraine
5 neutral Water dries up ahead of World Cup
6 positive Who's your hero? Nominate them
My duplicates will look like this (with empty $type):
7 Who's your hero? Nominate them
8 Water dries up ahead of World Cup
I remove duplicates like this:
bigdata <- bigdata[!duplicated(bigdata$text),]
The problem is, it removes the wrong duplicate. I want to remove the one where $type is empty, not the one that has a value for $type.
How can I remove a specific duplicate in R?

So here's a solution that does not use duplicated(...).
# creates an example - you have this already...
set.seed(1) # for reproducible example
bigdata <- data.frame(type=rep(c("positive","negative"),5),
text=sample(letters[1:10],10),
stringsAsFactors=F)
# add some duplicates
bigdata <- rbind(bigdata,data.frame(type="",text=bigdata$text[1:5]))
# you start here...
newdf <- with(bigdata,bigdata[order(text,type,decreasing=T),])
result <- aggregate(newdf,by=list(text=newdf$text),head,1)[2:3]
This sorts bigdata by text and type, in decreasing order, so that for a given text, the empty type will appear after any non-empty type. Then we extract only the first occurrence of each type for every text.
If your data really is "big", then a data.table solution will probably be faster.
library(data.table)
DT <- as.data.table(bigdata)
setkey(DT, text, type)
DT.result <- DT[, list(type = type[.N]), by = text]
This does basically the same thing, but since setkey sorts only in increasing order, we use type[.N] to get the last occurrence of type for a every text. .N is a special variable that holds the number of elements for that group.
Note that the current development version implements a function setorder(), which orders a data.table by reference, and can order in both increasing and decreasing order. So, using the devel version, it'd be:
require(data.table) # 1.9.3
setorder(DT, text, -type)
DT[, list(type = type[1L]), by = text]

You should keep rows that are either not duplicated or not missing a type value. The duplicated function only returns the second and later duplicates of each value (check out duplicated(c(1, 1, 2))), so we need to use both that value and the value of duplicated called with fromLast=TRUE.
bigdata <- bigdata[!(duplicated(bigdata$text) |
duplicated(bigdata$text, fromLast=TRUE)) |
!is.na(bigdata$type),]

foo = function(x){
x == ""
}
bigdata <- bigdata[-(!duplicated(bigdata$text)&sapply(bigdata$type, foo)),]

Related

Extract part of a string from one column, paste into a new column

Disclaimer: Totally inexperience with R so please bear with me!...
Context: I have a series of .csv files in a directory. These files contain 7 columns and approx 100 rows. I've compiled some scripts that will read in all of the files, loop over each one adding some new columns based on different factors (e.g. if a specific column makes reference to a "box set" then it creates a new column called "box_set" with "yes" or "no" for each row), and write out over the original files. The only thing that I can't quite figure out (and yes, I've Googled high and low) is how to split one of the columns into two, based on a particular string. The string always begins with ": Series" but can end with different numbers or ranges of numbers. E.g. "Poldark: Series 4", "The Musketeers: Series 1-3".
I want to be able to split that column (currently named Programme_Title) into two columns (one called Programme_Title and one called Series_Details). Programme_Title would just contain everything before the ":" whilst Series_Details would contain everything from the "S" onwards.
To further complicate matters, the Programme_Title column contains a number of different strings, not all of which follow the examples above. Some don't contain ": Series", some will include the ":" but will not be followed by "Series".
Because I'm terrible at explaining these things, here's a sample of what it currently looks like:
Programme_Title
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo: Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur: Series 1-2
Poldark: Series 4
The Musketeers: Series 1-3
War and Peace
And here's what I want it to look like:
Programme_Title Series_Details
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur Series 1-2
Poldark Series 4
The Musketeers Series 1-3
War and Peace
As I said, I'm a total R novice so imagine that you're speaking to a 5 yr old. If you need more info to be able to answer this then please let me know.
Here's the code that I'm using to do everything else (I'm sure it's a bit messy but I cobbled it together from different sources, and it works!)
### Read in files ###
filenames = dir(pattern="*.csv")
### Loop through all files, add various columns, then save ###
for (i in 1:length(filenames)) {
tmp <- read.csv(filenames[i], stringsAsFactors = FALSE)
### Add date part of filename to column labelled "date" ###
tmp$date <- str_sub(filenames[i], start = 13L, end = -5L)
### Create new column labelled "Series" ###
tmp$Series <- ifelse(grepl(": Series", tmp$Programme_Title), "yes", "no")
### Create "rank" for Programme_Category ###
tmp$rank <- sequence(rle(as.character(tmp$Programme_Category))$lengths)
### Create new column called "row" to assign numerical label to each group ###
DT = data.table(tmp)
tmp <- DT[, row := .GRP, by=.(Programme_Category)][]
### Identify box sets and create new column with "yes" / "no" ###
tmp$Box_Set <- ifelse(grepl("Box Set", tmp$Programme_Synopsis), "yes", "no")
### Remove the data.table which we no longer need ###
rm (DT)
### Write out the new file###
write.csv(tmp, filenames[[i]])
}
I don't have your exact data structure, but I created some example for you that should work:
library(tidyr)
movieName <- c("This is a test", "This is another test: Series 1-5", "This is yet another test")
df <- data.frame(movieName)
df
movieName
1 This is a test
2 This is another test: Series 1-5
3 This is yet another test
df <- df %>% separate(movieName, c("Title", "Series"), sep= ": Series")
for (row in 1:nrow(df)) {
df$Series[row] <- ifelse(is.na(df$Series[row]), "", paste("Series", df$Series[row], sep = ""))
}
df
Title Series
1 This is a test
2 This is another test Series 1-5
3 This is yet another test
I tried to capture all the examples you might encounter, but you can easily add things to capture variants not covered in the examples I provided.
Edit: I added a test case that did not include : or series. It will just produce a NA for the Series Details.
## load library: main ones using are stringr, dplyr, tidry, and tibble from the tidyverse, but I would recommend just installing the tidyverse
library(tidyverse)
## example of your data, hard to know all the unique types of data, but this will get you in the right direction
data <- tibble(title = c("X:Series 1-6",
"Y: Series 1-2",
"Z : Series 1-10",
"The Z and Z: 1-3",
"XX Series 1-3",
"AA AA"))
## Example of the data we want to format, see the different cases covered
print(data)
title
<chr>
1 X:Series 1-6
2 Y: Series 1-2
3 Z : Series 1-10
4 The Z and Z: 1-3
5 XX Series 1-3
6 AA AA
## These %>% are called pipes, and used to feed data through a pipeline, very handy and useful.
data_formatted <- data %>%
## Need to fix cases where you have Series but no : or vice versa, this keep everything the same.
## Sounds like you will always have either :, series, or :Series If this is different you can easily
## change/update this to capture other cases
mutate(title = case_when(
str_detect(title,'Series') & !(str_detect(title,':')) ~ str_replace(title,'Series',':Series'),
!(str_detect(title,'Series')) & (str_detect(title,':')) ~ str_replace(title,':',':Series'),
TRUE ~ title)) %>%
## first separate the columns based on :
separate(col = title,into = c("Programme_Title","Series_Details"), sep = ':') %>%
##This just removes all white space at the ends to clean it up
mutate(Programme_Title = str_trim(Programme_Title),
Series_Details = str_trim(Series_Details))
## Output of the data to see how it was formatted
print(data_formatted)
Programme_Title Series_Details
<chr> <chr>
1 X Series 1-6
2 Y Series 1-2
3 Z Series 1-10
4 The Z and Z Series 1-3
5 XX Series 1-3
6 AA AA NA

removing variables containing certain string in r [duplicate]

This question already has answers here:
Remove Rows From Data Frame where a Row matches a String
(6 answers)
Delete rows containing specific strings in R
(7 answers)
Closed 4 years ago.
I'd have hundreds of observations and I'd like to remove the ones that contain the string "english basement". I can't seem to find the right syntax to do so. I can only figure out how to keep observations with the that string. For instance, I used the code below to get only observations containing the string, and it worked perfectly:
eng_base <- zdata %>%
filter(str_detect(zdata$ListingDescription, “english basement”))
Now I want a data set,top_10mpEB, that excludes observations containing "english basement". Your help is greatly appreciated.
I do not know how your data looks like, but maybe this example helps you - I think you just need to negate the logical vector returned by str_detect:
library(dplyr)
library(stringr)
zdata <- data.frame(ListingDescription = c(rep("english basement, etc",3), letters[1:2] ))
zdata
# ListingDescription
#1 english basement, etc
#2 english basement, etc
#3 english basement, etc
#4 a
#5 b
zdata %>%
filter(!str_detect(ListingDescription, "english basement"))
# ListingDescription
#1: a
#2: b
Or using data.table package (no need of stringr::str_detect):
library(data.table)
setDT(zdata)
zdata[! ListingDescription %like% "english basement"]
# ListingDescription
#1: a
#2: b
You can do this using grepl():
x <- data.frame(ListingDescription = c('english basement other words description continued',
'great fireplace and an english basement',
'no basement',
'a house with a sauna!',
'the pool is great... and wait till you see the english basement!',
'new listing...will go fast'),
rent = c(3444, 23444, 346, 9000, 1250, 599))
x_english_basement <- x[grepl('english basement',
x$ListingDescription)==FALSE, ]
You can use dplyr to easily filter your dataframe.
library(dplyr)
new_data <- data %>%
filter(!ListingDescription=="english basement")
The ! became my best friend once I realized it meant "doesnt equal"

Extracting a value based on multiple conditions in R

Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq

Walk a CHAID tree R - need to sort by number of instances

I have a number of trees, when printing they are 7 pages long. I've had to rebalance the data and need to look at the branches with the highest frequency to see if they make sense - I need to identify a cancellation rate for different clusters.
Given the data is so long what I need is to have the biggest branches and then I can validate those rather than go through 210 branches manually. I will have lots of trees so need to automate this to look at the important results.
Example code to use:
library(CHAID)
updatecars<-mtcars
updatecars$cyl<-as.factor(updatecars$cyl)
updatecars$vs<-as.factor(updatecars$vs)
updatecars$am<-as.factor(updatecars$am)
updatecars$gear<-as.factor(updatecars$gear)
plot(carsChaid)
carsChaid<-chaid(am~ cyl+vs+gear, data=updatecars)
carsChaid
When you print this data, you see n=15 for the first group. I need a table where I can sort on this value.
What I need is a decision tree table with the variable values and the number within each group from the tree. This is not exactly the same as this answer Walk a tree
as it doesn't give the number within but I think it's in the direction.
Can someone help,
Thanks,
James
Sure there is a better way to do this but this works.Obviously willing to have corrections and improvements suggested.
The particular trouble i had was creating the list of all combinations. When the expand.grid goes over 3 factors, it stops working. So I had to build a loop ontop of it to create the complete list.
All_canx_rates<-function(Var1,Var2,Var3,Var4,Var5,nametree){
df1<-data.frame("CanxRate"=0,"Num_Canx"=0,"Num_Cust"=0)
pars<-as.list(match.call()[-1])
a<-eval(pars$nametree)[,as.character(pars$Var1)]
b<-eval(pars$nametree)[,as.character(pars$Var2)]
c<-eval(pars$nametree)[,as.character(pars$Var3)]
d<-eval(pars$nametree)[,as.character(pars$Var4)]
e<-eval(pars$nametree)[,as.character(pars$Var5)]
allcombos<-expand.grid(levels(a),levels(b),levels(c))
clean<- allcombos
allcombos$Var4<-d[1]
for (i in 2:length(levels(d))) {
clean$Var4<-levels(d)[i]
allcombos<-rbind(allcombos,clean)
}
#define a forloop
for (i in 1:nrow(allcombos)) {
#define values
f1<-allcombos[i,1]
f2<-allcombos[i,2]
f3<-allcombos[i,3]
f4<-allcombos[i,4]
y5<-nrow(nametree[(a %in% f1 & b %in% f2 & c %in% f3 & d %in% f4 &
e =='1'),])
y4<-nrow(nametree[(a %in% f1 & b %in% f2 & c %in% f3 & d %in% f4),])
df2<-data.frame("CanxRate"=y5/y4,"Num_Canx"=y5,"Num_Cust"=y4)
df1<-rbind(df1, df2)
}
#endforloop
#make the dataframe available for global viewing
df1<-df1[-1,]
output<<-cbind(allcombos,df1)
}
You can use data.tree to do further operations on a party object like sorting, walking the tree, custom plotting, etc. The latest release v0.3.7 from github has a conversion from party class objects:
devtools::install_github("gluc/data.tree#v0.3.7")
library(data.tree)
tree <- as.Node(carsChaid)
tree$fieldsAll
The last command shows the names of the converted fields of the party class:
[1] "data" "fitted" "nodeinfo" "partyinfo" "split" "splitlevels" "splitname" "terms" "splitLevel"
You can sort by a function, e.g. the rows of the data on each node:
tree$Sort(attribute = function(node) nrow(node$data), decreasing = TRUE)
print(tree,
"splitname",
count = function(node) nrow(node$data),
"splitLevel")
This prints, for instance, like so:
levelName splitname count splitLevel
1 1 gear 32
2 ¦--3 17 4, 5
3 °--2 15 3

Select variables that contain value in R

I apologize if this question has been answered. I have searched this for way too long.
I have coded data that has a prefix of a letter and suffix of numbers.
ex:
A01, A02,...A99 ### (for each letter A-Z)
I need R code that mirrors this SAS code:
Proc SQL;
Create table NEW as
Select *
From DATA
Where VAR contains 'D';
Quit;
EDIT
Sorry y'all, I'm new! (also, mediocre in R at best.) I thought posting the SAS/SQL code would help make it easier.
Anyway, the data is manufacturing data. I have a variable whose values are the A01...A99, etc. values.
(rough) example of the dataframe:
OBS PRODUCT PRICE PLANT
1 phone 8.55 A87
2 paper 105.97 X67
3 cord .59 D24
4 monitor 98.65 D99
The scale of the data is massive, and I'm only wanting to focus on the observations that come from the plant 'D', so I'm trying to subset the data based on the 'PLANT' variable that contains (or starts with) 'D'. I know how to filter the data with a specific value (ie. ==, >=, != , etc.). I just can't figure out how to do it when only part of the value is known and I have yet to find anything about a 'contains' operator in R. I hope that clarifies things more.
Assuming DATA is your data.frame and VAR is your column value,
DATA <- data.frame(
VAR=apply(expand.grid(LETTERS[1:4], 1:3), 1, paste0, collapse=""),
VAL = runif(3*4)
)
then you can do
subset(DATA, grepl("D", VAR))
A slight alternative to MrFlick's solution: use a vector of row-indices:
DATA[grep('D', DATA$VAR), ]
VAR VAL
4 D1 0.31001091
8 D2 0.71562382
12 D3 0.00981055
where we defined:
DATA <- data.frame(
VAR=apply(expand.grid(LETTERS[1:4], 1:3), 1, paste0, collapse=""),
VAL = runif(3*4)
)

Resources