Related
I would like to find rows that match in two columns but differ in a third and retain only one of these lines. So for example:
animal_couples <- data.frame(ID=c(1,2,3,4,5,6,7,8,9,10,11,12),species=c("Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Fish","Fish","Fish","Fish"),partner=c("Cat","Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Dog","Badger","Badger"),location=c("Germany","Germany","Iceland","France","France","Iceland","Greece","Greece","Germany","Germany","France","Spain"))
A row can match in 'species' and 'partner' so long as it also matches in 'location'. So the first two rows in this df are fine as Germany and Germany are the same. The next three rows are then removed. So the final df should be:
animal_couples_after <- data.frame(ID=c(1,2,6,7,8,9,10,11),species=c("Cat","Cat","Dog","Dog","Dog","Fish","Fish","Fish"),partner=c("Cat","Cat","Cat","Dog","Dog","Dog","Dog","Badger"),location=c("Germany","Germany","Iceland","Greece","Greece","Germany","Germany","France"))
The real dataset is quite large so I don't think looping through each row would be an option.
Thanks a lot for your help.
Could try:
library(data.table)
setDT(animal_couples)[, idx := rleid(location), by = .(species, partner)][idx == 1, ][, idx := NULL]
Output:
ID species partner location
1: 1 Cat Cat Germany
2: 2 Cat Cat Germany
3: 6 Dog Cat Iceland
4: 7 Dog Dog Greece
5: 8 Dog Dog Greece
6: 9 Fish Dog Germany
7: 10 Fish Dog Germany
8: 11 Fish Badger France
Or also shortened:
setDT(animal_couples)[, .SD[rleid(location) == 1], by = .(species, partner)]
I want remove entire row if there are duplicates in two columns. Any quick help in doing so in R (for very large dataset) would be highly appreciated. For example:
mydf <- data.frame(p1=c('a','a','a','b','g','b','c','c','d'),
p2=c('b','c','d','c','d','e','d','e','e'),
value=c(10,20,10,11,12,13,14,15,16))
This gives:
mydf
p1 p2 value
1 a b 10
2 c c 20
3 a d 10
4 b c 11
5 d d 12
6 b b 13
7 c d 14
8 c e 15
9 e e 16
I want to get:
p1 p2 value
1 a b 10
2 a d 10
3 b c 11
4 c d 14
5 c e 15
your note in the comments suggests your actual problem is more complex. There's some preprocessing you could do to your strings before you compare p1 to p2. You will have the domain expertise to know what steps are appropriate, but here's a first start. I remove all spaced and punctuation from p1 and p2. I then convert them all to uppercase before testing for equality. You can modify the clean_str function to include more / different cleaning operations.
Additionally, you may consider approximate matching to address typos / colloquial naming conventions. Package stringdist is a good place to start.
mydf <- data.frame(p1=c('New York','New York','New York','TokYo','LosAngeles','MEMPHIS','memphis','ChIcAGo','Cleveland'),
p2=c('new York','New.York','MEMPHIS','Chicago','knoxville','tokyo','LosAngeles','Chicago','CLEVELAND'),
value=c(10,20,10,11,12,13,14,15,16),
stringsAsFactors = FALSE)
mydf[mydf$p1 != mydf$p2,]
#> p1 p2 value
#> 1 New York new York 10
#> 2 New York New.York 20
#> 3 New York MEMPHIS 10
#> 4 TokYo Chicago 11
#> 5 LosAngeles knoxville 12
#> 6 MEMPHIS tokyo 13
#> 7 memphis LosAngeles 14
#> 8 ChIcAGo Chicago 15
#> 9 Cleveland CLEVELAND 16
clean_str <- function(col){
#removes all punctuation
d <- gsub("[[:punct:][:blank:]]+", "", col)
d <- toupper(d)
return(d)
}
mydf$p1 <- clean_str(mydf$p1)
mydf$p2 <- clean_str(mydf$p2)
mydf[mydf$p1 != mydf$p2,]
#> p1 p2 value
#> 3 NEWYORK MEMPHIS 10
#> 4 TOKYO CHICAGO 11
#> 5 LOSANGELES KNOXVILLE 12
#> 6 MEMPHIS TOKYO 13
#> 7 MEMPHIS LOSANGELES 14
Created on 2020-05-03 by the reprex package (v0.3.0)
Several ways to do that. Among them :
Base R
mydf[mydf$p1 != mydf$p2, ]
dplyr
library(dplyr)
mydf %>% filter(p1 != p2)
data.table
library(data.table)
setDT(mydf)
mydf[p1 != p2]
Here's a two-step solution based on #Chase's data:
First step (as suggested by #Chase) - preprocess your data in p1and p2to make them comparable:
# set to lower-case:
mydf[,c("p1", "p2")] <- lapply(mydf[,c("p1", "p2")], tolower)
# remove anything that's not alphanumeric between words:
mydf[,c("p1", "p2")] <- lapply(mydf[,c("p1", "p2")], function(x) gsub("(\\w+)\\W(\\w+)", "\\1\\2", x))
Second step - (i) using apply, paste the rows together, (ii) use grepl and backreference \\1 to look out for immediately adjacent duplicates in these rows, and (iii) remove (-) those rows which contain these duplicates:
mydf[-which(grepl("\\b(\\w+)\\s+\\1\\b", apply(mydf, 1, paste0, collapse = " "))),]
p1 p2 value
3 newyork memphis 10
4 tokyo chicago 11
5 losangeles knoxville 12
6 memphis tokyo 13
7 memphis losangeles 14
I have a data frame like this,
DF1= c(
"Name : John Miller, Math : 100, History : 80, Physics: 90",
"Name : Mary Smith, French : 99, History : 90, Physics: 89",
"Name : Eddy Abbot, Math : 90, French : 85, Chemistry : 90"
)
Would like to make it a data.table in this way (better in data.table format)
Name Math French History Physics Chemistry
1: John Miller 100 NA 80 90 NA
2: Mary Smith NA 99 90 89 NA
3: Eddy Abbot 90 85 NA NA 90
Wondering if my idea is at the right direction:
Split the strings into words based on ",".
Get the keywords, "French, "Math", etc, based on " : ".
Fill in the right row and right col with the value respectively. and done.
Would like to invite advice on step 3 and many thanks.
Replace each comma and end-of-line with a newline and each space-colon with just colon. Read that using readLines to break up the strings into separate lines and use trimws to remove any junk whitespace. At this point the file is in Debian Control Format (DCF) so we can use read.dcf to read it creating character matrix m. Now convert m to data.table and convert the types.
dcf <- trimws(readLines(textConnection(gsub(" :", ":", gsub(",|$", "\n", DF1)))))
m <- read.dcf(textConnection(dcf))
DT <- as.data.table(m)[, lapply(.SD, type.convert, as.is = TRUE)]
giving:
> DT
Name Math History Physics French Chemistry
1: John Miller 100 80 90 NA NA
2: Mary Smith NA 90 89 99 NA
3: Eddy Abbot 90 NA NA 85 90
Note
We used the object name DF1 for consistency with the question but it is a character vector, not a data frame, so you might want to choose a different name for it.
We convert it to a tibble,create a row names column ('rn'), expand the rows by splitting at , (separate_rows), separate the 'col' at : into 'col1' and 'col2', spread it to 'wide' format, and change the type
library(tidyverse)
tibble(col = DF1) %>%
rownames_to_column('rn') %>%
separate_rows(col, sep = "\\s*,\\s*") %>%
separate(col, into = c('col1', 'col2'), sep="\\s*:\\s*") %>%
spread(col1, col2) %>%
select(-rn) %>%
mutate_all(type.convert, as.is = TRUE) %>%
select(Name, Math, French, History, Physics, Chemistry)
# A tibble: 3 x 6
# Name Math French History Physics Chemistry
# <chr> <int> <int> <int> <int> <int>
#1 John Miller 100 NA 80 90 NA
#2 Mary Smith NA 99 90 89 NA
#3 Eddy Abbot 90 85 NA NA 90
It is also possible to convert to JSON format and then use fromJSON
library(jsonlite)
out <- fromJSON(paste0("[", paste("{", gsub('"(\\d+)"', "\\1",
gsub('(\\w+)\\s*:\\s*([^,]+)', '"\\1":"\\2"', DF1)), "}", sep="", collapse=",\n"), "]"))
out
# Name Math History Physics French Chemistry
#1 John Miller 100 80 90 NA NA
#2 Mary Smith NA 90 89 99 NA
#3 Eddy Abbot 90 NA NA 85 90
My question is about the standardization of column b. I need these data to be in a format that makes it easier to construct graphics.
a<- c("Jackson Brice / The Shocker","Flash Thompson", "Mr. Harrington","Mac Gargan","Betty Brant", "Ann Marie Hoag","Steve Rogers / Captain America", "Pepper Potts", "Karen")
b<- c("2:30", "2:15", "2", "1:15", "1:15", "1", ":55",":45", "v")
ab <- cbind.data.frame(a,b)
a b
1 Jackson Brice / The Shocker 2:30
2 Flash Thompson 2:15
3 Mr. Harrington 2
4 Mac Gargan 1:15
5 Betty Brant 1:15
6 Ann Marie Hoag 1
7 Steve Rogers / Captain America 1
8 Pepper Potts :45
9 Karen v
as outuput:
a b
1 Jackson Brice / The Shocker 00:02:30
2 Flash Thompson 00:02:15
3 Mr. Harrington 00:02:00
4 Mac Gargan 00:01:15
5 Betty Brant 00:01:15
6 Ann Marie Hoag 00:01:00
7 Steve Rogers / Captain America 00:01:00
8 Pepper Potts 00:00:45
9 Karen 00:00:00
If possible, the objects of the column b in the manipulable format of time.
So I've had to make a few assumptions about what you are trying to do, e.g. units and what you want done with character values but hopefully this function will give you something to work with.
The big challenge with time is that you need some fairly clear rules when parsing it from text. As I results I have had to put a number of if statements in the function to make it work but wherever possible, try and keep your time formats as consistent as possible.
library(lubridate)
formatTime <- function(x) {
# Check for a : seperator in the text
if(grepl(":",x, fixed = TRUE)) {
y <- unlist(strsplit(x,":", fixed = TRUE))
# If there is no value before the : then add "00" before the :
if(y[1]=="") {
z <- ms(paste("00",y[2],collapse = ":"), quiet=TRUE)
} else {
z <- ms(paste(y,collapse = ":"), quiet=TRUE)
}
} else {
# If there is no : then add "00" after the :
z <- ms(paste(x,"00",collapse = ":"), quiet=TRUE)
}
# If it did not pare with ms, i.e. it was a character, then assign zero time "00:00"
if(is.na(z)) z <- ms("0:00")
# Converted to duration due to issues returning period with lapply.
# Make dataframe to retun units and name with lapply.
return(data.frame(time = as.duration(z)))
}
# Convert factor variable to character
ab$b <- as.character(ab$b)
ab <- cbind(ab,rbindlist(lapply(ab$b,formatTime)))
I started by trying to work with a time period but it wouldn't return correctly with the apply statement so I converted to a duration. This may not display the same as your example but it should play nice with graphs.
Let me know if I've missed what you needed and I'll update the answer.
A solution using tidyr::separate and tidyr::unite can be achieved. The approach is to first replace a value containing alphabetic with 00:00:00. Separate parts in 3 columns. Using dplyr::mutate_at all the 3 columns is changed to 00 format. Finally, unite all the three columns.
library(tidyverse)
ab %>% mutate_if(is.factor, as.character) %>% #Change any factor in character
mutate(b = ifelse(grepl("[[:alpha:]]", b), "00:00:00", b)) %>%
mutate(b = ifelse(grepl(":", b), b, paste(b,"00",sep=":")) ) %>%
separate(b, into = c("b1", "b2", "b3"), sep = ":", fill="left", extra = "drop") %>%
mutate_at(vars(starts_with("b")),
funs(sprintf("%02d", as.numeric(ifelse(is.na(.) | . == "",0,.))))) %>%
unite("b", starts_with("b"), sep=":")
# a b
# 1 Jackson Brice / The Shocker 00:02:30
# 2 Flash Thompson 00:02:15
# 3 Mr. Harrington 00:02:00
# 4 Mac Gargan 00:01:15
# 5 Betty Brant 00:01:15
# 6 Ann Marie Hoag 00:01:00
# 7 Steve Rogers / Captain America 00:00:55
# 8 Pepper Potts 00:00:45
# 9 Karen 00:00:00
Data:
a<- c("Jackson Brice / The Shocker","Flash Thompson", "Mr. Harrington","Mac Gargan","Betty Brant",
"Ann Marie Hoag","Steve Rogers / Captain America", "Pepper Potts", "Karen")
b<- c("2:30", "2:15", "2", "1:15", "1:15", "1", ":55",":45", "v")
ab <- cbind.data.frame(a,b
Let's say I have a data frame as follows in R:
Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
Data[1,] <- c("983\n837\n424\n ", 2015, "Michael\nLewis\nPaul\n ")
Data[2,] <- c("123\n456\n789\n136", 2014, "Elaine\nJerry\nGeorge\nKramer")
Data[3,] <- c("987\n654\n321\n975\n ", 2010, "John\nPaul\nGeorge\nRingo\nNA")
Data[4,] <- c("424\n983\n837", 2015, "Paul\nMichael\nLewis")
Data[5,] <- c("456\n789\n123\n136", 2014, "Jerry\nGeorge\nElaine\nKramer")
What I want to do is the following:
Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "...\n ".
Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = "\n")).
My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = "\n") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!
Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Year=Data[i,"Year"],
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
})
UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
# Collapse back into a single row with the new sort order
dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="\n"),
Year=Data[i, "Year"],
Name=paste(dat[, "Name"], collapse="\n"))
})
do.call(rbind, seinfeld)
SerialNum Year Name
1 837\n983\n424 2015 Lewis\nMichael\nPaul
2 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
3 321\n987\n654\n975 2010 George\nJohn\nPaul\nRingo
4 837\n983\n424 2015 Lewis\nMichael\nPaul
5 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
eipi10 offered a great answer. In addition to that, I'd like to leave what I tried mainly with data.table. First, I split two columns (i.e., SerialNum and Name) with cSplit(), added an index with add_rownames(), and split the data by the index. In the first lapply(), I used Stacked() from the splitstackshape package. I stacked SerialNum and Name; separated SeriaNum and Name become two columns, as you see in a part of temp2. In the second lapply(), I used merge from the data.table package. Then, I removed rows with NAs (lapply(na.omit)), combined all data tables (rbindlist), and changed order of rows by rowname, which is row number of the original data) and Name (setorder(rowname, Name))
library(data.table)
library(splitstackshape)
library(dplyr)
cSplit(mydf, c("SerialNum", "Name"), direction = "wide",
type.convert = FALSE, sep = "\n") %>%
add_rownames %>%
split(f = .$rowname) -> temp
#a part of temp
#$`1`
#Source: local data frame [1 x 12]
#
#rowname Year SerialNum_1 SerialNum_2 SerialNum_3 SerialNum_4 SerialNum_5 Name_1 Name_2
#(chr) (dbl) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
#1 1 2015 983 837 424 NA NA Michael Lewis
#Variables not shown: Name_3 (chr), Name_4 (chr), Name_5 (chr)
lapply(temp, function(x){
Stacked(x, var.stubs = c("SerialNum", "Name"), sep = "_")
}) -> temp2
# A part of temp2
#$`1`
#$`1`$SerialNum
# rowname Year .time_1 SerialNum
#1: 1 2015 1 983
#2: 1 2015 2 837
#3: 1 2015 3 424
#4: 1 2015 4 NA
#5: 1 2015 5 NA
#
#$`1`$Name
# rowname Year .time_1 Name
#1: 1 2015 1 Michael
#2: 1 2015 2 Lewis
#3: 1 2015 3 Paul
#4: 1 2015 4 NA
#5: 1 2015 5 NA
lapply(1:nrow(mydf), function(x){
merge(temp2[[x]]$SerialNum, temp2[[x]]$Name, by = c("rowname", "Year", ".time_1"))
}) %>%
lapply(na.omit) %>%
rbindlist %>%
setorder(rowname, Name) -> out
print(out)
# rowname Year .time_1 SerialNum Name
# 1: 1 2015 2 837 Lewis
# 2: 1 2015 1 983 Michael
# 3: 1 2015 3 424 Paul
# 4: 2 2014 1 123 Elaine
# 5: 2 2014 3 789 George
# 6: 2 2014 2 456 Jerry
# 7: 2 2014 4 136 Kramer
# 8: 3 2010 3 321 George
# 9: 3 2010 1 987 John
#10: 3 2010 2 654 Paul
#11: 3 2010 4 975 Ringo
#12: 4 2015 3 837 Lewis
#13: 4 2015 2 983 Michael
#14: 4 2015 1 424 Paul
#15: 5 2014 3 123 Elaine
#16: 5 2014 2 789 George
#17: 5 2014 1 456 Jerry
#18: 5 2014 4 136 Kramer
DATA
mydf <- structure(list(SerialNum = c("983\n837\n424\n ", "123\n456\n789\n136",
"987\n654\n321\n975\n ", "424\n983\n837", "456\n789\n123\n136"
), Year = c(2015, 2014, 2010, 2015, 2014), Name = c("Michael\nLewis\nPaul\n ",
"Elaine\nJerry\nGeorge\nKramer", "John\nPaul\nGeorge\nRingo\nNA",
"Paul\nMichael\nLewis", "Jerry\nGeorge\nElaine\nKramer")), .Names = c("SerialNum",
"Year", "Name"), row.names = c(NA, -5L), class = "data.frame")