Select Data According to a Partial Match - r

Let's say I have the following data frames and want to merge them.
df1 = data.frame(zipcoide=c(90001,90002,90003,66062,90005))
df1
df2 = data.frame(sfc_code=c(900,660,800,400,500,100,300,350,310,450))
df2
SCF Codes are apparently zipcode prefixes and I want to match the sfc_code with the zipcode.
Basically, if I'm given a list of scf codes, I want to select all those zipcodes which have that scf code.
So in this example, I want to end up with
90001
90002
90003
90005
I figure I could use the sqldf package to write a query to select based on " LIKE' %900% " but was looking for something a little more "elegant."
Thanks!

You want to return the all the zipcodes whose first 3 digits appear in your sfc_codes list:
df1[ as.numeric(substr( df1$zipcoide , 1 , 3 ) ) %in% df2$sfc_code , ]
# [1] 90001 90002 90003 66062 90005
Probably not the best example because all zip codes are in that sfc_code list!
But if we remove 660 then we get:
df2 = data.frame(sfc_code=c(900,800,400,500,100,300,350,310,450))
df1[ as.numeric(substr( df1$zipcoide , 1 , 3 ) ) %in% df2$sfc_code , ]
# [1] 90001 90002 90003 90005

When you sfc_code are always the first three digits of your zipcode you could just select the first three digits of your zipcode and match these with the sfc_codes:
df1$sfc_code <- as.numeric(substr(as.character(df1$zipcoide), 1, 3))
match(df1$sfc_code, df2$sfc_code)
Update
If as #joran commented you want to have for each sfc_code in df2 all zipcodes in df1, you could use merge (with or without all=TRUE):
# add id so that we can see which records are matched
df1$id1 <- 1:nrow(df1)
df2$id2 <- 1:nrow(df2)
merge(df2, df1)

Related

line by line csv compare using if statements in R

I am comparing two csv files using R/Rstudio and I would like to compare them line by line, but in a specific order based on their columns. If my data looks like:
first <-read.csv(text="
name, number, description, version, manufacturer
A123, 12345, first piece, 1.0, fakemanufacturer
B107, 00001, second, 1.0, abcde parts
C203, 20000, third, NA, efgh parts
D123, 12000, another, 2.0, NA")
second csv:
second <- read.csv(text="
name, number, description, version, manufacturer
A123, 12345, first piece, 1.0, fakemanufacturer
B107, 00001, second, 1.0, abcde parts
C203, 20000, third, NA, efgh parts
E456, 45678, third, 2.0, ")
I'd like to have a for loop that looks something like:
for line in csv1:
if number exists in csv2:
if csv1$name == csv2$name:
if csv1$description == csv$description:
if csv1$manufacturer == csv2$manufacturer:
break
else:
add line to csv called changed, append a value for "changed" column to manufacturer
else:
add line to csv called changed, append a value for "changed" column to description
and so on
so that the output then looks like:
name number description version manufacturer changed
A123 12345 first piece 1.0 fakemanufacturer number
B107 00001 second 1.0 abcde parts no change
C204 20000 third newmanufacturer number, manufacturer
D123 12000 another 2.0 removed
E456 45678 third 2.0 added
and if at any point in this loop something doesn't match, I'd like to know where the mismatch was. The lines can match by number OR description. for example, given the 2 lines above, I would be able to tell that number changed between the two csv files. Thanks in advance for any help!!
It should be something like this, but as you have provided no data to test it I cannot vouch for my code:
cmpDF <- function(DF1, DF2){
DF2 <- DF2[DF2$number %in% DF1$number,] #keep only the rows of DF2 that are
#also in DF1
retChar <- character(nrow(DF1))
names(retChar) <- DF1$number #call the retChar vector with the number
# to be able to update it later
DF1 <- DF1[DF1$number %in% DF2$number,]#keep only the rows of DF1 that are
#also in DF2
# sort rows to make sure that equal rows have the same row number:
DF1 <- DF1[order(DF1$number),]
DF2 <- DF2[order(DF2$number),]
equals <- DF1 == DF2
identical <- rowSums(DF1 == DF2) == ncol(DF1) #here all elements are the same
retChar[as.character(DF1$number[identical])] <- "no change"
for(i in 1:ncol(DF1)){
if(colnames(DF1)[i] == "number") next
different <- !equals[,i]
retChar[as.character(DF1$number[different])] <- ifelse(nchar(retChar[as.character(DF1$number[different])]),
paste0(retChar[as.character(DF1$number[different])], colnames(DF1)[i], sep = ", "),
colnames(DF1)[i])
}
retChar[nchar(retChar) == 0] <- "number not in DF2"
return(retChar)
}

R: How do you subset all data-frames within a list?

I have a list of data-frames called WaFramesCosts. I want to simply subset it to show specific columns so that I can then export them. I have tried:
for (i in names(WaFramesCosts)) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used")]
}
but it returns the error of
Error in `[.data.frame`(WaFramesCosts[[i]], , c("Cost_Center", "Department", :
undefined columns selected
I also tried:
for (i in seq_along(WaFramesCosts)){
WaFramesCosts[[i]][ , -which(names(WaFramesCosts[[i]]) %in% c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used"))]
but I get the same error. Can anyone see what I am doing wrong?
Side Note: For reference, I used this:
for (i in seq_along(WaFramesCosts)) {
t <- WaFramesCosts[[i]][ , grepl( "Domestic" , names( WaFramesCosts[[i]] ) )]
q <- subset(WaFramesCosts[[i]], select = c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used"))
WaFramesCosts[[i]] <- merge(q,t)
}
while attempting the same goal with a different approach and seemed to get closer.
Welcome back, Kootseeahknee. You are still incorrectly assuming that the last command of a for loop is implicitly returned at the end. If you want that behavior, perhaps you want lapply:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used")]
})
The undefined columns selected error tells me that your assumptions of the datasets are not correct: at least one is missing at least one of the columns. From your previous question (How to do a complex edit of columns of all data frames in a list?), I'm inferring that you want columns that match, not assuming that it is in everything. From that, you could/should be using grep or some variant:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,grep("(Cost_Center|Domestic_Anytime_Min_Used|Department)",
colnames(WaFramesCosts)),drop=FALSE]
})
This will match column names that contain any of those strings. You can be a lot more precise by ensuring whole strings or start/end matches occur by using regular expressions. For instance, changing from (Cost|Dom) (anything that contains "Cost" or "Dom") to (^Cost|Dom) means anything that starts with "Cost" or contains "Dom"; similarly, (Cost|ment$) matches anything that contains "Cost" or ends with "ment". If, however, you always want exact matches and just need those that exist, then something like this will work:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,intersect(c("Cost_Center","Domestic_Anytime_Min_Used","Department"),
colnames(WaFramesCosts)),drop=FALSE]
})
Note, in that last example: notice the difference between mtcars[,2] (returns a vector) and mtcars[,2,drop=FALSE] (returns a data.frame with 1 column). Defensive programming, if you think it at all possible that your filtering will return a single-column, make sure you do not inadvertently convert to a vector by appending ,drop=FALSE to your bracket-subsetting.
Based on your description, this is an example of using library dplyr to achieve combining a list of data frames for a given set of columns. This doesn't require all data frames to have identical columns (Providing your data in a reproducible example would be better)
# test data
df1 = read.table(text = "
c1 c2 c3
a 1 101
b 2 102
", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "
c1 c2 c3
w 11 201
x 12 202
", header = TRUE, stringsAsFactors = FALSE)
# dfs is a list of data frames
dfs <- list(df1, df2)
# use dplyr::bind_rows
library(dplyr)
cols <- c("c1", "c3")
result <- bind_rows(dfs)[cols]
result
# c1 c3
# 1 a 101
# 2 b 102
# 3 w 201
# 4 x 202

Keep doubled columns which differ in only 2 letters in a data.frame

I have a data frame in R which consists of around 100 columns. Most of the columns are doubled but differ in 2 letters. I want to keep these columns and delete those columns which are not doubled.
Here is an example:
234-rgz SK 234-rgz PV 556-gft SK 456-hjk SK 456-hjk PV
The Output should be:
234-rgz SK 234-rgz PV 456-hjk SK 456-hjk PV
All columns have the same naming conventions. A number starting from 2 to 150 then a "-" after this 4 or 5 letters, then a space and then "SK" or "PV". I thought of using regular expression but then I don't solving the problem how I get rid of those single columns. Thanks for your help!
You can use duplicated on the column names after removing the suffix part. The output will be logical index which can be used to subset the original dataset.
v1 <- colnames(df1)
v2 <- sub('\\s+[^ ]+$', '', v1)
indx <- duplicated(v2)|duplicated(v2, fromLast=TRUE)
v1[indx]
#[1] "234-rgz SK" "234-rgz PV" "456-hjk SK" "456-hjk PV"
To subset the columns in the dataframe,
df1[indx]
Or another option is splitting the column names string to substring and use grep to match the substring that have a frequency >1
tbl <- table(unlist(strsplit(v1, '\\s+.*')))
df1[grep(paste(names(tbl)[tbl>1], collapse="|"), v1)]
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:9, 5*10, replace=TRUE), ncol=5,
dimnames=list(NULL, c('234-rgz SK', '234-rgz PV' , '556-gft SK',
'456-hjk SK' , '456-hjk PV') )) )

Mapping elements of a data frame by looping through another data frame

I have two R data frame with differing dimensions. However but data frames have an id column
df1:
nrow(df1)=22308
c1 c2 c3 pattern1.match
ENSMUSG00000000001_at 10.175115 10.175423 10.109524 0
ENSMUSG00000000003_at 2.133651 2.144733 2.106649 0
ENSMUSG00000000028_at 5.713781 5.714827 5.701983 0
df2:
Genes Pattern.Count
ENSMUSG00000000276 ENSMUSG00000000276_at 1
ENSMUSG00000000876 ENSMUSG00000000876_at 1
ENSMUSG00000001065 ENSMUSG00000001065_at 1
ENSMUSG00000001098 ENSMUSG00000001098_at 1
nrow(df2)=425
I would like to loop through df2, and find all genes that have pattern.count=1 and check it in df1$pattern1.match column.
Basically I would like to overwrite the fields GENES AND pattern1.match with the df2$Genes and df2$Pattern.Count. All the elements from df2$Pattern.Count are equal to one.
I wrote this function, but R freezes while looping through all these rows.
idcol <- ncol(df1)
return.frame.matches <- function(df1, df2, idcol) {
for (i in 1:nrow(df1)) {
for (j in 1:nrow(df2))
if(df1[i, 1] == df2[j, 1]) {
df1[i, idcol] = 1
break
}
}
return (df1)
}
Is there another way of doing that without almost killing the computer?
I'm not sure I get exactly what you are doing, but the following should at least get you closer.
The first column of df1 doesn't seem to have a name, are they rownames?
If so,
df1$Genes <- rownames(df1)
Then you could then do a merge to create a new dataframe with the genes you require:
merge(df1,subset(df2,Pattern.Count==1))
Note they are matching on the common column Genes. I'm not sure what you want to do with the pattern1.match column, but a subset on the df1 part of merge can incorporate conditions on that.
Edit
Going by the extra information in the comments,
df1$pattern1.match <- as.numeric(df1$Genes %in% df2$Genes)
should achieve what you are looking for.
Your sample data is not enough to play around with, but here is what I would start with:
dfm <- merge( df1, df2, by = idcol, all = TRUE )
dfm_pc <- subset( dfm, Pattern.Count == 1 )
I took the "idcol" from your code, don't see it in the data.

Comparing 2 datasets in R

I have 2 extracted data sets from a dataset called babies2009( 3 vectors count, name, gender )
One is girls2009 containing all the girls and the other boys2009.
I want to find out what similar names exist between boys and girls.
I tried this
common.names = (boys2009$name %in% girls2009$name)
When I try
babies2009[common.names, ] [1:10, ]
all I get is the girl names not the common names.
I have confirmed that both data sets indeed contain boys and girls respectively by doing taking a 10 sample...
boys2009 [1:10,]
girsl2009 [1:10,]
How else can I compare the 2 datasets and determine what values they both share.
Thanks,
common.names = (boys2009$name %in% girls2009$name) gives you a logical vector of length length(boys2009$name). So when you try selecting from a much longer data.frame babies2009[common.names, ] [1:10, ], you wind up with nonsense.
Solution: use that logical vector on the proper data.frame!
boys2009 <- data.frame( names=c("Billy","Bob"),data=runif(2), gender="M" , stringsAsFactors=FALSE)
girls2009 <- data.frame( names=c("Billy","Mae","Sue"),data=runif(3), gender="F" , stringsAsFactors=FALSE)
babies2009 <- rbind(boys2009,girls2009)
common.names <- (boys2009$name %in% girls2009$name)
> boys2009[common.names, ]$names
[1] "Billy"
Since you want similarities but did not specify exact matches, you should consider agrep
sapply(boys2009$name , agrep, girls2009$name, max = 0.1)
You can adjust the max.distance argument to suit your needs.
How about using set functions:
list(
`only boys` = setdiff(boys2009$name, girls2009$name),
`common` = intersect(boys2009$name, girls2009$name),
`only girls` = setdiff(girls2009$name, boys2009$name)
)

Resources