Say there is a data frame that has a structure like this:
df <- data.frame(x.1 = rnorm(n=100),
x.2 = rnorm(n=100),
x.3 = rnorm(n=100),
x.special = rnorm(n=100),
x.y.z = rnorm(n=100))
Inspecting the head, we get this output:
x.1 x.2 x.3 x.special x.y.z
1 1.01014580 -1.4047666 1.50374721 -0.8339784 -0.0831983
2 0.44307253 -0.4695634 -0.71951820 1.5758893 1.2163749
3 -0.87051845 0.1793721 -0.26838489 -1.0477929 -1.0813926
4 -0.28491936 0.4186763 -0.07494088 -0.2177471 0.3490200
5 -0.03769566 -0.3656822 0.12478667 -0.7975811 -0.4481193
6 -0.83808036 0.6842561 0.71231627 -0.3348798 1.7418141
Suppose I want to remove all the numbered variables but keep the x.special and x.y.z variables. I know that I can easily deselect with:
df %>%
select(-x.1,
-x.2,
-x.3)
However for something like 50 or 100 variables like this, it would become cumbersome. Similarly, I know I can pick patterns like so:
df %>%
select(-contains("x."))
But this of course removes everything because the special variables have the . name. Is there a more intelligent way of picking these variables? I feel like there is an option for finding the numeric variable in the name.
# use regex to remove these colums...
colsBool <- !grepl(x=names(df), pattern="\\d")
Result:
> head(df[, colsBool])
x.special x.y.z
1 1.1145156 -0.4911891
2 0.7059937 0.4500111
3 -0.6566422 1.6085353
4 -0.6322514 -0.8017260
5 0.4785106 0.6014765
6 -0.8508830 -0.5078307
Regular expressions are your best friend in this situation.
For instance, if you wanted to remove columns whose last value is a number, just do !grepl(pattern = "\\d$",...), the $ sign at the end of the expression will match only columns ending with a number. The ! sign in front of the grepl() expression negates the values in the match, that is, a TRUE becomes FALSE and vice-versa.
Related
All.
I've been trying to solve a problem on a large data set for some time and could use some of your wisdom.
I have a DF (1.3M obs) with a column called customer along with 30 other columns. Let's say it contains multiple instances of customers Customer1 thru Customer3000. I know that I have issues with 30 of those customers. I need to find all the customers that are NOT the customers I have issues and replace the value in the 'customer' column with the text 'Supported Customer'. That seems like it should be a simple thing...if it werent for the number of obs, I would have loaded it up in Excel, filtered all the bad customers out and copy/pasted the text 'Supported Customer' over what remained.
Ive tried replace and str_replace_all using grepl and paste/paste0 but to no avail. my current code looks like this:
#All the customers that have issues
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127",
"Customer128", ..... , "Customer140")
#Look for everything that is NOT in the list above and replace with "Enabled"
orderData$customer <- str_replace_all(orderData$customer, paste0("[^", paste(out, collapse =
"|"), "]"), "Enabled Customers")
That code gets me this error:
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
In a character range [x-y], x is greater than y. (U_REGEX_INVALID_RANGE)
I've tried the inverse of this approach and pulled a list of all obs that dont match the list of out customers. Something like this:
in <- orderData %>% filter(!customer %in% out) %>% select(customer) %>%
distinct(customer)
This gets me a much larger list of customers that ARE enabled (~3,100). Using the str_replace_all and paste approach seems to have issues though. At this large number of patterns, paste no longer collapses using the "|" operator. instead I get a string that looks like:
"c(\"Customer1\", \"Customer2345\", \"Customer54\", ......)
When passed into str_replace_all, this does not match any patterns.
Anyways, there's got to be an easier way to do this. Thanks for any/all help.
Here is a data.table approach.
First, some example data since you didn't provide any.
customer <- sample(paste0("Customer",1:300),5000,replace = TRUE)
orderData <- data.frame(customer = sample(paste0("Customer",1:300),5000,replace = TRUE),stringsAsFactors = FALSE)
orderData <- cbind(orderData,matrix(runif(0,100,n=5000*30),ncol=30))
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127", "Customer128","Customer140")
library(data.table)
setDT(orderData)
result <- orderData[!(customer %in% out),customer := gsub("Customer","Supported Customer ",customer)]
result
customer 1 2 3 4 5 6 7 8 9
1: Supported Customer 134 65.35091 8.57117 79.594166 84.88867 97.225276 84.563997 17.15166 41.87160 3.717705
2: Supported Customer 225 72.95757 32.80893 27.318046 72.97045 28.698518 60.709381 92.51114 79.90031 7.311200
3: Supported Customer 222 39.55269 89.51003 1.626846 80.66629 9.983814 87.122153 85.80335 91.36377 14.667535
4: Supported Customer 184 24.44624 20.64762 9.555844 74.39480 49.189537 73.126275 94.05833 36.34749 3.091072
5: Supported Customer 194 42.34858 16.08034 34.182737 75.81006 35.167769 23.780069 36.08756 26.46816 31.994756
---
I have a dataset of 4000+ images. For the purpose of figuring out the code, I moved a small subset of them to another folder.
The files look like this:
folder
[1] "r01c01f01p01-ch3.tiff" "r01c01f01p01-ch4.tiff" "r01c01f02p01-ch1.tiff"
[4] "r01c01f03p01-ch2.tiff" "r01c01f03p01-ch3.tiff" "r01c01f04p01-ch2.tiff"
[7] "r01c01f04p01-ch4.tiff" "r01c01f05p01-ch1.tiff" "r01c01f05p01-ch2.tiff"
[10] "r01c01f06p01-ch2.tiff" "r01c01f06p01-ch4.tiff" "r01c01f09p01-ch3.tiff"
[13] "r01c01f09p01-ch4.tiff" "r01c01f10p01-ch1.tiff" "r01c01f10p01-ch4.tiff"
[16] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
[19] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
[22] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
I cannot remove the name prior to the -ch# as that information is important. What I want to do, however, is to filter this list of images, and return only sets (ie: r01c02f10p01) which have all four ch values (ch1-4).
I was originally thinking that we could approach the issue along the lines of this:
ch1 <- dir(path="/Desktop/cp/complete//", pattern="ch1")
ch2 <- dir(path="/Desktop/cp/complete//", pattern="ch2")
ch3 <- dir(path="/Desktop/cp/complete//", pattern="ch3")
ch4 <- dir(path="/Desktop/cp/complete//", pattern="ch4")
Applying this list with the file.remove function, similar to this:
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
file.remove(folder,final2)
However, creating new variables for each ch value fragments out each file. I am unsure how to use these to actually distinguish whether an individual image has all four ch values to meaningfully filter my images. I'm kind of at a loss, as the other sources I've seen have issues that don't quite match this problem.
Earlier, I was able to remove the all images with ch5 from my image set like this. I was thinking this may be helpful in trying to filter only images which have ch1-ch4, but I'm not sure how to proceed.
##Create folder variable which has all image files
folder <- list.files(getwd())
##Create final2 variable which has all image files ending in ch5
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
##Remove final2 from folder
file.remove(folder,final2)
To summarize: I expect to filter files from a random assortment without complete ch values (ie: maybe only ch1 and ch2, or ch3 and ch4, or ch1, ch2, ch3, and ch4), to an assortment which only contains files which have a complete set (four files with ch1, ch2, ch3, and ch4).
Starting with a vector of filenames like you would get from list.files or something similar, you can create a data frame of filenames, use regex to extract the alphanumeric part at the beginning and the number that follows "-ch". Then check that all elements of an expected set (I put this in ch_set, but there might be another way you need to do this) occur in each group's set of CH values.
# assume this is the vector of file names that comes from list.files
# or something comparable
files <- c("r01c01f01p01-ch3.tiff", "r01c01f01p01-ch4.tiff", "r01c01f02p01-ch1.tiff", "r01c01f03p01-ch2.tiff", "r01c01f03p01-ch3.tiff", "r01c01f04p01-ch2.tiff", "r01c01f04p01-ch4.tiff", "r01c01f05p01-ch1.tiff", "r01c01f05p01-ch2.tiff", "r01c01f06p01-ch2.tiff", "r01c01f06p01-ch4.tiff", "r01c01f09p01-ch3.tiff", "r01c01f09p01-ch4.tiff", "r01c01f10p01-ch1.tiff", "r01c01f10p01-ch4.tiff", "r01c01f11p01-ch1.tiff", "r01c01f11p01-ch2.tiff", "r01c01f11p01-ch3.tiff", "r01c01f11p01-ch4.tiff", "r01c02f10p01-ch1.tiff", "r01c02f10p01-ch2.tiff", "r01c02f10p01-ch3.tiff", "r01c02f10p01-ch4.tiff")
library(dplyr)
ch_set <- 1:4
files_to_keep <- data.frame(filename = files, stringsAsFactors = FALSE) %>%
tidyr::extract(filename, into = c("group", "ch"), regex = "(^[\\w\\d]+)\\-ch(\\d)", remove = FALSE) %>%
mutate(ch = as.numeric(ch)) %>%
group_by(group) %>%
filter(all(ch_set %in% ch))
files_to_keep
#> # A tibble: 8 x 3
#> # Groups: group [2]
#> filename group ch
#> <chr> <chr> <dbl>
#> 1 r01c01f11p01-ch1.tiff r01c01f11p01 1
#> 2 r01c01f11p01-ch2.tiff r01c01f11p01 2
#> 3 r01c01f11p01-ch3.tiff r01c01f11p01 3
#> 4 r01c01f11p01-ch4.tiff r01c01f11p01 4
#> 5 r01c02f10p01-ch1.tiff r01c02f10p01 1
#> 6 r01c02f10p01-ch2.tiff r01c02f10p01 2
#> 7 r01c02f10p01-ch3.tiff r01c02f10p01 3
#> 8 r01c02f10p01-ch4.tiff r01c02f10p01 4
Now that you have a dataframe of the complete groups, just pull the matching filenames back out:
files_to_keep$filename
#> [1] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
#> [4] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
#> [7] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
One thing to note is that this worked without the mutate line where I converted ch to numeric—i.e. comparing character versions of those numbers to regular numeric version of them—because under the hood, %in% converts to matching types. That didn't seem totally safe if you needed to scale this, so I converted to have them in matching types.
I have 197 levels relating to location, I want to simplify this by creating a new variable "INSIDE" which stores 1 when location is a building/home/etc and 0 when location is outside. I have tried grepl() but it gives an error
data$Inside<-ifelse(grepl(data$Premise.Description,pattern = c("BUILDING","ROOM","AUTO","BALCONY","BANK","BAR","STORE","CHURCH","COLLEGE","CONDOMINIUM","CENTER","DAY CARE","SCHOOL","HOSPITAL","LIBRARY","PARLOR","OFFICE","MOSQUE","CLUB","PORCH","MALL","WAREHOUSE")),1,0)
Warning message:
In grepl(crime_3yr$Premise.Description, pattern = c("BUILDING", :
argument 'pattern' has length > 1 and only the first element will be used
I have tried using lapply() but it did not work too.
I want the output to be like this:
BUILDING 1
SHOP 1
Street 0
grepl takes a regex instead of a list of options, try this:
data$Inside<-ifelse(grepl(data$Premise.Description,pattern = "BUILDING|ROOM|AUTO|BALCONY|BANK|BAR|STORE|CHURCH|COLLEGE|CONDOMINIUM|CENTER|DAY CARE|SCHOOL|HOSPITAL|LIBRARY|PARLOR|OFFICE|MOSQUE|CLUB|PORCH|MALL|WAREHOUSE"),1,0)
If you want to keep the code similar to what you listed you need to look into regular expressions which is what the pattern part of the grepl needs to be.
data$Inside<-ifelse(grepl(data$Premise.Description,pattern = "BUILDING|ROOM|AUTO|BALCONY|BANK|BAR|STORE|CHURCH|COLLEGE|CONDOMINIUM|CENTER|DAY CARE|SCHOOL|HOSPITAL|LIBRARY|PARLOR|OFFICE|MOSQUE|CLUB|PORCH|MALL|WAREHOUSE"),1,0)
Try this code:
Your data.frame:
data<-data.frame(Premise.Description= c("BUILDING 1","MY ROOM","AUTO","BALCONY","OTHER"))
The solution:
toMatch<-c("BUILDING","ROOM","AUTO","BALCONY","BANK","BAR","STORE","CHURCH","COLLEGE","CONDOMINIUM","CENTER","DAY CARE","SCHOOL","HOSPITAL","LIBRARY","PARLOR","OFFICE","MOSQUE","CLUB","PORCH","MALL","WAREHOUSE")
data$Inside<-grepl(paste(toMatch,collapse="|"), data$Premise.Description)
data
Premise.Description Inside
1 BUILDING 1 TRUE
2 MY ROOM TRUE
3 AUTO TRUE
4 BALCONY TRUE
5 OTHER FALSE
You might be better off using data.table:
library(data.table)
setDT(data)
data[
grepl(c("BUILDING","ROOM","AUTO","BALCONY","BANK","BAR","STORE","CHURCH","COLLEGE","CONDOMINIUM","CENTER","DAY CARE","SCHOOL","HOSPITAL","LIBRARY","PARLOR","OFFICE","MOSQUE","CLUB","PORCH","MALL","WAREHOUSE"), Premise),
Inside := TRUE
]
I use data from the NHANES periodontal dataset (https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm) and after cleaning it to only keep the "pc" variables, I have a df=setPD 168 columns that include 6 measurements (pcd, pcm, pcs, pcp, pcl, pca) around 28 teeth numbered from #02 to #31
#names(setPD)
[1] "ohx02pcd" "ohx02pcm" "ohx02pcs" "ohx02pcp" "ohx02pcl" "ohx02pca" "ohx03pcd" "ohx03pcm" "ohx03pcs" "ohx03pcp" "ohx03pcl" "ohx03pca"
[13] "ohx04pcd" "ohx04pcm" "ohx04pcs" "ohx04pcp" "ohx04pcl" "ohx04pca" "ohx05pcd" "ohx05pcm" "ohx05pcs" "ohx05pcp" "ohx05pcl" "ohx05pca"
[25] "ohx06pcd" "ohx06pcm" "ohx06pcs" "ohx06pcp" "ohx06pcl" "ohx06pca" "ohx07pcd" "ohx07pcm" "ohx07pcs" "ohx07pcp" "ohx07pcl" "ohx07pca"
[37] "ohx08pcd" "ohx08pcm" "ohx08pcs" "ohx08pcp" "ohx08pcl" "ohx08pca" "ohx09pcd" "ohx09pcm" "ohx09pcs" "ohx09pcp" "ohx09pcl" "ohx09pca"
[49] "ohx10pcd" "ohx10pcm" "ohx10pcs" "ohx10pcp" "ohx10pcl" "ohx10pca" "ohx11pcd" "ohx11pcm" "ohx11pcs" "ohx11pcp" "ohx11pcl" "ohx11pca"
[61] "ohx12pcd" "ohx12pcm" "ohx12pcs" "ohx12pcp" "ohx12pcl" "ohx12pca" "ohx13pcd" "ohx13pcm" "ohx13pcs" "ohx13pcp" "ohx13pcl" "ohx13pca"
[73] "ohx14pcd" "ohx14pcm" "ohx14pcs" "ohx14pcp" "ohx14pcl" "ohx14pca" "ohx15pcd" "ohx15pcm" "ohx15pcs" "ohx15pcp" "ohx15pcl" "ohx15pca"
[85] "ohx18pcd" "ohx18pcm" "ohx18pcs" "ohx18pcp" "ohx18pcl" "ohx18pca" "ohx19pcd" "ohx19pcm" "ohx19pcs" "ohx19pcp" "ohx19pcl" "ohx19pca"
[97] "ohx20pcd" "ohx20pcm" "ohx20pcs" "ohx20pcp" "ohx20pcl" "ohx20pca" "ohx21pcd" "ohx21pcm" "ohx21pcs" "ohx21pcp" "ohx21pcl" "ohx21pca"
[109] "ohx22pcd" "ohx22pcm" "ohx22pcs" "ohx22pcp" "ohx22pcl" "ohx22pca" "ohx23pcd" "ohx23pcm" "ohx23pcs" "ohx23pcp" "ohx23pcl" "ohx23pca"
[121] "ohx24pcd" "ohx24pcm" "ohx24pcs" "ohx24pcp" "ohx24pcl" "ohx24pca" "ohx25pcd" "ohx25pcm" "ohx25pcs" "ohx25pcp" "ohx25pcl" "ohx25pca"
[133] "ohx26pcd" "ohx26pcm" "ohx26pcs" "ohx26pcp" "ohx26pcl" "ohx26pca" "ohx27pcd" "ohx27pcm" "ohx27pcs" "ohx27pcp" "ohx27pcl" "ohx27pca"
[145] "ohx28pcd" "ohx28pcm" "ohx28pcs" "ohx28pcp" "ohx28pcl" "ohx28pca" "ohx29pcd" "ohx29pcm" "ohx29pcs" "ohx29pcp" "ohx29pcl" "ohx29pca"
[157] "ohx30pcd" "ohx30pcm" "ohx30pcs" "ohx30pcp" "ohx30pcl" "ohx30pca" "ohx31pcd" "ohx31pcm" "ohx31pcs" "ohx31pcp" "ohx31pcl" "ohx31pca"
I am trying to apply a conditional selection in each group of six columns. This is:
transmute(setPD,PD02 = ifelse(setPD$ohx02pcd >5 |
setPD$ohx02pcm>5 |setPD$ohx02pcs >5|
setPD$ohx02pcp >5 | setPD$ohx02pcl >5 |
setPD$ohx02pca >5, 1, 0))
Then for the next tooth (03) I have to write again:
transmute(setPD,PD03 = ifelse(setPD$ohx03pcd >5 |
setPD$ohx03pcm>5|setPD$ohx03pcs >5|
setPD$ohx03pcp >5|setPD$ohx03pcl >5|
setPD$ohx03pca >5, 1, 0))
I tried to firstly do that conditional selection in a more efficient way, something like:
transmute(setPD,PD02 = ifelse(list(setPD$ohx02pcd:setPD$ohx02pcp) >5, 1, 0))
but it does not work.
Then I am looking for a way to write a loop that does that over each tooth without needing to write this 28 times!!
I thought of applying the select function of dplyr in a for loop but I don't know how to do that.
At the end I want to get all the new columns I made with transmute and say that if at least 2 of the 28 columns are 1, then I have disease, if <2 are 1 then I have health. ANy help would be appreciated.
**Note: If you want to get the dataset, it is open access from CDC.org:
https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm **
First, it is useful to point out that the logical statements of the form is A true OR is B true OR is C true are equivalent to asking is ANY of A,B,C true? We can use this to simplify the statements setPD$ohx02pcd >5 | setPD$ohx02pcm>5 |setPD$ohx02pcs >5| ... to ask if for any of these columns it is true that their value is larger than 5.
For example, let us focus on tooth number 02 first. To get all columns that concern this tooth, we can use grep to get a vector of column names. This can be achieved with
current_tooth <- grep("02", names(setPD), value = T)
Note that if there are any other columns in the data that contain the string 02, these columns will also show up. This does not appear to be the case in your data, but it is worthwhile pointing out here in case someone else uses it and this applies in other datasets.
Now, we can use these names to subset the dataframe. For instance,
setPD[,current_tooth]
will give you the corresponding columns. In each row, we want to check if any of the above mentioned conditions are true. Given a vector of logical statements, we can check if any of them is true with the function any. To go through a dataframe by row and apply a function, we can use apply, such as in
setPD$PD02 <-
apply(setPD[,grep("02", names(setPD), value = T)], 1, function(x) any(x>5))
Now, the above applies to one tooth only, namely 02. One way of doing it for all teeth is to create a vector with all tooth indicators and use this to loop over the above lines, replacing the "02" in the above grep call in each iteration and using assign or something similar to get the variable name right. A more elegant and more efficient way is to use the same principle on long data. Consider the following:
library(reshape2)
library(dplyr)
m <- melt(setPD, id.vars="SEQN")
m$num <- substr(m$variable, 4,5) # be careful here and check output!
m <- m %>% group_by(num) %>% mutate(PS = any(value>5))
m$num <- paste0("PS", m$num)
md <- dcast(m, SEQN ~ num, value.var = "PS")
setPD <- merge(setPD, md, by="SEQN")
This melts your data first and creates a variable num that indicates your tooth. Again, make sure that this works. I have used the fact that in your data, the tooth number all appear in the 4th and 5th place in the character string. Make sure this is true, and adjust the code otherwise. Then I create a variable PS which indicates whether any of the columns that contain the tooth identifer has a value larger than 5. Last but not least I recast the data so that you have the values of PD02, PD03, etc in columns again, before I merge this to the old dataset. The line with paste0 merely creates the variable names that you want to have.
I am having trouble to subset from a list using a variable of my function.
rankhospital <- function(state,outcome,num = "best") {
#code here
e3<-dataframe(...,state.name,...)
if (num=="worst"){ return(worst(state,outcome))
}else if((num%in%b=="TRUE" & outcome=="heart attack")=="TRUE"){
sep<-split(e3,e3$state.name)
hosp.estado<-sep$state
hospital<-hosp.estado[num,1]
return(as.character(hospital))
I split my data frame by state (which is a variable of my function)
But hosp.estado<-sep$state doesn't work. I have also tried as.data.frame.
The function (rankhospital("NY"....) returns me a character(0).
When I feed the sep$state with sep$"NY" directly in code it works perfectly so I guess the problem is I can't use a function's variable to do this. Am I right? What could I use instead?
Thank you!!
If state is a variable in your function, you can refer to a column with the name given by state using: sep[state] or sep[[state]]. The first produces a data frame with one column named based on the value of state. The second produces an unnamed vector.
df=data.frame(NY=rnorm(10),CA=rnorm(10), IL=rnorm(10))
state="NY"
df[state]
# NY
# 1 -0.79533912
# 2 -0.05487747
# 3 0.25014132
# 4 0.61824329
# 5 -0.17262350
# 6 -2.22390027
# 7 -1.26361438
# 8 0.35872890
# 9 -0.01104548
# 10 -0.94064916
df[[state]]
# [1] -0.79533912 -0.05487747 0.25014132 0.61824329 -0.17262350 -2.22390027 -1.26361438 0.35872890 -0.01104548 -0.94064916
class(df[state])
# [1] "data.frame"
class(df[[state]])
# [1] "numeric"
It seems like you are trying to get the top hospital in a state. You don't want to split here (see the result of sep to see what I mean). Instead, use:
as.character(e3[e3$state.name==state, 1][num])
This hopefully does what you want.
You need sep[[state]] instead of sep$state to get the data frame out of your sep list, which matches the state parameter of your function. Like this:
e3 <- read.csv("https://raw.github.com/Hindol/data-analysis-coursera/master/HW3/hospital-data.csv")
state <- "WY"
num <- 1:5
sep<-split(e3,e3$State)
hosp.estado<-sep[[state]]
hospital<-hosp.estado[num,1]
as.character(hospital)
# [1] "530002" "530006" "530008" "530010" "530011"