Subsetting in R using OR condition with strings

Subsetting in R using OR condition with strings - r

I have a data frame with about 40 columns, the second column, data[2] contains the name of the company that the rest of the row data describes. However, the names of the companies are different depending on the year (trailing 09 for 2009 data, nothing for 2010).
I would like to be able to subset the data such that I can pull in both years at once. Here is an example of what I'm trying to do...
subset(data, data[2] == "Company Name 09" | "Company Name", drop = T)
Essentially, I'm having difficulty using the OR operator within the subset function.
However, I have tried other alternatives:
subset(data, data[[2]] == grep("Company Name", data[[2]]))
Perhaps there's an easier way to do it using a string function?
Any thoughts would be appreicated.

First of all (as Jonathan done in his comment) to reference second column you should use either data[[2]] or data[,2]. But if you are using subset you could use column name: subset(data, CompanyName == ...).
And for you question I will do one of:
subset(data, data[[2]] %in% c("Company Name 09", "Company Name"), drop = TRUE)
subset(data, grepl("^Company Name", data[[2]]), drop = TRUE)
In second I use grepl (introduced with R version 2.9) which return logical vector with TRUE for match.

A couple of things:
1) Mock-up data is useful as we don't know exactly what you're faced with. Please supply data if possible. Maybe I misunderstood in what follows?
2) Don't use [[2]] to index your data.frame, I think [,"colname"] is much clearer
3) If the only difference is a trailing ' 09' in the name, then simply regexp that out:
R> x1 <- c("foo 09", "bar", "bar 09", "foo")
R> x2 <- gsub(" 09$", "", x1)
[1] "foo" "bar" "bar" "foo"
R>
Now you should be able to do your subset on the on-the-fly transformed data:
R> data <- data.frame(value=1:4, name=x1)
R> subset(data, gsub(" 09$", "", name)=="foo")
value name
1 1 foo 09
4 4 foo
R>
You could also have replace the name column with regexp'ed value.

Related

Parsing a multi space-separated data set and storing it in the right data structure

I have a large dataset with name, age and company.
file.txt :
name firstname1 lastname1
age 30
Company ABC Ltd
name firstname2 lastname2
age 28
Company XYZ Ltd
I need to write a function that will return data structure, given key attribute, provide the corresponding value of the given key.
E.g
content <- parseFile("file.txt")
content[1]["name"] # "firstname1 lastname1"
content[1]["age"] # 30
content[1]["Company"] # "ABC Ltd"
content[2]["name"] # "firstname2 lastname2"
content[2]["age"] # 28
content[2]["Company"] # "XYZ Ltd"
Up until now, I inferred that a list of the named vector can be used or
A list of objects can be used.
Or Is there any better way to solve this?
explanation with code example will be helpful

We can use readLines to get the data, create a delimiter with sub and create a two column data.frame
df1 <- read.csv(text =sub(" ", ",", dat), header = FALSE,
stringsAsFactors = FALSE)
If we need to split as a list
lst1 <- split(setNames(as.list(df1$V2), df1$V1), cumsum(df1$V1 == 'name'))
lst1[[1]][['name']]
#[1] "firstname1 lastname1"
lst1[[1]][['age']]
#[1] "30"
lst1[[2]][['age']]
#[1] "28"
data
dat <- readLines("file.txt")

R: How to replace only particular strings in a dataframe column

I have a dataframe column which has values like Americ0,Indi0,Data 2.0...
While doing the data cleaning I am supposed to replace "0" with "an"
df$column <- lapply(df$column, function(x){
str_replace(x,"0","an")
})
I am using the above code to replace 0 with "an" which is working as expected. The problem now is there are certain values in df$column which are not to be replaced like the value Data 2.0 .Appreciate if someone can help me on this.

You can do a str_replace from stringr,Assuming x is df$column:
library(stringr)
x <- c("Americ0","Indi0","Data 2.0")
str_replace(x,"([:alpha:]+)(0)","\\1an")
Or, using baseR
gsub("([[:alpha:]]+)(0)","\\1an",x)
Output:
> str_replace(x,"([:alpha:]+)(0)","\\1an")
[1] "American" "Indian" "Data 2.0"
> gsub("([[:alpha:]]+)(0)","\\1an",x)
[1] "American" "Indian" "Data 2.0"
Inside parenthesis , the items getting captured are called captured group, so I captured all the alphabets more than one into a capture group 1, Hence in this case 2.0 would not get selected.
From documentation:
[:alpha:] Alphabetic characters: [:lower:] and [:upper:].
For more you can search ?regex on your console

I'm not sure how you would do this without having some sort of rule on which you want/do not want to replace like maybe don't replace if 0 is at the beginning, or if 0 occurs in this set of strings.
With your current setup you could probably do something like this (assuming only "Data 2.0" is something you want to skip)
df <- as.data.frame(c("Americ0","Indi0","Data 2.0"))
colnames(df)[1] = "column"
do_not_replace <- c("Data 2.0")
df$column <- lapply(df$column, function(x) {
if(x %in% do_not_replace) {
x
} else str_replace(x, "0", "an")
})

need to flatten list to use intersect in R

I have fullname data that I have used strsplit() to get each element of the name.
# Dataframe with a `names` column (complete names)
df <- data.frame(
names =
c("Adam, R, Goldberg, MALS, MBA",
"Adam, R, Goldberg, MEd",
"Adam, S, Metsch, MBA",
"Alan, Haas, MSW",
"Alexandra, Dumas, Rhodes, MA",
"Alexandra, Ruttenberg, PhD, MBA"),
stringsAsFactors=FALSE)
# Add a column with the split names (it is actually a list)
df$splitnames <- strsplit(df$names, ', ')
I also have a list of degrees below
degrees<-c("EdS","DEd","MEd","JD","MS","MA","PhD","MSPH","MSW","MSSA","MBA",
"MALS","Esq","MSEd","MFA","MPA","EdM","BSEd")
I would like to get the intersection for each name and respective degrees.
I'm not sure how to flatten the name list so I can compare the two vectors using intersect. When I tried unlist(df$splitname,recursive=F) it returned each element separately. Any help is appreciated.

Try
df$intersect <- lapply(X=df$splitname, FUN=intersect, y=degrees)
That will give you a list of the intersection of each element in df$splitname (e.g. intersect(df$splitname[[1]], degrees)). If you want it as a vector:
sapply(X=df$intersect, FUN=paste, collapse=', ')
I assume you need it as a vector, since possibly the complete names came from one (for instance, from a dataframe), but strsplit outputs a list.
Does that work? If not, please try to clarify your intention.
Good luck!

For continuity, you can use unlist :
hh <- unlist(df$splitname)
intersect(hh,degrees)
For example :
ll <- list(c("Adam" , "R" , "Goldberg" ,"MALS" , "MBA "),
c("Adam" , "R" , "Goldberg", "MEd" ))
intersect(hh,degrees)
[1] "MEd"
or equivalent to :
hh[hh %in% degrees]
[1] "MEd"
To get differences you can use
setdiff(hh,degrees)
[1] "Adam" "R" "Goldberg" "MALS" "MBA "
...

Concatenating strings with

I have a data frame with several variables. What I want is create a string using (concatenation) the variable names but with something else in between them...
Here is a simplified example (number of variables reduced to only 3 whereas I have actually many)
Making up some data frame
df1 <- data.frame(1,2,3) # A one row data frame
names(df1) <- c('Location1','Location2','Location3')
Actual code...
len1 <- ncol(df1)
string1 <- 'The locations that we are considering are'
for(i in 1:(len1-1)) string1 <- c(string1,paste(names(df1[i]),sep=','))
string1 <- c(string1,'and',paste(names(df1[len1]),'.'))
string1
This gives...
[1] "The locations that we are considering are"
[2] "Location1"
[3] "Location2"
[4] "Location3 ."
But I want
The locations that we are considering are Location1, Location2 and Location3.
I am sure there is a much simpler method which some of you would know...
Thank you for you time...

Are you looking for the collapse argument of paste?
> paste (letters [1:3], collapse = " and ")
[1] "a and b and c"

The fact that these are names of a data.frame does not really matter, so I've pulled that part out and assigned them to a variable strs.
strs <- names(df1)
len1 <- length(strs)
string1 <- paste("The locations that we are considering are ",
paste(strs[-len1], collapse=", ", sep=""),
" and ",
strs[len1],
".\n",
sep="")
This gives
> cat(string1)
The locations that we are considering are Location1, Location2 and Location3.
Note that this will not give sensible English if there is only 1 element in strs.
The idea is to collapse all but the last string with comma-space between them, and then paste that together with the boilerplate text and the last string.

If your main goal is to print the results to the screen (or other output) then use the cat function (whose name derives from concatenate):
> cat(names(iris), sep=' and '); cat('\n')
Sepal.Length and Sepal.Width and Petal.Length and Petal.Width and Species
If you need a variable with the string, then you can use paste with the collapse argument. The sprintf function can also be useful for inserting strings into other strings (or numbers into strings).

An other options would be:
library(stringr)
str_c("The location that we are consiering are ", str_c(str_c(names(df1)[1:length(names(df1))-1], collapse=", "), names(df1)[length(names(df1))], sep=" and "))

R: Replacing rownames of data frame by a substring[2]

I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!

If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...

I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subsetting in R using OR condition with strings - r

Related

Parsing a multi space-separated data set and storing it in the right data structure

R: How to replace only particular strings in a dataframe column

need to flatten list to use intersect in R

Concatenating strings with

R: Replacing rownames of data frame by a substring[2]

Categories

Resources