R: grep drop all columns even when its not matching - r

Trying to remove columns from a large data frame. Using grep and it works fine when actually there are matching columns. But when there is zero matching columns it drops all the columns.
s <- s[, -grep("^Test", colnames(s))]
To confirm that there are no columns that match Test
> y <- grep("^Test", colnames(s))
> y
integer(0)
What is exactly going on here?

You need to use grepl and ! instead.
df2 <- data.frame(ID =c(1,2,3), T = c("words", "stuff","things"))
df2[,!grepl("^Test", colnames(df2))]
ID T
1 1 words
2 2 stuff
3 3 things
-grep() or -grepl() return integer(0) when there isn't a match.
-TRUE == -1 where as !TRUE == FALSE
Using !grepl() returns the full logical vector (TRUE TRUE) for each column header, allowing you to correctly subset when no columns meet the condition. In other words for colname(df)[i], grepl(..., colnames(df))[i] returns TRUE where your pattern is matched, then using ! you invert to keep the values that don't match, and remove the ones that do.

Related

For each row in DF, check if there is a match in a vector

I have a dataframe in R, and I want to check for any record in a vector that finds matches for the string in the DF. I can't seem to get it to work exactly right.
exampledf=as.data.frame(c("PIT","SLC"))
colnames(exampledf)="Column1"
examplevector=c("PITTPA","LAXLAS","JFKIAH")
This gets me close, but the result is a vector of (1,0,0) instead of a 0 or 1 for each row
exampledf$match=by(exampledf,1:nrow(exampledf),function(row) ifelse(grepl(exampledf$Column1,examplevector),1,0))
Expected result:
exampledf$match=c("1","0")
grepl returns a logical vector the same length as your examplevector. You can wrap it with the any() function (equivalent to using sum() as suggested above).
Here's a slightly modified form of your code:
exampledf$match = vapply(exampledf$Column1, function(x) any(grepl(x, examplevector)), 1L)
So here is my solution:
library(dplyr)
exampledf=as.data.frame(c("PIT","SLC"))
colnames(exampledf)="Column1"
examplevector=c("PITTPA","LAXLAS","JFKIAH")
pmatch does what you want and gives you which example vector it matches to. Use duplicates.ok because you want multiple matches to show up. If you dont want that, then make the argument equal to false. I just used dpylr to create the new column but you can do this however you would like.
exampledf %>% mutate(match_flag = ifelse(is.na(pmatch(Column1, examplevector, duplicates.ok = T)),0
, pmatch(Column1, examplevector, duplicates.ok = T)))
Column1 match_flag
1 PIT 1
2 SLC 0

R selecting rows from dataframe using logical indexing: accessing columns by `$` vs `[]`

I have a simple R data.frame object df. I am trying to select rows from this dataframe based on logical indexing from a column col in df.
I am coming from the python world where during similar operations i can either choose to select using df[df[col] == 1] or df[df.col == 1] with the same end result.
However, in the R data frame df[df$col == 1] gives an incorrect result compared to df[df[,col] == 1] (confirmed by summary command). I am not able to understand this difference as from links like http://adv-r.had.co.nz/Subsetting.html it seems that either way is ok. Also, str command on df$col and df[, col] shows the same output.
Is there any guidelines about when to use $ vs [] operator ?
Edit:
digging a little deeper and using this question as reference, it seems like the following code works correctly
df[which(df$col == 1), ]
however, not clear how to guard against NA and when to use which
You confused many things.
In
df[,col]
col should be the column number. For example,
col = 2
x = df[,col]
would select the second column and store it to x.
In
df$col
col should be the column name. For example,
df=data.frame(aa=1:5,bb=10:14)
x = df$bb
would select the second column and store it to x. But you cannot write df$2.
Finally,
df[[col]]
is the same as df[,col] if col is a number. If col is a character ("character" in R means the same as string in other languages), then it selects the column with this name. Example:
df=data.frame(aa=1:5,bb=10:14)
foo = "bb"
x = df[[foo]]
y = df[[2]]
z = df[["bb"]]
Now x, y, and z are all contain the copy of the second column of df.
The notation foo[[bar]] is from lists. The notation foo[,bar] is from matrices. Since dataframe has features of both matrix and list, it can use both.
Use $ when you want to select one specific column by name df$col_name.
Use [] when you want to select one or more columns by number:
df[,1] # select column with index 1
df[,1:3]# select columns with indexes 1 to 3
df[,c(1,3:5,7)] # select columns with indexes 1, 3 to 5 and 7.
[[]] is mostly for lists.
EDIT: df[which(df$col == 1), ] works because which function creates a logical vector which checks if the column index is equal to 1 (true) or not (false). This logical vector is passed to df[] and only true value is shown.
Remove rows with NAs (missing values) in data.frame - to find out more about how to deal with missing values. It is always a good practice to exclude missing values from dataset.

"for" loop not working

I am trying to isolate some values from a data frame
example:
test_df0<- data.frame('col1'= c('string1', 'string2', 'string1'),
'col2' = c('value1', 'value2', 'value3'),
'col3' = c('string3', 'string4', 'string3'))
I want to obtain a new dataframe with only unique strings from col1, and the relevant strings from col3 (which will be identical for rows with identical col1.
This is the loop I wrote, but I must be doing some blunt mistake:
test_df1<- as.data.frame(matrix(ncol= 2, nrow=0))
colnames(test_df1)<- c('col1', 'col3')
for (i in unique(test_df0$col1)){
first_matching_row<- match(x = i, table = test_df0$col1)
temp_df<-
data.frame('col1'= i,
'col3'= test_df0[first_matching_row, 'col3'])
rbind(test_df1, temp_df)}
The resulting test_df1 though is empty. Cannot spot the mistake with the loop, I would be grateful for any suggestion.
Edit: the for loop is working, if its last line is print(temp_df) instead of the rbind command, I get the correct results. I am not sure why the rbind is not working
An easier and faster way to do with is with the use of the duplicated() function. duplicated() looks through and input vector and returns TRUE if that value has been seen at an earlier index in the vector. For example:
> duplicated(c(0,0,0,1,2,3,0,3))
[1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
Because for the first value of 0 it hadn't seen one before, but for the next two it had. The for 1, 2, and the first 3 it hadn't seen those numbers before, but it it had seen the last two numbers 0 and 3 previously. This means that !duplicated() will return TRUE for the unique values of the data.
We can use this to index into the data frame to get the rows of test_df0 with unique values of col1 as follows:
test_df0[!duplicated(test_df0[["col1"]]), ]
But this returns all columns of the data frame. If we just want col1 and col3 we can index into the columns as well using:
test_df0[!duplicated(test_df0[["col1"]]), c("col1", "col3")]
As for why the loop isn't working, as #Jacob mentions, you aren't assigning the value you are creating with rbind to a value, so the value you create disappears after the function call.
You aren't actually assinging the rbind to anything! Presumably you need something like:
test_df1 <- rbind(test_df1, temp_df)

Guetting a subset in R

I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]

Concatenating strings from different rows in R

I have a R data frame which looks like
data.1 data.character
a **str1**,str2,str2,str3,str4,str5,str6
b str3,str4,str5
c **str1**,str6
I am currently using grepl to identify if the column data.character has my search string "<str>" and if so I want all the row values in data.1 to be concatenated into one string with a separator
eg. if I use grepl(str1,data.character) it will return two rows of df$data.1 and I want an output like
a,c ( rows which contain str1 in data.character)
I am currently using two for loops but i know this is not an efficient method. I was wondering if someone could suggest a more elegant and less time consuming method.
You were almost there - (now my long-winded answer)
# Data
df <- read.table(text="data.1 data.character
a **str1**,str2,str2,str3,str4,str5,str6
b str3,str4,str5
c **str1**,str6",header=T,stringsAsFactors=F)
Match string
# In your question you used grepl which produces a logical vector (TRUE if
#string is present)
grepl("str1" , df$data.character)
#[1] TRUE FALSE TRUE
# In my comment I used grep which produces an positional index of the vector if
# string is present (this was due to me not reading your grepl properly rather
# than because of any property)
grep("str1" , df$data.character)
# [1] 1 3
Then subset the vector that you want at these positions resulting from grep (or grepl)
(s <- df$data.1[grepl("str1" , df$data.character)])
# [1] "a" "c" first and third elements are selected
Paste these together into the required format (collapse argument is used to define the separator between the elements)
paste(s,collapse=",")
# [1] "a,c"
So more succinctly
paste(df$data.1[grep("str1" , df$data.character)],collapse=",")

Resources