Column1A<-Dat$col1
Column1B<-Dat2$col2
Both columns, column1A and column1B consist of mixed values, such as ABC1, 234,etc. Besides, each column can have duplicate entries. For instance,
Column1A
ABC1 ABC2 1234 ABC1
Column1B
ABC2 ABC3 1234
Is there a way to get the unique entry list for each column. If the unique entry list are difference for two columns, how to find the differences?
unique should work in getting unique values for a list.
For searching the different terms in two unique lists, you can just simply go through one of the list, and for each element of that list, check if that is in the other list.
A simple while loop should work, and you stop at the point when you cannot find an element from list 1 in list 2.
You can use unique to find the unique values:
unique(Column1A)
unique(Column1B)
since your data is already in a data.frame or a list, no need to first copy them out. Just use apply:
apply(Dat, 2, unique)
Related
How do we identify all those row entries in a particular column that contain a specific set of keywords?
For example, I have the following dataframe:
test <- data.frame(nom = 1:5, name = c("ser bla", "onlybla", "inspectiongfa serdafds", "inspection", "serbla blainspection"))
My keywords of interest are "ser" & "inspection"
What I'm looking for is to enlist all the values of the second column (i.e. name) in which both the keywords are present together.
So basically, my output should enlist the name values of rows 3 and 4 viz. "inspectiongfa serdafds" & "serbla blainspection"
What I have tried is the following:
I first generate a truth table to enlist the presence of each of the keywords for each row in the dataframe as follows:
as.data.frame(sapply(c("ser", "inspection"), grepl, test$name))
Once I get this, all I have to do is identify all those row entries where the values are a pair of TRUE TRUE. Hence, they'll correspond to the cases where the keywords of interest are present. Here it's the same rows 3 & 4.
But, I'm not able to figure out how to identify such row entries with the TRUE TRUE pair and whether this whole process is a bit of an overkill and it can be done in a much efficient manner.
Any help would be appreciated. Thanks!
You're almost there :)
Here's a solution extending what you have done:
# store your logic test outcomes
conditions_df <- as.data.frame(sapply(c("ser", "inspection"), grepl, test$name))
# False=0 & True=1. Can use rowSums to get the total and find ones that =2 ie True+True
# which gives you the indices of the TRUE outcomes ie the rows we need to filter test
locate_rows <- which(rowSums(conditions_df) == 2)
test$name[locate_rows]
[1] "inspectiongfa serdafds"
[2] "serbla blainspection"
I'm trying to find a clean way to get the first column of my DT, for each row, to be equal to the user_id found in other columns. That is, I must perform a search of "user_id" across each row, and return the entirety of the cell where the instance is found.
I first tried to get the index of the column where the partial match is found, and then use this to set the first column's values, but it did not work. Example:
user_id 1 2
1: N/A 300 user_id154
2: N/A user_id301 user_id125040
3: N/A 302 user_id2
For instance, I want to obtain the following
**user_id**
user_id154
user_id301
user_id2
Please bear in mind I am new to such data formatting in R (most of the work I do does not involve cleaning JSON files..), and that my data.table has overs 1M rows. The answer does not need to be super efficient, but it definitely shouldn't take more than 5 minutes or it will be considered as too slow by my boss.
Hopefully it is understandable
I'm sure someone will provide a more elegant solution, but this does the trick:
dt[, user_id := str_extract(str_c(1, 2), "user_id[0-9]*")]
This first combines all columns row-per-row, then for each row, looks for the first user_id in the combined value.
(Requires the stringr package)
For every row in your table grep first value that has "user_id" in it and put result into column user_id.
df$user_id <- apply(df, 1, function(x) grep("user_id", x, value = TRUE)[1])
Suppose we have a two categorical variables A and B that can each take 6 values. So there are 36 possible combinations. I want to create a new variable category that enumerates these possibilities based on the values of A and B . Is there a way of doing this without hard coding?
apply(expand.grid(unique(A), unique(B)), 1, paste, collapse="")
From inmost function to the outmost:
unique, returns unique vales of its argument
expand.grid, returns a matrix which contains the Cartesian product of its components
apply, applies a given function to the specified matrix/data-frame/... along the given dimension (1 = rows, 2 = columns)
paste concatenates strings or vector elements
I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.
Given data like this
C1<-c(3,-999.000,4,4,5)
C2<-c(3,7,3,4,5)
C3<-c(5,4,3,6,-999.000)
DF<-data.frame(ID=c("A","B","C","D","E"),C1=C1,C2=C2,C3=C3)
How do I go about removing the -999.000 data in all of the columns
I know this works per column
DF2<-DF[!(DF$C1==-999.000 | DF$C2==-999.000 | DF$C3==-999.000),]
But I'd like to avoid referencing each column. I am thinking there is an easy way to reference all of the columns in a particular data frame aka:
DF3<-DF[!(DF[,]==-999.000),]
or
DF3<-DF[!(DF[,(2:4)]==-999.000),]
but obviously these do not work
And out of curiosity, bonus points if you can me why I need that last comma before the ending square bracket as in:
==-999.000),]
The following may work
DF[!apply(DF==-999,1,sum),]
or if you can have multiple -999 on a row
DF[!(apply(DF==-999,1,sum)>0),]
or
DF[!apply(DF==-999,1,any),]
Based on your code, I'll assume that you want to remove all rows that contain -999.
DF2 <- DF[rowSums(DF == -999) == 0, ]
As for your bonus question: A data frame is a list of vectors, all of which have the same length. If we think of the vectors as columns, then a data frame can be thought of as a matrix where the columns might have different types (numeric, character, etc). R allows you to refer to elements of a data frame much the same way you refer to elements of a matrix; by using row and column indices. So DF[i, j] refers to the ith element in the jth vector of DF, which you can think of as the ith row and jth column. So if you want to retain only some of the rows of the data frame and all columns, you can use a matrix-like notation: DF[row.indices, ].
To address your "bonus" question, if we go to the documentation for ?Extract.data.frame we will find:
Data frames can be indexed in several modes. When [ and [[ are used
with a single index (x[i] or x[[i]]), they index the data frame as if
it were a list. In this usage a drop argument is ignored, with a
warning.
and also:
When [ and [[ are used with two indices (x[i, j] and x[[i, j]]) they
act like indexing a matrix: [[ can only be used to select one element.
Note that for each selected column, xj say, typically (if it is not
matrix-like), the resulting column will be xj[i], and hence rely on
the corresponding [ method, see the examples section.
So you need the comma to ensure that R knows you are referring to a row, not a column.
I don't understand if your target is to remove all the rows that contain at least one NA, if this is what you are looking for, then this could be a possible answer:
DF[DF==-999] <- NA
na.omit(DF)
ID C1 C2 C3
1 A 3 3 5
3 C 4 3 3
4 D 4 4 6