Filtering a data frame

Filtering a data frame - r

I have read in a csv file in matrix form (having m rows and n columns). I want to filter the matrix by conducting a filter in verbal form:
Select all values from column x where the values of an another column in this row is equal to "blabla".
It is like a select statement in database where I say I am interested in a subset of the matrix where these constraints need to be satisfied.
How can I do it in r? I have the data as dataframe and can access it by the headers. data["column_values" = "15"] does not give me back the rows where the column named column_values have values 15 only.
Thanks

You said you just wanted the column x values where column_values was 15, right?
subset(dat, column_values==15, select=x)
I think this may come as a dataframe so it's possble you may need to unlist() it and maybe even "unfactor" it.
> dat
Subject Product
1 1 ProdA
2 1 ProdB
3 1 ProdC
4 2 ProdB
5 2 ProdC
6 2 ProdD
7 3 ProdA
8 3 ProdB
> subset(dat, Subject==2, Product)
Product
4 ProdB
5 ProdC
6 ProdD
> unlist( subset(dat, Subject==2, Product) )
Product1 Product2 Product3
ProdB ProdC ProdD
Levels: ProdA ProdB ProdC ProdD
> as.character( unlist( subset(dat, Subject==2, Product) ) )
[1] "ProdB" "ProdC" "ProdD"
If you want all of the columns you can drop the third argument (the select= argument):
subset(dat, Subject==2 )
Subject Product
4 2 ProdB
5 2 ProdC
6 2 ProdD

Assuming that dat is the data frame in question, col is the name of the column and "value" is the value that you want, you can do
dat[dat$col=="value",]
That fetches all of the rows of dat for which dat$col=="value", and all of the columns.

First, note that a matrix and a data.frame are different things in R. I imagine you have a data.frame (as that is what is returned by read.csv()). data.frame's have named columns (if you don't give them ones, generic ones are created for you).
You can subset a data.frame by indicating both what rows you want and/or what columns you want. The easiest way to specify which rows is with a logical vector, often built out of comparisons using specific columns of the data.frame. For example data[["column values"]] == "15" would make a logical vector which is TRUE if the corresponding entry in the column column values is the string "15" (since it is in quotes, it is a string, not a number). You can make as complicated a selection criteria as you like (combining logical vectors with & and |) to specify the rows you want. This vector becomes the first argument in the indexing.
A list of column names or numbers can be the second argument. If either argument is missing, all rows (or columns) are assumed.
Putting this all together, you get examples like
data[data[["column values"]] == "15", ]
or using an actual data set (mtcars)
mtcars[mtcars$am == 1, ]
mtcars[mtcars$am == 1 & mtcars$hp > 100, "mpg"]
mtcars[mtcars$am == 1 & mtcars$hp > 100, "mpg", drop=FALSE]
mtcars[mtcars$hp > 100, c("mpg", "carb")]
Take a look at what each of the conditionals (first arguments, e.g. mtcars$am == 1 & mtcars$hp > 100) return to get a better sense of how indexing works.

Related

How to exclude a range of data points by index from a dataframe in R [duplicate]

I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4

The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]

You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.

Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.

Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)

By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]

For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.

Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")

Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!

To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2

Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding

Group categories in R according to first letters of a string?

I have a dataset loaded in R, and I have one of the columns that has text. This text is not unique (any row can have the same value) but it represents a specific condition of a row, and so the first 3-5 letters of this field will represent the group where the row belongs. Let me explain with an example.
Having 3 different rows, only showing the id and the column I need to group by:
ID........... TEXTFIELD
1............ VGH2130
2............ BFGF2345
3............ VGH3321
Having the previous example, I would like to create a new column in the dataframe where it would be set the group such as
ID........... TEXTFIELD........... NEWCOL
1............ VGH2130............. VGH
2............ BFGF2345............ BFGF
3............ VGH3321............. VGH
And to determine the groups that would be formed in this new column, I would like to make an array with the possible groups to make (since all the rows will be contained in one of these groups) (for example c <- ("VGH","BFGF",......) )
Can anyone drop any light on how to efficiently do this? (without making a for loop having to do this, since I have millions of rows and this would take ages)

You can also try
> data$group <- (str_extract(TEXTFIELD, "[aA-zZ]+"))
> data
ID TEXTFIELD group
1 1 VGH2130 VGH
2 2 BFGF2345 BFGF
3 3 VGH3321 VGH

you can try, if df is your data.frame:
df$NEWCOL <- gsub("([A-Z)]+)\\d+.*","\\1", df$TEXTFIELD)
> df
# ID TEXTFIELD NEWCOL
#1 1 VGH2130 VGH
#2 2 BFGF2345 BFGF
#3 3 VGH3321 VGH

Does the text field always have 3 or 4 letters preceding the numbers?
you can check that by doing:
nrow(data[grepl("[aA-zZ]{1,4}\\d+", data$TEXTFIELD)== TRUE, ]) #this gives number of rows where TEXTFIELD contains 3,4 letters followed by digits
If yes, then:
require(stringr)
data$NEWCOL <- str_extract(data$TEXTFIELD, "[aA-zZ]{1,4}")
Final Step:
data$group <- ifelse(data$NEWCOL == "VGH", "Group Name", ifelse(data$NEWCOL == "BFGF", "Group Name", ifelse . . . . ))
# Complete the ifelse statement to classify all groups

lapply on single column in data frame

I have a data frame which I populate from a csv file as follows (data for sample only) :
> csv_data <- read.csv('test.csv')
> csv_data
gender country income
1 1 20 10000
2 2 20 12000
3 2 23 3000
I want to convert country to factor. However when I do the following, it fails :
> csv_data[,2] <- lapply(csv_data[,2], factor)
Warning message:
In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
provided 3 variables to replace 1 variables
However, if I convert both gender and country to factor, it succeeds :
> csv_data[,1:2] <- lapply(csv_data[,1:2], factor)
> is.factor(csv_data[,1])
[1] TRUE
> is.factor(csv_data[,2])
[1] TRUE
Is there something I am doing wrong? I want to use lapply since I want to programmatically convert the columns into factors and it could be possible that the number of columns to be converted is only 1(it could be more as well, this number is driven from arguments to a function). Any way I can do it using lapply only?

When subsetting for one single column, you'll need to change it slightly.
There's a big difference between
lapply(df[,2], factor)
and
lapply(df[2], factor)
## and/or
lapply(df[, 2, drop=FALSE], factor)
Have a look at the output of each. If you remove the comma, everything should work fine. Using the comma in [,] turns a single column into a vector and therefore each element in the vector is factored individually. Whereas leaving it out keeps the column as a list, which is what you want to give to lapply in this situation. However, if you use drop=FALSE, you can leave the comma in, and the column will remain a list/data.frame.
No good:
df[,2] <- lapply(df[,2], factor)
# Warning message:
# In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
# provided 3 variables to replace 1 variables
Succeeds on a single column:
df[,2] <- lapply(df[,2,drop=FALSE], factor)
df[,2]
# [1] 20 20 23
# Levels: 20 23
On my opinion, the best way to subset data frame columns is without the comma. This also succeeds:
df[2] <- lapply(df[2], factor)
df[[2]]
# [1] 20 20 23
# Levels: 20 23

How do I delete rows in a data frame?

I have a data frame named "mydata" that looks like this this:
A B C D
1. 5 4 4 4
2. 5 4 4 4
3. 5 4 4 4
4. 5 4 4 4
5. 5 4 4 4
6. 5 4 4 4
7. 5 4 4 4
I'd like to delete row 2,4,6. For example, like this:
A B C D
1. 5 4 4 4
3. 5 4 4 4
5. 5 4 4 4
7. 5 4 4 4

The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]

You can also work with a so called boolean vector, aka logical:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the ! operator acts as a NOT, i.e. !TRUE == FALSE:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to #mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A is assigned NA (not a number) where A exceeds 4.

Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.

Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)

By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]

For completeness, I'll add that this can be done with dplyr as well using slice. The advantage of using this is that it can be part of a piped workflow.
df <- df %>%
.
.
slice(-c(2, 4, 6)) %>%
.
.
Of course, you can also use it without pipes.
df <- slice(df, -c(2, 4, 6))
The "not vector" format, -c(2, 4, 6) means to get everything that is not at rows 2, 4 and 6. For an example using a range, let's say you wanted to remove the first 5 rows, you could do slice(df, 6:n()). For more examples, see the docs.

Delete Dan from employee.data - No need to manage a new data.frame.
employee.data <- subset(employee.data, name!="Dan")

Here's a quick and dirty function to remove a row by index.
removeRowByIndex <- function(x, row_index) {
nr <- nrow(x)
if (nr < row_index) {
print('row_index exceeds number of rows')
} else if (row_index == 1)
{
return(x[2:nr, ])
} else if (row_index == nr) {
return(x[1:(nr - 1), ])
} else {
return (x[c(1:(row_index - 1), (row_index + 1):nr), ])
}
}
It's main flaw is it the row_index argument doesn't follow the R pattern of being a vector of values. There may be other problems as I only spent a couple of minutes writing and testing it, and have only started using R in the last few weeks. Any comments and improvements on this would be very welcome!

To identify by a name:
Call out the unique ID and identify the location in your data frame (DF).
Mark to delete. If the unique ID applies to multiple rows, all these rows will be removed.
Code:
Rows<-which(grepl("unique ID", DF$Column))
DF2<-DF[-c(Rows),]
DF2

Another approach when working with Unique IDs is to subset data:
*This came from an actual report where I wanted to remove the chemical standard
Chem.Report<-subset(Chem.Report, Chem_ID!="Standard")
Chem_ID is the column name.
The ! is important for excluding

R Using a for() loop to fill one dataframe with another

I have two dataframes and I wish to insert the values of one dataframe into another (let's call them DF1 and DF2).
DF1 consists of 2 columns 1 and 2. Column 1 (col1) contains characters a to z and col2 has values associated with each character (from a to z)
DF2 is a dataframe with 3 columns. The first two consist of every combination of DF1$col1 so: aa ab ac ad etc; where the first letter is in col1 and the second letter is in col2
I want to create a simple mathematical model utilizing the values in DF1$col2 to see the outcomes of every possible combination of objects in DF1$col1
The first step I wanted to do is to transfer values from DF1$col2 to DF2$col3 (values from DF2$col3 should be associated to values in DF2col1), but that's where I'm stuck. I currently have
for(j in 1:length(DF2$col1))
{
## this part is to use the characters in DF2$col1 as an input
## to yield the output for DF2$col3--
input=c(DF2$col1)[j]
## This is supposed to use the values found in DF1$col2 to fill in DF2$col3
g=DF1[(DF1$col2==input),"pred"]
## This is so that the values will fill in DF2$col3--
DF2$col3=g
}
When I run this, DF2$col3 will be filled up with the same value for a specific character from DF1 (e.g. DF2$col3 will have all the rows filled with the value associated with character "a" from DF1)
What exactly am I doing wrong?
Thanks a bunch for your time

You should really use merge for this as #Aaron suggested in his comment above, but if you insist on writing your own loop, than you have the problem in your last line, as you assign g value to the whole col3 column. You should use the j index there also, like:
for(j in 1:length(DF2$col1))
{
DF2$col3[j] = DF1[(which(DF1$col2 == DF2$col1[j]), "pred"]
}
If this would not work out, than please also post some sample database to be able to help in more details (as I do not know, but have a gues what could be "pred").

It sounds like what you are trying to do is a simple join, that is, match DF1$col1 to DF2$col1 and copy the corresponding value from DF1$col2 into DF2$col3. Try this:
DF1 <- data.frame(col1=letters, col2=1:26, stringsAsFactors=FALSE)
DF2 <- expand.grid(col1=letters, col2=letters, stringsAsFactors=FALSE)
DF2$col3 <- DF1$col2[match(DF2$col1, DF1$col1)]
This uses the function match(), which, as the documentation states, "returns a vector of the positions of (first) matches of its first argument in its second." The values you have in DF1$col1 are unique, so there will not be any problem with this method.
As a side note, in R it is usually better to vectorize your work rather than using explicit loops.

Not sure I fully understood your question, but you can try this:
df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2
The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).
The results are shown below:
> head(df1, n=3)
col1 col2
1 a 68
2 b 73
3 c 45
> tail(df1, n=3)
col1 col2
24 x 22
25 y 4
26 z 17
> head(df2, n=3)
col1 col2 col3
1 a a 68
2 b a 73
3 c a 45
> tail(df2, n=3)
col1 col2 col3
674 x z 22
675 y z 4
676 z z 17

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Filtering a data frame - r

Assuming that dat is the data frame in question, col is the name of the column and "value" is the value that you want, you can do dat[dat$col=="value",] That fetches all of the rows of dat for which dat$col=="value", and all of the columns.

Related

How to exclude a range of data points by index from a dataframe in R [duplicate]

Group categories in R according to first letters of a string?

lapply on single column in data frame

How do I delete rows in a data frame?

R Using a for() loop to fill one dataframe with another

Categories

Resources