Loop throughout unknown hierarchy R - r

I am looking for a loop throughout unknown hierarchy R (I only know the data when I request). For example
I request the highest Hierachy and put them in a dataframe
id name
1 Books
2 DVDs
3 Computer
For the next step I want to loop into the books category so, I do a new request with the id(1) and get:
id name
11 Child books
12 Fantasy
Again now I want to look into the next parent catagory of Child books and do a new request for id(11)
id name
111 Baby
112 Education
113 History
And so on:
id name
1111 Sound
1112 Touch
On this moment I don't know how deep each hierarchy is, but I can tell it is different for each different category. On the end I would like that the data frame looks like this:
Id name Id name Id name id name id name
1 Books 11 Child books 111 Baby 1111 Sound ...
1 Books 11 Child books 111 Baby 1112 Touch ...
1 Books 11 Child books 112 Education etc.
1 Books 11 Child books 113 History etc.
1 Books 12 Fantasy etc.
.................
2 DVDs etc.
.................
3 Computer etc.
.................
So I can extract the numbers of rows of the next hierarchy and repeat the row that number of times.
df[rep(x,each=nrow(df_next)),]
But I have no idea how to loop over an unknown (and changing) i.

Here's a not so elegant solution:
(i) subFn is a custom function that split id based on different lengths:
subFn <- function(id){
len <- nchar(id)
tmp <- lapply(1:len, function(x)substring(id, x, x))
names(tmp) <- paste0("level_", 1:length(tmp))
return(tmp)
}
## example
subFn("1111")
$level_1
[1] "1"
$level_2
[1] "1"
$level_3
[1] "1"
$level_4
[1] "1"
(ii) create a list of data.frame, where the id is separated into different number of columns based on its length:
dat_list <- lapply(list(df1, df2, df3), function(x) do.call(data.frame, c(list(name=x[, "name"], stringsAsFactors=FALSE), subFn(x[, "id"]))))
(iii) Using dplyr left_join to join two frames at a time:
dat_list[[1]] %>%
left_join(dat_list[[2]], by="level_1") %>%
left_join(dat_list[[3]], by=c("level_1", "level_2"))
name.x level_1 name.y level_2 name level_3
1 Books 1 Child books 1 Baby 1
2 Books 1 Child books 1 Education 2
3 Books 1 Child books 1 History 3
4 Books 1 Fantasy 2 <NA> <NA>
5 DVDs 2 <NA> <NA> <NA> <NA>
6 Computer 3 <NA> <NA> <NA> <NA>
To prevent the lengthy and convoluted steps in left_joining multiple data.frame, here's a solution inspired by How to join multiple data frames using dplyr?
func <- function(...){
df1 <- list(...)[[1]]
df2 <- list(...)[[2]]
col <- grep("level", names(df1), value=T)
left_join(..., by = col)
}
Reduce( func, dat_list)
Input data:
df1 <- data.frame(id = 1:3, name = c("Books", "DVDs", "Computer"))
df2 <- data.frame(id = 11:12, name = c("Child books", "Fantasy"))
df3 <- data.frame(id = 111:113, name=c("Baby", "Education", "History"))

Related

Filter out multiple rows [duplicate]

I have a data.frame with character data in one of the columns.
I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I'm missing?
Example:
data.frame name = dat
days name
88 Lynn
11 Tom
2 Chris
5 Lisa
22 Kyla
1 Tom
222 Lynn
2 Lynn
I'd like to filter out Tom and Lynn for example.
When I do:
target <- c("Tom", "Lynn")
filt <- filter(dat, name == target)
I get this error:
longer object length is not a multiple of shorter object length
You need %in% instead of ==:
library(dplyr)
target <- c("Tom", "Lynn")
filter(dat, name %in% target) # equivalently, dat %>% filter(name %in% target)
Produces
days name
1 88 Lynn
2 11 Tom
3 1 Tom
4 222 Lynn
5 2 Lynn
To understand why, consider what happens here:
dat$name == target
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Basically, we're recycling the two length target vector four times to match the length of dat$name. In other words, we are doing:
Lynn == Tom
Tom == Lynn
Chris == Tom
Lisa == Lynn
... continue repeating Tom and Lynn until end of data frame
In this case we don't get an error because I suspect your data frame actually has a different number of rows that don't allow recycling, but the sample you provide does (8 rows). If the sample had had an odd number of rows I would have gotten the same error as you. But even when recycling works, this is clearly not what you want. Basically, the statement dat$name == target is equivalent to saying:
return TRUE for every odd value that is equal to "Tom" or every even value that is equal to "Lynn".
It so happens that the last value in your sample data frame is even and equal to "Lynn", hence the one TRUE above.
To contrast, dat$name %in% target says:
for each value in dat$name, check that it exists in target.
Very different. Here is the result:
[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
Note your problem has nothing to do with dplyr, just the mis-use of ==.
This can be achieved using dplyr package, which is available in CRAN. The simple way to achieve this:
Install dplyr package.
Run the below code
library(dplyr)
df<- select(filter(dat,name=='tom'| name=='Lynn'), c('days','name))
Explanation:
So, once we’ve downloaded dplyr, we create a new data frame by using two different functions from this package:
filter: the first argument is the data frame; the second argument is the condition by which we want it subsetted. The result is the entire data frame with only the rows we wanted.
select: the first argument is the data frame; the second argument is the names of the columns we want selected from it. We don’t have to use the names() function, and we don’t even have to use quotation marks. We simply list the column names as objects.
Using the base package:
df <- data.frame(days = c(88, 11, 2, 5, 22, 1, 222, 2), name = c("Lynn", "Tom", "Chris", "Lisa", "Kyla", "Tom", "Lynn", "Lynn"))
# Three lines
target <- c("Tom", "Lynn")
index <- df$name %in% target
df[index, ]
# One line
df[df$name %in% c("Tom", "Lynn"), ]
Output:
days name
1 88 Lynn
2 11 Tom
6 1 Tom
7 222 Lynn
8 2 Lynn
Using sqldf:
library(sqldf)
# Two alternatives:
sqldf('SELECT *
FROM df
WHERE name = "Tom" OR name = "Lynn"')
sqldf('SELECT *
FROM df
WHERE name IN ("Tom", "Lynn")')
by_type_year_tag_filtered <- by_type_year_tag %>%
dplyr:: filter(tag_name %in% c("dplyr", "ggplot2"))
Write that. Example:
library (dplyr)
target <- YourData%>% filter (YourColum %in% c("variable1","variable2"))
Example with your data
target <- df%>% filter (names %in% c("Tom","Lynn"))
In case you have long strings as values in your string columns
you can use this powerful method with the stringr package.
A method that filter( %in% ) and base R can't do.
library(dplyr)
library(stringr)
sentences_tb = as_tibble(sentences) %>%
mutate(row_number())
sentences_tb
# A tibble: 720 x 2
value `row_number()`
<chr> <int>
1 The birch canoe slid on the smooth planks. 1
2 Glue the sheet to the dark blue background. 2
3 Its easy to tell the depth of a well. 3
4 These days a chicken leg is a rare dish. 4
5 Rice is often served in round bowls. 5
6 The juice of lemons makes fine punch. 6
7 The box was thrown beside the parked truck. 7
8 The hogs were fed chopped corn and garbage. 8
9 Four hours of steady work faced us. 9
10 Large size in stockings is hard to sell. 10
# ... with 710 more rows
matching_letters <- c(
"canoe","dark","often","juice","hogs","hours","size"
)
matching_letters <- str_c(matching_letters, collapse = "|")
matching_letters
[1] "canoe|dark|often|juice|hogs|hours|size"
letters_found <- str_subset(sentences_tb$value,matching_letters)
letters_found_tb = as_tibble(letters_found)
inner_join(sentences_tb,letters_found_tb)
# A tibble: 16 x 2
value `row_number()`
<chr> <int>
1 The birch canoe slid on the smooth planks. 1
2 Glue the sheet to the dark blue background. 2
3 Rice is often served in round bowls. 5
4 The juice of lemons makes fine punch. 6
5 The hogs were fed chopped corn and garbage. 8
6 Four hours of steady work faced us. 9
7 Large size in stockings is hard to sell. 10
8 Note closely the size of the gas tank. 33
9 The bark of the pine tree was shiny and dark. 111
10 Both brothers wear the same size. 253
11 The dark pot hung in the front closet. 261
12 Grape juice and water mix well. 383
13 The wall phone rang loud and often. 454
14 The bright lanterns were gay on the dark lawn. 476
15 The pleasant hours fly by much too soon. 516
16 A six comes up more often than a ten. 609
It's a bit verbose, but it's very handy and powerful if you have long strings and want to filter in what row is located a specific word.
Comparing with the accepted answers:
> target <- c("canoe","dark","often","juice","hogs","hours","size")
> filter(sentences_tb, value %in% target)
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
> df<- select(filter(sentences_tb,value=='canoe'| value=='dark'), c('value','row_number()'))
> df
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
> target <- c("canoe","dark","often","juice","hogs","hours","size")
> index <- sentences_tb$value %in% target
> sentences_tb[index, ]
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
You need to write all the sentences to get the desired result.

R drop rows with specific id [duplicate]

I have a data.frame with character data in one of the columns.
I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I'm missing?
Example:
data.frame name = dat
days name
88 Lynn
11 Tom
2 Chris
5 Lisa
22 Kyla
1 Tom
222 Lynn
2 Lynn
I'd like to filter out Tom and Lynn for example.
When I do:
target <- c("Tom", "Lynn")
filt <- filter(dat, name == target)
I get this error:
longer object length is not a multiple of shorter object length
You need %in% instead of ==:
library(dplyr)
target <- c("Tom", "Lynn")
filter(dat, name %in% target) # equivalently, dat %>% filter(name %in% target)
Produces
days name
1 88 Lynn
2 11 Tom
3 1 Tom
4 222 Lynn
5 2 Lynn
To understand why, consider what happens here:
dat$name == target
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Basically, we're recycling the two length target vector four times to match the length of dat$name. In other words, we are doing:
Lynn == Tom
Tom == Lynn
Chris == Tom
Lisa == Lynn
... continue repeating Tom and Lynn until end of data frame
In this case we don't get an error because I suspect your data frame actually has a different number of rows that don't allow recycling, but the sample you provide does (8 rows). If the sample had had an odd number of rows I would have gotten the same error as you. But even when recycling works, this is clearly not what you want. Basically, the statement dat$name == target is equivalent to saying:
return TRUE for every odd value that is equal to "Tom" or every even value that is equal to "Lynn".
It so happens that the last value in your sample data frame is even and equal to "Lynn", hence the one TRUE above.
To contrast, dat$name %in% target says:
for each value in dat$name, check that it exists in target.
Very different. Here is the result:
[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
Note your problem has nothing to do with dplyr, just the mis-use of ==.
This can be achieved using dplyr package, which is available in CRAN. The simple way to achieve this:
Install dplyr package.
Run the below code
library(dplyr)
df<- select(filter(dat,name=='tom'| name=='Lynn'), c('days','name))
Explanation:
So, once we’ve downloaded dplyr, we create a new data frame by using two different functions from this package:
filter: the first argument is the data frame; the second argument is the condition by which we want it subsetted. The result is the entire data frame with only the rows we wanted.
select: the first argument is the data frame; the second argument is the names of the columns we want selected from it. We don’t have to use the names() function, and we don’t even have to use quotation marks. We simply list the column names as objects.
Using the base package:
df <- data.frame(days = c(88, 11, 2, 5, 22, 1, 222, 2), name = c("Lynn", "Tom", "Chris", "Lisa", "Kyla", "Tom", "Lynn", "Lynn"))
# Three lines
target <- c("Tom", "Lynn")
index <- df$name %in% target
df[index, ]
# One line
df[df$name %in% c("Tom", "Lynn"), ]
Output:
days name
1 88 Lynn
2 11 Tom
6 1 Tom
7 222 Lynn
8 2 Lynn
Using sqldf:
library(sqldf)
# Two alternatives:
sqldf('SELECT *
FROM df
WHERE name = "Tom" OR name = "Lynn"')
sqldf('SELECT *
FROM df
WHERE name IN ("Tom", "Lynn")')
by_type_year_tag_filtered <- by_type_year_tag %>%
dplyr:: filter(tag_name %in% c("dplyr", "ggplot2"))
Write that. Example:
library (dplyr)
target <- YourData%>% filter (YourColum %in% c("variable1","variable2"))
Example with your data
target <- df%>% filter (names %in% c("Tom","Lynn"))
In case you have long strings as values in your string columns
you can use this powerful method with the stringr package.
A method that filter( %in% ) and base R can't do.
library(dplyr)
library(stringr)
sentences_tb = as_tibble(sentences) %>%
mutate(row_number())
sentences_tb
# A tibble: 720 x 2
value `row_number()`
<chr> <int>
1 The birch canoe slid on the smooth planks. 1
2 Glue the sheet to the dark blue background. 2
3 Its easy to tell the depth of a well. 3
4 These days a chicken leg is a rare dish. 4
5 Rice is often served in round bowls. 5
6 The juice of lemons makes fine punch. 6
7 The box was thrown beside the parked truck. 7
8 The hogs were fed chopped corn and garbage. 8
9 Four hours of steady work faced us. 9
10 Large size in stockings is hard to sell. 10
# ... with 710 more rows
matching_letters <- c(
"canoe","dark","often","juice","hogs","hours","size"
)
matching_letters <- str_c(matching_letters, collapse = "|")
matching_letters
[1] "canoe|dark|often|juice|hogs|hours|size"
letters_found <- str_subset(sentences_tb$value,matching_letters)
letters_found_tb = as_tibble(letters_found)
inner_join(sentences_tb,letters_found_tb)
# A tibble: 16 x 2
value `row_number()`
<chr> <int>
1 The birch canoe slid on the smooth planks. 1
2 Glue the sheet to the dark blue background. 2
3 Rice is often served in round bowls. 5
4 The juice of lemons makes fine punch. 6
5 The hogs were fed chopped corn and garbage. 8
6 Four hours of steady work faced us. 9
7 Large size in stockings is hard to sell. 10
8 Note closely the size of the gas tank. 33
9 The bark of the pine tree was shiny and dark. 111
10 Both brothers wear the same size. 253
11 The dark pot hung in the front closet. 261
12 Grape juice and water mix well. 383
13 The wall phone rang loud and often. 454
14 The bright lanterns were gay on the dark lawn. 476
15 The pleasant hours fly by much too soon. 516
16 A six comes up more often than a ten. 609
It's a bit verbose, but it's very handy and powerful if you have long strings and want to filter in what row is located a specific word.
Comparing with the accepted answers:
> target <- c("canoe","dark","often","juice","hogs","hours","size")
> filter(sentences_tb, value %in% target)
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
> df<- select(filter(sentences_tb,value=='canoe'| value=='dark'), c('value','row_number()'))
> df
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
> target <- c("canoe","dark","often","juice","hogs","hours","size")
> index <- sentences_tb$value %in% target
> sentences_tb[index, ]
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
You need to write all the sentences to get the desired result.

Subset rows using two column values [duplicate]

I have a data.frame with character data in one of the columns.
I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I'm missing?
Example:
data.frame name = dat
days name
88 Lynn
11 Tom
2 Chris
5 Lisa
22 Kyla
1 Tom
222 Lynn
2 Lynn
I'd like to filter out Tom and Lynn for example.
When I do:
target <- c("Tom", "Lynn")
filt <- filter(dat, name == target)
I get this error:
longer object length is not a multiple of shorter object length
You need %in% instead of ==:
library(dplyr)
target <- c("Tom", "Lynn")
filter(dat, name %in% target) # equivalently, dat %>% filter(name %in% target)
Produces
days name
1 88 Lynn
2 11 Tom
3 1 Tom
4 222 Lynn
5 2 Lynn
To understand why, consider what happens here:
dat$name == target
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Basically, we're recycling the two length target vector four times to match the length of dat$name. In other words, we are doing:
Lynn == Tom
Tom == Lynn
Chris == Tom
Lisa == Lynn
... continue repeating Tom and Lynn until end of data frame
In this case we don't get an error because I suspect your data frame actually has a different number of rows that don't allow recycling, but the sample you provide does (8 rows). If the sample had had an odd number of rows I would have gotten the same error as you. But even when recycling works, this is clearly not what you want. Basically, the statement dat$name == target is equivalent to saying:
return TRUE for every odd value that is equal to "Tom" or every even value that is equal to "Lynn".
It so happens that the last value in your sample data frame is even and equal to "Lynn", hence the one TRUE above.
To contrast, dat$name %in% target says:
for each value in dat$name, check that it exists in target.
Very different. Here is the result:
[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
Note your problem has nothing to do with dplyr, just the mis-use of ==.
This can be achieved using dplyr package, which is available in CRAN. The simple way to achieve this:
Install dplyr package.
Run the below code
library(dplyr)
df<- select(filter(dat,name=='tom'| name=='Lynn'), c('days','name))
Explanation:
So, once we’ve downloaded dplyr, we create a new data frame by using two different functions from this package:
filter: the first argument is the data frame; the second argument is the condition by which we want it subsetted. The result is the entire data frame with only the rows we wanted.
select: the first argument is the data frame; the second argument is the names of the columns we want selected from it. We don’t have to use the names() function, and we don’t even have to use quotation marks. We simply list the column names as objects.
Using the base package:
df <- data.frame(days = c(88, 11, 2, 5, 22, 1, 222, 2), name = c("Lynn", "Tom", "Chris", "Lisa", "Kyla", "Tom", "Lynn", "Lynn"))
# Three lines
target <- c("Tom", "Lynn")
index <- df$name %in% target
df[index, ]
# One line
df[df$name %in% c("Tom", "Lynn"), ]
Output:
days name
1 88 Lynn
2 11 Tom
6 1 Tom
7 222 Lynn
8 2 Lynn
Using sqldf:
library(sqldf)
# Two alternatives:
sqldf('SELECT *
FROM df
WHERE name = "Tom" OR name = "Lynn"')
sqldf('SELECT *
FROM df
WHERE name IN ("Tom", "Lynn")')
by_type_year_tag_filtered <- by_type_year_tag %>%
dplyr:: filter(tag_name %in% c("dplyr", "ggplot2"))
Write that. Example:
library (dplyr)
target <- YourData%>% filter (YourColum %in% c("variable1","variable2"))
Example with your data
target <- df%>% filter (names %in% c("Tom","Lynn"))
In case you have long strings as values in your string columns
you can use this powerful method with the stringr package.
A method that filter( %in% ) and base R can't do.
library(dplyr)
library(stringr)
sentences_tb = as_tibble(sentences) %>%
mutate(row_number())
sentences_tb
# A tibble: 720 x 2
value `row_number()`
<chr> <int>
1 The birch canoe slid on the smooth planks. 1
2 Glue the sheet to the dark blue background. 2
3 Its easy to tell the depth of a well. 3
4 These days a chicken leg is a rare dish. 4
5 Rice is often served in round bowls. 5
6 The juice of lemons makes fine punch. 6
7 The box was thrown beside the parked truck. 7
8 The hogs were fed chopped corn and garbage. 8
9 Four hours of steady work faced us. 9
10 Large size in stockings is hard to sell. 10
# ... with 710 more rows
matching_letters <- c(
"canoe","dark","often","juice","hogs","hours","size"
)
matching_letters <- str_c(matching_letters, collapse = "|")
matching_letters
[1] "canoe|dark|often|juice|hogs|hours|size"
letters_found <- str_subset(sentences_tb$value,matching_letters)
letters_found_tb = as_tibble(letters_found)
inner_join(sentences_tb,letters_found_tb)
# A tibble: 16 x 2
value `row_number()`
<chr> <int>
1 The birch canoe slid on the smooth planks. 1
2 Glue the sheet to the dark blue background. 2
3 Rice is often served in round bowls. 5
4 The juice of lemons makes fine punch. 6
5 The hogs were fed chopped corn and garbage. 8
6 Four hours of steady work faced us. 9
7 Large size in stockings is hard to sell. 10
8 Note closely the size of the gas tank. 33
9 The bark of the pine tree was shiny and dark. 111
10 Both brothers wear the same size. 253
11 The dark pot hung in the front closet. 261
12 Grape juice and water mix well. 383
13 The wall phone rang loud and often. 454
14 The bright lanterns were gay on the dark lawn. 476
15 The pleasant hours fly by much too soon. 516
16 A six comes up more often than a ten. 609
It's a bit verbose, but it's very handy and powerful if you have long strings and want to filter in what row is located a specific word.
Comparing with the accepted answers:
> target <- c("canoe","dark","often","juice","hogs","hours","size")
> filter(sentences_tb, value %in% target)
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
> df<- select(filter(sentences_tb,value=='canoe'| value=='dark'), c('value','row_number()'))
> df
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
> target <- c("canoe","dark","often","juice","hogs","hours","size")
> index <- sentences_tb$value %in% target
> sentences_tb[index, ]
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
You need to write all the sentences to get the desired result.

How do I infill non-adjacent rows with sample data from previous rows in R?

I have data containing a unique identifier, a category, and a description.
Below is a toy dataset.
prjnumber <- c(1,2,3,4,5,6,7,8,9,10)
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA)
description <- c("skip class",
"dunk on brayden",
"record deal",
"fame and fortune",
NA,
"female attention",
NA,NA,NA,NA)
toy.df <- data.frame(prjnumber, category, description)
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 <NA> <NA>
6 6 epic female attention
7 7 <NA> <NA>
8 8 <NA> <NA>
9 9 <NA> <NA>
10 10 <NA> <NA>
I want to randomly sample the 'category' and 'description' columns from rows that have been filled in to use as infill for rows with missing data.
The final data frame would be complete and would only rely on the initial 5 rows which contain data. The solution would preserve between-column correlation.
An expected output would be:
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 lit record deal
6 6 epic female attention
7 7 based skip class
8 8 based skip class
9 9 lit record deal
10 10 trill dunk on brayden
complete = na.omit(toy.df)
toy.df[is.na(toy.df$category), c("category", "description")] =
complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE),
c("category", "description")]
toy.df
# prjnumber category description
# 1 1 based skip class
# 2 2 trill dunk on brayden
# 3 3 lit record deal
# 4 4 cold fame and fortune
# 5 5 lit record deal
# 6 6 epic female attention
# 7 7 cold fame and fortune
# 8 8 based skip class
# 9 9 epic female attention
# 10 10 epic female attention
Though it would seem a little more straightforward if you didn't start with the unique identifiers filled out for the NA rows...
You could try
library(dplyr)
toy.df %>%
mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3)
Based on new information, we may need a numeric index to use in the funs.
toy.df %>%
mutate(indx= replace(row_number(), is.na(category),
sample(row_number()[!is.na(category)], replace=TRUE))) %>%
mutate_each(funs(.[indx]), 2:3) %>%
select(-indx)
Using Base R to fill in a single field a at a time, use something like (not preserving the correlation between the fields):
fields <- c('category','description')
for(field in fields){
missings <- is.na(toy.df[[field]])
toy.df[[field]][missings] <- sample(toy.df[[field]][!missings],sum(missings),T)
}
and to fill them in simultaneously (preserving the correlation between the fields) use something like:
missings <- apply(toy.df[,fields],
1,
function(x)any(is.na(x)))
toy.df[missings,fields] <- toy.df[!missings,fields][sample(sum(!missings),
sum(missings),
T),]
and of course, to avoid the implicit for loop in the apply(x,1,fun), you could use:
rowAny <- function(x) rowSums(x) > 0
missings <- rowAny(toy.df[,fields])

Filter multiple values on a string column in dplyr

I have a data.frame with character data in one of the columns.
I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I'm missing?
Example:
data.frame name = dat
days name
88 Lynn
11 Tom
2 Chris
5 Lisa
22 Kyla
1 Tom
222 Lynn
2 Lynn
I'd like to filter out Tom and Lynn for example.
When I do:
target <- c("Tom", "Lynn")
filt <- filter(dat, name == target)
I get this error:
longer object length is not a multiple of shorter object length
You need %in% instead of ==:
library(dplyr)
target <- c("Tom", "Lynn")
filter(dat, name %in% target) # equivalently, dat %>% filter(name %in% target)
Produces
days name
1 88 Lynn
2 11 Tom
3 1 Tom
4 222 Lynn
5 2 Lynn
To understand why, consider what happens here:
dat$name == target
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Basically, we're recycling the two length target vector four times to match the length of dat$name. In other words, we are doing:
Lynn == Tom
Tom == Lynn
Chris == Tom
Lisa == Lynn
... continue repeating Tom and Lynn until end of data frame
In this case we don't get an error because I suspect your data frame actually has a different number of rows that don't allow recycling, but the sample you provide does (8 rows). If the sample had had an odd number of rows I would have gotten the same error as you. But even when recycling works, this is clearly not what you want. Basically, the statement dat$name == target is equivalent to saying:
return TRUE for every odd value that is equal to "Tom" or every even value that is equal to "Lynn".
It so happens that the last value in your sample data frame is even and equal to "Lynn", hence the one TRUE above.
To contrast, dat$name %in% target says:
for each value in dat$name, check that it exists in target.
Very different. Here is the result:
[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
Note your problem has nothing to do with dplyr, just the mis-use of ==.
This can be achieved using dplyr package, which is available in CRAN. The simple way to achieve this:
Install dplyr package.
Run the below code
library(dplyr)
df<- select(filter(dat,name=='tom'| name=='Lynn'), c('days','name))
Explanation:
So, once we’ve downloaded dplyr, we create a new data frame by using two different functions from this package:
filter: the first argument is the data frame; the second argument is the condition by which we want it subsetted. The result is the entire data frame with only the rows we wanted.
select: the first argument is the data frame; the second argument is the names of the columns we want selected from it. We don’t have to use the names() function, and we don’t even have to use quotation marks. We simply list the column names as objects.
Using the base package:
df <- data.frame(days = c(88, 11, 2, 5, 22, 1, 222, 2), name = c("Lynn", "Tom", "Chris", "Lisa", "Kyla", "Tom", "Lynn", "Lynn"))
# Three lines
target <- c("Tom", "Lynn")
index <- df$name %in% target
df[index, ]
# One line
df[df$name %in% c("Tom", "Lynn"), ]
Output:
days name
1 88 Lynn
2 11 Tom
6 1 Tom
7 222 Lynn
8 2 Lynn
Using sqldf:
library(sqldf)
# Two alternatives:
sqldf('SELECT *
FROM df
WHERE name = "Tom" OR name = "Lynn"')
sqldf('SELECT *
FROM df
WHERE name IN ("Tom", "Lynn")')
by_type_year_tag_filtered <- by_type_year_tag %>%
dplyr:: filter(tag_name %in% c("dplyr", "ggplot2"))
Write that. Example:
library (dplyr)
target <- YourData%>% filter (YourColum %in% c("variable1","variable2"))
Example with your data
target <- df%>% filter (names %in% c("Tom","Lynn"))
In case you have long strings as values in your string columns
you can use this powerful method with the stringr package.
A method that filter( %in% ) and base R can't do.
library(dplyr)
library(stringr)
sentences_tb = as_tibble(sentences) %>%
mutate(row_number())
sentences_tb
# A tibble: 720 x 2
value `row_number()`
<chr> <int>
1 The birch canoe slid on the smooth planks. 1
2 Glue the sheet to the dark blue background. 2
3 Its easy to tell the depth of a well. 3
4 These days a chicken leg is a rare dish. 4
5 Rice is often served in round bowls. 5
6 The juice of lemons makes fine punch. 6
7 The box was thrown beside the parked truck. 7
8 The hogs were fed chopped corn and garbage. 8
9 Four hours of steady work faced us. 9
10 Large size in stockings is hard to sell. 10
# ... with 710 more rows
matching_letters <- c(
"canoe","dark","often","juice","hogs","hours","size"
)
matching_letters <- str_c(matching_letters, collapse = "|")
matching_letters
[1] "canoe|dark|often|juice|hogs|hours|size"
letters_found <- str_subset(sentences_tb$value,matching_letters)
letters_found_tb = as_tibble(letters_found)
inner_join(sentences_tb,letters_found_tb)
# A tibble: 16 x 2
value `row_number()`
<chr> <int>
1 The birch canoe slid on the smooth planks. 1
2 Glue the sheet to the dark blue background. 2
3 Rice is often served in round bowls. 5
4 The juice of lemons makes fine punch. 6
5 The hogs were fed chopped corn and garbage. 8
6 Four hours of steady work faced us. 9
7 Large size in stockings is hard to sell. 10
8 Note closely the size of the gas tank. 33
9 The bark of the pine tree was shiny and dark. 111
10 Both brothers wear the same size. 253
11 The dark pot hung in the front closet. 261
12 Grape juice and water mix well. 383
13 The wall phone rang loud and often. 454
14 The bright lanterns were gay on the dark lawn. 476
15 The pleasant hours fly by much too soon. 516
16 A six comes up more often than a ten. 609
It's a bit verbose, but it's very handy and powerful if you have long strings and want to filter in what row is located a specific word.
Comparing with the accepted answers:
> target <- c("canoe","dark","often","juice","hogs","hours","size")
> filter(sentences_tb, value %in% target)
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
> df<- select(filter(sentences_tb,value=='canoe'| value=='dark'), c('value','row_number()'))
> df
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
> target <- c("canoe","dark","often","juice","hogs","hours","size")
> index <- sentences_tb$value %in% target
> sentences_tb[index, ]
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
You need to write all the sentences to get the desired result.

Resources