I have a tibble with 20 variables. So far I've been using this pipe to find out which values appear more than once in a single column
as_tibble(iris) %>% group_by(Petal.Length) %>% summarise(n=sum(n())) %>% filter(n>1)
I was wonering if I could write a line that could loop this through all the columns and return 20 different tibbles (or as many as I need in the future) in the same way the pipe above would return one tibble. I have tried writing my own loops but I've had no success, I am quite new.
The iris example dataset has 5 columns so feel free to give an answer with 5 columns.
Thank you!
library(dplyr)
col_names <- colnames(iris)
lapply(
col_names,
function(col) {
iris %>%
group_by_at(col) %>%
summarise(n = n()) %>%
filter(n > 1)
}
)
In base R 4.1+ we have this one-liner. For each column it applies table and then filters out those elements whose value exceeds 1. Finally it converts what remains of the table to a data frame. Omit stack if it is ok to return a list of table objects instead of a list of data frames.
lapply(iris, \(x) stack(Filter(function(x) x > 1, table(x))))
A variation of that is to keep only duplicated items and then add 1 giving slightly fewer keystrokes. Again we can omit stack if returning a list of table objects is ok.
lapply(iris, \(x) stack(table(x[duplicated(x)]) + 1))
Related
How to set list names ,here is the code as below.
Currently,split_data include two sub list [[1]] and [[2]], how set names separately for them?
I want set name 'A' for [[1]],'B' for [[2]], so can retrieve data use split_data['A']...
Anyone can help on this, thanks ?
for instance ma <- list(a=c('a1','a2'),b=c('b1','b2')) can use ma["a"] for sub list
library(tidyverse)
test_data <- data.frame(category=c('A','B','A','B','A','B','A','B'),
sales=c(1,2,4,5,8,1,4,6))
split_data <- test_data %>% group_split(category)
Others have shown you in the comments how to get what you want using split() instead of group_split(). That seems like the easiest solution.
However, if you're stuck with the existing code, here's an alternative that keeps your current code, and adds the names.
library(tidyverse)
test_data <- data.frame(category=c('A','B','A','B','A','B','A','B'),
sales=c(1,2,4,5,8,1,4,6))
split_data <- test_data %>% group_split(category)
names(split_data) <- test_data %>% group_by(category) %>% group_keys() %>% apply(1, paste, collapse = ".")
The idea is to use group_by to split in the same way group_split does, then extract the keys as a tibble. This will have one row per group, but will have the different variables in separate columns, so I put them together by pasting the columns with a dot as separator. The last expression in the pipe is equivalent to apply(keys, 1, f)
where f is function(row) paste(row, collapse = "."). It applies f to each row of the tibble, producing a single name.
This should work even if the split happens on multiple variables, and produces names similar to those produced by split().
This question already has an answer here:
R delete last row in dataframe for each group
(1 answer)
Closed 2 years ago.
I need to delete every last row in a group after applying group_by.
I have tried something like that, but it does not work.
data=data %>%
group_by(isin) %>%
summarise(data=data[-length(isin),])
Thanks for your help!
We use the built in iris data set as an example. It has three groups of 50 rows each defined by the Species column. Next time please provide sample data in the question. See the top of the r tag page for info.
1) group_modify We can use group_modify from dplyr.
library(dplyr)
iris %>%
group_by(Species) %>%
group_modify(~ head(., -1)) %>%
ungroup
2) slice Another dplyr solution is to use slice
library(dplyr)
iris %>%
group_by(Species) %>%
slice(-n()) %>%
ungroup
3) by A base solution is to use by. It produces a list of data frames which we rbind back together.
do.call("rbind", by(iris, iris$Species, head, -1))
4) subset/ave Another base solution is to create a vector of numbers which count down to 1 for each group and then only keep those rows corresponding to a number greater than 1.
subset(iris, ave(1:nrow(iris), Species, FUN = function(x) length(x):1) > 1)
4a) or keep all rows except the one having the maximum row number in each group:
n <- nrow(iris)
subset(iris, ave(1:n, Species, FUN = max) != 1:n)
5) duplicated Yet another base solution uses duplicated. It only keeps rows whose Species column is duplicated counting back from the end.
subset(iris, duplicated(Species, fromLast = TRUE))
Try using the the base function by
new_data=do.call(rbind,by(data,data[,'isin'],function(x) x[-length(x),]))
By will return the groups in as list and do.call(rbind,...) will convert the list to a data.frame
I have created a list of dataframes with split like so:
dataframes_list <- split(df, f = df$variable3)
Each dataframe (131 in total) there is in long format and have the same variables and structure. I want to perform the function pivot_wider in all of them simultaneously.
I have been struggling with some functions of the apply family, but could not get it done:
First I reduced the number of variables within each dataframe selecting only those that should be used for pivoting
dataframes_list_2 <- lapply(dataframes_list, function (x) select(x, variable1, variable2))
Then I tried pivot_wider
dataframes_list_3 <- lapply(dataframes_list_2, function(x) pivot_wider(x, names_from = variable1, values_from = variable 2)
What I obtain in this way is the list with dataframes that contain 1 observation per variable, each of them being a vector of (in this case) 12 values. What I want instead is this:
Because there was a warning telling me that my observations were not uniquely identified, I varied the code above including such variable. But what I got was this:
Can someone give me some answer to this issue?
Thank you
Each dataframe in the list has this aspect:
I had the same problem and I solved it this way:
df_list <- lapply(1:length(my_list),
function(x) (pivot_wider(my_list[[x]], names_from = names, values_from = values)))
bind_rows(df_list)
You will get what you needed! Hope it helps!
You could try:
map(my_list, ~ (pivot_wider(.x, names_from=1,values_from= 2)))
number 1 and 2 are the columns in my tibbles. You can use map_dfr. To combine the data sets you can use unnest of bind_rows.
I have a dataframe with 500 observations for each of 3106 US counties. I would like to merge that dataframe with a SpatialPolygonsDataFrame.
I have tried a few approaches. I have found that if I filter the data by a variable iter_id I can use sp::merge() to merge the datasets. I presume that I can then rbind them back together. sp::merge() does not allow a right or full join and the spatial data needs to be in the left position. So a many to one will not work. The really nasty way I have tried is:
(I am not sure how to represent the dataframe with the variables of interest here)
library(choroplethr)
data(continental_us_states)
us <- tigris::counties(continental_us_states)
gm_y_corr <- tribble(~GEOID,~iter_id,~neat_variable,
01001,1,"value_1",
01003,1,"value_2",
...
01001,2,"value_3",
01003,2,"value_4",
...
01001,500,"value_5",
01003,500,"value_6")
filtered <- gm_y_corr %>%
filter(iter_id ==1)
us.gm <- sp::merge(us, filtered ,by='GEOID')
for (j in 2:500) {
tmp2 <- gm_y_corr %>%
filter(iter_id == j)
tmp3 <- sp::merge(us, tmp2,by='GEOID')
us.gm <- rbind(us.gm,tmp3)
}
I know there must be a better way. I have tried group_by. But multple matches are found. So I must not be understanding the group_by.
> geo_dat <- gm_y_corr %>%
+ group_by(iter_id)%>%
+ sp::merge(us, .,by='GEOID')
Error in .local(x, y, ...) : non-unique matches detected
I would like to merge the spatial data with the interesting data.
Here you can use the splitting functionality of base R in split or the more recent dplyr::group_split. This will separate your data frame according to your splitting variable and you can lapply or purrr::map a function such as merge to it and then dplyr::bind_rows to collapse the returned list back to a dataframe. Since I can't manage to get the us data I have just written what I suspect would work.
gm_y_corr %>%
group_by(iter_id) %>% # group
group_split() %>% # split
lapply(., function(x){ # apply function(x) merge(us, x, by = "GEOID") to leach list element
merge(us, x, by = "GEOID")
}) %>%
bind_rows() # collapse to data frame
equivalently this is the same as using base R functionality. The new group_by %>% group_split is a little more intuitive in my opinion.
gm_y_corr %>%
split(.$iter_id) %>%
lapply(., function(x){
merge(us, x, by = "GEOID")
}) %>%
bind_rows()
If you wanted to just use group_by you would have to follow up with dplyr::do function which I believe does a similar thing to what I have just done above. But without you having to split it yourself.
cols <- data %>% names()
data %>% dplyr::filter_(is.na(cols[1]))
gives zero although it should output some rows, alternatively when calling
data %>% dplyr::filter(is.na("colName"))
output rows
Thus, dynamic filtration not working, any idea what is the alternative?
dplyr::filter(data, is.na(data[, cols[1]]))