Can I combine 2 rows in R - r

A section of dataframe looks like
Streets <- c("Muscow tweede","Muscow NDSM", "kazan Bo", "Kazan Ca")
Hotels<- c(5,9,4,3)
Is there a method to merge Muscow tweede and Muscow ndsm, as well as the two Kazan streets, so that I can find the total number of hotels in the city rather than separate streets?

With dplyr:
library(dplyr)
df %>% group_by(col=tolower(sub(' .*', '', Streets))) %>%
summarize(Hotels=sum(Hotels))
Output:
col Hotels
<chr> <dbl>
1 kazan 7
2 muscow 14

Another way:
library(dplyr)
library(stringr)
tibble(Streets, Hotels) %>%
mutate(Streets = str_to_title(str_extract(Streets, '\\w+'))) %>%
group_by(Streets) %>% summarise(Hotels = sum(Hotels))
# A tibble: 2 x 2
Streets Hotels
<chr> <dbl>
1 Kazan 7
2 Muscow 14

Another way with tapply -
with(df, tapply(Hotels, tools::toTitleCase(sub('\\s.*', '', Streets)), sum))
# Kazan Muscow
# 7 14

df1$City = stringr::str_to_title(stringr::word(Streets, end = 1))
aggregate(Hotels ~ City, data = df1, sum)
City Hotels
1 Kazan 7
2 Muscow 14
Sample data
df1 <- data.frame(
Streets = c("Muscow tweede","Muscow NDSM", "kazan Bo", "Kazan Ca"),
Hotels = c(5,9,4,3))

We can use rowsum from base R
rowsum(Hotels, tools::toTitleCase(trimws(Streets, whitespace = "\\s+.*")))
[,1]
Kazan 7
Muscow 14

Related

replace a value in one column with a value from a second column on condition of a value from a third column from different rows

I have a data frame:
df1 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7",NA,"17",NA,NA,NA)),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
And I want this:
df2 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7","18","17",NA,NA,"17")),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
Which is, if there is a no value for an Object for PointA for measure == 2, I want it replaced with PointB from measure == 1 of the same Object.
First thing that comes to mind is:
library(dplyr)
df$PointA <- coalesce(df$PointA, df$PointB)
But afaik there is no way to make this condional.
Then I thought maybe something like:
df$PointA[is.na(df$PointA)] <- df$PointB
But this does not differentiate for the measure.
So I thought about:
df$PointA <- ifelse(df$measure == 2 & is.na(df$PointA), df$PointB, df$PointA)
But that does not take into account that I need the corresponding value from measure == 1.
Now, I am at a loss here. I am out of ideas how to approch this. Help?
Edit: I got two very good solutions already, but both rely on the order in the data frame. I tried, but obviously my example was to simple. I am looking for something that works under the following condition, too:
df1 <- df1[sample(nrow(df1)), ]
One possible option is using row_number() from dplyr. In case you need to sort your dataframe first, you can insert an arrange statement.
library(dplyr)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = if_else(measure == 2 & is.na(PointA), PointB[row_number()-1], PointA))
# A tibble: 6 x 4
# Groups: Object [3]
# Object PointA PointB measure
# <chr> <dbl> <dbl> <chr>
# 1 Daniel NA 17 1
# 2 Daniel 17 NA 2
# 3 Klaus 7 18 1
# 4 Klaus 18 22 2
# 5 Peter 17 NA 1
# 6 Peter NA NA 2
You could use coalesce +lag as shown below:
library(tidyverse)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = coalesce(PointA, lag(PointB)))
# A tibble: 6 x 4
# Groups: Object [3]
Object PointA PointB measure
<chr> <dbl> <dbl> <chr>
1 Klaus 7 18 1
2 Klaus 18 18 2
3 Peter 17 NA 1
4 Peter NA NA 2
5 Daniel NA 17 1
6 Daniel 17 NA 2
This could be condensed, but it should be relatively clear and doesn't rely on the row order at all. Beware if you have multiple rows for the same Object/Measure pair - the self-join will have multiple matches and you'll end up with a lot more rows than you started with.
library(dplyr)
df_fill = df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(df_fill) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)
result
# Object PointA PointB measure
# 1 Klaus 7 18 1
# 2 Klaus 18 22 2
# 3 Peter 17 NA 1
# 4 Peter NA NA 2
# 5 Daniel NA 17 1
# 6 Daniel 17 NA 2
Same as above but without saving the intermediate object:
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(
df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)

Can You Iterate Through Columns AND Unique Variables of Each Column to create a summary in R?

Considering the example dataframe below, is it possible to iterate over each column, and the unique variable in each column to obtain a summary of the unique variables for each column?
sex <- c("M","F","M","M","F","F","F","M","M","F")
school <- c("north","north","central","south","south","south","central","north","north","south")
days_missed <- c(5,1,2,0,7,1,3,2,4,15)
df <- data.frame(sex, school, days_missed, stringsAsFactors = F)
In this example, I want to be able to create a summary of missed days by sex and school
My expected output would 1 data frame for sex and one for schoool with output similar to below:
sex missed_days
M 13
F 27
school missed_days
north 12
central 5
south 23
I tried (without success):
for(i in seq_along(select(df,1:2)) {
output[[i]] <- sum(df$days_missed[[i]] )
}
Is there a way to accomplish what I am looking to do?
in base R you could do:
lapply(1:2,function(x)xtabs(days_missed~.,df[c(x,3)]))
[[1]]
sex
F M
27 13
[[2]]
school
central north south
5 12 23
using tidyverse:
library(tidyverse)
map(df[-3],~xtabs(days_missed~.x,df))
$sex
.x
F M
27 13
$school
.x
central north south
5 12 23
if you must use summarize then:
df %>%
summarise_at(vars(-days_missed), ~list(xtabs(days_missed~.x))) %>%
{t(.)[,1]}
$sex
.x
F M
27 13
$school
.x
central north south
5 12 23
In base R, you can use lapply along with tapply to get sum of days_missed by group.
lapply(df[-ncol(df)], function(x) tapply(df$days_missed, x, sum))
Or using tidyverse :
library(dplyr)
cols <- c('sex', 'school')
purrr::map(cols, ~df %>% group_by_at(.x) %>% summarise(sum = sum(days_missed)))
#[[1]]
# A tibble: 2 x 2
# sex sum
# <chr> <dbl>
#1 F 27
#2 M 13
#[[2]]
# A tibble: 3 x 2
# school sum
# <chr> <dbl>
#1 central 5
#2 north 12
#3 south 23
This returns a list of dataframes.
Here is a tidyverse approach
library(tidyverse)
sex <- c("M","F","M","M","F","F","F","M","M","F")
school <- c("north","north","central","south","south","south","central","north","north","south")
days_missed <- c(5,1,2,0,7,1,3,2,4,15)
df <- data.frame(sex, school, days_missed, stringsAsFactors = F)
df %>%
group_by(sex) %>%
summarise(missed_day = sum(days_missed))
df %>%
group_by(school) %>%
summarise(missed_day = sum(days_missed))
If you want to map all other features
simple_operation <- function(x,group) {
x %>%
group_by_at({{group}}) %>%
summarise(missed_day = sum(days_missed))
}
variable_names <-
df %>%
select(-days_missed) %>%
names()
map(.x = variable_names,.f = ~ simple_operation(x = df,group = .))

Colsum new dataframe

With this command it is possible to have a dataframe with the sum of every column
df <- data.frame(id = c(1,2,3), stock = c(3,1,4), bill = c(1,0,1), bear = c(3,2,5))
dfsum <- data.frame(colSums(df[-1]))
However this dataframe has only one column.
How is it possible to produce a dataframe with 2 column one with col names and second with the frequencies?
You can do:
stack(colSums(df[-1]))
values ind
1 8 stock
2 2 bill
3 10 bear
Or using tibble:
enframe(colSums(df[-1]))
name value
<chr> <dbl>
1 stock 8
2 bill 2
3 bear 10
We can use tidyverse approaches with summarise_at and pivot_longer
library(dplyr)
library(tidyr)
df %>%
summarise_at(vars(-id), sum) %>%
pivot_longer(everything())
# name value
#1 stock 8
#2 bill 2
#3 bear 10
You can simply try apply.
apply(df[-1], 2, sum)
Result
stock bill bear
8 2 10
For data.frame
(df2 <- data.frame( freq = apply(df[-1], 2, sum)))
df2$var <- rownames(df2)
Result
var freq
stock 8
bill 2
bear 10

Grouping data starting with specific number in R

I am sorry if the title is incomprehensible. I have a data as shown below; 1, 2, 3.. are months of various years. And I want to gather months separately for a and l.
a l
1-2006 3.498939 0.8523857
1-2007 14.801777 0.2457656
1-2008 6.893728 0.5381691
2-2006 2.090962 0.6764694
2-2007 9.192913 0.8740950
2-2016 5.059505 1.1761113
Structure of data is;
data<-structure(list(a = c(3.49893890760882, 14.8017770056402, 6.89372828391484,
2.0909624091048, 9.19291324208917, 5.05950526612261, 13.1570625271881,
14.9570662205959, 7.72453112976811, 12.9331892673657
), l = c(0.852385662732809,
0.245765570168399, 0.538169092055646, 0.676469362818052, 0.874095005203713,
1.17611132212132, 0.76857056091243, 0.622533767341579, 0.9562200838363,
1.10064589903771, 0.85863722854391
)), class = "data.frame", row.names = c("1-2006",
"1-2007", "1-2008",
"2-2006", "2-2007",
"2-2016",
"3-2015", "3-2016", "3-2017", "3-2018"
))
For example; I want to gather all january (1-2005, 1-2006..) and march data(3-2012, 3-2015..) data for a and also for l. Like this one:
january_a
1-2006 3.498939
1-2007 14.801777
1-2008 6.893728
january_l
1-2006 0.8523857
1-2007 0.2457656
1-2008 0.5381691
march_a
3-2012 9.192913
3-2015 5.059505
march_l
3-2012 0.8740950
3-2015 1.1761113
You could add a column which contains only the numerical prefix, and then split on that:
data$prefix <- sub("^(\\d+).*$", "\\1", row.names(data))
data_a <- split(data[,"a"], data$prefix)
data_a
$`1`
[1] 3.498939 14.801777 6.893728
$`2`
[1] 2.090962 9.192913 5.059505
Data:
data <- data.frame(a=c(3.498939, 14.801777, 6.893728, 2.090962, 9.192913, 5.059505),
l=c(0.8523857, 0.2457656, 0.5381691, 0.6764694, 0.8740950, 1.1761113))
row.names(data) <- c("1-2006", "1-2007", "1-2008", "2-2006", "2-2007", "2-2016")
This is another variation that you can try using tidyverse which returns a list of dataframes, where every element has a combination of month and "a" or "l".
library(tidyverse)
data %>%
rownames_to_column('date') %>%
pivot_longer(cols = -date) %>%
separate(date, c('month', 'year'), sep = "-", remove = FALSE) %>%
group_split(month, name)
#[[1]]
# A tibble: 3 x 5
# date month year name value
# <chr> <chr> <chr> <chr> <dbl>
#1 1-2006 1 2006 a 3.50
#2 1-2007 1 2007 a 14.8
#3 1-2008 1 2008 a 6.89
#[[2]]
# A tibble: 3 x 5
# date month year name value
# <chr> <chr> <chr> <chr> <dbl>
#1 1-2006 1 2006 l 0.852
#2 1-2007 1 2007 l 0.246
#3 1-2008 1 2008 l 0.538
#...
#...
This has some additional columns to uniquely identify values which you can remove if not needed.
Another option is group_split
library(purrr)
library(dplyr)
library(stringr)
data %>%
rownames_to_column('rn') %>%
select(rn, a) %>%
group_split(rn = str_remove(rn, '-.*'), keep = FALSE) %>%
map(flatten_dbl)
#[[1]]
#[1] 3.498939 14.801777 6.893728
#[[2]]
#[1] 2.090962 9.192913 5.059505
data
data <- data.frame(a=c(3.498939, 14.801777, 6.893728, 2.090962, 9.192913, 5.059505),
l=c(0.8523857, 0.2457656, 0.5381691, 0.6764694, 0.8740950, 1.1761113))
row.names(data) <- c("1-2006", "1-2007", "1-2008", "2-2006", "2-2007", "2-2016")

How can I match two sets of factor levels in a new data frame?

I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4
From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)
Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.

Resources