I have R dataframe:
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
You can try aggregate like below
aggregate(value ~city,df, function(x) x[-1]/x[1])
which gives
city value
1 LA 3
2 NY 2
Data
> dput(df)
structure(list(city = c("NY", "NY", "LA", "LA"), hour = c(0L,
12L, 0L, 12L), value = c(12L, 24L, 3L, 9L)), class = "data.frame", row.names = c("0",
"1", "2", "3"))
You can use lag to get previous value, divide each value by it's previous value for each city and drop NA rows.
library(dplyr)
df %>%
arrange(city, hour) %>%
group_by(city) %>%
summarise(value = value/lag(value)) %>%
na.omit()
# city value
# <chr> <dbl>
#1 LA 3
#2 NY 2
In data.table we can do this via shift :
library(data.table)
setDT(df)[order(city, hour), value := value/shift(value), city]
na.omit(df)
Related
I have two datasets , the first dataset is like this
ID Weight State
1 12.34 NA
2 11.23 IA
2 13.12 IN
3 12.67 MA
4 10.89 NA
5 14.12 NA
The second dataset is a lookup table for state values by ID
ID State
1 WY
2 IA
3 MA
4 OR
4 CA
5 FL
As you can see there are two different state values for ID 4, which is normal.
What I want to do is replace the NAs in dataset1 State column with State values from dataset 2. Expected dataset
ID Weight State
1 12.34 WY
2 11.23 IA
2 13.12 IN
3 12.67 MA
4 10.89 OR,CA
5 14.12 FL
Since ID 4 has two state values in dataset2 , these two values are collapsed and separated by , and used to replace the NA in dataset1. Any suggestion on accomplishing this is much appreciated. Thanks in advance.
Collapse df2 value and join it with df1 by 'ID'. Use coalesce to use non-NA value from the two state columns.
library(dplyr)
df1 %>%
left_join(df2 %>%
group_by(ID) %>%
summarise(State = toString(State)), by = 'ID') %>%
mutate(State = coalesce(State.x, State.y)) %>%
select(-State.x, -State.y)
# ID Weight State
#1 1 12.3 WY
#2 2 11.2 IA
#3 2 13.1 IN
#4 3 12.7 MA
#5 4 10.9 OR, CA
#6 5 14.1 FL
In base R with merge and transform.
merge(df1, aggregate(State~ID, df2, toString), by = 'ID') |>
transform(State = ifelse(is.na(State.x), State.y, State.x))
Tidyverse way:
library(tidyverse)
df1 %>%
left_join(df2 %>%
group_by(ID) %>%
summarise(State = toString(State)) %>%
ungroup(), by = 'ID') %>%
transmute(ID, Weight, State = coalesce(State.x, State.y))
Base R alternative:
na_idx <- which(is.na(df1$State))
df1$State[na_idx] <- with(
aggregate(State ~ ID, df2, toString),
State[match(df1$ID, ID)]
)[na_idx]
Data:
df1 <- structure(list(ID = c(1L, 2L, 2L, 3L, 4L, 5L), Weight = c(12.34,
11.23, 13.12, 12.67, 10.89, 14.12), State = c("WY", "IA", "IN",
"MA", "OR, CA", "FL")), row.names = c(NA, -6L), class = "data.frame")
df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 4L, 5L), State = c("WY",
"IA", "MA", "OR", "CA", "FL")), class = "data.frame", row.names = c(NA,
-6L))
I need some help counting observations meeting certain criteria by group. I first want the number of employees by location as a column. Then I would like to retrieve the number of employees that have worked more than 40 hours (by location) and summarize that into a column. I assume there is an easy way to do it with dplyr or base R but I'm stumped. My data is below.
name hours_worked location
Bob 55 IL
Nick 25 IL
Sally 30 IL
Patricia 50 WI
Tim 35 WI
Liz 42 OH
Brad 60 OH
Sam 48 OH
Ideal output would be something like:
location headcount over_40
IL 3 1
WI 2 1
OH 3 3
We can do a group by operation - grouped by 'location' get the number of rows (n()) for headcount and the sum of logical vector to get the count of 'over_40'
library(dplyr)
df1 %>%
group_by(location) %>%
summarise(headcount = n(), over_40 = sum(hours_worked > 40))
-output
# A tibble: 3 x 3
location headcount over_40
<chr> <int> <int>
1 IL 3 1
2 OH 3 3
3 WI 2 1
data
df1 <- structure(list(name = c("Bob", "Nick", "Sally", "Patricia", "Tim",
"Liz", "Brad", "Sam"), hours_worked = c(55L, 25L, 30L, 50L, 35L,
42L, 60L, 48L), location = c("IL", "IL", "IL", "WI", "WI", "OH",
"OH", "OH")), class = "data.frame", row.names = c(NA, -8L))
I have a dataset, z, that I wish to calculate the growth increase by the type:
location size type date
ny 5 hello 10/01/2020
ny 7 ai 10/02/2020
ny 8 ai 10/03/2020
ny 6 hello 10/04/2020
ca 15 cool 10/05/2020
ca 10 name 10/06/2020
ca 5 name 10/07/2020
ca 16 cool 10/08/2020
Desired output
location type increase percent_increase start_date end_date
ca cool 1 6.67% 10/05/2020 10/08/2020
ca name -5 -50% 10/6/2020 10/7/2020
ny hello 1 20% 10/01/2020 10/4/2020
ny ai 1 14.28% 10/2/2020 10/3/2020
This is what I am doing:
library(tidyverse)
z %>%
group_by(type, location) %>%
mutate(percent_increase = (size/lead(size) - 1) * 100)
I am not getting my desired output. Any assistance is appreciated.
To get the results you want, you need a different calculation in your mutate line:
I also added a filter to remove any results with NA for the percent_increase variable.
And finally added ```arrange`` to sort alphabetically by location to match the same order as your requested output.
CODE
z %>%
group_by(type, location) %>%
mutate(
increase = (lead(size) - size),
percent_increase = (increase/size) * 100,
start_date = date,
end_date = lead(date)) %>%
filter(!is.na(percent_increase)) %>%
arrange(location)
OUTPUT
# A tibble: 4 x 8
# Groups: type, location [4]
location size type date increase percent_increase start_date end_date
<chr> <int> <chr> <chr> <int> <dbl> <chr> <chr>
1 ca 15 cool 10/05/2020 1 6.67 10/05/2020 10/08/2020
2 ca 10 name 10/06/2020 -5 -50 10/06/2020 10/07/2020
3 ny 5 hello 10/01/2020 1 20 10/01/2020 10/04/2020
4 ny 7 ai 10/02/2020 1 14.3 10/02/2020 10/03/2020
INPUT
z <- structure(list(location = c("ny", "ny", "ny", "ny", "ca", "ca",
"ca", "ca"), size = c(5L, 7L, 8L, 6L, 15L, 10L, 5L, 16L), type = c("hello",
"ai", "ai", "hello", "cool", "name", "name", "cool"), date = c("10/01/2020",
"10/02/2020", "10/03/2020", "10/04/2020", "10/05/2020", "10/06/2020",
"10/07/2020", "10/08/2020")), class = "data.frame", row.names = c(NA,
-8L))
you're missing arrange function to organize by date
like this:
z %>%
group_by(type, location) %>%
arrange(date) %>%
mutate(percent_increase = (size/lead(size) - 1) * 100)
So I've seen many pages on the generalized version of this issue but here specifically I would like to sum all values in a row after a specific column.
Let's say we have this df:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0111 boston fitz 0 0 0
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
0114 ontario wazaaa NA 11 NA
Now the df's I work with aren't usually with 3 "q" variables, they vary. Hence, I would like to rowSum every row but only sum the rows that are after the column identity.
Rows with NA are to be ignored.
Eventually I would like to take the rows which sum to 0 to be removed and end with a df that looks like this:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
Doing this in dplyr is the preference but not required.
EDIT:
I have added below the data of which this solution is not working for, apologies for the confusion.
df <- structure(list(Program = c("3002", "111", "2455", "2929", "NA",
"NA", NA), Project_ID = c("299", "11", "271", "780", "207", "222",
NA), Advance_Identifier = c(14, 24, 12, 15, NA, 11, NA), Sequence = c(6,
4, 4, 5, 2, 3, 79), Item = c("payment", "hero", "prepayment_2",
"UPS", "period", "prepayment", "yeet"), q1 = c("500", "12", "-1",
"0", NA, "0", "0"), q2 = c("500", "12", "-1", "0", NA, "0", "1"
), q3 = c("500", "12", "2", "0", NA, "0", "2"), q4 = c("500",
"13", "0", "0", NA, "0", "3")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
Base R version with zero extra dependencies:
[Edit: I always forget rowSums exists]
> df1$new = rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)
> df1
id city identity q1 q2 q3 new
1 110 detroit ella 2 4 3 9
2 111 boston fitz 0 0 0 0
3 112 philly gerald 3 1 0 4
4 113 new_york doowop 8 11 2 21
If you need to convert chars to numbers, use apply with as.numeric:
df$new = apply(df[,(1+which(names(df)=="Item")):ncol(df),drop=FALSE], 1, function(col){sum(as.numeric(col))})
BUT look out if they are really factors because this will fail, which is why converting things that look like numbers to numbers before you do anything else is a Good Thing.
Benchmark
In case you are worried about speed here's a benchmark test of my function against the currently accepted solution:
akrun = function(df1){df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
sample data
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE), identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
Test - note I remove the new column from the source data frame each time otherwise the code keeps adding one of those into it (although akrun doesn't modify df in place it can get run after baz has modified it by assigning it the new column in the benchmark code).
> microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
Unit: microseconds
expr min lq mean
{ df$new = NULL df2 = akrun(df) } 1300.682 1328.941 1396.63477
{ df$new = NULL df$new = baz(df) } 63.102 72.721 87.78668
median uq max neval
1376.9425 1398.5880 2075.894 100
84.3655 86.7005 685.594 100
The tidyverse version takes 16 times as long as the base R version.
We can use
out <- df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))
out
# id city identity q1 q2 q3 new
#1 110 detroit ella 2 4 3 9
#2 111 boston fitz 0 0 0 0
#3 112 philly gerald 3 1 0 4
#4 113 new_york doowop 8 11 2 21
and then filter out the rows that have 0 in 'new'
out %>%
filter(new >0)
In the OP's updated dataset, the type of columns are character. We can automatically convert the types to respective types with
df %>%
#type.convert %>% # base R
# or with `readr::type_convert
type_convert %>%
...
NOTE: The OP mentioned in the title and in the description about a tidyverse option. It is not a question about efficiency.
Also, rowSums is a base R option. Here, we showed how to use that in tidyverse chain. I could have written an answer in base R way too earlier with the same option.
If we remove the select, it becomes just a base R i.e
df1$new < rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
Benchmarks
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE),
identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
akrun = function(df1){
rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
#Unit: microseconds
# expr min lq mean median uq max neval
# { df$new = NULL df2 = akrun(df) } 69.926 73.244 112.2078 75.4335 78.7625 3539.921 100
# { df$new = NULL df$new = baz(df) } 73.670 77.945 118.3875 80.5045 83.5100 3767.812 100
data
df1 <- structure(list(id = 110:113, city = c("detroit", "boston", "philly",
"new_york"), identity = c("ella", "fitz", "gerald", "doowop"),
q1 = c(2L, 0L, 3L, 8L), q2 = c(4L, 0L, 1L, 11L), q3 = c(3L,
0L, 0L, 2L)), class = "data.frame", row.names = c(NA, -4L
))
Similar to akrun you can try
df %>%
mutate_at(vars(starts_with("q")),funs(as.numeric)) %>%
mutate(sum_new = rowSums(select(., starts_with("q")), na.rm = TRUE)) %>%
filter(sum_new>0)
Here i use reduce in purrr to sum rows, it's the fastest way.
library(tidyverse)
data %>% filter_at(vars(starts_with('q')),~!is.na(.)) %>%
mutate( Sum = reduce(select(., starts_with("q")), `+`)) %>%
filter(Sum > 0)
This is the libraryI am using for creating dummies
install.packages("fastDummies")
library(fastDummies)
This is the dataset
winners <- data.frame(
city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"),
year = c(1990, 2000, 1990),
crime = 1:3)
Let's them create super dummies out of these cities:
dummy_cols(winners, select_columns = c("city"))
The results are
city year crime city_SaoPaulito city_NewAmsterdam city_BeatifulCow
1 SaoPaulito 1990 1 1 0 0
2 NewAmsterdam 2000 2 0 1 0
3 BeatifulCow 1990 3 0 0 1
So the question if that I want to return to the previous dataset, any ideas?
Thanks in advance!
We can use dcast
library(data.table)
dcast(setDT(winners), crime ~ city, length)
If we need to get the input, it would be
subset(df1, select = 1:3)
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
Or with melt
melt(setDT(df1), measure = patterns("_"))[value == 1, .(city, year, crime)]
# city year crime
#1: SaoPaulito 1990 1
#2: NewAmsterdam 2000 2
#3: BeatifulCow 1990 3
data
df1 <- structure(list(city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"
), year = c(1990L, 2000L, 1990L), crime = 1:3, city_SaoPaulito = c(1L,
0L, 0L), city_NewAmsterdam = c(0L, 1L, 0L), city_BeatifulCow = c(0L,
0L, 1L)), class = "data.frame", row.names = c("1", "2", "3"))
If you are going to have only one city as 1 in each row, you can just skip the dummy columns
df[, 1:3]
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
If you can have multiple cities one way using dplyr and tidyr::gather is
library(dplyr)
df %>%
tidyr::gather(key, value, starts_with("city_")) %>%
filter(value == 1) %>%
select(-value, -key)