How to combine columns based on contingencies? - r

I have the following df:
SUMLEV STATE COUNTY AGEGRP TOT_POP TOT_MALE
50 1 1 0 55601 26995
50 7 33 0 218022 105657
50 14 500 0 24881 13133
50 4 70 0 22400 11921
50 3 900 0 57840 28500
50 22 11 0 10138 5527
I would like to make a new columns named CODE based on the columns state and county. I would like to paste the number from state to the number from county. However, if county is a single or double digit number, I would like it to have zeroes before it, like 001 and 033.
Ideally the final df would look like:
SUMLEV STATE COUNTY AGEGRP TOT_POP TOT_MALE CODE
50 1 1 0 55601 26995 1001
50 7 33 0 218022 105657 7033
50 14 500 0 24881 13133 14500
50 4 70 0 22400 11921 4070
50 3 900 0 57840 28500 3900
50 22 11 0 10138 5527 22011
Is there a short, elegant way of doing this?

We can use sprintf
library(dplyr)
df %>%
mutate(CODE = sprintf('%d%03d', STATE, COUNTY))
# SUMLEV STATE COUNTY AGEGRP TOT_POP TOT_MALE CODE
#1 50 1 1 0 55601 26995 1001
#2 50 7 33 0 218022 105657 7033
#3 50 14 500 0 24881 13133 14500
#4 50 4 70 0 22400 11921 4070
#5 50 3 900 0 57840 28500 3900
#6 50 22 11 0 10138 5527 22011
If we need to split the column 'CODE' into two, we can use separate
library(tidyr)
df %>%
mutate(CODE = sprintf('%d%03d', STATE, COUNTY)) %>%
separate(CODE, into = c("CODE1", "CODE2"), sep= "(?=...$)")
Or extract to capture substrings as a group
df %>%
mutate(CODE = sprintf('%d%03d', STATE, COUNTY)) %>%
extract(CODE, into = c("CODE1", "CODE2"), "^(.*)(...)$")
Or with str_pad
library(stringr)
df %>%
mutate(CODE = str_c(STATE, str_pad(COUNTY, width = 3, pad = '0')))
Or in base R
df$CODE <- sprintf('%d%03d', df$STATE, df$COUNTY)
data
df <- structure(list(SUMLEV = c(50L, 50L, 50L, 50L, 50L, 50L), STATE = c(1L,
7L, 14L, 4L, 3L, 22L), COUNTY = c(1L, 33L, 500L, 70L, 900L, 11L
), AGEGRP = c(0L, 0L, 0L, 0L, 0L, 0L), TOT_POP = c(55601L, 218022L,
24881L, 22400L, 57840L, 10138L), TOT_MALE = c(26995L, 105657L,
13133L, 11921L, 28500L, 5527L)), class = "data.frame", row.names = c(NA,
-6L))

Related

merge ID of three data base to obtain multiples row

I have 3 data base, in this three data base my sample ID varryig. But i want to merge to obtain one data base with multiple row of the same ID but not the same values
This is whats i have
df1
ID
tstart
1
12
2
4
df2
ID
tstart
2
40
3
15
df3
ID
tstart
2
80
3
80
this is what i want
ID
tstart
1
12
2
4
2
40
3
15
3
80
now i want to create a new variable i have this
ID
tstart
t stop
results 1
result 2
1
12
20
5
NA
2
4
40
10
NA
2
40
80
NA
52
3
15
80
68
NA
3
80
100
NA
56
and i want a new variable to have this df :
ID
tstart
t stop
result
1
12
20
5
2
4
40
10
2
40
80
52
3
15
80
68
3
80
100
56
We can use bind_rows to bind the datasets and then get the distinct rows
library(dplyr)
bind_rows(df1, df2, df3) %>%
distinct()
For the second case, if the input already have the 'tstart', 'tstop', we could coalesce the 'result' columns
dfnew %>%
mutate(result = coalesce(result1, result2), .keep = 'unused')
-output
ID tstart tstop result
1 1 12 20 5
2 2 4 40 10
3 2 40 80 52
4 3 15 80 68
5 3 80 100 56
data
dfnew <- structure(list(ID = c(1L, 2L, 2L, 3L, 3L), tstart = c(12L, 4L,
40L, 15L, 80L), tstop = c(20L, 40L, 80L, 80L, 100L), result1 = c(5L,
10L, NA, 68L, NA), result2 = c(NA, NA, 52L, NA, 56L)),
class = "data.frame", row.names = c(NA,
-5L))

How to sum across rows based on condition greater than

I'm trying to move from Excel to R and am looking to do something similar to SumIfs in excel. I want to create a new column that is the sum of the rows from multiple columns but only if the value is greater than 25.
My data looks like this which is the area of different crops on farms and want to add a new column of total agriculture area but only include crops if there are more than 25 acres:
Prop_ID
State
Pasture
Soy
Corn
1
WI
20
45
75
2
MN
10
80
122
3
MN
152
0
15
4
IL
0
10
99
5
IL
75
38
0
6
WI
30
45
0
7
WI
68
55
0
I'm looking to produce a new table like this:
Prop_ID
State
Pasture
Soy
Corn
Total_ag
1
WI
20
45
75
120
2
MN
10
80
122
202
3
MN
152
0
15
152
4
IL
0
10
20
0
5
IL
15
15
20
0
6
WI
30
45
0
75
7
WI
50
55
0
105
I want to reference the columns to sum using there index [3:5] and not there name since I have different crops in different datasets.
I'm assuming using mutate or summarize with that I need to do but I can't figure it out.
We can replace those rows with values less than 25 to NA or 0 and then use rowSums
library(dplyr)
df1 <- df1 %>%
mutate(Total_ag = rowSums(across(where(is.numeric),
~ replace(.x, .x < 25, NA)), na.rm = TRUE))
Similar option in base R
df1$Total_ag <- rowSums(replace(df1[3:5], df1[3:5] < 25, NA), na.rm = TRUE)
Multiply value matrix with boolean matrix.
rowSums(dat[3:5]*(dat[3:5] >= 25))
# [1] 120 202 152 99 113 75 123
Data:
dat <- structure(list(Prop_ID = 1:7, State = c("WI", "MN", "MN", "IL",
"IL", "WI", "WI"), Pasture = c(20L, 10L, 152L, 0L, 75L, 30L,
68L), Soy = c(45L, 80L, 0L, 10L, 38L, 45L, 55L), Corn = c(75L,
122L, 15L, 99L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-7L))

How to group contiguous variable into a range r

I have an example dataset:
Road Start End Cat
1 0 50 a
1 50 60 b
1 60 90 b
1 70 75 a
2 0 20 a
2 20 25 a
2 25 40 b
Trying to output following:
Road Start End Cat
1 0 50 a
1 50 90 b
1 70 75 a
2 0 25 a
2 25 40 b
My code doesn't work:
df %>% group_by(Road, cat)
%>% summarise(
min(Start),
max(End)
)
How can I achieve the results I wanted?
We can use rleid from data.table to get the run-length-id-encoding for grouping and then do the summarise
library(dplyr)
library(data.table)
df %>%
group_by(Road, grp = rleid(Cat)) %>%
summarise(Cat = first(Cat), Start = min(Start), End = max(End)) %>%
select(-grp)
# A tibble: 5 x 4
# Groups: Road [2]
# Road Cat Start End
# <int> <chr> <int> <int>
#1 1 a 0 50
#2 1 b 50 90
#3 1 a 70 75
#4 2 a 0 25
#5 2 b 25 40
Or using data.table methods
library(data.table)
setDT(df)[, .(Start = min(Start), End = max(End)), .(Road, Cat, grp = rleid(Cat))]
data
df <- structure(list(Road = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), Start = c(0L,
50L, 60L, 70L, 0L, 20L, 25L), End = c(50L, 60L, 90L, 75L, 20L,
25L, 40L), Cat = c("a", "b", "b", "a", "a", "a", "b")),
class = "data.frame", row.names = c(NA,
-7L))

Calculation groups with specific columns in r

The pattern my data is like this
df1<-read.table(text="Car1 Car2 Car3 Time1 Time2 Time3
22 33 90 20 90 20
11 45 88 10 80 30
22 33 40 40 10 10
11 45 40 10 10 40
11 45 88 10 12 60
22 45 90 60 20 100",header=TRUE)
I want to calculate mean and SD based on Car and time. The point is Car 1 corresponds to Time1, Car2 corresponds to Time 2 and Car3 Corresponds to Time3 and so on.
I want to get the following table :
Car1 Mean SD
11 10 0
22 40 20
Car2
33 xx xx
45 xx xx
Car3
40 xx xx
88 xx xx
90 xx xx
I have tried:
df1 %>% group_by(Car1,Car2,Car3) %>%
summarise(mean=mean(Time,SD=sd(Time))
Unfortunately, it does not work. Any help?
You can also use the package data.table:
library(data.table)
melt(setDT(df1),
measure = patterns("Car", "Time"),
value.name = c("Car", "Time"),
variable.name = "group"
)[, .(Mean = mean(Time), Sd = sd(Time)), .(group, Car)]
# group Car Mean Sd
# 1: 1 22 40.0 20.00000
# 2: 1 11 10.0 0.00000
# 3: 2 33 50.0 56.56854
# 4: 2 45 30.5 33.28163
# 5: 3 90 60.0 56.56854
# 6: 3 88 45.0 21.21320
# 7: 3 40 25.0 21.21320
Here is one option with pivot_longer where we reshape from 'wide' to 'long' format and group by the 'group1' index and 'Car', get the mean and sd of 'Time' by summariseing the 'Time'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything(), names_to = c(".value", "group"),
names_sep="(?<=[a-z])(?=\\d+)") %>%
group_by(group, Car) %>%
summarise(Mean = mean(Time), SD = sd(Time))
# A tibble: 7 x 4
# Groups: group [3]
# group Car Mean SD
# <chr> <int> <dbl> <dbl>
#1 1 11 10 0
#2 1 22 40 20
#3 2 33 50 56.6
#4 2 45 30.5 33.3
#5 3 40 25 21.2
#6 3 88 45 21.2
#7 3 90 60 56.6
Assuming you can easily segregate your data into Time and Cars, then you can do this using loop, assuming you have data into structure as provided by you.
cars <- df1[1:3]
Time <- df1[4:6]
ls <- list()
for(i in 1:ncol(cars)) {
ls[[i]] <- aggregate(Time[i], by = cars[i], FUN = function(x) c(mean(x), sd(x)))
}
ls
Data for the results is:
df1 <- structure(list(Car1 = c(22L, 11L, 22L, 11L, 11L, 22L), Car2 = c(33L,
45L, 33L, 45L, 45L, 45L), Car3 = c(90L, 88L, 40L, 40L, 88L, 90L
), Time1 = c(20L, 10L, 40L, 10L, 10L, 60L), Time2 = c(90L, 80L,
10L, 10L, 12L, 20L), Time3 = c(20L, 30L, 10L, 40L, 60L, 100L)), class = "data.frame", row.names = c(NA,
-6L))
lapply(split.default(df1, gsub("\\D+", "", names(df1))), function(x){
d = gsub("\\D+", "", names(x)[1])
x %>%
group_by(!!sym(paste0("Car", d))) %>%
summarise(mean = mean(!!sym(paste0("Time", d))),
sd = sd(!!sym(paste0("Time", d)))) %>%
ungroup()
})

What tidyverse command can count the number of non-zero entries column-wise over a time series?

I have a data table that looks like the following:
Item 2018 2019 2020 2021 2022 2023
Apples 10 12 17 18 0 0
Bears 40 50 60 70 80 90
Cats 5 2 1 0 0 0
Dogs 15 17 18 15 11 0
I want a column that showing a count of the number of years with non-zero sales. That is:
Item 2018 2019 2020 2021 2022 2023 Count
Apples 10 12 17 18 0 0 4
Bears 40 50 60 70 80 90 6
Cats 5 2 1 0 0 0 3
Dogs 15 17 18 15 11 0 5
NB I'll want to do some analysis on this in the next pass, so looking to just add in the count column and not aggregate at this stage. This will be something like filter the rows if the count is greater than a threshold.
I looked at the tally() command from tidyverse, but this doesn't seem to do what I want (I think).
NB I haven't tagged this question as tidyverse due to the guidance on that tag. Shout if I need to edit this point.
As it is rowwise, we can use rowSums after converting the subset of dataset to logical
library(tidyverse)
df1 %>%
mutate(Count = rowSums(.[-1] > 0))
Or using reduce
df1 %>%
mutate(Count = select(., -1) %>%
mutate_all(funs(. > 0)) %>%
reduce(`+`))
Or with pmap
df1 %>%
mutate(Count = pmap_dbl(.[-1], ~ sum(c(...) > 0)))
# Item 2018 2019 2020 2021 2022 2023 Count
#1 Apples 10 12 17 18 0 0 4
#2 Bears 40 50 60 70 80 90 6
#3 Cats 5 2 1 0 0 0 3
#4 Dogs 15 17 18 15 11 0 5
data
df1 <- structure(list(Item = c("Apples", "Bears", "Cats", "Dogs"), `2018` = c(10L,
40L, 5L, 15L), `2019` = c(12L, 50L, 2L, 17L), `2020` = c(17L,
60L, 1L, 18L), `2021` = c(18L, 70L, 0L, 15L), `2022` = c(0L,
80L, 0L, 11L), `2023` = c(0L, 90L, 0L, 0L)), class = "data.frame",
row.names = c(NA, -4L))

Resources