How to sum across rows based on condition greater than - r

I'm trying to move from Excel to R and am looking to do something similar to SumIfs in excel. I want to create a new column that is the sum of the rows from multiple columns but only if the value is greater than 25.
My data looks like this which is the area of different crops on farms and want to add a new column of total agriculture area but only include crops if there are more than 25 acres:
Prop_ID
State
Pasture
Soy
Corn
1
WI
20
45
75
2
MN
10
80
122
3
MN
152
0
15
4
IL
0
10
99
5
IL
75
38
0
6
WI
30
45
0
7
WI
68
55
0
I'm looking to produce a new table like this:
Prop_ID
State
Pasture
Soy
Corn
Total_ag
1
WI
20
45
75
120
2
MN
10
80
122
202
3
MN
152
0
15
152
4
IL
0
10
20
0
5
IL
15
15
20
0
6
WI
30
45
0
75
7
WI
50
55
0
105
I want to reference the columns to sum using there index [3:5] and not there name since I have different crops in different datasets.
I'm assuming using mutate or summarize with that I need to do but I can't figure it out.

We can replace those rows with values less than 25 to NA or 0 and then use rowSums
library(dplyr)
df1 <- df1 %>%
mutate(Total_ag = rowSums(across(where(is.numeric),
~ replace(.x, .x < 25, NA)), na.rm = TRUE))
Similar option in base R
df1$Total_ag <- rowSums(replace(df1[3:5], df1[3:5] < 25, NA), na.rm = TRUE)

Multiply value matrix with boolean matrix.
rowSums(dat[3:5]*(dat[3:5] >= 25))
# [1] 120 202 152 99 113 75 123
Data:
dat <- structure(list(Prop_ID = 1:7, State = c("WI", "MN", "MN", "IL",
"IL", "WI", "WI"), Pasture = c(20L, 10L, 152L, 0L, 75L, 30L,
68L), Soy = c(45L, 80L, 0L, 10L, 38L, 45L, 55L), Corn = c(75L,
122L, 15L, 99L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-7L))

Related

Assign value in vector based on presence in another vector in R?

I have tried to look for a similar question and I´m sure other people encountered this problem but I still couldn´t find something that helped me. I have a dataset1 with 37.000 observations like this:
id hours
130 12
165 56
250 13
11 15
17 42
and another dataset2 with 38. 000 observations like this:
id hours
130 6
165 23
250 9
11 14
17 11
I want to do the following: if an id of dataset1 is in dataset2, the hours of dataset1 should override the hours of dataset2. For the id´s who are in dataset1 but not in dataset2, the value for dataset2$hours should be NA.
I tried the %in% operator, ifelse(), a loop, and some base R commands but I can´t figure it out. I always get the error that the vectors don´have the same length.
Thanks for any help!
You can replace hours with NAs for id that don't match between df1 and df2. Since both your data sets had the same values for ids, I added one row in df1 with id = 123 and hours = 12.
df1$hours <- replace(df1$hours, is.na(match(df1$id,df2$id)), NA)
df1
id hours
1 130 12
2 165 56
3 250 13
4 11 15
5 17 42
6 123 NA
data
df1 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L, 123L), hours = c(12L,
56L, 13L, 15L, 42L, NA)), row.names = c(NA, -6L), class = "data.frame")
id hours
1 130 12
2 165 56
3 250 13
4 11 15
5 17 42
6 123 12
df2 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L), hours = c(6L,
23L, 9L, 14L, 11L)), class = "data.frame", row.names = c(NA,
-5L))
First match ID's of replacement data with ID's of original data while using na.omit() for the case when replacement ID's are not contained in original data. Replace with replacement data whose ID's are in original ID's.
I expanded both data sets to fabricate cases with no matches.
dat1
# id hours
# 1 130 12
# 2 165 56
# 3 250 13
# 4 11 15
# 5 17 42
# 6 12 232
# 7 35 456
dat2
# id hours
# 1 11 14
# 2 17 11
# 3 165 23
# 4 999 99
# 5 130 6
# 6 250 9
Replacement
dat1[na.omit(match(dat2$id, dat1$id)), ]$hours <-
dat2[dat2$id %in% dat1$id, ]$hours
dat1
# id hours
# 1 130 6
# 2 165 23
# 3 250 9
# 4 11 14
# 5 17 11
# 6 12 232
# 7 35 456
Data:
dat1 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L, 12L, 35L),
hours = c(12L, 56L, 13L, 15L, 42L, 232L, 456L)), class = "data.frame", row.names = c(NA,
-7L))
dat2 <- structure(list(id = c(11L, 17L, 165L, 999L, 130L, 250L), hours = c(14L,
11L, 23L, 99L, 6L, 9L)), class = "data.frame", row.names = c(NA,
-6L))

subset a datatable using multiple criteria from second table

I have a very large datatable of ocean sites and multiple depths at each site (table 1).
I need to extract rows matching site location and depth in another datatable (Table 2).
Table 1 - table to be subset
Lat
Long
Depth
Nitrate
165
-77
0
29.5420
165
-77
50
30.2213
165
-77
100
29.2275
124
-46
0
27.8544
124
-46
50
28.6458
124
-46
100
24.9543
76
-24
0
31.9784
76
-24
50
28.6408
76
-24
100
24.9746
25
-62
0
31.9784
25
-62
50
28.6408
25
-62
100
24.9746
Table 2 - co-ordinates and depth needed for subsetting:
Lat
Long
Depth
165
-77
100
76
-24
50
25
-62
0
I have tried to get all sites in a table that would include all available depth data for those sites:
subset <- filter(table1, Lat == table2$Lat | Long == table2$Long)
but it returns zero obs.
Any suggestions?
It seems you are looking for an inner join:
merge(dat1, dat2, by = c("Lat", "Long"))
# Lat Long Depth.x Nitrate Depth.y
# 1 165 -77 100 29.2275 100
# 2 165 -77 0 29.5420 100
# 3 165 -77 50 30.2213 100
# 4 25 -62 0 31.9784 0
# 5 25 -62 50 28.6408 0
# 6 25 -62 100 24.9746 0
# 7 76 -24 0 31.9784 50
# 8 76 -24 50 28.6408 50
# 9 76 -24 100 24.9746 50
There is some risk in this: joins and such rely on strict equality when comparing columns, but floating-point (with many digits of precision) can become too "fine" for most programming languages to detect differences (c.f., Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). The problem with this problem is that you will get no errors, it just won't produce matches.
To work around that problem, you will need to think about "tolerance", or the distance between points in dat1 and points in dat2, and what distance between all the points is effectively "close enough" to constitute a join. That can be done in one of two ways: (1) calculate the distance between all points in dat1 and all points in dat2, and taking the minimum for each in dat1 (and within a tolerance); or (2) do a "fuzzy join" (using the aptly named fuzzyjoin package) to find points that are within a range (effectively like dat1$Lat between dat2$Lat +/- 0.01 and similar for Long).
Data
dat1 <- structure(list(Lat = c(165L, 165L, 165L, 124L, 124L, 124L, 76L, 76L, 76L, 25L, 25L, 25L), Long = c(-77L, -77L, -77L, -46L, -46L, -46L, -24L, -24L, -24L, -62L, -62L, -62L), Depth = c(0L, 50L, 100L, 0L, 50L, 100L, 0L, 50L, 100L, 0L, 50L, 100L), Nitrate = c(29.542, 30.2213, 29.2275, 27.8544, 28.6458, 24.9543, 31.9784, 28.6408, 24.9746, 31.9784, 28.6408, 24.9746)), class = "data.frame", row.names = c(NA, -12L))
dat2 <- structure(list(Lat = c(165L, 76L, 25L), Long = c(-77L, -24L, -62L), Depth = c(100L, 50L, 0L)), class = "data.frame", row.names = c(NA, -3L))
paste Lat and Long columns together of both the tables and select rows that match.
result <- subset(table1, paste(Lat, Long) %in% paste(table2$Lat, table2$Long))
result

do a column-1 for specific ending rows in a column in R

Hello i have a df such as
scaffolds start end
scaf1.1_0 1 40
scaf1.1_2 41 78
scaf1.1_3 79 300
seq_1f2.1 1 30
seq_1f3.1 1 90
seq_1f2.3 91 200
and I would like to do a df$end-1 only for the last df$scaffolds duplicated element (those with a _number).
here I should get the output :
scaffolds start end
scaf1.1_0 1 40
scaf1.1_2 41 78
scaf1.1_3 79 299
seq_1f2.1 1 30
seq_1f3.1 1 90
seq_1f2.3 91 199
where 300-1 = 299 because scaf1.1_3 was the last one
and 200-1 = 199 because seq_1f2.3was the last one
as you can see seq_1f2.1 did not change because it did not have any _value
Maybe is this what you are looking for? Create grouping variables based on strings and extract the last value per group to compute the difference:
library(tidyverse)
#Code
df %>% mutate(Var1=substr(scaffolds,1,3),
Var2=as.numeric(substr(scaffolds,nchar(scaffolds),nchar(scaffolds)))) %>%
group_by(Var1) %>%
mutate(end=ifelse(Var2==max(Var2),end-1,end)) %>% ungroup() %>%
select(-c(Var1,Var2))
Output:
# A tibble: 6 x 3
scaffolds start end
<chr> <int> <dbl>
1 scaf1.1_0 1 40
2 scaf1.1_2 41 78
3 scaf1.1_3 79 299
4 seq_1f2.1 1 30
5 seq_1f3.1 1 90
6 seq_1f2.3 91 199
Some data used:
#Data
df <- structure(list(scaffolds = c("scaf1.1_0", "scaf1.1_2", "scaf1.1_3",
"seq_1f2.1", "seq_1f3.1", "seq_1f2.3"), start = c(1L, 41L, 79L,
1L, 1L, 91L), end = c(40L, 78L, 300L, 30L, 90L, 200L)), class = "data.frame", row.names = c(NA,
-6L))

How to combine columns based on contingencies?

I have the following df:
SUMLEV STATE COUNTY AGEGRP TOT_POP TOT_MALE
50 1 1 0 55601 26995
50 7 33 0 218022 105657
50 14 500 0 24881 13133
50 4 70 0 22400 11921
50 3 900 0 57840 28500
50 22 11 0 10138 5527
I would like to make a new columns named CODE based on the columns state and county. I would like to paste the number from state to the number from county. However, if county is a single or double digit number, I would like it to have zeroes before it, like 001 and 033.
Ideally the final df would look like:
SUMLEV STATE COUNTY AGEGRP TOT_POP TOT_MALE CODE
50 1 1 0 55601 26995 1001
50 7 33 0 218022 105657 7033
50 14 500 0 24881 13133 14500
50 4 70 0 22400 11921 4070
50 3 900 0 57840 28500 3900
50 22 11 0 10138 5527 22011
Is there a short, elegant way of doing this?
We can use sprintf
library(dplyr)
df %>%
mutate(CODE = sprintf('%d%03d', STATE, COUNTY))
# SUMLEV STATE COUNTY AGEGRP TOT_POP TOT_MALE CODE
#1 50 1 1 0 55601 26995 1001
#2 50 7 33 0 218022 105657 7033
#3 50 14 500 0 24881 13133 14500
#4 50 4 70 0 22400 11921 4070
#5 50 3 900 0 57840 28500 3900
#6 50 22 11 0 10138 5527 22011
If we need to split the column 'CODE' into two, we can use separate
library(tidyr)
df %>%
mutate(CODE = sprintf('%d%03d', STATE, COUNTY)) %>%
separate(CODE, into = c("CODE1", "CODE2"), sep= "(?=...$)")
Or extract to capture substrings as a group
df %>%
mutate(CODE = sprintf('%d%03d', STATE, COUNTY)) %>%
extract(CODE, into = c("CODE1", "CODE2"), "^(.*)(...)$")
Or with str_pad
library(stringr)
df %>%
mutate(CODE = str_c(STATE, str_pad(COUNTY, width = 3, pad = '0')))
Or in base R
df$CODE <- sprintf('%d%03d', df$STATE, df$COUNTY)
data
df <- structure(list(SUMLEV = c(50L, 50L, 50L, 50L, 50L, 50L), STATE = c(1L,
7L, 14L, 4L, 3L, 22L), COUNTY = c(1L, 33L, 500L, 70L, 900L, 11L
), AGEGRP = c(0L, 0L, 0L, 0L, 0L, 0L), TOT_POP = c(55601L, 218022L,
24881L, 22400L, 57840L, 10138L), TOT_MALE = c(26995L, 105657L,
13133L, 11921L, 28500L, 5527L)), class = "data.frame", row.names = c(NA,
-6L))

What tidyverse command can count the number of non-zero entries column-wise over a time series?

I have a data table that looks like the following:
Item 2018 2019 2020 2021 2022 2023
Apples 10 12 17 18 0 0
Bears 40 50 60 70 80 90
Cats 5 2 1 0 0 0
Dogs 15 17 18 15 11 0
I want a column that showing a count of the number of years with non-zero sales. That is:
Item 2018 2019 2020 2021 2022 2023 Count
Apples 10 12 17 18 0 0 4
Bears 40 50 60 70 80 90 6
Cats 5 2 1 0 0 0 3
Dogs 15 17 18 15 11 0 5
NB I'll want to do some analysis on this in the next pass, so looking to just add in the count column and not aggregate at this stage. This will be something like filter the rows if the count is greater than a threshold.
I looked at the tally() command from tidyverse, but this doesn't seem to do what I want (I think).
NB I haven't tagged this question as tidyverse due to the guidance on that tag. Shout if I need to edit this point.
As it is rowwise, we can use rowSums after converting the subset of dataset to logical
library(tidyverse)
df1 %>%
mutate(Count = rowSums(.[-1] > 0))
Or using reduce
df1 %>%
mutate(Count = select(., -1) %>%
mutate_all(funs(. > 0)) %>%
reduce(`+`))
Or with pmap
df1 %>%
mutate(Count = pmap_dbl(.[-1], ~ sum(c(...) > 0)))
# Item 2018 2019 2020 2021 2022 2023 Count
#1 Apples 10 12 17 18 0 0 4
#2 Bears 40 50 60 70 80 90 6
#3 Cats 5 2 1 0 0 0 3
#4 Dogs 15 17 18 15 11 0 5
data
df1 <- structure(list(Item = c("Apples", "Bears", "Cats", "Dogs"), `2018` = c(10L,
40L, 5L, 15L), `2019` = c(12L, 50L, 2L, 17L), `2020` = c(17L,
60L, 1L, 18L), `2021` = c(18L, 70L, 0L, 15L), `2022` = c(0L,
80L, 0L, 11L), `2023` = c(0L, 90L, 0L, 0L)), class = "data.frame",
row.names = c(NA, -4L))

Resources