Eliminate cases based on multiple rows values - r

I have a base with the following information:
edit: *each row is an individual that lives in a house, multiple individuals with a unique P_ID and AGE can live in the same house with the same H_ID, I'm looking for all the houses with all the individuals based on the condition that there's at least one person over 60 in that house, I hope that explains it better *
show(base)
H_ID P_ID AGE CONACT
1 10010000001 1001000000102 35 33
2 10010000001 1001000000103 12 31
3 10010000001 1001000000104 5 NA
4 10010000001 1001000000101 37 10
5 10010000002 1001000000206 5 NA
6 10010000002 1001000000205 10 NA
7 10010000002 1001000000204 18 31
8 10010000002 1001000000207 3 NA
9 10010000002 1001000000203 24 35
10 10010000002 1001000000202 43 33
11 10010000002 1001000000201 47 10
12 10010000003 1001000000302 26 33
13 10010000003 1001000000301 29 10
14 10010000004 1001000000401 56 32
15 10010000004 1001000000403 22 31
16 10010000004 1001000000402 49 10
17 10010000005 1001000000503 1 NA
18 10010000005 1001000000501 24 10
19 10010000005 1001000000502 23 10
20 10010000006 1001000000601 44 10
21 10010000007 1001000000701 69 32
I want a list with all the houses and all the individuals living there based on the condition that there's at least one person 60+, here's a link for the data: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing
And here's how I made the base:
hogares<-read.csv("/home/servicio/Escritorio/TR_VIVIENDA01.CSV")
personas<-read.csv("/home/servicio/Escritorio/TR_PERSONA01.CSV")
datos<-merge(hogares,personas)
base<-data.frame(datos$ID_VIV, datos$ID_PERSONA, datos$EDAD, datos$CONACT)
base
Any help is much much appreciated, Thanks!

This can be done by:
Adding a variable with the maximum age per household
base$maxage <- ave(base$AGE, base$H_ID, FUN=max)
Then only keeping households with a maximum age above 60.
base <- subset(base, maxage >= 60)
Or you could combine the two lines into one. With the column names in your linked data:
> base <- subset(base, ave(base$datos.EDAD, base$datos.ID_VIV, FUN=max) >= 60)
> head(base)
datos.ID_VIV datos.ID_PERSONA datos.EDAD datos.CONACT
21 10010000007 1001000000701 69 32
22 10010000008 1001000000803 83 33
23 10010000008 1001000000802 47 33
24 10010000008 1001000000801 47 10
36 10010000012 1001000001204 4 NA
37 10010000012 1001000001203 2 NA

Using dplyr, we can group_by H_ID and select houses where any AGE is greater than 60.
library(dplyr)
df %>% group_by(H_ID) %>% filter(any(AGE > 60))
Similarly with data.table
library(data.table)
setDT(df)[, .SD[any(AGE > 60)], H_ID]

To get a list of the houses with a tenant Age > 60 we can filter and create a list of distinct H_IDs
house_list <- base %>%
filter(AGE > 60) %>%
distinct(H_ID) %>%
pull(H_ID)
Then we can filter the original dataframe based on that house_list to remove any households that do not have someone over the age of 60.
house_df <- base %>%
filter(H_ID %in% house_list)
To then calculate the CON values we can filter out NA values in CONACT, group_by(H_ID) and summarize to find the number of individuals within each house that have a non-NA CONACT value.
CON_calcs <- house_df %>%
filter(!is.na(CONACT)) %>%
group_by(H_ID) %>%
summarize(Count = n())
And join that back into the house_df based on H_ID to include the newly calculated CON values, and I believe that should end with your desired result.
final_df <- left_join(house_df, CON_calcs, by = 'H_ID')

Related

How can I create a matrix , with random number on row and not replace,but in col can replace, R language

How can I create a matrix , with random number on row and not replace.
like this
5 29 24 20 31 33
2 18 35 4 11 21
30 40 22 14 2 28
33 14 4 18 5 10
10 33 15 2 28 18
7 22 9 25 31 20
12 29 31 22 37 26
7 31 34 28 19 23
7 34 11 6 31 28
my code :
matrix(sample(1:42, 60, replace = FALSE), ncol = 6)
But I receive this error message:
Error in sample.int(length(x), size, replace, prob) : cannot take a
sample larger than the population when 'replace = FALSE'
but it's wrong because only 1~42, it can't create a 60 matrix.
You can not generate all 60 of the numbers with one sample function as you want to allow replacement of numbers in a different row. Therefore you have to do one sample per row. #Jav provided very neat code to accomplish this in the comment to the question:
t(sapply(1:10, function(x) sample(1:42, 6, replace = FALSE)))
if you want to have a different sample in each row, then replicate can help you -- but replicate (as pretty much everything else in R) works naturally columnwise, so you have to transpose the result:
t(replicate(10, sample(1:42, 6)))
replace = FALSE is the default, so I didn't include it
after transposing, 10 becomes the number of rows and 6 becomes the number of columns

multiplying columns in R

I have a data frame like this.
> abc
ID 1.x 2.x 1.y 2.y
1 4 10 20 30 40
2 16 5 10 5 10
3 42 16 17 18 19
4 91 20 20 20 20
5 103 103 42 56 84
How do I create two additional columns '1' and '2' by multiplying 1.x * 1.y and 2.x * 2.y in a generalized way?
I am trying to get a generalized solution where number of columns can be too many. So I want to multiply all x with all y. While x and y are fixed, n has to be figured out from data frame.
For simplicity lets assume n is also fixed however it is a large number.
One thing i can try is :-
abc[,c(6,7)]=abc[,c(2,3)]*abc[,c(4,5)]
It will work only if col positions are contiguous. This is good enough for me. If anyone can have more generalized solution, it will benefit us all.
If there are only couple of variables to multiply, we can do this with transform by multiplying the columns of interest
transform(abc, new1 = `1.x`*`1.y`, new2 = `2.x`*`2.y`, check.names = FALSE)
# ID 1.x 2.x 1.y 2.y new1 new2
#1 4 10 20 30 40 300 800
#2 16 5 10 5 10 25 100
#3 42 16 17 18 19 288 323
#4 91 20 20 20 20 400 400
#5 103 103 42 56 84 5768 3528
If we have lots of columns, then one approach is to split the dataset into a list of data.frames by taking the substring of names and then loop through the list and multiply the rows with do.call
abc[paste0("new", 1:2)] <- lapply(split.default(abc[-1],
sub("\\.[a-z]+$", "", names(abc)[-1])), function(x) do.call(`*`, x))
Or another option is (based on the pairwise column multiplication)
apply(aperm(array(unlist(abc[-1]), c(5, 2, 2)),
c(3, 1, 2)), 3, matrixStats::colProds)
Mutate will preserve the original variables. Mutate_all will allow you to multiply all columns in your dataframe.
abc %>%
mutate(new_vary1 = `1.x`* `2.x`,
new_vary2 = `1.y`* `2.y`) %>%
mutate_all(funs(.*`1.x`))

Ceil and floor values in R

I have a data.table of integers with values between 1 and 60.
My question is about flooring or ceiling any number to the following values: 12 18 24 30 36 ... 60.
For example, let's say my data.table contains the number 13. I want R to "transform" this number into 12 and 18 as 13 lies in between those numbers. Moreover, if I have 18 I want R to keep it at 18.
If my data.table contains the value 50, I want R to convert that number into 48 and 54 and so on.
My goal is to get two different data.tables. One where the floored values are saved and one where the ceiled values are saved.
Any idea how one could do this in R?
EDIT: Numbers smaller than 12 should always be transformed to 12.
Example output:
If have the following data.table data.table(c(1,28,29,41,53,53,17,41,41,53))
I want the following two output data.tables: floored values data.table(c(12,24,24,36,48,48,12,36,36,48))
I want the following two output data.tables: ceiled values data.table(c(12,30,30,42,54,54,18,42,42,54))
Here is a fairly direct way (edited to round up to 12 if any values are below):
df <- data.frame(nums = 10:20)
df$floors <- with(df,pmax(12,6*floor(nums/6)))
df$ceils <- with(df,pmax(12,6*ceiling(nums/6)))
Leading to:
> df
nums floors ceils
1 10 12 12
2 11 12 12
3 12 12 12
4 13 12 18
5 14 12 18
6 15 12 18
7 16 12 18
8 17 12 18
9 18 18 18
10 19 18 24
11 20 18 24
Here's a way we could do this, using sapply and the which.min functions. From your question, it's not immediately clear how values < 12 should be handled.
x <- 1:60
num_list <- seq(12, 60, 6)
floorr <- sapply(x, function(x){
diff_vec <- x - num_list
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
ceill <- sapply(x, function(x){
diff_vec <- num_list - x
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
tail(cbind(x, floorr, ceill))
x floorr ceill
[55,] 55 54 60
[56,] 56 54 60
[57,] 57 54 60
[58,] 58 54 60
[59,] 59 54 60
[60,] 60 60 60

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

Cumulative count of values in R

I hope you are doing very well. I would like to know how to calculate the cumulative sum of a data set with certain conditions. A simplified version of my data set would look like:
t id
A 22
A 22
R 22
A 41
A 98
A 98
A 98
R 98
A 46
A 46
R 46
A 46
A 46
A 46
R 46
A 46
A 12
R 54
A 66
R 13
A 13
A 13
A 13
A 13
R 13
A 13
Would like to make a new data set where, for each value of "id", I would have the cumulative number of times that each id appears , but when t=R I need to restart the counting e.g.
t id count
A 22 1
A 22 2
R 22 0
A 41 1
A 98 1
A 98 2
A 98 3
R 98 0
A 46 1
A 46 2
R 46 0
A 46 1
A 46 2
A 46 3
R 46 0
A 46 1
A 12 1
R 54 0
A 66 1
R 13 0
A 13 1
A 13 2
A 13 3
A 13 4
R 13 0
A 13 1
Any ideas as to how to do this? Thanks in advance.
Using rle:
out <- transform(df, count = sequence(rle(do.call(paste, df))$lengths))
out$count[out$t == "R"] <- 0
If your data.frame has more than these two columns, and you want to check only these two columns, then, just replace df with df[, 1:2] (or) df[, c("t", "id")].
If you find do.call(paste, df) dangerous (as #flodel comments), then you can replace that with:
as.character(interaction(df))
I personally don't find anything dangerous or clumsy with this setup (as long as you have the right separator, meaning you know your data well). However, if you do find it as such, the second solution may help you.
Update:
For those who don't like using do.call(paste, df) or as.character(interaction(df)) (please see the comment exchanges between me, #flodel and #HongOoi), here's another base solution:
idx <- which(df$t == "R")
ww <- NULL
if (length(idx) > 0) {
ww <- c(min(idx), diff(idx), nrow(df)-max(idx))
df <- transform(df, count = ave(id, rep(seq_along(ww), ww),
FUN=function(y) sequence(rle(y)$lengths)))
df$count[idx] <- 0
} else {
df$count <- seq_len(nrow(df))
}

Resources