Conditionally remove items from a list in R - r

I know there are many similar questions about removing items from a list, but I've been unable to solve my problem in particular - and I appreciate the help.
Simply put, I'd like to remove any entry (row) that has a value that is greater than -74.
list(structure(c(40.7571907043457, 40.7601699829102, 40.761848449707,
40.7660789489746, -73.9972381591797, -74.0038146972656, -74.0072479248047,
-74.0172576904297), .Dim = c(4L, 2L), .Dimnames = list(c("1",
"2", "3", "4"), c("lat", "lon"))), structure(c(40.7582893371582,
40.760498046875, 40.7620582580566, 40.7662887573242, -73.9975280761719,
-74.0031967163086, -74.0070190429688, -74.0170593261719), .Dim = c(4L,
2L), .Dimnames = list(c("1", "2", "3", "4"), c("lat", "lon"))))
Thanks so much.

If you only need to look at lon column with the negative values then simply,
lapply(your_list, function(i)i[i[,2] <= -74,])
In case you want to check both columns,
lapply(your_list, function(i)i[rowSums(i<=-74) > 0, , drop = FALSE])
Both give the same result,
[[1]]
lat lon
2 40.76017 -74.00381
3 40.76185 -74.00725
4 40.76608 -74.01726
[[2]]
lat lon
2 40.76050 -74.00320
3 40.76206 -74.00702
4 40.76629 -74.01706

Related

Hieraching across rows for the same id

So, I have a data set with a lot of observations for X individuals and more rows per some individuals. For each row, I have assigned a classification (the variable clinical_significance) that takes three values in prioritized order: definite disease, possible, colonization. Now, I would like to have only one row for each individual and the "highest classification" across the rows, e.g. definite if present, subsidiary possible and colonization. Any good suggestions on how to overcome this?
For instance, as seen in the example, I would like all ID #23 clinical_signifiance to be 'definite disease' as this outranks 'possible'
id id_row number_of_samples species_ny clinical_significa…
18 1 2 MAC possible
18 2 2 MAC possible
20 1 2 scrofulaceum possible
20 2 2 scrofulaceum possible
23 1 2 MAC possible
23 2 2 MAC definite disease
Making a reproducible example:
df <- structure(
list(
id = c("18", "18", "20", "20", "23", "23"),
id_row = c("1","2", "1", "2", "1", "2"),
number_of_samples = c("2", "2", "2","2", "2", "2"),
species_ny = c("MAC", "MAC", "scrofulaceum", "scrofulaceum", "MAC", "MAC"),
clinical_significance = c("possible", "possible", "possible", "possible", "possible", "definite disease")
),
row.names = c(NA, -6L), class = c("data.frame")
)
The idea is to turn clinical significance into a factor, which is stored as an integer instead of character (i.e. 1 = definite, 2 = possible, 3 = colonization). Then, for each ID, take the row with lowest number.
df_prio <- df |>
mutate(
fct_clin_sig = factor(
clinical_significance,
levels = c("definite disease", "possible", "colonization")
)
) |>
group_by(id) |>
slice_min(fct_clin_sig)
I fixed it using
df <- df %>%
group_by(id) %>%
mutate(clinical_significance_new = ifelse(any(clinical_significance == "definite disease"), "definite disease", as.character(clinical_significance)))

Replicate and append to dataframe in R

I believe this is fairly simple, although I am new to using R and code. I have a dataset which has a single row for each rodent trap site. There were however, 8 occasions of trapping over 4 years. What I wish to do is to expand the trap site data and append a number 1 to 8 for each row.
Then I can then label them with the trap visit for a subsequent join with the obtained trap data.
I have managed to replicate the rows with the following code. And while the rows are expanded in the data frame to 1, 1.1...1.7,2, 2.1...2.7 etc. I cannot figure out how to convert this to a useable column based ID.
structure(list(TrapCode = c("IA1sA", "IA2sA", "IA3sA", "IA4sA",
"IA5sA"), Y = c(-12.1355987315, -12.1356879776, -12.1357664998,
-12.1358823313, -12.1359720852), X = c(-69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532)), row.names = c(NA,
5L), class = "data.frame")
gps_1 <– gps_1[rep(seq_len(nrow(gps_1)), 3), ]
gives
"IA5sA", "IA1sA", "IA2sA", "IA3sA", "IA4sA", "IA5sA", "IA1sA",
"IA2sA", "IA3sA", "IA4sA", "IA5sA"), Y = c(-12.1355987315, -12.1356879776,
-12.1357664998, -12.1358823313, -12.1359720852, -12.1355987315,
-12.1356879776, -12.1357664998, -12.1358823313, -12.1359720852,
-12.1355987315, -12.1356879776, -12.1357664998, -12.1358823313,
-12.1359720852), X = c(-69.1335789865, -69.1335225279, -69.1334668485,
-69.1333847769, -69.1333226532, -69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532, -69.1335789865,
-69.1335225279, -69.1334668485, -69.1333847769, -69.1333226532
)), row.names = c("1", "2", "3", "4", "5", "1.1", "2.1", "3.1",
"4.1", "5.1", "1.2", "2.2", "3.2", "4.2", "5.2"), class = "data.frame")
I have a column with Trap_ID currently being a unique identifier. I hope that after the replication I could append an iteration number to this to keep it as a unique ID.
For example:
Trap_ID
IA1sA.1
IA1sA.2
IA1sA.3
IA2sA.1
IA2sA.2
IA2sA.3
Simply use a cross join (i.e., join with no by columns to return a cartesian product of both sets):
mdf <- merge(data.frame(Trap_ID = 1:8), trap_side_df, by=NULL)

Apply function on a list of tables by column

I have a list of table objects as such : list(X1A.1145442 = structure(c(0.3204, 0.6796, 0.3645, 0.6355, 0.1615, 0.8385, 0.3266, 0.6734, 0.2884, 0.7116, 0.3042, 0.6958), .Dim = c(2L, 6L), class = "table", .Dimnames = list(x = c("1", "2"),c("ES1-5", "ES14-26", "ES27-38", "ES6-13", "SA1-13", "SA14-25"))), X1A.1158042 = structure(c(0.4437, 0.5563, 0.4264, 0.5736, 0.2308, 0.7692, 0.3896, 0.6104, 0.2997, 0.7003, 0.3148, 0.6852), .Dim = c(2L, 6L), class = "table", .Dimnames = list(x = c("1", "2"), c("ES1-5", "ES14-26", "ES27-38", "ES6-13", "SA1-13", "SA14-25"))))
The list looks this way :
$`X1A.1145442`
x ES1-5 ES14-26 ES27-38 ES6-13 SA1-13 SA14-25
1 0.3204 0.3645 0.1615 0.3266 0.2884 0.3042
2 0.6796 0.6355 0.8385 0.6734 0.7116 0.6958
$X1A.1158042
x ES1-5 ES14-26 ES27-38 ES6-13 SA1-13 SA14-25
1 0.4437 0.4264 0.2308 0.3896 0.2997 0.3148
2 0.5563 0.5736 0.7692 0.6104 0.7003 0.6852
I would like to obtain the minimum value for each element of the list of tables in a column wise fashion.
I tried something with lapply but without success. Could someone help me on that please.
Regards,
Alex
It is a list of matrices. So the unit will be each element. If we use lapply, then it will loop through each of the element unless it is a converted to a data.frame. Here, we can make use of apply with MARGIN specified as 2 (for looping through columns)
lapply(lst1, function(x) apply(x, 2, min))
Or another option is colMins from matrixStats
library(matrixStats)
lapply(lst1, colMins)

Counting the rows in a data frame based on integer ranges

I have a data frame that lists a bunch of objects and their values.
Name NumCpu MemoryMB
1 BEAVERTN-SVR-C5 1 3072
2 BEAVERTN-SVR-UK 4 4096
3 BEAVERTN-SVR-JV 1 1024
I want to take my data frame and create a new column that groups these numbers by ranges.
Ranges: 0-1024, 1025-2048, 2049-4096
And then output the counts of those ranges into a new data frame:
Range Count
0-1024 1
1025-2048 0
2049-4096 2
I learn by doing, so this is a real work problem I'm trying to use R to solve. Any help greatly appreciated. Thank you!
Data
DF <- structure(list(Name = c("BEAVERTN-SVR-C5", "BEAVERTN-SVR-UK",
"BEAVERTN-SVR-JV"), NumCpu = c(1L, 4L, 1L), MemoryMB = c(3072L,
4096L, 1024L), Range = structure(c(3L, 3L, 1L), .Label = c("(0,1.02e+03]",
"(1.02e+03,2.05e+03]", "(2.05e+03,4.1e+03]"), class = "factor")), .Names = c("Name",
"NumCpu", "MemoryMB", "Range"), row.names = c("1", "2", "3"), class = "data.frame")

Fast computation of average proximity in proximity matrix

I've got a similarity matrix between all cases and, in a separate data frame, classes of these cases. I want to compute average similarity between cases from the same class, here is the equation for an example n from class j:
We have to compute a sum of all squared proximities between n and all cases k that come from the same class as n. Link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#outliers
I implemented that with 2 for loops, but it is really slow. Is there a faster way to do such thing in R?
Thanks.
//DATA (dput)
Data frame with classes:
structure(list(class = structure(c(1L, 2L, 2L, 1L, 3L, 3L, 1L,
1L, 2L, 3L), .Label = c("1", "2", "3", "5", "6", "7"), class = "factor")), .Names = "class", row.names = c(NA,
-10L), class = "data.frame")
Proximity matrix (row m and column m correspond to class in row m of data frame above):
structure(c(1, 0.60996875, 0.51775, 0.70571875, 0.581375, 0.42578125,
0.6595, 0.7134375, 0.645375, 0.468875, 0.60996875, 1, 0.77021875,
0.55171875, 0.540375, 0.53084375, 0.4943125, 0.462625, 0.7910625,
0.56321875, 0.51775, 0.77021875, 1, 0.451375, 0.60353125, 0.62353125,
0.5203125, 0.43934375, 0.6909375, 0.57159375, 0.70571875, 0.55171875,
0.451375, 1, 0.69196875, 0.59390625, 0.660375, 0.76834375, 0.606875,
0.65834375, 0.581375, 0.540375, 0.60353125, 0.69196875, 1, 0.7194375,
0.684, 0.68090625, 0.50553125, 0.60234375, 0.42578125, 0.53084375,
0.62353125, 0.59390625, 0.7194375, 1, 0.53665625, 0.553125, 0.513,
0.801625, 0.6595, 0.4943125, 0.5203125, 0.660375, 0.684, 0.53665625,
1, 0.8456875, 0.52878125, 0.65303125, 0.7134375, 0.462625, 0.43934375,
0.76834375, 0.68090625, 0.553125, 0.8456875, 1, 0.503, 0.6215,
0.645375, 0.7910625, 0.6909375, 0.606875, 0.50553125, 0.513,
0.52878125, 0.503, 1, 0.60653125, 0.468875, 0.56321875, 0.57159375,
0.65834375, 0.60234375, 0.801625, 0.65303125, 0.6215, 0.60653125,
1), .Dim = c(10L, 10L))
Correct result:
c(2.44197227050781, 2.21901680175781, 2.07063155175781, 2.52448621289062,
1.88040830957031, 2.16019295703125, 2.58622273828125, 2.81453253222656,
2.1031745078125, 2.00542063378906)
Should be possible. Your notation does not make clear whether we will find members of like classes in the rows or columns, so this answer presumes in the columns but the obvious modifications would work as well if they were in rows.
colSums(mat^2)) # in R this is element-wise application of ^2 rather than matrix multiplication.
Since both operations are vectorized it would be expected to be much faster than for-loops.
With the modification and assuming the matrix is named 'mat' and the class-dataframe named 'cldf':
sapply( 1:nrow(mat) ,
function(r) sum(mat[r, cldf[['class']][r] == cldf[['class']] ]^2) )
[1] 2.441972 2.219017 2.070632 2.524486 1.880408 2.160193 2.586223 2.814533 2.103175 2.005421

Resources