R - Delete Observations if More Than 25% of a Group - r

This is my first post! I started using R about a year ago and I have learned a lot from this sub over the last few months! Thanks for all of your help so far.
Here is what I am trying to do:
• Group Data by POS
• Within each POS group, no ORG should represent more than 25% of the dataset
• If the ORG represents more than 25% of the observation(column), the value furthest from the mean should be deleted. I think this would loop until the data from that ORG are less than 25% of the observation.
I am not sure how to approach this problem as I am a not too familiar with R functions. Well, I am assuming this would require a function.
Here is the sample dataset:
print(Example)
# A tibble: 18 x 13
Org Pos obv1 obv2 obv3 obv4 obv5 obv6 obv7 obv8 obv9 obv10 obv11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 34.6 26.2 43.1 NA NA NA NA NA NA NA NA
2 2 1 18.7 15.5 23.4 NA NA NA NA NA NA NA NA
3 3 1 16.2 14.4 21.7 NA NA NA NA NA NA NA 1.32
4 3 1 20.0 15.5 23.4 NA NA 1.32 2.78 1.44 NA NA 1.89
5 3 1 2.39 16.9 24.1 NA NA 1.13 1.52 1.12 NA NA 2.78
6 3 1 24.3 15.4 24.6 NA NA 1.13 1.89 1.13 NA NA 1.51
7 6 1 16.7 16.0 23.4 0.19 NA 0.83 1.3 0.94 1.78 2.15 1.51
8 6 1 18.7 16.4 25.8 0.19 NA 1.22 1.4 0.97 1.93 2.35 1.51
9 6 1 19.3 16.4 25.8 0.19 NA 1.22 1.4 0.97 1.93 2.35 1.51
10 7 1 23.8 18.6 28.6 NA NA NA NA NA NA NA NA
11 12 2 28.8 24.4 39.7 NA NA 1.13 1.89 1.32 2.46 3.21 NA
12 13 2 24.6 19.6 29.4 0.16 NA 3.23 3.23 2.27 NA NA NA
13 14 2 18.4 15.5 24.8 NA NA 2.27 3.78 1.13 3.46 4.91 2.78
14 15 2 23.8 24.4 39.7 NA NA NA NA NA NA NA NA
15 15 2 25.8 24.4 39.7 NA NA NA NA NA NA NA NA
16 16 2 18.9 17.4 26.9 0.15 NA NA 1.89 2.99 NA NA 1.51
17 16 2 22.1 17.3 26.9 NA NA NA 2.57 0.94 NA NA 1.51
18 16 2 24.3 19.6 28.5 0.15 NA NA 1.51 1.32 NA NA 2.27
The result would look something like this:
print(Result)
# A tibble: 18 x 13
Org Pos obv1 obv2 obv3 obv4 obv5 obv6 obv7 obv8 obv9 obv10 obv11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 34.6 26.2 43.1 NA NA NA NA NA NA NA NA
2 2 1 18.7 15.5 23.4 NA NA NA NA NA NA NA NA
3 3 1 NA NA NA NA NA NA NA NA NA NA NA
4 3 1 20.0 15.5 23.4 NA NA 1.32 2.78 1.44 NA NA NA
5 3 1 NA NA NA NA NA NA NA NA NA NA NA
6 3 1 NA NA NA NA NA NA NA NA NA NA 1.51
7 6 1 16.7 16.0 23.4 0.19 NA NA NA NA NA NA NA
8 6 1 NA NA NA NA NA 1.22 1.4 0.97 1.93 2.35 1.51
9 6 1 19.3 16.4 25.8 NA NA NA NA NA NA NA NA
10 7 1 23.8 18.6 28.6 NA NA NA NA NA NA NA NA
11 12 2 28.8 24.4 39.7 NA NA 1.13 1.89 1.32 2.46 3.21 NA
12 13 2 24.6 19.6 29.4 0.16 NA 3.23 3.23 2.27 NA NA NA
13 14 2 18.4 15.5 24.8 NA NA 2.27 3.78 1.13 3.46 4.91 2.78
14 15 2 NA NA NA NA NA NA NA NA NA NA NA
15 15 2 25.8 24.4 39.7 NA NA NA NA NA NA NA NA
16 16 2 NA NA NA NA NA NA 1.89 2.99 NA NA NA
17 16 2 22.1 17.3 26.9 NA NA NA 2.57 0.94 NA NA 1.51
18 16 2 NA NA NA NA NA NA NA NA NA NA NA
Any advice would be appreciated. Thanks!

Related

How to use rollapplyr while ignoring NA values?

I have weather data with NAs sporadically throughout and I want to calculate rolling means. I have been using the rollapplyr function within zoo but even though I include partial = TRUE, it still puts a NA whenever, for example, there is a NA in 1 of the 30 values to be averaged.
Here is the formula:
weather_rolled <- weather %>%
mutate(maxt30 = rollapplyr(max_temp, 30, mean, partial = TRUE))
Here's my data:
A tibble: 7,160 x 11
station_name date max_temp avg_temp min_temp rainfall rh avg_wind_speed dew_point avg_bare_soil_temp total_solar_rad
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 VEGREVILLE 2019-01-01 0.9 -7.9 -16.6 1 81.7 20.2 -7.67 NA NA
2 VEGREVILLE 2019-01-02 5.5 1.5 -2.5 0 74.9 13.5 -1.57 NA NA
3 VEGREVILLE 2019-01-03 3.3 -0.9 -5 0.5 80.6 10.1 -3.18 NA NA
4 VEGREVILLE 2019-01-04 -1.1 -4.7 -8.2 5.2 92.1 8.67 -4.76 NA NA
5 VEGREVILLE 2019-01-05 -3.8 -6.5 -9.2 0.2 92.6 14.3 -6.81 NA NA
6 VEGREVILLE 2019-01-06 -3 -4.4 -5.9 0 91.1 16.2 -5.72 NA NA
7 VEGREVILLE 2019-01-07 -5.8 -12.2 -18.5 0 75.5 30.6 -16.9 NA NA
8 VEGREVILLE 2019-01-08 -17.4 -21.6 -25.7 1.2 67.8 16.1 -26.1 NA NA
9 VEGREVILLE 2019-01-09 -12.9 -15.1 -17.4 0.2 71.5 14.3 -17.7 NA NA
10 VEGREVILLE 2019-01-10 -13.2 -17.9 -22.5 0.4 80.2 3.38 -21.8 NA NA
# ... with 7,150 more rows
Essentially, whenever a NA appears midway through, it results in a lot of NAs for the rolling mean. I want to still calculate the rolling mean within that time frame, ignoring the NAs. Does anyone know a way to get around this? I have been searching online for hours to no avail.
Thanks!

Adding multiple NA rows to a data frame

I wanted to calculate something using a loop. The loop should go more than the number of my observations. For this reason, I want to first add NA rows to my data frame. Let's say my data frame is mtcars data. How can I add 10 NA rows in the mtcars data? The add_row()function from dplyr helps me to add only one NA row. Using rbind() also didn't help to add multiple rows.I probably need to loop this one too. But maybe there is an easy solution.
To add several rows of NA to the dataframe, you can try this:
n <- 2 # Number of NA rows
rbind(df, matrix(NA, nrow = n, ncol = NCOL(df),
dimnames = list(NULL, colnames(df))))
There are many ways to do that. One way with base-R:
#make a data.frame with 10 rows of NA
na_frame <- as.data.frame(matrix(NA, nrow = 10, ncol = 11))
#add the names of mtcars
names(na_frame) <- names(mtcars)
#bind together
rbind(mtcars, na_frame)
output:
#truncated
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
1 NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA
7 NA NA NA NA NA NA NA NA NA NA NA
8 NA NA NA NA NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA NA NA NA NA
Alternatives that start with the base frame itself:
base R
mt <- mtcars[1:3,]
rbind(mt, mt[0,][rep(NA, 11),])
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# NA NA NA NA NA NA NA NA NA NA NA NA
# NA.1 NA NA NA NA NA NA NA NA NA NA NA
# NA.2 NA NA NA NA NA NA NA NA NA NA NA
# NA.3 NA NA NA NA NA NA NA NA NA NA NA
# NA.4 NA NA NA NA NA NA NA NA NA NA NA
# NA.5 NA NA NA NA NA NA NA NA NA NA NA
# NA.6 NA NA NA NA NA NA NA NA NA NA NA
# NA.7 NA NA NA NA NA NA NA NA NA NA NA
# NA.8 NA NA NA NA NA NA NA NA NA NA NA
# NA.9 NA NA NA NA NA NA NA NA NA NA NA
# NA.10 NA NA NA NA NA NA NA NA NA NA NA
dplyr
library(dplyr)
mt %>%
slice(rep(1, 11)) %>%
mutate(across(everything(), ~ NA)) %>%
bind_rows(mt, .)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# 4 NA NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA NA
# 6 NA NA NA NA NA NA NA NA NA NA NA
# 7 NA NA NA NA NA NA NA NA NA NA NA
# 8 NA NA NA NA NA NA NA NA NA NA NA
# 9 NA NA NA NA NA NA NA NA NA NA NA
# 10 NA NA NA NA NA NA NA NA NA NA NA
# 11 NA NA NA NA NA NA NA NA NA NA NA
# 12 NA NA NA NA NA NA NA NA NA NA NA
# 13 NA NA NA NA NA NA NA NA NA NA NA
# 14 NA NA NA NA NA NA NA NA NA NA NA

Keep top 3 values in a row, change everything else to NA

Using mtcars for reproduciblity
(This is a row operation). I want to keep 3 values in a row based on their magnitude (so basically top 3 values would be having value, rest everything change to NA)
I tried using pivot_longer converting to long then filtering but problem is I want to convert again to wide cause I want to retain the structure of data.
mtcars %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>% top_n(3)
Sample Output on 3 rows of mtcars
Note: In mtcars, all 3 rows have same column name values as non NA but in original dataset it would be different. (Preferably tidyverse solution)
I know you would like a tidyverse solution, but this is a one-liner in base R:
t(apply(mtcars, 1, function(x) {x[order(x)[1:(length(x) - 3)]] <- NA; x}))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 NA 160.0 110 NA NA NA NA NA NA NA
#> Mazda RX4 Wag 21.0 NA 160.0 110 NA NA NA NA NA NA NA
#> Datsun 710 22.8 NA 108.0 93 NA NA NA NA NA NA NA
#> Hornet 4 Drive 21.4 NA 258.0 110 NA NA NA NA NA NA NA
#> Hornet Sportabout 18.7 NA 360.0 175 NA NA NA NA NA NA NA
#> Valiant NA NA 225.0 105 NA NA 20.22 NA NA NA NA
#> Duster 360 NA NA 360.0 245 NA NA 15.84 NA NA NA NA
#> Merc 240D 24.4 NA 146.7 62 NA NA NA NA NA NA NA
#> Merc 230 NA NA 140.8 95 NA NA 22.90 NA NA NA NA
#> Merc 280 19.2 NA 167.6 123 NA NA NA NA NA NA NA
#> Merc 280C NA NA 167.6 123 NA NA 18.90 NA NA NA NA
#> Merc 450SE NA NA 275.8 180 NA NA 17.40 NA NA NA NA
#> Merc 450SL NA NA 275.8 180 NA NA 17.60 NA NA NA NA
#> Merc 450SLC NA NA 275.8 180 NA NA 18.00 NA NA NA NA
#> Cadillac Fleetwood NA NA 472.0 205 NA NA 17.98 NA NA NA NA
#> Lincoln Continental NA NA 460.0 215 NA NA 17.82 NA NA NA NA
#> Chrysler Imperial NA NA 440.0 230 NA NA 17.42 NA NA NA NA
#> Fiat 128 32.4 NA 78.7 66 NA NA NA NA NA NA NA
#> Honda Civic 30.4 NA 75.7 52 NA NA NA NA NA NA NA
#> Toyota Corolla 33.9 NA 71.1 65 NA NA NA NA NA NA NA
#> Toyota Corona 21.5 NA 120.1 97 NA NA NA NA NA NA NA
#> Dodge Challenger NA NA 318.0 150 NA NA 16.87 NA NA NA NA
#> AMC Javelin NA NA 304.0 150 NA NA 17.30 NA NA NA NA
#> Camaro Z28 NA NA 350.0 245 NA NA 15.41 NA NA NA NA
#> Pontiac Firebird 19.2 NA 400.0 175 NA NA NA NA NA NA NA
#> Fiat X1-9 27.3 NA 79.0 66 NA NA NA NA NA NA NA
#> Porsche 914-2 26.0 NA 120.3 91 NA NA NA NA NA NA NA
#> Lotus Europa 30.4 NA 95.1 113 NA NA NA NA NA NA NA
#> Ford Pantera L 15.8 NA 351.0 264 NA NA NA NA NA NA NA
#> Ferrari Dino 19.7 NA 145.0 175 NA NA NA NA NA NA NA
#> Maserati Bora 15.0 NA 301.0 335 NA NA NA NA NA NA NA
#> Volvo 142E 21.4 NA 121.0 109 NA NA NA NA NA NA NA
Your general idea was in the right direction. You can pivot to long data and group by the row number before using slice_max() and reshaping back to wide:
library(dplyr)
library(tidyr)
library(tibble)
mtcars %>%
rowid_to_column() %>%
pivot_longer(-rowid) %>%
group_by(rowid) %>%
mutate(value = replace(value, !value %in% tail(value[order(value)], 3), NA)) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <lgl> <dbl> <dbl> <lgl> <lgl> <dbl> <lgl> <lgl> <lgl> <lgl>
1 21 NA 160 110 NA NA NA NA NA NA NA
2 21 NA 160 110 NA NA NA NA NA NA NA
3 22.8 NA 108 93 NA NA NA NA NA NA NA
4 21.4 NA 258 110 NA NA NA NA NA NA NA
5 18.7 NA 360 175 NA NA NA NA NA NA NA
6 NA NA 225 105 NA NA 20.2 NA NA NA NA
7 NA NA 360 245 NA NA 15.8 NA NA NA NA
8 24.4 NA 147. 62 NA NA NA NA NA NA NA
9 NA NA 141. 95 NA NA 22.9 NA NA NA NA
10 19.2 NA 168. 123 NA NA NA NA NA NA NA
# ... with 22 more rows
Seeing that you were curious about other solutions..
Here I leave you a more tidyverse-oriented solution.
library(purrr)
library(dplyr)
mtcars %>% pmap_dfr(~c(...) %>% replace(rank(desc(.)) > 3, NA))
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 NA 160 110 NA NA NA NA NA NA NA
#> 2 21 NA 160 110 NA NA NA NA NA NA NA
#> 3 22.8 NA 108 93 NA NA NA NA NA NA NA
#> 4 21.4 NA 258 110 NA NA NA NA NA NA NA
#> 5 18.7 NA 360 175 NA NA NA NA NA NA NA
#> 6 NA NA 225 105 NA NA 20.2 NA NA NA NA
#> 7 NA NA 360 245 NA NA 15.8 NA NA NA NA
#> 8 24.4 NA 147. 62 NA NA NA NA NA NA NA
#> 9 NA NA 141. 95 NA NA 22.9 NA NA NA NA
#> 10 19.2 NA 168. 123 NA NA NA NA NA NA NA
#> # ... with 22 more rows
As a concept, it's similar to the base R solution, but it should (or at least tries to) be more "functional" and hopefully readable. Even if the chosen solution looks very good.
EDIT.
To answer your comment about more info..
It should be known that ~ helps you writing more compact anonymous functions.
instead of:
mtcars %>% pmap_dfr(~c(...) %>% replace(rank(desc(.)) > 3, NA))
you could also write:
mtcars %>% pmap_dfr(function(...) c(...) %>% replace(rank(desc(.)) > 3, NA))
Those three dots basically gather all together the input you're providing to your function. Instead of writing a variable for each input, I use ... to include them all.
pmap takes a list of lists or a list of vectors as first argument.
In this case, it takes a data.frame which is actually a list of vector of the same length.
Then, pmap provides the function with the i-th element of each vector of the list.
... intercept all those i-th elements and c() create a unique vector of those elements.
The function itself will just replace NAs in that vector in very similar way to the accepted solution. I used rank because it seemed to me a bit easier to read, but I guess it's a matter of style.
pmap always returns a list. That's you can use pmap_dfr to return a dataframe instead. Specifically you want to create a dataframe by binding each vector of the final result as rows (that explains the r at the end).
Check out ?pmap for more info.
A data.table solution for completeness:
DT <- as.data.table(mtcars)
DT[,
{
t3 <- sort(unlist(.SD), decreasing = TRUE)[1:3]
lapply(.SD, function(x) if (x %in% t3) x else NA_real_)
},
by = seq_len(nrow(DT))]
# seq_len mpg cyl disp hp drat wt qsec vs am gear carb
# 1: 1 21.0 NA 160.0 110 NA NA NA NA NA NA NA
# 2: 2 21.0 NA 160.0 110 NA NA NA NA NA NA NA
# 3: 3 22.8 NA 108.0 93 NA NA NA NA NA NA NA
# 4: 4 21.4 NA 258.0 110 NA NA NA NA NA NA NA
# 5: 5 18.7 NA 360.0 175 NA NA NA NA NA NA NA
# 6: 6 NA NA 225.0 105 NA NA 20.22 NA NA NA NA
# ...
One dplyr option could be:
mtcars %>%
rowwise() %>%
mutate(temp = list(tail(sort(c_across(everything())), 3))) %>%
ungroup() %>%
mutate(across(everything(), ~ replace(.x, !.x %in% unlist(temp), NA))) %>%
select(-temp)
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 NA 160 110 NA NA NA NA NA NA NA
2 21 NA 160 110 NA NA NA NA NA NA NA
3 22.8 NA 108 93 NA NA NA NA NA NA NA
4 21.4 NA 258 110 NA NA NA NA NA NA NA
5 18.7 NA 360 175 NA NA NA NA NA NA NA
6 NA NA 225 105 NA NA 20.2 NA NA NA NA
7 NA NA 360 245 NA NA 15.8 NA NA NA NA
8 24.4 NA 147. 62 NA NA NA NA NA NA NA
9 22.8 NA 141. 95 NA NA 22.9 NA NA NA NA
10 19.2 NA 168. 123 NA NA NA NA NA NA NA
The same logic using purrr:
mtcars %>%
pmap_dfr(~ replace(c(...), !c(...) %in% tail(sort(c(...)), 3), NA))

taking average by groups, excluding NA value

I'm struggling with finding something to aggregate my data frame by taking the mean and ignoring the NA value, but the end results would still show a missing value them.
the data table looks for instance like this
Guar1 Bucket2 1 2 3 4 Total Month
10 -10 NA NA NA NA 0 201110
10 -0.2 0 9.87 8.42 0 18.29 201110
10 0 0.81 7.49 3.32 5.92 17.54 201110
10 0.4 0 0 NA 0 0 201110
10 999 0.73 7.57 4.61 0.77 13.68 201110
20 -10 NA NA NA NA 0 201110
20 -0.2 NA NA 100 NA 100 201110
20 0 NA 0 0 0 0 201110
20 0.4 1.39 3.13 14.04 2.98 21.54 201110
20 999 1.38 3.11 17.08 2.97 24.54 201110
999 999 1.06 5.44 8.61 1.52 16.63 201110
10 -10 NA NA NA NA 0 201111
10 -0.2 0 0 8.54 0 8.54 201111
10 0 1.87 6.12 16.6 0 24.59 201111
10 0.4 0 0 0 1.47 1.47 201111
10 999 1.68 5.82 13.15 1.67 22.32 201111
20 -10 NA NA NA NA 0 201111
20 -0.2 NA 0 NA NA 0 201111
20 0 NA NA 0 0 0 201111
20 0.4 2.29 5.38 14.91 14.18 36.76 201111
20 999 2.29 5.35 13.09 14.1 34.83 201111
And the final table
Guar1 Bucket2 1 2 3 4 Total
10 -10 NA NA NA NA 0
10 -0.2 0 4.935 8.48 0 13.415
10 0 1.34 6.805 9.96 2.96 21.065
10 0.4 0 0 0 0.735 0.735
10 999 1.205 6.695 8.88 1.22 18
20 -10 NA NA NA NA 0
20 -0.2 NA 0 100 NA 50
20 0 NA 0 0 0 0
20 0.4 1.84 4.255 14.475 8.58 29.15
20 999 1.835 4.23 15.085 8.535 29.685
999 999 1.06 5.44 8.61 1.52 16.63
I've try the
aggregate(.~ Guar1+Bucket2, df, mean, na.rm = FALSE)
but it then excluding all NA in the final table.
and if I set all the NA value in df equal to 0 then I would not have the desire average.
I hope that someone can help me with this. Thanks!
Check this example with dplyr package
You can group by more than one variable. dplyr package is great for data editing summarising end etc.
dataFrame <- data.frame(group = c("a","a","a", "b","b","b"), value = c(1,2,NA,NA,NA,3))
library("dplyr")
df <- dataFrame %>%
group_by(group) %>%
summarise(Mean = mean(value, na.rm = T))
Output
# A tibble: 2 × 2
group Mean
<fctr> <dbl>
1 a 1.5
2 b 3.0
To avoid the NA rows to be removed, use na.action = na.pass and with na.rm=TRUE from the mean, make sure that we use only the non-NA elements to get the mean
aggregate(.~ Guar1+Bucket2, df, mean, na.rm =TRUE, na.action = na.pass)

Assigning values from submatrices to larger matrix

I have a bunch of small matrices, which are basically subsets of a larger matrix, but have different values. I want to take the values from these submatrices and overwrite the corresponding values in the larger matrix. For instance, say this is my larger matrix:
AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 6.5 NA -1.8 3.65 -17.96 -26.5
AB-2600 NA 7.18 NA NA NA NA
AB-3500 -1.79 NA 5.4 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.8 NA
AD-0100 -17.96 NA -4.63 9.8 5.9 NA
AF-0200 -26.5 NA NA NA NA 4.28
A smaller matrix might just be:
AB-2000 AB-3500
AB-2000 5.5 2.5
AB-3500 2.5 6.5
So, for instance, I want to take the value from the intersection of the AB-2000 row and AB-3500 column in the smaller matrix (2.5) and set it as the new value in the larger matrix, and do the same thing for the other values in the submatrix so we get a new larger matrix that looks like:
AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 5.5 NA 2.5 3.65 -17.96 -26.5
AB-2600 NA 7.18 NA NA NA NA
AB-3500 2.5 NA 6.5 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.8 NA
AD-0100 -17.96 NA -4.63 9.8 5.9 NA
AF-0200 -26.5 NA NA NA NA 4.28
I have a lot of submatrices whose values I am using to override the values in the larger matrix so want a way to do this efficiently. Any thoughts?
You can take advantage of having equal rownames and colnames in all matrices and just subset the big matrix according to the submatrix, and then replace the values:
X <- read.table(text=" AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 6.5 NA -1.8 3.65 -17.96 -26.5
AB-2600 NA 7.18 NA NA NA NA
AB-3500 -1.79 NA 5.4 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.8 NA
AD-0100 -17.96 NA -4.63 9.8 5.9 NA
AF-0200 -26.5 NA NA NA NA 4.28")
X
x1 <- read.table(text=" AB-2000 AB-3500
AB-2000 5.5 2.5
AB-3500 2.5 6.5")
X[rownames(x1),colnames(x1)] <- x1
Result:
> X
AB.2000 AB.2600 AB.3500 AC.0100 AD.0100 AF.0200
AB-2000 5.50 NA 2.50 3.65 -17.96 -26.50
AB-2600 NA 7.18 NA NA NA NA
AB-3500 2.50 NA 6.50 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.80 NA
AD-0100 -17.96 NA -4.63 9.80 5.90 NA
AF-0200 -26.50 NA NA NA NA 4.28
For more than one submatrix, you can do something like this:
x2 <- read.table(text=" AB-2600 AC-0100
AB-2600 42 42
AC-0100 42 42") #Fake data
all.sub <- list(x1, x2)
for(x in all.sub) X[rownames(x),colnames(x)] <- x
> X
AB.2000 AB.2600 AB.3500 AC.0100 AD.0100 AF.0200
AB-2000 5.50 NA 2.50 3.65 -17.96 -26.50
AB-2600 NA 42.1 NA 42.20 NA NA
AB-3500 2.50 NA 6.50 NA -4.63 NA
AC-0100 3.65 42.3 NA 42.40 9.80 NA
AD-0100 -17.96 NA -4.63 9.80 5.90 NA
AF-0200 -26.50 NA NA NA NA 4.28
Just keep in mind that if you have repeated occurrences of [row,col] the last submatrix in all.sub will be the final value in X.

Resources