Assigning values from submatrices to larger matrix - r

I have a bunch of small matrices, which are basically subsets of a larger matrix, but have different values. I want to take the values from these submatrices and overwrite the corresponding values in the larger matrix. For instance, say this is my larger matrix:
AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 6.5 NA -1.8 3.65 -17.96 -26.5
AB-2600 NA 7.18 NA NA NA NA
AB-3500 -1.79 NA 5.4 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.8 NA
AD-0100 -17.96 NA -4.63 9.8 5.9 NA
AF-0200 -26.5 NA NA NA NA 4.28
A smaller matrix might just be:
AB-2000 AB-3500
AB-2000 5.5 2.5
AB-3500 2.5 6.5
So, for instance, I want to take the value from the intersection of the AB-2000 row and AB-3500 column in the smaller matrix (2.5) and set it as the new value in the larger matrix, and do the same thing for the other values in the submatrix so we get a new larger matrix that looks like:
AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 5.5 NA 2.5 3.65 -17.96 -26.5
AB-2600 NA 7.18 NA NA NA NA
AB-3500 2.5 NA 6.5 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.8 NA
AD-0100 -17.96 NA -4.63 9.8 5.9 NA
AF-0200 -26.5 NA NA NA NA 4.28
I have a lot of submatrices whose values I am using to override the values in the larger matrix so want a way to do this efficiently. Any thoughts?

You can take advantage of having equal rownames and colnames in all matrices and just subset the big matrix according to the submatrix, and then replace the values:
X <- read.table(text=" AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 6.5 NA -1.8 3.65 -17.96 -26.5
AB-2600 NA 7.18 NA NA NA NA
AB-3500 -1.79 NA 5.4 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.8 NA
AD-0100 -17.96 NA -4.63 9.8 5.9 NA
AF-0200 -26.5 NA NA NA NA 4.28")
X
x1 <- read.table(text=" AB-2000 AB-3500
AB-2000 5.5 2.5
AB-3500 2.5 6.5")
X[rownames(x1),colnames(x1)] <- x1
Result:
> X
AB.2000 AB.2600 AB.3500 AC.0100 AD.0100 AF.0200
AB-2000 5.50 NA 2.50 3.65 -17.96 -26.50
AB-2600 NA 7.18 NA NA NA NA
AB-3500 2.50 NA 6.50 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.80 NA
AD-0100 -17.96 NA -4.63 9.80 5.90 NA
AF-0200 -26.50 NA NA NA NA 4.28
For more than one submatrix, you can do something like this:
x2 <- read.table(text=" AB-2600 AC-0100
AB-2600 42 42
AC-0100 42 42") #Fake data
all.sub <- list(x1, x2)
for(x in all.sub) X[rownames(x),colnames(x)] <- x
> X
AB.2000 AB.2600 AB.3500 AC.0100 AD.0100 AF.0200
AB-2000 5.50 NA 2.50 3.65 -17.96 -26.50
AB-2600 NA 42.1 NA 42.20 NA NA
AB-3500 2.50 NA 6.50 NA -4.63 NA
AC-0100 3.65 42.3 NA 42.40 9.80 NA
AD-0100 -17.96 NA -4.63 9.80 5.90 NA
AF-0200 -26.50 NA NA NA NA 4.28
Just keep in mind that if you have repeated occurrences of [row,col] the last submatrix in all.sub will be the final value in X.

Related

How to use rollapplyr while ignoring NA values?

I have weather data with NAs sporadically throughout and I want to calculate rolling means. I have been using the rollapplyr function within zoo but even though I include partial = TRUE, it still puts a NA whenever, for example, there is a NA in 1 of the 30 values to be averaged.
Here is the formula:
weather_rolled <- weather %>%
mutate(maxt30 = rollapplyr(max_temp, 30, mean, partial = TRUE))
Here's my data:
A tibble: 7,160 x 11
station_name date max_temp avg_temp min_temp rainfall rh avg_wind_speed dew_point avg_bare_soil_temp total_solar_rad
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 VEGREVILLE 2019-01-01 0.9 -7.9 -16.6 1 81.7 20.2 -7.67 NA NA
2 VEGREVILLE 2019-01-02 5.5 1.5 -2.5 0 74.9 13.5 -1.57 NA NA
3 VEGREVILLE 2019-01-03 3.3 -0.9 -5 0.5 80.6 10.1 -3.18 NA NA
4 VEGREVILLE 2019-01-04 -1.1 -4.7 -8.2 5.2 92.1 8.67 -4.76 NA NA
5 VEGREVILLE 2019-01-05 -3.8 -6.5 -9.2 0.2 92.6 14.3 -6.81 NA NA
6 VEGREVILLE 2019-01-06 -3 -4.4 -5.9 0 91.1 16.2 -5.72 NA NA
7 VEGREVILLE 2019-01-07 -5.8 -12.2 -18.5 0 75.5 30.6 -16.9 NA NA
8 VEGREVILLE 2019-01-08 -17.4 -21.6 -25.7 1.2 67.8 16.1 -26.1 NA NA
9 VEGREVILLE 2019-01-09 -12.9 -15.1 -17.4 0.2 71.5 14.3 -17.7 NA NA
10 VEGREVILLE 2019-01-10 -13.2 -17.9 -22.5 0.4 80.2 3.38 -21.8 NA NA
# ... with 7,150 more rows
Essentially, whenever a NA appears midway through, it results in a lot of NAs for the rolling mean. I want to still calculate the rolling mean within that time frame, ignoring the NAs. Does anyone know a way to get around this? I have been searching online for hours to no avail.
Thanks!

R - Delete Observations if More Than 25% of a Group

This is my first post! I started using R about a year ago and I have learned a lot from this sub over the last few months! Thanks for all of your help so far.
Here is what I am trying to do:
• Group Data by POS
• Within each POS group, no ORG should represent more than 25% of the dataset
• If the ORG represents more than 25% of the observation(column), the value furthest from the mean should be deleted. I think this would loop until the data from that ORG are less than 25% of the observation.
I am not sure how to approach this problem as I am a not too familiar with R functions. Well, I am assuming this would require a function.
Here is the sample dataset:
print(Example)
# A tibble: 18 x 13
Org Pos obv1 obv2 obv3 obv4 obv5 obv6 obv7 obv8 obv9 obv10 obv11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 34.6 26.2 43.1 NA NA NA NA NA NA NA NA
2 2 1 18.7 15.5 23.4 NA NA NA NA NA NA NA NA
3 3 1 16.2 14.4 21.7 NA NA NA NA NA NA NA 1.32
4 3 1 20.0 15.5 23.4 NA NA 1.32 2.78 1.44 NA NA 1.89
5 3 1 2.39 16.9 24.1 NA NA 1.13 1.52 1.12 NA NA 2.78
6 3 1 24.3 15.4 24.6 NA NA 1.13 1.89 1.13 NA NA 1.51
7 6 1 16.7 16.0 23.4 0.19 NA 0.83 1.3 0.94 1.78 2.15 1.51
8 6 1 18.7 16.4 25.8 0.19 NA 1.22 1.4 0.97 1.93 2.35 1.51
9 6 1 19.3 16.4 25.8 0.19 NA 1.22 1.4 0.97 1.93 2.35 1.51
10 7 1 23.8 18.6 28.6 NA NA NA NA NA NA NA NA
11 12 2 28.8 24.4 39.7 NA NA 1.13 1.89 1.32 2.46 3.21 NA
12 13 2 24.6 19.6 29.4 0.16 NA 3.23 3.23 2.27 NA NA NA
13 14 2 18.4 15.5 24.8 NA NA 2.27 3.78 1.13 3.46 4.91 2.78
14 15 2 23.8 24.4 39.7 NA NA NA NA NA NA NA NA
15 15 2 25.8 24.4 39.7 NA NA NA NA NA NA NA NA
16 16 2 18.9 17.4 26.9 0.15 NA NA 1.89 2.99 NA NA 1.51
17 16 2 22.1 17.3 26.9 NA NA NA 2.57 0.94 NA NA 1.51
18 16 2 24.3 19.6 28.5 0.15 NA NA 1.51 1.32 NA NA 2.27
The result would look something like this:
print(Result)
# A tibble: 18 x 13
Org Pos obv1 obv2 obv3 obv4 obv5 obv6 obv7 obv8 obv9 obv10 obv11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 34.6 26.2 43.1 NA NA NA NA NA NA NA NA
2 2 1 18.7 15.5 23.4 NA NA NA NA NA NA NA NA
3 3 1 NA NA NA NA NA NA NA NA NA NA NA
4 3 1 20.0 15.5 23.4 NA NA 1.32 2.78 1.44 NA NA NA
5 3 1 NA NA NA NA NA NA NA NA NA NA NA
6 3 1 NA NA NA NA NA NA NA NA NA NA 1.51
7 6 1 16.7 16.0 23.4 0.19 NA NA NA NA NA NA NA
8 6 1 NA NA NA NA NA 1.22 1.4 0.97 1.93 2.35 1.51
9 6 1 19.3 16.4 25.8 NA NA NA NA NA NA NA NA
10 7 1 23.8 18.6 28.6 NA NA NA NA NA NA NA NA
11 12 2 28.8 24.4 39.7 NA NA 1.13 1.89 1.32 2.46 3.21 NA
12 13 2 24.6 19.6 29.4 0.16 NA 3.23 3.23 2.27 NA NA NA
13 14 2 18.4 15.5 24.8 NA NA 2.27 3.78 1.13 3.46 4.91 2.78
14 15 2 NA NA NA NA NA NA NA NA NA NA NA
15 15 2 25.8 24.4 39.7 NA NA NA NA NA NA NA NA
16 16 2 NA NA NA NA NA NA 1.89 2.99 NA NA NA
17 16 2 22.1 17.3 26.9 NA NA NA 2.57 0.94 NA NA 1.51
18 16 2 NA NA NA NA NA NA NA NA NA NA NA
Any advice would be appreciated. Thanks!

R data.table Creating simple custom function [duplicate]

This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I am currently working in R on a data set that looks somewhat like the following (except it holds millions of rows and more variables) :
pid agedays wtkg htcm bmi haz waz whz
1 2 1.92 44.2 9.74 -2.72 -3.23 NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70
I am working to create a function, in which the following variables are added :
haz_1.5, waz_1.5, whz_1.5, htcm_1.5, wtkg_1.5, and bmi_1.5
each variable will follow the same pattern of criteria as below :
!is.na(haz) and agedays > 61-45 and agedays <=61-15, haz_1.5 will hold the value of haz
The new data set should look like the following (except bmi_1.5, wtkg_1.5, and htcm_1.5 are omitted from the output below, so table sample can fit in box):
pid agedays wtkg htcm bmi haz waz whz haz_1.5 waz_1.5 whz_1.5
1 2 1.92 44.2 9.74 -2.72 -3.23 NA NA NA NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48 NA NA NA
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14 NA NA NA
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54 NA NA NA
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18 NA NA NA
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70 NA NA NA
Here's the code that I've tried so far :
measure<-list("haz", "waz", "whz", "htcm", "wtkg", "bmi")
set_1.5_months <- function(x, y, z){
maled_anthro[!is.na(z) & agedays > (x-45) & agedays <= (x-15), y:=z]
}
for(i in 1:length(measure)){
z <- measure[i]
y <- paste(measure[i], "1.5", sep="_")
x <- 61
maled_anthro_1<-set_1.5_months(x, y, z)
}
The code above has not been successful. I just end up with a new variable "y" added into the original data table that holds the values "bmi" or "NA". Can someone help me with figuring out where I went wrong with this code?
I'd like to keep the function as similar to the formatting above (easy to change) as I have other similar functions that will need to be created in which the values "1.5" and x==61 will need to be swapped out with other numbers and I like that these are relatively easy to change in the current format.
I believe the following is a idiomatic way to create new columns by applying a function to many existing columns.
Note that I've left the condition as it was, negating it all to make the code as close to the question's as possible.
library(data.table)
setDT(maled_anthro)
set_1.5_months <- function(y, agedays, x = 61){
z <- y
is.na(z) <- !(!is.na(y) & agedays > (x - 45) & agedays <= (x - 15))
z
}
measure <- c("haz", "waz", "whz", "htcm", "wtkg", "bmi")
new_measure <- paste(measure, "1.5", sep = "_")
maled_anthro[, (new_measure) := lapply(.SD, function(y) set_1.5_months(y, agedays, x=61)), .SDcols = measure ]
# pid agedays wtkg htcm bmi haz waz whz haz_1.5 waz_1.5 whz_1.5 htcm_1.5 wtkg_1.5 bmi_1.5
#1: 1 2 1.92 44.2 9.74 -2.72 -3.23 NA NA NA NA NA NA NA
#2: 1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00 -2.21 -3.03 -2.00 49.2 2.68 11.07
#3: 1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48 NA NA NA NA NA NA
#4: 1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14 NA NA NA NA NA NA
#5: 2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54 NA NA NA NA NA NA
#6: 2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79 -0.14 -0.58 -0.79 53.1 3.78 13.41
#7: 2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18 NA NA NA NA NA NA
#8: 2 104 5.82 61.3 15.49 0.23 -0.38 -0.70 NA NA NA NA NA NA
Data
maled_anthro <- read.table(text = "
pid agedays wtkg htcm bmi haz waz whz
1 2 1.92 44.2 9.74 -2.72 -3.23 NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70
", header = TRUE)

as.numeric function does not change vector type

I have a column in my dataset with numbers and NAs. When I try to import it, RStudio categorizes it as "character" instead of "numeric".
I tried using "as.numeric" function to convert that. I get the warning: "NAs introduced by coercion" and then nothing happens.
str(my_data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 131 obs. of 6 variables:
$ labs_TotalChol : num NA 149 149 188 171 147 207 NA 131 136 ...
$ labs_Creatinine : num NA NA NA NA 0.8 1 NA NA 0.7 0.7 ...
$ PET_globalCFR : chr NA "2.87" "2.65" "2.65" ...
$ RHI : num NA 1.49 1.91 1.5 3.03 1.72 1.93 2.67 1.28 2.06 ...
$ PET_avghr_stress : num NA 90 99.7 76 73 ...
$ PET_RPPCorrected_rest: num NA 2.09 2.1 2.24 2.11 2 2.75 1.07 2.72 2.24 ...
I use the code:
as.numeric(my_data$PET_globalCFR)
and get:
[1] NA 2.87 2.65 2.65 2.46 2.80 2.93 2.02 3.77 2.62 2.06 NA 2.73 2.40 2.95 2.97 2.69 2.61 2.17 2.80 2.59 NA 1.87 2.23 NA
[26] 1.34 2.06 2.24 1.94 1.73 1.63 NA 1.72 NA 1.94 1.25 3.38 NA 2.09 2.68 2.91 1.94 2.41 2.50 NA NA 2.79 2.14 3.77 2.10
[51] 2.88 2.07 2.78 NA NA NA 1.54 2.38 2.29 1.40 2.21 2.36 NA 2.30 2.54 2.29 2.28 2.57 3.53 NA 2.34 3.84 1.50 2.19 2.16
[76] 1.20 2.73 1.35 3.48 2.51 1.42 1.74 1.68 NA NA 1.98 NA NA 2.44 1.62 2.99 1.34 1.39 2.16 4.58 1.74 NA 2.21 NA 1.41
[101] 0.95 2.60 2.30 1.67 1.81 1.79 NA 1.60 3.24 3.20 NA 1.46 NA NA NA 2.65 NA NA 2.80 1.67 3.49 NA NA NA NA
[126] NA NA NA 1.54 NA NA
Warning message:
NAs introduced by coercion
Maybe you can import it as a character first, then filter to remove NA's (!is.NA) then convert to numeric.

R: apply simple function to specific columns by grouped variable

I have a data set with 2 observations for each person.
There are more than 100 variables in the data set.
I would like to fill in the missing data for each person, with the available data for the same variable. I can do this manually with dplyr mutate function, but it will be cumbersome to do that for all the variables that needs to be filled in.
Here is what I tried, but it failed:
> # Here's data example
> # https://www.dropbox.com/s/a0bc69xgxhaeguc/data_xlsc.xlsx?dl=0
> # I have already attached it to my working space
>
> names(data)
[1] "ID" "Age" "var1" "var2" "var3" "var4" "var5" "var6" "var7" "var8" "var9"
> head(data)
Source: local data frame [6 x 11]
ID Age var1 var2 var3 var4 var5 var6 var7 var8 var9
1 1 50 27.5 1.83 92.0 NA NA NA NA NA 5.1
2 1 NA NA NA NA 3.54 30.2 27.9 64.34 60.8 NA
3 2 51 33.7 1.77 105.6 NA NA NA NA NA 5.2
4 2 NA NA NA NA 4.05 36.4 38.7 67.75 63.7 NA
5 3 43 26.3 1.84 89.1 NA NA NA NA NA 4.8
6 3 NA NA NA NA 3.77 24.4 21.9 67.97 64.2 NA
> # As you can see above, for each person (ID) there are missing values for age and other variables.
> # I'd like to fill in missing data with the available data for each variable, for each ID
>
> #These are the variables that I need to fill in
> desired_variables <- names(data[,2:11])
>
> # this is my attempt that failed
>
> data2 <- data %>% group_by(ID) %>%
+ do(
+ for (i in seq_along(desired_variables)) {
+ i=max(i, na.rm=T)
+ }
+ )
Error: Results are not data frames at positions: 1, 2, 3
Desired output for the first person:
ID Age var1 var2 var3 var4 var5 var6 var7 var8 var9
1 1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1
2 1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1
Here's a possible data.table solution
library(data.table)
setattr(data, "class", "data.frame") ## If your data is of `tbl_df` class
setDT(data)[, (desired_variables) := lapply(.SD, max, na.rm = TRUE), by = ID] ## you can also use `.SDcols` if you want to specify specific columns
data
# ID Age var1 var2 var3 var4 var5 var6 var7 var8 var9
# 1: 1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1
# 2: 1 50 27.5 1.83 92.0 3.54 30.2 27.9 64.34 60.8 5.1
# 3: 2 51 33.7 1.77 105.6 4.05 36.4 38.7 67.75 63.7 5.2
# 4: 2 51 33.7 1.77 105.6 4.05 36.4 38.7 67.75 63.7 5.2
# 5: 3 43 26.3 1.84 89.1 3.77 24.4 21.9 67.97 64.2 4.8
# 6: 3 43 26.3 1.84 89.1 3.77 24.4 21.9 67.97 64.2 4.8

Resources