R data.table Creating simple custom function [duplicate] - r

This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I am currently working in R on a data set that looks somewhat like the following (except it holds millions of rows and more variables) :
pid agedays wtkg htcm bmi haz waz whz
1 2 1.92 44.2 9.74 -2.72 -3.23 NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70
I am working to create a function, in which the following variables are added :
haz_1.5, waz_1.5, whz_1.5, htcm_1.5, wtkg_1.5, and bmi_1.5
each variable will follow the same pattern of criteria as below :
!is.na(haz) and agedays > 61-45 and agedays <=61-15, haz_1.5 will hold the value of haz
The new data set should look like the following (except bmi_1.5, wtkg_1.5, and htcm_1.5 are omitted from the output below, so table sample can fit in box):
pid agedays wtkg htcm bmi haz waz whz haz_1.5 waz_1.5 whz_1.5
1 2 1.92 44.2 9.74 -2.72 -3.23 NA NA NA NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48 NA NA NA
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14 NA NA NA
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54 NA NA NA
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18 NA NA NA
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70 NA NA NA
Here's the code that I've tried so far :
measure<-list("haz", "waz", "whz", "htcm", "wtkg", "bmi")
set_1.5_months <- function(x, y, z){
maled_anthro[!is.na(z) & agedays > (x-45) & agedays <= (x-15), y:=z]
}
for(i in 1:length(measure)){
z <- measure[i]
y <- paste(measure[i], "1.5", sep="_")
x <- 61
maled_anthro_1<-set_1.5_months(x, y, z)
}
The code above has not been successful. I just end up with a new variable "y" added into the original data table that holds the values "bmi" or "NA". Can someone help me with figuring out where I went wrong with this code?
I'd like to keep the function as similar to the formatting above (easy to change) as I have other similar functions that will need to be created in which the values "1.5" and x==61 will need to be swapped out with other numbers and I like that these are relatively easy to change in the current format.

I believe the following is a idiomatic way to create new columns by applying a function to many existing columns.
Note that I've left the condition as it was, negating it all to make the code as close to the question's as possible.
library(data.table)
setDT(maled_anthro)
set_1.5_months <- function(y, agedays, x = 61){
z <- y
is.na(z) <- !(!is.na(y) & agedays > (x - 45) & agedays <= (x - 15))
z
}
measure <- c("haz", "waz", "whz", "htcm", "wtkg", "bmi")
new_measure <- paste(measure, "1.5", sep = "_")
maled_anthro[, (new_measure) := lapply(.SD, function(y) set_1.5_months(y, agedays, x=61)), .SDcols = measure ]
# pid agedays wtkg htcm bmi haz waz whz haz_1.5 waz_1.5 whz_1.5 htcm_1.5 wtkg_1.5 bmi_1.5
#1: 1 2 1.92 44.2 9.74 -2.72 -3.23 NA NA NA NA NA NA NA
#2: 1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00 -2.21 -3.03 -2.00 49.2 2.68 11.07
#3: 1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48 NA NA NA NA NA NA
#4: 1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14 NA NA NA NA NA NA
#5: 2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54 NA NA NA NA NA NA
#6: 2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79 -0.14 -0.58 -0.79 53.1 3.78 13.41
#7: 2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18 NA NA NA NA NA NA
#8: 2 104 5.82 61.3 15.49 0.23 -0.38 -0.70 NA NA NA NA NA NA
Data
maled_anthro <- read.table(text = "
pid agedays wtkg htcm bmi haz waz whz
1 2 1.92 44.2 9.74 -2.72 -3.23 NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70
", header = TRUE)

Related

Using notation to sample rows from a data file and run individual T-test on the in R

I am trying to do three steps with my data file.
Resample from the data file 100 times with replacement. My code is below
Q3<-question.3
Q.3<-data.frame(question.3)
resample<-Q.3[sample(100, replace = TRUE),]'
x1 x2
61 9.49 10.17
18 10.96 9.42
30 10.12 8.08
51 10.72 10.23
70 10.52 9.14
87 10.32 8.32
42 10.21 9.77
72 8.66 10.80
87.1 10.32 8.32
76 8.78 10.15
25 11.11 9.42
29 11.68 12.37
87.2 10.32 8.32
66 7.91 7.96
51.1 10.72 10.23
22 9.24 11.57
90 9.99 10.19
92 10.02 9.30
75 10.04 10.23
8 9.83 8.96
41 9.60 9.86
79 8.59 8.93
5 9.51 9.19
10 9.27 7.50
36 10.15 10.85
68 11.26 8.47
88 9.03 8.71
79.1 8.59 8.93
6 9.75 9.83
58 11.05 8.58
6.1 9.75 9.83
21 9.94 13.32
34 8.56 10.21
10.1 9.27 7.50
67 9.10 9.90
35 9.59 9.65
82 10.61 9.46
62 11.18 8.71
58.1 11.05 8.58
77 11.50 10.94
4 11.01 11.77
71 10.92 9.51
24 10.35 9.23
26 9.91 9.49
29.1 11.68 12.37
96 9.33 8.62
76.1 8.78 10.15
81 13.52 8.69
66.1 7.91 7.96
34.1 8.56 10.21
91 9.93 11.53
100 10.52 9.94
76.2 8.78 10.15
86 11.55 10.26
30.1 10.12 8.08
59 9.22 11.00
44 8.76 10.40
83 9.93 10.49
77.1 11.50 10.94
73 8.34 8.90
7 8.55 7.09
58.2 11.05 8.58
4.1 11.01 11.77
37 10.29 10.80
84 10.20 9.78
87.3 10.32 8.32
50 10.23 9.72
8.1 9.83 8.96
44.1 8.76 10.40
76.3 8.78 10.15
74 11.55 11.30
51.2 10.72 10.23
28 11.25 10.13
53 10.54 9.43
51.3 10.72 10.23
47 7.65 8.56
99 10.26 11.21
96.1 9.33 8.62
23 7.58 11.08
43 9.20 8.84
40 8.68 9.94
31 10.16 10.32
36.1 10.15 10.85
62.1 11.18 8.71
55 10.40 10.14
20 9.95 9.71
100.1 10.52 9.94
80 10.34 7.09
86.1 11.55 10.26
6.2 9.75 9.83
56 9.01 9.77
51.4 10.72 10.23
45 7.86 7.09
77.2 11.50 10.94
9 8.43 9.79
48 10.28 9.52
36.2 10.15 10.85
69 10.92 10.14
3 8.07 11.98
38 9.41 9.60
Conduct an independent sample t-test for each of the re-sampled data to obtain 100 p-values.
I am stuck here. I can conduct a T-test for the 2 columns but I am stuck on how to do this for reach row?
I tried unlisting my re-sample and using notation to work within the R object resample to collect T.test but this doesn't work.
resample1<-unlist(resample)
resample1[t.test(resample1,mu = .05),] #didnt work
resample1[t.test(resample,mu = .05, alternative = "greater",)] #didnt work
I then want to extract the 100 p-values from each T-test. I want to code p-values <0.05 as 1 and p-values >0.05 as 0.
If I had the 100 values I would use a if else statement and the $ to fun the if statement in each row. I would use a for loop and if else statement to extract the p-values and code them as 1's and 0's
I want to compute the mean to see if it is close to 0.05.
thanks
P.S I am using CTRL-K to make the code be in the right format for asking a question but its not working for some reason
Youi can try a tidyverse approach
library(tidyverse) # load required packages
library(broom)
set.seed(1234) # set seeding number for reproducibiliy
df <- data.frame(x1 = runif(1000), x2=runif(1000)) # some data with 1000 rows
# calculate the pvalues using t.test.
pvals <- map_df(1:100, ~slice_sample(df, n = 100, replace = T) %>%
with(., t.test(x1, x2)) %>%
broom::tidy() %>% select(p.value))
pvals
# A tibble: 100 x 1
p.value
<dbl>
1 0.0360
2 0.405
3 0.774
4 0.265
5 0.614
6 0.482
7 0.808
8 0.956
9 0.459
10 0.184
# ... with 90 more rows
purrr's map function serves as loop with 100 iterations.
dplyr's slice_sample gives you 100 rows by chance and replacement.
broom::tidy function is used to easily extract p.values.
# how many values below 0.05
pvals %>%
count(p.value > 0.05)
# A tibble: 2 x 2
`p.value > 0.05` n
<lgl> <int>
1 FALSE 8
2 TRUE 92
# mean pvalue
pvals %>%
summarise(mean = mean(p.value))
# A tibble: 1 x 1
mean
<dbl>
1 0.436

R data.table, select columns with no NA

I have a table of stock prices here:
https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing
Some columns have NA's because the company does not exist (until later dates), or the company folded.
What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:
example <- fread(file = "example.csv", key = "date")
example_select <- example[,
lapply(.SD,
function(x) not(sum(is.na(x) > 0)))
] %>%
as.logical(.)
example[, ..example_select]
Is there better (less lines) code to do the same? Thank you!
Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]
There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:
example[, .SD, .SDcols = colSums(is.na(example)) == 0]
An answer using tidyverse packages
library(readr)
library(dplyr)
library(purrr)
data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) {
column <- tibble("{y}" := x)
if(any(is.na(column)))
return(NULL)
else
return(column)
})
Output
# A tibble: 5,076 x 11
date ACU ACY AE AEF AIM AIRI AMS APT ARMP ASXC
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001-01-02 2.75 4.75 14.4 8.44 2376 250 2.5 1.06 490000 179.
2 2001-01-03 2.75 4.5 14.5 9 2409 250 2.5 1.12 472500 193.
3 2001-01-04 2.75 4.5 14.1 8.88 2508 250 2.5 1.06 542500 301.
4 2001-01-05 2.38 4.5 14.1 8.88 2475 250 2.25 1.12 586250 301.
5 2001-01-08 2.56 4.75 14.3 8.75 2376 250 2.38 1.06 638750 276.
6 2001-01-09 2.56 4.75 14.3 8.88 2409 250 2.38 1.06 568750 264.
7 2001-01-10 2.56 5.5 14.5 8.69 2310 300 2.12 1.12 586250 274.
8 2001-01-11 2.69 5.25 14.4 8.69 2310 300 2.25 1.19 564375 333.
9 2001-01-12 2.75 4.81 14.6 8.75 2541 275 2 1.38 564375 370.
10 2001-01-16 2.75 4.88 14.9 8.94 2772 300 2.12 1.62 595000 358.
# … with 5,066 more rows
Using Filter :
library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))
# date ACU ACY AE AEF AIM AIRI AMS APT
# 1: 2001-01-02 2.75 4.75 14.4 8.44 2376.00 250.00 2.50 1.06
# 2: 2001-01-03 2.75 4.50 14.5 9.00 2409.00 250.00 2.50 1.12
# 3: 2001-01-04 2.75 4.50 14.1 8.88 2508.00 250.00 2.50 1.06
# 4: 2001-01-05 2.38 4.50 14.1 8.88 2475.00 250.00 2.25 1.12
# 5: 2001-01-08 2.56 4.75 14.3 8.75 2376.00 250.00 2.38 1.06
# ---
#5072: 2021-03-02 36.95 10.59 28.1 8.77 2.34 1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78 2.26 1.57 2.47 12.92
#5074: 2021-03-04 37.90 8.03 30.8 8.63 2.09 1.44 2.27 12.44
#5075: 2021-03-05 35.68 8.13 31.5 8.70 2.05 1.48 2.35 12.45
#5076: 2021-03-08 37.87 8.22 31.9 8.59 2.01 1.52 2.47 12.15
# ARMP ASXC
# 1: 4.90e+05 178.75
# 2: 4.72e+05 192.97
# 3: 5.42e+05 300.62
# 4: 5.86e+05 300.62
# 5: 6.39e+05 276.25
# ---
#5072: 5.67e+00 3.92
#5073: 5.58e+00 4.54
#5074: 5.15e+00 4.08
#5075: 4.49e+00 3.81
#5076: 4.73e+00 4.15

as.numeric function does not change vector type

I have a column in my dataset with numbers and NAs. When I try to import it, RStudio categorizes it as "character" instead of "numeric".
I tried using "as.numeric" function to convert that. I get the warning: "NAs introduced by coercion" and then nothing happens.
str(my_data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 131 obs. of 6 variables:
$ labs_TotalChol : num NA 149 149 188 171 147 207 NA 131 136 ...
$ labs_Creatinine : num NA NA NA NA 0.8 1 NA NA 0.7 0.7 ...
$ PET_globalCFR : chr NA "2.87" "2.65" "2.65" ...
$ RHI : num NA 1.49 1.91 1.5 3.03 1.72 1.93 2.67 1.28 2.06 ...
$ PET_avghr_stress : num NA 90 99.7 76 73 ...
$ PET_RPPCorrected_rest: num NA 2.09 2.1 2.24 2.11 2 2.75 1.07 2.72 2.24 ...
I use the code:
as.numeric(my_data$PET_globalCFR)
and get:
[1] NA 2.87 2.65 2.65 2.46 2.80 2.93 2.02 3.77 2.62 2.06 NA 2.73 2.40 2.95 2.97 2.69 2.61 2.17 2.80 2.59 NA 1.87 2.23 NA
[26] 1.34 2.06 2.24 1.94 1.73 1.63 NA 1.72 NA 1.94 1.25 3.38 NA 2.09 2.68 2.91 1.94 2.41 2.50 NA NA 2.79 2.14 3.77 2.10
[51] 2.88 2.07 2.78 NA NA NA 1.54 2.38 2.29 1.40 2.21 2.36 NA 2.30 2.54 2.29 2.28 2.57 3.53 NA 2.34 3.84 1.50 2.19 2.16
[76] 1.20 2.73 1.35 3.48 2.51 1.42 1.74 1.68 NA NA 1.98 NA NA 2.44 1.62 2.99 1.34 1.39 2.16 4.58 1.74 NA 2.21 NA 1.41
[101] 0.95 2.60 2.30 1.67 1.81 1.79 NA 1.60 3.24 3.20 NA 1.46 NA NA NA 2.65 NA NA 2.80 1.67 3.49 NA NA NA NA
[126] NA NA NA 1.54 NA NA
Warning message:
NAs introduced by coercion
Maybe you can import it as a character first, then filter to remove NA's (!is.NA) then convert to numeric.

Substituting the results of a calculation

I'm munging data, specifically, I've opened this pdf http://pubs.acs.org/doi/suppl/10.1021/ja105035r/suppl_file/ja105035r_si_001.pdf and scraped the data from table s4,
1a 1b 1a 1b
1 5.27 4.76 5.09 4.75
2 2.47 2.74 2.77 2.80
4 1.14 1.38 1.12 1.02
6 7.43 7.35 7.22-7.35a 7.25-7.36a
7 7.38 7.34 7.22-7.35a 7.25-7.36a
8 7.23 7.20 7.22-7.35a 7.25-7.36a
9(R) 4.16 3.89 4.12b 4.18b
9(S) 4.16 3.92 4.12b 4.18b
10 1.19 0.91 1.21 1.25
pasted it into notepad and saved it as a txt file.
s4 <- read.table("s4.txt", header=TRUE, stringsAsFactors=FALSE)
gives,
X1a X1b X1a.1 X1b.1
1 5.27 4.76 5.09 4.75
2 2.47 2.74 2.77 2.80
4 1.14 1.38 1.12 1.02
6 7.43 7.35 7.22-7.35a 7.25-7.36a
7 7.38 7.34 7.22-7.35a 7.25-7.36a
8 7.23 7.20 7.22-7.35a 7.25-7.36a
in order to use the data I need to change it all to numeric and remove the letters, thanks to this link R regex gsub separate letters and numbers I can use the following code,
gsub("([[:alpha:]])","",s4[,3])
I can get rid of the extraneous letters.
What I want to do now, and the point of the question, is to change the ranges,
"7.22-7.35" "7.22-7.35" "7.22-7.35"
with their means,
"7.29"
Could I use gsub for this? (or would I need to strsplit across the hyphen, combine into a vector and return the mean?).
You need a single regex in strsplit for this task (removing letters and splitting):
s4[] <- lapply(s4, function(x) {
if (is.numeric(x)) x
else sapply(strsplit(as.character(x), "-|[[:alpha:]]"),
function(y) mean(as.numeric(y)))
})
The result:
> s4
X1a X1b X1a.1 X1b.1
1 5.27 4.76 5.090 4.750
2 2.47 2.74 2.770 2.800
4 1.14 1.38 1.120 1.020
6 7.43 7.35 7.285 7.305
7 7.38 7.34 7.285 7.305
8 7.23 7.20 7.285 7.305
Here's an approach that seems to work right on the sample data:
df[] <- lapply(df, function(col){
col <- gsub("([[:alpha:]])","", col)
col <- ifelse(grepl("-", col), mean(as.numeric(unlist(strsplit(col[grepl("-", col)], "-")))), col)
as.numeric(col)
})
> df
# X1a X1b X1a.1 X1b.1
#1 5.27 4.76 5.090 4.750
#2 2.47 2.74 2.770 2.800
#4 1.14 1.38 1.120 1.020
#6 7.43 7.35 7.285 7.305
#7 7.38 7.34 7.285 7.305
#8 7.23 7.20 7.285 7.305
Disclaimer: It only works right if the ranges in each column are all the same (as in the sample data)
something like that :
mean(as.numeric(unlist(strsplit("7.22-7.35","-"))))
should work (and correspond to what you had in mind I guess)
or you can do :
eval(parse(text=paste0("mean(c(",gsub("-",",","7.22-7.35"),"))")))
but I'm not sure this is simpler...
To apply it to a vector :
vec<-c("7.22-7.35","7.22-7.35")
1st solution : sapply(vec, function(x) mean(as.numeric(unlist(strsplit(x,"-")))))
2nd solution : sapply(vec, function(x) eval(parse(text=paste0("mean(c(",gsub("-",",",x),"))"))))
In both cases, you'll get :
7.22-7.35 7.22-7.35
7.285 7.285
Also,
library(gsubfn)
indx <- !sapply(s4, is.numeric)
s4[indx] <- lapply(s4[indx], function(x)
sapply(strapply(x, '([0-9.]+)', ~as.numeric(x)), mean))
s4
# X1a X1b X1a.1 X1b.1
#1 5.27 4.76 5.090 4.750
#2 2.47 2.74 2.770 2.800
#4 1.14 1.38 1.120 1.020
#6 7.43 7.35 7.285 7.305
#7 7.38 7.34 7.285 7.305
#8 7.23 7.20 7.285 7.305

R: returning row value when certain number of columns reach certain value

Return row value when certain number of columns reach certain value from the following table
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3.93 3.92 3.74 4.84 4.55 4.67 3.99 4.10 4.86 4.06
2 4.00 3.99 3.81 4.90 4.61 4.74 4.04 4.15 4.92 4.11
3 4.67 4.06 3.88 5.01 4.66 4.80 4.09 4.20 4.98 4.16
4 4.73 4.12 3.96 5.03 4.72 4.85 4.14 4.25 5.04 4.21
5 4.79 4.21 4.04 5.09 4.77 4.91 4.18 4.30 5.10 4.26
6 4.86 4.29 4.12 5.15 4.82 4.96 4.23 4.35 5.15 4.30
7 4.92 4.37 4.19 5.21 4.87 5.01 4.27 4.39 5.20 4.35
8 4.98 4.43 4.25 5.26 4.91 5.12 4.31 4.43 5.25 4.38
9 5.04 4.49 4.31 5.30 4.95 5.15 4.34 4.46 5.29 4.41
10 5.04 4.50 4.49 5.31 5.01 5.17 4.50 4.60 5.30 4.45
11 ...
12 ...
As an output, I need a data frame, containing the % reach of the value of interest ('5' in this example) by V1-V10:
Rownum Percent
1 0
2 0
3 10
4 20
5 20
6 20
7 33
8 33
9 40
10 50
Many thanks!
If your matrix is mat:
cbind(1:dim(mat)[1],rowSums(mat>5)/dim(mat)[2]*100)
As far as it's always about 0 and 1 with ten columns, I would multiply the whole dataset by 10 (equals percentage values in this case...). Just use the following code:
# Sample data
set.seed(10)
data <- as.data.frame(do.call("rbind", lapply(seq(9), function(...) {
sample(c(0, 1), 10, replace = TRUE)
})))
rownames(data) <- c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yza")
# Percentages
rowSums(data * 10)
# abc def ghi jkl mno pqr stu vwx yza
# 80 40 80 60 60 10 30 50 50
Ok, so now I believe you want to get the percentage of values in each row that meet some threshold criteria. You give the example > 5. One solution of many is using apply:
apply( df , 1 , function(x) sum( x > 5 )/length(x)*100 )
# 1 2 3 4 5 6 7 8 9 10
# 0 0 10 20 20 20 30 30 40 50
#Thomas' solution will be faster for large data.frames because it converts to a matrix first, and these are faster to operate on.

Resources