This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I am currently working in R on a data set that looks somewhat like the following (except it holds millions of rows and more variables) :
pid agedays wtkg htcm bmi haz waz whz
1 2 1.92 44.2 9.74 -2.72 -3.23 NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70
I am working to create a function, in which the following variables are added :
haz_1.5, waz_1.5, whz_1.5, htcm_1.5, wtkg_1.5, and bmi_1.5
each variable will follow the same pattern of criteria as below :
!is.na(haz) and agedays > 61-45 and agedays <=61-15, haz_1.5 will hold the value of haz
The new data set should look like the following (except bmi_1.5, wtkg_1.5, and htcm_1.5 are omitted from the output below, so table sample can fit in box):
pid agedays wtkg htcm bmi haz waz whz haz_1.5 waz_1.5 whz_1.5
1 2 1.92 44.2 9.74 -2.72 -3.23 NA NA NA NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48 NA NA NA
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14 NA NA NA
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54 NA NA NA
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18 NA NA NA
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70 NA NA NA
Here's the code that I've tried so far :
measure<-list("haz", "waz", "whz", "htcm", "wtkg", "bmi")
set_1.5_months <- function(x, y, z){
maled_anthro[!is.na(z) & agedays > (x-45) & agedays <= (x-15), y:=z]
}
for(i in 1:length(measure)){
z <- measure[i]
y <- paste(measure[i], "1.5", sep="_")
x <- 61
maled_anthro_1<-set_1.5_months(x, y, z)
}
The code above has not been successful. I just end up with a new variable "y" added into the original data table that holds the values "bmi" or "NA". Can someone help me with figuring out where I went wrong with this code?
I'd like to keep the function as similar to the formatting above (easy to change) as I have other similar functions that will need to be created in which the values "1.5" and x==61 will need to be swapped out with other numbers and I like that these are relatively easy to change in the current format.
I believe the following is a idiomatic way to create new columns by applying a function to many existing columns.
Note that I've left the condition as it was, negating it all to make the code as close to the question's as possible.
library(data.table)
setDT(maled_anthro)
set_1.5_months <- function(y, agedays, x = 61){
z <- y
is.na(z) <- !(!is.na(y) & agedays > (x - 45) & agedays <= (x - 15))
z
}
measure <- c("haz", "waz", "whz", "htcm", "wtkg", "bmi")
new_measure <- paste(measure, "1.5", sep = "_")
maled_anthro[, (new_measure) := lapply(.SD, function(y) set_1.5_months(y, agedays, x=61)), .SDcols = measure ]
# pid agedays wtkg htcm bmi haz waz whz haz_1.5 waz_1.5 whz_1.5 htcm_1.5 wtkg_1.5 bmi_1.5
#1: 1 2 1.92 44.2 9.74 -2.72 -3.23 NA NA NA NA NA NA NA
#2: 1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00 -2.21 -3.03 -2.00 49.2 2.68 11.07
#3: 1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48 NA NA NA NA NA NA
#4: 1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14 NA NA NA NA NA NA
#5: 2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54 NA NA NA NA NA NA
#6: 2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79 -0.14 -0.58 -0.79 53.1 3.78 13.41
#7: 2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18 NA NA NA NA NA NA
#8: 2 104 5.82 61.3 15.49 0.23 -0.38 -0.70 NA NA NA NA NA NA
Data
maled_anthro <- read.table(text = "
pid agedays wtkg htcm bmi haz waz whz
1 2 1.92 44.2 9.74 -2.72 -3.23 NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70
", header = TRUE)
Related
I am trying to do three steps with my data file.
Resample from the data file 100 times with replacement. My code is below
Q3<-question.3
Q.3<-data.frame(question.3)
resample<-Q.3[sample(100, replace = TRUE),]'
x1 x2
61 9.49 10.17
18 10.96 9.42
30 10.12 8.08
51 10.72 10.23
70 10.52 9.14
87 10.32 8.32
42 10.21 9.77
72 8.66 10.80
87.1 10.32 8.32
76 8.78 10.15
25 11.11 9.42
29 11.68 12.37
87.2 10.32 8.32
66 7.91 7.96
51.1 10.72 10.23
22 9.24 11.57
90 9.99 10.19
92 10.02 9.30
75 10.04 10.23
8 9.83 8.96
41 9.60 9.86
79 8.59 8.93
5 9.51 9.19
10 9.27 7.50
36 10.15 10.85
68 11.26 8.47
88 9.03 8.71
79.1 8.59 8.93
6 9.75 9.83
58 11.05 8.58
6.1 9.75 9.83
21 9.94 13.32
34 8.56 10.21
10.1 9.27 7.50
67 9.10 9.90
35 9.59 9.65
82 10.61 9.46
62 11.18 8.71
58.1 11.05 8.58
77 11.50 10.94
4 11.01 11.77
71 10.92 9.51
24 10.35 9.23
26 9.91 9.49
29.1 11.68 12.37
96 9.33 8.62
76.1 8.78 10.15
81 13.52 8.69
66.1 7.91 7.96
34.1 8.56 10.21
91 9.93 11.53
100 10.52 9.94
76.2 8.78 10.15
86 11.55 10.26
30.1 10.12 8.08
59 9.22 11.00
44 8.76 10.40
83 9.93 10.49
77.1 11.50 10.94
73 8.34 8.90
7 8.55 7.09
58.2 11.05 8.58
4.1 11.01 11.77
37 10.29 10.80
84 10.20 9.78
87.3 10.32 8.32
50 10.23 9.72
8.1 9.83 8.96
44.1 8.76 10.40
76.3 8.78 10.15
74 11.55 11.30
51.2 10.72 10.23
28 11.25 10.13
53 10.54 9.43
51.3 10.72 10.23
47 7.65 8.56
99 10.26 11.21
96.1 9.33 8.62
23 7.58 11.08
43 9.20 8.84
40 8.68 9.94
31 10.16 10.32
36.1 10.15 10.85
62.1 11.18 8.71
55 10.40 10.14
20 9.95 9.71
100.1 10.52 9.94
80 10.34 7.09
86.1 11.55 10.26
6.2 9.75 9.83
56 9.01 9.77
51.4 10.72 10.23
45 7.86 7.09
77.2 11.50 10.94
9 8.43 9.79
48 10.28 9.52
36.2 10.15 10.85
69 10.92 10.14
3 8.07 11.98
38 9.41 9.60
Conduct an independent sample t-test for each of the re-sampled data to obtain 100 p-values.
I am stuck here. I can conduct a T-test for the 2 columns but I am stuck on how to do this for reach row?
I tried unlisting my re-sample and using notation to work within the R object resample to collect T.test but this doesn't work.
resample1<-unlist(resample)
resample1[t.test(resample1,mu = .05),] #didnt work
resample1[t.test(resample,mu = .05, alternative = "greater",)] #didnt work
I then want to extract the 100 p-values from each T-test. I want to code p-values <0.05 as 1 and p-values >0.05 as 0.
If I had the 100 values I would use a if else statement and the $ to fun the if statement in each row. I would use a for loop and if else statement to extract the p-values and code them as 1's and 0's
I want to compute the mean to see if it is close to 0.05.
thanks
P.S I am using CTRL-K to make the code be in the right format for asking a question but its not working for some reason
Youi can try a tidyverse approach
library(tidyverse) # load required packages
library(broom)
set.seed(1234) # set seeding number for reproducibiliy
df <- data.frame(x1 = runif(1000), x2=runif(1000)) # some data with 1000 rows
# calculate the pvalues using t.test.
pvals <- map_df(1:100, ~slice_sample(df, n = 100, replace = T) %>%
with(., t.test(x1, x2)) %>%
broom::tidy() %>% select(p.value))
pvals
# A tibble: 100 x 1
p.value
<dbl>
1 0.0360
2 0.405
3 0.774
4 0.265
5 0.614
6 0.482
7 0.808
8 0.956
9 0.459
10 0.184
# ... with 90 more rows
purrr's map function serves as loop with 100 iterations.
dplyr's slice_sample gives you 100 rows by chance and replacement.
broom::tidy function is used to easily extract p.values.
# how many values below 0.05
pvals %>%
count(p.value > 0.05)
# A tibble: 2 x 2
`p.value > 0.05` n
<lgl> <int>
1 FALSE 8
2 TRUE 92
# mean pvalue
pvals %>%
summarise(mean = mean(p.value))
# A tibble: 1 x 1
mean
<dbl>
1 0.436
I have a table of stock prices here:
https://drive.google.com/file/d/1S666wiCzf-8MfgugN3IZOqCiM7tNPFh9/view?usp=sharing
Some columns have NA's because the company does not exist (until later dates), or the company folded.
What I want to do is: select columns that has no NA's. I use data.table because it is faster. Here are my working codes:
example <- fread(file = "example.csv", key = "date")
example_select <- example[,
lapply(.SD,
function(x) not(sum(is.na(x) > 0)))
] %>%
as.logical(.)
example[, ..example_select]
Is there better (less lines) code to do the same? Thank you!
Try:
example[,lapply(.SD, function(x) {if(anyNA(x)) {NULL} else {x}} )]
There are lots of ways you could do this. Here's how I usually do it - a data.table approach without lapply:
example[, .SD, .SDcols = colSums(is.na(example)) == 0]
An answer using tidyverse packages
library(readr)
library(dplyr)
library(purrr)
data <- read_csv("~/Downloads/example.csv")
map2_dfc(data, names(data), .f = function(x, y) {
column <- tibble("{y}" := x)
if(any(is.na(column)))
return(NULL)
else
return(column)
})
Output
# A tibble: 5,076 x 11
date ACU ACY AE AEF AIM AIRI AMS APT ARMP ASXC
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001-01-02 2.75 4.75 14.4 8.44 2376 250 2.5 1.06 490000 179.
2 2001-01-03 2.75 4.5 14.5 9 2409 250 2.5 1.12 472500 193.
3 2001-01-04 2.75 4.5 14.1 8.88 2508 250 2.5 1.06 542500 301.
4 2001-01-05 2.38 4.5 14.1 8.88 2475 250 2.25 1.12 586250 301.
5 2001-01-08 2.56 4.75 14.3 8.75 2376 250 2.38 1.06 638750 276.
6 2001-01-09 2.56 4.75 14.3 8.88 2409 250 2.38 1.06 568750 264.
7 2001-01-10 2.56 5.5 14.5 8.69 2310 300 2.12 1.12 586250 274.
8 2001-01-11 2.69 5.25 14.4 8.69 2310 300 2.25 1.19 564375 333.
9 2001-01-12 2.75 4.81 14.6 8.75 2541 275 2 1.38 564375 370.
10 2001-01-16 2.75 4.88 14.9 8.94 2772 300 2.12 1.62 595000 358.
# … with 5,066 more rows
Using Filter :
library(data.table)
Filter(function(x) all(!is.na(x)), fread('example.csv'))
# date ACU ACY AE AEF AIM AIRI AMS APT
# 1: 2001-01-02 2.75 4.75 14.4 8.44 2376.00 250.00 2.50 1.06
# 2: 2001-01-03 2.75 4.50 14.5 9.00 2409.00 250.00 2.50 1.12
# 3: 2001-01-04 2.75 4.50 14.1 8.88 2508.00 250.00 2.50 1.06
# 4: 2001-01-05 2.38 4.50 14.1 8.88 2475.00 250.00 2.25 1.12
# 5: 2001-01-08 2.56 4.75 14.3 8.75 2376.00 250.00 2.38 1.06
# ---
#5072: 2021-03-02 36.95 10.59 28.1 8.77 2.34 1.61 2.48 14.33
#5073: 2021-03-03 38.40 10.00 30.1 8.78 2.26 1.57 2.47 12.92
#5074: 2021-03-04 37.90 8.03 30.8 8.63 2.09 1.44 2.27 12.44
#5075: 2021-03-05 35.68 8.13 31.5 8.70 2.05 1.48 2.35 12.45
#5076: 2021-03-08 37.87 8.22 31.9 8.59 2.01 1.52 2.47 12.15
# ARMP ASXC
# 1: 4.90e+05 178.75
# 2: 4.72e+05 192.97
# 3: 5.42e+05 300.62
# 4: 5.86e+05 300.62
# 5: 6.39e+05 276.25
# ---
#5072: 5.67e+00 3.92
#5073: 5.58e+00 4.54
#5074: 5.15e+00 4.08
#5075: 4.49e+00 3.81
#5076: 4.73e+00 4.15
I have a column in my dataset with numbers and NAs. When I try to import it, RStudio categorizes it as "character" instead of "numeric".
I tried using "as.numeric" function to convert that. I get the warning: "NAs introduced by coercion" and then nothing happens.
str(my_data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 131 obs. of 6 variables:
$ labs_TotalChol : num NA 149 149 188 171 147 207 NA 131 136 ...
$ labs_Creatinine : num NA NA NA NA 0.8 1 NA NA 0.7 0.7 ...
$ PET_globalCFR : chr NA "2.87" "2.65" "2.65" ...
$ RHI : num NA 1.49 1.91 1.5 3.03 1.72 1.93 2.67 1.28 2.06 ...
$ PET_avghr_stress : num NA 90 99.7 76 73 ...
$ PET_RPPCorrected_rest: num NA 2.09 2.1 2.24 2.11 2 2.75 1.07 2.72 2.24 ...
I use the code:
as.numeric(my_data$PET_globalCFR)
and get:
[1] NA 2.87 2.65 2.65 2.46 2.80 2.93 2.02 3.77 2.62 2.06 NA 2.73 2.40 2.95 2.97 2.69 2.61 2.17 2.80 2.59 NA 1.87 2.23 NA
[26] 1.34 2.06 2.24 1.94 1.73 1.63 NA 1.72 NA 1.94 1.25 3.38 NA 2.09 2.68 2.91 1.94 2.41 2.50 NA NA 2.79 2.14 3.77 2.10
[51] 2.88 2.07 2.78 NA NA NA 1.54 2.38 2.29 1.40 2.21 2.36 NA 2.30 2.54 2.29 2.28 2.57 3.53 NA 2.34 3.84 1.50 2.19 2.16
[76] 1.20 2.73 1.35 3.48 2.51 1.42 1.74 1.68 NA NA 1.98 NA NA 2.44 1.62 2.99 1.34 1.39 2.16 4.58 1.74 NA 2.21 NA 1.41
[101] 0.95 2.60 2.30 1.67 1.81 1.79 NA 1.60 3.24 3.20 NA 1.46 NA NA NA 2.65 NA NA 2.80 1.67 3.49 NA NA NA NA
[126] NA NA NA 1.54 NA NA
Warning message:
NAs introduced by coercion
Maybe you can import it as a character first, then filter to remove NA's (!is.NA) then convert to numeric.
I'm munging data, specifically, I've opened this pdf http://pubs.acs.org/doi/suppl/10.1021/ja105035r/suppl_file/ja105035r_si_001.pdf and scraped the data from table s4,
1a 1b 1a 1b
1 5.27 4.76 5.09 4.75
2 2.47 2.74 2.77 2.80
4 1.14 1.38 1.12 1.02
6 7.43 7.35 7.22-7.35a 7.25-7.36a
7 7.38 7.34 7.22-7.35a 7.25-7.36a
8 7.23 7.20 7.22-7.35a 7.25-7.36a
9(R) 4.16 3.89 4.12b 4.18b
9(S) 4.16 3.92 4.12b 4.18b
10 1.19 0.91 1.21 1.25
pasted it into notepad and saved it as a txt file.
s4 <- read.table("s4.txt", header=TRUE, stringsAsFactors=FALSE)
gives,
X1a X1b X1a.1 X1b.1
1 5.27 4.76 5.09 4.75
2 2.47 2.74 2.77 2.80
4 1.14 1.38 1.12 1.02
6 7.43 7.35 7.22-7.35a 7.25-7.36a
7 7.38 7.34 7.22-7.35a 7.25-7.36a
8 7.23 7.20 7.22-7.35a 7.25-7.36a
in order to use the data I need to change it all to numeric and remove the letters, thanks to this link R regex gsub separate letters and numbers I can use the following code,
gsub("([[:alpha:]])","",s4[,3])
I can get rid of the extraneous letters.
What I want to do now, and the point of the question, is to change the ranges,
"7.22-7.35" "7.22-7.35" "7.22-7.35"
with their means,
"7.29"
Could I use gsub for this? (or would I need to strsplit across the hyphen, combine into a vector and return the mean?).
You need a single regex in strsplit for this task (removing letters and splitting):
s4[] <- lapply(s4, function(x) {
if (is.numeric(x)) x
else sapply(strsplit(as.character(x), "-|[[:alpha:]]"),
function(y) mean(as.numeric(y)))
})
The result:
> s4
X1a X1b X1a.1 X1b.1
1 5.27 4.76 5.090 4.750
2 2.47 2.74 2.770 2.800
4 1.14 1.38 1.120 1.020
6 7.43 7.35 7.285 7.305
7 7.38 7.34 7.285 7.305
8 7.23 7.20 7.285 7.305
Here's an approach that seems to work right on the sample data:
df[] <- lapply(df, function(col){
col <- gsub("([[:alpha:]])","", col)
col <- ifelse(grepl("-", col), mean(as.numeric(unlist(strsplit(col[grepl("-", col)], "-")))), col)
as.numeric(col)
})
> df
# X1a X1b X1a.1 X1b.1
#1 5.27 4.76 5.090 4.750
#2 2.47 2.74 2.770 2.800
#4 1.14 1.38 1.120 1.020
#6 7.43 7.35 7.285 7.305
#7 7.38 7.34 7.285 7.305
#8 7.23 7.20 7.285 7.305
Disclaimer: It only works right if the ranges in each column are all the same (as in the sample data)
something like that :
mean(as.numeric(unlist(strsplit("7.22-7.35","-"))))
should work (and correspond to what you had in mind I guess)
or you can do :
eval(parse(text=paste0("mean(c(",gsub("-",",","7.22-7.35"),"))")))
but I'm not sure this is simpler...
To apply it to a vector :
vec<-c("7.22-7.35","7.22-7.35")
1st solution : sapply(vec, function(x) mean(as.numeric(unlist(strsplit(x,"-")))))
2nd solution : sapply(vec, function(x) eval(parse(text=paste0("mean(c(",gsub("-",",",x),"))"))))
In both cases, you'll get :
7.22-7.35 7.22-7.35
7.285 7.285
Also,
library(gsubfn)
indx <- !sapply(s4, is.numeric)
s4[indx] <- lapply(s4[indx], function(x)
sapply(strapply(x, '([0-9.]+)', ~as.numeric(x)), mean))
s4
# X1a X1b X1a.1 X1b.1
#1 5.27 4.76 5.090 4.750
#2 2.47 2.74 2.770 2.800
#4 1.14 1.38 1.120 1.020
#6 7.43 7.35 7.285 7.305
#7 7.38 7.34 7.285 7.305
#8 7.23 7.20 7.285 7.305
Return row value when certain number of columns reach certain value from the following table
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3.93 3.92 3.74 4.84 4.55 4.67 3.99 4.10 4.86 4.06
2 4.00 3.99 3.81 4.90 4.61 4.74 4.04 4.15 4.92 4.11
3 4.67 4.06 3.88 5.01 4.66 4.80 4.09 4.20 4.98 4.16
4 4.73 4.12 3.96 5.03 4.72 4.85 4.14 4.25 5.04 4.21
5 4.79 4.21 4.04 5.09 4.77 4.91 4.18 4.30 5.10 4.26
6 4.86 4.29 4.12 5.15 4.82 4.96 4.23 4.35 5.15 4.30
7 4.92 4.37 4.19 5.21 4.87 5.01 4.27 4.39 5.20 4.35
8 4.98 4.43 4.25 5.26 4.91 5.12 4.31 4.43 5.25 4.38
9 5.04 4.49 4.31 5.30 4.95 5.15 4.34 4.46 5.29 4.41
10 5.04 4.50 4.49 5.31 5.01 5.17 4.50 4.60 5.30 4.45
11 ...
12 ...
As an output, I need a data frame, containing the % reach of the value of interest ('5' in this example) by V1-V10:
Rownum Percent
1 0
2 0
3 10
4 20
5 20
6 20
7 33
8 33
9 40
10 50
Many thanks!
If your matrix is mat:
cbind(1:dim(mat)[1],rowSums(mat>5)/dim(mat)[2]*100)
As far as it's always about 0 and 1 with ten columns, I would multiply the whole dataset by 10 (equals percentage values in this case...). Just use the following code:
# Sample data
set.seed(10)
data <- as.data.frame(do.call("rbind", lapply(seq(9), function(...) {
sample(c(0, 1), 10, replace = TRUE)
})))
rownames(data) <- c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yza")
# Percentages
rowSums(data * 10)
# abc def ghi jkl mno pqr stu vwx yza
# 80 40 80 60 60 10 30 50 50
Ok, so now I believe you want to get the percentage of values in each row that meet some threshold criteria. You give the example > 5. One solution of many is using apply:
apply( df , 1 , function(x) sum( x > 5 )/length(x)*100 )
# 1 2 3 4 5 6 7 8 9 10
# 0 0 10 20 20 20 30 30 40 50
#Thomas' solution will be faster for large data.frames because it converts to a matrix first, and these are faster to operate on.