How can you loop this higher-order function in R? - r

This question relates to the reply I received here with a nice little function from thelatemail.
The dataframe I'm using is not optimal, but it's what I've got and I'm simply trying to loop this function across all rows.
This is my df
dput(SO_Example_v1)
structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community",
"Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L,
285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L,
37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L,
34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L,
62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L,
142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type",
"hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType",
"hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType",
"hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType",
"hosp2_CathAssocType"), class = "data.frame", row.names = c(NA,
-3L))
####################
#what it looks like#
####################
require(dplyr)
df <- tbl_df(SO_Example_v1)
head(df)
Type hosp1_WoundAssocType hosp1_BloodAssocType hosp1_UrineAssocType
1 Healthcare 464 73 75
2 Community 285 40 37
3 Contaminant 24 26 18
Variables not shown: hosp1_RespAssocType (int), hosp1_CathAssocType (int), hosp2_WoundAssocType
(int), hosp2_BloodAssocType (int), hosp2_UrineAssocType (int), hosp2_RespAssocType (int),
hosp2_CathAssocType (int)
The function I have is to perform a chisq.test across all categories in df$Type. Ideally the function should switch to a fisher.test() if the cell count is <5, but that's a separate issue (extra brownie points for the person who comes up with how to do that though).
This is the function I'm using to go row by row
func <- Map(
function(x,y) {
out <- cbind(x,y)
final <- rbind(out[1,],colSums(out[2:3,]))
chisq <- chisq.test(final,correct=FALSE)
chisq$p.value
},
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))]
)
func
But ideally, i'd want it to be something like this
for(i in 1:nrow(df)){func}
But that doesn't work. A further hook is, that when for example, row two is taken, the final call looks like this
func <- Map(
function(x,y) {
out <- cbind(x,y)
final <- rbind(out[2,],colSums(out[c(1,3),]))
chisq <- chisq.test(final,correct=FALSE)
chisq$p.value
},
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))]
)
func
so the function should understand that the cell count its taking for out[x,] has to be excluded from colSums(). This data.frame only has 3 rows, so it's easy, but I've tried applying this function to a separate data.frame I have that consists >200 rows, so it would be nice to be able to loop this somehow.
Any help appreciated.
Cheers

You were missing two things:
To select the line i and select all but this line you want to use
u[i] and u[-i]
If an item is not the same length than the others given to Map, it is recycled, a very general property of the language. You then just have to add an argument to the function that corresponds to the line you want to oppose to the others, it will be recycled for all the items of the vectors passed.
The following does what you asked for
# the function doing the stats
FisherOrChisq <- function(x,y,lineComp) {
out <- cbind(x,y)
final <- rbind(out[lineComp,],colSums(out[-lineComp,]))
test <- chisq.test(final,correct=FALSE)
return(test$p.value)
}
# test of the stat function
FisherOrChisq(SO_Example_v1[grep("^hosp1",names(SO_Example_v1))[1]],
SO_Example_v1[grep("^hosp2",names(SO_Example_v1))[1]],2)
# making the loop
result <- c()
for(type in SO_Example_v1$Type){
line <- which(SO_Example_v1$Type==type)
res <- Map(FisherOrChisq,
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))],
line
)
result <- rbind(result,res)
}
colnames(result) <- gsub("^hosp[0-9]+","",colnames(result))
rownames(result) <- SO_Example_v1$Type
That said, what you are doing is very heavy multiple testing. I would be extremely cautious with the use of the corresponding p-values, you need at least to use a multiple testing correction such as what is suggested here.

Related

R function to count number of times when values changes

I am new to R,
I have 3 columns named A1, A2, ChangeInA that looks like this in a dataset
A1
A2
ChangeInA
10
20
10
24
30
24
22
35
35
54
65
65
15
29
15
The column 'ChangeInA' is either (A1 or A2)
I want to determine the number of times the 3rd column ('ChangeInA') changes.
Is there any function in R to do that?
Let me explain:
From the table, we can see that the 'ChangeInA' column switched twice,
first at row 3 and it switched again at row 5 (note that 'ChangeInA' can only have values of A1 or A2) so I want an R function to print how many times the switch happened. I can see the change on the dataset but I need to prove it on R
Below is a code I tried from previous answers
change<- rleid(rawData$ChangeInA == rawData$A1)
This showed me all the ChangeInA
change<- max(rleid(rawData$ChangeInA == rawData$A1))
This showed me the maximum number in ChangeInA
One option is to use rleid from data.table to keep track of when a change occurs in ChangeInA, which we can use on a conditional of whether ChangeInA is equal to A1. Then, we can just use max to get the total number of changes.
library(data.table)
max(rleid(df$ChangeInA == df$A1) - 1)
# 2
Or we could use dplyr with rleid:
library(dplyr)
df %>%
mutate(rlid = rleid(A1 == ChangeInA) - 1) %>%
pull(rlid) %>%
last()
Data
df <- structure(list(A1 = c(10L, 24L, 22L, 54L, 15L), A2 = c(20L, 30L,
35L, 65L, 29L), ChangeInA = c(10L, 24L, 35L, 65L, 15L)), class = "data.frame", row.names = c(NA,
-5L))

kmeans complains "NA/NaN/Inf in foreign function call (arg 1)", when there are none?

I'm trying to run kmeans clustering analysis on a relatively simple data frame. However, 
kmeans(sample_data, centers = 4) 
doesn't work, as R states there are "NA/NaN/Inf in foreign function call (arg 1)" (not true). Anyway, I tried
kmeans(na.omit(sample_data), centers = 4)
based on the answers here (and other posts), and that didn't work. The only workaround I found was to exclude the non-numeric column (i.e., the observation names) using
kmeans(sample_data[, 2:5], centers = 4)
Unfortunately, this makes the clusters much less informative, since the points now have numbers instead of names. What's going on? Or how could I get the clustering with the right labels?
Edit: I'm trying to reproduce this procedure / result, but with a different data set. Notice that when the author visualizes the clusters, the points are labelled according to the observations (the states, in that case; or "obs1, obs2, etc." in mine.)
Because of the workaround above (which drops the column with observation names), I get a sequence of numeric labels instead.
Code and dput below:
library(factoextra)
cluster <- kmeans(sample_data, centers = 4) #this doesn't work
cluster <- kmeans(sample_data[, 2:5], centers = 4) #this works
fviz_cluster(cluster, sample_data)
sample_data:
structure(list(name = structure(c(1L, 12L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 13L,
14L, 15L, 16L, 17L, 18L), .Label = c("obs1", "obs10", "obs11",
"obs12", "obs13", "obs14", "obs15", "obs16", "obs17", "obs18",
"obs19", "obs2", "obs20", "obs21", "obs22", "obs23", "obs24",
"obs25", "obs3", "obs4", "obs5", "obs6", "obs7", "obs8", "obs9"
), class = "factor"), variable1 = c(0, 0.383966783938484, 0.541654398529028,
0.469060314591266, 0.397636449124337, 0.3944696359856, 0.368740430902284,
0.998695171590958, 0.60013559365688, 0.543416096609665, 1, 0.287523586757021,
0.57818096701751, 0.504722587360754, 0.284825226469556, 0.295250085072615,
0.509782836343032, 0.392942062325636, 0.602608457169149, 0.474668174468815,
0.219951650206242, 0.263837738487209, 0.530976492805559, 0.312401708505963,
0.828799458392802), variable2 = c(0, 0.21094954480341, 0.374890541082605,
0.502470003202637, 0.385212751959443, 0.499052863381439, 0.172887314327707,
0.319869014605517, 0.484308813708282, 0.348608342250238, 0.474464311565186,
0.380406312920036, 1, 0.618253544624658, 0.560290273167607, 0.676315913606924,
0.339157532529115, 0.479005841710258, 0.576094917240369, 0.819742646967549,
0.472559283375261, 0.45594685111211, 0.160720270709769, 0.494360626922513,
0.658705091697224), variable3 = c(0, 0.0391726961740698, 0.157000498692027,
0.194883594782107, 0.133290754949737, 0.199085094994071, 0.000551185924636259,
0.418045152251051, 0.434858475480003, 0.443442199844268, 0.257231662911141,
0.195570389942169, 0.46503468971732, 0.358104620337886, 0.391852363829371,
0.39834809992812, 0.258870156344325, 0.38555892877453, 0.480559759927908,
1, 0.15662554228071, 0.279363773961277, 0.11211821625736, 0.180885222092932,
0.339650099009323), variable4 = c(0, 0.0464395032429444, 0.323768557597659,
0.201813172242373, 0.302710768912681, 0.446027132614423, 0.542018940773003,
1, 0.738123811706962, 0.550819613183929, 0.679555989322392, 0.563126171437818,
0.470328070009844, 0.316069092919459, 0.344421820993065, 0.222931758003036,
0.250406547916021, 0.381098780580988, 0.9526031202384, 0.174161621337361,
0.260548409706516, 0.288399563112687, 0.617089845066814, 0.265314653254406,
0.330637996311329)), class = "data.frame", row.names = c(NA,
-25L))
K-means only works on continuous variables.
It probably tried to convert your labels into numbers, and that did not work.
Never include identifier columns in analysis!
Proper data preprocessing is crucial and 90% of the work; you need to understand the requirements precisely. It is not sufficient to just make it run somehow - it is easy to make it run, but return useless results...
The key is to convert the column with the desired labels to row names with
df <- tibble::column_to_rownames(df, var = "labels")
That way the clustering algorithm won't even consider the labels, but will apply them to the points on the cluster.

R data.table get maximum value per row for multiple columns

I've got a data.table in R which looks like that one:
dat <- structure(list(de = c(1470L, 8511L, 3527L, 2846L, 2652L, 831L
), fr = c(14L, 81L, 36L, 16L, 30L, 6L), it = c(9L, 514L, 73L,
37L, 91L, 2L), ro = c(1L, 14L, 11L, 1L, 9L, 0L)), .Names = c("de",
"fr", "it", "ro"), class = c("data.table", "data.frame"), row.names = c(NA,
-6L))
I now wanna create a new data.table (having exactly the same columns) but holding only the maximum value per row. The values in the other columns should simply be NA.
The data.table could have any number of columns (the data.table above is just an example).
The desired output table would look like this:
de fr it ro
1: 1470 NA NA NA
2: 8511 NA NA NA
3: 3527 NA NA NA
4: 2846 NA NA NA
5: 2652 NA NA NA
6: 831 NA NA NA
There are several issues with what the OP is attempting here: (1) this really looks like a case where data should be kept in a matrix rather than a data.frame or data.table; (2) there's no reason to want this sort of output that I can think of; and (3) doing any standard operations with the output will be a hassle.
With that said...
dat2 = dat
is.na(dat2)[-( 1:nrow(dat) + (max.col(dat)-1)*nrow(dat) )] <- TRUE
# or, as #PierreLafortune suggested
is.na(dat2)[col(dat) != max.col(dat)] <- TRUE
# or using the data.table package
dat2 = dat[rep(NA_integer_, nrow(dat)), ]
mc = max.col(dat)
for (i in seq_along(mc)) set(dat2, i = i, j = mc[i], v = dat[i, mc[i]])
It's not clear to me whether you mean that you want to use the data.table package, or if you are satisfied with making a data.frame using only base functions. It is certainly possible to do the latter.
Here is one solution, which uses only max() and which.max() and relies on the fact that an empty data.frame will fill in all of the remaining cells with NA to achieve a rectangular structure.
maxdat <- data.frame()
for (col in names(dat)) {
maxdat[which.max(dat[,col]), col] <- max(dat[,col])
}

How do summarize this data table with dplyr, then run a chisq.test (or similar) on the results and loop it all into one neat function?

This question was embedded in another question I asked here, but as it goes beyond the scope of what I wanted to know in the initial inquiry, I thought it might deserve a separate thread.
I've been trying to come up with a solution for this problem based on the answers I have received here and here using dplyr and the functions written by Khashaa and Jaap.
Using the solutions provided to me (especially from Jaap), I have been able to summarize the raw data I received into a matrix-looking data table
dput(SO_Example_v1)
structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community",
"Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L,
285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L,
37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L,
34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L,
62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L,
142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type",
"hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType",
"hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType",
"hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType",
"hosp2_CathAssocType"), class = "data.frame", row.names = c(NA,
-3L))
Which looks as follows
require(dplyr)
df <- tbl_df(SO_Example_v1)
head(df)
Type hosp1_WoundAssocType hosp1_BloodAssocType hosp1_UrineAssocType
1 Healthcare 464 73 75
2 Community 285 40 37
3 Contaminant 24 26 18
Variables not shown: hosp1_RespAssocType (int), hosp1_CathAssocType (int), hosp2_WoundAssocType
(int), hosp2_BloodAssocType (int), hosp2_UrineAssocType (int), hosp2_RespAssocType (int),
hosp2_CathAssocType (int)
The column Type is the type of bacteria, the following columns represent where they were cultured. The digits represent the number of times the respective type of bacteria were detected.
I know what my final table should look like, but until now I have been doing it step by step for each comparison and variable and there must undoubtedly be a way to do this by piping multiple functions in dplyr - but alas, I have not found the answer on SO to this.
Example of what final table should look like
Wound
Type n Hospital 1 (%) n Hospital 2 (%) p-val
Healthcare associated bacteria 464 (60.0) 171 (56.4) 0.28
Community associated bacteria 285 (36.9) 115 (38.0) 0.74
Contaminants 24 (3.1) 17 (5.6) 0.05
Where the first grouping variable "Wound" is then subsequently replaced by "Urine", "Respiratory", ... and then there's a final column termed "All/Total", which is the total number of times each variable in the rows of "Type" was found and summarized across Hospital 1 and 2 and then compared.
What I have done until now is the following and very tedious, as it's calculated "by hand" and then I poulate the table with all of the results manually.
### Wound cultures & healthcare associated (extracted manually)
# hosp1 464 (yes), 309 (no), 773 wound isolates in total; (% = 464 / 309 * 100)
# hosp2 171 (yes), 132 (no), 303 would isolates in total; (% = 171 / 303 * 100)
### Then the chisq.test of my contingency table
chisq.test(cbind(c(464,309),c(171,132)),correct=FALSE)
I appreciate that if I run a piped dplyr on the raw data.frame I won't be able to get the exact formatting of my desired table, but there must be a way to at least automate all the steps here and put the results together in a final table that I can export as a .csv file and then just do some final column editing etc.?
Any help is greatly appreciated.
It's ugly, but it works (Sam in the comments is right that this whole issue should probably be addressed by adjusting your data to a clean format before analysing, but anyway):
Map(
function(x,y) {
out <- cbind(x,y)
final <- rbind(out[1,],colSums(out[2:3,]))
chisq.test(final,correct=FALSE)
},
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))]
)
#$hosp1_WoundAssocType
#
# Pearson's Chi-squared test
#
#data: final
#X-squared = 1.16, df = 1, p-value = 0.2815
# etc etc...
Matches your intended result:
chisq.test(cbind(c(464,309),c(171,132)),correct=FALSE)
#
# Pearson's Chi-squared test
#
#data: cbind(c(464, 309), c(171, 132))
#X-squared = 1.16, df = 1, p-value = 0.2815

Combing two data frames if values in one column fall between values in another

I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")

Resources