counting observations greater than 20 in R - r

I have a dataset df in R and trying to get the number of observations greater than 20
sample input df:
df <- data.frame(Ensembl_ID = c("ENSG00000284662", "ENSG00000186827", "ENSG00000186891", "ENSG00000160072", "ENSG00000041988"), FS_glm_1_L_Ad_N1_233_233 = c(NA, "11.0704011098281", "18.5580644869131", NA, NA), FS_glm_1_L_Ad_N10_36_36 = c("25.5660669439994", NA, "17.7371918093936", "17.15620204154", NA), FS_glm_1_L_Ad_N2_115_115 = c("26.5660644083686", NA, "11.4006170885388", "17.9862691299736", "9.83546459757003" ), FS_glm_1_L_Ad_N3_84_84 = c("26.5660644053515", NA, "10.9591563938286", NA, NA), FS_glm_1_L_Ad_N4_65_65 = c("26.5660642078305", NA, "11.1498422647029", "10.5876449860129", "9.84781577969005"), FS_glm_1_L_Ad_N5_64_64 = c("26.5660688201853", NA, "18.613395947125", "10.5753792680759", "11.059101026016"), FS_glm_1_L_Ad_N6_55_55 = c("26.5660644039101", NA, "18.478237966938", "10.543187719545", NA), FS_glm_1_L_Ad_N7_32_32 = c("25.5660669436648", NA, "17.9467280294446", "10.0328888122706", NA), FS_glm_1_L_Ad_N8_31_31 = c("25.566069252448", NA, "17.6805603365895", "17.3419854603055", "9.81610669984747"))
class(df)
[1] "data.frame"
I tried
length(which(as.vector(df[,-1]) > 20))
[1] 11
and
sum(df[,-1] > 20, na.rm=TRUE)
[1] 11
However, the real occurrence is only 8 times instead of 11 why so?
The same script works correctly in another data frame but not in this df.

The data is character in this dataframe and not numeric. When numbers are characters weird things happen.
"2" > "13"
#[1] TRUE
Change the data to numeric before using sum.
df[-1] <- lapply(df[-1], as.numeric)
sum(df[,-1] > 20, na.rm=TRUE)
#[1] 8

Related

Joining 'n' number of lists and perform a function in R

I have a dataframe which contains many triplicate (3 columns set). And I have grouped the dataframe into each triplicate as a seperate group of list.
The example dataset is,
example_data <- structure(list(`1_3ng` = c(69648445400, 73518145600, NA, NA,
73529102400, 75481088000, NA, 73545910600, 74473949200, 77396199900
), `2_3ng` = c(71187990600, 70677690400, NA, 73675407400, 73215342700,
NA, NA, 69996254800, 69795686400, 76951318300), `3_3ng` = c(65032022000,
71248214000, NA, 72393058300, 72025550900, 71041067000, 73604692000,
NA, 73324202000, 75969608700), `4_7-5ng` = c(NA, 65845061600,
75009245100, 64021237700, 66960666600, 69055643600, NA, 64899540900,
NA, NA), `5_7-5ng` = c(65097201700, NA, NA, 69032126500, NA,
70189899800, NA, 74143529100, 69299087400, NA), `6_7-5ng` = c(71964413900,
69048485800, NA, 71281569700, 71167596500, NA, NA, 68389822800,
69322289200, NA), `7_10ng` = c(71420403700, 67552276500, 72888076300,
66491357100, NA, 68165019600, 70876631000, NA, 69174190100, 63782945300
), `8_10ng` = c(NA, 71179401200, 68959365100, 70570182700, 73032738800,
NA, 74807496700, NA, 71812102100, 73855098500), `9_10ng` = c(NA,
70403756100, NA, 70277421000, 69887731700, 69818871800, NA, 71353886700,
NA, 74115466700), `10_15ng` = c(NA, NA, 68487581700, NA, NA,
69056997400, NA, 67780479400, 66804467800, 72291939500), `11_15ng` = c(NA,
63599643700, NA, NA, 60752029700, NA, NA, 63403655600, NA, 64548492900
), `12_15ng` = c(NA, 67344750600, 61610182700, 67414425600, 65946654700,
66166118400, NA, 70830837700, 67288305700, 69911451300)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L)
And after grouping I got the four lists, since the above example dataset contains 4 groups. I have used the following R code for grouping the data,
grouping_data<-function(df){ #df= dataframe
df_col<-ncol(df) #calculates no. of columns in dataframe
groups<-sort(rep(0:((df_col/3)-1),3)) #creates user determined groups
id<-list() #creates empty list
for (i in 1:length(unique(groups))){
id[[i]]<-which(groups == unique(groups)[i])} #creates list of groups
names(id)<-paste0("id",unique(groups)) #assigns group based names to the list "id"
data<-list() #creates empty list
for (i in 1:length(id)){
data[[i]]<-df[,id[[i]]]} #creates list of dataframe columns sorted by groups
names(data)<-paste0("data",unique(groups)) #assigns group based names to the list "data"
return(data)}
group_data <-grouping_data(example_data)
Please suggest useful R code for do a particular function for all the lists at a same time.
For example the below function I have done by following way,
#VSN Normalization
vsnNorm <- function(dat) {
dat<-as.data.frame(dat)
vsnNormed <- suppressMessages(vsn::justvsn(as.matrix(dat)))
colnames(vsnNormed) <- colnames(dat)
row.names(vsnNormed) <- rownames(dat)
return(as.matrix(vsnNormed))
}
And I have tried like below,
vsn.dat0 <- vsnNorm(group_data$data0)
vsn.dat1 <- vsnNorm(group_data$data1)
vsn.dat2 <- vsnNorm(group_data$data2)
vsn.dat3 <- vsnNorm(group_data$data3)
vsn.dat <- cbind (vsn.dat0,vsn.dat1,vsn.dat2,vsn.dat3)
It is working well.
But the dataset triplicate (3 columns set) value may be change from dataset to dataset. And calling all the lists everytime become will be tedious.
So kindly share some codes which will call all the resulted lists for performing a function and combine the result as a single file.
Thank you in advance.
The shortcut you are looking for is:
vsn.dat <- do.call("rbind", lapply(group_data, vsnNorm))

linear regression model with dplyr on sepcified columns by name

I have the following data frame, each row containing four dates ("y") and four measurements ("x"):
df = structure(list(x1 = c(69.772808673525, NA, 53.13125414839,
17.3033274666411,
NA, 38.6120670385487, 57.7229000792707, 40.7654208618078, 38.9010405201831,
65.7108936694177), y1 = c(0.765671296296296, NA, 1.37539351851852,
0.550277777777778, NA, 0.83037037037037, 0.0254398148148148,
0.380671296296296, 1.368125, 2.5250462962963), x2 = c(81.3285388496182,
NA, NA, 44.369872853302, NA, 61.0746827226573, 66.3965114460601,
41.4256874481852, 49.5461413070349, 47.0936997726146), y2 =
c(6.58287037037037,
NA, NA, 9.09377314814815, NA, 7.00127314814815, 6.46597222222222,
6.2462962962963, 6.76976851851852, 8.12449074074074), x3 = c(NA,
60.4976916064608, NA, 45.3575294731303, 45.159758146854, 71.8459173097114,
NA, 37.9485456227131, 44.6307631013742, 52.4523342186143), y3 = c(NA,
12.0026157407407, NA, 13.5601157407407, 16.1213657407407, 15.6431018518519,
NA, 15.8986805555556, 13.1395138888889, 17.9432638888889), x4 = c(NA,
NA, NA, 57.3383407228293, NA, 59.3921356160536, 67.4231673171527,
31.853845252547, NA, NA), y4 = c(NA, NA, NA, 18.258125, NA,
19.6074768518519,
20.9696527777778, 23.7176851851852, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I would like to create an additional column containing the slope of all the y's versus all the x's, for each row (each row is a patient with these 4 measurements).
Here is what I have so far:
df <- df %>% mutate(Slope = lm(vars(starts_with("y") ~
vars(starts_with("x"), data = .)
I am getting an error:
invalid type (list) for variable 'vars(starts_with("y"))'...
What am I doing wrong, and how can I calculate the rowwise slope?
You are using a tidyverse syntax but your data is not tidy...
Maybe you should rearrange your data.frame and rethink the way you store your data.
Here is how to do it in a quick and dirty way (at least if I understood your explanations correctly):
df <- merge(reshape(df[,(1:4)*2-1], dir="long", varying = list(1:4), v.names = "x", idvar = "patient"),
reshape(df[,(1:4)*2], dir="long", varying = list(1:4), v.names = "y", idvar = "patient"))
df$patient <- factor(df$patient)
Then you could loop over the patients, perform a linear regression and get the slopes as a vector:
sapply(levels(df$patient), function(pat) {
coef(lm(y~x,df[df$patient==pat,],na.action = "na.omit"))[2]
})

find strings in data.frame to fill in new column

I used dplyr on my data to create a subset of data like this:
dd <- data.frame(ID = c(700689L, 712607L, 712946L, 735907L, 735908L, 735910L, 735911L, 735912L, 735913L, 746929L, 747540L),
`1` = c("eg", NA, NA, "eg", "eg", NA, NA, NA, NA, "eg", NA),
`2` = c(NA, NA, NA, "sk", "lk", NA, NA, NA, NA, "eg", NA),
`3` = c(NA, NA, NA, "sk", "lk", NA, NA, NA, NA, NA, NA),
`4` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA),
`5` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA),
`6` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA))
I now want to check every column except ID if it contains certain strings. In this example I want to create 1 column with "1" for every ID that contains a column with "eg" and "0" for the rest. Likewise one more column which tells me if there is either a "sk" or "lk" in the other columns. After that the old columns except ID can be removed from the data.frame
The difficult part for me is doing this with a dynamic number of columns, as my dplyr-subset will return different amounts of columns based on the specific case, but I need to check every one that is created in every case. I wanted to use unite first to put all strings together but I will have the same problem then: How can I unite all columns except the first ID one.
If this can be solved within dplyr it would be perfect but any working solution is appreciated.
The result should look like this:
result <- data.frame(ID = c(700689L, 712607L, 712946L, 735907L, 735908L, 735910L, 735911L, 735912L, 735913L, 746929L, 747540L),
with_eg = c(1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0),
with_sk_or_lk = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0))
From your description, you want one column to check for "eg" and another column to check for both "lk" and "sk". If this is the case, then the following base R method will work.
dfNew <- cbind(id=dd[1],
eg=pmin(rowSums(dd[-1] == "eg", na.rm=TRUE), 1),
other=pmin(rowSums(dd[-1] == "sk" | dd[-1] == "lk", na.rm=TRUE), 1))
Here, the presence of "eg" is checked across the entire data.frame (except the id column) and a logical matrix is returned, rowSums adds the TRUE values across the rows, with na.rm removing the NAs, then pmin takes the minimum of the output of rowSums and 1, so that any elements with 2 are replaced by 1 and any values with 0 are preserved.
This same logic is applied to the construction of the "other" variable, except the presence of either "lk" or "sk" are checked in the initial logical matrix. Finally, data.frame returns a 3 column data.frame with the desired values.
This returns
dfNew
ID eg other
1 700689 1 0
2 712607 0 0
3 712946 0 0
4 735907 1 1
5 735908 1 1
6 735910 0 0
7 735911 0 0
8 735912 0 0
9 735913 0 0
10 746929 1 0
11 747540 0 0
Here is an admittedly hacky dplyr/purrr solution. Given that your IDs don't seem like they'll ever equal 'eg', 'sk', or 'lk', I haven't included anything to not search the ID column.
library(dplyr)
library(purrr)
dd %>%
split(.$ID) %>%
map_df(~ data_frame(
ID = .x$ID,
eg = ifelse(any(.x == 'eg', na.rm = TRUE), 1, 0),
other = ifelse(any(.x == 'lk' | .x == 'sk', na.rm = TRUE), 1, 0)
))

How to update values in a for-loop?

I have a for-loop that initializes 3 vectors (launch_2012, amount, and one_week_bf) and creates a data frame. Then, it predicts a single week's of data and inserts it into vectors (amount and one_week_bf), and recreates the data.frame again; this process is looped 8 times. However, I can't seem to get the data.frame to update the new amounts. Would anyone be able to assist please?
for (i in 1:8) {
launch_2012 <- c(rep('bf', 5), 'launch', rep('af', 7))
amount <- c(7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA, NA)
one_week_bf <- c(NA, 7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA)
newdata <- data.frame(amount = amount, one_week_bf = one_week_bf, launch = launch_2012, week = week)
predicted <- predict(model0a, newdata)
amount[i+5] <- predicted[i+5]
one_week_bf[i+6] <- predicted[i+5]
View(newdata)
}
It's difficult to be sure since your example is not reproducible, but note that predict.lm(...) by default has na.action=na.pass, which means that any rows in newdata that have any NA values by default generate NA for the prediction. Since your first pass of newdata has NA in rows 6-13, predicted will have NA in those same elements. This means that amounts and one_week_bf will have NA in those elements, which in turn will generate the same newdata each time.
None of this should be in a for loop.
x <- data.frame("launch_2012" = c(rep('bf', 5), 'launch', rep('af', 7)),
"amount"=c(7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA, NA),
"one_week_bf"=c(NA, 7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA))
x$new_amount <- #the replacement from your predict vector
x$new_one_week_bf <- #the replacement from your predict vector
Note I have no idea what model0a does, so just gave what the new columns should be as whatever the resulting vector is from your predict function. This will add the new data as new columns

Matrix to data frame with row/columns numbers

I have a 10x10 matrix in R, called run_off. I would like to convert this matrix to a data frame that contains the entries of the matrix (the order doesn't really matter, although I'd prefer it to be filled by row) as well as the row and columns numbers of the entries as separate columns in the data frame, so that for instance element run_off[2,3] has a row in the data frame with 3 columns, the first containing the element itself, the second containing 2 and the third containing 3.
This is what I have so far:
run_off <- matrix(data = c(45630, 23350, 2924, 1798, 2007, 1204, 1298, 563, 777, 621,
53025, 26466, 2829, 1748, 732, 1424, 399, 537, 340, NA,
67318, 42333, -1854, 3178, 3045, 3281, 2909, 2613, NA, NA,
93489, 37473, 7431, 6648, 4207, 5762, 1890, NA, NA, NA,
80517, 33061, 6863, 4328, 4003, 2350, NA, NA, NA, NA,
68690, 33931, 5645, 6178, 3479, NA, NA, NA, NA, NA,
63091, 32198, 8938, 6879, NA, NA, NA, NA, NA, NA,
64430, 32491, 8414, NA, NA, NA, NA, NA, NA, NA,
68548, 35366, NA, NA, NA, NA, NA, NA, NA, NA,
76013, NA, NA, NA, NA, NA, NA, NA, NA, NA)
, nrow = 10, ncol = 10, byrow = TRUE)
df <- data.frame()
for (i in 1:nrow(run_off)) {
for (k in 1:ncol(run_off)) {
claim <- run_off[i,k]
acc_year <- i
dev_year <- k
df[???, "claims"] <- claim # Problem here
df[???, "acc_year"] <- acc_year # and here
df[???, "dev_year"] <- dev_year # and here
}
}
dev_year refers to the column number of the matrix entry and acc_yearto the row number. My problem is that I don't know the proper index to use for the data frame.
I am assuming you are not interested in the NA elements? You can use which and the arr.ind = TRUE argument to return a two column matrix of array indices for each value and cbind this to the values, excluding the NA values:
# Get array indices
ind <- which( ! is.na(run_off) , arr.ind = TRUE )
# cbind indices to values
out <- cbind( run_off[ ! is.na( run_off ) ] , ind )
head( as.data.frame( out ) )
# V1 row col
#1 45630 1 1
#2 53025 2 1
#3 67318 3 1
#4 93489 4 1
#5 80517 5 1
#6 68690 6 1
Use t() on the matrix first if you want to fill by row, e.g. which( ! is.na( t( run_off ) ) , arr.ind = TRUE ) (and when you cbind it).

Resources