find strings in data.frame to fill in new column - r

I used dplyr on my data to create a subset of data like this:
dd <- data.frame(ID = c(700689L, 712607L, 712946L, 735907L, 735908L, 735910L, 735911L, 735912L, 735913L, 746929L, 747540L),
`1` = c("eg", NA, NA, "eg", "eg", NA, NA, NA, NA, "eg", NA),
`2` = c(NA, NA, NA, "sk", "lk", NA, NA, NA, NA, "eg", NA),
`3` = c(NA, NA, NA, "sk", "lk", NA, NA, NA, NA, NA, NA),
`4` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA),
`5` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA),
`6` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA))
I now want to check every column except ID if it contains certain strings. In this example I want to create 1 column with "1" for every ID that contains a column with "eg" and "0" for the rest. Likewise one more column which tells me if there is either a "sk" or "lk" in the other columns. After that the old columns except ID can be removed from the data.frame
The difficult part for me is doing this with a dynamic number of columns, as my dplyr-subset will return different amounts of columns based on the specific case, but I need to check every one that is created in every case. I wanted to use unite first to put all strings together but I will have the same problem then: How can I unite all columns except the first ID one.
If this can be solved within dplyr it would be perfect but any working solution is appreciated.
The result should look like this:
result <- data.frame(ID = c(700689L, 712607L, 712946L, 735907L, 735908L, 735910L, 735911L, 735912L, 735913L, 746929L, 747540L),
with_eg = c(1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0),
with_sk_or_lk = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0))

From your description, you want one column to check for "eg" and another column to check for both "lk" and "sk". If this is the case, then the following base R method will work.
dfNew <- cbind(id=dd[1],
eg=pmin(rowSums(dd[-1] == "eg", na.rm=TRUE), 1),
other=pmin(rowSums(dd[-1] == "sk" | dd[-1] == "lk", na.rm=TRUE), 1))
Here, the presence of "eg" is checked across the entire data.frame (except the id column) and a logical matrix is returned, rowSums adds the TRUE values across the rows, with na.rm removing the NAs, then pmin takes the minimum of the output of rowSums and 1, so that any elements with 2 are replaced by 1 and any values with 0 are preserved.
This same logic is applied to the construction of the "other" variable, except the presence of either "lk" or "sk" are checked in the initial logical matrix. Finally, data.frame returns a 3 column data.frame with the desired values.
This returns
dfNew
ID eg other
1 700689 1 0
2 712607 0 0
3 712946 0 0
4 735907 1 1
5 735908 1 1
6 735910 0 0
7 735911 0 0
8 735912 0 0
9 735913 0 0
10 746929 1 0
11 747540 0 0

Here is an admittedly hacky dplyr/purrr solution. Given that your IDs don't seem like they'll ever equal 'eg', 'sk', or 'lk', I haven't included anything to not search the ID column.
library(dplyr)
library(purrr)
dd %>%
split(.$ID) %>%
map_df(~ data_frame(
ID = .x$ID,
eg = ifelse(any(.x == 'eg', na.rm = TRUE), 1, 0),
other = ifelse(any(.x == 'lk' | .x == 'sk', na.rm = TRUE), 1, 0)
))

Related

counting observations greater than 20 in R

I have a dataset df in R and trying to get the number of observations greater than 20
sample input df:
df <- data.frame(Ensembl_ID = c("ENSG00000284662", "ENSG00000186827", "ENSG00000186891", "ENSG00000160072", "ENSG00000041988"), FS_glm_1_L_Ad_N1_233_233 = c(NA, "11.0704011098281", "18.5580644869131", NA, NA), FS_glm_1_L_Ad_N10_36_36 = c("25.5660669439994", NA, "17.7371918093936", "17.15620204154", NA), FS_glm_1_L_Ad_N2_115_115 = c("26.5660644083686", NA, "11.4006170885388", "17.9862691299736", "9.83546459757003" ), FS_glm_1_L_Ad_N3_84_84 = c("26.5660644053515", NA, "10.9591563938286", NA, NA), FS_glm_1_L_Ad_N4_65_65 = c("26.5660642078305", NA, "11.1498422647029", "10.5876449860129", "9.84781577969005"), FS_glm_1_L_Ad_N5_64_64 = c("26.5660688201853", NA, "18.613395947125", "10.5753792680759", "11.059101026016"), FS_glm_1_L_Ad_N6_55_55 = c("26.5660644039101", NA, "18.478237966938", "10.543187719545", NA), FS_glm_1_L_Ad_N7_32_32 = c("25.5660669436648", NA, "17.9467280294446", "10.0328888122706", NA), FS_glm_1_L_Ad_N8_31_31 = c("25.566069252448", NA, "17.6805603365895", "17.3419854603055", "9.81610669984747"))
class(df)
[1] "data.frame"
I tried
length(which(as.vector(df[,-1]) > 20))
[1] 11
and
sum(df[,-1] > 20, na.rm=TRUE)
[1] 11
However, the real occurrence is only 8 times instead of 11 why so?
The same script works correctly in another data frame but not in this df.
The data is character in this dataframe and not numeric. When numbers are characters weird things happen.
"2" > "13"
#[1] TRUE
Change the data to numeric before using sum.
df[-1] <- lapply(df[-1], as.numeric)
sum(df[,-1] > 20, na.rm=TRUE)
#[1] 8

if and for statement for the National inpatient sample

I have a dataset, attached. It has 16 columns.
The first column shows whether patients got surgery or not, and the other 15 are Day1-15 of surgery (coded as 1, 0).
I want to create a new column that satisfies a few conditions. I want that column to show the exact day of the procedure.
If column Day3 for example has a value of 1, I want the new value in the new column to be 3 (if and only if the first column crani and PRDAY3 column have a value of 1), and so on to be applied on all of the days' columns (day1-15).
Would really appreciate your help. Please let me know if you have any questions regarding the dataset or the problem I'm trying to solve.
tts <- function(timedc){
for (i in 15) {
if (TBI$PRDAYi == "1"){
timedc = c(timedc, TBI$PRDAYi)
}
return(timedc)
}
for (i in TBI$crani){
if (TBI$crani == "1"){
tts
}
}
}
*When tts is time to surgery.
I'm getting this error message:
Warning in if (TBI$crani == "1") { :
the condition has length > 1 and only the first element will be used
I want to create a column that has the exact day of the procedure from this database, as above.
dataset below.
> dput(TBI[1:10, 1:6])
structure(list(TBI.crani = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), TBI.PRDAY1 = c(NA,
0, NA, NA, 1, NA, 0, 0, 0, 1), TBI.PRDAY2 = c(NA, 2, NA, NA,
11, NA, NA, 0, 16, 2), TBI.PRDAY3 = c(NA, 2, NA, NA, 0, NA, NA,
0, 0, 2), TBI.PRDAY4 = c(NA, NA, NA, NA, 9, NA, NA, NA, 0, 5),
TBI.PRDAY5 = c(NA, NA, NA, NA, 11, NA, NA, NA, 0, 1)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
enter image description here
I believe the following function does what the question asks for. With the posted data it returns a vector of zeros, since the first column TBI.crani is always zero.
tts <- function(data){
f <- function(x){
if(all(is.na(x)))
NA_integer_
else {
w <- which(x == 1)
if(length(w)) w[1] else 0L
}
}
crani <- grep("crani", names(data))
day_cols <- grep("DAY", names(data))
apply(data, 1, function(x){
if(x[crani] == 1) f(x[day_cols]) else 0L
})
}
tts(TBI)
TBI$NEW <- tts(TBI)

how to paste an array to rows which contain a certain value in a certain column in R

I would like to paste values of a certain data.frame row to other rows which have a certain attribute of a certain feature, however not a whole row just a couple of values of it. Exactly it looks like:
z <- c(NA, NA, 3,4,2,3,5)
x <- c(NA, NA, 2,5,5,3,3)
a <- c("Hank", NA, NA, NA, NA, NA, NA)
b <- c("Hank", NA, NA, NA, NA, NA, NA)
c <- c(NA, NA, NA, NA, NA, NA, NA)
d <- c("Bobby", NA, NA, NA, NA, NA, NA)
df <- as.data.frame(rbind( a, b, c, d, z, x))
Now, I would like to pass df["z",3:7] to the rows[3:7] which have V1 == "Hank", and pass df["x", 3:7] when V1== "Bobby".
Do anybody has a hint for me? I guess it should be a function with sapply or something like that. Maybe a dplyr could give a solution? Thanks for any advice!

How to update values in a for-loop?

I have a for-loop that initializes 3 vectors (launch_2012, amount, and one_week_bf) and creates a data frame. Then, it predicts a single week's of data and inserts it into vectors (amount and one_week_bf), and recreates the data.frame again; this process is looped 8 times. However, I can't seem to get the data.frame to update the new amounts. Would anyone be able to assist please?
for (i in 1:8) {
launch_2012 <- c(rep('bf', 5), 'launch', rep('af', 7))
amount <- c(7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA, NA)
one_week_bf <- c(NA, 7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA)
newdata <- data.frame(amount = amount, one_week_bf = one_week_bf, launch = launch_2012, week = week)
predicted <- predict(model0a, newdata)
amount[i+5] <- predicted[i+5]
one_week_bf[i+6] <- predicted[i+5]
View(newdata)
}
It's difficult to be sure since your example is not reproducible, but note that predict.lm(...) by default has na.action=na.pass, which means that any rows in newdata that have any NA values by default generate NA for the prediction. Since your first pass of newdata has NA in rows 6-13, predicted will have NA in those same elements. This means that amounts and one_week_bf will have NA in those elements, which in turn will generate the same newdata each time.
None of this should be in a for loop.
x <- data.frame("launch_2012" = c(rep('bf', 5), 'launch', rep('af', 7)),
"amount"=c(7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA, NA),
"one_week_bf"=c(NA, 7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA))
x$new_amount <- #the replacement from your predict vector
x$new_one_week_bf <- #the replacement from your predict vector
Note I have no idea what model0a does, so just gave what the new columns should be as whatever the resulting vector is from your predict function. This will add the new data as new columns

Matrix to data frame with row/columns numbers

I have a 10x10 matrix in R, called run_off. I would like to convert this matrix to a data frame that contains the entries of the matrix (the order doesn't really matter, although I'd prefer it to be filled by row) as well as the row and columns numbers of the entries as separate columns in the data frame, so that for instance element run_off[2,3] has a row in the data frame with 3 columns, the first containing the element itself, the second containing 2 and the third containing 3.
This is what I have so far:
run_off <- matrix(data = c(45630, 23350, 2924, 1798, 2007, 1204, 1298, 563, 777, 621,
53025, 26466, 2829, 1748, 732, 1424, 399, 537, 340, NA,
67318, 42333, -1854, 3178, 3045, 3281, 2909, 2613, NA, NA,
93489, 37473, 7431, 6648, 4207, 5762, 1890, NA, NA, NA,
80517, 33061, 6863, 4328, 4003, 2350, NA, NA, NA, NA,
68690, 33931, 5645, 6178, 3479, NA, NA, NA, NA, NA,
63091, 32198, 8938, 6879, NA, NA, NA, NA, NA, NA,
64430, 32491, 8414, NA, NA, NA, NA, NA, NA, NA,
68548, 35366, NA, NA, NA, NA, NA, NA, NA, NA,
76013, NA, NA, NA, NA, NA, NA, NA, NA, NA)
, nrow = 10, ncol = 10, byrow = TRUE)
df <- data.frame()
for (i in 1:nrow(run_off)) {
for (k in 1:ncol(run_off)) {
claim <- run_off[i,k]
acc_year <- i
dev_year <- k
df[???, "claims"] <- claim # Problem here
df[???, "acc_year"] <- acc_year # and here
df[???, "dev_year"] <- dev_year # and here
}
}
dev_year refers to the column number of the matrix entry and acc_yearto the row number. My problem is that I don't know the proper index to use for the data frame.
I am assuming you are not interested in the NA elements? You can use which and the arr.ind = TRUE argument to return a two column matrix of array indices for each value and cbind this to the values, excluding the NA values:
# Get array indices
ind <- which( ! is.na(run_off) , arr.ind = TRUE )
# cbind indices to values
out <- cbind( run_off[ ! is.na( run_off ) ] , ind )
head( as.data.frame( out ) )
# V1 row col
#1 45630 1 1
#2 53025 2 1
#3 67318 3 1
#4 93489 4 1
#5 80517 5 1
#6 68690 6 1
Use t() on the matrix first if you want to fill by row, e.g. which( ! is.na( t( run_off ) ) , arr.ind = TRUE ) (and when you cbind it).

Resources