How to recode in a tidy manner with better looking code - r

I am a medical researcher. I have a very large administrative database where the diagnoses are included in columns with headers dx1 - dx15 (dx = diagnosis). These columns contain numbers/letter codes which are in character form in R. I have written code to run through these dx columns, but would like to rewrite the code in the form of an array. I can do that easily in SAS, but am finding it difficult to do the same in R.
I am attaching the code that I use here:
a <- as.character(c("4578","4551")) # here I identify initially the codes for the diagnosis that I am interested in.
Then I create a new variable cancer in my dataframe df and use this code to identify patients with cancer. the new variable df$cancer will be either 0 or 1 depending upon diagnosis.
The code work, but as you can see, is not tidy and elegant at all.
df$cm_cancer <- with(df, ifelse((dx3 %in% a | dx4 %in% a | dx5 %in% a |
dx6 %in% a | dx7 %in% a | dx8 %in% a | dx9 %in% a |
dx10 %in% a | dx11 %in% a | dx12 %in% a | dx13 %in% a |
dx14 %in% a | dx15 %in% a), 1, 0))
With SAS, I can do the same with this elegant piece of code:
data df2;
set df;
cancer = 0;
array dgn[15] dx1 - dx15;
do i = 1 to 15;
if dgn[i] in ("4578","4551") then
cancer = 1;
end;
drop i;
run;
I refuse to believe that SAS has better answers for this than R; just agree that I am still a novice in the use of R.
Any help welcome; believe me, I have tried to google to find arrays in R, loops in R; anything that would help me to rewrite this code better.

Related

Problem iterating through list of dataframes

I am working with a database of three-dimensional vectors and am trying to calculate the surface area of the triangles between all possible combinations of three vectors. The goal is to get a list or dataframe containing the area for all possible combinations, each named based on the column names of the respective coordinates (e. g. c1:c2:c3).
For the moment, I get "invalid subscript type 'list'" as an error when running my function for the triangle calculation but I don't know how else to iterate through my list.
I am generating a list of all possible combinations of coordinates using combn
tridf <- combn(newdata, 3, simplify=FALSE) #newdata contains the coordinates, each column consists of a three-dimensional vector with x, y and z
Example for structure of newdata:
| c1 | c2 | c3 | c4 | c5 |
x| -8.99 | -8.71 | -10.52 | -8.38 | -55.76 |
y| -267.54 | -266.50 | -266.26 | -279.47 | -243.53 |
z| -117.85 | -122.87 | -200.95 | -146.96 | -130.40 |
dput(newdata):
structure(list(g = c("-8.993426322937012", "-267.54718017578125",
"-117.85099792480469"), n = c("-8.717547416687012", "-266.50799560546875",
"-122.87059020996094"), ale = c("-10.52885627746582", "-266.2621154785156",
"-200.95721435546875"), rhi = c("-8.382125854492188", "-279.47918701171875",
"-146.96658325195312"), fmo.r = c("-55.76047897338867", "-243.5348663330078",
"-130.4052734375")), row.names = c("V2", "V3", "V4"), class = "data.frame")
which gives me a list of n dataframes through which I now would like to iterate using the following function:
triarea <- function(i){
newtridf <- as.data.frame(tridf[[i]])
ab <- as.numeric(newtridf[,2])-as.numeric(newtridf[,1])
ac <- as.numeric(newtridf[,3])-as.numeric(newtridf[,1])
c <- as.data.frame(cross(ab,ac)) #cross is a function of library(pracma)
area <- 0.5*sqrt(c[1,]^2+c[2,]^2+c[3,]^2)
}
When running this code manually outside the function there is no problem and I always end up with the correct result for area, but when running this as a function, called using combn
newcombn <- combn(tridf, 1, triarea, simplify=FALSE)
it throws the following error:
Error in tridf[[i]] : invalid subscript type 'list'
I've been searching the web and trying around for hours now but I am completely lost, especially as I am relatively new to R and programming in general. I get that there seems to be a problem with the data being stored in a list, but I do not know how to approach solving this or how to directly refer iteratively to the respective column of the dataframe inside of the list of dataframes, without the need for auxiliary elements like newtridf ...
Thank you very much in advance for your time and help!

Updating dataframe based on conditions (over loop) R

I'm having difficulty developing a function/algorithm that that updates a dataframe based on certain conditions. I've looked at some answers related to "updating" a dataframe via for loops, but I'm still stuck.
Say I have a dataframe:
df <- data.frame("data_low" = .2143, "data_high" = .7149)
where data_low and data_high are the max and min of some column in a dataframe
I also have two functions:
checker(b[1,])
Takes the value of data_high and data_low, and returns a scalar. If the scalar is less than 1, I'd like to store this in another dataframe, say "d". Else, I want to split "b" with the following function:
splitter()
splits "b" by the median of data_high and data_low.
I've considered trying to develop this with a loop:
storage <- data.frame(data_low = double(), data_high = double()
for( i in 1:nrow(b)){
if(checker(b[i,]) <1){
storage <- splitter(b[i,])
} else {
temp <- splitter(b[i,])
b <- rbind(b,temp)
}
}
My desired output after two iterations (where check >1 for each row:
** Obviously these numbers are picked at random, I'm just hoping to gain some intuition related to looping/updating dataframes based on cases..
starting at i = 0:
| .2143 | .7149 |,
i = 2
| .2143 | .4442 | ** Note at splitter() should break this into 2 rows after i = 2 is complete.
| .4442 | .7149 | ** And again here
i = 3
| .2143 | .3002 |
| .3002 | .4442 |
| .4442 | .5630 |
| .5630 | .7149 |
Can anyone give me some tips on how to organize this loop? I'm thinking my issue here is related to rbind and/or the actual updating of b.
I recognize that much of this code isn't reproducible, but am more interested in the though process here.
Any help would be greatly appreciated!
You can do this with a nested loop (one for the number of iterations and one for the number of rows in b), or using nested Reduce calls, as shown here.
Reduce(function(x, y) {
List=apply(x, 1, function(z) {
med=median(c(z[1], z[2]))
dat=data.frame(data_low=c(z[1], med), data_high=c(med, z[2]))
rownames(dat)=NULL
return(dat)
})
Reduce(function(w, z) rbind(w, z), List)
}, rep(NA, 2), init=df)
One rep:
data_low data_high
1 0.2143 0.4646
2 0.4646 0.7149
Two reps:
data_low data_high
1 0.21430 0.33945
2 0.33945 0.46460
3 0.46460 0.58975
4 0.58975 0.71490
Three reps:
data_low data_high
1 0.214300 0.276875
2 0.276875 0.339450
3 0.339450 0.402025
4 0.402025 0.464600
5 0.464600 0.527175
6 0.527175 0.589750
7 0.589750 0.652325
8 0.652325 0.714900

How to pull out columns in r based on various criteria

I have a huge data.set in R (1mil+ rows) and 51 columns. One of my columns is "StateFIPS" the other is "CountyFIPS" and another is "event type". The rest I do not care about.
Is there an easy way to take that dataframe and pull out all the columns that have "StateFIPS"=3 AND "CountyFIPS=4" AND "event type"=Tornado, and put all those rows into a new dataframe.
Thanks!
We can use subset
df2 <- subset(df1, StateFIPS == 3 & CountyFIPS == 4 & `event type` == "Tornado")
It is quite easy. This should do it (supposing your data.frame is named "data_set")
new_data <- data_set[(data_set$CountyFIPS == 4) |
(data_set$event_type == 'Tornado') |
(data_set$StateFIPS == 3),]
Sure,
You can sue the which() command, see https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/which
You can then use any logical conditions (and combine them with & (=and) and | (=or)

I am trying to create a new column that is conditional upon the contents of existing columns

I am trying to make several new columns in my data frame that are conditional upon the contents of a few existing columns.
In pseudo-code, the arguments basically go "If VariableA is 1 and either VariableB is 2 or VariableC is 3, then make VariableD = 1, and if it does not meet these conditions make it a zero.
I have tried using for loops and ifelse statements, but have had no luck. I know that the logic of my commands are correct, but I am making some error translating it into R syntax, which isn't surprising because I just started using R about a week ago.
Below is a simplified version of what I have tried doing...
Data$VariableD <- ifelse(Data$VariableA == 'Jim' && (Data$VariableB == 2 || Data$VariableC == 3), 1, 0)
It runs without error, but upon examining the contents of VariableD, all cells are filled with "NA"
Here is an example using a similar dataset, notice row 1 meets the criteria. (I can't make a proper table to save my life, but I think it's interpretable.
|Variable A|Variable B|Variable C|Variable D|
|1| Jim | 2 | 4 | NA |
|2| Tom | 2 | 3 | NA |
|3| Tom | 3 | 4 | NA |
Could you please provide the class of your variables? Could be a problem of class...(by example you put ==1 for variableA, but if variableA is a class chr should be == "1").
Could you please also provide your full loop?
Otherwise, please try that:
Data$VariableD <- ifelse(Data$VariableA == 1 & (Data$VariableB == 2 | Data$VariableA == 3), 1, 0)

Cross-referencing data frames without using for loops

Im having an issue with speed of using for loops to cross reference 2 data frames. The overall aim is to identify rows in data frame 2 that lie between coordinates specified in data frame 1 (and meet other criteria). e.g. df1:
chr start stop strand
1 chr1 179324331 179327814 +
2 chr21 45176033 45182188 +
3 chr5 126887642 126890780 +
4 chr5 148730689 148734146 +
df2:
chr start strand
1 chr1 179326331 +
2 chr21 45175033 +
3 chr5 126886642 +
4 chr5 148729689 +
My current code for this is:
for (index in 1:nrow(df1)) {
found_miRNAs <- ""
curr_row = df1[index, ];
for (index2 in 1:nrow(df2)){
curr_target = df2[index2, ]
if (curr_row$chrm == curr_target$chrm & curr_row$start < curr_target$start & curr_row$stop > curr_target$start & curr_row$strand == curr_target$strand) {
found_miRNAs <- paste(found_miRNAs, curr_target$start, sep=":")
}
}
curr_row$miRNAs <- found_miRNAs
found_log <- rbind(Mcf7_short_aUTRs2,curr_row)
}
My actual data frames are 400 lines for df1 and > 100 000 lines for df2 and I am hoping to do 500 iterations, so, as you can imagine this unworkably slow. I'm relatively new to R so any hints for functions that may increase the efficiency of this would be great.
Maybe not fast enough, but probably faster and a lot easier to read:
df1 <- data.frame(foo=letters[1:5], start=c(1,3,4,6,2), end=c(4,5,5,9,4))
df2 <- data.frame(foo=letters[1:5], start=c(3,2,5,4,1))
where <- sapply(df2$start, function (x) which(x >= df1$start & x <= df1$end))
This will give you a list of the relevant rows in df1 for each row in df2. I just tried it with 500 rows in df1 and 50000 in df2. It finished in a second or two.
To add criteria, change the inner function within sapply. If you then want to put where into your second data frame, you could do e.g.
df2$matching_rows <- sapply(where, paste, collapse=":")
But you probably want to keep it as a list, which is a natural data structure for it.
Actually, you can even have a list column it in the data frame:
df2$matching_rows <- where
though this is quite unusual.
You've run into two of the most common mistakes people make when coming to R from another programming language. Using for loops instead of vector-based operations and dynamically appending to a data object. I'd suggest as you get more fluent you take some time to read Patrick Burns' R Inferno, it provides some interesting insight into these and other problems.
As #David Arenburg and #zx8754 have pointed out in the comments above there are specialized packages that can solve the problem, and the data.table package and #David's approach can be very efficient for larger datasets. But for your case base R can do what you need it to very efficiently as well. I'll document one approach here, with a few more steps than necessary for clarity, just in case you're interested:
set.seed(1001)
ranges <- data.frame(beg=rnorm(400))
ranges$end <- ranges$beg + 0.005
test <- data.frame(value=rnorm(100000))
## Add an ID field for duplicate removal:
test$ID <- 1:nrow(test)
## This is where you'd set your criteria. The apply() function is just
## a wrapper for a for() loop over the rows in the ranges data.frame:
out <- apply(ranges, MAR=1, function(x) test[ (x[1] < test$value & x[2] > test$value), "ID"])
selected <- unlist(out)
selected <- unique( selected )
selection <- test[ selected, ]

Resources