I'm trying to iteratively loop through subsets of an R df but am having some trouble. df$A contains values from 0-1000. I'd like to subset the df based on each unique value of df$A, manipulate that data, save it as a newdf, and then eventually concatenate (rbind) the 1000 generated newdf's into one single df.
My current code for a single iteration (no loops) is like this:
dfA = 1
dfA_1 <- subset(df, A == dfA)
:: some ddply commands on dfA_1 altering its length and content ::
EDIT: to clarify, in the single iteration version, once I have the subset, I have been using ddply to then count the number of rows that contain some values. Not all subsets have all values, so the result can be of variable length. Thus, I have been appending the result to a skeleton df that accounts for cases in which a certain subset of df might not have any rows containing the values I expect (i.e., nrow = 0). Ideally, I wind up with the subset being fixed length for each instance of A. How can I incorporate this into a single (or multiple) plyr or dplyr set of code?
My issue with the for loops for this is that the length is not the variable, but rather the unique values of df$A.
My questions are as follows:
1. How would I use a for loop (or some form of apply) to perform this operation?
2. Can these operations be used to manipulate the data in addition to generate iterative df namess (e.g., the df named dfA_1 would be dfA_x where x is one of the values of df$A from 1 to 1000). My current thinking is that I'd then rbind the 1000 dfA_x's, though this seems cumbersome.
Many thanks for any assistance.
You should really use the dplyr package for this. What you want to do would probably take this form:
library(dplyr)
df %>%
group_by(A) %>%
summarize( . . . )
It will be easier to do, easier to read, less prone to error, and faster.
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
Related
hi sorry if this seems super straightforward but i'm having trouble figuring out how to get R to look across multiple columns using a grepl(^x, ^y, ^z) or %in% c(x, y, z) and when true or match, creating a new 1/0 column at end of dataframe. i've tried lots of variations but seem only to be able to filter down to a smaller dataframe then add 1s, but then i lose the original dataframe. or, use many nested ifelse to mutate a 1/0 column in the original dataframe, but this feels sloppy and likely to lead to mistakes.
any thoughts would be appreciated!
I think what you are looking for is the across function, e.g. to look through the first 3 columns of a data.frame using certain searchPattern you can do:
library(dplyr)
data %>% mutate(Opioid_Specific= sum(across(1:3, ~as.numeric(grepl(searchPattern, .))))) %>%
mutate(Opioid_Specific= ifelse(Opioid_Specific>= 1,1,0 ))
Another option would be to use the output of a normal (combined) condition as numeric, e.g.:
data %>% mutate(Opioid_Specific= as.numeric(grepl(searchPattern, R1) | grepl(searchPattern, R2) | grepl(searchPattern, R3)))
I updated the question with pseudocode to better explain what I would like to do.
I have a data.frame named df_sel, with 5064 rows and 215 columns.
Some of the columns (~80) contains integers with a unique identifier for a specific trait (medications). These columns are named "meds_0_1", "meds_0_2", "meds_0_3" etc. as well as "meds_1_1", "meds_1_2", "meds_1_3". Each column may or may not contain any of the integer values I am looking for.
For the specific integer values to look for, some could be grouped under different types of medication, but coded for specific brand names.
metformin = 1140884600 # not grouped
sulfonylurea = c(1140874718, 1140874724, 1140874726) # grouped
If it would be possible to look-up a group of medications, like in a vector format as above, that would be helpful.
I would like to do this:
IF [a specific row]
CONTAINS [the single integer value of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_METFORMIN = 1 ELSE A_NEW_VARIABLE_METFORMIN = 0
and concordingly
IF [a specific row]
CONTAINS [any of multiple integer values of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_SULFONYLUREA = 1 ELSE A_NEW_VARIABLE_SULFONYLUREA = 0
I have manged to create a vector based on column names:
column_names <- names(df_sel) %>% str_subset('^meds_0')
But I havent gotten any further despite some suggestions below.
I hope you understand better what I am trying to do.
As for the selection of the columns, you could do this by first extracting the names in the way you are doing with a regex, and then using select:
library(stringr)
column_names <- names(df_sel) %>%
str_subset('^meds_0')
relevant_df <- df_sel %>%
select(column_names)
I didn't quite get the structure of your variables (if they are integers, logicals, etc.), so I'm not sure how to continue, but it would probably involve something like summing across all the columns and dropping those that are not 0, like:
meds_taken <- rowSums(relevant_df)
df_sel_med_count <- df_sel %>%
add_column(meds_taken)
At this point you should have your initial df with the relevant data in one column, and you can summarize by subject, medication or whatever in any way you want.
If this is not enough, please edit your question providing a relevant sample of your data (you can do this with the dput function) and I'll edit this answer to add more detail.
First, I would like to start off by recommending bioconductor for R libraries, as it sounds like you may be studying biological data. Now to your question.
Although tidyverse is the most widely acceptable and 'easy' method, I would recommend in this instance using 'lapply' as it is extremely fast. Your code from a programming standpoint becomes a simple boolean, as you stated, but I think we can go a little further. Using the built-in data from 'mtcars',
data(mtcars)
head(mtcars, 6)
target=6
#trues and falses for each row and column
rows=lapply(mtcars, function(x) x %in% target)
#Number of Trues for each column and which have more that 0 Trues
column_sums=unlist(lapply(rows, function(x) (sum(x, na.rm = TRUE))))
which(column_sums>0)
This will work with other data types with a few tweaks here and there.
This code normalizes each value in each row (all values end up between -1 and 1).
dt <- setDT(knime.in)
df <-as.data.frame(t(apply(dt[,-1], 1, function(x) x / sum(x) )))
df1<-cbind(knime.in$Majors_Final,df)
BUT
It is not dynamic. The code "knows" that the String categorical variable is in row one and removes it before running the calculations
It seems old school and I suspect it does not make full use of data.table's referencing memory allocations.
QUESTIONS
How do I use the most memory efficient data.table code to achieve the row wise normalization?
How do I exclude all is.character() columns (or include only is.numeric), if I do not know the position or name of these columns?
I have a large data set (>100,000 rows) and would like to create a new column that sums all previous values of another column.
For a simulated data set test.data with 100,000 rows and 2 columns, I create the new vector that sums the contents of column 2 with:
sapply(1:100000, function(x) sum(test.data[1:x[1],2]))
I append this vector to the test.table later with cbind() This is too slow, however. Is there a faster way to accomplish this, or be able to reference the vector that sapply is making in sapply so I can just update the cumulative sum instead of performing the whole calc again?
Per my comment above it'll be faster if you do a direct assignment and use cumsum instead of sapply (cumsum was specifically built for what you want to do).
This should work:
test.data$sum <- cumsum(test.data[, 2])
I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])