How to dynamically add pair of columns - r

I have a dataframe with columns like this:
Income | Wt | Ht | Growth_Income | Growth_Wt | Growth_Ht.
Each column has 300 rows of numeric values. I would like to find a way how to add the columns that look the same (e.g. Income and Growth_Income). I would also like to find a way how to populate the dataframe so that i do the summation five times and each iteration is based on the previous output.
Sorry, im quite new with R and I havent thought of any way how to write the code yet. In excel, it would be easy dragging the formula but i need to code it in r because otherwise my program wont work. I hope someone could help me out here :(

Assuming that the patterns are as showed in the example, we remove the substring prefix from the column names, use that to split the dataset into a list of data.frames and loop through the list to get the rowSums
nm1 <- sub(".*_", "", names(df1))
sapply(split.default(df1, nm1), rowSums)

Related

Appending to empty df in for loop R

I have looked at similar answers but none of them quite answer my task.
I have found a very messy answer to my question but would like advice as to whether there is a simpler way.
I have a file list of many tables that I want to import into R and append columns to an empty df.
The rownames or column 1 will be the same for each imported table/df but the number of columns (sample_ids) will change.
At the moment I create a vector outside the loop and name it with the row names that I know won't change. Then I loop through the dfs and do a left_join using the same col name
Something like this:
final_df<-c(the row names that I want to extract)
names(final_df) <- "Sample_ID"
for (i in 1:length(files)){
my_df<-read_tsv(files[i])
# get the table specific sample names
my_sn <- my_df[15,-c(1:3)]
# get the rows I want to extract
my_df<-filter(row names I want to extract)
names(my_df)<-c("Sample_ID", my_sn)
final_df<-left_join(final_df, my_df, by="Sample_ID")
}
I'm thinking there must be a more elegant way.

conditional search across multiple columns in very large dataframe, goal to create 1/0 column for other analysis using R

hi sorry if this seems super straightforward but i'm having trouble figuring out how to get R to look across multiple columns using a grepl(^x, ^y, ^z) or %in% c(x, y, z) and when true or match, creating a new 1/0 column at end of dataframe. i've tried lots of variations but seem only to be able to filter down to a smaller dataframe then add 1s, but then i lose the original dataframe. or, use many nested ifelse to mutate a 1/0 column in the original dataframe, but this feels sloppy and likely to lead to mistakes.
any thoughts would be appreciated!
I think what you are looking for is the across function, e.g. to look through the first 3 columns of a data.frame using certain searchPattern you can do:
library(dplyr)
data %>% mutate(Opioid_Specific= sum(across(1:3, ~as.numeric(grepl(searchPattern, .))))) %>%
mutate(Opioid_Specific= ifelse(Opioid_Specific>= 1,1,0 ))
Another option would be to use the output of a normal (combined) condition as numeric, e.g.:
data %>% mutate(Opioid_Specific= as.numeric(grepl(searchPattern, R1) | grepl(searchPattern, R2) | grepl(searchPattern, R3)))

Find whether a raw in data table contains at least one word from the list

I am quite new to R and data tables, so probably my question will sound obvious, but I searched through questions here for similar issues and couldn't find a solution anyway.
So, initially, I have a data table and one of the rows contains fields that have many values(in fact these values are all separate words) of the data joined together by &&&&. I also have a list of words (list). This list is big and has 38 000 different words. But for the purpose of example let's sat that it is small.
list <- c('word1', 'word2, 'word3')
What I need is to filter the data table so that I only have rows that contain at least one word from the list of words.
I unjoined the data by &&&&& and created a list
fields_with_words <-strsplit(data_final$fields_with_words,"&&&&")
But I don't know which function should I use to check whether the row from my data table has at least one word from the list. Can you give me some clues?
Try :
data_final[sapply(strsplit(data_final$fields_with_words,"&&&&"), function(x)
any(x %in% word_list)), ]
I have used word_list instead of list here since list is a built-in function in R.
Assuming you want to scan x variable in df with the list of words lw <- c("word1","word2","word3") (character vector of words), you can use
df[grepl(paste0("(",paste(lw, collapse = "|"), ")"), x)]
if you want regular expression. In particular you will have match also if your word is within a sentence. However, with 38k words, I don't know if this solution is scalable.
If your x column contains only words and you want exact matching, the problem is simpler. You can do:
df[any(x %chin% lw)]
%chin% is a data.table special %in% operator for character vectors (%in% can also be used but it will not be as performant). You can have better performance there if you use merge by transforming lw into a data.table:
merge(df, data.table(x = lw), by = "x")

How to fill dataframe rows for progressive files in a for loop in R

I'm trying to analyze some data acquired from experimental tests with several variables being recorded. I've imported a dataframe into R and I want to obtain some statistical information by processing these data.
In particular, I want to fill in an empty dataframe with the same variable names of the imported dataframe but with statistical features like mean, median, mode, max, min and quantiles as rows for each variable.
The input dataframes are something like 60 columns x 250k rows each.
I've already managed to do this using apply as in the following lines of code for a single input file.
df[1,] <- apply(mydata,2,mean,na.rm=T)
df[2,] <- apply(mydata,2,sd,na.rm=T)
...
Now I need to do this in a for loop for a number of input files mydata_1, mydata_2, mydata_3, ... in order to build several summary statistics dataframes, one for each input file.
I tried in several different ways, trying with apply and assign but I can't really manage to access each row of interest in the output dataframes cycling over the several input files.
I wuold like to do something like the code below (I know that this code does not work, it's just to give an idea of what I want to do).
The output df dataframes are already defined and empty.
for (xx in 1:number_of_mydata_files) {
df_xx[1,]<-apply(mydata_xx,2,mean,na.rm=T)
df_xx[2,]<-apply(mydata_xx,2,sd,na.rm=T)
...
}
Actually I can't remember the error message given by this code, but the problem is that I can't even run this because it does not work.
I'm quite a beginner of R, so I don't have so much experience in using this language. Is there a way to do this? Are there other functions that could be used instead of apply and assign)?
EDIT:
I add here a simple table description that represents the input dataframes I’m using. Sorry for the poor data visualization right here. Basically the input dataframes I’m using are .csv imported files, looking like tables with the first row being the column description, aka the name of the measured variable, and the following rows being the acquired data. I have 250 000 acquisitions for each variable in each file, and I have something like 5-8 files like this being my input.
Current [A] | Force [N] | Elongation [%] | ...
—————————————————————————————————————
Value_a_1 | Value_b_1 | Value_c_1 | ...
I just want to obtain a data frame like this as an output, with the same variables name, but instead with statistical values as rows. For example, the first row, instead of being the first values acquired for each variable, would be the mean of the 250k acquisitions for each variable. The second row would be the standard deviation, the third the variance and so on.
I’ve managed to build empty dataframes for the output summary statistics, with just the columns and no rows yet. I just want to fill them and do this iteratively in a for loop.
Not sure what your data looks like but you can do the following where lst represents your list of data frames.
lst <- list(iris[,-5],mtcars,airquality)
lapply(seq_along(lst),
function(x) sapply(lst[[x]],function(x)
data.frame(Mean=mean(x,na.rm=TRUE),
sd=sd(x,na.rm=TRUE))))
Or as suggested by #G. Grothendieck simply:
lapply(lst, sapply, function(x)
data.frame(Mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE)))
If all your files are in the same directory, set working directory as that and use either list.files() or ls() to walk along your input files.
If they share the same column names, you can rbind the result into a single data set.

R sapply and mean for more than 1 column in split dataframe

I have a problem concerning sapply in R:
I hav a dataframe Test_ALL that I split by (at the moment) one column named activity. The dataframe has somewhat 20 columns with extra long names ( e.g. fBodyBodyGyroJerkMag-std()) that I don`t want to write down explicitely. From this dataframe I want to get a mean for each column. I tried this and it worked for 1 named column.
aa<-split(Test_ALL,Test_ALL$activity)
y<-sapply(aa,function(x) colMeans(x [c("fBodyBodyGyroJerkMag-std()")]))
but when I tried to get a mean for more than 1 column it didn`t work.
aa<-split(Test_ALL,Test_ALL$activity)
y<-sapply(aa,function(x) colMeans(x [c("fBodyBodyGyroJerkMag-std()","fBodyAccMag-std()")]))
I tried this too, but also no success
namesERG<-names(Test_ALL)
aa<-split(Test_ALL,Test_ALL$activity)
y<-sapply(aa,function(x) colMeans(x[c(namesERG)]))
What am I doing wrong?
Thak you!
Without a reproducible example is difficult to completely understand your problem. Anyway I think that a part of the issue is related to the fact that you have some non numeric columns. I think that somenthing like that could be a solution
library(dplyr)
aa <- split(Test_ALL, Test_ALL$activity)
y <- sapply(aa, function(x) colMeans(select_if(x, is.numeric)))

Resources