This may be a very basic question but I am not able to fo since last hour. I want to merge cells in each row using a comma or semicolon. The data looks like
OTU_1 23 15 273 51 127 190 220 83 k__Bacteria p__Chloroflexi c__SJA-15 o__C10_SB1A f__C10_SB1A g__Candidatus Amarilinum s__
The output would be like this
OTU_1;23;15;273;51;127;190;220;83;k__Bacteria;p__Chloroflexi;c__SJA-15;o__C10_SB1A;f__C10_SB1A;g__Candidatus Amarilinum;s__
Can you please guide how it can be done in R. I know how to use concatinate function but i am wondering if it can be done in R?
Thanks
It's not clear what you mean by cells. Is this a vector? Columns of a data.frame?
Tool to use here might be paste, but how you use it might depend on what the underlying structure of the data is.
> paste(letters, collapse = ";")
[1] "a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z"
to merge cells by rows you can use apply with second argument equal to 1, i.e.
apply(your_dataframe,1, function(x) paste(x, collapse = ","))
You'll receive a list of length equal to number of rows where each element is equal to cell merge
Related
I've got a dataframe that I need to split based on the values in one of the columns - most of them are either 0 or 1, but a couple are NA, which I can't get to form a subset. This is what I've done:
all <- read.csv("XXX.csv")
splitted <- split(all, all$case_con)
dim(splitted[[1]]) #--> gives me 185
dim(splitted[[2]]) #--> gives me 180
but all contained 403 rows, which means that 38 NA values were left out and I don't know how to form a similar subset to the ones above with them. Any suggestions?
Try this:
splitted<-c(split(all, all$case_con,list(subset(all, is.na(case_con))))
This should tack on the data frame subset with the NAs as the last one in the list...
list(split(all, all$cases_con), split(all, is.na(all$cases_con)))
I think it would be work. Ty
I use r, and I'm looking to use regular expressions to calculate the row sums for the amount of occurrences of a string pattern that occurs across all columns in data frame containing epigenetic information. There are 40 columns, 15 of which may or may not contain the pattern of interest. The code that has got me closest to what I'm looking for is:
# Looking to match following exact pattern ',.,' which will always be
# preceded and followed by a sequence of characters or numbers.
# Note: the full stop in the pattern above signifies any character
df$rowsum <- rowSums(apply(df, 2, grep, pattern = ".*,.,.*"))
For each row, this provides a count of the columns that contain the pattern, however the issue I have is that any individual cell can contain this pattern more than once. I've tried several different function combinations to try to get to the answer, and realise that grep probably is not the solution as it spits out a logical whenever it finds the pattern, meaning it can only report a maximum of one pattern match for any particular cell. I need a solution that counts every occurrence of the pattern within each individual cell in a single row, and adds these values to provide a row sum total. This total is added rowsum column of that particular row.
For context a typical individual occurrence of the contents of a particular cell could be:
2212(AATTGCCCCACA,-,0.00)
Whereas if there were multiple occurrences they would exist in the cell as a continuous string each entry separated by a comma, for example for two entries:
144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)
I'm using the ,., as the unique identifier of each entry, as everything else for each entry is variable.
Here is some toy data:
df <-data.frame(NAMES = c('A', 'B', 'C', 'D'),
GENE1 = c("144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)", "2(TGTGAGTCAC,+,0.00)", "NA", "NA"),
GENE2 = c("632(TAAAGAGTCAC,-,0.00),60(GTCCCTCACT,-,0.00),", "7(TGTGAGTCAC,+,0.00)", "7(TGTGAGTCAC,+,0.00)", "NA"),
stringsAsFactors = F)
The optimum code would provide a data frame with a row sums column attached with totals:
# Omitted GENE column contents to save space
NAMES GENE1 GENE2 rowsum
A ... ... 4
B ... ... 2
C ... ... 1
D ... ... 0
Been stumped on this for 48 hrs. Any help would be greatly appreciated.
We can use str_extract from stringr
library(stringr)
df$rowsum <- Reduce(`+`, lapply(df[-1],
function(x) lengths(str_extract_all(x, "\\d+\\("))))
df$rowsum
#[1] 4 2 1 0
I am trying to merge a data.frame and a column from another data.frame, but have so far been unsuccessful.
My first data.frame [Frequencies] consists of 2 columns, containing 47 upper/ lower case alpha characters and their frequency in a bigger data set. For example purposes:
Character<-c("A","a","B","b")
Frequency<-(100,230,500,420)
The second data.frame [Sequences] is 93,000 rows in length and contains 2 columns, with the 47 same upper/ lower case alpha characters and a corresponding qualitative description. For example:
Character<-c("a","a","b","A")
Descriptor<-c("Fast","Fast","Slow","Stop")
I wish to add the descriptor column to the [Frequencies] data.frame, but not the 93,000 rows! Rather, what each "Character" represents. For example:
Character<-c("a")
Frequency<-c("230")
Descriptor<-c("Fast")
Following can also be done:
> merge(adf, bdf[!duplicated(bdf$Character),])
Character Frequency Descriptor
1 a 230 Fast
2 A 100 Fast
3 b 420 Stop
4 B 500 Slow
Why not:
df1$Descriptor <- df2$Descriptor[ match(df1$Character, df2$Character) ]
I have asked this question a couple times without any help. I have since improved the code so I am hoping somebody has some ideas! I have a dataset full of 0's and 1's. I simply want to add the 10 columns together resulting in 1 column with 3835 rows. This is my code thus far:
# select for valid IDs
data = history[history$studyid %in% valid$studyid,]
sibling = data[,c('b16aa','b16ba','b16ca','b16da','b16ea','b16fa','b16ga','b16ha','b16ia','b16ja')]
# replace all NA values by 0
sibling[is.na(sibling)] <- 0
# loop over all columns and count the number of 174
apply(sibling, 2, function(x) sum(x==174))
The problem is this code adds together all the rows, I want to add together all the columns so I would result with 1 column. This is the answer I am now getting which is wrong:
b16aa b16ba b16ca b16da b16ea b16fa b16ga b16ha b16ia b16ja
68 36 22 18 9 5 6 5 4 1
In apply() you have the MARGIN set to 2, which is columns. Set the MARGIN argument to 1, so that your function, sum, will be applied across rows. This was mentioned by #sgibb.
If that doesn't work (can't reproduce example), you could try first converting the elements of the matrix to integers X2 <- apply(sibling, c(1,2), function(x) x==174), and then use rowSums to add up the columns in each row: Xsum <- rowSums(X2, na.rm=TRUE). With this setup you do not need to first change the NA's to 0's, as you can just handle the NA's with the na.rm argument in rowSums()
I have a data frame with several columns; some numeric and some character. How to compute the sum of a specific column? I’ve googled for this and I see numerous functions (sum, cumsum, rowsum, rowSums, colSums, aggregate, apply) but I can’t make sense of it all.
For example suppose I have a data frame people with the following columns
people <- read(
text =
"Name Height Weight
Mary 65 110
John 70 200
Jane 64 115",
header = TRUE
)
…
How do I get the sum of all the weights?
You can just use sum(people$Weight).
sum sums up a vector, and people$Weight retrieves the weight column from your data frame.
Note - you can get built-in help by using ?sum, ?colSums, etc. (by the way, colSums will give you the sum for each column).
To sum values in data.frame you first need to extract them as a vector.
There are several way to do it:
# $ operatior
x <- people$Weight
x
# [1] 65 70 64
Or using [, ] similar to matrix:
x <- people[, 'Weight']
x
# [1] 65 70 64
Once you have the vector you can use any vector-to-scalar function to aggregate the result:
sum(people[, 'Weight'])
# [1] 199
If you have NA values in your data, you should specify na.rm parameter:
sum(people[, 'Weight'], na.rm = TRUE)
you can use tidyverse package to solve it and it would look like the following (which is more readable for me):
library(tidyverse)
people %>%
summarise(sum(weight, na.rm = TRUE))
When you have 'NA' values in the column, then
sum(as.numeric(JuneData1$Account.Balance), na.rm = TRUE)
to order after the colsum :
order(colSums(people),decreasing=TRUE)
if more than 20+ columns
order(colSums(people[,c(5:25)],decreasing=TRUE) ##in case of keeping the first 4 columns remaining.