Exclude one single column from sapply - r

I have a dataframe with multiple columns that I want to group according to their names. When several columns names respond to the same pattern, I want them grouped in a single column and that column is the sum of the group.
colnames(dataframe)
[1] "Départements" "01...3" "01...4" "01...5" "02...6" "02...7" "02...8" "02...9" "02...10" "03...11"
[11] "03...12" "03...13" "04...14" "04...15" "05...16" "05...17" "05...18" "06...19" "06...20" "06...21"
So I use this bit of code that works just fine when every column are numeric, though the first one is character and therefore I hit an error. How can I exclude the first column from the code?
#Group columns by patern, look for a pattern and loop through
patterns <- unique(substr(names(dataframe_2012), 1, 3))` #store patterns in a vector
dataframe <- sapply(patterns, function(xx) rowSums(dataframe[,grep(xx, names(dataframe)), drop=FALSE]))
#loop through
This is the error code I get
Error in rowSums(DEPTpolicedata_2012[, grep(xx, names(DEPTpolicedata_2012)), :
'x' must be numeric

You can simply remove the first column using
patterns$Départements <- NULL

Related

What is happening during assignment to a dataframe by lapply

Given a dataframe df and a function f which is applied to df:
df[] <- lapply(df, f)
What is the magic R is performing to replace columns in df with collection of vectors in the list from lapply? I see that the result from lapply is a list of vectors having the same names as the dataframe df. I assume some magic mapping is being done to map the vectors to df[], which is the collection of columns in df (methinks). Just works? Trying to better understand so that I remember what to use the next time.
A data.frame is merely a list of vectors having the same length. You can see it using is.list(a_data_frame). It will return TRUE.
[] can have different meaning or action depending of the object it is applied on. It even can be redefined as it is in fact a function.
[] allows to subset or insert vector columns from data.frame.
df[1] get the first column
df[1] <- 2 replace the first column with 2 (repeated in order to have the same length as other columns)
df[] return the whole data.frame
df[] <- list(c1,c2,c3) sets the content of the data.frame replacing it's current content
Plus a wide number of other way to access or set data in a data.frame (by column name, by subset of rows, of columns, ...)

Getting only the rownames containing a specific character - R

I have a Seurat R object. I would like to only select the data corresponding to a specific sample. Therefore, I want to get only the row names that contain a specific character. Example of my differences in row names: CTAAGCTT-1 and CGTAAAT-2. I want to differentiate based on 1 and 2. The code below shows what I already tried. But it just returns the total numbers of row. Not how many rows are matching the character.
length <- length(rownames(seuratObject#meta.data) %in% "1")
OR
length <- length(grepl("-1",rownames(seuratObj#meta.data)))
Idents(seuratObject, cells = 1:length)
Thanks for any input.
Just missing which()
length(which(grepl("-1", rownames(seuratObject#meta.data))))

the use of minus sign inside square brackets

Below is an exercise from Datacamp.
Using the cbind() call to include all three sheets. Make sure the first column of urban_sheet2 and urban_sheet3 are removed, so you don't have duplicate columns. Store the result in urban.
Code:
# Add code to import data from all three sheets in urbanpop.xls
path <- "urbanpop.xls"
urban_sheet1 <- read.xls(path, sheet = 1, stringsAsFactors = FALSE)
urban_sheet2 <- read.xls(path, sheet = 2, stringsAsFactors = FALSE)
urban_sheet3 <- read.xls(path, sheet = 3, stringsAsFactors = FALSE)
# Extend the cbind() call to include urban_sheet3: urban
urban <- cbind(urban_sheet1, urban_sheet2[-1],urban_sheet3[-1])
# Remove all rows with NAs from urban: urban_clean
urban_clean<-na.omit(urban)
My question is why using [-1] to remove the first column in cbind. Is it a special use of square brackets inside cbind()? Does that mean that if I want to remove the first two columns the code should be urban_sheet2[-2]? I only know that square brackets are used for selecting certain columns or rows. This confuses me.
This is not specific to cbind(). You can use - inside square brackets to remove any particular row or column you want. If your data frame is df, df[,-1] will have its first column removed. df[,-2] will have its second (and only second) column removed. df[,-c(1,2)] will have both its first and second columns removed. Likewise, df[-1,] will have its first row removed, etc.
This cannot be done with column names, e.g., df[,-"var1"] will not work. To use column names, you can use which(), as in df[,-which(names(df) %in% "var1")], but simply df[,!names(df) %in% "var1")] is easier and yields the same result. You can also use subset(): subset(df, select = -c(var1, var2)); this will remove the columns named "var1" and "var2".
Note that removing rows and columns only affects the output of the call, and will not affect the original object unless the output is assigned to the original object.

Count of Comma separated values in r

I have a column named subcat_id in which the values are stored as comma separated lists. I need to count the number of values and store the counts in a new column. The lists also have Null values that I want to get rid of.
I would like to store the counts in the n column.
We can try
nchar(gsub('[^,]+', '', gsub(',(?=,)|(^,|,$)', '',
gsub('(Null){1,}', '', df1$subcat_id), perl=TRUE)))+1L
#[1] 6 4
Or
library(stringr)
str_count(df1$subcat_id, '[0-9.]+')
#[1] 6 4
data
df1 <- data.frame(subcat_id = c('1,2,3,15,16,78',
'1,2,3,15,Null,Null'), stringsAsFactors=FALSE)
You can do
sapply(strsplit(subcat_id,","),FUN=function(x){length(x[x!="Null"])})
strsplit(subcat_id,",") will return a list of each item in subcat_id split on commas. sapply will apply the specified function to each item in this list and return us a vector of the results.
Finally, the function that we apply will take just the non-null entries in each list item and count the resulting sublist.
For example, if we have
subcat_id <- c("1,2,3","23,Null,4")
Then running the above code returns c(3,4) which you can assign to your column.
If running this from a dataframe, it is possible that the character column has been interpreted as a factor, in which case the error non-character argument will be thrown. To fix this, we need to force interpretation as a character vector with the as.character function, changing the command to
sapply(strsplit(as.character(frame$subcat_id),","),FUN=function(x){length(x[x!="Null"])})

Put column sums in a new row in a matrix

I have a data frame that consists of municipality names (factors) in the first column and number of projects (integers) in columns two and three.
Var.1<-c("Andover", "Avon", "Bethany")
Freq.x<-c(2,NA,10)
Freq.y<-c(4,2,9)
Projects<-data.frame(Var.1,as.integer(as.numeric(Freq.y)),as.integer(as.numeric(Freq.x)))
[Note: I am making the second and third columns as integers here because that's how they are categorized in my actual data set.]
I was able to take the row sums of the rows using:
Projects$Sum<-rowSums(Projects[,2:3])
However, I'm unable to figure out how to take the column sums. I tried using the following formula:
Projects[Total,]<-colSums(Projects[2:3,])
I get the error:
Error in colSums(Projects[2:3, ]) : 'x' must be numeric
Even when I convert the second and third columns to as.numeric, I get the same response.
Can someone advise how to obtain the column sums create a new row at the bottom which will house the results?
You can do something like this:
Var.1<-c("Andover", "Avon", "Bethany")
Freq.x<-c(2,NA,10)
Freq.y<-c(4,2,9)
freq <- cbind(Freq.x, Freq.y)
freq <- rbind(freq, colSums(freq, na.rm=TRUE))
Projects <- data.frame(name=c(Var.1, "Total"), freq)
In particular: keep numeric part separate and compute it's sums; add "TOtal" to the character vector before it will be converted to factor, and thereafter make the data.frame

Resources