rowsum of last column of dataframe - r

I have a function below to calculate summary,i want to calculate the sum of last row (last row can have many columns, also can have "NA". do we have any solution for this..????
dataa<-data.frame(
aa = c("q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c","q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c"),
col1=c(1,2,3,2,1,2,3,4,4,4,5,3,4,2,1,2,5,3,2,1,2,4,2,1,3,2,1,2,3,1,2,2,4,4,4,1,2,5,3,5),
col2=c(2,1,1,7,4,1,2,7,5,7,2,6,2,2,6,3,4,3,2,5,7,5,6,4,4,6,5,6,4,1,7,3,2,7,7,2,3,7,2,4)
)
df <- database %>% select(!!var1,!!var2)
tab1 <- expss::cro_cpct(df[[1]],df[[2]])
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables

Since your first column contains the string #Total cases, sum will throw an error. Excluding the first column will work. Also, adding na.rm=TRUE will ignore NAs
sum(tab1[nrow(tab1),-1], na.rm = T)

Related

Trouble with NA's in large dataframe

I'm having trouble trying to standardize my data.
So, first things first, I create the dataframe object with my data, with my desired row names (and I remove the 1st column, as it is not needed.
EXPGli <-read.delim("C:/Users/i5/Dropbox/Guilherme Vergara/Doutorado/Data/Datasets/MergedEXP3.txt", row.names=2)
EXPGli <- EXPGli[,-1]
EXPGli <- as.data.frame(EXPGli)
Then, I am supposed to convert all the columns to Z-score (each column = gene expression values; each row = sample) -> the idea here is to convert every gene expression data to a Z-score value of it for each cell
Z_score <- function(x) {(x-mean(x))/ sd(x)}
apply(EXPGli, 2, Z_score)
Which returns me [ reached 'max' / getOption("max.print") -- omitted 1143 rows ]
And now my whole df is NA's cells.
Indeed, there are several NAs in the dataset, some full rows and even some columns.
I tried several approaches to remove NAs
EXPGli <- na.omit(EXPGli)
EXPGli %>% drop_na()
print(EXPGli[rowSums(is.na(EXPGli)) == 0, ])
na.exclude(EXPGli)
Yet apparently, it does not work. Additionally, trying to is.na(EXPGli)
Returns me False to all fields.
I would like to understand what am I doing wrong here, it seems that the issue might be NA's not being recognized in R as NA but I couldnt find a solve for this. Any input is very appreciatted, thanks in advance!
You may want to set the argument na.rm = TRUE in your calls to mean(x) and sd(x) inside the Z_score function, otherwise these calls would return NAs for any vector with NAs in it.
Z_score <- function(x) {(x-mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)}

Dynamic mean of consecutive columns in dplyr

I have a data frame with a large number of columns containing numeric values.
I'd like to dynamically calculate the mean value of the two consecutive columns (so mean of column 1 and column 2, mean of column 3 and 4, mean of 5 and 6 etc...) and either store it into new column names or replace one of the two columns I used in the calculation.
I tried creating a function that calculate the mean of two columns and storing it into the first column then applying a loop to that function so it applies to my whole datatable.
However I'm struggling with mutate: since I dynamically generate the name of the column I use (they all start with "PUISSANCE" then a number) through a glue, it displays the value as a string into the mutate and doesn't evaluate it.
mean_col <- function(data, k)
{
n<-2*k+1
m<-2*k+2
varname_even <- paste("PUISSANCE", m,sep="")
varname_odd <- paste("PUISSANCE", n,sep="")
mutate(data, "{{varname_odd}}" := ({{varname_odd}}+{{varname_even}})/2) %% *here is the issue, the argument on the right is considered as non numeric, since it is the sum of two strings...*
data
}
for (k in 0:24) {
my_data_set <- mean_col(my_data_set,k)
}
Ok guys, just to let you know that I managed to solve it myself.
I did a pivot_longer transmute in order to put all the "PuissanceXX" in the same column and the values associated in another.
Then I used str_extract to get only the number XX from the string "PUISSANCEXX" that I converted into a numeric.
Thanks to a division by 2 (-0,5) I managed to have each successive value being X and X,5. so getting both values to X thanks to a floor. Then I just did a group_by/summarize in order to get the sum and that's it !
pivot_longer(starts_with("PUISSANCE"),names_to = "heure", values_to = "puissance") %>%
mutate("time" = floor(as.numeric(str_extract(heure, "\\d+"))/2-0.5)) %>%
select(-heure) %>%
group_by(time) %>%
summarise("power" = mean(puissance))

Why am I getting a "number of items to replace is not a multiple of replacement length" error?

I'm currently working with a dataframe "dat." I'm trying to calculate a score using columns 69-88 (if there are values in any of those columns, then add them together and put the result in a new column called "score").
This is the code I have now:
dat$score <- 0
for (num in 69:88){
dat$score[!is.na(dat[,num])] <- dat$score+dat[,num]
}
This gives me a column where some rows show the correct score, but other rows are returning "NA". I also have 20 warnings messages that look like so:
1: In dat$score[!is.na(dat[, num])] <- dat$score + ... :
number of items to replace is not a multiple of replacement length
Why is my code working for some rows and not for others, and why am I getting this error?
Are you looking for the rowSums()function? You just have to add the argument na.rm=TRUE.
A solution with dplyr:
library(dplyr)
dat %>% mutate(score=rowSums(across(69:88), na.rm=TRUE))
Or with base R
dat$score<-rowSums(dat[, 69:88], na.rm=TRUE)
use apply. its usually quicker than for
dat$score <- apply(dat[,69:88], 1, sum, na.rm = T)

Drop a column if the sum of that column equals 0 in R

I've created a for loop to iterate over each column in my train data set. It checks if the absolute sum of the columns value equals 0, if so, it will store the columns name in my list called "aux". At the end of the loop, I assign train to remove the columns that are in "aux".
Problem: I keep getting error message "Error in -aux : invalid argument to unary operator"
Notes about dataset: There are no NAs or NaN, all values are numeric. Currently it is a matrix, but I can transform into a dataframe if required.
aux = NULL #auxiliary vector
for(i in 1:ncol(train)){ #checking all columns of the df
if(sum(abs(train[,i]))==0){ #if the sum of the column is zero (using absolute value to avoid problems where the positive and negative numbers sum to zero)
aux = c(aux,i) #then store the number of that column
}
}
train = train[,-aux] #and remove the columns
Instead of a loop, we can use Filter
Filter(function(x) sum(abs(x), na.rm = TRUE) > 0, train)
Or with colSums
train[colSums(abs(train), na.rm = TRUE) > 0]

Error sorting dataframe column by decreasing

I've this dataframe of two columns being the first one words(Word)and the second one their frequency (Freq) . I'm using the sort() function but apparently I keep falling into the same trap. My code, followed by the error, is:
sorted <- sort(df, order(df$Freq, decreasing= TRUE))
Error in sort(freqmat, order(freqmat$Freq, decreasing = TRUE)) :
'decreasing' must be a length-1 logical vector.
Did you intend to set 'partial'?
It don't think you mean to sort again after ordering by frequency. If you want to rearrange the rows by frequency:
sorted <- df[order(df$Freq, decreasing= TRUE),]

Resources