Error sorting dataframe column by decreasing - r

I've this dataframe of two columns being the first one words(Word)and the second one their frequency (Freq) . I'm using the sort() function but apparently I keep falling into the same trap. My code, followed by the error, is:
sorted <- sort(df, order(df$Freq, decreasing= TRUE))
Error in sort(freqmat, order(freqmat$Freq, decreasing = TRUE)) :
'decreasing' must be a length-1 logical vector.
Did you intend to set 'partial'?

It don't think you mean to sort again after ordering by frequency. If you want to rearrange the rows by frequency:
sorted <- df[order(df$Freq, decreasing= TRUE),]

Related

Trouble with NA's in large dataframe

I'm having trouble trying to standardize my data.
So, first things first, I create the dataframe object with my data, with my desired row names (and I remove the 1st column, as it is not needed.
EXPGli <-read.delim("C:/Users/i5/Dropbox/Guilherme Vergara/Doutorado/Data/Datasets/MergedEXP3.txt", row.names=2)
EXPGli <- EXPGli[,-1]
EXPGli <- as.data.frame(EXPGli)
Then, I am supposed to convert all the columns to Z-score (each column = gene expression values; each row = sample) -> the idea here is to convert every gene expression data to a Z-score value of it for each cell
Z_score <- function(x) {(x-mean(x))/ sd(x)}
apply(EXPGli, 2, Z_score)
Which returns me [ reached 'max' / getOption("max.print") -- omitted 1143 rows ]
And now my whole df is NA's cells.
Indeed, there are several NAs in the dataset, some full rows and even some columns.
I tried several approaches to remove NAs
EXPGli <- na.omit(EXPGli)
EXPGli %>% drop_na()
print(EXPGli[rowSums(is.na(EXPGli)) == 0, ])
na.exclude(EXPGli)
Yet apparently, it does not work. Additionally, trying to is.na(EXPGli)
Returns me False to all fields.
I would like to understand what am I doing wrong here, it seems that the issue might be NA's not being recognized in R as NA but I couldnt find a solve for this. Any input is very appreciatted, thanks in advance!
You may want to set the argument na.rm = TRUE in your calls to mean(x) and sd(x) inside the Z_score function, otherwise these calls would return NAs for any vector with NAs in it.
Z_score <- function(x) {(x-mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)}

rowsum of last column of dataframe

I have a function below to calculate summary,i want to calculate the sum of last row (last row can have many columns, also can have "NA". do we have any solution for this..????
dataa<-data.frame(
aa = c("q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c","q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c"),
col1=c(1,2,3,2,1,2,3,4,4,4,5,3,4,2,1,2,5,3,2,1,2,4,2,1,3,2,1,2,3,1,2,2,4,4,4,1,2,5,3,5),
col2=c(2,1,1,7,4,1,2,7,5,7,2,6,2,2,6,3,4,3,2,5,7,5,6,4,4,6,5,6,4,1,7,3,2,7,7,2,3,7,2,4)
)
df <- database %>% select(!!var1,!!var2)
tab1 <- expss::cro_cpct(df[[1]],df[[2]])
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
Since your first column contains the string #Total cases, sum will throw an error. Excluding the first column will work. Also, adding na.rm=TRUE will ignore NAs
sum(tab1[nrow(tab1),-1], na.rm = T)

Compare multiple columns in 2 different dataframes in R

I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum (Compare group of two columns and return index matches R) but this is a different scenario: I am trying to compare if a column in dataframe 1 is between the range of 2 columns in dataframe 2. Functions like match, merge, join, intersect won't work here. I have been trying to use purr::pluck but didn't get far. The dataframes are of different sizes.
Below is an example:
temp1.df <- mtcars
temp2.df <- data.frame(
Cyl = sample (4:8, 100, replace = TRUE),
Start = sample (1:22, 100, replace = TRUE),
End = sample (1:22, 100, replace = TRUE)
)
temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)
My attempt:
temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))
Error:
Error in mutate_impl(.data, dots) :
Column `new_mpg` must be length 32 (the number of rows) or one, not 100
Expected Result:
Compare temp1.df$cyl and temp2.df$Cyl. If they are match then -->
Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->
if it is, then create a new variable new_mpg with value of 1.
It's hard to show the exact expected output here.
I realize I could loop this so for each row of temp1.df but the original temp2.df has over 250,000 rows. An efficient solution would be much appreciated.
Thanks
temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
temp<-temp2.df[temp2.df$Cyl==x[2],]
ifelse(any(apply(temp, 1, function(y) {
dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
})),1,0)
})
Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply, so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply, or maybe changing the organization of it within apply, e.g., by apply(temp1.df[,c("mpg","cyl")]....
At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count. Within this subset, it checks if any of the mpg for this line falls between (from dplyr) Start and End, and returns 1 if yes (or 0 if no). All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg.
I'm guessing there's a way to do this with rowwise, but I could never get it to work properly...

if function for rowSums_Modify the code

I want to get summation over several columns and make a new column based on them. So I use
df$Sum <-rowSums(df[,grep("y", names(df))])
But sometimes df just includes one column and in this case, I will get the error. Since this function is part of my long programming procedure, I was wondering how I can make an if function in a way that If df[,grep("y", names(df))] includes just one column then get sum is equal to df[,grep("y", names(df))] otherwise if df[,grep("y", names(df))] have more at leat two columns get the summation over them?
suppose:
require(stats); require(graphics)
attach(cars)
cars$y1<-seq(20:69)
#cars$y2<-seq(30:79)
df<-cars
df$Sum <-rowSums(df[,grep("y", names(df))])
You can use drop = FALSE when subsetting:
df$Sum <-rowSums(df[,grep("y", names(df)), drop = FALSE])
This keeps df as a data frame even if you are selecting only one column.

Plyr based on which.min for hole data.frame with colwise not working

Hello I have a list re where the elements contain dataframes with the colnames: c(values, diff, Sample1, Sample2, Sample3,...) up to Sample 100-1000.
The column "values" has not unique values and the column diff represents the difference from another vector (which is not included in the data.frames).
so exemplatory I got for the first important two columns:
values<- c(1,1,2,2,3,4,4,4)
diff <- c(1,2,1,2,1,2,2,1)
Now I want (for every dataframe in the list) to reduce the dataset, that only one element of unique values is left with the smallest value in diff. So in the upper case:
values=c(1,2,3,4)
diff<-c(1,1,1,1)
I tried plyr:
for (k in 1:length(re)) {
ret[[k]] <- ddply(re[[k]], .(valueData), summarise, re[[k]][which.min(diff),]) }
giving the Error:
Error in vector(type, length) :
vector: cannot make a vector of mode 'closure'.
because I have not only the columns "values" and "diff" in the data.frames but many more with differencing sizes I cant just name every column :
ret[[k]] <- ddply(re[[k]], .(valueData), summarise, diff=min(diff),
Sample1=Sample1[which.min(diff)],Sample2=Sample2[which.min(diff)],Samplex...)
So how could I fix this or is there another option despite plyr?
Any ideas?
And many thanks!!!
Try this:
lapply(re,function(df){
df <- df[order(df$values,df$diff),]
df[!duplicated(df$values),]
})
Just sort your dataframe in a ascending manner and pick the first unique value in values column.

Resources