I tried every solution but my problem is still there. I have a big df (20rows*400cols) - for each row I want to count how many columns have a value of more than 16.
The first col is factor and the rest of the columns are integers.
my df:
col1 col2 col3 col4
abc 2 16 17
def 4 2 4
geh 50 60 73
desired output should be:
col1 col2 col3 col4 count
abc 2 16 17 1
def 4 2 4 0
geh 50 60 73 3
I tried df$morethan16 <- rowSums(df[,-1] > 16) but then I get NA in the count column.
We may need na.rm to take care of NA elements as >/</== returns NA wherever there are NA elements
df$morethan16 <- rowSums(df[,-1] > 16, na.rm = TRUE)
If we still get NA, check the class of the columns. The above code works only if the columns are numeric. Convert to numeric class automatically with type.convert (based on the values of the column)
df <- type.convert(df, as.is = TRUE)
check the structure
str(df)
If it is still not numeric, some values in the column may be character elements that prevents it from conversion to numeric. Force the columns to numeric with as.numeric. If those are factor columns, do as.character first
df[-1] <- lapply(df[-1], function(x) as.numeric(as.character(x)))
Here is another option using crossprod
df$count <- c(crossprod(rep(1, ncol(df[-1])), t(df[-1] > 16)))
which gives
col1 col2 col3 col4 count
1 abc 2 16 17 1
2 def 4 2 4 0
3 geh 50 60 73 3
Related
I have a data that looks as follows:
Patent_number<-c(2323,4449,4939,4939,12245)
IPC_class_1<-c("C12N",4,"C29N00185",2,"C12F")
IPC_class_2<-c(3,"K12N","C12F","A01N",8)
IPC_class_3<-c("S12F",1,"CQ010029393049",5,"CQ1N")
df<-data.frame(Patent_number, IPC_class_1, IPC_class_2, IPC_class_3)
View(df)
I want to count only the number o (string) values such as C12N, A01N etc. per row by adding another column "counts" in the end of the data frame. In other words, I want to exclude the numeric values from the row count.
Any suggestions?
You can't have mixed types in a dataframe column, so all of the numeric values will also be stored as type character. One approach would be to convert everything using as.numeric, and then use is.na to count those that are not coercible to numeric...
df$counts <- apply(sapply(df, as.numeric), 1, function(x) sum(is.na(x)))
df
Patent_number IPC_class_1 IPC_class_2 IPC_class_3 counts
1 2323 C12N 3 S12F 2
2 4449 4 K12N 1 1
3 4939 C29N C12F CQ01 3
4 4939 2 A01N 5 1
5 12245 C12F 8 CQ1N 2
We may also count by checking if all the characters are digits
df$counts <- ncol(df) - Reduce(`+`, lapply(df, grepl, pattern = '^[0-9.]+$'))
df$counts
[1] 2 1 3 1 2
I have a data of 80k rows and 874 columns. Some of these columns are empty. I use sum(is.na) in a for loop to determine the index of empty columns. Since the first column is not empty, if sum(is.na) is equal to the number of rows of the first column, it means that column is empty.
for (i in 1:ncol(loans)){
if (sum(is.na(loans[i])) == nrow(loans[1])){
print(i)
}
}
Now that I know the indices of empty columns, I want to drop them from the data. I thought about storing those indices in an array and dropping them in a loop but I don't think it will work since columns with data will replace the empty columns. How can I drop them?
You should try to provide a toy dataset for your question.
loans <- data.frame(
a = c(NA, NA, NA),
b = c(1,2,3),
c = c(1,2,3),
d = c(1,2,3),
e = c(NA, NA, NA)
)
loans[!sapply(loans, function(col) all(is.na(col)))]
sapply loops over columns of loans and applies the anonymous function checking if all elements are NA. It then coerces the output to a vector, in this case logical.
The tidyverse option:
loans[!purrr::map_lgl(loans, ~all(is.na(.x)))]
Does this work:
df <- data.frame(col1 = rep(NA, 5),
col2 = 1:5,
col3 = rep(NA,5),
col4 = 6:10)
df
col1 col2 col3 col4
1 NA 1 NA 6
2 NA 2 NA 7
3 NA 3 NA 8
4 NA 4 NA 9
5 NA 5 NA 10
df[,which(colSums(df, na.rm = TRUE) == 0)] <- NULL
df
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
Another approach:
df[!apply(df, 2, function(x) all(is.na(x)))]
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
A dplyr solution:
df %>%
select_if(!colSums(., na.rm = TRUE) == 0)
You can try to use fundamental skills like if else and for loops for almost all problems, although a drawback is that it will be slower.
# evaluate each column, if a column meets your condition, remove it, then next
for (i in 1:length(loans)){
if (sum(is.na(loans[,i])) == nrow(loans)){
loans[,i] <- NULL
}
}
I got a data frame in R like the following:
V1 V2 V3
1 2 3
1 43 54
2 34 53
3 34 51
3 43 42
...
And I want to delete all rows which value of V1 has a frequency lower then 2. So in my example the row with V1 = 2 should be deleted, because the value "2" only appears once in the column ("1" and "3" appear twice each).
I tired to add a extra column with the frequency of V1 in it to delete all rows where the frequency is > 1 but with the following I only get NAs in the extra column.
data$Frequency <- table(data$V1)[data$V1]
Thanks
You can try this:
library(dplyr)
df %>% group_by(V1) %>% filter(n() > 1)
You can also consider using data.table. We first count the occurence of each value in V1, then filter on those occurences being more than 1. Finally, we remove our count-column as we no longer need it.
library(data.table)
setDT(dat)
dat2 <- dat[,n:=.N,V1][n>1,,][,n:=NULL]
Or even quicker, thanks to RichardScriven:
dat[, .I[.N >= 2], by = V1]
> dat2
V1 V2 V3
1: 1 2 3
2: 1 43 54
3: 3 34 51
4: 3 43 42
With this you do not need to load a library
res<-data.frame(V1=c(1,1,2,3,3,3),V2=rnorm(6),V3=rnorm(6))
res[res$V1%in%names(table(res$V1)>=2)[table(res$V1)>=2],]
This question already has answers here:
Data Table - Select Value of Column by Name From Another Column
(3 answers)
Closed 4 years ago.
I have a data.table like this:
col1 col2 col3 new
1 4 55 col1
2 3 44 col2
3 34 35 col2
4 44 87 col3
I want to populate another column matched_value that contains the values from the respective column names given in the new column:
col1 col2 col3 new matched_value
1 4 55 col1 1
2 3 44 col2 3
3 34 35 col2 34
4 44 87 col3 87
E.g., in the first row, the value of new is "col1" so matched_value takes the value from col1, which is 1.
How can I do this efficiently in R on a very large data.table?
An excuse to use the obscure .BY:
DT[, newval := .SD[[.BY[[1]]]], by=new]
col1 col2 col3 new newval
1: 1 4 55 col1 1
2: 2 3 44 col2 3
3: 3 34 35 col2 34
4: 4 44 87 col3 87
How it works. This splits the data into groups based on the strings in new. The value of the string for each group is stored in newname = .BY[[1]]. We use this string to select the corresponding column of .SD via .SD[[newname]]. .SD stands for Subset of Data.
Alternatives. get(.BY[[1]]) should work just as well in place of .SD[[.BY[[1]]]]. According to a benchmark run by #David, the two ways are equally fast.
We can match the 'new' column with the column names of the dataset to get the column index, cbind with the row index (1:nrow(df1)) and extract the corresponding elements of the dataset based on row/column index. It can be assigned to a new column.
df1$matched_value <- df1[-4][cbind(1:nrow(df1),match(df1$new, colnames(df1) ))]
df1
# col1 col2 col3 new matched_value
#1 1 4 55 col1 1
#2 2 3 44 col2 3
#3 3 34 35 col2 34
#4 4 44 87 col3 87
NOTE: If the OP have a data.table, one option is convert to data.frame or use with=FALSE while subsetting.
setDF(df1) #to convert to 'data.frame'.
Benchmarks
set.seed(45)
df2 <- data.frame(col1= sample(1:9, 20e6, replace=TRUE),
col2= sample(1:20, 20e6, replace=TRUE),
col3= sample(1:40, 20e6, replace=TRUE),
col4=sample(1:30, 20e6, replace=TRUE),
new= sample(paste0('col', 1:4), 20e6, replace=TRUE), stringsAsFactors=FALSE)
system.time(df2$matched_value <- df2[-5][cbind(1:nrow(df2),match(df2$new, colnames(df2) ))])
# user system elapsed
# 2.54 0.37 2.92
I can't imagine I'm the first person with this question, but I haven't found a solution yet (here or elsewhere).
I have a few columns, which I want to average in R. The only minimally tricky aspect is that some columns contain NAs.
For example:
Trait Col1 Col2 Col3
DF 23 NA 23
DG 2 2 2
DH NA 9 9
I want to create a Col4 that averages the entries in the first 3 columns, ignoring the NAs.
So:
Trait Col1 Col2 Col3 Col4
DF 23 NA 23 23
DG 2 2 2 2
DH NA 9 9 9
Ideally something like this would work:
data$Col4 <- mean(data$Chr1, data$Chr2, data$Chr3, na.rm=TRUE)
but it doesn't.
You want rowMeans() but importantly note it has a na.rm argument that you want to set to TRUE. E.g.:
> mat <- matrix(c(23,2,NA,NA,2,9,23,2,9), ncol = 3)
> mat
[,1] [,2] [,3]
[1,] 23 NA 23
[2,] 2 2 2
[3,] NA 9 9
> rowMeans(mat)
[1] NA 2 NA
> rowMeans(mat, na.rm = TRUE)
[1] 23 2 9
To match your example:
> dat <- data.frame(Trait = c("DF","DG","DH"), mat)
> names(dat) <- c("Trait", paste0("Col", 1:3))
> dat
Trait Col1 Col2 Col3
1 DF 23 NA 23
2 DG 2 2 2
3 DH NA 9 9
> dat <- transform(dat, Col4 = rowMeans(dat[,-1], na.rm = TRUE))
> dat
Trait Col1 Col2 Col3 Col4
1 DF 23 NA 23 23
2 DG 2 2 2 2
3 DH NA 9 9 9
Why NOT the accepted answer?
The accepted answer is correct, however, it is too specific to this particular task and impossible to be generalized. What if we need, instead of mean, other statistics like var, skewness, etc. , or even a custom function?
A more flexible solution:
row_means <- apply(X=data, MARGIN=1, FUN=mean, na.rm=TRUE)
More details on apply:
Generally, to apply any function (custom or built-in) on the entire dataset, column-wise or row-wise, apply or one of its variations (sapply, lapply`, ...) should be used. Its signature is:
apply(X, MARGIN, FUN, na.rm)
where:
X: The data of form dataframe or matrix.
MARGIN: The dimension on which the aggregation takes place. Use 1 for row-wise operation and 2 for column-wise operation.
FUN: The operation to be called on the data. Here any pre-defined R functions, as well as any user-defined function could be used.
na.rm: If TRUE, the NA values will be removed before FUN is called.
Why should I use apply?
For many reasons, including but not limited to:
Any function can be easily plugged in to apply.
For different preferences such as the input or output data types, other variations can be used (e.g., lapply for operations on lists).
(Most importantly) It facilitates scalability since there are versions of this function that allows parallel execution (e.g. mclapply from {parallel} library). For instance, see [+] or [+].