Count only string values in a row - r

I have a data that looks as follows:
Patent_number<-c(2323,4449,4939,4939,12245)
IPC_class_1<-c("C12N",4,"C29N00185",2,"C12F")
IPC_class_2<-c(3,"K12N","C12F","A01N",8)
IPC_class_3<-c("S12F",1,"CQ010029393049",5,"CQ1N")
df<-data.frame(Patent_number, IPC_class_1, IPC_class_2, IPC_class_3)
View(df)
I want to count only the number o (string) values such as C12N, A01N etc. per row by adding another column "counts" in the end of the data frame. In other words, I want to exclude the numeric values from the row count.
Any suggestions?

You can't have mixed types in a dataframe column, so all of the numeric values will also be stored as type character. One approach would be to convert everything using as.numeric, and then use is.na to count those that are not coercible to numeric...
df$counts <- apply(sapply(df, as.numeric), 1, function(x) sum(is.na(x)))
df
Patent_number IPC_class_1 IPC_class_2 IPC_class_3 counts
1 2323 C12N 3 S12F 2
2 4449 4 K12N 1 1
3 4939 C29N C12F CQ01 3
4 4939 2 A01N 5 1
5 12245 C12F 8 CQ1N 2

We may also count by checking if all the characters are digits
df$counts <- ncol(df) - Reduce(`+`, lapply(df, grepl, pattern = '^[0-9.]+$'))
df$counts
[1] 2 1 3 1 2

Related

changing column names of a data frame by changing values - R

Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.
We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)
Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3
Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3

Duplicating R dataframe vector values using another vector as a guide

I have the following R dataframe: df = data.frame(value=c(5,4,3,2,1), a=c(2,0,1,6,9), b=c(7,0,0,3,4)). I would like to duplicate the values of a and b by the number of times of the corresponding position values in value. For example, Expanding b would look like b_ex = c(7,7,7,7,7,2,2,2,4). No values of three or four would be in b_ex because values of zero are in b[2] and b[3]. The expanded vectors would be assigned names and be stand-alone.
Thanks!
Maybe you are looking for :
result <- lapply(df[-1], function(x) rep(x[x != 0], df$value[x != 0]))
#$a
#[1] 2 2 2 2 2 1 1 1 6 6 9
#$b
#[1] 7 7 7 7 7 3 3 4
To have them as separate vectors in global environment use list2env :
list2env(result, .GlobalEnv)

R First Row By Group When Condition Is Met

dataHAVE=data.frame(STUDENT=c(1,1,1,2,2,2,3,3,3),
SCORE=c(0,1,1,5,1,2,1,1,1),
CAT=c(3,10,7,4,5,0,4,5,1),
FOX=c(5,0,10,8,9,1,8,9,0))
dataWANT=data.frame(STUDENT=c(1,2,3),
SCORE=c(1,1,1),
CAT=c(10,5,4),
FOX=c(0,9,8))
I have 'dataHAVE' and want 'dataWANT' which takes the first row for every 'STUDENT' when 'SCORE' equals to 1. I am seeking a data.table solution because of it being a large data. I try this but do not know how to set the criteria for 'SCORE'
dataWANT[,.SD[1],by = key(STUDENT)]
Convert the 'data.frame' to 'data.table' (setDT), grouped by 'STUDENT', specify the logical condition in i, get the index of the first row (.I[1]), extract that column ($V1) and subset the rows
library(data.table)
setDT(dataHAVE)[dataHAVE[SCORE == 1, .I[1], STUDENT]$V1]
.I returns row index. If we don't have a grouping column, it would return a vector i.e.
setDT(dataHAVE)[SCORE == 1, .I]
#[1] 1 2 3 4 5 6
when we provide the grouping column, by default, the .I returns with a named column V1 (we could override it by changing the name)
setDT(dataHAVE)[SCORE == 1, .(colindex = .I[1]), STUDENT]
# STUDENT colindex
#1: 1 2
#2: 2 5
#3: 3 7
Nowe, we have two columns, 'STUDENT', 'colindex'. We are specifically interested in the 'colindex', so extract with standard procedures ($ or [[) and then use that as row index in i
i1 <- setDT(dataHAVE)[SCORE == 1, .(colindex = .I[1]), STUDENT]$colindex
i1
#[1] 2 5 7
This we use for subsetting
dataHAVE[i1]
Here is a base R option using subset + ave
subset(
dataHAVE,
ave(SCORE==1, STUDENT, FUN = function(x) seq_along(x) == min(which(x)))
)
which gives
STUDENT SCORE CAT FOX
2 1 1 10 0
5 2 1 5 9
7 3 1 4 8
Solution 1. There is a straightforward and comprehensive solution in two lines:
dataWANT <- dataHAVE[dataHAVE$SCORE == 1,] #Filter score equals to 1
dataWANT <- dataWANT[!duplicated(dataWANT$STUDENT), ] #Remove duplicated students
Solution 2. However, if you prefer to solve in one line:
dataWANT <- dataHAVE[!duplicated(paste0(dataHAVE$STUDENT, dataHAVE$SCORE)) & dataHAVE$SCORE ==1, ]
That creates a logical vector showing which of the combinations that are not duplicated of preceding elements, and combine it with a test if 'SCORE' is 1.
You could use match to get 1st row where SCORE = 1 for each STUDENT.
library(data.table)
setDT(dataHAVE)
dataHAVE[, .SD[match(1, SCORE)], STUDENT]
# STUDENT SCORE CAT FOX
#1: 1 1 10 0
#2: 2 1 5 9
#3: 3 1 4 8

how to name data frame columns to column index

It is a very basic question.How can you set the column names of data frame to column index? So if you have 4 columns, column names will be 1 2 3 4. The data frame i am using can have up to 100 columns.
It is not good to name the column names with names that start with numbers. Suppose, we name it as seq_along(D). It becomes unnecessarily complicated when we try to extract a column. For example,
names(D) <- seq_along(D)
D$1
#Error: unexpected numeric constant in "D$1"
In that case, we may need backticks or ""
D$"1"
#[1] 1 2 3
D$`1`
#[1] 1 2 3
However, the [ should work
D[["1"]]
#[1] 1 2 3
I would use
names(D) <- paste0("Col", seq_along(D))
D$Col1
#[1] 1 2 3
Or
D[["Col1"]]
#[1] 1 2 3
data
D <- data.frame(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9),d=c(10,11,12))
Just use names:
D <- data.frame(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9),d=c(10,11,12))
names(D) <- 1:ncol(D) # sequence from 1 through the number of columns

Select last non-NA value in a row, by row

I have a data frame where each row is a vector of values of varying lengths. I would like to create a vector of the last true value in each row.
Here is an example data frame:
df <- read.table(tc <- textConnection("
var1 var2 var3 var4
1 2 NA NA
4 4 NA 6
2 NA 3 NA
4 4 4 4
1 NA NA NA"), header = TRUE); close(tc)
The vector of values I want would therefore be c(2,6,3,4,1).
I just can't figure out how to get R to identify the last value.
Any help is appreciated!
Do this by combining three things:
Identify NA values with is.na
Find the last value in a vector with tail
Use apply to apply this function to each row in the data.frame
The code:
lastValue <- function(x) tail(x[!is.na(x)], 1)
apply(df, 1, lastValue)
[1] 2 6 3 4 1
Here's an answer using matrix subsetting:
df[cbind( 1:nrow(df), max.col(!is.na(df),"last") )]
This max.col call will select the position of the last non-NA value in each row (or select the first position if they are all NA).
Here's another version that removes all infinities, NA, and NaN's before taking the first element of the reversed input:
apply(df, 1, function(x) rev(x[is.finite(x)])[1] )
# [1] 2 6 3 4 1

Resources