Separating multiple values within one cell into multiple cells - r

I am working with patent data and I have multiple values (International Patent Classifications) separated by a comma in one cell. For instance: C12N, C12P, A01 (within one cell). I would like to analyze how many different 4-digit classifications each patent has. To do so, I need to, first, separate the individual values and put each of them into an individual cell. Second, I need to count the unique values (each placed then in separate columns) for each row.
How can I separate the individual values within one cell into multiple cells within Excel or R. Is there any excel or R function you can suggest?
Here is reproducible example on how the data looks like in R or in Excel.
#Example of how the data looks like
Publication_number<-c(12345,10012,23323,44556,77999)
IPC_class_4_digits<-c("C12N,CF01,C345","C12P,F12N,F039","A014562,F23N", "A01C, A01B, A01F, A01K, A01G", "C10N, C10R, C10Q, C12F")
data_example<-cbind(Publication_number, IPC_class_4_digits)
View(data_example)
The expected about should be a column "counts" counting the number of different 4-digit numbers. In this case => c(3, 3, 2, 5, 4)

Assuming you have a dataframe with two columns Publication_number and IPC_class_4_digits you could use cSplit from splitstackshape package:
library(splitstackshape)
# assuming your data
df <- data.frame(Publication_number, IPC_class_4_digits)
cSplit(df, 'IPC_class_4_digits', ',')
Output:
Publication_number IPC_class_4_digits_1 IPC_class_4_digits_2 IPC_class_4_digits_3
1: 12345 C12N CF01 C345
2: 1001 C12P F12N F039
3: 2332 A014562 F23N <NA>

You can split the string on comma and count it's length.
data_example$count <- lengths(strsplit(data_example$IPC_class_4_digits, ','))
data_example
# Publication_number IPC_class_4_digits count
#1 12345 C12N,CF01,C345 3
#2 10012 C12P,F12N,F039 3
#3 23323 A014562,F23N 2
#4 44556 A01C, A01B, A01F, A01K, A01G 5
#5 77999 C10N, C10R, C10Q, C12F 4
Or another option is to use str_count -
data_example$count <- stringr::str_count(data_example$IPC_class_4_digits, ',') + 1
data
data_example<-data.frame(Publication_number, IPC_class_4_digits)

Related

how to check the same ID in a different dataframe and make the merge file?

I want to modify the longitudinal data based on ID.
I want to check whether the IDs in data wave 1(A) and data in wave 2(B) match properly. Also, I want to combine the data of A and B into one file based on ID.
I tried to merge the file using merge() code and tried to check whether the ID matched through the sex variable. However, it is difficult to check ID if there is no same variable in both waves, and it does not directly check each ID.
ID <- c(1012,1102,1033,1204,1555)
sex <- c(1,0,1,0,1)
A <- cbind(ID,sex)
A <- as.data.frame(A)
ID <- c(1006,1102,1001,1033,1010,1234,1506,1999)
sex <- c(1,0,1,1,1,0,0,0)
B <- cbind(ID,sex)
B <- as.data.frame(B)
merge.AB<-merge(A,B,by="ID")
all(merge.AB$sex.x == merge.AB$sex.y)
1. Are there any way to merge A(wave1) and B(wave2) files by ID other than merge() code?
Since there are 2 or 3 wave1 files other than A, it would be nice to be able to combine several files at once.
2. Is there a way to check if two frames have the same ID directly?
I tried to combine the files and check matching IDs through cbind() code for combining the A and B. But I couldn't check them together because the number of rows between the A and B dataframe is different.
It would be helpful to use a loop(e.g. if, for, etc.), but it would be nice if there was a way to do it with a package or simple code.
3. How do I locate a row number with a mismatched ID?
I want to know the all of locations in the row(row number) for the example.
e.g.
mismatched ID in A: 1012,1204,1555
mismatched ID in B: 1006,1001,1010,1234,1506,1999
Question 1 : you can merge multiple dataframes with merge. You first need to create a list of the df you want to merge and then you could use Reduce.
df_list <- list(df1,df2,...dfn)
data=Reduce(function(x,y) merge(x=x,y=y,by="ID",all=T),df_list)
Alternatively using tidyverse:
library(tidyverse)
df_list %>% reduce(full_join, by='ID')
In your example, pay attention that it is not convenient to merge two df with the same variable name and that contain the same information. You could simply use
data=Reduce(function(x,y) merge(x=x,y=y,all=T), df_list)
to remove redundant information from merged df.
Question 2 : check IDs with setdiff() and intersect()
intersect() gives you the common values between two vectors
setdiff(x,y) gives you the values in x that are not present in y
intersect(A$ID,B$ID)
[1] 1102 1033
setdiff(A$ID,B$ID)
[1] 1012 1204 1555
setdiff(B$ID,A$ID)
[1] 1006 1001 1010 1234 1506 1999
Question 3 : a simple which() including %in% test will give you the position in the dataframe
which(!(A$ID %in% B$ID))
[1] 1 4 5
which(!(B$ID %in% A$ID))
[1] 1 3 5 6 7 8

match/merge dataframes with a number columns with different column names in r

I have two dataframe with different columns that has large number of rows (about 2 million)
The first one is df1
The second one is df2
I need to get match the values in y column from table one to R column in table two
Example:
see the two rows in df1 in red box have matched the two rows in df2 in red box
Then I need to get the score of the matched values
so the result should look like this and it should be stores in a dataframe:
My attempt : first Im beginner in R, so when I searched I found that I can use Match function, merge function but I did not get the result that I want it might because I did not know how to use them correctly, therefore, I need step by step very simple solution
We can use match from base R
df2[match(df2$R, df1$y, nomatch = 0), c("R", "score")]
# R score
#3 2 3
#4 111 4
Or another option is semi_join from dplyr
library(dplyr)
semi_join(df2[-1], df1, by = c(R = "y"))
# R score
#1 2 3
#2 111 4
merge(df1,df2,by.x="y",by.y="R")[c("y","score")]
y score
1 2 3
2 111 4

R require cell counts for number of occurrences of regex pattern over entire data frame

I'm working in R and I have a data frame containing epigenetic information. I have 300,000 rows containing genomic locations and 15 columns each of which identifies a transcription factor motif that may or may not occur at each locus.
I'm trying to use regular expressions to count how many times each transcription factor occurs at each genomic locus. Individual motifs can occur > 15 times at any one locus, so I'd like the output to be a matrix/data frame containing motif counts for each individual cell of the data frame.
A typical single occurrence of a motif in a cell could be:
2212(AATTGCCCCACA,-,0.00)
Whereas if there were multiple occurrences of a motif, these would exist in the cell as a continuous string each entry separated by a comma, for example for two entries:
144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)
Here is some toy data:
df <-data.frame(NAMES = c('LOC_A', 'LOC_B', 'LOC_C', 'LOC_D'),
TFM1 = c("144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)", "2(TGTGAGTCAC,+,0.00)", "0", "0"),
TFM2 = c("632(TAAAGAGTCAC,-,0.00),60(GTCCCTCACT,-,0.00),", "7(TGTGAGTCAC,+,0.00)", "7(TGTGAGTCAC,+,0.00)", "0"),
stringsAsFactors = F)
I'd be looking for the output in the following format:
NAMES TFM1 TFM2
LOC_A 2 2
LOC_B 1 1
LOC_C 0 1
LOC_D 0 0
If possible, I'd like to avoid for loops, but if loops are required so be it. To get row counts for this data frame I used the following code, kindly recommended by #akrun:
df$MotifCount <- Reduce(`+`, lapply(df[-1],
function(x) lengths(str_extract_all(x, "\\d+\\("))))
Notice that the unique identifier for the motifs used here is "\\d+\\(" to pick up the number and opening bracket at the start of each motif identification string. This would have to be included in any solution code. Something similar which worked across the whole data frame to provide individual cell counts would be ideal.
Many Thanks
We don't need the Reduce part
data.frame(c(df[1],lapply(df[-1], function(x) lengths(str_extract_all(x, "\\d+\\(")))) )
# NAMES TFM1 TFM2
#1 LOC_A 2 2
#2 LOC_B 1 1
#3 LOC_C 0 1
#4 LOC_D 0 0
This will also work:
cbind.data.frame(df[1],sapply(lapply(df[-1], function(x) str_extract_all(x, "\\d+\\(")), function(x) lapply(x, length)))
# NAMES TFM1 TFM2
#1 LOC_A 2 2
#2 LOC_B 1 1
#3 LOC_C 0 1
#4 LOC_D 0 0

List all possible occurrences within a column?

I am trying to merge a data.frame and a column from another data.frame, but have so far been unsuccessful.
My first data.frame [Frequencies] consists of 2 columns, containing 47 upper/ lower case alpha characters and their frequency in a bigger data set. For example purposes:
Character<-c("A","a","B","b")
Frequency<-(100,230,500,420)
The second data.frame [Sequences] is 93,000 rows in length and contains 2 columns, with the 47 same upper/ lower case alpha characters and a corresponding qualitative description. For example:
Character<-c("a","a","b","A")
Descriptor<-c("Fast","Fast","Slow","Stop")
I wish to add the descriptor column to the [Frequencies] data.frame, but not the 93,000 rows! Rather, what each "Character" represents. For example:
Character<-c("a")
Frequency<-c("230")
Descriptor<-c("Fast")
Following can also be done:
> merge(adf, bdf[!duplicated(bdf$Character),])
Character Frequency Descriptor
1 a 230 Fast
2 A 100 Fast
3 b 420 Stop
4 B 500 Slow
Why not:
df1$Descriptor <- df2$Descriptor[ match(df1$Character, df2$Character) ]

Subset data frame based on first letters of column name

I have a large dataframe with multiple columns representing different variables that were measured for different individuals. The name of the columns always start with a number (e.g. 1:18). I would like to subset the df and create separete dfs for each individual. Here it is an example:
x <- as.data.frame(matrix(nrow=10,ncol=18))
colnames(x) <- paste(1:18, 'col', sep="")
The column names of my real df is a composition of the Individual ID, the variable name, and the number of the measure (I took 3 measures of each variable). So for instance I have the measure b (body) for individual 1, then in the df I would have 3 columns named: 1b1, 1b2, 1b3. In the end I have 10 different regions (body, head, tail, tail base, dorsum, flank, venter, throat, forearm, leg). So for each individual I have 30 columns (10 regions x 3 measures per region). So I have multiple variables starting with the different numbers and I would like to subset then based on their unique numbers. I tried using grep:
partialName <- 1
df2<- x[,grep(partialName, colnames(x))]
colnames(x)
[1] "1col" "2col" "3col" "4col" "5col" "6col" "7col" "8col" "9col" "10col"
"11col" "12col" "13col" "14col" "15col" "16col" "17col" "18col"
My problem here as you can see it doesn't separate the individuals because 1 and 10 are in the subset. In other words this selects everybody that starts with 1.
Ultimately what I would like to do is to loop over all my individuals (1:18), creating new dfs for each individual.
I think keeping the data in one data.frame is the best option here. Either that, or put it into a list of data.frame's. This makes it easy to extract summary statistics per individual much easier.
First create some example data:
df = as.data.frame(matrix(runif(50 * 100), 100, 50), stringsAsFactors = FALSE)
names_variables = c('spam', 'ham', 'shrub')
individuals = 1:100
column_names = paste(sample(individuals, 50),
sample(names_variables, 50, TRUE),
sep = '')
colnames(df) = column_names
What I would do first is use melt to cast the data from wide format to long format. This essentially stacks all the columns in one big vector, and adds an extra column telling which column it came from:
library(reshape2)
df_melt = melt(df)
head(df_melt)
variable value
1 85ham 0.83619111
2 85ham 0.08503596
3 85ham 0.54599402
4 85ham 0.42579376
5 85ham 0.68702319
6 85ham 0.88642715
Then we need to separate the ID number from the variable. The assumption here is that the numeric part of the variable is the individual ID, and the text is the variable name:
library(dplyr)
df_melt = mutate(df_melt, individual_ID = gsub('[A-Za-z]', '', variable),
var_name = gsub('[0-9]', '', variable))
essentially removing the part of the string not needed. Now we can do nice things like:
mean_per_indivdual_per_var = summarise(group_by(df_melt, individual_ID, var_name),
mean(value))
head(mean_per_indivdual_per_var)
individual_ID var_name mean(value)
1 63 spam 0.4840511
2 46 ham 0.4979884
3 20 shrub 0.5094550
4 90 ham 0.5550148
5 30 shrub 0.4233039
6 21 ham 0.4764298
It seems that your colnames are the standard ones of a data.frame, so to get just the column 1 you can do this:
df2 <- df[,1] #Where 1 can be changed to the number of column you wish.
There is no need to subset by a partial name.
Although it is not recommended you could create a loop to do so:
for (i in ncol(x)){
assing(paste("df",i), x[,i]) #I use paste to get a different name for each column
}
Although the #paulhiemstra solution avoids the loop.
So with the new information then you can do as you wanted with grep, but specifically telling how many matches you expect:
df2<- x[,grep("1{30}", colnames(x))]

Resources