I am having trouble figuring out how to trim the end off of a string in a data frame.
I want to trim everything to a "base" name, after #s and letters, a period, then a number. My goal is trim everything in my dataframe to this "base" name, then sum the values with the same "base." I was thinking it would be possible to trim, then merge and sum the values.
ie/
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7.1 2
B0228.7.2 12
B0350.2h.1 30
B0350.2h.2 2
B0350.2i 15
2RSSE.1a 3
2RSSE.1b 10
R02F11.11 4
to
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7 14
B0350.2 47
2RSSE.1 13
R02F11.11 4
Thank you for any help!
Here is a solution using the dplyr and stringr packages. You first create a column with your extracted base pattern, and then use the group_by and summarise functions from dplyr to get the sum of values for each name:
library(dplyr)
library(stringr)
df2 = df %>% mutate(Gene_name = str_extract(Gene_name,"[[:alnum:]]+\\.\\d+")) %>%
group_by(Gene_name) %>% summarise(Values = sum(Values))
Gene_name Values
<chr> <int>
1 2RSSE.1 13
2 B0222.5 4
3 B0222.6 16
4 B0228.7 14
5 B0350.2 47
6 R02F11.11 4
As someone has also suggested, I would get gene names first, and then search for them in the original data.frame
df <- data.frame(Gene_name = c("B0222.5", "B0222.6", "B0228.7.1", "B0228.7.2", "B0350.2h.1", "B0350.2h.2", "B0350.2i", "2RSSE.1a", "2RSSE.1b", "R02F11.11"),
Values = c(4, 16, 2, 12, 30, 2, 15, 3, 10, 4),
stringsAsFactors = F)
pat <- "(^[[:alnum:]]+\\.[[:digit:]]*)"
cap.pos <- regexpr(pat, df$Gene_name)
cap.gene <- unique(substr(df$Gene_name, cap.pos, (cap.pos + attributes(cap.pos)$match.length - 1)))
do.call(rbind, lapply(cap.gene, (function(nm){
sumval <- sum(df[grepl(nm, df$Gene_name, fixed = T),]$Values, na.rm = T)
data.frame(Gene_name = nm, Value = sumval)
})))
The result tracks with your request
Gene_name Value
1 B0222.5 4
2 B0222.6 16
3 B0228.7 14
4 B0350.2 47
5 2RSSE.1 13
6 R02F11.11 4
You can also create the Gene_name as a factor and change the levels.
# coerce the vector as a factor
Gene_name <- as.factor(Gene_name)
# view the levels
levels(Gene_name)
# to make B0228.7.1 into B0228.7
levels(Gene_name)[ *index for B0228.7.1* ] <- B0228.7
You can repeat this for the levels that need to change and then the values will automatically sum together and rows with similar levels will be treated as the same category.
Related
I have a list of column names that look like this...
colnames(dat)
1 subject
2 e.type
3 group
4 boxnum
5 edate
6 file.name
7 fr
8 active
9 inactive
10 reward
11 latency.to.first.active
12 latency.to.first.inactive
13 act0.600
14 act600.1200
15 act1200.1800
16 act1800.2400
17 act2400.3000
18 act3000.3600
19 inact0.600
20 inact600.1200
21 inact1200.1800
22 inact1800.2400
23 inact2400.3000
24 inact3000.3600
25 rew0.600
26 rew600.1200
27 rew1200.1800
28 rew1800.2400
29 rew2400.3000
30 rew3000.3600
I want to get the row sum for the columns that list act#, inact#, and reward#
This works...
for (row in 1:nrow(dat)) {
dat[row, "active"] = rowSums(dat[row,c(13:18)])
dat[row, "inactive"] = rowSums(dat[row,c(19:24)])
dat[row, "reward"] = rowSums(dat[row,c(25:30)])
}
But I don't want to hard coded it since the number of columns for the 3 sections may change. How can I do this without hard coding the column indexes?
Also, for example, I tried searching for the "act" named columns but it was also including the "active" column.
sub_dat <- dat[, 13:30]
result <- sapply(split.default(sub_dat, substr(names(sub_dat), 1, 3)), rowSums)
dat[, c('active', 'inactive', 'reward')] <- result
Easy-cheesy with witch select & matches from the tidyverse.
library(tidyverse)
data %>%
mutate(
sum_act = rowSums(select(., matches("act[0-9]"))),
sum_inact = rowSums(select(., matches("inact[0-9]"))),
sum_rew = rowSums(select(., matches("rew[0-9]")))
)
I made an example how it could be done:
t <- data.frame(c(1,2,3),c("a","b","c"))
colnames(t) <- c("num","char")
#with function append() you make a list of rows that fulfill your logical argument
whichRows <- append(which(t$char == "a"),which(t$char == "b"))
sum(t$num[whichRows])
or if I misunderstood you and you want to sum for every column separately then:
sum(t$num[which(t$char == "a")])
sum(t$num[which(t$char == "b")])
I am working with a data frame that has two columns, name and spouse. I am trying to calculate the interracial marriage frequency, but I need to remove repeated registers.
When I have the name of a creature I need to keep this register in the data frame but remove the register where that creature name is the spouse name. I have this following data sample:
name spouse
15 Finarfin Eärwen
6 Tar-Vanimeldë Herucalmo
17 Faramir owyn
8 Tar-Meneldur Almarian
14 Finduilas of Dol Amroth Denethor II
12 Finwë MÃriel Serindë then ,Indis
9 Tar-Ancalimë Hallacar
7 Tar-MÃriel Ar-Pharazôn
5 Tarannon Falastur Berúthiel
21 Rufus Burrows Asphodel Brandybuck
2 Angrod Eldalótë
4 Ar-Gimilzôr Inzilbêth
19 Lobelia Sackville-Baggins Otho Sackville-Baggins
25 Mrs. Proudfoot Odo Proudfoot
22 Rudigar Bolger Belba Baggins
24 Odo Proudfoot Mrs. Proudfoot
3 Ar-Pharazôn Tar-MÃriel
13 Fingolfin Anairë
18 Silmariën Elatan
23 Rowan Greenhand Belba Baggins
20 RÃan Huor
1 Adanel Belemir
16 Fastolph Bolger Pansy Baggins
10 Morwen Steelsheen Thengel
11 Tar-Aldarion Erendis
25 Belemir Adanel
For example, I ran the code and in line 1 it caught name Adanel and got Belemir as its spouse, so I need to keep line 1, but remove line 25, because with that I will avoid duplicated data.
I have tried this following code:
interacialMariage <-data %>% filter(spouse != name) %>% select(name, spouse)
How can I get the same spouse name register out of the data frame registers?
P.S.: I would need it to avoid case sensitive (Belemir == belemir) so that I don't have problems in the future.
Thanks!
You could set up another vector with the row-wise alphabetically sorted names, and deduplicate using that...
sorted <- sapply(1:nrow(data),
function(i) paste(sort(c(trimws(tolower(data$name[i])),
trimws(tolower(data$spouse[i])))),
collapse=" "))
irM <- data[!duplicated(sorted),]
The trimws strips off any leading or trailing spaces before sorting and pasting, and tolower converts everything to lower case.
My attempt with tidyverse:
library(tidyverse)
dat %>%
mutate(id = 1:n()) %>% # add id to label the pairs
gather('key', 'name', -id) %>% # transform: key (name | spouse), name, id
group_by(name) %>% # group by unique name to find duplicated
top_n(-1, wt = id) %>% # if name > 1, take row with the lower id
spread(key, name) %>% # spread data to original format
select(-id) # remove id's
# # A tibble: 3 x 2
# name spouse
# <chr> <chr>
# 1 Adanel Belemir
# 2 Fastolph Bolger Pansy Baggins
# 3 Morwen Steelsheen Thengel
Data:
dat <- data.frame(
name = c("Adanel", "Fastolph Bolger", "Morwen Steelsheen", "Belemir"),
spouse = c("Belemir", "Pansy Baggins", "Thengel", "Adanel" ),
stringsAsFactors = F
)
In "Zero frequent items" when using the eclat to mine frequent itemsets, the OP is interested in the groupings/clusterings based on how frequent they are ordered together. This grouping can be inspected by the arules::inspect function.
library(arules)
dataset <- read.transactions("8GbjnHK2.txt", sep = ";", rm.duplicates = TRUE)
f <- eclat(dataset,
parameter = list(
supp = 0.001,
maxlen = 17,
tidLists = TRUE))
inspect(head(sort(f, by = "support"), 10))
The data set can be downloaded from https://pastebin.com/8GbjnHK2.
However, the output cannot be easily saved to another object as a data frame.
out <- inspect(f)
So how do we capture the output of inspect(f) for use as data frame?
We can use the methods labels to extract the associations/groupings and quality to extract the quality measures (support and count). We can then use cbind to store these into a data frame.
out <- cbind(labels = labels(f), quality(f))
head(out)
# labels support count
# 1 {3031093,3059242} 0.001010 16
# 2 {3031096,3059242} 0.001073 17
# 3 {3060614,3060615} 0.001010 16
# 4 {3022540,3072091} 0.001010 16
# 5 {3061698,3061700} 0.001073 17
# 6 {3031087,3059242} 0.002778 44
Coercing the itemsets to a data.frame also creates the required output.
> head(as(f, "data.frame"))
items support count
1 {3031093,3059242} 0.001010101 16
2 {3031096,3059242} 0.001073232 17
3 {3060614,3060615} 0.001010101 16
4 {3022540,3072091} 0.001010101 16
5 {3061698,3061700} 0.001073232 17
6 {3031087,3059242} 0.002777778 44
I have a df like this:
> df<-data.frame(Client.code =
c(100451,100451,100523,100523,100523,100525),dayref = c(24,30,15,13,17,5))
> df
Client.code dayref
1 100451 24
2 100451 30
3 100523 15
4 100523 13
5 100523 17
6 100525 5
It is a one-year distribution of payments period from issue.
Usign this data above and given a df2 like this:
Client.Code Days
1 100451 16
1 100523 16
1 100460 35
As i have enough data for a reasonable quantile prob. calculations.I will like to know how to build a loop for assing to every row in this df2 of days a quantile according with the first df.
We can use data.table
library(data.table)
setDT(df)[, .(Quantile = quantile(dayref)), Client.code]
Or with tidyverse
library(dplyr)
library(tidyr)
df %>%
group_by(Client.code) %>%
summarise(Quantile = list(quantile(dayref))) %>%
unnest
tapply(df$dayref, df$Client.code, quantile)
You can specify specific percentiles by adding a vector of them
tapply(df$dayref, df$Client.code, quantile, 1:19/20)
You may need to formulate like this
tapply(df$dayref, df$Client.code, quantile, probs = 1:19/20)
And you can add na.rm = TRUE as another argument if you might have NAs
I would like to select n rows from y that matches the strings in x where n= length of x, but the same row in y should not be selected more than one time. The rows should be selected randomly from y.
> head(x$Age_Yrs_Sex)
[1] "65_0" "72_1" "82_0" "52_0" "81_0" "58_0"
> head(y,20)
ID Age_Yrs_Sex
1 10678800017 30_0
2 106788000024 63_0
4 10678800048 59_0
5 1067880000055 68_1
7 1067800079 59_0
8 10678800086 36_1
10 10678000109 39_0
12 1067880123 42_0
13 10678800130 45_1
14 106788000147 49_1
15 1067880000154 24_0
16 106780000161 44_0
17 1067880000178 43_1
19 106780000192 79_0
20 106880000208 22_0
22 107880000222 89_0
23 167880000239 28_0
24 106788000246 44_1
25 106780000253 76_0
26 106780000260 45_1
Assuming that the entries in x are always less than those in y for a given match, this should work (using dplyr). Generating usable example data here:
y <-
data.frame(
ID = 1:1000
, Age_Yrs_Sex = paste(sample(1:10, 1000, TRUE)
, 0:1
, sep = "_")
)
x <-
data.frame(
Age_Yrs_Sex = paste(c(1,1:4), 0, sep = "_")
)
Count the number of matches for each thing (can be skipped if it is always 1)
matches <-
table(x$Age_Yrs_Sex)
Filter the table to just the matches, then select from each group the number of matches found in the table above (using slice, randomly sample row numbers from 1 to the number of rows, returning the number of results of that match from the table).
y %>%
filter(Age_Yrs_Sex %in% names(matches)) %>%
group_by(Age_Yrs_Sex) %>%
slice(sample(1:n(), matches[as.character(Age_Yrs_Sex[1])]))
Gives (for example):
ID Age_Yrs_Sex
<int> <fctr>
1 95 1_0
2 777 1_0
3 151 2_0
4 951 3_0
5 403 4_0