Replace row names in a column

Replace row names in a column - r

I have a large data.frame in R with thousands of rows and 4 columns.
For example:
Chromosome Start End Count
1 NC_031985.1 16255093 16255094 1
2 NC_031972.1 11505205 11505206 1
3 NC_031971.1 24441227 24441228 1
4 NC_031977.1 29030540 29030541 1
5 NC_031969.1 595867 595868 1
6 NC_031986.1 40147812 40147813 1
I have this data.frame with the chromosome names accordingly
LG1 NC_031965.1
LG2 NC_031966.1
LG3a NC_031967.1
LG3b NC_031968.1
LG4 NC_031969.1
LG5 NC_031970.1
LG6 NC_031971.1
LG7 NC_031972.1
LG8 NC_031973.1
LG9 NC_031974.1
LG10 NC_031975.1
LG11 NC_031976.1
LG12 NC_031977.1
LG13 NC_031978.1
LG14 NC_031979.1
LG15 NC_031980.1
LG16 NC_031987.1
LG17 NC_031981.1
LG18 NC_031982.1
LG19 NC_031983.1
LG20 NC_031984.1
LG22 NC_031985.1
LG23 NC_031986.1
I want to replace all row names of the large matrix with the chromosome names as listed above and get:
Chromosome Start End Count
1 LG22 16255093 16255094 1
2 LG7 11505205 11505206 1
3 LG6 24441227 24441228 1
4 LG12 29030540 29030541 1
5 LG4 595867 595868 1
6 LG23 40147812 40147813 1
Does anybody know which is the less painful way to do this?
It might be easy (or not) but my experience in R is limited.
Many thanks!

As discussed in the comments here is the dplyr solution if people are looking:
library(dplyr)
df %>%
inner_join(chromo_names, by = c("Chromosome" = "V2")) %>%
select(Chromosome = V1, Start, End, Count)
This gives a warning message that the two merging columns has different factor levels. You can either ignore that and work with characters or convert the merged column to a factor like:
df %>%
inner_join(chromo_names, by = c("Chromosome" = "V2")) %>%
select(Chromosome = V1, Start, End, Count) %>%
mutate(Chromosome = as.factor(Chromosome))
Here is a Base R solution:
merged = merge(df, chromo_names,
by.x = "Chromosome",
by.y = "V2",
sort = FALSE)
merged = merged[c(5,2:4)]
names(merged)[1] = "Chromosome"
Result:
Chromosome Start End Count
1 LG22 16255093 16255094 1
2 LG7 11505205 11505206 1
3 LG6 24441227 24441228 1
4 LG12 29030540 29030541 1
5 LG4 595867 595868 1
6 LG23 40147812 40147813 1
Data:
df = read.table(text = " Chromosome Start End Count
1 NC_031985.1 16255093 16255094 1
2 NC_031972.1 11505205 11505206 1
3 NC_031971.1 24441227 24441228 1
4 NC_031977.1 29030540 29030541 1
5 NC_031969.1 595867 595868 1
6 NC_031986.1 40147812 40147813 1", header = TRUE)
chromo_names = read.table(text = "LG1 NC_031965.1
LG2 NC_031966.1
LG3a NC_031967.1
LG3b NC_031968.1
LG4 NC_031969.1
LG5 NC_031970.1
LG6 NC_031971.1
LG7 NC_031972.1
LG8 NC_031973.1
LG9 NC_031974.1
LG10 NC_031975.1
LG11 NC_031976.1
LG12 NC_031977.1
LG13 NC_031978.1
LG14 NC_031979.1
LG15 NC_031980.1
LG16 NC_031987.1
LG17 NC_031981.1
LG18 NC_031982.1
LG19 NC_031983.1
LG20 NC_031984.1
LG22 NC_031985.1
LG23 NC_031986.1", header = FALSE)

Related

How to count the occurrence of a word in multiple variables in R and sort them from highest to lowest?

I have a huge dataset with over 3 million obs and 108 columns. There are 14 variables I'm interested in: DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF (they're in different positions). These variables contain ICD-10 codes.
I'm interested in counting how many times certain ICD-10 codes appear and then sort them from highest to lowest in a dataframe. Here's some reproductible data:
data <- data.frame(DIAG_PRINC = c("O200", "O200", "O230"),
DIAG_SECUN = c("O555", "O530", "O890"),
DIAGSEC1 = c("O766", "O876", "O899"),
DIAGSEC2 = c("O200", "I520", "O200"),
DIAGSEC3 = c("O233", "O200", "O620"),
DIAGSEC4 = c("O060", "O061", "O622"),
DIAGSEC5 = c("O540", "O123", "O344"),
DIAGSEC6 = c("O876", "Y321", "S333"),
DIAGSEC7 = c("O450", "X900", "O541"),
DIAGSEC8 = c("O222", "O111", "O123"),
DIAGSEC9 = c("O987", "O123", "O622"),
CID_MORTE = c("O066", "O699", "O555"),
CID_ASSO = c("O600", "O060", "O068"),
CID_NOTIF = c("O069", "O066", "O065"))
I also have a list of ICD-10 codes that I'm interested in counting.
GRUPO1 <- c("O00", "O000", "O001", "O002", "O008", "O009",
"O01", "O010", "O011", "O019",
"O02", "O020", "O021", "O028", "O029",
"O03", "O030", "O031", "O032", "O033", "O034", "O035", "O036", "O037",
"O038", "O039",
"O04", "O040", "O041", "O042", "O043", "O044", "O045", "O046", "O047",
"O048", "O049",
"O05", "O050", "O051", "O052", "O053", "O054", "O055", "O056", "O057",
"O058", "O059",
"O06", "O060", "O061", "O062", "O063", "O064", "O065", "O066", "O067",
"O068", "O069",
"O07", "O070", "O071", "O072", "O073", "O074", "O075", "O076", "O077",
"O078", "O079",
"O08", "O080", "O081", "O082", "O083", "O084", "O085", "O086", "O087",
"O088", "O089")
What I need is a dataframe counting how many times the ICD-10 codes from "GRUPO1" appear in any row/column from DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF variables. For example, on my reproductible data ICD-10 cod "O066" appears twice.
Thank you in advance!

We can unlist the data into a vector, use %in% to subset the values from 'GRUPO1' and get the frequency count with table in base R
v1 <- unlist(data)
out <- table(v1[v1 %in% GRUPO1])
out[order(-out)]
O060 O066 O061 O065 O068 O069
2 2 1 1 1 1

Here is a tidyverse solution using tidyr and dplyr:
library(tidyverse)
pivot_longer(data, everything()) %>%
filter(value %in% GRUPO1) %>%
count(value)
Output
value n
<chr> <int>
1 O060 2
2 O061 1
3 O065 1
4 O066 2
5 O068 1
6 O069 1

Calculate combinations of several categorical variables

I have a data frame with mainly categorical variables. I want to see the number of combinations of variables found in three of these columns with categorical variables.
The data in the columns looks like this:
number_arms <- c("6","8","12")
arrangements <- c("single", "paired", "ornament")
approx_position <- c("top", "middle", "bottom")
rg2 <- data.frame(number_arms, arrangements, approx_position)
I was reading in another post to use the following code when comparing two columns:
library(dplyr)
library(stringr)
rg2 %>%
count(combination = str_c(pmin(number_arms, arrangements), ' - ',
pmax(number_arms, arrangements)), name = "count")
This is the result:
combination count
12 - single 1
16 - single 1
4 - paired 3
4 - single 4
5 - paired 4
5 - single 2
6 - ornament 1
6 - paired 81
However, the code does not give me the wanted results if I add the third column, like this:
rg2 %>%
count(combination = str_c(pmin(number_arms, arrangements, approx_position), ' - ',
pmax(number_arms, arrangements, approx_position)), name = "count")
It still runs the code without error but I get wrong results.
Do I need a different code to calculate the combinations of three variables?

If you're looking for the count of each combination of the variables, excluding 0, you can do:
subset(data.frame(table(rg2)), Freq > 0)
number_arms arrangements approx_position Freq
1 12 ornament bottom 1
15 8 paired middle 1
26 6 single top 1
or combined:
subset(data.frame(table(rg2)), Freq > 0) |>
tidyr::unite("combn", -Freq, sep = " - ")
combn Freq
1 12 - ornament - bottom 1
15 8 - paired - middle 1
26 6 - single - top 1
data
number_arms <- c("6","8","12")
arrangements <- c("single", "paired", "ornament")
approx_position <- c("top", "middle", "bottom")
rg2 <- data.frame(number_arms, arrangements, approx_position)

Tidyverse option (updated to remove group_by):
library(dplyr)
rg2 %>%
count(number_arms, arrangements, approx_position)
Result:
number_arms arrangements approx_position n
<chr> <chr> <chr> <int>
1 12 ornament bottom 1
2 6 single top 1
3 8 paired middle 1

You can try dplyr::count() + paste():
library(dplyr)
rg2 %>%
count(combination = paste(number_arms, arrangements, approx_position, sep = " - "), name = "count")
# combination count
# 1 12 - ornament - bottom 1
# 2 6 - single - top 1
# 3 8 - paired - middle 1

Reshape panel data from wide to long

My data frame looks like this:
person. id98 id100 id102 educ98 educ100 educ102 pid98 pid100 pid102
1. 3. 0. 0. 2. 4. 5. T. F. F
2. ....
I hope to transform it like this:
person. year. id. educ. pid.
1. 98
1. 100
1. 102
In Stata, I know that the "reshape" command can automatically identify the year from those variables' names. In R, I don't know how to deal with that.
I want to get the number that is trailing in each column name and bundle the column based on that number.

If you would like to use reshape, maybe the code below could help
reshape(
setNames(df, gsub("(\\d+)", "\\.\\1", names(df))),
# the gsub needed because `reshape` expects a period as a separator
direction = "long",
varying = -1
)
which gives
person. time id educ pid
1.98 1 98 1 2 TRUE
1.100 1 100 1 4 FALSE
1.102 1 102 1 5 FALSE
Data
> dput(df)
structure(list(person. = 1, id98 = 3, id100 = 0, id102 = 0, educ98 = 2,
educ100 = 4, educ102 = 5, pid98 = TRUE, pid100 = FALSE, pid102 = FALSE), class = "data.frame", row.names = c(NA,
-1L))

You can use pivot_longer from tidyr. Using data from #ThomasIsCoding
tidyr::pivot_longer(df,
cols = -person.,
names_to = c('.value', 'year'),
names_pattern = '([a-z]+)(\\d+)')
# person. year id educ pid
# <dbl> <chr> <dbl> <dbl> <lgl>
#1 1 98 3 2 TRUE
#2 1 100 0 4 FALSE
#3 1 102 0 5 FALSE

Using the data.table package this is fairly easy.
As a note I think this'll only work if the columns are ordered in the same order i.e id90 id100 id102 pid90 pid100 pid102 ... etc.
Edit
The aforementioned issue has been solved in this new code.
# loading data.table
if(!require(data.table)){
install.packages("data.table")
library(data.table)
}
df= data.frame(person=1:5, id90=rnorm(5), id91=rnorm(5), id92=rnorm(5), pid90=rnorm(5), pid91=rnorm(5), pid92=rnorm(5), educ90=rnorm(5), educ91=rnorm(5), educ92=rnorm(5))
# turn data.frame in data.table
setDT(df)
cols = colnames(df)[order(colnames)]
# df[, ..cols] reorders the columns alphabetically
# to evade the problem stated above.
# id.vars is the id vars
# using patterns with measure vars will bundle all the columns
# that match the regex pattern in the same column
dt <- melt(df[, ..cols], id.vars="person", measure.vars=patterns(id="^id", educ="^educ", pid="^pid"))
# getting the years
years = gsub('^id', '', colnames(df)[grepl('^id', colnames(df))])
# changing the years
dt[, c("year","variable"):=list(years[variable], NULL)]

Merge a dataframe by creating subsets based on time period and a unique ID number

I am looking to create a dataframe that lists a unique ID with the movement of n different amounts across a period of m timesteps. I currently generate subsets of each timestep and then merge all these subsets with a separate dataframe that contains just the unique IDs. See below:
set.seed(129)
df1 <- data.frame(
id= c(rep(seq(1:7),3)),
time= c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3),
amount1= runif(21,0,50),
amount2= runif(21,-20,600),
amount3= runif(21,-15,200),
amount4= runif(21,-3,300)
)
df2 <- data.frame(
id = unique(df1$id)
)
sub_1 <- subset(df1, time == 1)
sub_2 <- subset(df1, time == 2)
sub_3 <- subset(df1, time == 3)
df2<-merge(df2,sub_1,by.x = "id",by.y = "id", all=TRUE)
df2<-merge(df2,sub_2,by.x = "id",by.y = "id", all=TRUE)
df2<-merge(df2,sub_3,by.x = "id",by.y = "id", all=TRUE)
#df2
id time.x amount1.x amount2.x amount3.x amount4.x time.y amount1.y amount2.y amount3.y amount4.y time amount1 amount2 amount3 amount4
1 1 1 6.558261 -17.713007 46.477430 195.061597 2 18.5453843 269.7406808 132.588713 80.40133 3 24.943217 488.1025 103.473479 198.51302
2 2 1 15.736044 230.018563 72.604346 -2.513162 2 48.8537058 356.5593748 161.239261 246.25985 3 35.559262 406.4749 66.278064 30.11592
3 3 1 8.057720 386.814867 101.997370 152.269564 2 0.7334493 0.7842648 66.603965 156.12478 3 42.170220 450.0306 195.872986 109.73098
4 4 1 15.575282 527.033563 37.403278 197.529341 2 37.8372445 370.0410836 6.074847 273.46715 3 20.302206 290.0026 -2.101649 112.88488
5 5 1 4.230635 427.294382 112.771237 199.401096 2 15.3735066 376.8945806 104.382371 224.09730 3 8.050933 291.6123 53.660734 270.37200
6 6 1 29.087870 9.330858 129.400932 70.801129 2 38.9966662 421.9258798 -3.891286 290.59259 3 17.919554 581.1735 137.100314 129.78561
7 7 1 4.380303 463.658580 4.120219 56.527016 2 6.0582455 484.4981686 67.820164 72.05615 3 43.556746 170.0745 41.134708 247.99512
I have a major issue with this, as the values of m and n increase this method becomes ugly and long. Is there a cleaner way to do this? Maybe as a one liner so I don't have to make say 15 subsets if m = 15.
Thanks

You just need your original df1 dataset and do this:
library(tidyverse)
df1 %>%
group_split(time) %>% # create your subsets and store them as a list of data frames
reduce(left_join, by = "id") # sequentially join those subsets

R- Trimming a string in a dataframe after a particular pattern

I am having trouble figuring out how to trim the end off of a string in a data frame.
I want to trim everything to a "base" name, after #s and letters, a period, then a number. My goal is trim everything in my dataframe to this "base" name, then sum the values with the same "base." I was thinking it would be possible to trim, then merge and sum the values.
ie/
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7.1 2
B0228.7.2 12
B0350.2h.1 30
B0350.2h.2 2
B0350.2i 15
2RSSE.1a 3
2RSSE.1b 10
R02F11.11 4
to
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7 14
B0350.2 47
2RSSE.1 13
R02F11.11 4
Thank you for any help!

Here is a solution using the dplyr and stringr packages. You first create a column with your extracted base pattern, and then use the group_by and summarise functions from dplyr to get the sum of values for each name:
library(dplyr)
library(stringr)
df2 = df %>% mutate(Gene_name = str_extract(Gene_name,"[[:alnum:]]+\\.\\d+")) %>%
group_by(Gene_name) %>% summarise(Values = sum(Values))
Gene_name Values
<chr> <int>
1 2RSSE.1 13
2 B0222.5 4
3 B0222.6 16
4 B0228.7 14
5 B0350.2 47
6 R02F11.11 4

As someone has also suggested, I would get gene names first, and then search for them in the original data.frame
df <- data.frame(Gene_name = c("B0222.5", "B0222.6", "B0228.7.1", "B0228.7.2", "B0350.2h.1", "B0350.2h.2", "B0350.2i", "2RSSE.1a", "2RSSE.1b", "R02F11.11"),
Values = c(4, 16, 2, 12, 30, 2, 15, 3, 10, 4),
stringsAsFactors = F)
pat <- "(^[[:alnum:]]+\\.[[:digit:]]*)"
cap.pos <- regexpr(pat, df$Gene_name)
cap.gene <- unique(substr(df$Gene_name, cap.pos, (cap.pos + attributes(cap.pos)$match.length - 1)))
do.call(rbind, lapply(cap.gene, (function(nm){
sumval <- sum(df[grepl(nm, df$Gene_name, fixed = T),]$Values, na.rm = T)
data.frame(Gene_name = nm, Value = sumval)
})))
The result tracks with your request
Gene_name Value
1 B0222.5 4
2 B0222.6 16
3 B0228.7 14
4 B0350.2 47
5 2RSSE.1 13
6 R02F11.11 4

You can also create the Gene_name as a factor and change the levels.
# coerce the vector as a factor
Gene_name <- as.factor(Gene_name)
# view the levels
levels(Gene_name)
# to make B0228.7.1 into B0228.7
levels(Gene_name)[ *index for B0228.7.1* ] <- B0228.7
You can repeat this for the levels that need to change and then the values will automatically sum together and rows with similar levels will be treated as the same category.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replace row names in a column - r

Related

How to count the occurrence of a word in multiple variables in R and sort them from highest to lowest?

Calculate combinations of several categorical variables

Reshape panel data from wide to long

Merge a dataframe by creating subsets based on time period and a unique ID number

R- Trimming a string in a dataframe after a particular pattern

Categories

Resources