Related
I have a huge dataset with over 3 million obs and 108 columns. There are 14 variables I'm interested in: DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF (they're in different positions). These variables contain ICD-10 codes.
I'm interested in counting how many times certain ICD-10 codes appear and then sort them from highest to lowest in a dataframe. Here's some reproductible data:
data <- data.frame(DIAG_PRINC = c("O200", "O200", "O230"),
DIAG_SECUN = c("O555", "O530", "O890"),
DIAGSEC1 = c("O766", "O876", "O899"),
DIAGSEC2 = c("O200", "I520", "O200"),
DIAGSEC3 = c("O233", "O200", "O620"),
DIAGSEC4 = c("O060", "O061", "O622"),
DIAGSEC5 = c("O540", "O123", "O344"),
DIAGSEC6 = c("O876", "Y321", "S333"),
DIAGSEC7 = c("O450", "X900", "O541"),
DIAGSEC8 = c("O222", "O111", "O123"),
DIAGSEC9 = c("O987", "O123", "O622"),
CID_MORTE = c("O066", "O699", "O555"),
CID_ASSO = c("O600", "O060", "O068"),
CID_NOTIF = c("O069", "O066", "O065"))
I also have a list of ICD-10 codes that I'm interested in counting.
GRUPO1 <- c("O00", "O000", "O001", "O002", "O008", "O009",
"O01", "O010", "O011", "O019",
"O02", "O020", "O021", "O028", "O029",
"O03", "O030", "O031", "O032", "O033", "O034", "O035", "O036", "O037",
"O038", "O039",
"O04", "O040", "O041", "O042", "O043", "O044", "O045", "O046", "O047",
"O048", "O049",
"O05", "O050", "O051", "O052", "O053", "O054", "O055", "O056", "O057",
"O058", "O059",
"O06", "O060", "O061", "O062", "O063", "O064", "O065", "O066", "O067",
"O068", "O069",
"O07", "O070", "O071", "O072", "O073", "O074", "O075", "O076", "O077",
"O078", "O079",
"O08", "O080", "O081", "O082", "O083", "O084", "O085", "O086", "O087",
"O088", "O089")
What I need is a dataframe counting how many times the ICD-10 codes from "GRUPO1" appear in any row/column from DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF variables. For example, on my reproductible data ICD-10 cod "O066" appears twice.
Thank you in advance!
We can unlist the data into a vector, use %in% to subset the values from 'GRUPO1' and get the frequency count with table in base R
v1 <- unlist(data)
out <- table(v1[v1 %in% GRUPO1])
out[order(-out)]
O060 O066 O061 O065 O068 O069
2 2 1 1 1 1
Here is a tidyverse solution using tidyr and dplyr:
library(tidyverse)
pivot_longer(data, everything()) %>%
filter(value %in% GRUPO1) %>%
count(value)
Output
value n
<chr> <int>
1 O060 2
2 O061 1
3 O065 1
4 O066 2
5 O068 1
6 O069 1
I have a data frame with mainly categorical variables. I want to see the number of combinations of variables found in three of these columns with categorical variables.
The data in the columns looks like this:
number_arms <- c("6","8","12")
arrangements <- c("single", "paired", "ornament")
approx_position <- c("top", "middle", "bottom")
rg2 <- data.frame(number_arms, arrangements, approx_position)
I was reading in another post to use the following code when comparing two columns:
library(dplyr)
library(stringr)
rg2 %>%
count(combination = str_c(pmin(number_arms, arrangements), ' - ',
pmax(number_arms, arrangements)), name = "count")
This is the result:
combination count
12 - single 1
16 - single 1
4 - paired 3
4 - single 4
5 - paired 4
5 - single 2
6 - ornament 1
6 - paired 81
However, the code does not give me the wanted results if I add the third column, like this:
rg2 %>%
count(combination = str_c(pmin(number_arms, arrangements, approx_position), ' - ',
pmax(number_arms, arrangements, approx_position)), name = "count")
It still runs the code without error but I get wrong results.
Do I need a different code to calculate the combinations of three variables?
If you're looking for the count of each combination of the variables, excluding 0, you can do:
subset(data.frame(table(rg2)), Freq > 0)
number_arms arrangements approx_position Freq
1 12 ornament bottom 1
15 8 paired middle 1
26 6 single top 1
or combined:
subset(data.frame(table(rg2)), Freq > 0) |>
tidyr::unite("combn", -Freq, sep = " - ")
combn Freq
1 12 - ornament - bottom 1
15 8 - paired - middle 1
26 6 - single - top 1
data
number_arms <- c("6","8","12")
arrangements <- c("single", "paired", "ornament")
approx_position <- c("top", "middle", "bottom")
rg2 <- data.frame(number_arms, arrangements, approx_position)
Tidyverse option (updated to remove group_by):
library(dplyr)
rg2 %>%
count(number_arms, arrangements, approx_position)
Result:
number_arms arrangements approx_position n
<chr> <chr> <chr> <int>
1 12 ornament bottom 1
2 6 single top 1
3 8 paired middle 1
You can try dplyr::count() + paste():
library(dplyr)
rg2 %>%
count(combination = paste(number_arms, arrangements, approx_position, sep = " - "), name = "count")
# combination count
# 1 12 - ornament - bottom 1
# 2 6 - single - top 1
# 3 8 - paired - middle 1
My data frame looks like this:
person. id98 id100 id102 educ98 educ100 educ102 pid98 pid100 pid102
1. 3. 0. 0. 2. 4. 5. T. F. F
2. ....
I hope to transform it like this:
person. year. id. educ. pid.
1. 98
1. 100
1. 102
In Stata, I know that the "reshape" command can automatically identify the year from those variables' names. In R, I don't know how to deal with that.
I want to get the number that is trailing in each column name and bundle the column based on that number.
If you would like to use reshape, maybe the code below could help
reshape(
setNames(df, gsub("(\\d+)", "\\.\\1", names(df))),
# the gsub needed because `reshape` expects a period as a separator
direction = "long",
varying = -1
)
which gives
person. time id educ pid
1.98 1 98 1 2 TRUE
1.100 1 100 1 4 FALSE
1.102 1 102 1 5 FALSE
Data
> dput(df)
structure(list(person. = 1, id98 = 3, id100 = 0, id102 = 0, educ98 = 2,
educ100 = 4, educ102 = 5, pid98 = TRUE, pid100 = FALSE, pid102 = FALSE), class = "data.frame", row.names = c(NA,
-1L))
You can use pivot_longer from tidyr. Using data from #ThomasIsCoding
tidyr::pivot_longer(df,
cols = -person.,
names_to = c('.value', 'year'),
names_pattern = '([a-z]+)(\\d+)')
# person. year id educ pid
# <dbl> <chr> <dbl> <dbl> <lgl>
#1 1 98 3 2 TRUE
#2 1 100 0 4 FALSE
#3 1 102 0 5 FALSE
Using the data.table package this is fairly easy.
As a note I think this'll only work if the columns are ordered in the same order i.e id90 id100 id102 pid90 pid100 pid102 ... etc.
Edit
The aforementioned issue has been solved in this new code.
# loading data.table
if(!require(data.table)){
install.packages("data.table")
library(data.table)
}
df= data.frame(person=1:5, id90=rnorm(5), id91=rnorm(5), id92=rnorm(5), pid90=rnorm(5), pid91=rnorm(5), pid92=rnorm(5), educ90=rnorm(5), educ91=rnorm(5), educ92=rnorm(5))
# turn data.frame in data.table
setDT(df)
cols = colnames(df)[order(colnames)]
# df[, ..cols] reorders the columns alphabetically
# to evade the problem stated above.
# id.vars is the id vars
# using patterns with measure vars will bundle all the columns
# that match the regex pattern in the same column
dt <- melt(df[, ..cols], id.vars="person", measure.vars=patterns(id="^id", educ="^educ", pid="^pid"))
# getting the years
years = gsub('^id', '', colnames(df)[grepl('^id', colnames(df))])
# changing the years
dt[, c("year","variable"):=list(years[variable], NULL)]
I am looking to create a dataframe that lists a unique ID with the movement of n different amounts across a period of m timesteps. I currently generate subsets of each timestep and then merge all these subsets with a separate dataframe that contains just the unique IDs. See below:
set.seed(129)
df1 <- data.frame(
id= c(rep(seq(1:7),3)),
time= c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3),
amount1= runif(21,0,50),
amount2= runif(21,-20,600),
amount3= runif(21,-15,200),
amount4= runif(21,-3,300)
)
df2 <- data.frame(
id = unique(df1$id)
)
sub_1 <- subset(df1, time == 1)
sub_2 <- subset(df1, time == 2)
sub_3 <- subset(df1, time == 3)
df2<-merge(df2,sub_1,by.x = "id",by.y = "id", all=TRUE)
df2<-merge(df2,sub_2,by.x = "id",by.y = "id", all=TRUE)
df2<-merge(df2,sub_3,by.x = "id",by.y = "id", all=TRUE)
#df2
id time.x amount1.x amount2.x amount3.x amount4.x time.y amount1.y amount2.y amount3.y amount4.y time amount1 amount2 amount3 amount4
1 1 1 6.558261 -17.713007 46.477430 195.061597 2 18.5453843 269.7406808 132.588713 80.40133 3 24.943217 488.1025 103.473479 198.51302
2 2 1 15.736044 230.018563 72.604346 -2.513162 2 48.8537058 356.5593748 161.239261 246.25985 3 35.559262 406.4749 66.278064 30.11592
3 3 1 8.057720 386.814867 101.997370 152.269564 2 0.7334493 0.7842648 66.603965 156.12478 3 42.170220 450.0306 195.872986 109.73098
4 4 1 15.575282 527.033563 37.403278 197.529341 2 37.8372445 370.0410836 6.074847 273.46715 3 20.302206 290.0026 -2.101649 112.88488
5 5 1 4.230635 427.294382 112.771237 199.401096 2 15.3735066 376.8945806 104.382371 224.09730 3 8.050933 291.6123 53.660734 270.37200
6 6 1 29.087870 9.330858 129.400932 70.801129 2 38.9966662 421.9258798 -3.891286 290.59259 3 17.919554 581.1735 137.100314 129.78561
7 7 1 4.380303 463.658580 4.120219 56.527016 2 6.0582455 484.4981686 67.820164 72.05615 3 43.556746 170.0745 41.134708 247.99512
I have a major issue with this, as the values of m and n increase this method becomes ugly and long. Is there a cleaner way to do this? Maybe as a one liner so I don't have to make say 15 subsets if m = 15.
Thanks
You just need your original df1 dataset and do this:
library(tidyverse)
df1 %>%
group_split(time) %>% # create your subsets and store them as a list of data frames
reduce(left_join, by = "id") # sequentially join those subsets
I am having trouble figuring out how to trim the end off of a string in a data frame.
I want to trim everything to a "base" name, after #s and letters, a period, then a number. My goal is trim everything in my dataframe to this "base" name, then sum the values with the same "base." I was thinking it would be possible to trim, then merge and sum the values.
ie/
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7.1 2
B0228.7.2 12
B0350.2h.1 30
B0350.2h.2 2
B0350.2i 15
2RSSE.1a 3
2RSSE.1b 10
R02F11.11 4
to
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7 14
B0350.2 47
2RSSE.1 13
R02F11.11 4
Thank you for any help!
Here is a solution using the dplyr and stringr packages. You first create a column with your extracted base pattern, and then use the group_by and summarise functions from dplyr to get the sum of values for each name:
library(dplyr)
library(stringr)
df2 = df %>% mutate(Gene_name = str_extract(Gene_name,"[[:alnum:]]+\\.\\d+")) %>%
group_by(Gene_name) %>% summarise(Values = sum(Values))
Gene_name Values
<chr> <int>
1 2RSSE.1 13
2 B0222.5 4
3 B0222.6 16
4 B0228.7 14
5 B0350.2 47
6 R02F11.11 4
As someone has also suggested, I would get gene names first, and then search for them in the original data.frame
df <- data.frame(Gene_name = c("B0222.5", "B0222.6", "B0228.7.1", "B0228.7.2", "B0350.2h.1", "B0350.2h.2", "B0350.2i", "2RSSE.1a", "2RSSE.1b", "R02F11.11"),
Values = c(4, 16, 2, 12, 30, 2, 15, 3, 10, 4),
stringsAsFactors = F)
pat <- "(^[[:alnum:]]+\\.[[:digit:]]*)"
cap.pos <- regexpr(pat, df$Gene_name)
cap.gene <- unique(substr(df$Gene_name, cap.pos, (cap.pos + attributes(cap.pos)$match.length - 1)))
do.call(rbind, lapply(cap.gene, (function(nm){
sumval <- sum(df[grepl(nm, df$Gene_name, fixed = T),]$Values, na.rm = T)
data.frame(Gene_name = nm, Value = sumval)
})))
The result tracks with your request
Gene_name Value
1 B0222.5 4
2 B0222.6 16
3 B0228.7 14
4 B0350.2 47
5 2RSSE.1 13
6 R02F11.11 4
You can also create the Gene_name as a factor and change the levels.
# coerce the vector as a factor
Gene_name <- as.factor(Gene_name)
# view the levels
levels(Gene_name)
# to make B0228.7.1 into B0228.7
levels(Gene_name)[ *index for B0228.7.1* ] <- B0228.7
You can repeat this for the levels that need to change and then the values will automatically sum together and rows with similar levels will be treated as the same category.