Related
I have a huge dataset with over 3 million obs and 108 columns. There are 14 variables I'm interested in: DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF (they're in different positions). These variables contain ICD-10 codes.
I'm interested in counting how many times certain ICD-10 codes appear and then sort them from highest to lowest in a dataframe. Here's some reproductible data:
data <- data.frame(DIAG_PRINC = c("O200", "O200", "O230"),
DIAG_SECUN = c("O555", "O530", "O890"),
DIAGSEC1 = c("O766", "O876", "O899"),
DIAGSEC2 = c("O200", "I520", "O200"),
DIAGSEC3 = c("O233", "O200", "O620"),
DIAGSEC4 = c("O060", "O061", "O622"),
DIAGSEC5 = c("O540", "O123", "O344"),
DIAGSEC6 = c("O876", "Y321", "S333"),
DIAGSEC7 = c("O450", "X900", "O541"),
DIAGSEC8 = c("O222", "O111", "O123"),
DIAGSEC9 = c("O987", "O123", "O622"),
CID_MORTE = c("O066", "O699", "O555"),
CID_ASSO = c("O600", "O060", "O068"),
CID_NOTIF = c("O069", "O066", "O065"))
I also have a list of ICD-10 codes that I'm interested in counting.
GRUPO1 <- c("O00", "O000", "O001", "O002", "O008", "O009",
"O01", "O010", "O011", "O019",
"O02", "O020", "O021", "O028", "O029",
"O03", "O030", "O031", "O032", "O033", "O034", "O035", "O036", "O037",
"O038", "O039",
"O04", "O040", "O041", "O042", "O043", "O044", "O045", "O046", "O047",
"O048", "O049",
"O05", "O050", "O051", "O052", "O053", "O054", "O055", "O056", "O057",
"O058", "O059",
"O06", "O060", "O061", "O062", "O063", "O064", "O065", "O066", "O067",
"O068", "O069",
"O07", "O070", "O071", "O072", "O073", "O074", "O075", "O076", "O077",
"O078", "O079",
"O08", "O080", "O081", "O082", "O083", "O084", "O085", "O086", "O087",
"O088", "O089")
What I need is a dataframe counting how many times the ICD-10 codes from "GRUPO1" appear in any row/column from DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF variables. For example, on my reproductible data ICD-10 cod "O066" appears twice.
Thank you in advance!
We can unlist the data into a vector, use %in% to subset the values from 'GRUPO1' and get the frequency count with table in base R
v1 <- unlist(data)
out <- table(v1[v1 %in% GRUPO1])
out[order(-out)]
O060 O066 O061 O065 O068 O069
2 2 1 1 1 1
Here is a tidyverse solution using tidyr and dplyr:
library(tidyverse)
pivot_longer(data, everything()) %>%
filter(value %in% GRUPO1) %>%
count(value)
Output
value n
<chr> <int>
1 O060 2
2 O061 1
3 O065 1
4 O066 2
5 O068 1
6 O069 1
I have a dataset, df, where columns consist of various chemicals and rows consist of samples identified by their id and the concentration of each chemical.
I need to correct the chemical concentrations using a unique value for each chemical, which are found in another dataset, df2.
Here's a minimal df1 dataset:
df1 <- read.table(text="id,chem1,chem2,chem3,chemA,chemB
1,0.5,1,5,4,3
2,1.5,0.5,2,3,4
3,1,1,2.5,7,1
4,2,5,3,1,7
5,3,4,2.3,0.7,2.3",
header = TRUE,
sep=",")
and here is a df2 example:
df2 <- read.table(text="chem,value
chem1,1.7
chem2,2.3
chem3,4.1
chemA,5.2
chemB,2.7",
header = TRUE,
sep = ",")
What I need to do is to divide all observations of chem1 in df1 by the value provided for chem1 in df2, repeated for each chemical. In reality, chemical names are not sequential, and there's roughly 30 chemicals.
Previously I would have done this using Excel and index/match but I'm looking to make my methods more reproducible, hence fighting my way through with R. I mostly do data manipulation with dplyr, so if there's a tidyverse solution out there, that would be great!
Thankful for any help
We can use the 'chem' column from 'df2' to subset the 'df1', divide by the 'value' column of 'df2' replicated to make the lengths same and update the columns of 'df1' by assigning the results back
df1[as.character(df2$chem)] <- df1[as.character(df2$chem)]/df2$value[col(df1[-1])]
Using reshape2 package, the data frame can be changed to long format to merge with the df2 as follows. (Note that the example df introduce some whitespace that are filtered in this solution)
library(reshape2)
df1 <- read.table(text="id,chem1,chem2,chem3,chemA,chemB
1,0.5,1,5,4,3
2,1.5,0.5,2,3,4
3,1,1,2.5,7,1
4,2,5,3,1,7
5,3,4,2.3,0.7,2.3",
header = TRUE,
sep=",",stringsAsFactors = F)
df2 <- read.table(text="chem,value
chem1,1.7
chem2,2.3
chem3,4.1
chemA,5.2
chemB,2.7",
header = TRUE,
sep = ",",stringsAsFactors = F)
df2$chem <- gsub("\\s+","",df2$chem) #example introduces whitespaces in the names
df1A <- melt(df1,id.vars=c("id"),variable.name="chem")
combined <- merge(x=df1A,y=df2,by="chem",all.x=T)
combined$div <- combined$value.x/combined$value.y
head(combined)
chem id value.x value.y div
1 chem1 1 0.5 1.7 0.2941176
2 chem1 2 1.5 1.7 0.8823529
3 chem1 3 1.0 1.7 0.5882353
4 chem1 4 2.0 1.7 1.1764706
5 chem1 5 3.0 1.7 1.7647059
6 chem2 1 1.0 2.3 0.4347826
or in wide format:
> dcast(combined[,c("id","chem","div")],id ~ chem,value.var="div")
id chem1 chem2 chem3 chemA chemB
1 1 0.2941176 0.4347826 1.2195122 0.7692308 1.1111111
2 2 0.8823529 0.2173913 0.4878049 0.5769231 1.4814815
3 3 0.5882353 0.4347826 0.6097561 1.3461538 0.3703704
4 4 1.1764706 2.1739130 0.7317073 0.1923077 2.5925926
5 5 1.7647059 1.7391304 0.5609756 0.1346154 0.8518519
Here's a tidyverse solution.
df3 <- df1 %>%
# convert the data from wide to long to make the next step easier
gather(key = chem, value = value, -id) %>%
# do your math, using 'match' to map values from df2 to rows in df3
mutate(value = value/df2$value[match(df3$chem, df2$chem)]) %>%
# return the data to wide format if that's how you prefer to store it
spread(chem, value)
Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)
I have a numeric vector:
> dput(vec_exp)
structure(c(12.344902729712, 6.54357482855349, 17.1939193108764,
23.1029632631654, 8.91495023159554, 14.3259091357051, 18.0494234749187,
2.92524638658168, 5.10306474037357, 2.66645609602021), .Names = c("Arthur_1",
"Mark_1", "Mark_2", "Mark_3", "Stephen_1", "Stephen_2",
"Stephen_3", "Rafael_1", "Marcus_1", "Georg_1"))
and then I have a data frame like the one below:
Name Nr Numb
1 Rafael 20.8337 20833.7
2 Joseph 25.1682 25168.2
3 Stephen 40.5880 40588.0
4 Leon 198.7730 198773.0
5 Thierry 16.5430 16543.0
6 Marcus 31.6600 31660.0
7 Lucas 39.6700 39670.0
8 Georg 194.9410 194941.0
9 Mark 60.1020 60102.0
10 Chris 56.0578 56057.8
I would like to multiply the numbers in numeric vector by the numbers from the column Nr in this data frame. Of course it is important to multiply the values by the name. It means that Mark_1 from numeric vector should be multiplied by the Nr = 60.1020, same for Mark_2, and Stephen_3 by 40.5880, etc.
Can someone recommend any easy solution ?
You could use match to match the names after extracting only the first part of the names of vec_exp, i.e. extract Mark from Mark_1 etc.
vec_exp * df$Nr[match(sub("^([^_]+).*", "\\1", names(vec_exp)), df$Name)]
# Arthur_1 Mark_1 Mark_2 Mark_3 Stephen_1 Stephen_2 Stephen_3 Rafael_1 Marcus_1 Georg_1
# NA 393.28193 1033.38894 1388.53430 361.84000 581.46000 732.59000 60.94371 161.56303 519.80162
Arthur is NA because there's no match in the data.frame.
If you want to keep those entries without a match in the data as they were before, you could do it like this:
i <- match(sub("^([^_]+).*", "\\1", names(vec_exp)), df$Name)
vec_exp[!is.na(i)] <- vec_exp[!is.na(i)] * df$Nr[na.omit(i)]
This first computes the matches and then only multiplies those if they are not NA.
We can use base R methods. Convert the vector to a data.frame with stack, create a 'Name' column by removing the substring from 'ind' and merge with the data.frame ('df1'). Then, we can multiply the 'Nr' and the 'values' column.
d1 <- merge(df1, transform(stack(vec_exp), Name = sub("_.*", "", ind)), all.y=TRUE)
d1$Nr*d1$values
Or with dplyr, it is much more easier to understand.
library(dplyr)
library(tidyr)
stack(vec_exp) %>%
separate(ind, into = c("Name", "ind")) %>%
left_join(., df1, by = "Name") %>%
mutate(res = values*Nr) %>%
.$res
#[1] NA 393.28193 1033.38894 1388.53430 361.84000
#[6] 581.46000 732.59000 60.94371 161.56303 519.80162
I'm still learning R and have been given the task of grouping a long list of students into groups of four based on another variable. I have loaded the data into R as a data frame. How do I sample entire rows without replacement, one from each of 4 levels of a variable and have R output the data into a spreadsheet?
So far I have been tinkering with a for loop and the sample function but I'm quickly getting over my head. Any suggestions? Here is sample of what I'm attempting to do. Given:
Last.Name <- c("Picard","Troi","Riker","La Forge", "Yar", "Crusher", "Crusher", "Data")
First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com")
Section <- c(1,1,2,2,3,3,4,4)
df <- data.frame(Last.Name,First.Name,Email,Section)
I want to randomly select a Star Trek character from each section and end up with 2 groups of 4. I would want the entire row's worth of information to make it over to a new data frame containing all groups with their corresponding group number.
I'd use the wonderful package 'dplyr'
require(dplyr)
random_4 <- df %>% group_by(Section) %>% slice(sample(c(1,2),1))
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Troi Deanna b#b.com 1
2 La Forge Geordi d#d.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Picard Jean-Luc a#a.com 1
2 Riker William c#c.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
%>% means 'and then'
The code is read as:
Take DF AND THEN for all 'Section', select by position (slice) 1 or 2. Voila.
I suppose you have 8 students: First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data").
If you wish to randomly assign a section number to the 8 students, and assuming you would like each section to have 2 students, then you can either permute Section <- c(1, 1, 2, 2, 3, 3, 4, 4) or permute the list of the students.
First approach, permute the sections:
> assigned_section <- print(sample(Section))
[1] 1 4 3 2 2 3 4 1
Then the following data frame gives the assignments:
assigned_students <- data.frame(First.Name, assigned_section)
Second approach, permute the students:
> assigned_students <- print(sample(First.Name))
[1] "Data" "Geordi" "Tasha" "William" "Deanna" "Beverly" "Jean-Luc" "Wesley"
Then, the following data frame gives the assignments:
assigned_students <- data.frame(assigned_students, Section)
Alex, Thank You. Your answer wasn't exactly what I was looking for, but it inspired the correct one for me. I had been thinking about the process from a far too complicated point of view. Instead of having R select rows and put them into a new data frame, I decided to have R assign a random number to each of the students and then sort the data frame by the number:
First, I broke up the data frame into sections:
df1<- subset(df, Section ==1)
df2<- subset(df, Section ==2)
df3<- subset(df, Section ==3)
df4<- subset(df, Section ==4)
Then I randomly generated a group number 1 through 4.
Groupnumber <-sample(1:4,4, replace=F)
Next, I told R to bind the columns:
Assigned1 <- cbind(df1,Groupnumber)
*Ran the group number generator and cbind in alternating order until I got through the whole set. (Wanted to make sure the order of the numbers was unique for each section).
Finally row binding the data set back together:
Final_List<-rbind(Assigned1,Assigned2,Assigned3,Assigned4)
Thank you everyone who looked this over. I am new to data science, R, and stackoverflow, but as I learn more I hope to return the favor.
I'd suggest the randomizr package to "block assign" according to section. The block_ra function lets you do this in a easy-to-read one-liner.
install.packages("randomizr")
library(randomizr)
df$group <- block_ra(block_var = df$Section,
condition_names = c("group_1", "group_2"))
You can inspect the resulting sets in a variety of ways. Here's with base r subsetting:
df[df$group == "group_1",]
Last.Name First.Name Email Section group
2 Troi Deanna b#b.com 1 group_1
3 Riker William c#c.com 2 group_1
6 Crusher Beverly f#f.com 3 group_1
7 Crusher Wesley g#g.com 4 group_1
df[df$group == "group_2",]
Last.Name First.Name Email Section group
1 Picard Jean-Luc a#a.com 1 group_2
4 La Forge Geordi d#d.com 2 group_2
5 Yar Tasha e#e.com 3 group_2
8 Data Data h#h.com 4 group_2
If you want to roll your own:
set <- tapply(1:nrow(df), df$Section, FUN = sample, size = 1)
df[set,] # show the sampled set
df[-set,] # show the complimentary set