This question already has answers here:
R - extracting value by rank
(2 answers)
Closed 6 years ago.
Suppose I have a data frame that I have ordered according to rate such that it now looks something like this:
Name Rate
A 10
D 11
C 11
E 12
B 13
F 14
I am trying to write a function that takes a rank value as an argument (e.g. rank = 2) and outputs the corresponding names, such that if there are ties in ranks, it would output the name that comes first alphabetically.
In this case, the data should look something like this:
Name Rate Rank
A 10 1
C 11 2
D 11 3
E 12 4
B 13 5
F NA 6
so that rank=2 would output "C" (not D)
and rank = 5 would output "B"
Suppose that the function's rank input is called "num", this is what I've tried to do:
rankName <- df[!is.na(df[,2]),]
rankName <- sort(rankName[,2],) #sorting according to Rate
rank<-seq(1,length(rankName),by=1) #creating a sequence for rank
rankName <- cbind(rankHosp,rank) #combining rankName & rank seq.
comp <- rankName[rankName[,3]==num,] #finding rate value where rank = num
rankName <- rankName[rankName[,2]==comp,] #finding rows where rates are
#equal at that rank
rankName<-rankName$Name #extracting by Name
if (length(rankName)>1){
rankName <- sort(rankName)
rankName <- rankName[1]
}
I'm getting the following error:
Error in `[.data.frame`(rankName, , 3) : undefined columns selected
I'm assuming that, regardless of my error, there's a significantly simpler way to accomplish this, but I haven't been able to figure it out.
Any advice is appreciated. Thank you!
One way of doing this would be to use base::rank() and then using grouping functionality provided by packages like dplyr
df<- read.table(header = T, text = "Name Rate
A 10
D 11
C 11
E 12
B 13
F 14")
df$rnk<- rank(df$Rate, na.last = T,ties.method = "average")
df
require(dplyr)
finaldf<- df %>% group_by(rnk) %>% mutate(Rank=floor(rnk)+ order(Name)-1) %>%
as.data.frame %>% select(c(Name,Rate,Rank))
finaldf
first rnk is created using average, so we group_by by using these averages that will be 2.5 for names D and C
Related
I have 569 rows of data related to breast cancer. In column A, each row either has a value of 'M' or 'B' in the cell (malignant or benign). In column B, the concavity of the nucleus of each tumour is given. I want to find the mean concavity for all malignant tumours, and for all benign tumours, separately.
Edit: first 25 rows of columns A and B given below as an example
> df2
data2.diagnosis data2.concavity_mean
1 M 0.3001000
2 M 0.0869000
3 M 0.1974000
4 M 0.2414000
5 M 0.1980000
6 M 0.1578000
7 M 0.1127000
8 M 0.0936600
9 M 0.1859000
10 M 0.2273000
11 M 0.0329900
12 M 0.0995400
13 M 0.2065000
14 M 0.0993800
15 M 0.2128000
16 M 0.1639000
17 M 0.0739500
18 M 0.1722000
19 M 0.1479000
20 B 0.0666400
21 B 0.0456800
22 B 0.0295600
23 M 0.2077000
24 M 0.1097000
25 M 0.1525000
How do I ask R to give me "the mean of rows in column B, given their value in column A is M" and then "given their value in column A is B"?
Assuming your variable A is a factor, a base R approach for the example dataframe example would be
example <- data.frame(A = as.factor(c('M','B','M', 'B')), B=c(1,2,3,4))
mean(example$B[example$A == 'M'])
#> [1] 2
# for both factor levels simultaneously you can use
by(example$B, example$A, mean)
#> example$A: B
#> [1] 3
# ---- #
#> example$A: M
#> [1] 2
Note. Created on 2022-01-16 by the reprex package (v2.0.1)
Copying one of the examples of the above users (who have provided valid solutions), I am just providing a few alternative solutions using the tidyverse package
example <- data.frame(A = as.factor(c('M','B','M', 'B')), B=c(1,2,3,4))
#first example creates a new table with summarized values
example %>% #takes your data table
group_by(A) %>% #groups it by the factors listed in column A
summarize(mean_A=mean(B)) #finds the mean of each subgroup (from previous step)
If you found this or any of these answers as helpful, please select it as final answer.
As pointed in the comments, it would be nice to have a reproducible example and your data (or at least a subset of them) to see what are you dealing with.
Anyway, the solution to your problem should resemble the following (I am using simulated data):
set.seed(1986)
dta = data.frame("type" = c(rep("B", length = 5), rep("M", length = 5)), "nucleus" = rnorm(10))
mean(dta$nucleus[dta$type == "B"]) # Mean concavity for benign.
mean(dta$nucleus[dta$type == "M"]) # Mean concavity for malign.
Basically, I am just applying the mean() function to two subsets of the data, by selecting rows with the [] operator.
EDIT
Now that we have an idea of your actual data, I can provide a complete solution:
mean(dta$data2.concavity_mean[dta$data2.diagnosis== "B"]) # Mean concavity for benign.
mean(dta$data2.concavity_mean[dta$data2.diagnosis== "M"]) # Mean concavity for malign.
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
I am looking to add a column to my data that will list the individual count of the observation in the dataset. I have data on NBA teams and each of their games. They are listed by date, and I want to create a column that lists what # in each season each game is for each team.
My data looks like this:
# gmDate teamAbbr opptAbbr id
# 2012-10-30 WAS CLE 2012-10-30WAS
# 2012-10-30 CLE WAS 2012-10-30CLE
# 2012-10-30 BOS MIA 2012-10-30BOS
Commas separate each column
I've tried to use "add_count" but this has provided me with the total # of games each team has played in the dataset.
Prior attempts:
nba_box %>% add_count()
I expect the added column to display the # game for each team (1-82), but instead it now shows the total number of games in the dataset (82).
Here is a base R example that approaches the problem from a for loop standpoint. Given that a team can be either column, we keep track of the teams position by unlisting the data and using the table function to sum the previous rows.
# intialize some fake data
test <- as.data.frame(t(replicate(6, sample( LETTERS[1:3],2))),
stringsAsFactors = F)
colnames(test) <- c("team1","team2")
# initialize two new columns
test$team2_gamenum <- test$team1_gamenum <- NA
count <- NULL
for(i in 1:nrow(test)){
out <- c(count, table(unlist(test[i,c("team1","team2")])))
count <- table(rep(names(out), out)) # prob not optimum way of combining two table results
test$team1_gamenum[i] <- count[which(names(count) == test[i,1])]
test$team2_gamenum[i] <- count[which(names(count) == test[i,2])]
}
test
# team1 team2 team1_gamenum team2_gamenum
#1 B A 1 1
#2 A C 2 1
#3 C B 2 2
#4 C B 3 3
#5 A C 3 4
#6 A C 4 5
I have a list of identical dataframes. Each data frame contains columns with unique variables (temp/DO) and with repeated variables (eg-t1).
[[1]]
temp DO t1
1 4 1
3 9 1
5 7 1
I want to find the mean of DO when the temperature is equal to t1.
t1 represents a specific temperature, but the value varies for each data frame in the list so I can't specify an actual value.
So far I've tried writing a function
hvod<-function(DO, temp, depth){
hDO<-DO[which(temp==t1[1])]
mHDO<-mean(hDO)
htemp<-temp[which(temp=t1[1])]
mhtemp<-mean(htemp)
}
hfit<-hvod(data$DO, data$temp, data$depth)
But for whatever reason t1 is not recognized. Any ideas on the function OR
a way to combine select (dplyr function) and lapply to solve this?
I've seen similar posts put none that apply to the issue of a specific value (t1) that changes for each data frame.
I would just take the dataframe as argument and do rest of the logic inside function as it gives more control to the function. Something like this would work,
hvod<-function(data){
temp <- data$temp
t1 <- data$t1
DO <- data$DO
hDO<-DO[which(temp==t1[1])]
mHDO<-mean(hDO)
htemp<-temp[which(temp=t1[1])]
mhtemp<-mean(htemp)
}
You can try using dplyr::bind_rows function to combine all data.frames from list in one data.frame.
Then group on data.frame number to find the mean of DO for rows having temp==t1 as:
library(dplyr)
bind_rows(ll, .id = "DF_Name") %>%
group_by(DF_Name) %>%
filter(temp==t1) %>%
summarise(MeanDO = mean(DO)) %>%
as.data.frame()
# DF_Name MeanDO
# 1 1 4.0
# 2 2 6.5
# 3 3 6.7
Data:
df1 <- read.table(text =
"temp DO t1
1 4 1
3 9 1
5 7 1",
header = TRUE)
df2 <- read.table(text =
"temp DO t1
3 4 3
3 9 3
5 7 1",
header = TRUE)
df3 <- read.table(text =
"temp DO t1
2 4 2
2 9 2
2 7 2",
header = TRUE)
ll <- list(df1, df2, df3)
Thank you Thiloshon and MKR for the help! I had initial combined the data I needed into one list of data frames but to answer this I actually had my data in separate data frames (fitsObs and df1).
The variables I was working with in the code were 1 to 1, so by finding the range where depth and d2 were the same (I used temp and t1 in the example), I could find the mean over that range .
for(i in 1:1044){
df1 <- GLNPOsurveyCTD$data[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
deptho <- -abs(df1$depth) #defining temp and depth in the loop
to <- df1$temp
do <- df1$DO
xx <- which(deptho <= fitObs$d2) #mean over range xx
mhtemp <- mean(to[xx], na.rm=TRUE)
mHDO <- mean(do[xx], na.rm=TRUE)
}
In the following example, how do I ask R to identify a tie as "tie" when I want to determine the most frequent value within a group?
I am basically following on from a previous question, that used which.max or which.is.max and a custom function (Create a variable capturing the most frequent occurence by group), but I want to acknowledge the ties as a tie. Any ideas?
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
I want to create a third variable freq that contains the most frequent observation in v1 by id, but also creates identifies ties as "tie".
From previous answers, this code works to create the freq variable, but just doesn't deal with the ties:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
You could slightly modify your function by testing if the maximum count occurs more than once. This happens in sum(tbl == max(tbl)). Then proceed accordingly.
df1 <-data.frame(
id=rep(1:2, each=4),
v1=rep(letters[1:4], c(2,2,3,1))
)
myFun <- function(x){
tbl <- table(x$v1)
nmax <- sum(tbl == max(tbl))
if (nmax == 1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
else
x$freq <- "tie"
x
}
ddply(df1,.(id),.fun=myFun)
id v1 freq
1 1 a tie
2 1 a tie
3 1 b tie
4 1 b tie
5 2 c c
6 2 c c
7 2 c c
8 2 d c
This question already has answers here:
Split 1 Column into 2 Columns in a Dataframe [duplicate]
(1 answer)
Breaking up (melting) text data in a column in R?
(2 answers)
Closed 9 years ago.
I have a dataset with a patient identifier and a text field with a summary of medical findings (1 row per patient). I would like to create a dataset with multiple rows per patients by splitting the text field so that each sentence of the summary falls on a different line. Subsequently, I would like to text parse each line looking for certain keywords and negation terms. An example of the structure of the data frame is (the letters represent the sentences):
ID Summary
1 aaaaa. bb. c
2 d. eee. ff. g. h
3 i. j
4 k
I would like to split the text field at the “.” to convert it to:
ID Summary
1 aaaaa
1 bb
1 c
2 d
2 eee
2 ff
2 g
2 h
3 i
3 j
4 k
R code to create the initial data frame:
ID <- c(1, 2, 3, 4)
Summary <- c("aaaaa. bb. c", "d. eee. ff. g. h", "i. j", "k")
df <- data.frame(cbind(ID, Summary))
df$ID <- as.numeric(df$ID)
df$Summary <- as.character(df$Summary)
The following previous posting provides a nice solution:
Breaking up (melting) text data in a column in R?
I used the following code from that posting which works for this sample dataset:
dflong <- by(df, df$ID, FUN = function(x) {
sentence = unlist(strsplit(x$Summary, "[.]"))
data.frame(ID = x$ID, Summary = sentence)
})
dflong2<- do.call(rbind,dflong)
However, when I try to apply to my larger dataset (>200,000 rows), I get the error message:
Error in data.frame(ID = x$ID, Summary = sentence) : arguments imply differing number of rows: 1, 0
I reduced the data frame down to test it on a smaller dataset and I still get this error message any time the number of rows is >57.
Is there another approach to take that can handle a larger number of rows? Any advice is appreciated. Thank you.
Use data.table:
library(data.table)
dt = data.table(df)
dt[, strsplit(Summary, ". ", fixed = T), by = ID]
# ID V1
# 1: 1 aaaaa
# 2: 1 bb
# 3: 1 c
# 4: 2 d
# 5: 2 eee
# 6: 2 ff
# 7: 2 g
# 8: 2 h
# 9: 3 i
#10: 3 j
#11: 4 k
There are many ways to address #agstudy's comment about empty Summary, but here's a fun one:
dt[, c(tmp = "", # doesn't matter what you put here, will delete in a sec
# the point of having this is to force the size of the output table
# which data.table will kindly fill with NA's for us
Summary = strsplit(Summary, ". ", fixed = T)), by = ID][,
tmp := NULL]
You get an error because for some rows you have no data ( summary column). Try this should work for you:
dflong <- by(df, df$ID, FUN = function(x) {
sentence = unlist(strsplit(x$Summary, "[.]"))
## I just added this line to your solution
if(length(sentence )==0)
sentence <- NA
data.frame(ID = x$ID, Summary = sentence)
})
dflong2<- do.call(rbind,dflong)
PS : This is slightly different from the data.table solution which will remove rows where summary equal to ''(0 charcaters). That's said I would would use a data.table solution here since you have more than 200 000 rows.