I premise I'm new with R and actually I'm trying to get the fundamentals.
Currently I'm workin on a large dataframe (called "ppl") which I have to edit in order to filter some rows. Each row is included in a group and it is characterized by an intensity (into) value and a sample value.
mz rt into sample tracker sn grp
100.0153 126 2.762664 3 11908 7.522655 0
100.0171 127 2.972048 2 5308 7.718521 0
100.0788 272 30.217969 2 5309 19.024807 1
100.0796 272 17.277916 3 11910 7.297716 1
101.0042 128 37.557324 3 11916 27.991320 2
101.0043 128 39.676014 2 5316 28.234918 2
Well, the first question is: "How can I select from each group the sample with the highest intensity?"
I tried a for loop:
for (i in ppl$grp) {
temp<-ppl[ppl$grp == i,]
sel<-rbind(sel,temp[max(temp$into),])
}
The fact is that it works for ppl$grp == 0, but the next cycles return NAs rows.
Then the filtered dataframe(called "sel") also should store the sample values of the removed rows. It should be as follows:
mz rt into sample tracker sn grp
100.0171 127 2.972048 c(2,3) 5308 7.718521 0
100.0788 272 30.217969 c(2,3) 5309 19.024807 1
101.0043 128 39.676014 c(2,3) 5316 28.234918 2
In order to get this I would use this approach:
lev<-factor(ppl$grp)
samp<-ppl$sample
samp2<-split(samp,lev)
sel$sample<-samp2
Any hint? Because I cannot test it since I still don't have solved the previous problem.
Thanks a lot.
Not sure if I follow your question. But maybe this will get you started.
library(dplyr)
ppl %>% group_by(grp) %>% filter(into == max(into))
A base R option using ave is
ppl[with(ppl, ave(into, grp, FUN = max)==into),]
If the 'sample' column in the expected output have the unique elements in each 'grp', then after grouping by 'grp', update the 'sample' as the pasted unique elements of 'sample', then arrange the 'into' descendingly and slice the 1st row.
library(dplyr)
ppl %>%
group_by(grp) %>%
mutate(sample = toString(sort(unique(sample)))) %>%
arrange(desc(into)) %>%
slice(1L)
# mz rt into sample tracker sn grp
# <dbl> <int> <dbl> <chr> <int> <dbl> <int>
#1 100.0171 127 2.972048 2, 3 5308 7.718521 0
#2 100.0788 272 30.217969 2, 3 5309 19.024807 1
#3 101.0043 128 39.676014 2, 3 5316 28.234918 2
A data.table alternative:
library(data.table)
setkey(setDT(ppl),grp)
ppl <- ppl[ppl[,into==max(into),by=grp]$V1,]
## mz rt into sample tracker sn grp
##1: 100.0171 127 2.972048 2 5308 7.718521 0
##2: 100.0788 272 30.217969 2 5309 19.024807 1
##3: 101.0043 128 39.676014 2 5316 28.234918 2
I have no idea why this code would work
for (i in ppl$grp) {
temp<-ppl[ppl$grp == i,]
sel<-rbind(sel,temp[max(temp$into),])
}
max(temp$into) should return the maximum value--which appears to not be an integer in most cases.
Also, building a data.frame with rbind in every for loop instance is not good practice (in any language). It requires quit a bit of type checking and array growing that can get very expensive.
Also, max will return NA when there are any NAs for that group.
There is also a question about what you want to do about ties? Do you just want one result or all of them? The code Akrun gives will give you all of them.
This code will write a new column that has the group max
ppl$grpmax <- ave(ppl$into, ppl$grp, FUN=function(x) { max(x, na.rm=TRUE ) } )
You can then select all values in a group that are equal to the max with
pplmax <- subset(ppl, into == grpmax)
If you want just one per group then you can remove duplicates
pplmax[!duplicated(pplmax$grp),]
Related
I would like to create a column in my data frame that gives the percentage of each category. The total (100%) would be the summary of the column Score.
My data looks like
Client Score
<chr> <int>
1 RP 125
2 DM 30
Expected
Client Score %
<chr> <int>
1 RP 125 80.6
2 DM 30 19.3
Thanks!
Note special character in column names is not good.
library(dplyr)
df %>%
mutate(`%` = round(Score/sum(Score, na.rm = TRUE)*100, 1))
Client Score %
1 RP 125 80.6
2 DM 30 19.4
Probably the best way is to use dplyr. I recreated your data below and used the mutate function to create a new column on the dataframe.
#Creation of data
Client <- c("RP","DM")
Score <- c(125,30)
DF <- data.frame(Client,Score)
DF
#install.packages("dplyr") #Remove first # and install if library doesn't load
library(dplyr) #If this doesn't run, install library using code above.
#Shows new column
DF %>%
mutate("%" = round((Score/sum(Score))*100,1))
#Overwrites dataframe with new column added
DF %>%
mutate("%" = round((Score/sum(Score))*100,1)) -> DF
Using base R functions the same goal can be achieved.
X <- round((DF$Score/sum(DF$Score))*100,1) #Creation of percentage
DF$"%" <- X #Storage of X as % to dataframe
DF #Check to see it exists
In base R, may use proportions
df[["%"]] <- round(proportions(df$Score) * 100, 1)
-output
> df
Client Score %
1 RP 125 80.6
2 DM 30 19.4
I have two data frames, df1 has information about a publication's year, outlet name, total articles in this publication in a year, and a cumulative sum of articles over the period of time I'm studying. df2 has a random sample of article IDs, with potential values ranging from 1 to the total number of articles given by df1$cumsum.
What I need to do is to grab each article ID in df2 and identify in which publication and year it falls under, using the information contained in df1.
Here's a minimally reproducible example:
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2009, 2000:2009)
df1$outlet <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,2,2,2,2,2,2,2,2,2)
df1$article_total <- sample(1:200, 20, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_num <- sample(1:2102, 100, replace = T) # get random sample of article IDs for the total number of articles I have in this db
df2 <- as.data.frame(df2)
Ideally, I would also like to calculate an article's ID in each year. For example, in the data above, outlet 1 has 14 articles in the year 2000 and 168 in 2001 (cumsum = 183). If I have an article ID of 156, I would like to know that it is the 142th article in the year 2001 of publication 1. And so on and so forth for every article ID I have in this database.
I was thinking I should do this with a for loop, but I'm 100% lost in writing it. Here's what I began writing, but I have a feeling I'm not on the right track with it:
for i in 1:nrow(df2$art_num){
article_number <- df2$art_num[i]
if (article_number %in% df1$cumsum){ # note: cumsum should be an interval before doing this?
# get article number, year, publication in new df
# also calculate article ID in each year/publication
}
}
Thanks in advance for any help! I'm still lost with writing loops in R...
#######################
EDITED EXAMPLE as per Frank's suggestion
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2002, 2000:2002)
df1$outlet <- c(1, 1, 1, 2,2,2)
df1$article_total <- sample(1:50, 6, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_id <- c(66, 120, 77, 156, 24)
df2 <- as.data.frame(df2)
Here's the output I'm looking for:
art_id outlet year article_number
1 66 1 2002 19
2 120 2 2000 35
3 77 1 2002 30
4 156 2 2001 35
5 24 1 2000 20
This example shows my ideal output in df3, which I calculated/built by hand. It has one column with the article's ID, the appropriate outlet, the year, and a new variable art_number. This is different than the article ID in that I calculated it from df1$cumsum and df3$art_id. In this example, the first row shows that the first article in my database has an ID of 66. I obtain a art_number value of 19 because this article (id = 66) is the 19th article published in the year 2002 by outlet 1. I calculated this value by looking at the article ID, locating the year and outlet based on the df1$cumsum, and then substracting the art_id value from the df1$cumsum value for the previous year. So for this specific article, I calculated df3$art_number = df3$art_id[1,1] - df1$cumsum[2,4]
I need to do this calculation for every article in my data base so I don't do this process by hand forever.
I think your data structure makes sense, though it would be easier with one additional column, for the first article in a year and outlet:
library(data.table)
setDT(df1); setDT(df2)
df1[, art_cstart := shift(cumsum(article_total), fill=0L) + 1L]
year outlet article_total cumsum art_cstart
1: 2000 1 4 4 1
2: 2001 1 43 47 5
3: 2002 1 38 85 48
4: 2000 2 36 121 86
5: 2001 2 39 160 122
6: 2002 2 8 168 161
Now, we can do a rolling update join, "rolling" each art_id to the previous cumsum and computing each desired column:
df2[, c("outlet", "year", "art_num") := df1[df2, on=.(cumsum = art_id), roll=-Inf, .(
x.year,
x.outlet,
i.art_id - x.art_cstart + 1L
)]]
art_id outlet year art_num
1: 66 2002 1 19
2: 120 2000 2 35
3: 77 2002 1 30
4: 156 2001 2 35
5: 24 2001 1 20
How it works
x[i, on=, roll=, j] is the syntax for a join, looking up each row of i in x.
In this join j evaluates to a list of columns, .(...) shorthand for list(...).
Column assignment is done with (colnames) := .(...).
The assignment is to the existing table df2 instead of unnecessarily creating a new table.
For details on how data.table syntax works, see the startup messages...
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
This is the code you need I think:
df3 <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df3) <- c("articleNumber", "year", "publication")
for(i in 1:nrow(df2$art_num)){
for(j in 1:nrow(df1$cumsum)) {
if ((df2$art_num[i] >= df1$cumsum[j]) && (df2$art_num[i] <= df1$cumsum[j + 1])){
# note: cumsum should be an interval before doing this? NOT REALLY SURE
# WHAT YOU NEED HERE
# get article number, year, publication in new df
df3[i, 1] <- df2$art_num[i]
df3[i, 2] <- df1$year[j]
df3[i, 3] <- df1$outlet[j]
# also calculate article ID in each year/publication ISN'T THIS
# art_num?
}
}
I am trying to calculate the families sizes from a data frame, which also contains two types of events : family members who died, and those who left the family. I would like to take into account these two parameters in order to compute the actual family size.
Here is a reproductive example of my problem, with 3 families only :
family <- factor(rep(c("001","002","003"), c(10,8,15)), levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
dead <- c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0)
left <- c(0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0)
DF <- data.frame(family, dead, left) ; DF
I could count N = total family members (in each family) in a second dataframe DF2, by simply using table()
DF2 <- with(DF, data.frame(table(family)))
colnames(DF2)[2] <- "N" ; DF2
family N
1 001 10
2 002 8
3 003 15
But i can not find a proper way to get the actual number of people (for example, creating a new variable N2 into DF2) , calculated by substracting to N the number of members who died or left the family. I suppose i have to relate the two dataframes DF and DF2 in a way. i have looked for other related questions in this site but could not find the right answer...
If anyone has a good idea, it would be great !
Thank you in advance..
Deni
Logic : First we want to group_by(family) and then calculate 2 numbers : i) total #obs in each group ii) subtract the sum(dead) + sum(left) from this total .
In dplyr package : n() helps us get the total #observations in each group
In data.table : .N does the same above job
library(dplyr)
DF %>% group_by(family) %>% summarise( total = n(), current = n()-sum(dead,left, na.rm = TRUE))
# family total current
# (fctr) (int) (dbl)
#1 001 10 6
#2 002 8 4
#3 003 15 7
library(data.table)
# setDT() is preferred if incase your data was a data.frame. else just DF.
setDT(DF)[, .(total = .N, current = .N - sum(dead, left, na.rm = TRUE)), by = family]
# family total current
#1: 001 10 6
#2: 002 8 4
#3: 003 15 7
Here is a base R option
do.call(data.frame, aggregate(dl~family, transform(DF, dl = dead + left),
FUN = function(x) c(total=length(x), current=length(x) - sum(x))))
Or a modified version is
transform(aggregate(. ~ family, transform(DF, total = 1,
current = dead + left)[c(1,4:5)], FUN = sum), current = total - current)
# family total current
#1 001 10 6
#2 002 8 4
#3 003 15 7
I finally found another which works fine (from another post), allowing to compute everything from the original DF table. This uses the ddply function :
DF <- ddply(DF,.(family),transform,total=length(family))
DF <- ddply(DF,.(family),transform,actual=length(family)-sum(dead=="1")-sum(left=="1"))
DF
Thanks a lot to everyone who helped ! Deni
I'm having the following table (this is just a sample):
custNbr channel custBranchNbr totalTransactions
1 Web 901 7
2 store 903 5
3 Cel 901 10
etc...
and I'd like to create "sub_table" which summarize the number of transactions in each custBranchNbr conditioned on the specific channels (Web+Cel only); something like this:
custBranchNbr sum(totalTransaction)
901 17
I know how to use conditional sum (like this: sum(DF[which(DF[,1]>30 & DF[,4]>90),2])), but I don't know how can I implement this to get the "sub-table" I described above.
your help will be appreciated.
use the aggregate function
sub_table <- aggregate(custBranchNbr, df[df$channel %in% c('Web', 'Cel'), ], sum)
we can also do this with library(dplyr):
df %>% filter(channel %in% c("Web", "Cel") %>%
group_by(custBranchNbr) %>%
summarise(sum_totalTransactions = sum(totalTransactions))
# A tibble: 1 × 2
custBranchNbr sum_totalTransactions
<int> <int>
1 901 17
An option using data.table
library(data.table)
setDT(df)[channel %chin% c('Web', 'Cel'), .(Sum = sum(totalTransaction)).( , custBranchNbr]
I would like to know if there is a simple way to achieve what I describe below using ddply. My data frame describes an experiment with two conditions. Participants had to select between options A and B, and we recorded how long they took to decide, and whether their responses were accurate or not.
I use ddply to create averages by condition. The column nAccurate summarizes the number of accurate responses in each condition. I also want to know how much time they took to decide and express it in the column RT. However, I want to calculate average response times only when participants got the response right (i.e. Accuracy==1). Currently, the code below can only calculate average reaction times for all responses (accurate and inaccurate ones). Is there a simple way to modify it to get average response times computed only in accurate trials?
See sample code below and thanks!
library(plyr)
# Create sample data frame.
Condition = c(rep(1,6), rep(2,6)) #two conditions
Response = c("A","A","A","A","B","A","B","B","B","B","A","A") #whether option "A" or "B" was selected
Accuracy = rep(c(1,1,0),4) #whether the response was accurate or not
RT = c(110,133,121,122,145,166,178,433,300,340,250,674) #response times
df = data.frame(Condition,Response, Accuracy,RT)
head(df)
Condition Response Accuracy RT
1 1 A 1 110
2 1 A 1 133
3 1 A 0 121
4 1 A 1 122
5 1 B 1 145
6 1 A 0 166
# Calculate averages.
avg <- ddply(df, .(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT))
# The problem: response times are calculated over all trials. I would like
# to calculate mean response times *for accurate responses only*.
avg
Condition N nAccurate RT
1 6 4 132.8333
2 6 4 362.5000
With plyr, you can do it as follows:
ddply(df,
.(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1]))
this gives:
Condition N nAccurate RT
1: 1 6 4 127.50
2: 2 6 4 300.25
If you use data.table, then this is an alternative way:
library(data.table)
setDT(df)[, .(N = .N,
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1])),
by = Condition]
Using dplyr package:
library(dplyr)
df %>%
group_by(Condition) %>%
summarise(N = n(),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy == 1]))