How to find first occurrence of a vector of numeric elements within a data frame column? - r

I have a data frame (min_set_obs) which contains two columns: the first containing numeric values, called treatment, and the second an id column called seq:
min_set_obs
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
Let's say I have a vector of numeric values, called key:
key
[1] 1 1 1 2 2 3
I.e. a vector of three 1s, two 2s, and one 3.
How would I go about identifying which rows from my min_set_obs data frame contain the first occurrence of values from the key vector?
I'd like my output to look like this:
Treatment seq
1 29
1 23
3 60
1 6
2 41
2 44
I.e. the sixth row from min_set_obs was 'extra' (it was the fourth 1 when there should only be three 1s), so it would be removed.
I'm familiar with the %in% operator, but I don't think it can tell me the position of the first occurrence of the key vector in the first column of the min_set_obs data frame.
Thanks

Here is an option with base R, where we split the 'min_set_obs' by 'Treatment' into a list, get the head of elements in the list using the corresponding frequency of 'key' and rbind the list elements to a single data.frame
res <- do.call(rbind, Map(head, split(min_set_obs, min_set_obs$Treatment), n = table(key)))
row.names(res) <- NULL
res
# Treatment seq
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60

Use dplyr, you can firstly count the keys using table and then take the top n rows correspondingly from each group:
library(dplyr)
m <- table(key)
min_set_obs %>% group_by(Treatment) %>% do({
# as.character(.$Treatment[1]) returns the treatment for the current group
# use coalesce to get the default number of rows (0) if the treatment doesn't exist in key
head(., coalesce(m[as.character(.$Treatment[1])], 0L))
})
# A tibble: 6 x 2
# Groups: Treatment [3]
# Treatment seq
# <int> <int>
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60

Related

add square of all items in a column as a row in R data frame

I have a dataframe as shown below containing 3 rows or n rows more generally. I want to add a 4'th row or n+1'th row containing sum of squares of all items of that column.
x<-data.frame("a" = c(2,3,4),"b" =c(3,4,5))
> x
a b
1 2 3
2 3 4
3 4 5
In the above example, the 4'th row should contain value of 29 and 50 respectively.
An option is
library(dplyr)
x %>%
summarise_all(~ sum(.^2)) %>%
bind_rows(x, .)
#. a b
#1 2 3
#2 3 4
#3 4 5
#4 29 50
Or in base R
rbind(x, colSums(x^2))

How can I get sum of value from a column based on condition mixed with second list using R?

When I try to do operation on 2 lists I get a error messages and the calculation does not work properly.(see end of question)
list2 <- list2 %>%
mutate(sum_of_part = sum(list1$part[(list1$id < list2$id) & (list1$id >= lag(list2$id))]))
So what I want to do is:
Get the sum of "part" of all rows in list1 where the "id" is between the "id" of the current row in list2 and the "id" of the row before.
I also want to count the number of rows which are used to calculate the column sum_of_parts.
list1
id Part ...
1 2
2 3
3 4
4 6
99 11
100 11
191 11
222 11
333 11
list2
id ...
1
3
4
88
99
solution
id ... sum_of_parts count
1 ... 2 1
3 ... 9 3
4 ... 10 2
88 ... 6 1
99 ... 11 1
But because my list2 is a lot smaller then my list1, I do get this errors(there are some more but they look almost the same):
In list1$id < list2$id : longer object length is not a multiple of shorter object length
You were really close, this one gets me all the time!
mutate operates by group I believe, so if you haven't specified a group it will try to use the whole column in a vectorised operation (which is usually more efficient), and thus the error about different lengths.
If you want to operate on each row, you can use rowwise(), to make the following calculations treat each row as a group. So id will be a length one vector in the mutate call.
Note we need to specify the lag before grouping, otherwise using the logic above, there will be no previous id in a length one vector.
library(dplyr)
list1 <- readr::read_csv(
'id,part
1,2
2,3
3,4
4,6
99,11
100,11
191,11
222,11
333,11')
list2 <- readr::read_csv(
'id
1
3
4
88
99'
)
list2 %>%
mutate(lag_id = lag(id, default = 0)) %>%
rowwise() %>%
mutate(sum_of_part = sum(list1$part[(list1$id <= id) & (list1$id > lag_id)]),
count = length(list1$part[(list1$id <= id) & (list1$id > lag_id)])) %>%
select(-lag_id)
#> Source: local data frame [5 x 3]
#> Groups: <by row>
#>
#> # A tibble: 5 x 3
#> id sum_of_part count
#> <int> <int> <int>
#> 1 1 2 1
#> 2 3 7 2
#> 3 4 6 1
#> 4 88 0 0
#> 5 99 11 1

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

R: Add value in new column of data frame depending on value in another column

I have 2 data frames in R, df1 and df2.
df1 represents in each row one subject in an experiment. It has 3 columns. The first two columns specify a combination of groups the subject is in. The third column contains the experimental result.
df2 containts values for each combination of groups that can be used for normalization. Thus, it has three columns, two for the groups and a third for the normalization constant.
Now I want to create a fourth column in df1 with the experimental results from the third column, divided by the normalization constant in df2. How can I facilitate this?
Here's an example:
df1 <- data.frame(c(1,1,1,1),c(1,2,1,2),c(10,11,12,13))
df2 <- data.frame(c(1,1,2,2),c(1,2,1,2),c(30,40,50,60))
names(df1)<-c("Group1","Group2","Result")
names(df2)<-c("Group1","Group2","NormalizationConstant")
As result, I need a new column in df1 with c(10/30,11/40,12/30,13/40).
My first attempt is with the following code, which fails for my real data with the error message "In is.na(e1) | is.na(e2) : Length of the longer object is not a multiple of the length of the shorter object". Nevertheless, when I replace the referrer ==df1[,1] and ==df1[,2] with fixed values, it works. Is this really returning only the value of the column for this particular row?
df1$NormalizedResult<- df1$Result / df2[df2[,1]==df1[,1] & df2[,2]==df1[,2],]$NormalizationConstant
Thanks for your help!
In this case where the groups are aligned perfectly it's as simple as:
> df1$expnormed <- df1$Result/df2$NormalizationConstant
> df1
Group1 Group2 Result expnormed
1 1 1 10 0.3333333
2 1 2 11 0.2750000
3 1 1 12 0.2400000
4 1 2 13 0.2166667
If they were not exactly aligned you would use merge:
> dfm <-merge(df1,df2)
> dfm
Group1 Group2 Result NormalizationConstant
1 1 1 10 30
2 1 1 12 30
3 1 2 11 40
4 1 2 13 40
> dfm$expnormed <- with(dfm, Result/NormalizationConstant)
A possibility :
df1$res <- df1$Result/df2$NormalizationConstant[match(do.call("paste", df1[1:2]), do.call("paste", df2[1:2]))]
Group1 Group2 Result res
1 1 1 10 0.3333333
2 1 2 11 0.2750000
3 1 1 12 0.4000000
4 1 2 13 0.3250000
Hth

Resources