How can I get sum of value from a column based on condition mixed with second list using R? - r

When I try to do operation on 2 lists I get a error messages and the calculation does not work properly.(see end of question)
list2 <- list2 %>%
mutate(sum_of_part = sum(list1$part[(list1$id < list2$id) & (list1$id >= lag(list2$id))]))
So what I want to do is:
Get the sum of "part" of all rows in list1 where the "id" is between the "id" of the current row in list2 and the "id" of the row before.
I also want to count the number of rows which are used to calculate the column sum_of_parts.
list1
id Part ...
1 2
2 3
3 4
4 6
99 11
100 11
191 11
222 11
333 11
list2
id ...
1
3
4
88
99
solution
id ... sum_of_parts count
1 ... 2 1
3 ... 9 3
4 ... 10 2
88 ... 6 1
99 ... 11 1
But because my list2 is a lot smaller then my list1, I do get this errors(there are some more but they look almost the same):
In list1$id < list2$id : longer object length is not a multiple of shorter object length

You were really close, this one gets me all the time!
mutate operates by group I believe, so if you haven't specified a group it will try to use the whole column in a vectorised operation (which is usually more efficient), and thus the error about different lengths.
If you want to operate on each row, you can use rowwise(), to make the following calculations treat each row as a group. So id will be a length one vector in the mutate call.
Note we need to specify the lag before grouping, otherwise using the logic above, there will be no previous id in a length one vector.
library(dplyr)
list1 <- readr::read_csv(
'id,part
1,2
2,3
3,4
4,6
99,11
100,11
191,11
222,11
333,11')
list2 <- readr::read_csv(
'id
1
3
4
88
99'
)
list2 %>%
mutate(lag_id = lag(id, default = 0)) %>%
rowwise() %>%
mutate(sum_of_part = sum(list1$part[(list1$id <= id) & (list1$id > lag_id)]),
count = length(list1$part[(list1$id <= id) & (list1$id > lag_id)])) %>%
select(-lag_id)
#> Source: local data frame [5 x 3]
#> Groups: <by row>
#>
#> # A tibble: 5 x 3
#> id sum_of_part count
#> <int> <int> <int>
#> 1 1 2 1
#> 2 3 7 2
#> 3 4 6 1
#> 4 88 0 0
#> 5 99 11 1

Related

How many times does the value for column B appear for a value in column A?

I am having the hardest time coming up with a code that lets me match a topic (Column B) to a name (Column A) and create a frequency column for the times B has matched with A (or how many times both have appeared together). Col A and B are codes for longer names.
I thought maybe using the count function from plyr but cant make it work. Maybe you can give me an idea of what I could use for a code?
For example I have a table:
**Col A
Col B**
1
38
1
6
1
38
2
38
2
7
2
7
2
8
2
7
The result that I am looking for is
**Col A
Col B
freq**
1
38
2
1
6
1
2
38
1
2
7
3
2
8
1
So the number 38 has appeared in "1" two times. 6 has appeared one time. and so on.
I have 600 rows of data and cant come up with a useful or even a close call code.
Thank you so much for your help!
Summarise and count using dplyr:
library(dplyr)
df2 <- df %>%
group_by(col1, col2) %>%
summarise(count = n()) %>%
ungroup()
returns:
col1 col2 count
<dbl> <dbl> <int>
1 1 6 1
2 1 38 2
3 2 7 3
4 2 8 1
5 2 38 1

how to subset a data frame up until a point R

i want to subset a data frame and take all observations for each id until the first observation that didn't meet my condition. Something like this:
goodDaysAfterTreatMent <- subset(Patientdays, treatmentDate < date & goodThings > badThings)
Except that this returns all observations that meet the condition. I want something that stops with the first observation that didn't meet the condition, moves on to the next id, and returns all observations for this id that meets the condition, and so on.
the only way i can see is to use a lot of loops but loops and that's usually not a god thing.
Hope you guys have an idea
Assume that your condition is to return rows where v < 5 :
# example dataset
df = data.frame(id = c(1,1,1,1,2,2,2,2,3,3,3),
v = c(2,4,3,5,4,5,6,7,5,4,1))
df
# id v
# 1 1 2
# 2 1 4
# 3 1 3
# 4 1 5
# 5 2 4
# 6 2 5
# 7 2 6
# 8 2 7
# 9 3 5
# 10 3 4
# 11 3 1
library(tidyverse)
df %>%
group_by(id) %>% # for each id
mutate(flag = cumsum(ifelse(v < 5, 1, NA))) %>% # check if v < 5 and fill with NA all rows when condition is FALSE and after that
filter(!is.na(flag)) %>% # keep only rows with no NA flags
ungroup() %>% # forget the grouping
select(-flag) # remove flag column
# # A tibble: 4 x 2
# id v
# <dbl> <dbl>
# 1 1 2
# 2 1 4
# 3 1 3
# 4 2 4
Easy way:
Find First FALSE by (min(which(condition == F)):
Patientdays<-cbind.data.frame(treatmentDate=c(1:5,4,6:10),date=c(2:5,3,6:10,10),goodThings=c(1:11),badThings=c(0:10))
attach(Patientdays)# Just due to ease of use (optional)
condition<-treatmentDate < date & goodThings > badThings
Patientdays[1:(min(which(condition == F))-1),]
Edit: Adding result.
treatmentDate date goodThings badThings
1 1 2 1 0
2 2 3 2 1
3 3 4 3 2
4 4 5 4 3

Finding unique values between rows of data frame and replacing them (R)

I have nested data, with ID numbers for within- and cluster level observations. Let's call them L1IDs and L2IDs.
L1ID <- c(1,2,3,4,5,6)
L2ID <- c(11,11,22,22,33,33)
And for both I have a number of variables. We'll call them L1X's and L2X's
L1X1 <- rnorm(6,3,1.1)
L1X2 <- rnorm(6,0,.7)
L2X1 <- c(0,1,1,1,0,0)
L2X2 <- c(Blue,Blue,Red,Red,Green,Red)
Combining the vectors into a dataframe:
df <- data.frame(L1ID,L2ID,L1X1,L1X2,L2X1,L2X2)
df
I have a problem. The values for the 11 and 33 L2ID are not identical. ID 11 has a 1 for the 2nd entry under L2X1 when it should be 0, and ID 33 has Red in the last entry for L2X2 when it should be Green.
L1X values should be different within cluster but not the L2Xs. I need a way to search a large data base by L2ID and find column values that are not identical. Then, replace them with a chosen value. Ideally, this would be a dataframe where each L2ID is a single row and then each column is a logic vector that says True or False if all values in that column, for that L2ID, match. And then replace them all with a same value. So, for ID 11, I need to be able to see that L2X1 does not match for all subjects clustered within it, and that I can replace the 1 with a 0, but that L2X2 all match.
Does that make sense?
My actual dataset (licensed access so I cannot share) is rather large and manually searching this thing for where values do not match is a pain.
So far, my approach has been to eliminate all L1X variables, use dplyr's distinct() function to reduce each row to unique combinations of the L2X variables (each L2ID typically has 2 unique combinations), and then manually searching for discrepancies. Often it's a decimal point in the wrong place.
Update:
To make these sample data more representative of what I am working with, I changed L2X2 to a character vector and added in a 3rd L2ID. Also, I nearly have 200 columns and 9,000 L2IDs (and since most are doubled, it gets to be about 18,000 obs). I'm trying to find a way to not manually specify each column when searching if their values matched. Tried something like the following:
df %>% group_by(L2ID) %>% sapply(identical())
But I have never used the identical() function in Base R so this didn't work. And still working through what to do next. I appreciate the responses so far; I'm going to keep working through this as we go.
Here we check if L2X1 is consistent for L2ID. You can easily add another column using this logic to check L2X2 as well. We simply check if the min and max value of each L2ID is equal, if if those values are not equal, we replace with the min value in L2X1_Fixed.
df %>% group_by(L2ID) %>% mutate(Test= ifelse(min(L2X1)==max(L2X1), TRUE,FALSE)) %>%
mutate(L2X1_Fixed = ifelse(Test ==FALSE, min(L2X1), L2X1))
# A tibble: 6 x 8
# Groups: L2ID [2]
L1ID L2ID L1X1 L1X2 L2X1 L2X2 Test L2X1_Fixed
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
1 1 11 2.355470 -1.53195614 0 13 FALSE 0
2 2 11 3.784859 0.20900278 0 13 FALSE 0
3 3 11 3.339077 -0.19772481 1 13 FALSE 0
4 4 22 2.512764 0.18222493 1 8 TRUE 1
5 5 22 1.176079 0.04175856 1 8 TRUE 1
6 6 22 3.688449 -0.42174624 1 9 TRUE 1
I make no promises on performance, but this is one solution, which takes advantage of the rle (run length encoding) function in R. This, of course, assumes that the example data you provided properly implies that the value should be replaced with the most common value in that group.
> L1ID <- c(1,2,3,4,5,6)
> L2ID <- c(11,11,11,22,22,22)
> L1X1 <- rnorm(6,3,1.1)
> L1X2 <- rnorm(6,0,.7)
> L2X1 <- c(0,0,1,1,1,1)
> L2X2 <- c(13,13,13,8,8,9)
> df <- data.frame(L1ID,L2ID,L1X1,L1X2,L2X1,L2X2)
> df
L1ID L2ID L1X1 L1X2 L2X1 L2X2
1 1 11 1.9155828 0.287683782 0 13
2 2 11 2.8383669 -0.693942886 0 13
3 3 11 4.7517203 0.419193550 1 13
4 4 22 2.0092141 0.002223136 1 8
5 5 22 1.2546399 -0.457323727 1 8
6 6 22 0.8622906 0.255975868 1 9
> df %>%
group_by(L2ID) %>%
mutate(L2X1_r = rle(L2X1)$values[rle(L2X1)$lengths == max(rle(L2X1)$lengths)],
L2X2_r = rle(L2X2)$values[rle(L2X2)$lengths == max(rle(L2X2)$lengths)]) %>%
ungroup()
# A tibble: 6 x 8
L1ID L2ID L1X1 L1X2 L2X1 L2X2 L2X1_r L2X2_r
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 1.9155828 0.287683782 0 13 0 13
2 2 11 2.8383669 -0.693942886 0 13 0 13
3 3 11 4.7517203 0.419193550 1 13 0 13
4 4 22 2.0092141 0.002223136 1 8 1 8
5 5 22 1.2546399 -0.457323727 1 8 1 8
6 6 22 0.8622906 0.255975868 1 9 1 8
Update
Based on the comments and updated question, I've realized that rle won't work because it assumes the "majority" value has a long run length encoding. This approach fixes this issue, as well as introduces a way to not have to specify every column to be mutated manually.
> L1ID <- c(1,2,3,4,5,6)
> L2ID <- c(11,11,22,22,33,33)
> L1X1 <- rnorm(6,3,1.1)
> L1X2 <- rnorm(6,0,.7)
> L2X1 <- c(0,1,1,1,0,0)
> L2X2 <- c('Blue','Blue','Red','Red','Green','Red')
> df <- data.frame(L1ID,L2ID,L1X1,L1X2,L2X1,L2X2, stringsAsFactors=F)
> df
L1ID L2ID L1X1 L1X2 L2X1 L2X2
1 1 11 4.058659 0.12423215 0 Blue
2 2 11 2.922632 0.30954205 1 Blue
3 3 22 2.719407 -0.33382402 1 Red
4 4 22 1.981046 -0.63617811 1 Red
5 5 33 2.570058 -1.39886373 0 Green
6 6 33 4.471551 -0.05489082 0 Red
> replace_with_right_value = function(col) {
+ tbl = table(col)
+ names(tbl)[tbl == max(tbl)]
+ }
> df %>%
group_by(L2ID) %>%
mutate_at(vars(matches('L2X')), replace_with_right_value)
ungroup()
# A tibble: 6 x 6
L1ID L2ID L1X1 L1X2 L2X1 L2X2
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 11 4.058659 0.12423215 0 Blue
2 2 11 2.922632 0.30954205 1 Blue
3 3 22 2.719407 -0.33382402 1 Red
4 4 22 1.981046 -0.63617811 1 Red
5 5 33 2.570058 -1.39886373 0 Green
6 6 33 4.471551 -0.05489082 0 Red
The replace_with_right_value function takes in a column and returns the most common element in that vector. mutate_at allows you to specify which columns to select, which is done via vars(matches('L2X')). If the columns do not follow this pattern, you'll need to modify that string a bit. Matches accepts a regular expression, which should prove very helpful in this case. In this case of L2ID, there is not enough information in the question or the data to determine which value to choose for L2X1 when L2ID == 11 or L2X2 when L2ID == 33. As a result, it returns both. To force it to choose a value, such as the first one, change the function to return names(tbl)[tbl == max(tbl)][1]

How to find first occurrence of a vector of numeric elements within a data frame column?

I have a data frame (min_set_obs) which contains two columns: the first containing numeric values, called treatment, and the second an id column called seq:
min_set_obs
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
Let's say I have a vector of numeric values, called key:
key
[1] 1 1 1 2 2 3
I.e. a vector of three 1s, two 2s, and one 3.
How would I go about identifying which rows from my min_set_obs data frame contain the first occurrence of values from the key vector?
I'd like my output to look like this:
Treatment seq
1 29
1 23
3 60
1 6
2 41
2 44
I.e. the sixth row from min_set_obs was 'extra' (it was the fourth 1 when there should only be three 1s), so it would be removed.
I'm familiar with the %in% operator, but I don't think it can tell me the position of the first occurrence of the key vector in the first column of the min_set_obs data frame.
Thanks
Here is an option with base R, where we split the 'min_set_obs' by 'Treatment' into a list, get the head of elements in the list using the corresponding frequency of 'key' and rbind the list elements to a single data.frame
res <- do.call(rbind, Map(head, split(min_set_obs, min_set_obs$Treatment), n = table(key)))
row.names(res) <- NULL
res
# Treatment seq
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60
Use dplyr, you can firstly count the keys using table and then take the top n rows correspondingly from each group:
library(dplyr)
m <- table(key)
min_set_obs %>% group_by(Treatment) %>% do({
# as.character(.$Treatment[1]) returns the treatment for the current group
# use coalesce to get the default number of rows (0) if the treatment doesn't exist in key
head(., coalesce(m[as.character(.$Treatment[1])], 0L))
})
# A tibble: 6 x 2
# Groups: Treatment [3]
# Treatment seq
# <int> <int>
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

Resources