Finding unique values between rows of data frame and replacing them (R) - r

I have nested data, with ID numbers for within- and cluster level observations. Let's call them L1IDs and L2IDs.
L1ID <- c(1,2,3,4,5,6)
L2ID <- c(11,11,22,22,33,33)
And for both I have a number of variables. We'll call them L1X's and L2X's
L1X1 <- rnorm(6,3,1.1)
L1X2 <- rnorm(6,0,.7)
L2X1 <- c(0,1,1,1,0,0)
L2X2 <- c(Blue,Blue,Red,Red,Green,Red)
Combining the vectors into a dataframe:
df <- data.frame(L1ID,L2ID,L1X1,L1X2,L2X1,L2X2)
df
I have a problem. The values for the 11 and 33 L2ID are not identical. ID 11 has a 1 for the 2nd entry under L2X1 when it should be 0, and ID 33 has Red in the last entry for L2X2 when it should be Green.
L1X values should be different within cluster but not the L2Xs. I need a way to search a large data base by L2ID and find column values that are not identical. Then, replace them with a chosen value. Ideally, this would be a dataframe where each L2ID is a single row and then each column is a logic vector that says True or False if all values in that column, for that L2ID, match. And then replace them all with a same value. So, for ID 11, I need to be able to see that L2X1 does not match for all subjects clustered within it, and that I can replace the 1 with a 0, but that L2X2 all match.
Does that make sense?
My actual dataset (licensed access so I cannot share) is rather large and manually searching this thing for where values do not match is a pain.
So far, my approach has been to eliminate all L1X variables, use dplyr's distinct() function to reduce each row to unique combinations of the L2X variables (each L2ID typically has 2 unique combinations), and then manually searching for discrepancies. Often it's a decimal point in the wrong place.
Update:
To make these sample data more representative of what I am working with, I changed L2X2 to a character vector and added in a 3rd L2ID. Also, I nearly have 200 columns and 9,000 L2IDs (and since most are doubled, it gets to be about 18,000 obs). I'm trying to find a way to not manually specify each column when searching if their values matched. Tried something like the following:
df %>% group_by(L2ID) %>% sapply(identical())
But I have never used the identical() function in Base R so this didn't work. And still working through what to do next. I appreciate the responses so far; I'm going to keep working through this as we go.

Here we check if L2X1 is consistent for L2ID. You can easily add another column using this logic to check L2X2 as well. We simply check if the min and max value of each L2ID is equal, if if those values are not equal, we replace with the min value in L2X1_Fixed.
df %>% group_by(L2ID) %>% mutate(Test= ifelse(min(L2X1)==max(L2X1), TRUE,FALSE)) %>%
mutate(L2X1_Fixed = ifelse(Test ==FALSE, min(L2X1), L2X1))
# A tibble: 6 x 8
# Groups: L2ID [2]
L1ID L2ID L1X1 L1X2 L2X1 L2X2 Test L2X1_Fixed
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
1 1 11 2.355470 -1.53195614 0 13 FALSE 0
2 2 11 3.784859 0.20900278 0 13 FALSE 0
3 3 11 3.339077 -0.19772481 1 13 FALSE 0
4 4 22 2.512764 0.18222493 1 8 TRUE 1
5 5 22 1.176079 0.04175856 1 8 TRUE 1
6 6 22 3.688449 -0.42174624 1 9 TRUE 1

I make no promises on performance, but this is one solution, which takes advantage of the rle (run length encoding) function in R. This, of course, assumes that the example data you provided properly implies that the value should be replaced with the most common value in that group.
> L1ID <- c(1,2,3,4,5,6)
> L2ID <- c(11,11,11,22,22,22)
> L1X1 <- rnorm(6,3,1.1)
> L1X2 <- rnorm(6,0,.7)
> L2X1 <- c(0,0,1,1,1,1)
> L2X2 <- c(13,13,13,8,8,9)
> df <- data.frame(L1ID,L2ID,L1X1,L1X2,L2X1,L2X2)
> df
L1ID L2ID L1X1 L1X2 L2X1 L2X2
1 1 11 1.9155828 0.287683782 0 13
2 2 11 2.8383669 -0.693942886 0 13
3 3 11 4.7517203 0.419193550 1 13
4 4 22 2.0092141 0.002223136 1 8
5 5 22 1.2546399 -0.457323727 1 8
6 6 22 0.8622906 0.255975868 1 9
> df %>%
group_by(L2ID) %>%
mutate(L2X1_r = rle(L2X1)$values[rle(L2X1)$lengths == max(rle(L2X1)$lengths)],
L2X2_r = rle(L2X2)$values[rle(L2X2)$lengths == max(rle(L2X2)$lengths)]) %>%
ungroup()
# A tibble: 6 x 8
L1ID L2ID L1X1 L1X2 L2X1 L2X2 L2X1_r L2X2_r
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 1.9155828 0.287683782 0 13 0 13
2 2 11 2.8383669 -0.693942886 0 13 0 13
3 3 11 4.7517203 0.419193550 1 13 0 13
4 4 22 2.0092141 0.002223136 1 8 1 8
5 5 22 1.2546399 -0.457323727 1 8 1 8
6 6 22 0.8622906 0.255975868 1 9 1 8
Update
Based on the comments and updated question, I've realized that rle won't work because it assumes the "majority" value has a long run length encoding. This approach fixes this issue, as well as introduces a way to not have to specify every column to be mutated manually.
> L1ID <- c(1,2,3,4,5,6)
> L2ID <- c(11,11,22,22,33,33)
> L1X1 <- rnorm(6,3,1.1)
> L1X2 <- rnorm(6,0,.7)
> L2X1 <- c(0,1,1,1,0,0)
> L2X2 <- c('Blue','Blue','Red','Red','Green','Red')
> df <- data.frame(L1ID,L2ID,L1X1,L1X2,L2X1,L2X2, stringsAsFactors=F)
> df
L1ID L2ID L1X1 L1X2 L2X1 L2X2
1 1 11 4.058659 0.12423215 0 Blue
2 2 11 2.922632 0.30954205 1 Blue
3 3 22 2.719407 -0.33382402 1 Red
4 4 22 1.981046 -0.63617811 1 Red
5 5 33 2.570058 -1.39886373 0 Green
6 6 33 4.471551 -0.05489082 0 Red
> replace_with_right_value = function(col) {
+ tbl = table(col)
+ names(tbl)[tbl == max(tbl)]
+ }
> df %>%
group_by(L2ID) %>%
mutate_at(vars(matches('L2X')), replace_with_right_value)
ungroup()
# A tibble: 6 x 6
L1ID L2ID L1X1 L1X2 L2X1 L2X2
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 11 4.058659 0.12423215 0 Blue
2 2 11 2.922632 0.30954205 1 Blue
3 3 22 2.719407 -0.33382402 1 Red
4 4 22 1.981046 -0.63617811 1 Red
5 5 33 2.570058 -1.39886373 0 Green
6 6 33 4.471551 -0.05489082 0 Red
The replace_with_right_value function takes in a column and returns the most common element in that vector. mutate_at allows you to specify which columns to select, which is done via vars(matches('L2X')). If the columns do not follow this pattern, you'll need to modify that string a bit. Matches accepts a regular expression, which should prove very helpful in this case. In this case of L2ID, there is not enough information in the question or the data to determine which value to choose for L2X1 when L2ID == 11 or L2X2 when L2ID == 33. As a result, it returns both. To force it to choose a value, such as the first one, change the function to return names(tbl)[tbl == max(tbl)][1]

Related

How to split a dataframe into multiple at the same time based on the value of one column

crop.genos <- data.frame(crop=rep(1:6, each=4),genos=rep(1:4, 6))
crop.genos$crop.genotype <- paste(crop.genos$crop, crop.genos$genos, sep="")
Here I got a data frame with three columns: crop, genos, crop.geotype. And I want to get six different dataframe based on the crop catogory (such like the example below), all the rest columns are remained
crop genos crop.genotype
1 1 1 11
2 1 2 12
3 1 3 13
Use split:
l <- split(crop.genos, crop.genos$crop)
names(l) <- paste0('df', names(l))
list2env(l, env = .GlobalEnv)
output
> df1
# crop genos crop.genotype
#1 1 1 11
#2 1 2 12
#3 1 3 13
#4 1 4 14

rolling function with variable width R

I need to summarize some data using a rolling window of different width and shift. In particular I need to apply a function (eg. sum) over some values recorded on different intervals.
Here an example of a data frame:
df <- tibble(days = c(0,1,2,3,1),
value = c(5,7,3,4,2))
df
# A tibble: 5 x 2
days value
<dbl> <dbl>
1 0 5
2 1 7
3 2 3
4 3 4
5 1 2
The columns indicate:
days how many days elapsed from the previous observation. The first value is 0 because no previous observation.
value the value I need to aggregate.
Now, let's assume that I need to sum the field value every 4 days shifting 1 day at the time.
I need something along these lines:
days value roll_sum rows_to_sum
0 5 15 1,2,3
1 7 10 2,3
2 3 3 3
3 4 6 4,5
1 2 NA NA
The column rows_to_sum has been added to make it clear.
Here more details:
The first value (15), is the sum of the 3 rows because 0+1+2 = 3 which is less than the reference value 4 and adding the next line (with value 3) will bring the total day count to 7 which is more than 4.
The second value (10), is the sum of row 2 and 3. This is because, excluding the first row (since we are shifting one day), we only summing row 2 and 3 because including row 4 will bring the total sum of days to 1+2+3 = 6 which is more than 4.
...
How can I achieve this?
Thank you
Here is one way :
library(dplyr)
library(purrr)
df %>%
mutate(roll_sum = map_dbl(row_number(), ~{
i <- max(which(cumsum(days[.x:n()]) <= 4))
if(is.na(i)) NA else sum(value[.x:(.x + i - 1)])
}))
# days value roll_sum
# <dbl> <dbl> <dbl>
#1 0 5 15
#2 1 7 10
#3 2 3 3
#4 3 4 6
#5 1 2 2
Performing this calculation in base R :
sapply(seq(nrow(df)), function(x) {
i <- max(which(cumsum(df$days[x:nrow(df)]) <= 4))
if(is.na(i)) NA else sum(df$value[x:(x + i - 1)])
})

How can I get sum of value from a column based on condition mixed with second list using R?

When I try to do operation on 2 lists I get a error messages and the calculation does not work properly.(see end of question)
list2 <- list2 %>%
mutate(sum_of_part = sum(list1$part[(list1$id < list2$id) & (list1$id >= lag(list2$id))]))
So what I want to do is:
Get the sum of "part" of all rows in list1 where the "id" is between the "id" of the current row in list2 and the "id" of the row before.
I also want to count the number of rows which are used to calculate the column sum_of_parts.
list1
id Part ...
1 2
2 3
3 4
4 6
99 11
100 11
191 11
222 11
333 11
list2
id ...
1
3
4
88
99
solution
id ... sum_of_parts count
1 ... 2 1
3 ... 9 3
4 ... 10 2
88 ... 6 1
99 ... 11 1
But because my list2 is a lot smaller then my list1, I do get this errors(there are some more but they look almost the same):
In list1$id < list2$id : longer object length is not a multiple of shorter object length
You were really close, this one gets me all the time!
mutate operates by group I believe, so if you haven't specified a group it will try to use the whole column in a vectorised operation (which is usually more efficient), and thus the error about different lengths.
If you want to operate on each row, you can use rowwise(), to make the following calculations treat each row as a group. So id will be a length one vector in the mutate call.
Note we need to specify the lag before grouping, otherwise using the logic above, there will be no previous id in a length one vector.
library(dplyr)
list1 <- readr::read_csv(
'id,part
1,2
2,3
3,4
4,6
99,11
100,11
191,11
222,11
333,11')
list2 <- readr::read_csv(
'id
1
3
4
88
99'
)
list2 %>%
mutate(lag_id = lag(id, default = 0)) %>%
rowwise() %>%
mutate(sum_of_part = sum(list1$part[(list1$id <= id) & (list1$id > lag_id)]),
count = length(list1$part[(list1$id <= id) & (list1$id > lag_id)])) %>%
select(-lag_id)
#> Source: local data frame [5 x 3]
#> Groups: <by row>
#>
#> # A tibble: 5 x 3
#> id sum_of_part count
#> <int> <int> <int>
#> 1 1 2 1
#> 2 3 7 2
#> 3 4 6 1
#> 4 88 0 0
#> 5 99 11 1

How to find first occurrence of a vector of numeric elements within a data frame column?

I have a data frame (min_set_obs) which contains two columns: the first containing numeric values, called treatment, and the second an id column called seq:
min_set_obs
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
Let's say I have a vector of numeric values, called key:
key
[1] 1 1 1 2 2 3
I.e. a vector of three 1s, two 2s, and one 3.
How would I go about identifying which rows from my min_set_obs data frame contain the first occurrence of values from the key vector?
I'd like my output to look like this:
Treatment seq
1 29
1 23
3 60
1 6
2 41
2 44
I.e. the sixth row from min_set_obs was 'extra' (it was the fourth 1 when there should only be three 1s), so it would be removed.
I'm familiar with the %in% operator, but I don't think it can tell me the position of the first occurrence of the key vector in the first column of the min_set_obs data frame.
Thanks
Here is an option with base R, where we split the 'min_set_obs' by 'Treatment' into a list, get the head of elements in the list using the corresponding frequency of 'key' and rbind the list elements to a single data.frame
res <- do.call(rbind, Map(head, split(min_set_obs, min_set_obs$Treatment), n = table(key)))
row.names(res) <- NULL
res
# Treatment seq
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60
Use dplyr, you can firstly count the keys using table and then take the top n rows correspondingly from each group:
library(dplyr)
m <- table(key)
min_set_obs %>% group_by(Treatment) %>% do({
# as.character(.$Treatment[1]) returns the treatment for the current group
# use coalesce to get the default number of rows (0) if the treatment doesn't exist in key
head(., coalesce(m[as.character(.$Treatment[1])], 0L))
})
# A tibble: 6 x 2
# Groups: Treatment [3]
# Treatment seq
# <int> <int>
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

Resources