Filtering data in a data frame - r

I have a data frame that looks like this:
S1State S1Value S2State S2Value
NSW 20 VIC 30
WA 30 NSW 20
I would like to filter and select the state(from S1State and S2State) that has the maximum value(from S1Value and S2Value). The result should look like this:
SState SValue
VIC 30
WA 30
I am new to R and have been experimenting with dplyr.

The answer I was hinting at is as follows:
library(dplyr)
dt <- read.table(text = "S1State S1Value S2State S2Value
NSW 20 VIC 30
WA 30 NSW 20",
header = TRUE, stringsAsFactors = FALSE)
answer = dt %>%
mutate(SState = ifelse(S1Value > S2Value, S1State, S2State),
SValue = ifelse(S1Value > S2Value, S1Value, S2Value)) %>%
select(SState, SValue)

Just to show that this is far from impossible with standard R tools:
nams <- c("State","Value")
tmp <- reshape(dt, direction="long", varying=lapply(nams, grep, x=names(dt)),
v.names=nams, timevar=NULL)
tmp[with(tmp, Value == ave(Value, id, FUN=max)),]
# State Value id
#2.1 WA 30 2
#1.2 VIC 30 1

I assume that the OP may have more states in the data frame, such as S3State, S4State, ...
The following solutions are based on this assumption, trying to be able to process more than one states. If there are only two states, the approach proposed by #lebelinoz is simple and straightforward.
Solution 1
A solution using functions from dplyr and tidyr. dt2 is the final output.
# Load packages
library(dplyr)
library(tidyr)
# Process the data
dt2 <- dt %>%
gather(Num, Value, contains("Value")) %>%
gather(State, Name, contains("State")) %>%
# Only keep records with the same state number
filter(substring(Num, 1, 2) == substring(State, 1, 2)) %>%
mutate(Group = substring(Num, 1, 2)) %>%
group_by(Group) %>%
filter(Value == max(Value)) %>%
ungroup() %>%
select(SState = Name, SSValue = Value)
Solution 2
A solution using functions from dplyr, purrr, and stringr. I loaded the package tidyverse for the first two packages. Again, dt2 is the final output.
# Load packages
library(tidyverse)
library(stringr)
# Extract the column names
Col <- colnames(dt)
# Extract state numbers
ColNum <- Col %>%
str_extract(pattern = "[0-9]") %>%
unique()
# Design a function to process the data
dt_process <- function(pattern, dt){
dt2 <- dt %>%
# Extract columns based on a pattern (numbers)
select(dplyr::contains(pattern)) %>%
# Rename the columns
rename_all(~sub(pattern, "", .)) %>%
# Filter the maximum row
filter(SValue == max(SValue))
return(dt2)
}
# Apply the dt_process function
dt_list <- map(.x = ColNum, .f = dt_process, dt = dt)
# Bind all data frames
dt2 <- bind_rows(dt_list) %>% arrange(SState)
Data Preparation
# Create example data frame
dt <- read.table(text = "S1State S1Value S2State S2Value
NSW 20 VIC 30
WA 30 NSW 20",
header = TRUE, stringsAsFactors = FALSE)

Related

Finding the location of half the max value in a column

I have a list of data frames, and I'd like to apply a function to that list to find the location in the "julian" column that corresponds to half the max value in the "total_cover" column. Here's some data that represents the data I have:
df1 <- data.frame(julian = c(81,85,88,97,101,104,126,167),
total_cover = c(43,52,75,92,94,97,188,172))
df2 <- data.frame(julian = c(81,85,88,97,101,104,126,167),
total_cover = c(30,55,73,80,75,85,138,154))
df3 <- data.frame(julian = c(107,111,115,119,123,129,131,133,135,137),
total_cover = c(36,41,43,47,55,55,55,65,75,80))
data.list <- list(df1=df1,df2=df2,df3=df3)
The code below is what I've tried, but I'm not getting the correct output. This doesn't seem to be finding the julian day that corresponds to half the max value
unlist(lapply(X = data.list, FUN = function(x){
x[which.max(x[["total_cover"]] >= which.max(x[["total_cover"]])/2), "julian"]
}))
output:
df1 df2 df3
81 81 107
My ideal output would be what's shown below, with the julian dates that correspond to >= max(total_cover)/2
df1 df2 df3
101 97 111
Using R 4.2.2
I believe the following answers the question.
sapply(data.list, \(x) {
half_max <- max(x$total_cover)/2
d <- abs(x$total_cover - half_max)
is.na(d) <- x$total_cover < half_max
x$julian[which.min(d)]
})
#> df1 df2 df3
#> 101 97 111
Created on 2022-12-13 with reprex v2.0.2
find_julian <- function(df){
#calculate the distance from half of the maximum
distance <- df[["total_cover"]]- max(df[["total_cover"]])/2
#find smallest value greater than half of the maximum and select corresponding julian
df[distance==min(distance[distance>=0]),"julian"]
}
unlist(lapply(X = data.list, FUN = find_julian))
df1 df2 df3
101 97 111
Here is step by step dplyr solution: The main issue is that the difference is sometimes negative and we have to remove them:
The result of
df1 df2 df3
81 81 107
occurs because the code does not take into consideration negative numbers!
Long version:
library(dplyr)
bind_rows(data.list, .id = 'id') %>%
group_by(id) %>%
mutate(x = (max(total_cover)/2)) %>%
mutate(y = total_cover-x) %>%
filter(y >=0) %>%
filter(y == min(y)) %>%
select(1:2) %>%
pull(julian, name = id)
Or a little shorter:
bind_rows(data.list, .id = 'id') %>%
group_by(id) %>%
filter(total_cover-(max(total_cover)/2) >=0) %>%
filter(total_cover == min(total_cover)) %>%
select(1:2) %>%
pull(julian, name = id)
result:
df1 df2 df3
101 97 111

How can I search in a data frame if all possible combinations exist in another data frame in R using dplyr?

I have two data frames.
The first one that contains all the possible combinations with their corresponding values and looks like this:
first
second
val
A
B
10
A
C
20
A
D
30
B
C
40
B
D
50
C
D
60
H
I
70
The second one that comes from the production line has two columns the date column that has grouped all the variables corresponding to their date and are concatenated:
date
var
2022-01-01
A
2022-02-01
B,C,F,E,G,H,I
I want to find all the combinations in the second data frame and to see if they match with any combinations in the first data frame. If a variable stands alone in the second data frame as A in 2022-01-01 to give me the 0 and otherwise the value of the combination.
Ideally I want the resulting data frame to look like this:
date
comb
val
2022-01-01
A
0
2022-02-01
B,C
40
2022-02-01
H,I
70
How can I do this in R using dplyr?
library(tidyverse)
first = c("A","A","A","B","B","C","H")
second = c("B","C","D","C","D","D","I")
val = c(10,20,30,40,50,60,70)
df1 = tibble(first,second,val);df1
date = c(as.Date("2022-01-01"),as.Date("2022-02-01"))
var = c("A","B,C,F,E,G,H,I")
df2 = tibble(date,var);df2
Using tidyverse:
library(tidyverse)
first = c("A","A","A","B","B","C","H")
second = c("B","C","D","C","D","D","I")
val = c(10,20,30,40,50,60,70)
df1 = tibble(first,second,val);df1
date = c(as.Date("2022-01-01"),as.Date("2022-02-01"))
var = c("A","B,C,F,E,G,H,I")
df2 = tibble(date,var);df2
df2_tidy <- df2 %>%
mutate(first = str_split(var, ","),
second = first) %>%
unnest(first) %>%
unnest(second) %>%
select(-var)
singles <- df2 %>%
filter(!str_detect(var, ",")) %>%
mutate(val = 0) %>%
select(date, comb = var, val)
combs <- df1 %>%
inner_join(df2_tidy, by = c("first", "second")) %>%
mutate(comb = paste(first, second, sep = ",")) %>%
select(date, comb, val)
bind_rows(singles, combs)

Shifting Values in R in rows

I have a problem that sounds easy, however, I could not find a solution in R. I would like to shift values according to the first year of the release. I mean the first column represents the years of the release and the columns are years when the device is broken (values are numbers of broken devices).
This is a solution in Python:
def f(x):
shifted = np.argmin((x.index.astype(int)< x.name[0]))
return x.shift(-shifted)
df = df.set_index(['Delivery Year', 'Freq']).apply(f, axis=1)
df.columns = [f'Year.{i + 1}' for i in range(len(df.columns))]
df = df.reset_index()
df
I would like to have it in R too.
# TEST
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1976` = c(10,NA,NA,NA),
`Year.1977` = c(5,3,NA,NA),
`Year.1978` = c(10,NA,8,NA),
`Year.1979` = c(13,10,5,14)
)
data
# DESIRED
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1` = c(10,3,8,14),
`Year.2` = c(5,NA,5,NA),
`Year.3` = c(10,10,NA,NA),
`Year.4` = c(13,NA,NA,NA)
)
data
In addition, would it be also possible to transform the number of broken devices into the percentage of Freq column?
Thank you
Using tidyverse
data %>%
pivot_longer(!c(Delivery.Year, Freq)) %>%
separate(name, c("Lab", "Year")) %>%
select(-Lab) %>%
mutate_all(as.numeric) %>%
filter(Year >= Delivery.Year) %>%
group_by(Delivery.Year, Freq) %>%
mutate(ind = paste0("Year.", row_number()),
per = value/Freq) %>%
ungroup() %>%
pivot_wider(id_cols = c(Delivery.Year, Freq), names_from = ind, values_from = c(value, per))
I pivoted it into long form to begin with and separated the original column names Year.1976, Year.1977, etc. to just get the years from the columns and dropped the Year piece of it. Then I converted all columns to numeric to allow for mathematical operations like filtering for when Year >= Delivery.Year. I then created a column to get the titles you did request Year.1, Year.2, etc. and calculated the percent. Then I pivot_wider to get it in the format you requested. One thing to note is that I was unclear if you wanted both the original values and the percent or just the percent. If you only want the percent then values_from = per should do it for you.
library(dplyr)
f <- function(df) {
years <- paste0("Year.",sort(as.vector(na.omit(as.integer(stringr::str_extract(colnames(df), "\\d+"))))))
df1 <- df %>% select(years)
df2 <- df %>% select(-years)
val <- c()
firstyear <- years[1]
for (k in 1:nrow(df1) ) {
vec <- as.numeric(as.vector(df1[k,]))
val[k] <- (as.numeric(suppressWarnings(na.omit(vec))))[1]
}
df1[firstyear] <- val
colnames(df1) <- c(paste0("Year.",seq(1:ncol(df1))))
df <- cbind(df2,df1)
print(df)
}
> f(data)
Delivery.Year Freq Year.1 Year.2 Year.3 Year.4
1 1976 120 10 5 10 13
2 1977 100 3 3 NA 10
3 1978 80 8 NA 8 5
4 1979 60 14 NA NA 14

Addressing another dataframe with dynamic variables in R (ideally dplyr, using mutate)

I have already read a variety of threads on dynamically named variables, but I couldn't quite find an answer.
I have two dataframes.
df <- data.frame(qno=c(1,2,3,4))
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
Now I want to create another column 'average' in the df dataframe which gives me the average of each column in ref.
Intended output:
df <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
Here is what I have tried:
df <- df %>%
mutate(average := mean(!!as.name(paste0("ref$Q",qno)))
I have also tried a version with a for loop, but that didn't work either.
for (i in 1:length(df$qno)){
df$average[i] <- mean(as.name(paste0("ref$Q",df$qno[i])))
}
df <- df %>%
mutate(average = mean(as.name(paste0("ref$Q",qno))))```.
Here it is with mutate:
df %>% mutate(average = t(ref %>% summarise(across(everything(), ~mean(.x, na.rm = TRUE)))))
qno average
1 1 10.5
2 2 30.5
3 3 50.5
4 4 70.5
But you can use it without mutate entirely if you want the names from ref:
t(ref %>% summarise(across(everything(), list(mean), .names = "{.col}"))) %>%
data.frame() %>%
rename(average = 1)
average
Q1 10.5
Q2 30.5
Q3 50.5
Q4 70.5
Does this solve your problem?
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
out <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
df <- data.frame(qno=c(1:length(ref)))
for (i in seq_along(ref)) {
df$average[i] <- mean(ref[[i]], na.rm = T)
}
I was not really sure if you want to name the rows like the variables, so you could just add this when you create the df object:
df <- data.frame(qno = paste0("Q", c(1:length(ref))))

Filtering a dataframe with respect to the list of data

I am in trouble with making multi filtering in a dataframe wrt the list of data. My real data set is huge, so I created a fake one as below to make the question replicable.
set.seed(1)
df <- data.frame(Cluster=round(runif(2000,1,50)),
Grup = paste0("Group",round(runif(2000,1,10))),
ID = paste0("id",1:2000),
Point1 = round(runif(2000,1,100)),
Point2 = round(runif(2000,1,100)))
Cluster_grup <- list(List1 = data.frame( V1=c(47,35),V2=c(20,35)),
List2 = data.frame(V1=c(10,5,6),V2=c(49,2,46),V3=c(11,12,13)),
List3 = data.frame(V1=c(22,3),V2=c(18,18),V3=c(50,25),V4=c(6,7)))
Grup_info <- list(First = c("Group1","Group7"),
Second = c("Group4","Group5","Group3"),
Third = c("Group10","Group8","Group1","Group6"))
I basically want to make a filtering with respect to the data inside Grup_info and Cluster_grup. For instance if we take the first elements of those two lists,
Grup_info[[1]]
"Group1" "Group7"
Cluster_grup [[1]]
V1 V2
1 47 20
2 35 35
Then I need to filter and apply expand.grid like,
df_sorted1 <- df %>% filter(.,Cluster == 47 & Grup=="Group1") %>%
select(.,ID,Point1,Point2)
df_sorted2 <-df %>% filter(.,Cluster == 20 & Grup=="Group7") %>%
select(.,ID,Point1,Point2)
ep1 <- expand.grid(df_sorted1$ID,df_sorted2$ID)
ep2 <- expand.grid(df_sorted1$Point1,df_sorted2$Point1)
ep3 <- expand.grid(df_sorted1$Point2,df_sorted2$Point2)
data.frame(ep1, SumPoint1 = rowSums(ep2),SumPoint2 = rowSums(ep3))
So the very same thing will be applied while assigning Cluster == 35 inside the filter function. Then I will bind those two dataframes as well.
But as you can see, the length of the groups are not equal. For example the third Grup_info has four elements inside it as the third Cluster_grup does.
At the end, I want to get a list, including three dataframes which are the binded dataframes of expand.grid outputs.
I can actually achieve it by for loops or sapply family functions maybe, but I wonder if there exists a faster solution like a tidyverse approach or something like that.
Nice to see you, maydin
I made the code you want probably.
Data Input
set.seed(1)
library(dplyr)
library(tidyverse)
library(rlang)
library(data.table)
df <- data.frame(Cluster=round(runif(2000,1,50)),
Grup = paste0("Group",round(runif(2000,1,10))),
ID = paste0("id",1:2000),
Point1 = round(runif(2000,1,100)),
Point2 = round(runif(2000,1,100)))
Cluster_grup <- list(List1 = data.frame( V1=c(47,35),V2=c(20,35)),
List2 = data.frame(V1=c(10,5,6),V2=c(49,2,46),V3=c(11,12,13)),
List3 = data.frame(V1=c(22,3),V2=c(18,18),V3=c(50,25),V4=c(6,7)))
Grup_info <- list(List1 = c("Group1","Group7"),
List2 = c("Group4","Group5","Group3"),
List3 = c("Group10","Group8","Group1","Group6"))
Data merge
I merged Cluster_grup and Grup_info.
mergeGrp <-
sapply(names(Grup_info), function(x){
material <- Cluster_grup[[ x ]]
colnames(material)<- Grup_info[[x]]
return(material)
})
> mergeGrp
$List1
Group1 Group7
1 47 20
2 35 35
$List2
Group4 Group5 Group3
1 10 49 11
2 5 2 12
3 6 46 13
$List3
Group10 Group8 Group1 Group6
1 22 18 50 6
2 3 18 25 7
Data handling
I used RbindList to merge all the result.
But if you don't want to that, you should manipulate yourself.
FinalResult = lapply(mergeGrp,function(x){
tidyTest = x %>% tidyr::gather() %>% dplyr::group_by(key)
result = NULL
for (i in 1: NROW(x)){
mate = tidyTest %>% filter(row_number() == i )
condList = apply(mate,1,function(x){
sprintf("( Cluster == %s & Grup == '%s' )",x[2],x[1])
})
filtered = lapply(condList, function(x){
df %>% filter_(x) %>% select(ID,Point1,Point2)}
)
ep1 = filtered %>% purrr::map(.,~.$ID) %>%
as.vector() %>% expand.grid()
ep2 = filtered %>% purrr::map(.,~.$Point1) %>% as.vector() %>%
expand.grid() %>% rowSums()
ep3 = filtered %>% purrr::map(.,~.$Point2) %>% as.vector() %>%
expand.grid() %>% rowSums()
result = rbind(result,data.frame(ep1, SumPoint1 = ep2,SumPoint2 = ep3))
}
return(result)
}
)
#rbindlist(FinalResult)

Resources