Hi and happy new year at all.
I have a tricky task (in my opinion) and I can not find a way to solve it.
Please see following toy data. The orginal dataset has hundreds of cols/rows.
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan"),
US=c(8,2,NA,7),
UK=c(5,4,1,7))
I want to create a new column, called "origin", which pastes the colname of each cell (without NA) seperated by "|" under consideration of the corresponding value. Higher values should be pasted first. As for same values (like Zlatan), the sequence isnĀ“t relevant. Output for Zlatan could be US|UK OR UK|US.
This is the desired ouput:
I tried some hours to solve it but no approach worked. May be it make sense to convert the values as.factor...
Help is much appreciated. Thank you in advance!
Here's a dplyr approach. First, we can use rowwise to work on individual rows independently. Next, we can use c_across which allows us to select values from that row only. We can subset a vector of c("US","UK") based on whether the US and UK columns are not NA.
paste with collapse = "|" allows us to put the values together with the seperator. I added a row to see what would happen if they are both NA.
library(dplyr)
test %>%
rowwise() %>%
mutate(origin = paste(c("US","UK")[rev(order(c_across(US:UK), na.last = NA))], collapse = "|"))
# A tibble: 5 x 4
# Rowwise:
name US UK origin
<chr> <dbl> <dbl> <chr>
1 Amber 8 5 "US|UK"
2 Thomas 2 4 "UK|US"
3 Stefan NA 1 "UK"
4 Zlatan 7 7 "UK|US"
5 Bob NA NA ""
This is also trivially expanded to more columns:
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan","Bob"),
US=c(8,2,NA,7,NA),
UK=c(5,4,1,7,NA),
AUS=c(1,2,NA,NA,1))
test %>%
rowwise() %>%
mutate(origin = paste(c("US","UK","AUS")[rev(order(c_across(US:AUS), na.last = NA))], collapse = "|"))
# A tibble: 5 x 5
# Rowwise:
name US UK AUS origin
<chr> <dbl> <dbl> <dbl> <chr>
1 Amber 8 5 1 US|UK|AUS
2 Thomas 2 4 2 UK|AUS|US
3 Stefan NA 1 NA UK
4 Zlatan 7 7 NA UK|US
5 Bob NA NA 1 AUS
Or with tidyselect assistance to perform all columns but name:
test %>%
rowwise() %>%
mutate(origin = paste(names(across(-name))[rev(order(c_across(-name), na.last = NA))], collapse = "|"))
Another possibility with tidyverse. It is longer than the other two solutions, but it should work directly with a dataframe with as many columns as you need.
I changed the dataframe to long format, filtered out NAs, grouped by name, summarized using paste, and joined with the original dataframe to get the original columns (and rows with all NAs).
library(tidyverse)
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan","Bob"),
US=c(8,2,NA,7,NA),
UK=c(5,4,1,7,NA),
AUS=c(1,2,NA,NA,1))
test %>%
# change to long format
tidyr::pivot_longer(cols=-name, names_to = "country", values_to = "value") %>%
# remove rows with NA
dplyr::filter(!is.na(value)) %>%
# group by name and sort
dplyr::group_by(name) %>% dplyr::arrange(-value) %>%
# create summary of countries for each name in column 'origin'
dplyr::summarise(origin=paste(country, collapse = "|")) %>%
# join with original data frame to include original columns (and names with only NA) and change NA to '' in origin
dplyr::right_join(test, by='name') %>% dplyr::mutate(origin=ifelse(is.na(origin), '', origin)) %>%
# move origin column to end
dplyr::relocate(origin, .after = last_col())
Result
name US UK AUS origin
<chr> <dbl> <dbl> <dbl> <chr>
1 Amber 8 5 1 US|UK|AUS
2 Bob NA NA 1 AUS
3 Stefan NA 1 NA UK
4 Thomas 2 4 2 UK|US|AUS
5 Zlatan 7 7 NA US|UK
Here's a different tidyverse solution using case_when:
library(tidyverse)
data <- data.frame (test<-data.frame(
"name" =c("Amber","Thomas","Stefan","Zlatan"),
"US" =c(8,2,NA,7),
"UK" =c(5,4,1,7)))
data <- data %>% mutate(origin = case_when( US > UK ~ "US|UK",
UK >= US ~ "UK|US",
is.na(UK) & !is.na(US) ~ "US",
is.na(US) & !is.na(UK) ~ "UK"))
data
#> name US UK origin
#> 1 Amber 8 5 US|UK
#> 2 Thomas 2 4 UK|US
#> 3 Stefan NA 1 UK
#> 4 Zlatan 7 7 UK|US
Created on 2021-01-06 by the reprex package (v0.3.0)
Related
I have a data frame:
df1 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7",NA,"17",NA,NA,NA)),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
And I want this:
df2 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7","18","17",NA,NA,"17")),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
Which is, if there is a no value for an Object for PointA for measure == 2, I want it replaced with PointB from measure == 1 of the same Object.
First thing that comes to mind is:
library(dplyr)
df$PointA <- coalesce(df$PointA, df$PointB)
But afaik there is no way to make this condional.
Then I thought maybe something like:
df$PointA[is.na(df$PointA)] <- df$PointB
But this does not differentiate for the measure.
So I thought about:
df$PointA <- ifelse(df$measure == 2 & is.na(df$PointA), df$PointB, df$PointA)
But that does not take into account that I need the corresponding value from measure == 1.
Now, I am at a loss here. I am out of ideas how to approch this. Help?
Edit: I got two very good solutions already, but both rely on the order in the data frame. I tried, but obviously my example was to simple. I am looking for something that works under the following condition, too:
df1 <- df1[sample(nrow(df1)), ]
One possible option is using row_number() from dplyr. In case you need to sort your dataframe first, you can insert an arrange statement.
library(dplyr)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = if_else(measure == 2 & is.na(PointA), PointB[row_number()-1], PointA))
# A tibble: 6 x 4
# Groups: Object [3]
# Object PointA PointB measure
# <chr> <dbl> <dbl> <chr>
# 1 Daniel NA 17 1
# 2 Daniel 17 NA 2
# 3 Klaus 7 18 1
# 4 Klaus 18 22 2
# 5 Peter 17 NA 1
# 6 Peter NA NA 2
You could use coalesce +lag as shown below:
library(tidyverse)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = coalesce(PointA, lag(PointB)))
# A tibble: 6 x 4
# Groups: Object [3]
Object PointA PointB measure
<chr> <dbl> <dbl> <chr>
1 Klaus 7 18 1
2 Klaus 18 18 2
3 Peter 17 NA 1
4 Peter NA NA 2
5 Daniel NA 17 1
6 Daniel 17 NA 2
This could be condensed, but it should be relatively clear and doesn't rely on the row order at all. Beware if you have multiple rows for the same Object/Measure pair - the self-join will have multiple matches and you'll end up with a lot more rows than you started with.
library(dplyr)
df_fill = df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(df_fill) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)
result
# Object PointA PointB measure
# 1 Klaus 7 18 1
# 2 Klaus 18 22 2
# 3 Peter 17 NA 1
# 4 Peter NA NA 2
# 5 Daniel NA 17 1
# 6 Daniel 17 NA 2
Same as above but without saving the intermediate object:
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(
df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)
I have the following data frame
Name Product Unit Class
2 sushil seeds
4 sanju Soap 46 C
5 rahul 5
7 sanju 4 E
9 sushil 20 B
10 rahul Soap A
and what I need is, a data frame without duplicate rows with the below conditions.
if the row is having all columns values filled then eliminate the second duplicate row.
if the row is having few of the columns value empty then replace the empty cell with the similar column values from its duplicate row.
The desired result should look like this.
Name Product Unit Class
1 sushil seeds 20 B
2 sanju Soap 46 C
3 rahul Soap 5 A
Thanks in advance for the help!
here is the df code.
Name <- c("abbas","sushil","abbas","sanju","rahul","shweta","sanju","rajiv","sushil","rahul")
Unit <- c(18," ",18,46,5,67,4,3,20," ")
Product <- c("Rice","seeds","Rice","Soap"," ","Towel"," "," "," ","Soap")
Class <- c("A"," ","A","C"," ","D","E","A","B","A")
Data <- data.frame(Name,Product,Unit,Class)
duplicate <- which(duplicated(Data))
unique <- Data[!duplicated(Data),]
NewData <- unique[unique$Name %in% unique$Name[duplicated(unique$Name)],]
In the following I am assuming that the primary ID is the Name column.
First part (harder):
library(tidyverse)
df[ df == "" ] <- NA
df2 <- df %>%
mutate(complete=complete.cases(df)) %>%
group_by(Name) %>%
mutate(any_complete=any(complete)) %>%
filter( complete | (!complete & !any_complete)) %>%
select(-complete, -any_complete)
Result:
# A tibble: 5 x 4
# Groups: Name [3]
Name Product Unit Class
<chr> <chr> <int> <chr>
1 sushil seeds NA NA
2 sanju Soap 46 C
3 rahul NA 5 NA
4 sushil NA 20 B
5 rahul Soap NA A
Explanation: first we replace all missing strings by actual NA's. Then, we create a column, complete, which checks whether all of the columns are complete for a given row. Next we create another column that tells us whether, for any given Name there is a complete observation. Finally, we keep only the rows which are either (i) complete or (ii) not complete, but a complete observation for that Name is missing.
Second task is simpler, but boring:
df2 %>% arrange(Name, Product) %>% fill(Product) %>%
arrange(Name, Unit) %>% fill(Unit) %>%
arrange(Name, Class) %>% fill(Class) %>%
filter(!duplicated(Name))
Result:
# A tibble: 5 x 4
# Groups: Name [3]
Name Product Unit Class
<chr> <chr> <int> <chr>
1 rahul Soap 5 A
2 sanju Soap 46 C
3 sushil seeds 20 B
I have a longitudinal data set and would like to extract the latest, non-missing complete set of observations for each variable in the data set where id is a unique identifier, yr is year, and x1 and x2 are variables with missing values. The actual data set has 100s of variables over the course of 60 years.
data <- data.frame(id=rep(1:3,3)
yr=rep(1:3,times=1, each=3)
x1=c(1,3,7,NA,NA,NA,9,4,10)
x2=c(NA,NA,NA,3,9,6,NA,NA,NA))
Below are my expected results. For x1, the latest complete set of observations is year 3. For x2, the latest complete set of observations is year 2.
Using base R
subset(data, yr %in% names(tail(which(sapply(split(data[c('x1', 'x2')],
data$yr), function(x) any(colSums(!is.na(x)) == nrow(x)))), 2)))
Here's a tidyverse solution. First, I create the data frame.
# Create data frame
df <- data.frame(id=rep(1:3,3),
yr=rep(1:3,times=1, each=3),
x1=c(1,3,7,NA,NA,NA,9,4,10),
x2=c(NA,NA,NA,3,9,6,NA,NA,NA))
Next, I load the required libraries.
# Load library
library(dplyr)
library(tidyr)
I then go from wide to long format, group by yr and key (i.e., variable name), remove those that have NA values (i.e., keep those that are all not NA), group by key, keep those data that are in the maximum year, switch back to wide format, and arrange to make the printed result look pretty.
df %>%
gather("key", "val", x1, x2) %>%
group_by(yr, key) %>%
filter(all(!is.na(val))) %>%
group_by(key) %>%
filter(yr == max(yr)) %>%
spread(key, val) %>%
arrange(yr)
#> # A tibble: 6 x 4
#> id yr x1 x2
#> <int> <int> <dbl> <dbl>
#> 1 1 2 NA 3
#> 2 2 2 NA 9
#> 3 3 2 NA 6
#> 4 1 3 9 NA
#> 5 2 3 4 NA
#> 6 3 3 10 NA
Created on 2019-05-29 by the reprex package (v0.3.0)
I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')
I would like to average columns in a data set based on a unique identifier. I do not know ahead of time how many columns I will have for each unique identifier or what order they will come in. The unique IDs are all known before hand and are lists of weeks. I have found solutions for regular patterns but not solutions for using the actual column headers to sort out the average. Thanks for any and all help.
I present the original data and desired result. In the example there are only 2 unique IDs
x = read.table(text = "
site wk1 wk2 wk1 wk1
1 2 4 6 8
2 10 20 30 40
3 5 NA 2 3
4 100 100 NA NA",
sep = "", header = TRUE)
x
desired.outcome = read.table(text = "
site wk1avg wk2avg
1 3.3 4
2 26.6 20
3 3.3 NA
4 NA 100",
sep = "", header = TRUE)
If your original data file has duplicated column names, read.table will change them so all the columns have unique values (as you can see by checking x in your example after it's loaded). In fact, the code below depends on that happening, because melt will drop columns with duplicated names. Then we use mutate to remove the extra text added by read.table to de-duplicate the column names so that we can group properly by week.
library(reshape2)
library(dplyr)
x %>% melt(id.var="site") %>% # Convert to long format
mutate(variable = gsub("\\..*", "", variable)) %>% # "re-duplicate" original column names
group_by(site, variable) %>%
summarise(mn = mean(value)) %>%
dcast(site ~ variable)
site wk1 wk2
1 1 5.333333 4
2 2 26.666667 20
3 3 3.333333 NA
4 4 NA 100
Here's a tidyr and dplyr approach:
library(dplyr)
library(tidyr)
x %>% gather(wk, val, -site) %>% # gather wk* columns into key-value pairs
extract(wk, 'wk', '(wk\\d+).*?') %>% # trim suffixes added by read.table
group_by(site, wk) %>%
summarise(mean_val = mean(val)) %>% # calculate grouped means
spread(wk, mean_val) # spread back into wk* columns
# Source: local data frame [4 x 3]
# Groups: site [4]
#
# site wk1 wk2
# (int) (dbl) (dbl)
# 1 1 5.333333 4
# 2 2 26.666667 20
# 3 3 3.333333 NA
# 4 4 NA 100