This question already has answers here:
Replace values in a dataframe based on lookup table
(8 answers)
Closed 2 years ago.
I'm trying to assign values to different columns, separately for each row, based on lookup values. I'm working in R. Here's a minimal working example:
#Item scores
item1 <- c(NA, 1, NA, 4)
item2 <- c(NA, 2, NA, 3)
item3 <- c(NA, 3, NA, NA)
item57 <- c(NA, 4, 4, 1)
mydata <- data.frame(item1, item2, item3, item57)
#Lookup values based on item score
lookup <- data.frame(score = 1:4, value=c(6, 7, 8, 10))
I have many participants (i.e., rows) assessed with a score on each of many items (i.e., columns). I'd like to create variables in my data frame for the values that are tied to the item scores (based on the lookup table). Here's my desired output:
#Desired output (adding value that is tied to item score to the original data)
desiredOutput <- cbind(mydata,
value1 = c(NA, 6, NA, 10),
value2 = c(NA, 7, NA, 8),
value3 = c(NA, 8, NA, NA),
value57 = c(NA, 10, 10, 6))
I have a fairly large dataset and would like to stay away from loops, if possible. Also, we can skip rows with all NAs, if it's faster to process.
here's a tidyverse method. The basis of it is that you want to first gather the score columns and left_join the lookup table so that you have your values matched to scores. Then the rest is just manipulation to get back back to the desired output format. To do this, we need to create the column names that we want with gather and unite, and then finally spread back out. Note that you need rowid_to_column at the beginning so that spread will know what observations to place on what rows. If you want to exactly get your output column names, you can mix in some stringr.
item1 <- c(NA, 1, NA, 4)
item2 <- c(NA, 2, NA, 3)
item3 <- c(NA, 3, NA, NA)
item57 <- c(NA, 4, 4, 1)
mydata <- data.frame(item1, item2, item3, item57)
#Lookup values based on item score
lookup <- data.frame(score = 1:4, value=c(6, 7, 8, 10))
library(tidyverse)
mydata %>%
rowid_to_column(var = "participant") %>%
gather(items, score, starts_with("item")) %>%
left_join(lookup) %>%
gather(coltype, val, score:value) %>%
unite(colname, coltype, items) %>%
spread(colname, val)
#> Joining, by = "score"
#> participant score_item1 score_item2 score_item3 score_item57 value_item1
#> 1 1 NA NA NA NA NA
#> 2 2 1 2 3 4 6
#> 3 3 NA NA NA 4 NA
#> 4 4 4 3 NA 1 10
#> value_item2 value_item3 value_item57
#> 1 NA NA NA
#> 2 7 8 10
#> 3 NA NA 10
#> 4 8 NA 6
Created on 2018-06-19 by the reprex package (v0.2.0).
Related
I have a data set with many columns (DATA_OLD) in which I want to exchange all values based on an allocation list with many entries (KEY).
Every value in DATA_OLD should be replaced by its counterpart (can be seen in KEY) to create DATA_NEW.
For simplicity, the example here contains a short KEY and DATA_OLD set. In reality, there are >2500 rows in KEY and >100 columns in DATA_OLD. Therefore, an approach that can be applied to the whole data set simultaneously without calling each colname of DATA_OLD is important.
KEY:
old
new
1
1
3
2
7
3
12
4
55
5
Following this example, every value "1" should be replaced with another value "1". Every value "3" should be replaced with value "2". Every value "7" should be replaced with value "3".
DATA_OLD (START):
var1
var2
var3
NA
3
NA
NA
55
NA
1
NA
NA
NA
NA
NA
3
NA
NA
55
NA
12
DATA_NEW (RESULT):
var1
var2
var3
NA
2
NA
NA
5
NA
1
NA
NA
NA
NA
NA
2
NA
NA
5
NA
4
Here reproducible data:
KEY<-structure(list(old = c(1, 3, 7, 12, 55), new = c(1, 2, 3, 4,
5)), class = "data.frame", row.names = c(NA, -5L))
DATA_OLD<-structure(list(var1 = c(NA, NA, 1, NA, 3, 55), var2 = c(3,
55, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, NA, 12)), class = "data.frame", row.names = c(NA, -6L))
DATA_NEW<-structure(list(var1 = c(NA, NA, 1, NA, 2, 5), var2 = c(2,
5, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, NA, 4)), class = "data.frame", row.names = c(NA, -6L))
I have tried back and forth, and it appears that I am completely clueless. Help would be greatly apprecciated! The real data set is quite large...
1) Base R Be careful here since some solutions have the side effect of converting the numeric columns to character or factor or the data frame to something else. A solution using match will generally work. The result of lapply is a list so convert back to data frame.
DATA_OLD |>
lapply(function(x) with(KEY, new[match(x, old)])) |>
as.data.frame()
or
DATA_NEW <- DATA_OLD
DATA_NEW[] <- lapply(DATA_OLD, function(x) with(KEY, new[match(x, old)]))
This last one is easy to convert to act only on some columns
DATA_NEW <- DATA_OLD
ix <- 1:2 # only convert these columns
DATA_NEW[ix] <- lapply(DATA_OLD[ix], function(x) with(KEY, new[match(x, old)]))
2) purrr Alternately use map_dfr which returns a data frame directly:
library(purrr)
map_dfr(DATA_OLD, ~ with(KEY, new[match(.x, old)]))
3) dplyr A dplyr solution using across is the following. If there were some non-numeric columns that should not be converted then replace everything() with where(is.numeric)
library(purrr)
DATA_OLD %>%
mutate(across(everything(), ~ with(KEY, new[match(.x, old)])))
The simplest way to implement a dictionary in R is a named array, where you can use the names as indices:
key <- setNames(KEY$new, KEY$old)
> key
1 3 7 12 55
1 2 3 4 5
The only thing to be mindful of is that the indexing must by done by character, rather than integer:
> key[3]
7
3 # WRONG! This is the 3rd item!
> key["3"]
3
2 # RIGHT! This is the item named "3"
Then you can apply the transformation column-wise. This turns the data into a matrix, but you can simply turn it back.
as.data.frame(apply(DATA_OLD, 2, \(col) key[as.character(col)]))
var1 var2 var3
1 NA 2 1
2 NA 5 NA
3 1 NA NA
4 NA NA NA
5 2 NA NA
6 5 NA 4
sorry in advance if this answer isn't formatted well, I'm quite new to R and the SO community, I'd welcome constructive criticism. I have a data frame that looks like this and am trying to filter it so it only contains the minimum 'Cars' and 'Houses' for each person.
my_data = data.frame("Name" = c("Dora", "Dora", "John", "John", "Marie", "Marie"),
"Cars" = c(2, 3, NA, NA, 4, 1),
"Houses" = c(NA, NA, 4, 3, 2, NA))
#Name Cars Houses
#1 Dora 2 NA
#2 Dora 3 NA
#3 John NA 4
#4 John NA 3
#5 Marie 4 2
#6 Marie 1 NA
I want to end up with something like this (especially note the Marie row has changed, but it's ok if that's split on 2 separate rows as well):
#Name Cars Houses
#Dora 2 NA
#John NA 3
#Marie 1 2
OR like this:
#Name Cars Houses
#Dora 2 NA
#John NA 3
#Marie NA 2
#Marie 1 NA
Based on other answers, I've tried
my_data %>%
group_by(Name) %>%
filter(Cars == min(Cars))
#Name Cars Houses
#Dora 2 NA
#Marie 1 NA
but this results in the John rows being dropped before I can filter the minimum Houses. Does anyone have any suggestions for how to approach this? Thanks in advance.
We can use summarise to get the minimum of each column for each name:
my_data = data.frame("Name" = c("Dora", "Dora", "John", "John", "Marie", "Marie"),
"Cars" = c(2, 3, NA, NA, 4, 1),
"Houses" = c(NA, NA, 4, 3, 2, NA))
library(dplyr)
my_data %>%
group_by(Name) %>%
summarise(Cars = min(Cars, na.rm = TRUE),
Houses = min(Houses, na.rm = TRUE))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
Name Cars Houses
<chr> <dbl> <dbl>
1 Dora 2 Inf
2 John Inf 3
3 Marie 1 2
Here is what you can do in base R:
df <- data.frame("Name" = c("Dora", "Dora", "John", "John", "Marie", "Marie"),
"Cars" = c(2, 3, NA, NA, 4, 1),
"Houses" = c(NA, NA, 4, 3, 2, NA), stringsAsFactors = FALSE)
aggregate(df, list(df$Name), FUN = function(x) min(x, na.rm = TRUE))[,-1]
Output
Name Cars Houses
1 Dora 2 Inf
2 John Inf 3
3 Marie 1 2
I have dataframe like this:
I want to create a new column which is the sum of other columns by ignoring NA if there is any numeric value in a row. But if all value (like the second row) in a row are na, the sum column gets NA.
As this is your first activity here on SO you should have a look to this which describes how a minimal and reproducible examples is made. This is certainly needed in the future, if you have more questions. An image is mostly not accepted as a starting point.
Fortunately your table was a small one. I turned it into a tribble and then used rowSums to calculate the numbers you seem to want.
df <- tibble::tribble(
~x, ~y, ~z,
6000, NA, NA,
NA, NA, NA,
100, 7000, 1000,
0, 0, NA
)
df$sum <- rowSums(df, na.rm = T)
df
#> # A tibble: 4 x 4
#> x y z sum
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6000 NA NA 6000
#> 2 NA NA NA 0
#> 3 100 7000 1000 8100
#> 4 0 0 NA 0
Created on 2020-06-15 by the reprex package (v0.3.0)
Let's say that your data frame is called df
cbind(df, apply(df, 1, function(x){if (all(is.na(x))) {NA} else {sum(x, na.rm = T)}))
Note that if your data frame has other columns, you will need to restrict the df call within apply to only be the columns you're after.
You can count the NA values in df. If in a row there is no non-NA value you can assign output as NA or calculate row-wise sum otherwise using rowSums.
ifelse(rowSums(!is.na(df)) == 0, NA, rowSums(df, na.rm = TRUE))
#[1] 6000 NA 10000 8100 0
data
df <- structure(list(x = c(6000, NA, 10000, 100, 0), y = c(NA, NA,
NA, 7000, 0), z = c(NA, NA, NA, 1000, NA)), class = "data.frame",
row.names = c(NA, -5L))
I have a data frame with data that looks like this:
Part Number Vendor Name Position Repair
123 ABC 1 2
NA <NA> 2 4
NA <NA> 3 1
NA <NA> 4 5
NA <NA> 5 6
NA <NA> 6 3
123 XYZ 1 4
NA <NA> 2 5
NA <NA> 3 7
NA <NA> 4 1
NA <NA> 5 2
NA <NA> 6 3
NA <NA> 7 6
I have a part number and vendor name grouped. Whenever position column > 3 and Repair ==1, retrieve subsequent rows.
Suppose in the given example for Part number =123 and vendor name=ABC, the repair==1 is at third position [position=3]-> All the rows that belong to part=123 and vendor name =ABC should be excluded.
Part=123 and vendor name=XYZ, the repair ==1 is at the fourth position. So retrieve 4th,5th,6th and 7th rows.
Condition to be considered is consider rows where Position >3 and Repair ==1, retrieve all subsequent rows.
Sample data:
Input <- structure(list(`Part Number` = c(123, NA, NA, NA, NA, NA, 123,
NA, NA, NA, NA, NA, NA), `Vendor Name` = c("ABC", NA, NA, NA,
NA, NA, "XYZ", NA, NA, NA, NA, NA, NA), Position = c(1, 2, 3,
4, 5, 6, 1, 2, 3, 4, 5, 6, 7), Repair = c(2, 4, 1, 5, 6, 3, 4,
5, 7, 1, 2, 3, 6)), .Names = c("Part Number", "Vendor Name", "Position",
"Repair"), row.names = c(NA, -13L), class = c("tbl_df", "tbl",
"data.frame"))
I've tried the following but it hasn't resulted in what I wanted:
output_table <- Input %>% group_by(`Part Number`,`Vendor Name`) %>%
mutate(rn=row_number()) %>% filter(rn>=which(pivot$Repair==1)) #Here I'm able to filter subsequent rows where repair==1 but how to exclude the rows which doesn't fall under the mentioned conditions.
output_table <- Input[Input$Position >3 & Input$Repair==1,] # gives me rows matching the condition but I need subsequent rows once the condition is met
Your format seems like it is geared towards presentation (reports) vice for data processing. Any processing like this should really be done before you do things like remove repeating rows for visual-grouping.
Ultimately, the only part you need here within group_by is the use of cumany. The rest of the mutating code is to accommodate the NA fields.
Input %>%
# assuming order is "safe to assume"
mutate_at(vars(`Part Number`, `Vendor Name`), zoo::na.locf) %>%
group_by(`Part Number`,`Vendor Name`) %>%
filter(cumany(Position > 3 & Repair == 1)) %>%
# return the first two columns to NA
mutate(toprow = row_number() == 1L) %>%
ungroup() %>%
mutate_at(vars(`Part Number`, `Vendor Name`), ~ if_else(toprow, ., .[NA])) %>%
select(-toprow)
# # A tibble: 4 x 4
# `Part Number` `Vendor Name` Position Repair
# <dbl> <chr> <dbl> <dbl>
# 1 123 XYZ 4 1
# 2 NA <NA> 5 2
# 3 NA <NA> 6 3
# 4 NA <NA> 7 6
If you are doing more processing on the data, I'd suggest you don't undo "dragging the labels down", instead just doing:
Input %>%
# assuming order is "safe to assume"
mutate_at(vars(`Part Number`, `Vendor Name`), zoo::na.locf) %>%
group_by(`Part Number`,`Vendor Name`) %>%
filter(cumany(Position > 3 & Repair == 1)) %>%
ungroup()
# # A tibble: 4 x 4
# `Part Number` `Vendor Name` Position Repair
# <dbl> <chr> <dbl> <dbl>
# 1 123 XYZ 4 1
# 2 123 XYZ 5 2
# 3 123 XYZ 6 3
# 4 123 XYZ 7 6
With dplyr and tidyr you can do this as follows:
library(dplyr)
library(tidyr)
Input %>%
fill(`Part Number`, `Vendor Name`) %>% # fill down missing values
group_by(`Part Number`, `Vendor Name`) %>% # group by `Part Number` & `Vendor Name`
filter( cumsum(Position>3 & Repair==1) >= 1) # select only rows where the cumulative sum of true/false condition >= 1
Output for that should be what you are looking for:
# A tibble: 4 x 4
`Part Number` `Vendor Name` Position Repair
<dbl> <chr> <dbl> <dbl>
1 123 XYZ 4 1
2 123 XYZ 5 2
3 123 XYZ 6 3
4 123 XYZ 7 6
I want to remove all rows from my dataset that are NA in two columns. If a row has a non-NA value in either column, I want to keep it. How do I do this?
you can do this
library(tidyverse)
df <- data.frame(a = c(2, 4, 6, NA, 3, NA),
b = c(5, 4, 8, NA, 6, 7))
df1 <- df %>%
filter(is.na(a) == FALSE | is.na(b) == FALSE)
and you get:
> df1
a b
1 2 5
2 4 4
3 6 8
4 3 6
5 NA 7
Here are a couple of base R suggestions. Loop through the columns of datasets, convert it to a logical vector, and collapse the logical vectors by comparing each corresponding element with Reduce, negate the output and subset the dataset
df[!Reduce(`&`, lapply(df, is.na)),]
Or converting the logical matrix (!is.na(df)) to a logical vector to subset the dataset
df[rowSums(!is.na(df))>0,]
data
df <- data.frame(a = c(2, 4, 6, NA, 3, NA),
b = c(5, 4, 8, NA, 6, 7))