Removing duplicate rows from one column without losing data from another column - r

I apologize if this has already been asked, but the solutions I came across didn't seem to work for me.
I have a data set that was initially multiple excel sheets containing different variables for the same subjects. I was able to import the data into r and combine into a single data frame using:
x1_data <- "/data.xlsx"
excel_sheets(path = x1_data)
tab_names <- excel_sheets(path = x1_data)
list_all <- lapply(tab_names, function(x)
read_excel(path = x1_data, sheet = x))
str(list_all)
df <- rbind.fill(list_all)
df <- as_tibble(df)
However, I now have many duplicate rows for each subject, as each sheet was essentially added beneath the preceding sheet. Something like this:
Sheet 1
ID: 1,2
Age: 32, 29
Sex: M, F
Sheet 2
ID: 1, 2
Weight: 75, 89
Height: 157, 146
Combined
ID: 1, 2, 1, 2
Age: 32, 29, NA, NA
Sex: M, F, NA, NA
Weight: NA, NA, 75, 89
Height: NA, NA, 157, 146
I can't seem to figure out how to delete the duplicate ID rows without losing the data in the columns that belong to those rows. I tried aggregate and group_by without success. What I am after is this:
Combined
ID: 1, 2
Age: 32, 29
Sex: M, F
Weight: 75, 89
Height: 157, 146
Any help would be appreciated. Thanks.

Here's a possible solution:
library(tidyverse)
df <- tibble(ID = c(1, 2, 1, 2),
Age = c(32, 29, NA, NA),
Sex = c("M", "F", NA, NA),
Weight = c(NA, NA, 75, 89),
Height = c(NA, NA, 157, 146))
df1 <- df %>% filter(is.na(Age)) %>% select(ID, Weight, Height)
df2 <- df %>% filter(!is.na(Age)) %>% select(ID, Age, Sex)
df.merged <- df2 %>% left_join(df1, by = "ID")
For future questions, please provide a already formatted sample of your data which makes it much more easier to work with.

Related

Conditionally replace values across multiple columns based on string match in a separate column

I'm trying to conditionally replace values in multiple columns based on a string match in a different column but I'd like to be able to do so in a single line of code using the across() function but I keep getting errors that don't quite make sense to me. I feel like this is probably a simple solution so if anyone could point me in the right direction, that would be fantastic!
df <- data.frame("type" = c("Park", "Neighborhood", "Airport", "Park", "Neighborhood", "Neighborhood"),
"total" = c(34, 56, 75, 89, 21, 56),
"group_a" = c(30, 26, 45, 60, 3, 46),
"group_b" = c(4, 30, 30, 29, 18, 10))
# working but not concise
df %>%
mutate(total = ifelse(str_detect(type, "Park"), NA, total),
group_a = ifelse(str_detect(type, "Park"), NA, group_a),
group_b = ifelse(str_detect(type, "Park"), NA, group_b))
# concise but not working
df %>% mutate(across(total, group_a, group_b), ifelse(str_detect(type, "Park"), NA, .))
Update
We got a solution that works with my dummy dataset but is not working with my real data, so I am going to share a small snippet of my real data frame with the numbers changed and organization names hidden. When I run this line of code (df %>% mutate(across(c(Attempts, Canvasses, Completes)), ~ifelse(str_detect(long_name, "park-cemetery"), NA, .))) on these data, I get the following error message:
Error: Problem with mutate() input ..2. x Input ..2 must be a
vector, not a formula object. i Input ..2 is
~ifelse(str_detect(long_name, "park-cemetery"), NA, .).
This a small sample of the data that produces this error:
df <- structure(list(Org = c("OrgName", "OrgName", "OrgName", "OrgName",
"OrgName", "OrgName", "OrgName", "OrgName", "OrgName", "OrgName"
), nCode = c("M34", "R36", "R46", "X29", "M31", "K39", "Q12",
"Q39", "X41", "K27"), Attempts = c(100, 100, 100, 100, 100, 100,
100, 100, 100, 100), Canvasses = c(80, 80, 80, 80, 80, 80, 80,
80, 80, 80), Completes = c(50, 50, 50, 50, 50, 50, 50, 50, 50,
50), van_nocc_id = c(999, 999, 999, 999, 999, 999, 999, 999,
999, 999), van_name = c("M-Upper West Side", "SI-Rosebank", "SI-Tottenville",
"BX-park-cemetery-etc-Bronx", "M-Stuyvesant Town-Cooper Village",
"BK-Kensington", "Q-Broad Channel", "Q-Lindenwood", "BX-Wakefield",
"BK-East New York"), boro_short = c("M", "SI", "SI", "BX", "M",
"BK", "Q", "Q", "BX", "BK"), long_name = c("Upper West Side",
"Rosebank", "Tottenville", "park-cemetery-etc-Bronx", "Stuyvesant Town-Cooper Village",
"Kensington", "Broad Channel", "Lindenwood", "Wakefield", "East New York"
)), row.names = c(NA, -10L), class = "data.frame")
Final update
The curse of the misplaced closing bracket! Thanks to everyone for your help... the correct solution was df %>% mutate(across(c(Attempts, Canvasses, Completes), ~ifelse(str_detect(long_name, "park-cemetery"), NA, .)))
If you use the newly introduced function across (which is the correct way to approach this task), you have to specify inside across itself the function you want to apply. In this case the function ifelse(...) has to be a purrr-style lambda (so starting with ~). Check out across documentation and look for the arguments .cols and .fns.
df %>%
mutate(across(c(total, group_a, group_b), ~ifelse(str_detect(type, "Park"), NA, .)))
Output
# type total group_a group_b
# 1 Park NA NA NA
# 2 Neighborhood 56 26 30
# 3 Airport 75 45 30
# 4 Park NA NA NA
# 5 Neighborhood 21 3 18
# 6 Neighborhood 56 46 10
Here a data.table solution.
require(data.table)
df <- data.frame("type" = c("Park", "Neighborhood", "Airport", "Park", "Neighborhood", "Neighborhood"),
"total" = c(34, 56, 75, 89, 21, 56),
"group_a" = c(30, 26, 45, 60, 3, 46),
"group_b" = c(4, 30, 30, 29, 18, 10))
setDT(df)
df[type == "Park", c("total", "group_a", "group_b") := NA]
Update: that didn't take long to figure out! Just needed to place the columns in a vector:
# concise AND working!
df %>% mutate(across(c(total, group_a, group_b)), ifelse(str_detect(type, "Park"), NA, .))
I had tried this initially but placed the columns in quotes... don't do that :)

Returning values from a column based on the last value of another column

I have a dataset like this:
data <- data.frame(Time = c(1,4,6,9,11,13,16, 25, 32, 65),
A = c(10, NA, 13, 2, 32, 19, 32, 34, 93, 12),
B = c(1, 99, 32, 31, 12, 13, NA, 13, NA, NA),
C = c(2, 32, NA, NA, NA, NA, NA, NA, NA, NA))
What I want to retrieve are the values in Time that corresponds to the last numerical value in A, B, and C.
For example, the last numerical values for A, B, and C are 12, 13, and 32 respectively.
So, the Time values that correspond are 65, 25, and 4.
I've tried something like data[which(data$Time== max(data$A)), ], but this doesn't work.
We can multiply the row index with the logical matrix, and get the colMaxs (from matrixStats) to subset the 'Time' column
library(matrixStats)
data$Time[colMaxs((!is.na(data[-1])) * row(data[-1]))]
#[1] 65 25 4
Or using base R, we get the index with which/arr.ind, get the max index using a group by operation (tapply) and use that to extract the 'Time' value
m1 <- which(!is.na(data[-1]), arr.ind = TRUE)
data$Time[tapply(m1[,1], m1[,2], FUN = max)]
#[1] 65 25 4
Or with summarise/across in the devel version of dplyr
library(dplyr)
data %>%
summarise(across(A:C, ~ tail(Time[!is.na(.)], 1)))
# A B C
#1 65 25 4
Or using summarise_at with the current version of dplyr
data %>%
summarise_at(vars(A:C), ~ tail(Time[!is.na(.)], 1))

Weighted mean calculation in R with missing values

Does anyone know if it is possible to calculate a weighted mean in R when values are missing, and when values are missing, the weights for the existing values are scaled upward proportionately?
To convey this clearly, I created a hypothetical scenario. This describes the root of the question, where the scalar needs to be adjusted for each row, depending on which values are missing.
Image: Weighted Mean Calculation
File: Weighted Mean Calculation in Excel
Using weighted.mean from the base stats package with the argument na.rm = TRUE should get you the result you need. Here is a tidyverse way this could be done:
library(tidyverse)
scores <- tribble(
~student, ~test1, ~test2, ~test3,
"Mark", 90, 91, 92,
"Mike", NA, 79, 98,
"Nick", 81, NA, 83)
weights <- tribble(
~test, ~weight,
"test1", 0.2,
"test2", 0.4,
"test3", 0.4)
scores %>%
gather(test, score, -student) %>%
left_join(weights, by = "test") %>%
group_by(student) %>%
summarise(result = weighted.mean(score, weight, na.rm = TRUE))
#> # A tibble: 3 x 2
#> student result
#> <chr> <dbl>
#> 1 Mark 91.20000
#> 2 Mike 88.50000
#> 3 Nick 82.33333
The best way to post an example dataset is to use dput(head(dat, 20)), where dat is the name of a dataset. Graphic images are a really bad choice for that.
DATA.
dat <-
structure(list(Test1 = c(90, NA, 81), Test2 = c(91, 79, NA),
Test3 = c(92, 98, 83)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
w <-
structure(list(Test1 = c(18, NA, 27), Test2 = c(36.4, 39.5, NA
), Test3 = c(36.8, 49, 55.3)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
CODE.
You can use function weighted.mean in base package statsand sapply for this. Note that if your datasets of notes and weights are R objects of class matrix you will not need unlist.
sapply(seq_len(nrow(dat)), function(i){
weighted.mean(unlist(dat[i,]), unlist(w[i, ]), na.rm = TRUE)
})

Multiple columns processing and dynamically naming new columns

Variables are mistakenly being entered into multiple columns eg: "aaa_1", "aaa_2" and "aaa_3", or "ccc_1, "ccc_2", and "ccc_3"). Need to create single new columns (eg "aaa", or "ccc"). Some variables are currently in a single column though ("hhh_1"), but more columns may be added (hhh_2 etc).
This is what I got:
aaa_1 <- c(43, 23, 65, NA, 45)
aaa_2 <- c(NA, NA, NA, NA, NA)
aaa_3 <- c(NA, NA, 92, NA, 82)
ccc_1 <- c("fra", NA, "spa", NA, NA)
ccc_2 <- c(NA, NA, NA, "wez", NA)
ccc_3 <- c(NA, "ija", NA, "fda", NA)
ccc_4 <- c(NA, NA, NA, NA, NA)
hhh_1 <- c(183, NA, 198, NA, 182)
dataf1 <- data.frame(aaa_1,aaa_2,aaa_3,ccc_1,ccc_2, ccc_3,ccc_4,hhh_1)
This is what I want:
aaa <- c(43, 23, NA, NA, NA)
ccc <- c("fra", "ija", "spa", NA, NA)
hhh <- c(183, NA, 198, NA, 182)
dataf2 <- data.frame(aaa,ccc,hhh)
General solution needed as there are ~100 variables (eg "aaa", "hhh", "ccc", "ttt", "eee", "hhh"etc).
Thanks!
This is a base solution, i.e. no packages.
First define get_only which when given a list converts it to a data.frame and applies get_only to each row. When given a vector it returns the single non-NA in it or NA if there is not only one.
Define root to be the column names without the suffixes.
Convert the data frame to a list of columns, group them by root and apply get_only to each such group.
Finally, convert the resulting list to a data frame.
get_only <- function(x) UseMethod("get_only")
get_only.list <- function(x) apply(data.frame(x), 1, get_only)
get_only.default <- function(x) if (sum(!is.na(x)) == 1) na.omit(x) else NA
root <- sub("_.*", "", names(dataf1))
as.data.frame(lapply(split(as.list(dataf1), root), FUN = get_only))
giving:
age country hight
1 43 fra 183
2 23 ija NA
3 NA spa 198
4 NA <NA> NA
5 NA <NA> 182
We may try with splitstackshape
library(splitstackshape)
nm1 <- sub("_\\d+", "", names(dataf1))
tbl <- table(nm1) > 1
merged.stack(dataf1, var.stubs = names(tbl)[tbl], sep="_")
I'm not sure your example is right. For example in the third row you've got values for both age_1 and age_3, then in the desired output NA for that row.
If I've understood what you're trying to do though, it will be much easier if you transpose columns to rows, fix them and then transpose back again. Try this as a start point using the 'tidyverse' of dplyr and tidyr.
library(tidyverse)
library(stringr)
age_1 <- c(43, 23, 65, NA, 45)
age_2 <- c(NA, NA, NA, NA, NA)
age_3 <- c(NA, NA, 92, NA, 82)
country_1 <- c("fra", NA, "spa", NA, NA)
country_2 <- c(NA, NA, NA, "wez", NA)
country_3 <- c(NA, "ija", NA, "fda", NA)
country_4 <- c(NA, NA, NA, NA, NA)
hight_1 <- c(183, NA, 198, NA, 182)
dataf1 <- data.frame(age_1,age_2,age_3,country_1,country_2, country_3,country_4,hight_1)
data <- dataf1 %>%
mutate(row_num = row_number()) %>% #create a row number to track values
gather(key, value, -row_num) %>% #flatten your data
drop_na() %>% #drop na rows
mutate(key = str_replace(key, "_.", "")) %>% #remove the '_x' part of names
group_by(row_num) %>%
top_n(1) %>%
spread(key, value) #pivot back to columns
For your example you need the group_by() and top_n() lines to make it run because you've got multiple values in the same row. If you only have one value (as I think you should?) then you can remove these two lines. It will be better without them because then it won't run if your data is wrong.
Edit following comment below. This will make any duplicated entries NA.
data <- dataf1 %>%
mutate(row_num = row_number()) %>% #create a row number to track values
gather(key, value, -row_num) %>% #flatten your data
drop_na() %>% #drop na rows
mutate(key = str_replace(key, "_.", "")) %>% #remove the '_x' part of names
group_by(row_num, key) %>%
mutate(count = n()) %>% #count how many entries for each row/key combo
mutate(value = ifelse(count > 1, NA, value)) %>% #set NA for rows with duplicates
drop_na() %>%
spread(key, value) %>% #pivot back to columns
select(-count) #drop the `count` variable

Data frame reshape based on conditions

I have the following problem that I quite struggle to solve:
I have a data frame looking like:
row1 = c(55.7, NA, NA, "inf", 4.19, 99, 4, 15, 16, NA, 13, 0.1, 0.8, 51, NA, 44)
row2 = c(13, 1, 81, 6, NA, 0.3, NA, NA, 1.4, 89, NA, NA, 2.1, 99, 0.5, NA)
df = data.frame(row1, row2)
df = as.data.frame(t(df))
The first problem is that I need to change values "inf" to numerical == 100.
All I use does not help.
This creates additional NAs:
data[data =="inf"] = 100
This just don't work:
data[is.na(data)] = "Skip"
I expect it is because of data types but I cannot figure out how to fix it.
The second problem is more complex. I need to transform the data frame to match the highest values columns with lowest one columns to get somethink like this:
row3 = c("row1","V4", "V12")
row4 = c("row1", "V6", "V13")
df2 = data.frame(row3, row4)
df2 = t(df2)
And so on for all rows and columns.
The problem is that I cannot even find an approach how to solve this task, if you can give me a direction that will be extremely valuable.
Thanks a lot
For your first problem try to convert your values to character:
df[]<-lapply(df, as.character)
df[df =="inf"] = "100"
Then convert it back to factor:
df[]<-lapply(df, as.factor)

Resources