Adding values to columns based on multiple conditions - r

I have 1 df as below
df <- data.frame(n1 = c(1,2,1,2,5,6,8,9,8,8),
n2 = c(100,1000,500,1,NA,NA,2,8,10,15),
n3 = c("a", "a", "a", NA, "b", "c",NA,NA,NA,NA),
n4 = c("red", "red", NA, NA, NA, NA,NA,NA,NA,NA))
df
n1 n2 n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a <NA>
4 2 1 <NA> <NA>
5 5 NA b <NA>
6 6 NA c <NA>
7 8 2 <NA> <NA>
8 9 8 <NA> <NA>
9 8 10 <NA> <NA>
9 8 15 <NA> <NA>
First, please see my desired output
df
n1 n2 n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a red
4 2 1 <NA> red
5 5 NA b <NA>
6 6 NA c <NA>
7 8 2 <NA> red
8 9 8 <NA> red
9 8 10 <NA> red
9 8 15 <NA> red
I made this post before (Adding values to one columns based on conditions). However, I realized that I need to take one more column to solve my problem.
So, I would like to update/add the red in n4 by asking the conditions comming from n1, n2, n3. If n3 == "a", and values of n1 associated with a, then values of n4 that are the same row with values of n1 should be added with red (i.e. row 3,4th). At the same time, if values of n1 also match with that of n2 (i.e. 2), then this row th of n4 should also be added red. Further, 8 of column n1 is connected with the entire things like that. Then, if we have futher values of n2 or n1 is equal to 8 then, the step would be replicated as before. I hope it is clear, if not I would like to explain more. (It sounds like a Zig Zag thing).
-Note: tidyverse and baseR also welcomed to help me here.
Any suggestions for me please?

You can try the code below if you are using igraph
res <- do.call(
rbind,
lapply(
decompose(
graph_from_data_frame(replace(df, is.na(df), "NA"))
),
function(x) {
n4 <- E(x)$n4
if (!all(n4 == "NA")) {
E(x)$n4 <- unique(n4[n4 != "NA"])
}
get.data.frame(x)
}
)
)
dfout <- type.convert(
res[match(do.call(paste, df[1:2]), do.call(paste, res[1:2])), ],
as.is = TRUE
)
which gives
> dfout
from to n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a red
4 2 1 <NA> red
9 5 NA b <NA>
10 6 NA c <NA>
5 8 2 <NA> red
6 9 8 <NA> red
7 8 10 <NA> red
8 8 15 <NA> red

Related

remove rows with >50% missing across certain columns in R

here is my data:
data <- data.frame(id=c(1,2,3,4,5),
ethnicity=c("asian",NA,NA,NA,"asian"),
age=c(34,NA,NA,NA,65),
a1=c(3,4,5,2,7),
a2=c("y","y","y",NA,NA),
a3=c("low", NA, "high", "med", NA),
a4=c("green", NA, "blue", "orange", NA))
id ethnicity age a1 a2 a3 a4
1 asian 34 3 y low green
2 <NA> NA 4 y <NA> <NA>
3 <NA> NA 5 y high blue
4 <NA> NA 2 <NA> med orange
5 asian 65 7 <NA> <NA> <NA>
I would like to remove rows that have >50% missing in columns a1 to a4.
I have tried the below code; but am having trouble specifying the columns that I want this to take effect for:
data[which(rowMeans(!is.na(data)) > 0.5), ] #This doesn't specify the column
miss2 <- c()
for(i in 1:nrow(data)) {
if(length(which(is.na(data[4:7,]))) >= 0.5*ncol(data)) miss2 <- append(miss2,4:7)
}
data1 <- data[-miss2,]
#I thought I specified the column here but im not getting the output I was hoping for (i.e id 4 doesn't show up)
The code above looks at missing in all columns. I would like to specify to just look for % of missing in columns a1,a2,a3,a4. What im hoping to get is below:
id ethnicity age a1 a2 a3 a4
1 asian 34 3 y low green
2 <NA> NA 4 y <NA> <NA>
3 <NA> NA 5 y high blue
4 <NA> NA 2 <NA> med orange
Any help is appreciated, thank you!
You're really close, the main issue being using which (an array of indices) instead of simply an array of booleans
keep <- rowMeans(is.na(data[,4:7])) <= 0.5
keep
[1] TRUE TRUE TRUE TRUE FALSE
data[keep,]
id ethnicity age a1 a2 a3 a4
1 1 asian 34 3 y low green
2 2 <NA> NA 4 y <NA> <NA>
3 3 <NA> NA 5 y high blue
4 4 <NA> NA 2 <NA> med orange
Just for fun a dplyr approach:
Here we combine rowwise with a comparing statement directly in filter. First we check the sum of NA over a1:a4, divide by the amount of columns and ask if condition <= 0.5 is true:
To do this we have to transform all (a1:a4) to the same class:
data %>%
rowwise() %>%
mutate(a1 = as.character(a1)) %>%
filter(sum(is.na(c_across(a1:a4))) / length(c_across(a1:a4)) <= 0.5)
id ethnicity age a1 a2 a3 a4
<dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1 asian 34 3 y low green
2 2 NA NA 4 y NA NA
3 3 NA NA 5 y high blue
4 4 NA NA 2 NA med orange
data[rowSums(is.na(data[, -c(1:3)])) / 4 <= .5, ]
#> id ethnicity age a1 a2 a3 a4
#> 1 1 asian 34 3 y low green
#> 2 2 <NA> NA 4 y <NA> <NA>
#> 3 3 <NA> NA 5 y high blue
#> 4 4 <NA> NA 2 <NA> med orange

R strsplit for uneven number of columns in a huge data set

I have a huge data set with about 200 columns and 25k+ rows, with the separator ';'. The columns are of an uneven number.
I read it in as a delimited txt file df <- read.delim(~path/data.txt, sep=";", header = FALSE)
which reads nicely as a table.
My issue is, many of the rows are so long that in the txt file they often spill onto new lines and it is here that it is not recognising that it should continue on the same row. Therefore the distinguished columns have information that belongs else where.
Each observation of data is a dbl.
I have created a new example below for ease of reading, therefore it is not possible to simply sort conditions into columns.
***EDIT: x, y and z contain spatial coordinates, but I have substituted them for their corresponding letters for ease of reading.
The data is X-profile data giving me coordinates of the centre point along a line, followed by offsets of 1m (up to 100m either side of 0, the centre line) in each column with its corresponding height ***
My data ends up looking something like this:
[c1] [c2] [c3] [c4] [c5] [c6] [c7] [c8] [c9]
[1] x y z 1 2 3 N/A N/A N/A
[2] x y z 1 2 3 4 5 6
[3] 7 8 9 10 N/A N/A N/A N/A N/A
[4] x y z 1 2 3 4 5 7
[5] 7 8 9 N/A N/A N/A N/A N/A N/A
[6] x y z 1 2 3 N/A N/A N/A
[7] x y z 1 2 3 4 5 N/A
And I'd like it to look like this:
[c1] [c2] [c3] [c4] [c5] [c6] [c7] [c8] [c9] [c10] [c11] [c12] [c13]
[1] x y z 1 2 3 N/A N/A N/A N/A N/A N/A N/A
[2] x y z 1 2 3 4 5 6 7 8 9 10
[3] x y z 1 2 3 4 5 6 7 8 9 N/A
[4] x y z 1 2 3 N/A N/A N/A N/A N/A N/A N/A
[5] x y z 1 2 3 4 5 N/A N/A N/A N/A N/A
I have tried strsplit(as.character(df), split = "\n", fixed = TRUE) and it returns an error that it is not a character string. I have tried the same function with split = "\t" and split = "\r" and it returns the same error. Each attempt takes around half an hour to process so I was also wondering if there is a more efficient way to do this.
I hope I have explained clearly my aim.
EDIT
The text file is similar to the following example:
x;y;z;1;2;3
x;y;z;1;2;3;4;5;6;
7;8;9;10
x;y;z;1;2;3;4;5;6;
7;8;9
x;y;z;1;2;3;4
x;y;z;1;2;3;4;5;6;
7;8;9;10;11;12;13;
14;15
In some cases a number is split between the previous line and that below:
E.G.
101;102;103;10
4;105;106
This layout is exactly how it is being read into R.
Use scan which omits empty lines by default. Next, find positions that begin with "x" using findInterval, split there and paste them together. Then basically the ususal strsplit, some length adaptions etc. and you got it.
r <- scan('foo.txt', 'A', qui=T)
r <- split(r, findInterval(seq_len(length(r)), grep('^x', r))) |>
lapply(paste, collapse='') |>
lapply(strsplit, ';') |>
lapply(el) |>
{\(.) lapply(., `length<-`, max(lengths(.)))}() |>
do.call(what=rbind) |>
as.data.frame()
r
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
# 1 x y z 1 2 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 2 x y z 1 2 3 4 5 6 7 8 9 10 <NA> <NA> <NA> <NA> <NA>
# 3 x y z 1 2 3 4 5 6 7 8 9 <NA> <NA> <NA> <NA> <NA> <NA>
# 4 x y z 1 2 3 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 5 x y z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data:
writeLines(text='x;y;z;1;2;3
x;y;z;1;2;3;4;5;6;
7;8;9;10
x;y;z;1;2;3;4;5;6;
7;8;9
x;y;z;1;2;3;4
x;y;z;1;2;3;4;5;6;
7;8;9;10;11;12;13;
14;15', 'foo.txt')
using data.table:
dt <- data.table(df)
dt[, grp := cumsum(c1 == "x")]
dt <- merge(dt[c1 == "x"], dt[c1 == 7], by = "grp", all = T)[, grp := NULL]
names(dt) <- paste0("c", 1:ncol(dt))
Resulting to:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18
1: x y z 1 2 3 NA NA NA NA NA NA NA NA NA NA NA NA
2: x y z 1 2 3 4 5 6 7 8 9 10 NA NA NA NA NA
3: x y z 1 2 3 4 5 7 7 8 9 NA NA NA NA NA NA
4: x y z 1 2 3 NA NA NA NA NA NA NA NA NA NA NA NA
5: x y z 1 2 3 4 5 NA NA NA NA NA NA NA NA NA NA

Find row of the next instance of the value in R

I have two columns Time and Event. There are two events A and B. Once an event A takes place, I want to find when the next event B occurs. Column Time_EventB is the desired output.
This is the data frame:
df <- data.frame(Event = sample(c("A", "B", ""), 20, replace = TRUE), Time = paste("t", seq(1,20)))
What is the code in R for finding the next instance of a value (B in this case)?
What is the code for once the instance of B is found, return the value of the corresponding Time Column?
The code should be something like this:
data$Time_EventB <- ifelse(data$Event == "A", <Code for returning time of next instance of B>, "")
In Excel this can be done using VLOOKUP.
Here's a simple solution:
set.seed(1)
df <- data.frame(Event = sample(c("A", "B", ""),size=20, replace=T), time = 1:20)
as <- which(df$Event == "A")
bs <- which(df$Event == "B")
next_b <- sapply(as, function(a) {
diff <- bs-a
if(all(diff < 0)) return(NA)
bs[min(diff[diff > 0]) == diff]
})
df$next_b <- NA
df$next_b[as] <- df$time[next_b]
> df
Event time next_b
1 A 1 2
2 B 2 NA
3 B 3 NA
4 4 NA
5 A 5 8
6 6 NA
7 7 NA
8 B 8 NA
9 B 9 NA
10 A 10 14
11 A 11 14
12 A 12 14
13 13 NA
14 B 14 NA
15 15 NA
16 B 16 NA
17 17 NA
18 18 NA
19 B 19 NA
20 20 NA
Here's an attempt using a "rolling join" from the data.table package:
library(data.table)
setDT(df)
df[Event=="B", .(time, nextb=time)][df, on="time", roll=-Inf][Event != "A", nextb := NA][]
# time nextb Event
# 1: 1 2 A
# 2: 2 NA B
# 3: 3 NA B
# 4: 4 NA
# 5: 5 8 A
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA B
# 9: 9 NA B
#10: 10 14 A
#11: 11 14 A
#12: 12 14 A
#13: 13 NA
#14: 14 NA B
#15: 15 NA
#16: 16 NA B
#17: 17 NA
#18: 18 NA
#19: 19 NA B
#20: 20 NA
Using data as borrowed from #thc

r - data frame manipulation [duplicate]

This question already has answers here:
Reshape multiple value columns to wide format
(5 answers)
Closed 5 years ago.
Suppose I have this data frame:
df <- data.frame(ID = c("id1", "id1", "id1", "id2", "id2", "id3", "id3", "id3"),
Code = c("A", "B", "C", "A", "B", "A", "C", "D"),
Count = c(34,65,21,3,8,12,15,16), Value = c(3,1,8,2,3,3,5,8))
that looks like this:
df
ID Code Count Value
1 id1 A 34 3
2 id1 B 65 1
3 id1 C 21 8
4 id2 A 3 2
5 id2 B 8 3
6 id3 A 12 3
7 id3 C 15 5
8 id3 D 16 8
I would like to obtain this result data frame:
result <- data.frame(Code = c("A", "B", "C", "D"),
id1_count = c(34,65,21,NA), id1_value = c(3,1,8,NA),
id2_count = c(3, 8, NA, NA), id2_value = c(2, 3, NA, NA),
id3_count = c(12,NA,15,16), id3_value = c(3,NA,5,8))
that looks like this:
> result
Code id1_count id1_value id2_count id2_value id3_count id3_value
1 A 34 3 3 2 12 3
2 B 65 1 8 3 NA NA
3 C 21 8 NA NA 15 5
4 D NA NA NA NA 16 8
Is there a one liner in the R base package that can do that? I am able to achieve the result I need but not in the R way (i.e., with loops and so on). Any help is appreciated. Thank you.
You can try dcast from devel version of data.table (v1.9.5) which can take multiple value.var columns. Instructions to install are here
library(data.table)
dcast(setDT(df), Code~ID, value.var=c('Count', 'Value'))
# Code Count_id1 Count_id2 Count_id3 Value_id1 Value_id2 Value_id3
#1: A 34 3 12 3 2 3
#2: B 65 8 NA 1 3 NA
#3: C 21 NA 15 8 NA 5
#4: D NA NA 16 NA NA 8
Or using reshape from base R
reshape(df, idvar='Code', timevar='ID', direction='wide')
# Code Count.id1 Value.id1 Count.id2 Value.id2 Count.id3 Value.id3
#1 A 34 3 3 2 12 3
#2 B 65 1 8 3 NA NA
#3 C 21 8 NA NA 15 5
#8 D NA NA NA NA 16 8
You could also try:
library(tidyr)
library(dplyr)
df %>%
gather(key, value, -(ID:Code)) %>%
unite(id_key, ID, key) %>%
spread(id_key, value)
Which gives:
# Code id1_Count id1_Value id2_Count id2_Value id3_Count id3_Value
#1 A 34 3 3 2 12 3
#2 B 65 1 8 3 NA NA
#3 C 21 8 NA NA 15 5
#4 D NA NA NA NA 16 8

Combining columns in a dataframe each with partial information

I have a large data set which used different coding schemes for the same variables over different time periods. The coding in each time period is represented as a column with values during the year it was active and NA everywhere else.
I was able to "combine" them by using nested ifelse commands together with dplyr's mutate [see edit below], but I am running into a problem using ifelse to do something slightly different. I want to code a new variable based on whether ANY of the previous variables meets a condition. But for some reason, the ifelse construct below does not work.
MWE:
library("dplyr")
library("magrittr")
df <- data.frame(id = 1:12, year = c(rep(1995, 5), rep(1996, 5), rep(1997, 2)), varA = c("A","C","A","C","B",rep(NA,7)), varB = c(rep(NA,5),"B","A","C","A","B",rep(NA,2)))
df %>% mutate(varC = ifelse(varA == "C" | varB == "C", "C", "D"))
Output:
> df
id year varA varB varC
1 1 1995 A <NA> <NA>
2 2 1995 C <NA> C
3 3 1995 A <NA> <NA>
4 4 1995 C <NA> C
5 5 1995 B <NA> <NA>
6 6 1996 <NA> B <NA>
7 7 1996 <NA> A <NA>
8 8 1996 <NA> C C
9 9 1996 <NA> A <NA>
10 10 1996 <NA> B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
If I don't use the | operator, and test against only varA, it will come out with the results as expected, but it will only apply to those years that varA is not NA.
Output:
> df %<>% mutate(varC = ifelse(varA == "C", "C", "D"))
> df
id year varA varB varC
1 1 1995 A <NA> D
2 2 1995 C <NA> C
3 3 1995 A <NA> D
4 4 1995 C <NA> C
5 5 1995 B <NA> D
6 6 1996 <NA> B <NA>
7 7 1996 <NA> A <NA>
8 8 1996 <NA> C <NA>
9 9 1996 <NA> A <NA>
10 10 1996 <NA> B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
Desired output:
> df
id year varA varB varC
1 1 1995 A <NA> D
2 2 1995 C <NA> C
3 3 1995 A <NA> D
4 4 1995 C <NA> C
5 5 1995 B <NA> D
6 6 1996 <NA> B D
7 7 1996 <NA> A D
8 8 1996 <NA> C C
9 9 1996 <NA> A D
10 10 1996 <NA> B D
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
How do I get what I'm looking for?
To make this question more applicable to a wider audience, and to learn from this situation, it would be great have an explanation as to what is happening with the comparison using | that causes it not to work as expected. Thanks in advance!
EDIT: This is what I meant by successfully combining them with nested ifelses
> df %>% mutate(varC = ifelse(year == 1995, as.character(varA),
+ ifelse(year == 1996, as.character(varB), NA)))
id year varA varB varC
1 1 1995 A <NA> A
2 2 1995 C <NA> C
3 3 1995 A <NA> A
4 4 1995 C <NA> C
5 5 1995 B <NA> B
6 6 1996 <NA> B B
7 7 1996 <NA> A A
8 8 1996 <NA> C C
9 9 1996 <NA> A A
10 10 1996 <NA> B B
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
R has this annoying tendency where the logical value of a condition that involves NA is just NA, rather than true or false.
i.e. NA>0 = NA rather than FALSE
NA interacts with TRUE just like false does. i.e. TRUE|NA = TRUE. TRUE&NA = NA.
Interestingly, it also interacts with FALSE as if it was TRUE. i.e. FALSE|NA=NA. FALSE&NA=FALSE
In fact, NA is like a logical value between TRUE and FALSE. e.g. NA|TRUE|FALSE = TRUE.
So here's a way to hack this:
ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB))
How do we interpret this? On the left side of the OR, we have the following: If varA is NA, then we have NA&FALSE. Since NA is one step above FALSE in the hierarchy of logicals, the & is going to force the whole thing to be FALSE. Otherwise, if varA is not NA but it's not 'C', you'll have FALSE&TRUE which gives FALSE as you want. Otherwise, if it's 'C', they're both true. Same goes for the thing on the right of the OR.
When using a condition that involves x, but x can be NA, I like to use
((condition for x)&!is.na(x)) to completely rule out the NA output and force the TRUE or FALSE values in the situations I want.
EDIT: I just remembered that you want an NA output if they're both NA. This doesn't end up doing it, so that's my bad. Unless you're okay with a 'D' output when they're both NA.
EDIT2: This should output the NAs as you want:
ifelse(is.na(varA)&is.na(varB), NA, ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB)), 'C','D'))
Per #Khashaa comment. This should do the trick and get you to the desired output.
df %>%
mutate(varC = ifelse(is.na(varA) & is.na(varB), NA,
ifelse(varA %in% "C" | varB %in% "C", "C", "D")))

Resources