Combining columns in a dataframe each with partial information - r

I have a large data set which used different coding schemes for the same variables over different time periods. The coding in each time period is represented as a column with values during the year it was active and NA everywhere else.
I was able to "combine" them by using nested ifelse commands together with dplyr's mutate [see edit below], but I am running into a problem using ifelse to do something slightly different. I want to code a new variable based on whether ANY of the previous variables meets a condition. But for some reason, the ifelse construct below does not work.
MWE:
library("dplyr")
library("magrittr")
df <- data.frame(id = 1:12, year = c(rep(1995, 5), rep(1996, 5), rep(1997, 2)), varA = c("A","C","A","C","B",rep(NA,7)), varB = c(rep(NA,5),"B","A","C","A","B",rep(NA,2)))
df %>% mutate(varC = ifelse(varA == "C" | varB == "C", "C", "D"))
Output:
> df
id year varA varB varC
1 1 1995 A <NA> <NA>
2 2 1995 C <NA> C
3 3 1995 A <NA> <NA>
4 4 1995 C <NA> C
5 5 1995 B <NA> <NA>
6 6 1996 <NA> B <NA>
7 7 1996 <NA> A <NA>
8 8 1996 <NA> C C
9 9 1996 <NA> A <NA>
10 10 1996 <NA> B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
If I don't use the | operator, and test against only varA, it will come out with the results as expected, but it will only apply to those years that varA is not NA.
Output:
> df %<>% mutate(varC = ifelse(varA == "C", "C", "D"))
> df
id year varA varB varC
1 1 1995 A <NA> D
2 2 1995 C <NA> C
3 3 1995 A <NA> D
4 4 1995 C <NA> C
5 5 1995 B <NA> D
6 6 1996 <NA> B <NA>
7 7 1996 <NA> A <NA>
8 8 1996 <NA> C <NA>
9 9 1996 <NA> A <NA>
10 10 1996 <NA> B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
Desired output:
> df
id year varA varB varC
1 1 1995 A <NA> D
2 2 1995 C <NA> C
3 3 1995 A <NA> D
4 4 1995 C <NA> C
5 5 1995 B <NA> D
6 6 1996 <NA> B D
7 7 1996 <NA> A D
8 8 1996 <NA> C C
9 9 1996 <NA> A D
10 10 1996 <NA> B D
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
How do I get what I'm looking for?
To make this question more applicable to a wider audience, and to learn from this situation, it would be great have an explanation as to what is happening with the comparison using | that causes it not to work as expected. Thanks in advance!
EDIT: This is what I meant by successfully combining them with nested ifelses
> df %>% mutate(varC = ifelse(year == 1995, as.character(varA),
+ ifelse(year == 1996, as.character(varB), NA)))
id year varA varB varC
1 1 1995 A <NA> A
2 2 1995 C <NA> C
3 3 1995 A <NA> A
4 4 1995 C <NA> C
5 5 1995 B <NA> B
6 6 1996 <NA> B B
7 7 1996 <NA> A A
8 8 1996 <NA> C C
9 9 1996 <NA> A A
10 10 1996 <NA> B B
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>

R has this annoying tendency where the logical value of a condition that involves NA is just NA, rather than true or false.
i.e. NA>0 = NA rather than FALSE
NA interacts with TRUE just like false does. i.e. TRUE|NA = TRUE. TRUE&NA = NA.
Interestingly, it also interacts with FALSE as if it was TRUE. i.e. FALSE|NA=NA. FALSE&NA=FALSE
In fact, NA is like a logical value between TRUE and FALSE. e.g. NA|TRUE|FALSE = TRUE.
So here's a way to hack this:
ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB))
How do we interpret this? On the left side of the OR, we have the following: If varA is NA, then we have NA&FALSE. Since NA is one step above FALSE in the hierarchy of logicals, the & is going to force the whole thing to be FALSE. Otherwise, if varA is not NA but it's not 'C', you'll have FALSE&TRUE which gives FALSE as you want. Otherwise, if it's 'C', they're both true. Same goes for the thing on the right of the OR.
When using a condition that involves x, but x can be NA, I like to use
((condition for x)&!is.na(x)) to completely rule out the NA output and force the TRUE or FALSE values in the situations I want.
EDIT: I just remembered that you want an NA output if they're both NA. This doesn't end up doing it, so that's my bad. Unless you're okay with a 'D' output when they're both NA.
EDIT2: This should output the NAs as you want:
ifelse(is.na(varA)&is.na(varB), NA, ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB)), 'C','D'))

Per #Khashaa comment. This should do the trick and get you to the desired output.
df %>%
mutate(varC = ifelse(is.na(varA) & is.na(varB), NA,
ifelse(varA %in% "C" | varB %in% "C", "C", "D")))

Related

Adding values to columns based on multiple conditions

I have 1 df as below
df <- data.frame(n1 = c(1,2,1,2,5,6,8,9,8,8),
n2 = c(100,1000,500,1,NA,NA,2,8,10,15),
n3 = c("a", "a", "a", NA, "b", "c",NA,NA,NA,NA),
n4 = c("red", "red", NA, NA, NA, NA,NA,NA,NA,NA))
df
n1 n2 n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a <NA>
4 2 1 <NA> <NA>
5 5 NA b <NA>
6 6 NA c <NA>
7 8 2 <NA> <NA>
8 9 8 <NA> <NA>
9 8 10 <NA> <NA>
9 8 15 <NA> <NA>
First, please see my desired output
df
n1 n2 n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a red
4 2 1 <NA> red
5 5 NA b <NA>
6 6 NA c <NA>
7 8 2 <NA> red
8 9 8 <NA> red
9 8 10 <NA> red
9 8 15 <NA> red
I made this post before (Adding values to one columns based on conditions). However, I realized that I need to take one more column to solve my problem.
So, I would like to update/add the red in n4 by asking the conditions comming from n1, n2, n3. If n3 == "a", and values of n1 associated with a, then values of n4 that are the same row with values of n1 should be added with red (i.e. row 3,4th). At the same time, if values of n1 also match with that of n2 (i.e. 2), then this row th of n4 should also be added red. Further, 8 of column n1 is connected with the entire things like that. Then, if we have futher values of n2 or n1 is equal to 8 then, the step would be replicated as before. I hope it is clear, if not I would like to explain more. (It sounds like a Zig Zag thing).
-Note: tidyverse and baseR also welcomed to help me here.
Any suggestions for me please?
You can try the code below if you are using igraph
res <- do.call(
rbind,
lapply(
decompose(
graph_from_data_frame(replace(df, is.na(df), "NA"))
),
function(x) {
n4 <- E(x)$n4
if (!all(n4 == "NA")) {
E(x)$n4 <- unique(n4[n4 != "NA"])
}
get.data.frame(x)
}
)
)
dfout <- type.convert(
res[match(do.call(paste, df[1:2]), do.call(paste, res[1:2])), ],
as.is = TRUE
)
which gives
> dfout
from to n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a red
4 2 1 <NA> red
9 5 NA b <NA>
10 6 NA c <NA>
5 8 2 <NA> red
6 9 8 <NA> red
7 8 10 <NA> red
8 8 15 <NA> red

Spreading data that is grouped by ID but having different observations

I have this data:
drugData <- data.frame(caseID=c(9, 9, 10, 11, 12, 12, 12, 12, 13, 45, 45, 225),
Drug=c("Cocaine", "Cocaine", "DPT", "LSD", "Cocaine", "LSD", "Heroin","Heroin", "LSD", "DPT", "DPT", "Heroin"),
County=c("A", "A", "B", "C", "D", "D", "D","D", "E", "F", "F", "G"),
Date=c(2009, 2009, 2009, 2009, 2011, 2011, 2011, 2011, 2010, 2010, 2010, 2005))
"CaseID" rows make up a single case, which may have observations of all the same drug, or different types of drugs. I want this data to look like the following:
CaseID Drug.1 Drug.2 Drug. 3 Drug.4 County Date
9 Cocaine Cocaine NA NA A 2009
10 DPT LSD NA NA B 2009
11 LSD NA NA NA C 2009
12 Cocaine LSD Heroin Heroin D 2011
13 LSD NA NA NA E 2010
45 DPT DPT NA NA F 2010
225 Heroin NA NA NA G 2005
I've tried using dplyr spread function but can't seem to quite get this to work.
We can pivot to wide format after creating a sequence column based on 'caseID'
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
drugData %>%
mutate(nm = str_c('Drug', rowid(caseID))) %>%
pivot_wider(names_from = nm, values_from = Drug)
#A tibble: 7 x 7
# caseID County Date Drug1 Drug2 Drug3 Drug4
# <dbl> <fct> <dbl> <fct> <fct> <fct> <fct>
#1 9 A 2009 Cocaine Cocaine <NA> <NA>
#2 10 B 2009 DPT <NA> <NA> <NA>
#3 11 C 2009 LSD <NA> <NA> <NA>
#4 12 D 2011 Cocaine LSD Heroin Heroin
#5 13 E 2010 LSD <NA> <NA> <NA>
#6 45 F 2010 DPT DPT <NA> <NA>
#7 225 G 2005 Heroin <NA> <NA> <NA>
Or with spread (spread is deprecated in place of pivot_wider
drugData %>%
mutate(nm = str_c('Drug', rowid(caseID))) %>%
spread(nm, Drug)
Or using data.table
dcast(setDT(drugData), caseID + County + Date ~
paste0('Drug', rowid(caseID)), value.var = 'Drug')
# caseID County Date Drug1 Drug2 Drug3 Drug4
#1: 9 A 2009 Cocaine Cocaine <NA> <NA>
#2: 10 B 2009 DPT <NA> <NA> <NA>
#3: 11 C 2009 LSD <NA> <NA> <NA>
#4: 12 D 2011 Cocaine LSD Heroin Heroin
#5: 13 E 2010 LSD <NA> <NA> <NA>
#6: 45 F 2010 DPT DPT <NA> <NA>
#7: 225 G 2005 Heroin <NA> <NA> <NA>

I need to add several rows together based on the fact that they have something in common with another row

Using the information on hand I need to predict how much of a particular product we need next month. I have several months worth of data going back, however the data is separated by both VPN and by a separate warehouse number. I just need to know how much to order in general and ignore the warehouse separation. we'll be adding that back in later.
There are multiple duplicates of many of the VPN's and i would like to consolidate all the duplicates and also sum the numbers that have been separated.
VPN Month To Date December November October September August July June May April March
0A36227-AA 15 6 4 2 NA 4 6 4 2 <NA> 4
0A36227-AA NA 1 NA NA NA NA 1 <NA> <NA> <NA> <NA>
0A36227-AA 2 3 1 NA 2 3 3 1 <NA> 2 3
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA 1 NA NA NA <NA> 1 <NA> <NA> <NA>
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA NA NA NA NA <NA> <NA> <NA> <NA> <NA>
So i want to combine all the duplicates and add all the numbers from the rows into just one row per VPN.
I've tried using the aggregate function and it didn't work for me. i may have used it wrong though.
any help would be appreciated!
also there are some cases where it may cause an infinite number to show up. if anyone has any further advice for how to handle that it would be welcome.
You basically want to know how to perform sum while grouping in your data frame.
You will find plenty of answer.
I have a data.table solution for your case:
plouf <- read.table(text = " VPN Month.To.Date December November October September August July June May April March
0A36227-AA 15 6 4 2 NA 4 6 4 2 <NA> 4
0A36227-AA NA 1 NA NA NA NA 1 <NA> <NA> <NA> <NA>
0A36227-AA 2 3 1 NA 2 3 3 1 <NA> 2 3
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA 1 NA NA NA <NA> 1 <NA> <NA> <NA>
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA NA NA NA NA <NA> <NA> <NA> <NA> <NA>",
stringsAsFactors = FALSE, header = TRUE)
here is the code
DT <- setDT(plouf)
tochange <- names(DT)[!names(DT) %in% "VPN"]
here the tochange vector is the list of your column you want to average
DT[,c(tochange) := lapply(.SD,function(x){as.numeric(x)}),.SDcols = tochange]
DT[,lapply(.SD,function(x){sum(x,na.rm = TRUE)}),.SDcols = tochange,by = VPN]
The first line is to set everything to numeric¨
The second line perform the sum ignoring the NAs and grouping by VPN. I am not 100% sure that is what you wanted.
VPN Month.To.Date December November October September August July June May April March i
1: 0A36227-AA 17 10 5 2 2 7 10 5 2 2 7 10
2: 0A36258-AA 2 0 1 2 0 0 0 1 2 0 0 0
I hope it helps
here is the dplyr equivalent
plouf %>%
mutate_at(vars(tochange),funs(as.numeric)) %>%
group_by(VPN) %>%
summarise_at(vars(tochange),funs(sum(.,na.rm = TRUE)))

How can I transform a long-formated data frame to a wide-formated one with multiple values within a cell in R?

Here is the original df:
area sector item
1 East A <NA>
2 South A Baidu
3 South A Tencent
4 West A <NA>
5 North A <NA>
6 East B Microsoft
7 East B Google
8 East B Facebook
9 South B <NA>
10 West B <NA>
11 North B <NA>
12 East C <NA>
13 South C <NA>
14 West C <NA>
15 North C Alibaba
16 East D <NA>
17 South D <NA>
18 West D Amazon
19 North D <NA>
20 East E <NA>
21 South E <NA>
22 West E <NA>
23 North E <NA>
How can I transform the above df to the following one? Some cells in the transformed df have multiple items from the original df.
Sector East South West North
1 A <NA> "Baidu, Tencent" <NA> <NA>
2 B "Microsoft, Google, Facebook" <NA> <NA> <NA>
3 C <NA> <NA> <NA> "Alibaba"
4 D <NA> <NA> "Amazon" <NA>
5 E <NA> <NA> <NA> <NA>
A quick solution could be to use the toString function while trasnforming from long to wide using the reshape2 package
reshape2::dcast(df, sector ~ area, toString)
#Using item as value column: use value.var to override.
# sector East North South West
# 1 A <NA> <NA> Baidu, Tencent <NA>
# 2 B Microsoft, Google, Facebook <NA> <NA> <NA>
# 3 C <NA> Alibaba <NA> <NA>
# 4 D <NA> <NA> <NA> Amazon
# 5 E <NA> <NA> <NA> <NA>
This is almost a dupe of this but most of the solutions there won't work for this case- but this can still give you some ideas.
And just for fun, here is a base solution:
reshape(aggregate(item ~ area + sector, data = df, paste, collapse = ","),
idvar = "sector", timevar = "area", direction = "wide")
sector item.East item.North item.South item.West
1 A <NA> <NA> Baidu,Tencent <NA>
5 B Microsoft,Google,Facebook <NA> <NA> <NA>
9 C <NA> Alibaba <NA> <NA>
13 D <NA> <NA> <NA> Amazon
17 E <NA> <NA> <NA> <NA>
Here is an option with dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
group_by(area, sector) %>%
summarise(item = toString(item)) %>%
spread(area, item)

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources