R - Creating conditional variables with loop - r

I have a dataset like this (but this is just a subset; the real dataset has hundreds of ID_Desc variables), where each data point has a person's gender, and whether they checked off a number of descriptors (1) or not (NA):
Gender ID1_Desc_1 ID1_Desc_2 ID1_Desc_3 ID2_Desc_1 ID2_Desc_2 ID2_Desc_3 ID3_Desc_1 ID3_Desc_2 ID3_Desc_3
1 NA NA 1 NA NA 1 NA NA NA
2 NA 1 1 NA NA NA 1 1 NA
1 1 1 1 NA 1 NA NA NA NA
I'm trying to write a loop that will (1) check their gender, (2) based on their gender, check whether they checked off the same descriptor in the first list they saw (lists ID1 and ID2 for Gender=1 and lists ID1 and ID3 for Gender=2), and (3) create a new variable (Same#) that indicates whether they checked off the same descriptor in both lists (by writing a 1) or not (by writing a 0).
I've been working with this code, which seems to be checking their gender ok and creating the new variables (Same#), but it's writing 0's for everything, which is not correct:
for (i in 1:3){
assign(paste("Same",i,sep=""),
ifelse(Gender=="1",
ifelse(paste("ID1_Desc_",i,sep="")==paste("ID2_Desc_",i,sep=""),1,0),
ifelse(paste("ID1_Desc_",i,sep="")==paste("ID3_Desc_",i,sep=""),1,0)
)
)
}
Based on the data I provided, Same1 should be 0 0 1 (since Gender=1 and they chose Desc_3 in both the ID1 and ID2 lists), Same2 should be 0 1 0 (since Gender=2 and they chose Desc_2 in both the ID1 and ID3 lists), and Same3 should be 0 1 0 (since Gender=1 and they chose Desc_2 in both the ID1 and ID2 lists) but right now, all 3 come out as 0 0 0.
I know using loops may not be the best way to do this, but I'd really like to know how to do it with loop if it's possible. If not, anything that works would be incredibly appreciated. Thanks.

You may try this
ind1 <- grep("^ID1", colnames(df))
ind2 <- grep("^ID2", colnames(df))
ind3 <- grep("^ID3", colnames(df))
cond1 <- do.call(cbind,Map(`==` , df[ind1], df[ind2]))
cond2 <- do.call(cbind,Map(`==` , df[ind1], df[ind3]))
Finalind <- do.call(cbind, Map(`|`, as.data.frame(t(cond1)),
as.data.frame(t(cond2))))
res <- (!is.na(Finalind))+0
rownames(res) <- paste0("Same", 1:3)
t(res)
# Same1 Same2 Same3
#V1 0 0 1
#V2 0 1 0
#V3 0 1 0
cbind(df, t(res))
data
df <- structure(list(Gender = c(1L, 2L, 1L), ID1_Desc_1 = c(NA, NA,
1L), ID1_Desc_2 = c(NA, 1L, 1L), ID1_Desc_3 = c(1L, 1L, 1L),
ID2_Desc_1 = c(NA, NA, NA), ID2_Desc_2 = c(NA, NA, 1L), ID2_Desc_3 = c(1L,
NA, NA), ID3_Desc_1 = c(NA, 1L, NA), ID3_Desc_2 = c(NA, 1L,
NA), ID3_Desc_3 = c(NA, NA, NA)), .Names = c("Gender", "ID1_Desc_1",
"ID1_Desc_2", "ID1_Desc_3", "ID2_Desc_1", "ID2_Desc_2", "ID2_Desc_3",
"ID3_Desc_1", "ID3_Desc_2", "ID3_Desc_3"), class = "data.frame",
row.names = c(NA, -3L))

Related

How to count nonblank values in each dataframe row

Given a data frame in R how do I determine the number of non blank values per row.
col1 col2 col3 rowCounts
1 3 2
1 6 2
1 1
0
This is how I did it in python:
df['rowCounts'] = df.apply(lambda x: x.count(), axis=1)
What is the R Code for this?
In base R, we can use (assuming NA as blank) rowSums as a vectorized option on the logical matrix (!is.na(df)) where TRUE (->1 i.e. non-NA) values will be added for each row with rowSums
df$rowCounts <- rowSums(!is.na(df))
-output
df
# col1 col2 col3 rowCounts
#1 1 3 NA 2
#2 NA 1 6 2
#3 NA NA 1 1
#4 NA NA NA 0
If the blank is ""
df$rowCounts <- rowSums(df != "", na.rm = TRUE)
Or with apply and MARGIN = 1 as a similar syntax to Python (though it will be slower compared to rowSums)
df$rowCounts <- apply(df, 1, function(x) sum(!is.na(x)))
data
df <- structure(list(col1 = c(1L, NA, NA, NA), col2 = c(3L, 1L, NA,
NA), col3 = c(NA, 6L, 1L, NA)), class = "data.frame", row.names = c(NA,
-4L))

Looping through a column and returning value

I want to loop through one of the columns in my data frame and check a condition, then replace 0 or 1. The code is :
for (i in v$R){
if( is.na(v$R) ==TRUE ){v$V5 = 0}else{v$V5=1}
}
But I get an error. The data frame named 'v' is as follow. The V5 has NA values and I want to replace with 0 if values in R columns are NA, and else replace with 1. How can I do that?
A B R V5
1 2 3 NA
4 5 NA NA
You can try ifelse like below
df <- within(df,V5 <- ifelse(is.na(R),0,1))
or + (which converts logical value to numerical ones)
df <- within(df,V5 <- +!is.na(R))
such that
> df
A B R V5
1 1 2 3 1
2 4 5 NA 0
If you would like to use loops, you can try
for (i in seq_along(df$R)){
if( is.na(df$R[i]) ==TRUE ){df$V5[i] = 0}else{df$V5[i]=1}
}
DATA
df <- structure(list(A = c(1L, 4L), B = c(2L, 5L), R = c(3L, NA), V5 = c(NA,
NA)), class = "data.frame", row.names = c(NA, -2L))
Try this:
v$V5 <- ifelse(is.na(v$R), 0, 1)

Determine if sub string appears in a string by row of dataframe

I have a dataframe that is revised every day. When an error occurs, It's checked, and if it can be solved, then the keyword "REVISED" is added to the beginning of the error message. Like so:
ID M1 M2 M3
1 NA "REVISED-error" "error"
2 "REVISED-error" "REVISED-error" NA
3 "REVISED-error" "REVISED-error" "error"
4 NA "error" NA
5 NA NA NA
I want to find a way to add two columns, helping me determine if there are any error, and how many of them have been revised. Like this:
ID M1 M2 M3 i1 ix
1 NA "REVISED-error" "error" 2 1 <- 2 errors, 1 revised
2 "REVISED-error" "REVISED-error" NA 2 2
3 "REVISED-error" "REVISED-error" "error" 3 2
4 NA "error" NA 1 0
5 NA NA NA 0 0
I found this code:
df <- df%>%mutate(i1 = rowSums(!is.na(.[2:4])))
That helps me to know how many errors are in those specific columns. How can I know if any of said errors contains the keyword REVISED? I've tried a few things but none have worked so far:
df <- df%>%
mutate(i1 = rowSums(!is.na(.[2:4])))%>%
mutate(ie = rowSums(.[2:4) %in% "REVISED")
This returns an error x must be an array of at least two dimensions
You could use apply to find number of times "error" and "REVISED" appears in each row.
df[c("i1", "ix")] <- t(apply(df[-1], 1, function(x)
c(sum(grepl("error", x)), sum(grepl("REVISED", x)))))
df
# ID M1 M2 M3 i1 ix
#1 1 <NA> REVISED-error error 2 1
#2 2 REVISED-error REVISED-error <NA> 2 2
#3 3 REVISED-error REVISED-error error 3 2
#4 4 <NA> error <NA> 1 0
#5 5 <NA> <NA> <NA> 0 0
Althernative approach using is.na and rowSums to calculate i1.
df$i1 <- rowSums(!is.na(df[-1]))
df$ix <- apply(df[-1], 1, function(x) sum(grepl("REVISED", x)))
data
df <- structure(list(ID = 1:5, M1 = structure(c(NA, 1L, 1L, NA, NA),
.Label = "REVISED-error", class = "factor"),
M2 = structure(c(2L, 2L, 2L, 1L, NA), .Label = c("error",
"REVISED-error"), class = "factor"), M3 = structure(c(1L,
NA, 1L, NA, NA), .Label = "error", class = "factor")), row.names = c(NA,
-5L), class = "data.frame")
You can use str_count() from the stringr library to count the number of times REVISED appears, like so
df <- data.frame(M1=as.character(c(NA, "REVISED-x", "REVISED-x")),
M2=as.character(c("REVISED-x", "REVISED-x", "REVISED-x")),
stringsAsFactors = FALSE)
library(stringr)
df$ix <- str_count(paste0(df$M1, df$M2), "REVISED")
df
# M1 M2 ix
# 1 <NA> REVISED-x 1
# 2 REVISED-x REVISED-x 2
# 3 REVISED-x REVISED-x 2

R compare two columns return third column if any column match conditions

I have a dataset:data1 which have ME and PDR columns.
I want to create this third column: case which would look like this:
ME PDR case
1 2 2
NA 1 1
NA 1 1
1 2 2
NA NA NA
I tried to use this command but it doesn't return me 1 when I have 1 in either columns and no 2 in any of them.
data1$case=ifelse(data1$ME==2 | data1$PDR==2 ,2,ifelse(data1$ME==NA & data1$PDR==NA,NA,1))
We can use pmax
data1$case <- do.call(pmax, c(data1, na.rm = TRUE))
data1$case
#[1] 2 1 1 2 NA
Regarding the OP's case with NA, the == returns NA for any element that is an NA. So, we need to take care of the NA with adding a condition (& !is.na(ME) - for both columns)
with(data1, ifelse((ME == 2 & !is.na(ME)) | (PDR == 2 & !is.na(PDR)),
2, ifelse(is.na(ME) &is.na(PDR), NA, 1)))
#[1] 2 1 1 2 NA
NOTE: The == for checking NA is not recommended as there are functions to get a logical vector when there are missing values (is.na, complete.cases)
data
data1 <- structure(list(ME = c(1L, NA, NA, 1L, NA), PDR = c(2L, 1L, 1L,
2L, NA)), class = "data.frame", row.names = c(NA, -5L))

Summarize data already grouped in r

With the following dataset in R
ID=Custid
ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1 NA On-line 1 New 5 0 1
1 NA On-line 1 Stream 5 0 1
3 EU Tele 2 Stream 5 1 0
I would like to convert the dataset to this format of columns
ID Geo Brand Neworstream OnlineRevQ112 TeleRevQ112 OnlineRevQ212 TeleRevQ212
What is the best way to go about doing this? Can't figure out the best command in R.
Thanks in advance
You can use the reshape2 package and its melt and dcast functions to restructure your data.
data <- structure(list(ID = c(1L, 1L, 3L), Geo = structure(c(NA, NA,
1L), .Label = "EU", class = "factor"), Channel = structure(c(1L,
1L, 2L), .Label = c("On-line", "Tele"), class = "factor"), Brand = c(1L,
1L, 2L), Neworstream = structure(c(1L, 2L, 2L), .Label = c("New",
"Stream"), class = "factor"), RevQ112 = c(5L, 5L, 5L), RevQ212 = c(0L,
0L, 1L), RevQ312 = c(1L, 1L, 0L)), .Names = c("ID", "Geo", "Channel",
"Brand", "Neworstream", "RevQ112", "RevQ212", "RevQ312"), class = "data.frame", row.names = c(NA,
-3L))
library(reshape2)
## melt data
df_long<-melt(data,id.vars=c("ID","Geo","Channel","Brand","Neworstream"))
## recast in combinations of channel and time frame
dcast(df_long,... ~Channel+variable,sum)
Update/facepalm
The "NA" in your dataset probably aren't NA values but rather, the abbreviation "NA" for North America or something like that.
If you had used na.strings when reading your data in, you should have no problems using reshape as I originally indicated:
mydf <- read.table(header = TRUE, na.strings = "",
text = 'ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1 NA On-line 1 New 5 0 1
1 NA On-line 1 Stream 5 0 1
3 EU Tele 2 Stream 5 1 0')
reshape(mydf, direction = "wide",
idvar = c("ID", "Geo", "Brand", "Neworstream"),
timevar = "Channel")
(I might, however, recommend changing your abbreviation for legibility and to reduce confusion!)
Original Answer (since there's still something interesting about reshape there)
This should do it:
reshape(mydf, direction = "wide",
idvar = c("ID", "Geo", "Brand", "Neworstream"),
timevar = "Channel")
# ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line RevQ312.On-line
# 1 1 <NA> 1 New 5 0 1
# 3 3 EU 2 Stream NA NA NA
# RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1 NA NA NA
# 3 5 1 0
Update (To try to salvage the answer a little bit)
As #Arun points out, the above isn't quite right. The culprit here is interaction(), which is used by reshape() to create a new temporary ID variable when more than one ID variable is specified.
Here's the line from reshape() and what it looks like when applied to our "mydf" object:
data[, tempidname] <- interaction(data[, idvar], drop = TRUE)
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] <NA> <NA> 3.EU.2.Stream
# Levels: 3.EU.2.Stream
Hmmm. This seems to simplify to two IDs, NA and 3.EU.2.Stream.
What happens if we replace NA with ""?
mydf$Geo <- as.character(mydf$Geo)
mydf$Geo[is.na(mydf$Geo)] <- ""
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] 1..1.New 1..1.Stream 3.EU.2.Stream
# Levels: 1..1.New 1..1.Stream 3.EU.2.Stream
Aaahh. That's a little bit better. We now have three unique IDs... and reshape() seems to work.
reshape(mydf, direction = "wide",
idvar=names(mydf)[c(1, 2, 4, 5)],
timevar="Channel")
# ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line
# 1 1 1 New 5 0
# 2 1 1 Stream 5 0
# 3 3 EU 2 Stream NA NA
# RevQ312.On-line RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1 1 NA NA NA
# 2 1 NA NA NA
# 3 NA 5 1 0

Resources