ifelse: Assigning a condition from 1 column to another and multiple statements - r

I've got a data.frame. I am trying to use values in column 2, 3, 4 to assign a value in col1. Is this possible?
dat<-data.frame(col1=c(1,2,3,4,5), col2=c(1,2,3,4,"U"), col3=c(1,2,3,"U",5), col4=c("U",2,3,4,5))
dat1=data.frame(col1=ifelse(dat$col2=="U"|dat$col3=="U"|dat$col4=="U", dat$col1=="U", dat$col1))
col1
0
2
3
0
0
Why am I getting a 0 where a U should be?

Don't assign within the ifelse function.
dat1=data.frame(col1=ifelse(dat$col2=="U"|dat$col3=="U"|dat$col4=="U",
"U",
dat$col1))
dat1
col1
1 U
2 2
3 3
4 U
5 U

you probably want to be using this:
dat1 <- data.frame(col1=ifelse(dat$col2=="U"|dat$col3=="U"|dat$col4=="U", "U", dat$col1))
# I changed the dat$col1=="U" to just "U"
If the question is "Why am I getting a 0 where a U should be?" the answer lies in what you have assigned for the if-TRUE portion of your ifelse(.) statement.
Your ifelse statement essentially says
if any of columns 2 through 4 are U
then assign the value of `does column 1 == "U"` <-- Not sure if this is what you want
else assign the value of column 1
So when your ifelse test evaluates to TRUE, what you get returned is the value of col1=="U", but coerced into an integer. ie: 0 for FALSE, 1 for TRUE
You can also take advantage of T/F getting evaluated to 1/0 to clean up your code:
# using the fact that rowSums(dat[2:4]=="U") will be 0 when "U" is not in any column:
ifelse(rowSums(dat[2:4]=="U")>0, "U", dat$col1)

any() makes things like this a lot neater
head(dat)
col1 col2 col3 col4
1 1 1 1 U
2 2 2 2 2
3 3 3 3 3
4 4 4 U 4
5 5 U 5 5
apply(dat,1, function(x)any(x=='U'))
[1] TRUE FALSE FALSE TRUE TRUE
dat[apply(dat,1, function(x)any(x=='U')), 1] <-'U'
dat
col1 col2 col3 col4
1 U 1 1 U
2 2 2 2 2
3 3 3 3 3
4 U 4 U 4
5 U U 5 5

An easy way would be:
dat$col1[as.logical(rowSums(dat[-1]=="U"))] <- "U"
col1 col2 col3 col4
1 U 1 1 U
2 2 2 2 2
3 3 3 3 3
4 U 4 U 4
5 U U 5 5

Related

Filtering of dataframe columns displaying a counter intuitive behavior (R)

Take as an example the dataframe below. I need to change the dataframe by keeping only the columns that are in the filter objects.
test <- data.frame(A = c(1,6,1,2,3) , B = c(1,2,1,1,2), C = c(1,7,6,4,1), D = c(1,1,1,1,1))
filter <- c("A", "B", "C", "D")
filter2 <- c("A","B","D")
To do that I'm using this piece of code:
`%ni%` <- Negate(`%in%`)
test <- test[,-which(names(test) %ni% filter2)]
If I use the filter2 object I get what is expected:
A B D
1 1 1 1
2 6 2 1
3 1 1 1
4 2 1 1
5 3 2 1
However, if I use the filter object, I get a dataframe with zero columns:
data frame with 0 columns and 5 rows
I expected to get an untouched dataframe, since filter had all test columns in it. Why does this happen, and how can I write a more reliable code not to get empty dataframes in these situations?
Use ! instead of -
test[,!(names(test) %ni% filter2)]
test[,!(names(test) %ni% filter)]
by wrapping with which and using -, it works only when the length of output of which is greater than 0
> which(names(test) %ni% filter2)
[1] 3
> which(names(test) %ni% filter)
integer(0)
By doing the -, there is no change in the integer(0) case
> -which(names(test) %ni% filter)
integer(0)
> -which(names(test) %ni% filter2)
[1] -3
thus,
> test[integer(0)]
data frame with 0 columns and 5 rows
I think you can simplify the column selection process by subsetting the dataframe with character vector of column names.
test[filter]
# A B C D
#1 1 1 1 1
#2 6 2 7 1
#3 1 1 6 1
#4 2 1 4 1
#5 3 2 1 1
test[filter2]
# A B D
#1 1 1 1
#2 6 2 1
#3 1 1 1
#4 2 1 1
#5 3 2 1

R create group variable based on row order and condition

I have a dataframe containing multiple groups that are not explicitly stated. Instead, new group always start when type == 1, and is the same for following rows, containing type == 2. The number of rows per group can vary.
How can I explicitly create new variable based on order of another column? The groups, of course, should be exclusive.
My data:
df <- data.frame(type = c(1,2,2,1,2,1,2,2,2,1),
stand = 1:10)
Expected output with new group myGroup:
type stand myGroup
1 1 1 a
2 2 2 a
3 2 3 a
4 1 4 b
5 2 5 b
6 1 6 c
7 2 7 c
8 2 8 c
9 2 9 c
10 1 10 d
One option could be:
with(df, letters[cumsum(type == 1)])
[1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "d"
Here is another option using rep() + diff(), but not as simple as the approach by #tmfmnk
idx <- which(df$type==1)
v <- diff(which(df$type==1))
df$myGroup <- rep(letters[seq(idx)],c(v <- diff(which(df$type==1)),nrow(df)-sum(v)))
such that
> df
type stand myGroup
1 1 1 a
2 2 2 a
3 2 3 a
4 1 4 b
5 2 5 b
6 1 6 c
7 2 7 c
8 2 8 c
9 2 9 c
10 1 10 d

R write table last longer for 2 columns than for whole dataframe

A dataframe with 40 columns:
This is executed after a few seconds
write.table(data_2[1:10000,], file = "/Volumes/2018/06_abteilungen/bi/analytics/tools/adobe/adobe_analytics/adobe_analytics_api_rohdaten/api_via_data_feed_auf_ftp/beispiel_datenexporte_data_feed/r_exporte/channel_va_closer.csv", sep = ";", col.names = NA)
This never ends:
write.table(data_2[1:1000,c(data_2$va_closer_detail,data_2$va_closer_id)], file = "/Volumes/2018/06_abteilungen/bi/analytics/tools/adobe/adobe_analytics/adobe_analytics_api_rohdaten/api_via_data_feed_auf_ftp/beispiel_datenexporte_data_feed/r_exporte/channel_va_closer.csv", sep = ";", col.names = NA)
How can I extract only 2 columns without performance-delay?
You can use [ to subset a data frame either by giving it row/column indices or row/column names. For example:
dd = data.frame(col1 = rep(1:2, 5), col2 = c(rep(1:3, 3), 1), col3 = 'a')
dd
# col1 col2 col3
# 1 1 1 a
# 2 2 2 a
# 3 1 3 a
# 4 2 1 a
# 5 1 2 a
# 6 2 3 a
# 7 1 1 a
# 8 2 2 a
# 9 1 3 a
# 10 2 1 a
If you wanted the first 5 rows and the first 2 columns, you could do either of these:
# good
dd[1:5, 1:2] # using column indices
dd[1:5, c("col1", "col2")] # using column names
But what you have in your question is
# bad
dd[1:5, c(dd$col1, dd$col2)] # using actual values :(
What columns are you asking for? Well, dd$col1 is the first column values: 1,2,1,2,... and dd$col2 is the second column values 1,2,3,1,2,3... Using c() you are sticking them together, so we can expand this out to
c(dd$col1, dd$col2) # these are the columns you are asking for
# [1] 1 2 1 2 1 2 1 2 1 2 1 2 3 1 2 3 1 2 3 1
# these are equivalent for this data
dd[1:5, c(dd$col1, dd$col2)]
dd[1:5, c(1,2,1,2,1,2,1,2,1,2,1,2,3,1,2,3,1,2,3,1)]
# col1 col2 col1.1 col2.1 col1.2 col2.2 col1.3 col2.3 col1.4 col2.4 col1.5 col2.5 col3 col1.6 col2.6 col3.1 col1.7 col2.7
# 1 1 1 1 1 1 1 1 1 1 1 1 1 a 1 1 a 1 1
# 2 2 2 2 2 2 2 2 2 2 2 2 2 a 2 2 a 2 2
# 3 1 3 1 3 1 3 1 3 1 3 1 3 a 1 3 a 1 3
# 4 2 1 2 1 2 1 2 1 2 1 2 1 a 2 1 a 2 1
# 5 1 2 1 2 1 2 1 2 1 2 1 2 a 1 2 a 1 2
# col3.2 col1.8
# 1 a 1
# 2 a 2
# 3 a 1
# 4 a 2
# 5 a 1
We are asking to repeat the columns again and again, with twice as many columns as there are rows in the original data! I don't know how many rows you have, it looks like more than 1000, so you are asking not for 2 columns, but for more than 2000 columns - maybe a lot more.
Two footnotes:
I second the the comment recommending data.table::fwrite, it will be much faster.
As a debugging technique, don't forget you can run small pieces of code to isolate the problem. When you try
write.table(data_2[1:1000,c(data_2$va_closer_detail,data_2$va_closer_id)],
file = "/Volumes/2018/06_abteilungen/bi/analytics/tools/adobe/adobe_analytics/adobe_analytics_api_rohdaten/api_via_data_feed_auf_ftp/beispiel_datenexporte_data_feed/r_exporte/channel_va_closer.csv",
sep = ";", col.names = NA)
And it doesn't seem to work there are two things worth checking: (a) is the file path valid, (b) is the data valid. If you had just tried running the data_2[...] part of the line, you would have identified the problem without needing help.
data_2[1:1000,c(data_2$va_closer_detail,data_2$va_closer_id)]
And when you ran that and saw different output than expected, again you run a smaller piece of the line,
c(data_2$va_closer_detail,data_2$va_closer_id)
And hopefully the issue is clear.

different print.gap value for specific column

Is there any way to have a different print.gap for a particular column?
Example data:
dd <- data.frame(col1 = 1:5, col2 = 1:5, col3 = I(letters[1:5]))
print (dd, quote=F, right=T, print.gap=5)
Output with print.gap=5:
col1 col2 col3
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
Desired output (print.gap mix, first two with print.gap=5, third with print.gap=12)
col1 col2 col3
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
I realise this may not be achievable with any change of the print statement, but perhaps some have an alternative method or suggestion. The output is to be saved in a text file. Also please note, the solution should be flexible enough to not just increase the gap for the last column, it could be any column, or multiple columns with different print.gaps in a data frame.
There's probably a way to do this by defining a "proper" alternative print method, but here's a hackish solution that can be used to adjust each column width independently.
rbind(
data.frame(lapply(dd, as.character), stringsAsFactors=FALSE),
substring(" ", 1, c(1,7,12))
)
# col1 col2 col3
#1 1 1 a
#2 2 1 b
#3 3 1 c
#4 4 1 d
#5 5 2 e
#6 6 2 f
#7 7 2 g
#8 8 2 h

Remover observations for which there is not a duplicate

I would like to break a dataset into two frames - one for which the original dataset has duplicate observations based on a condition and one for which the original dataset does not have duplicate observations based on a condition. In the following example, I would like to break the frame into one for which there is only one coder for an observation and one for which there are two coders::
frame <- data.frame(id = c(1,1,1,2,2,3), coder = c("A", "A", "B", "A", "B", "A"), y = c(4,5,4,1,1,2))
frame
For this, I would like to produce, such that:
frame1:
id coder y
1 1 A 4
2 1 A 5
3 1 B 4
4 2 A 1
5 2 B 1
frame2:
6 3 A 2
You can use aggregate to determine the ids you want in each data frame:
cts <- aggregate(coder~id, frame, function(x) length(unique(x)))
cts
# id coder
# 1 1 2
# 2 2 2
# 3 3 1
Then you can subset as appropriate based on this:
subset(frame, id %in% cts$id[cts$coder >= 2])
# id coder y
# 1 1 A 4
# 2 1 A 5
# 3 1 B 4
# 4 2 A 1
# 5 2 B 1
subset(frame, id %in% cts$id[cts$coder < 2])
# id coder y
# 6 3 A 2
You may also try:
indx <- !colSums(!table(frame$coder, frame$id))
frame[frame$id %in% names(indx)[indx],]
# id coder y
#1 1 A 4
#2 1 A 5
#3 1 B 4
#4 2 A 1
#5 2 B 1
frame[frame$id %in% names(indx)[!indx],]
# id coder y
#6 3 A 2
Explanation
table(frame$coder, frame$id)
# 1 2 3
# A 2 1 1
# B 1 1 0 #Here for id 3, B==0
If we Negate that, the result would be a logical index
!table(frame$coder, frame$id).
Do the colSums of the above, which results
# 1 2 3
# 0 0 1
Negate again and get the index for ids and subset those ids which are TRUE
From this you can subset by matching with the names of the ids

Resources