Use if-else function on data frame with multiple values - r

I have a data frame that contains multiple values in each spot, like this:
ID<-c(1,1,1,2,2,2,2,3,3,4,4,4,5,6,6)
W<-c(29,72,32,33,34,44,42,78,32,42,18,26,10,34,39)
df1<-data.frame(ID, W)
df<-ddply(df1, .(ID), summarize,
X=paste(unique(W),collapse=","))
ID X
1 1 29,72,32
2 2 33,34,44,42
3 3 78,32
4 4 42,18,26
5 5 10
6 6 34,39
I am trying to generate another column using an if-else function so that every ID that has an X value greater than 70 will show a 1, and all others will show a 0, like this:
ID X Y
1 1 29,72,32 1
2 2 33,34,44,42 0
3 3 78,32 1
4 4 42,18,26 0
5 5 10 0
6 6 34,39 0
This is the code that I tried:
df$Y <- ifelse(df$X>=70, 1, 0)
But it doesn't work; it only seems to put the first value of each spot through the function:
ID X Y
1 1 29,72,32 0
2 2 33,34,44,42 0
3 3 78,32 1
4 4 42,18,26 0
5 5 10 0
6 6 34,39 0
It worked fine on my one column that has only one value per spot. Is there a way to get to the if-else function to evaluate every value in each spot and assign a 1 if any of them fit the statement?
Thank you, I'm sorry that I do not know a lot of R vocabulary yet.

As 'X' is a string, we can split the 'X' at the , to create a list of vectors, loop over the list with map check if there are any numeric converted values are greater than 70
library(dplyr)
library(purrr)
df %>%
mutate(Y = map_int(strsplit(X, ","), ~ +(any(as.numeric(.x) > 70))))

Related

How do I identifying the first zero in a group of ordered columns?

I'm trying to format a dataset for use in some survival analysis models. Each row is a school, and the time-varying columns are the total number of students enrolled in the school that year. Say the data frame looks like this (there are time invariate columns as well).
Name total.89 total.90 total.91 total.92
a 8 6 4 0
b 1 2 4 9
c 7 9 0 0
d 2 0 0 0
I'd like to create a new column indicating when the school "died," i.e., the first column in which a zero appears. Ultimately I'd like to have this column be "years since 1989" and can re-name columns accordingly.
A more general version of the question, for a series of time ordered columns, how do I identify the first column in which a given value occurs?
Here's a base R approach to get a column with the first zero (x = 0) or NA if there isn't one:
data$died <- apply(data[, -1], 1, match, x = 0)
data
# Name total.89 total.90 total.91 total.92 died
# 1 a 8 6 4 0 4
# 2 b 1 2 4 9 NA
# 3 c 7 9 0 0 3
# 4 d 2 0 0 0 2
Here is an option using max.col with rowSums
df1$died <- max.col(!df1[-1], "first") * NA^!rowSums(!df1[-1])
df1$died
#[1] 4 NA 3 2

How to tidy up a character column?

What I have:
test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--at","at--","--at","--at","a--","acg","a--","a--", "agg"))
> test_df
isolate label alignment
1 1 1 --at
2 2 1 at--
3 3 1 --at
4 4 1 --at
5 1 2 a--
6 2 2 acg
7 3 2 a--
8 4 2 a--
9 5 2 agg
What I want:
I'd like to explode the alignment field into two columns, position and character:
> test_df
isolate label aln_pos aln_char
1 1 1 1 -
2 1 1 2 -
3 1 1 3 a
4 1 1 4 t
...
Not all alignments are the same length, but all alignments with the same label have the same length.
What I've tried:
I was thinking I could use separate to first make each position have its own column, then use gather turn those columns into key value pairs. However, I haven't been able to get the separate part right.
Since you mentioned tidyr::gather, you could try this:
test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),
label=c(1,1,1,1,2,2,2,2,2),
alignment=c("--at","at--","--at","--at","a--","acg","a--","a--", "agg"),
stringsAsFactors = FALSE)
library(tidyverse)
test_df %>%
mutate(alignment = strsplit(alignment,"")) %>%
unnest(alignment)
In base R, you can use indexing along with creation of a list with strsplit like this.
# make variable a character vector
test_df$alignment <- as.character(test_df$alignment)
# get list of individual characters
myList <- strsplit(test_df$alignment, split="")
then build the data.frame
# construct data.frame
final_df <- cbind(test_df[rep(seq_len(nrow(test_df)), lengths(myList)),
c("isolate", "label")],
aln_pos=sequence(lengths(myList)),
aln_char=unlist(myList))
Here, we take the first two columns of the original data.frame and repeat the rows using rep with a vector input in its second argument telling it how many times to repeat the corresponding value in its first argument. The number of times is calculated with lengths. The second argument of cbind is a call to sequence taking the same lengths output. this produces counts from 1 to the corresponding length. The third argument is the unlisted character values.
this returns
head(final_df, 10)
isolate label aln_pos aln_char
1 1 1 1 -
1.1 1 1 2 -
1.2 1 1 3 a
1.3 1 1 4 t
2 2 1 1 a
2.1 2 1 2 t
2.2 2 1 3 -
2.3 2 1 4 -
3 3 1 1 -
3.1 3 1 2 -

replace values in one dataset with values in another dataset R

I have a somewhat seemingly simple problem that I am stumped with. I have a df, say:
x y z
0 1 2
3 5 4
1 0 5
0 5 0
and another:
x y z
1 5 6
2 4 5
4 5 7
5 8 5
I want to replace the zero values in df1 with the value in df2. E.g., cell 1 of df1 would be 1 instead of zero. I want this for all columns in a dataframe. Can you help me code? I cant seem to figure it out. Thanks!
First, you can locate the indices of 0's using which
zero_locations <- which(df1 == 0, arr.ind=TRUE)
Then, you can use the locations to make the replacements:
df1[zero_locations] <- df2[zero_locations]
As David Arenburg pointed out in the comments, which isn't strictly necessary:
zero_locations <- df1 == 0
Will work as well.

Loop throug the data frame applying some function on each value in R

I am new to R . I have a data frame(usr.query) with structure as shown below
[
Now I want to take text of each id and compare it to text of all the other id and and if there is a match, i want to append it to a new column say count of match.
A0008 with A0043,A0065,A0082,B0018,B0026
A0043 with A0008,A0065,A0082,B0018,B0026
Function to apply
count_match = length(intersect(unlist(strsplit(query1," ")),unlist(strsplit(query2," "))))
The query 1 here is text of A0008 and query 2 is text of A0043,A0065,A0082,B0018,B0026
I tried the suggested solution and here is the result.
No loops are necessary; you'll usually find that's the case in R, because it's really good at utilizing vectorized operations. In this case, you can get the necessary combinations with combn, and then make the match_count column by subsetting the original data.frame with the combinations of the new one, and testing for equality. Adding zero changes the values from Boolean to numeric (use as.integer, if you prefer).
# assemble sample data
df <- data.frame(id = 1:5, text = c('apple', 'mango', 'apple', 'apple', 'mango'))
# make combinations
df2 <- as.data.frame(t(combn(df$id, 2)))
# add names
names(df2) <- c('main_id', 'compared_to_id')
# test for match
df2$match_count <- (df[df2$main_id, 'text'] == df[df2$compared_to_id, 'text']) + 0
The result:
> df2
main_id compared_to_id match_count
1 1 2 0
2 1 3 1
3 1 4 1
4 1 5 0
5 2 3 0
6 2 4 0
7 2 5 1
8 3 4 1
9 3 5 0
10 4 5 0

Select max or equal value from several columns in a data frame

I'm trying to select the column with the highest value for each row in a data.frame. So for instance, the data is set up as such.
> df <- data.frame(one = c(0:6), two = c(6:0))
> df
one two
1 0 6
2 1 5
3 2 4
4 3 3
5 4 2
6 5 1
7 6 0
Then I'd like to set another column based on those rows. The data frame would look like this.
> df
one two rank
1 0 6 2
2 1 5 2
3 2 4 2
4 3 3 3
5 4 2 1
6 5 1 1
7 6 0 1
I imagine there is some sort of way that I can use plyr or sapply here but it's eluding me at the moment.
There might be a more efficient solution, but
ranks <- apply(df, 1, which.max)
ranks[which(df[, 1] == df[, 2])] <- 3
edit: properly spaced!

Resources