If condition statment within function in R - r

I have one question which is probably easy for a lot of you. I would like to write a function which will do the calculations based on condition in selected column. It will be easier to show you an example:
con <- c("A", "B", "B", "C", "C", "A", "D", "A", "B", "D", "D", "D")
value <- c(1, 3, 2, 1, 1, 1, 2, 1, 2, 3, 3, 2)
dat <- data.frame(con, value)
head(dat)
So one possibility would be to do this in this simple way:
dat$new <- ifelse(dat$con == "A", dat$value*10,
ifelse(dat$con == "B", dat$value*100, dat$value*1000))
head(dat)
But, my question is how would the function look like? I tried something like this, but it is not working. Can someone help me with explanation what is missing and wrong?
calc <- function(dat) {
if(dat[, con] == "A") {
new <- dat$value*10
}
if(dat[, con] == "B") {
new <- dat$value*100
} else {
new <- dat$value*1000
}
}
calc(dat)

You can also create a function without if and ifelse:
calc <- function(data)
transform(data, new = value * 1000 / 100 ^ (con == "A") / 10 ^ (con == "B"))
The function is based on mathematical operations.
calc(dat)
# con value new
# 1 A 1 10
# 2 B 3 300
# 3 B 2 200
# 4 C 1 1000
# 5 C 1 1000
# 6 A 1 10
# 7 D 2 2000
# 8 A 1 10
# 9 B 2 200
# 10 D 3 3000
# 11 D 3 3000
# 12 D 2 2000

calc <- function(dat) {
dat$new <- ifelse(dat[,'con'] == 'A', dat[,'value']*10,
ifelse(dat[,'con'] == 'B', dat[,'value']*100,
dat[,'value']*1000)
)
dat
}
The subsetting operator $ is problematic in functions. Instead using the framework DF[,'<variable>'] is better. Also, note the quotation marks around the variable names (column names). Also your original function does not print a result to the screen. The last command will be returned when the function is called.
calc(dat)
con value new
1 A 1 10
2 B 3 300
3 B 2 200
4 C 1 1000
5 C 1 1000
6 A 1 10
7 D 2 2000
8 A 1 10
9 B 2 200
10 D 3 3000
11 D 3 3000
12 D 2 2000

Related

R: reorder a data frame with groups while preserving order within groups

R coders! I have a data frame, plan, with two columns. One column has group labels, lab, and the other, tr has only two distinct values in it.
lab <- rep(letters[1:2], each = 4)
tr <- c(1, 2, 2, 1, 1, 2, 1, 2)
plan <- data.frame(lab = lab, tr = tr)
> plan
lab tr
1 a 1
2 a 2
3 a 2
4 a 1
5 b 1
6 b 2
7 b 1
8 b 2
I have another vector, order_new, which is a reordered version of lab.
order_new <- lab[sample(1:8)]
> order_new
[1] "b" "b" "a" "a" "b" "a" "b" "a"
I want to reorder the data frame above so the tr values are sorted in the order given by order_new but with the order within the original lab groups preserved. The result I want is:
plan_new <- data.frame(order_new = order_new, tr = c(1, 2, 1, 2, 1, 2, 2, 1))
> plan_new
order_new tr
1 b 1
2 b 2
3 a 1
4 a 2
5 b 1
6 a 2
7 b 2
8 a 1
The first row in the new data frame is a "b" value and so takes the first "b" value in the original data frame. Row 2, also a "b", takes the second "b" value in the original. The third row, an "a", takes the first "a" value in the original etc.
I can't find anything close enough in past answers to work this out and am really looking forward to someone helping me out with this!
If you don't mind a loop
order_new=c("b", "b", "a", "a", "b", "a", "b", "a")
tmp=split(plan$tr,plan$lab)
res=list()
for (x in 1:length(order_new)) {
res[[x]]=tmp[[order_new[x]]][1]
tmp[[order_new[x]]]=tail(tmp[[order_new[x]]],-1)
}
data.frame(
"lab"=order_new,
"tr"=unlist(res)
)
lab tr
1 b 1
2 b 2
3 a 1
4 a 2
5 b 1
6 a 2
7 b 2
8 a 1
Here is a data.table approach of things.. can easily be tinkerd into a dplyr or baseR solution, followint the same logic..
I included all intermediate results to show you the results of each line..
lab <- rep(letters[1:2], each = 4)
tr <- c(1, 2, 2, 1, 1, 2, 1, 2)
plan <- data.frame(lab = lab, tr = tr)
#hard coded, since sample is not reproducible without set.seed()
order_new <- c("b", "b", "a", "a", "b", "a", "b", "a")
library( data.table )
#make plan a data.table
setDT(plan)
#set row_id's by grope (lab)
plan[, row_id := rowid( lab ) ]
# lab tr row_id
# 1: a 1 1
# 2: a 2 2
# 3: a 2 3
# 4: a 1 4
# 5: b 1 1
# 6: b 2 2
# 7: b 1 3
# 8: b 2 4
#make a new data.table for the new ordering
plan_new <- data.table( order_new = order_new )
#also add rownumbers by group
plan_new[, row_id := rowid( order_new ) ][]
# order_new row_id
# 1: b 1
# 2: b 2
# 3: a 1
# 4: a 2
# 5: b 3
# 6: a 3
# 7: b 4
# 8: a 4
#now join the tr-value from data.table 'plan' to 'plkan2', based on the rowid
plan_new[ plan, tr := i.tr, on = .(order_new = lab, row_id) ]
# order_new row_id tr
# 1: b 1 1
# 2: b 2 2
# 3: a 1 1
# 4: a 2 2
# 5: b 3 1
# 6: a 3 2
# 7: b 4 2
# 8: a 4 1
#drop the row_id column if needed
plan_new[, row_id := NULL ][]
# order_new tr
# 1: b 1
# 2: b 2
# 3: a 1
# 4: a 2
# 5: b 1
# 6: a 2
# 7: b 2
# 8: a 1

Randomly sample values from a pool so that the sum is less than a threshold in R

Let's say we have a pool of values and I want to sample random number of values from this pool, so that the sum of these values is between two thresholds. I want to design a function in R to implemented that.
pool = data.frame(ID = letters, value = sample(1:5, size = 26, replace = T))
> print(pool)
ID value
1 a 1
2 b 4
3 c 4
4 d 2
5 e 2
6 f 4
7 g 5
8 h 5
9 i 4
10 j 3
11 k 3
12 l 5
13 m 3
14 n 2
15 o 3
16 p 4
17 q 1
18 r 1
19 s 5
20 t 1
21 u 2
22 v 4
23 w 5
24 x 2
25 y 4
26 z 1
I want to randomly sample what ever number of IDs so that the sum of values for these IDs are between two thresholds, let's say between 8 and 10 (including the two boundaries). The expected outcome should be like these:
c("a", "b", "c")
c("f", "g")
c("a", "d", "e", "j", "k")
I think this question has not been asked previously. Does anyone have clues?
Here's an approach where I shuffle the input and check the cumulative sum of the shuffled output to look for an acceptable sum.
If a subset of that initial sequence happens to work, it outputs that sequence (in this manifestation, the longest sequence under the max threshold). If it doesn't work, it reshuffles and looks again, up to the max number of iterations.
set.seed(42)
library(dplyr)
sample_in_range <- function(src_tbl, min_sum = 8, max_sum = 10, max_iter = 100) {
for(i in 1:max_iter) {
output <- src_tbl %>%
sample_n(nrow(src_tbl)) %>%
mutate(ID = as.character(ID),
cuml = cumsum(value)) %>%
filter(cuml <= max_sum)
if(max(output$cuml) >= min_sum) return(output)
}
}
output <- sample_in_range(pool)
output
ID value cuml
1 k 3 3
2 w 2 5
3 z 4 9
4 t 1 10
output %>% pull(ID)
[1] "k" "w" "z" "t"

R-Converting Incidence matrix(csv file) to edge list format

I am studying social network analysis and will be using Ucinet to draw network graphs. For this, I have to convert the csv file to an edge list format. Converting the adjacency matrix to the edge list was successful. However, it is difficult to convert an incidence matrix to the edge list format.
The csv file('some.csv') I have, with a incidence matrix like this:
A B C D
a 1 0 3 1
b 0 0 0 2
c 3 2 0 1
The code that converted the adjacency matrix to the edge list was as follows:
x<-read.csv("C:/.../something.csv", header=T, row.names=1)
net<-as.network(x, matrix.type='adjacency', ignore.eval=FALSE, names.eval='dd', loops=FALSE)
el<-edgelist(net, attrname='dd')
write.csv(el, file='C:/.../result.csv')
Now It only succeedded in loading the file. I tried to follow the above method, but I get an error.
y<-read.csv("C:/.../some.csv", header=T, row.names=1)
net2<-network(y, matrix.type='incidence', ignore.eval=FALSE, names.eval='co', loops=FALSE)
Error in network.incidence(x, g, ignore.eval, names.eval, na.rm, edge.check) :
Supplied incidence matrix has empty head/tail lists. (Did you get the directedness right?)
I want to see the result in this way:
a A 1
a C 3
a D 1
b D 2
c A 3
c B 2
c D 1
I tried to put the values as the error said, but I could not get the result i wanted.
Thank you for any assistance with this.
Here's your data:
inc_mat <- matrix(
c(1, 0, 3, 1,
0, 0, 0, 2,
3, 2, 0, 1),
nrow = 3, ncol = 4, byrow = TRUE
)
rownames(inc_mat) <- letters[1:3]
colnames(inc_mat) <- LETTERS[1:4]
inc_mat
#> A B C D
#> a 1 0 3 1
#> b 0 0 0 2
#> c 3 2 0 1
Here's a generalized function that does the trick:
as_edgelist.weighted_incidence_matrix <- function(x, drop_rownames = TRUE) {
melted <- do.call(cbind, lapply(list(row(x), col(x), x), as.vector)) # 3 col matrix of row index, col index, and `x`'s values
filtered <- melted[melted[, 3] != 0, ] # drop rows where column 3 is 0
# data frame where first 2 columns are...
df <- data.frame(mode1 = rownames(x)[filtered[, 1]], # `x`'s rownames, indexed by first column in `filtered``
mode2 = colnames(x)[filtered[, 2]], # `x`'s colnames, indexed by the second column in `filtered`
weight = filtered[, 3], # the third column in `filtered`
stringsAsFactors = FALSE)
out <- df[order(df$mode1), ] # sort by first column
if (!drop_rownames) {
return(out)
}
`rownames<-`(out, NULL)
}
Take it for a spin:
el <- as_edgelist.weighted_incidence_matrix(inc_mat)
el
#> mode1 mode2 weight
#> 1 a A 1
#> 2 a C 3
#> 3 a D 1
#> 4 b D 2
#> 5 c A 3
#> 6 c B 2
#> 7 c D 1
Here are the results you wanted:
control_df <- data.frame(
mode1 = c("a", "a", "a", "b", "c", "c", "c"),
mode2 = c("A", "C", "D", "D", "A", "B", "D"),
weight = c(1, 3, 1, 2, 3, 2, 1),
stringsAsFactors = FALSE
)
control_df
#> mode1 mode2 weight
#> 1 a A 1
#> 2 a C 3
#> 3 a D 1
#> 4 b D 2
#> 5 c A 3
#> 6 c B 2
#> 7 c D 1
Do they match?
identical(control_df, el)
#> [1] TRUE
This might not be the most efficient way, but it produces expected result:
y <- matrix( c(1,0,3,0,0,2,3,0,0,1,2,1), nrow=3)
colnames(y) <- c("e.A","e.B","e.C","e.D")
dt <- data.frame(rnames=c("a","b","c"))
dt <- cbind(dt, y)
# rnames e.A e.B e.C e.D
#1 a 1 0 3 1
#2 b 0 0 0 2
#3 c 3 2 0 1
# use reshape () function to convert dataframe into the long format
M <- reshape(dt, direction="long", idvar = "rnames", varying = c("e.A","e.B","e.C","e.D"))
M <- M[M$e >0,]
M
# rnames time e
# a.A a A 1
# c.A c A 3
# c.B c B 2
# a.C a C 3
# a.D a D 1
# b.D b D 2
# c.D c D 1
# If M needs to be sorted by the column rnames:
M[order(M$rnames), ]
# rnames time e
# a.A a A 1
# a.C a C 3
# a.D a D 1
# b.D b D 2
# c.A c A 3
# c.B c B 2
# c.D c D 1

Removing Only Adjacent Duplicates in Data Frame in R

I have a data frame in R that is supposed to have duplicates. However, there are some duplicates that I would need to remove. In particular, I only want to remove row-adjacent duplicates, but keep the rest. For example, suppose I had the data frame:
df = data.frame(x = c("A", "B", "C", "A", "B", "C", "A", "B", "B", "C"),
y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
This results in the following data frame
x y
A 1
B 2
C 3
A 4
B 5
C 6
A 7
B 8
B 9
C 10
In this case, I expect there to be repeating "A, B, C, A, B, C, etc.". However, it is only a problem if I see adjacent row duplicates. In my example above, that would be rows 8 and 9 with the duplicate "B" being adjacent to each other.
In my data set, whenever this occurs, the first instance is always a user-error, and the second is always the correct version. In very rare cases, there might be an instance where the duplicates occur 3 (or more) times. However, in every case, I would always want to keep the last occurrence. Thus, following the example from above, I would like the final data set to look like
A 1
B 2
C 3
A 4
B 5
C 6
A 7
B 9
C 10
Is there an easy way to do this in R? Thank you in advance for your help!
Edit: 11/19/2014 12:14 PM EST
There was a solution posted by user Akron (spelling?) that has since gotten deleted. I am now sure why because it seemed to work for me?
The solution was
df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
It seems to work for me, why did it get deleted? For example, in cases with more than 2 consecutive duplicates:
df = data.frame(x = c("A", "B", "B", "B", "C", "C", "C", "A", "B", "C", "A", "B", "B", "C"), y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
x y
1 A 1
2 B 2
3 B 3
4 B 4
5 C 5
6 C 6
7 C 7
8 A 8
9 B 9
10 C 10
11 A 11
12 B 12
13 B 13
14 C 14
> df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
> df
x y
1 A 1
4 B 4
7 C 7
8 A 8
9 B 9
10 C 10
11 A 11
13 B 13
14 C 14
This seems to work?
Try
df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
# x y
#1 A 1
#2 B 2
#3 C 3
#4 A 4
#5 B 5
#6 C 6
#7 A 7
#9 B 9
#10 C 10
Explanation
Here, we are comparing an element with the element preceding it. This can be done by removing the first element from the column and that column compared with the column from which last element is removed (so that the lengths become equal)
df$x[-1] #first element removed
#[1] B C A B C A B B C
df$x[-nrow(df)]
#[1] A B C A B C A B B #last element `C` removed
df$x[-1]!=df$x[-nrow(df)]
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
In the above, the length is 1 less than the nrow of df as we removed one element. Inorder to compensate that, we can concatenate a TRUE and then use this index for subsetting the dataset.
Here's an rle solution:
df[cumsum(rle(as.character(df$x))$lengths), ]
# x y
# 1 A 1
# 2 B 2
# 3 C 3
# 4 A 4
# 5 B 5
# 6 C 6
# 7 A 7
# 9 B 9
# 10 C 10
Explanation:
RLE stands for Run Length Encoding. It produces a list of vectors. One being the runs, the values, and the other lengths being the number of consecutive repeats of each value. For example, x <- c(3, 2, 2, 3) has a runs vector of c(3, 2, 3) and lengths c(1, 2, 1). In this example, the cumulative sum of the lengths produces c(1, 3, 4). Subset x with this vector and you get c(3, 2, 3). Note that the second element of the lengths vector is the third element of the vector and the last occurrence of 2 in that particular 'run'.
You could also try
df[c(diff(as.numeric(df$x)), 1) != 0, ]
In case x is of character class (rather than factor), try
df[c(diff(as.numeric(factor(df$x))), 1) != 0, ]
# x y
# 1 A 1
# 2 B 2
# 3 C 3
# 4 A 4
# 5 B 5
# 6 C 6
# 7 A 7
# 9 B 9
# 10 C 10

Finding the Column Index for a Specific Value

I am having a brain cramp. Below is a toy dataset:
df <- data.frame(
id = 1:6,
v1 = c("a", "a", "c", NA, "g", "h"),
v2 = c("z", "y", "a", NA, "a", "g"),
stringsAsFactors=F)
I have a specific value that I want to find across a set of defined columns and I want to identify the position it is located in. The fields I am searching are characters and the trick is that the value I am looking for might not exist. In addition, null strings are also present in the dataset.
Assuming I knew how to do this, the variable position indicates the values I would like returned.
> df
id v1 v2 position
1 1 a z 1
2 2 a y 1
3 3 c a 2
4 4 <NA> <NA> 99
5 5 g a 2
6 6 h g 99
The general rule is that I want to find the position of value "a", and if it is not located or if v1 is missing, then I want 99 returned.
In this instance, I am searching across v1 and v2, but in reality, I have 10 different variables. It is also worth noting that the value I am searching for can only exist once across the 10 variables.
What is the best way to generate this recode?
Many thanks in advance.
Use match:
> df$position <- apply(df,1,function(x) match('a',x[-1], nomatch=99 ))
> df
id v1 v2 position
1 1 a z 1
2 2 a y 1
3 3 c a 2
4 4 <NA> <NA> 99
5 5 g a 2
6 6 h g 99
Firstly, drop the first column:
df <- df[, -1]
Then, do something like this (disclaimer: I'm feeling terribly sleepy*):
( df$result <- unlist(lapply(apply(df, 1, grep, pattern = "a"), function(x) ifelse(length(x) == 0, 99, x))) )
v1 v2 result
1 a z 1
2 a y 1
3 c a 2
4 <NA> <NA> 99
5 g a 2
6 h g 99
* sleepy = code is not vectorised
EDIT (slightly different solution, I still feel sleepy):
df$result <- rapply(apply(df, 1, grep, pattern = "a"), function(x) ifelse(length(x) == 0, 99, x))

Resources