as.factor not working with INT values on R - r

Hey guys if you could please help me. I got this dataset:
q1 q2 q3 m1 m2 b1 b2
A 78 150 2887 4 4 0 1
B 74 142 2904 4 4 1 1
C 79 137 1564 4 4 1 0
D 80 164 4522 2 2 0 0
E 74 173 5025 2 3 0 1
F 73 140 1971 3 3 0 1
I want to transform m1:b2 into factors. If I do
data[,4:7] <- as.factor(data[,4:7])
it doesn't work, the values change to char vectors. It gets messed up like this:
q1 q2 q3 m1 m2 b1
A 78 150 2887 c(4, 4, 4, 2, 2, 3) c(0, 1, 1, 0, 0, 0) c(4, 4, 4, 2, 2, 3)
B 74 142 2904 c(4, 4, 4, 2, 3, 3) c(1, 1, 0, 0, 1, 1) c(4, 4, 4, 2, 3, 3)
C 79 137 1564 c(0, 1, 1, 0, 0, 0) c(4, 4, 4, 2, 2, 3) c(0, 1, 1, 0, 0, 0)
D 80 164 4522 c(1, 1, 0, 0, 1, 1) c(4, 4, 4, 2, 3, 3) c(1, 1, 0, 0, 1, 1)
E 74 173 5025 c(4, 4, 4, 2, 2, 3) c(0, 1, 1, 0, 0, 0) c(4, 4, 4, 2, 2, 3)
F 73 140 1971 c(4, 4, 4, 2, 3, 3) c(1, 1, 0, 0, 1, 1) c(4, 4, 4, 2, 3, 3)
b2
A c(0, 1, 1, 0, 0, 0)
B c(1, 1, 0, 0, 1, 1)
C c(4, 4, 4, 2, 2, 3)
D c(4, 4, 4, 2, 3, 3)
E c(0, 1, 1, 0, 0, 0)
F c(1, 1, 0, 0, 1, 1)
But if I use lapply it works fine. Can you explain me why? Because I've been using as.factor(d[]) in other occasions and it worked just fine with other data.frame objects. Thank you.

Checking the documentation for as.factor (by typing ?as.factor), you'll see it says that the first argument x is "a vector of data, usually taking a small number of distinct values". If you supply multiple columns of a data frame, they are treated as one vector. In your example, as.factor creates a unique factor level for each unique value in the entire vectorized, concatenation of columns 4 through 7 of your data frame above.
You should use:
data[4:7] <- lapply(data[4:7], as.factor)
or (requiring tidyverse packages)
data <- data %>% mutate_at(4:7, as.factor)
Both of these solutions will correctly treat each column supplied, here columns 4, 5, 6, and 7, as their own vectors, individually. Each one is converted to a factor separately, and re-assigned appropriately.

Related

R function to find rank of a value in a sorted vector

I'd like to find the rank of a value in a sorted vector, i.e., given a sorted (increasing) vector and a value, find the index of the value in the vector if it is present (or the mean of indices if more than once), or the index of the greatest element less than the value, if it is not present, but within the range of the vector, or something reasonable if the value is outside the range of the vector altogether.
Let's say xx is the vector and x is the value. mean(which(xx == x)) covers the value-present case, and max(which(xx < x)) covers the value-not-present-and-in-range case. 1 and length(xx) are probably reasonable outputs for the not-in-range case.
So I could do that, but I'd like to avoid creating a Boolean vector the size of xx, and also there are just enough wrinkles that I'd prefer to call a built-in or library function instead of rolling my own. Perhaps there is something simple which I've overlooked.
Here's an example. The first value, 7, is present in the vector. The second, 7.3, is not present. I'd like to get the outputs 82.5 and 86, respectively.
> sort (floor (runif (100) * 10)) -> xx
> xx
[1] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
[38] 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6
[75] 6 6 6 6 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 9 9 9
> mean (which (xx == 7))
[1] 82.5
> max (which (xx <= 7.3))
[1] 86
EDIT: with hints from akrun, I've come up with the following. Note that when there are duplicates, make use of the fact that match returns the least index and findInterval returns the greatest.
# assume xx is sorted already
mean.rank.in <- function (xx, x) {
findInterval (x, xx) -> i
if (i == 0) 0
else
if (xx[[i]] == x)
# account for duplicates here:
# findInterval returned greatest index, call match to find least
(match(x, xx) + i)/2
else i
}
Here are some checks:
xx <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7,
7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9)
mean.rank.in (xx, 7) == 82.5 # expect TRUE
mean.rank.in (xx, 7.3) == 86 # expect TRUE
sapply (xx, function (x) mean.rank.in (xx, x)) # looks right
sum (sapply (xx, function (x) mean.rank.in (xx, x))) == 5050 # expect TRUE
yy <- sort (runif (100))
all (sapply (yy, function (y) mean.rank.in (yy, y)) == 1:100) # expect TRUE
dyy <- min (yy[2:100] - yy[1:99])
yy1 <- yy + dyy/2
all (sapply (yy1, function (y) mean.rank.in (yy1, y)) == 1:100) # expect TRUE
mean.rank.in (yy, yy[[1]] - 1) == 0 # expect TRUE
mean.rank.in (yy, yy[[100]] + 1) == 100 # expect TRUE
Here is one option with rank
rank(xx)[match(7, xx)]
#[1] 82.5
and with findInterval
findInterval(7.3, xx)
#[1] 86
data
xx <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7,
7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9)

R reshape wide to long: multiple variables, observations with multiple indicies

I have got some data containing observations with multiple idicies $y_{ibc}$ stored in a messy wide format. I have been fiddling around with tidyr and reshape2 but could not figure it out (reshaping really is my nemesis).
Here is an example:
df <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9), a1b1c1 = c(5,
2, 1, 4, 3, 1, 0, 1, 3), a2b1c1 = c(3, 4, 1, 1, 3, 2, 1, 4, 4
), a3b1c1 = c(4, 0, 0, 1, 1, 1, 0, 0, 1), a1b2c1 = c(1, 0, 4,
2, 4, 1, 0, 4, 2), a2b2c1 = c(2, 0, 1, 0, 1, 0, 3, 2, 0), a3b2c1 = c(2,
4, 3, 0, 2, 3, 3, 3, 4), yc1 = c(1, 2, 2, 1, 2, 2, 2, 1, 1), a1b1c2 = c(4,
2, 3, 0, 4, 4, 2, 1, 4), a2b1c2 = c(3, 0, 3, 3, 4, 4, 3, 2, 2
), a3b1c2 = c(3, 1, 0, 1, 4, 0, 2, 2, 3), a1b2c2 = c(2, 2, 0,
3, 2, 1, 4, 1, 0), a2b2c2 = c(3, 0, 2, 3, 4, 4, 4, 0, 4), a3b2c2 = c(0,
0, 0, 2, 0, 0, 1, 4, 3), yc2 = c(2, 2, 2, 1, 2, 2, 2, 1, 1), X = c(5,
6, 3, 7, 4, 3, 2, 3, 2)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
This is what I want (excerpt):
id b c y a1 a2 a3 X
1 1 b1 c1 1 5 3 4 5
2 1 b2 c1 1 1 2 2 5
3 1 b1 c2 2 4 3 3 5
4 1 b2 c2 2 2 3 0 5
Using tidyr & dplyr:
library(tidyverse)
df %>%
pivot_longer(cols = matches("a.b.c."), names_to = "name", values_to = "value") %>%
separate(name, into = c("a", "b", "c"), sep = c(2,4)) %>%
mutate(y = case_when(c == "c1" ~ yc1,
c == "c2" ~ yc2)) %>%
pivot_wider(names_from = a, values_from = value) %>%
select(id, b, c, y, a1, a2, a3, X)
First, convert all your a/b/c columns to a long format & separate the 3 values into separate columns. Then combine your y columns into one depending on the value of c using mutate andcase_when (you could also use if_else for two options but case_when is more expandable for more values). Then pivot your a columns back to wide format and use select to put them in the right order and get rid of the yc1 and yc2 columns.

Create a time series using diagonal matrices

Let´s say that my data has the following structure:
structure(list(Year = c(2000, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2001, 2001, 2001, 2001), Month = c(1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), Day = c(1,
1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 1, 1,
1, 1), FivMin = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3,
4, 1, 2, 3, 4, 1, 2, 3, 4), A = c(1, 2, 3, 0, 1, 5, 3, 4, 1,
0, 3, 1, 0, 2, 3, 0, 1, 2, 0, 9, 1, 2, 3, 0), B = c(2, 3, 4,
1, 2, 3, 0, 1, 2, 1, 4, -2, 2, 1, 0, 2, 2, 3, -1, 1, 2, 3, 4,
1), C = c(3, 0, 1, 2, 3, 4, 1, 9, 3, 7, 1, 2, 3, 4, 1, 2, 3,
4, 1, 2, 3, 0, 1, 2), D = c(4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2,
3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3)), row.names = c(NA, -24L
), class = c("tbl_df", "tbl", "data.frame"))
My idea is use the crossproduct comand every day. In order to do that I wrote the following code:
res <- lapply(split(data, data[c("Year","Month","Day")]),
function(x) tcrossprod(t(x[c("A","B","C","D")])))
Final<-do.call(rbind, lapply(res, diag))
The output of Final is:
A B C D
2000.1.1 14 30 14 30
2001.1.1 14 30 14 30
2000.1.2 51 14 107 30
2001.1.2 0 0 0 0
2000.1.3 11 25 63 30
2001.1.3 0 0 0 0
2000.1.4 13 9 30 30
2001.1.4 0 0 0 0
2000.1.5 86 15 30 30
2001.1.5 0 0 0 0
What I need is a time serie (matrix or df object) formed by the diagonals calculated with crossproduct, It means my desired time serie would be
A B C D
2000.1.1 14 30 14 30
2000.1.2 51 14 107 30
2000.1.3 11 25 63 30
2000.1.4 13 9 30 30
2000.1.5 86 15 30 30
2001.1.1 14 30 14 30
What would be the changes in my original code. I think that i could replace the split command by grouped_by but it did not work.
As the split makes data frame into list, it creates 0 rows as well. Just remove those zero rows and try.
ls<- split(data, data[c("Year","Month","Day")])
ls<- ls[sapply(ls, nrow)>0]
res <- lapply(ls, function(x) tcrossprod(t(x[c("A","B","C","D")])))
Final<-do.call(rbind, lapply(res, diag))
Final <- Final[ order(row.names(Final)), ]
Final
Output:
A B C D
2000.1.1 14 30 14 30
2000.1.2 51 14 107 30
2000.1.3 11 25 63 30
2000.1.4 13 9 30 30
2000.1.5 86 15 30 30
2001.1.1 14 30 14 30

How to sum remaing values after using gsub?

This problem is unsolved by my brain, so I'm asking all of you for a little help.
This is part of my data:
rfam[1:20,]
id name
1 RF00001 LL_skoljka_r41782307_x1
2 RF00001 LL_skoljka_r9950955_x1
3 RF00001 LL_skoljka_r49323482_x1
4 RF00001 LL_skoljka_r14141437_x1
5 RF00001 LL_skoljka_r16457227_x3
6 RF00002 LL_skoljka_r40347558_x1
7 RF00002 LL_skoljka_r44415149_x1
8 RF00002 LL_skoljka_r13145032_x1
9 RF00002 LL_skoljka_r29248915_x42
10 RF00003 LL_skoljka_r15936986_x1
11 RF00003 LL_skoljka_r28953530_x1
12 RF00003 LL_skoljka_r32665758_x1
13 RF00003 LL_skoljka_r32835489_x1
14 RF00003 LL_skoljka_r32835498_x1
15 RF04051 LL_skoljka_r33254611_x1
16 RF04051 LL_skoljka_r29761867_x12
17 RF04051 LL_skoljka_r45123665_x2
18 RF04051 LL_skoljka_r34837827_x15
19 RF08595 LL_skoljka_r38900754_x1
20 RF08595 LL_skoljka_r22016530_x1
In first step I want to remove all the nonsense before x in variable name so I use:
rfam$name<- as.data.frame(sapply(rfam$name, gsub, pattern='^.*?x', replacement=""))
Result:
rfam[1:20,]
id name
1 RF00001 1
2 RF00001 1
3 RF00001 1
4 RF00001 1
5 RF00001 3
6 RF00002 1
7 RF00002 1
8 RF00002 1
9 RF00002 42
10 RF00003 1
11 RF00003 1
12 RF00003 1
13 RF00003 1
14 RF00003 1
15 RF04051 1
16 RF04051 12
17 RF04051 2
18 RF04051 15
19 RF08595 1
20 RF08595 1
In second step I would like to sum up values that stay in variable name for each id.
Results should look like this:
view(rfam)
id name
1 RF00001 7
2 RF00002 45
3 RF00003 5
4 RF04051 30
5 RF08595 2
If I want to sum up values, variable should be numeric. Both of my variables are factors. So I transformed id to character using rfam[,1]=as.character(rfam[,1]) and tried to convert name to numeric by rfam[,2]=as.numeric(levels(rfam[,2])[rfam[,2]]). Transformation of id was successful, while name returns "NA's".
I've also tried rfam[,2]=as.numeric(as.character(rfam[,2])), but the result was the same.
I've tried to export data to txt file and then in excel do the rest of analysis, but when I export data, it looks like this:
"id" "name"
"1" "RF00001" c(1, 1, 1, 1, 9, 1, 1, 1, 11, 1, 1, 1, 1, 1, 1, 3, 7, 5, 1, 1, 1, 9, 1, 14, 10, 7, 1, 5, 1, 1, 1, 1, 1, 7, 1, 2, 1, 1, 1, 9, 1, 7, 1, 1, 1, 1, 1, 1, 10, 7, 1, 10, 7, 1, 1, 1, 1, 1, 7, 1, 10, 1, 1, 1, 1, 1, 1, 1, 7, 1,...)
"2" "RF00001" c(1, 1, 1, 1, 9, 1, 1, 1, 11, 1, 1, 1, 1, 1, 1, 3, 7, 5, 1, 1, 1, 9, 1, 14, 10, 7, 1, 5, 1, 1, 1, 1, 1, 7, 1, 2, 1, 1, 1, 9, 1, 7, 1, 1, 1, 1, 1, 1, 10, 7, 1, 10, 7, 1, 1, 1, 1, 1, 7, 1, 10, 1, 1, 1, 1, 1, 1, 1, 7, 1,...)
"3" "RF00001" c(1, 1, 1, 1, 9, 1, 1, 1, 11, 1, 1, 1, 1, 1, 1, 3, 7, 5, 1, 1, 1, 9, 1, 14, 10, 7, 1, 5, 1, 1, 1, 1, 1, 7, 1, 2, 1, 1, 1, 9, 1, 7, 1, 1, 1, 1, 1, 1, 10, 7, 1, 10, 7, 1, 1, 1, 1, 1, 7, 1, 10, 1, 1, 1, 1, 1, 1, 1, 7, 1,...)
Now here is my dead end. I don't understand what is happening and I would appreciate if you could help me out.
Update
Having realized your question is not about the grouping part, the problem is that your sapply() function is creating a data.frame inside rfam instead of a vector.
You can use the following data.table solution to correctly convert the rfam$name column to the desired format to be able to group.
setDT(rfam)[,name:= as.numeric(gsub('^.*?x', replacement="",name))]
Now we can use dplyr to attain the desired output:
library(dplyr)
as.data.frame(rfam) %>% group_by(id) %>% summarise(name=sum(name))

Subset dataset by selecting a span of rows beginning and ending with certain value or deleting rows before and after certain values

I haven't found any similar stuff to my question and now I'm in trouble with the following problem:
I've a mass of data, so I created a more simple data, that you can use:
structure(list(id = 123:182, tag = c(1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3), a = c(3, 3, 5, 9, 1, 9, 9, 5,
5, 1, 1, 1, 5, 3, 9, 3, 5, 9, 3, 9, 9, 1, 5, 1, 3, 3, 1, 3, 9,
3, 3, 5, 3, 1, 9, 5, 9, 1, 5, 3, 9, 5, 9, 5, 5, 9, 1, 3, 5, 5,
3, 9, 3, 1, 1, 1, 3, 5, 5, 3), b = c(0, 0, 0, 0, 1, 1, 0, 0,
1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 1, 1, 1, 0, 0, 0, 0)), .Names = c("id", "tag", "a",
"b"), row.names = c(NA, -60L), class = "data.frame")
I want to subset the data by tag, but the rows with the first and the last values of 0 in column b should be deleted. I begin to try something with the ddplr - function, but it doesn't work and it's not worth to see...
The result shoud look like this:
id tag a b
5 127 1 1 1
6 128 1 9 1
7 129 1 9 0
8 130 1 5 0
9 131 1 5 1
10 132 1 1 0
11 133 1 1 1
12 134 1 1 0
13 135 1 5 1
14 136 1 3 0
15 137 1 9 0
16 138 1 3 1
24 146 2 1 1
25 147 2 3 0
26 148 2 3 1
27 149 2 1 0
28 150 2 3 1
29 151 2 9 1
30 152 2 3 0
31 153 2 3 1
32 154 2 5 0
33 155 2 3 1
34 156 2 1 1
35 157 2 9 1
36 158 2 5 1
45 167 3 5 1
46 168 3 9 1
47 169 3 1 0
48 170 3 3 0
49 171 3 5 1
50 172 3 5 0
51 173 3 3 1
52 174 3 9 0
53 175 3 3 1
54 176 3 1 1
55 177 3 1 1
56 178 3 1 1
What can I do?
If dd is your data frame try this:
w <- which(dd$b == 1)
dd[min(w):max(w), ]
To do it by tag try this:
is.ok <- function(b.ok) {
if (any(b.ok)) {
w <- which(b.ok)
seq_along(b.ok) %in% min(w):max(w)
} else FALSE
}
ok <- ave(dd$b == 1, dd$tag, FUN = is.ok)
dd[ok, ]
UPDATE: by tag

Resources