I am trying to add numbers to my data which belongs to each data
my data is like
df <- structure(list(data = structure(c(1L, 1L, 1L, 1L, 1L, 3L, 3L,
4L, 4L, 5L, 5L, 6L, 5L, 7L, 7L, 8L, 8L, 2L, 2L, 2L), .Label = c("data1",
"data10", "data2", "data3", "data4", "data5", "data6", "data7"
), class = "factor"), values = structure(c(3L, 8L, 18L, 1L, 15L,
17L, 19L, 7L, 2L, 2L, 11L, 10L, 6L, 4L, 9L, 12L, 14L, 5L, 13L,
16L), .Label = c("112864.443", "11319531", "12874.443", "142983324",
"1612410048", "16349475.63", "184901841", "2223793.8", "30553282.01",
"312004.547", "3135868.44", "317403612.9", "3686081.063", "43701608",
"623793.8", "64959501.42", "67666215", "767666215", "775987137.8"
), class = "factor")), .Names = c("data", "values"), class = "data.frame", row.names = c(NA,
-20L))
I want to have the exact values after each of my first column. since they are not consecutive, I dont know how to add them into a separate column. a desire output should look like below
data values
data1 12874.443 1
data1 2223793.8 1
data1 767666215 1
data1 112864.443 1
data1 623793.8 1
data2 67666215 2
data2 775987137.8 2
data3 184901841 3
data3 11319531 3
data4 11319531 4
data4 3135868.44 4
data5 312004.547 5
data4 16349475.63 4
data6 142983324 6
data6 30553282.01 6
data7 317403612.9 7
data7 43701608 7
data10 1612410048 10
data10 3686081.063 10
data10 64959501.42 10
one way is to use gsub to extract the value and add it as another column
df$label <- gsub("[^[:digit:]]", "", df$data)
another way is to use str_extract thanks to this question R: split character data into numbers and letters
library(stringr)
df$label <- as.numeric(str_extract(df$data, "[0-9]+"))
> df
# data values label
# 1 data1 12874.443 1
# 2 data1 2223793.8 1
# 3 data1 767666215 1
# 4 data1 112864.443 1
# 5 data1 623793.8 1
# 6 data2 67666215 2
# 7 data2 775987137.8 2
# 8 data3 184901841 3
# 9 data3 11319531 3
# 10 data4 11319531 4
# 11 data4 3135868.44 4
# 12 data5 312004.547 5
# 13 data4 16349475.63 4
# 14 data6 142983324 6
# 15 data6 30553282.01 6
# 16 data7 317403612.9 7
# 17 data7 43701608 7
# 18 data10 1612410048 10
# 19 data10 3686081.063 10
# 20 data10 64959501.42 10
Related
I have this file:
ID P
1 10
1 12
1 11
2 9
2 8
2 10
3 11
3 12
3 14
4 15
4 16
4 8
5 11
5 13
5 10
6 14
6 16
6 11
And I would like to assign these values (a,b,c) randomly to the file:
like this:
ID P Group
1 10 a
1 12 b
1 11 c
2 9 c
2 8 a
2 10 b
3 11 a
3 12 c
3 14 b
4 15 c
4 16 a
4 8 b
5 11 b
5 13 c
5 10 a
6 14 b
6 16 c
6 11 a
I need to do several times, every time randomly. I tried this:
df %>% group_by(ID) %>% replicate(1,sample(df$group))
but, for sure, didnĀ“t work. Some suggestion?
Here is an option with sample
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Group = sample(c('a', 'b', 'c'), n(), replace = TRUE))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), P = c(10L, 12L, 11L, 9L, 8L,
10L, 11L, 12L, 14L, 15L, 16L, 8L, 11L, 13L, 10L, 14L, 16L, 11L
)), class = "data.frame", row.names = c(NA, -18L))
Two solutions, one with grouping, the other without
library(tidyverse)
df <- dplyr::tribble(
~ID, ~P,
1,10,
1,12,
1,11,
2,9,
2,8,
2,10,
3,11,
3,12,
3,14,
4,15,
4,16,
4,8,
5,11,
5,13,
5,10,
6,14,
6,16,
6,11
)
sample_vector <- c("a","b","c")
##Without grouping id
df_2 <- df %>%
mutate(Group = sample(sample_vector, nrow(df), replace = TRUE))
##With grouping by ID
df_2 <- df %>% group_by(ID) %>%
mutate(Group = sample(sample_vector, n(), replace = TRUE))
Let's say I have this dataframe:
ID X1 X2
1 1 2
2 2 1
3 3 1
4 4 1
5 5 5
6 6 20
7 7 20
8 9 20
9 10 20
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))
And I want to select rows in which the absolute value of the subtraction of rows are more or equal to 2 (based on columns X1 and X2).
For example, row 4 value is 4-1, which is 3 and should be selected.
Row 9 value is 10-20, which is -10. Absolute value is 10 and should be selected.
In this case it would be rows 3, 4, 6, 7, 8 and 9
I tried:
dataset2 = dataset[,abs(dataset- c(dataset[,2])) > 2]
But I get an error.
The operation:
abs(dataset- c(dataset[,2])) > 2
Does give me rows that the sum are more than 2, but the result only works for my second column and does not select properly
We can get the difference between the 'X1', 'X2' columns, create a logical expression in subset to subset the rows
subset(dataset, abs(X1 - X2) >= 2)
# ID X1 X2
#3 3 3 1
#4 4 4 1
#6 6 6 20
#7 7 7 20
#8 8 9 20
#9 9 10 20
Or using index
subset(dataset, abs(dataset[[2]] - dataset[[3]]) >= 2)
data
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))
Consider the sample data
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 8L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 0L, 1L, 0L, 0L)
),
.Names = c("id", "A", "B"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id (stored in column 1) has varying number of entries for column A and B. In the example data, there are four observations with id = 1. I am looking for a way to subset this data in R so that there will be at most 3 entries for for each id and finally create another column (labelled as C) which consists of the order of each id. The expected output would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 1L, 0L, 0L),
C = c(1L, 2L, 3L, 1L, 2L, 1L)
),
.Names = c("id", "A", "B","C"),
class = "data.frame",
row.names = c(NA,-6L)
)
Your help is much appreciated.
Like this?
library(data.table)
dt <- as.data.table(df)
dt[, C := seq(.N), by = id]
dt <- dt[C <= 3,]
dt
# id A B C
# 1: 1 20 1 1
# 2: 1 12 1 2
# 3: 1 13 0 3
# 4: 2 11 1 1
# 5: 2 21 0 2
# 6: 3 17 0 1
Here is one option with dplyr and considering the top 3 values based on A (based of the comments of #Ronak Shah).
library(dplyr)
df %>%
group_by(id) %>%
top_n(n = 3, wt = A) %>% # top 3 values based on A
mutate(C = rank(id, ties.method = "first")) # C consists of the order of each id
# A tibble: 6 x 4
# Groups: id [3]
id A B C
<int> <int> <int> <int>
1 1 20 1 1
2 1 12 1 2
3 1 13 0 3
4 2 11 1 1
5 2 21 0 2
6 3 17 0 1
I have a data like below
df<- structure(list(data1 = c(0.013818378, 0.014362551, 0.014647562,
0.0136627, 0.015510173, 0.006818502, 0.006683564, 0.006655434,
0.006691479, 0.00666666, 0.014507653, 0.017446481, 0.014021427,
0.013963069, 0.020706391, 0.007104358, 0.006809539, 0.006680631,
0.009059533, 0.006681197, 0.015691738, 0.016709763, 0.015761994,
0.016062111, 0.015917196, 0.006816436, 0.006809539, 0.006680631,
0.009059533, 0.006681197), data2 = c(0.045378058, 0.041371486,
0.046058451, 0.040479177, 0.051143336, 0.016131932, 0.014399847,
0.014950329, 0.016408355, 0.015886182, 0.046151342, 0.05265521,
0.046046663, 0.040515428, 0.086865434, 0.019222881, 0.016926183,
0.016703444, 0.081352865, 0.132841645, 0.051641343, 0.059851738,
0.04830957, 0.047550067, 0.049228835, 0.015154055, 0.016926183,
0.016703444, 0.081352865, 0.132841645), time = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L
), place = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L), .Label = c("B02", "B03", "B04", "B05",
"B06", "C02", "C03", "C04", "C05", "C06"), class = "factor")), .Names = c("data1",
"data2", "time", "place"), class = "data.frame", row.names = c(NA,
-30L))
It has several data in it and distinguishable by time
I am trying to put separate them and re-orginise them in various data frame
each column except time and place are one data which needs to be organized
for example for data1 at time 1
B 0.013818378 0.014362551 0.014647562 0.0136627 0.015510173
C 0.006818502 0.006683564 0.006655434 0.006691479 0.00666666
data 1 at time 10
B 0.014507653 0.017446481 0.014021427 0.013963069 0.020706391
C 0.007104358 0.006809539 0.006680631 0.009059533 0.006681197
etc etc
We separate the 'place' column into two columns by splitting between the letter and digits, and spread into 'wide' format
library(dplyr)
library(tidyr)
df %>%
separate(place, into = c("grp", "number"), "(?<=[A-Z])(?=[0-9])") %>%
select(-data2) %>%
spread(number, data1)
# time grp 02 03 04 05 06
#1 1 B 0.013818378 0.014362551 0.014647562 0.013662700 0.015510173
#2 1 C 0.006818502 0.006683564 0.006655434 0.006691479 0.006666660
#3 10 B 0.014507653 0.017446481 0.014021427 0.013963069 0.020706391
#4 10 C 0.007104358 0.006809539 0.006680631 0.009059533 0.006681197
#5 17 B 0.015691738 0.016709763 0.015761994 0.016062111 0.015917196
#6 17 C 0.006816436 0.006809539 0.006680631 0.009059533 0.006681197
If we want as a list of datasets of both 'data1' and 'data2'
nm1 <- grep("data", names(df), value = TRUE)
nm1 %>%
purrr::map(~ df %>%
select(-one_of(nm1), .x) %>%
separate(place, into = c("grp", "number"), "(?<=[A-Z])(?=[0-9])") %>%
spread(number, .x) )
#[[1]]
# time grp 02 03 04 05 06
#1 1 B 0.013818378 0.014362551 0.014647562 0.013662700 0.015510173
#2 1 C 0.006818502 0.006683564 0.006655434 0.006691479 0.006666660
#3 10 B 0.014507653 0.017446481 0.014021427 0.013963069 0.020706391
#4 10 C 0.007104358 0.006809539 0.006680631 0.009059533 0.006681197
#5 17 B 0.015691738 0.016709763 0.015761994 0.016062111 0.015917196
#6 17 C 0.006816436 0.006809539 0.006680631 0.009059533 0.006681197
#[[2]]
# time grp 02 03 04 05 06
#1 1 B 0.04537806 0.04137149 0.04605845 0.04047918 0.05114334
#2 1 C 0.01613193 0.01439985 0.01495033 0.01640835 0.01588618
#3 10 B 0.04615134 0.05265521 0.04604666 0.04051543 0.08686543
#4 10 C 0.01922288 0.01692618 0.01670344 0.08135286 0.13284165
#5 17 B 0.05164134 0.05985174 0.04830957 0.04755007 0.04922883
#6 17 C 0.01515405 0.01692618 0.01670344 0.08135286 0.13284165
It is not clear how the output should look like when we have multiple value columns. The dcast from data.table can deal with multiple value.var columns
library(data.table)
setDT(df)[, c("grp", "number") := tstrsplit(place, "(?<=[A-Z])(?=[0-9])", perl = TRUE)]
dcast(df, grp + time ~ number, value.var = c("data1", "data2"))
It is somewhat unclear from your question, but I think that this is what you want:
library(tidyverse)
df %>%
mutate(
column = str_extract(place, "[0-9]+"),
place = str_extract(place, "[A-Z]")
) %>%
gather(data1, data2, key = "data", value = "val") %>%
spread(column, val) %>%
split(f = .$data)
Which produces the following format:
$data1
time place data 02 03 04 05 06
1 1 B data1 0.013818378 0.014362551 0.014647562 0.013662700 0.015510173
3 1 C data1 0.006818502 0.006683564 0.006655434 0.006691479 0.006666660
5 10 B data1 0.014507653 0.017446481 0.014021427 0.013963069 0.020706391
7 10 C data1 0.007104358 0.006809539 0.006680631 0.009059533 0.006681197
9 17 B data1 0.015691738 0.016709763 0.015761994 0.016062111 0.015917196
11 17 C data1 0.006816436 0.006809539 0.006680631 0.009059533 0.006681197
$data2
time place data 02 03 04 05 06
2 1 B data2 0.04537806 0.04137149 0.04605845 0.04047918 0.05114334
4 1 C data2 0.01613193 0.01439985 0.01495033 0.01640835 0.01588618
6 10 B data2 0.04615134 0.05265521 0.04604666 0.04051543 0.08686543
8 10 C data2 0.01922288 0.01692618 0.01670344 0.08135286 0.13284165
10 17 B data2 0.05164134 0.05985174 0.04830957 0.04755007 0.04922883
12 17 C data2 0.01515405 0.01692618 0.01670344 0.08135286 0.13284165
My data looks like this:
Group Feature_A Feature_B Feature_C Feature_D
1 1 0 3 2 4
2 1 5 2 2 8
3 1 9 8 6 5
4 2 5 7 8 8
5 2 2 6 8 1
6 2 3 8 6 4
7 3 1 5 3 5
8 3 1 4 3 4
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), Feature_A = c(0L,
5L, 9L, 5L, 2L, 3L, 1L, 1L), Feature_B = c(3L, 2L, 8L, 7L, 6L,
8L, 5L, 4L), Feature_C = c(2L, 2L, 6L, 8L, 8L, 6L, 3L, 3L), Feature_D = c(4L,
8L, 5L, 8L, 1L, 4L, 5L, 4L)), .Names = c("Group", "Feature_A",
"Feature_B", "Feature_C", "Feature_D"), class = "data.frame", row.names = c(NA,
-8L))
For every Feature I want to generate a plot (e.g., boxplot) that would higlight difference between Groups.
# Get unique Feature and Group
Features<-unique(colnames(df[,-1]))
Group<-unique(colnames(df$Group))
But how can I do the rest?
Pseudo-code might look like this:
Select Feature from Data
Split Data according Group
Boxplot
for (i in 1:levels(df$Features)){
for (o in 1:length(Group)){
}}
How can I achieve this? Hope someone can help me.
I would put py data in the long format. Then Using ggplot2 you can do some nice things.
library(reshape2)
library(ggplot2)
library(gridExtra)
## long format using Group as id
dat.m <- melt(dat,id='Group')
## bar plot
p1 <- ggplot(dat.m) +
geom_bar(aes(x=Group,y=value,fill=variable),stat='identity')
## box plot
p2 <- ggplot(dat.m) +
geom_boxplot(aes(x=factor(Group),y=value,fill=variable))
## aggregate the 2 plots
grid.arrange(p1,p2)
This is easy to do. I do this all the time
The code below will generate the charts using ggplot and save them as ch_Feature_A ....
you can wrap the answer in a pdf statement to send them to pdf as well
library(ggplot2)
df$Group <- as.factor(df$Group)
for (i in 2:dim(df)[2]) {
ch <- ggplot(df,aes_string(x="Group",y=names(df)[i],fill="Group"))+geom_boxplot()
assign(paste0("ch_",names(df)[i]),ch)
}
or even simpler, if you do not want separate charts
library(reshape2)
df1 <- melt(df)
ggplot(df1,aes(x=Group,y=value,fill=Group))+geom_boxplot()+facet_grid(.~variable)