Assign established values randomly by ID in R - r

I have this file:
ID P
1 10
1 12
1 11
2 9
2 8
2 10
3 11
3 12
3 14
4 15
4 16
4 8
5 11
5 13
5 10
6 14
6 16
6 11
And I would like to assign these values (a,b,c) randomly to the file:
like this:
ID P Group
1 10 a
1 12 b
1 11 c
2 9 c
2 8 a
2 10 b
3 11 a
3 12 c
3 14 b
4 15 c
4 16 a
4 8 b
5 11 b
5 13 c
5 10 a
6 14 b
6 16 c
6 11 a
I need to do several times, every time randomly. I tried this:
df %>% group_by(ID) %>% replicate(1,sample(df$group))
but, for sure, didnĀ“t work. Some suggestion?

Here is an option with sample
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Group = sample(c('a', 'b', 'c'), n(), replace = TRUE))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), P = c(10L, 12L, 11L, 9L, 8L,
10L, 11L, 12L, 14L, 15L, 16L, 8L, 11L, 13L, 10L, 14L, 16L, 11L
)), class = "data.frame", row.names = c(NA, -18L))

Two solutions, one with grouping, the other without
library(tidyverse)
df <- dplyr::tribble(
~ID, ~P,
1,10,
1,12,
1,11,
2,9,
2,8,
2,10,
3,11,
3,12,
3,14,
4,15,
4,16,
4,8,
5,11,
5,13,
5,10,
6,14,
6,16,
6,11
)
sample_vector <- c("a","b","c")
##Without grouping id
df_2 <- df %>%
mutate(Group = sample(sample_vector, nrow(df), replace = TRUE))
##With grouping by ID
df_2 <- df %>% group_by(ID) %>%
mutate(Group = sample(sample_vector, n(), replace = TRUE))

Related

Choose rows in which the absolute value of subtraction is less a specified value

Let's say I have this dataframe:
ID X1 X2
1 1 2
2 2 1
3 3 1
4 4 1
5 5 5
6 6 20
7 7 20
8 9 20
9 10 20
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))
And I want to select rows in which the absolute value of the subtraction of rows are more or equal to 2 (based on columns X1 and X2).
For example, row 4 value is 4-1, which is 3 and should be selected.
Row 9 value is 10-20, which is -10. Absolute value is 10 and should be selected.
In this case it would be rows 3, 4, 6, 7, 8 and 9
I tried:
dataset2 = dataset[,abs(dataset- c(dataset[,2])) > 2]
But I get an error.
The operation:
abs(dataset- c(dataset[,2])) > 2
Does give me rows that the sum are more than 2, but the result only works for my second column and does not select properly
We can get the difference between the 'X1', 'X2' columns, create a logical expression in subset to subset the rows
subset(dataset, abs(X1 - X2) >= 2)
# ID X1 X2
#3 3 3 1
#4 4 4 1
#6 6 6 20
#7 7 7 20
#8 8 9 20
#9 9 10 20
Or using index
subset(dataset, abs(dataset[[2]] - dataset[[3]]) >= 2)
data
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))

How do i replace the fourth row of values in a dataframe with a corresponding vector in R?

I have a set of values
col1|col2|col3|col4
5 10 15 20
2 4 6 8
3 6 9 12
4 3 7 15
I would like to replace row 4 with a vector
c(4,8,12,16)
I would like to inset the vector in column 4 and replace the original values. I tried this script.
df[[4]]<- vector_name
I expect the result
col1|col2|col3|col4
5 10 15 20
2 4 6 8
3 6 9 12
4 8 12 16
We can use replace
replace(df1, cbind(nrow(df1), seq_along(df1)), v1)
data
df1 <- structure(list(col1 = c(5L, 2L, 3L, 4L), col2 = c(10L, 4L, 6L,
3L), col3 = c(15L, 6L, 9L, 7L), col4 = c(20L, 8L, 12L, 15L)),
class = "data.frame", row.names = c(NA,
-4L))
v1 <- c(4, 8, 12, 16)

how can I reorganize a data based on two column

I have a data like below
df<- structure(list(data1 = c(0.013818378, 0.014362551, 0.014647562,
0.0136627, 0.015510173, 0.006818502, 0.006683564, 0.006655434,
0.006691479, 0.00666666, 0.014507653, 0.017446481, 0.014021427,
0.013963069, 0.020706391, 0.007104358, 0.006809539, 0.006680631,
0.009059533, 0.006681197, 0.015691738, 0.016709763, 0.015761994,
0.016062111, 0.015917196, 0.006816436, 0.006809539, 0.006680631,
0.009059533, 0.006681197), data2 = c(0.045378058, 0.041371486,
0.046058451, 0.040479177, 0.051143336, 0.016131932, 0.014399847,
0.014950329, 0.016408355, 0.015886182, 0.046151342, 0.05265521,
0.046046663, 0.040515428, 0.086865434, 0.019222881, 0.016926183,
0.016703444, 0.081352865, 0.132841645, 0.051641343, 0.059851738,
0.04830957, 0.047550067, 0.049228835, 0.015154055, 0.016926183,
0.016703444, 0.081352865, 0.132841645), time = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L
), place = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L), .Label = c("B02", "B03", "B04", "B05",
"B06", "C02", "C03", "C04", "C05", "C06"), class = "factor")), .Names = c("data1",
"data2", "time", "place"), class = "data.frame", row.names = c(NA,
-30L))
It has several data in it and distinguishable by time
I am trying to put separate them and re-orginise them in various data frame
each column except time and place are one data which needs to be organized
for example for data1 at time 1
B 0.013818378 0.014362551 0.014647562 0.0136627 0.015510173
C 0.006818502 0.006683564 0.006655434 0.006691479 0.00666666
data 1 at time 10
B 0.014507653 0.017446481 0.014021427 0.013963069 0.020706391
C 0.007104358 0.006809539 0.006680631 0.009059533 0.006681197
etc etc
We separate the 'place' column into two columns by splitting between the letter and digits, and spread into 'wide' format
library(dplyr)
library(tidyr)
df %>%
separate(place, into = c("grp", "number"), "(?<=[A-Z])(?=[0-9])") %>%
select(-data2) %>%
spread(number, data1)
# time grp 02 03 04 05 06
#1 1 B 0.013818378 0.014362551 0.014647562 0.013662700 0.015510173
#2 1 C 0.006818502 0.006683564 0.006655434 0.006691479 0.006666660
#3 10 B 0.014507653 0.017446481 0.014021427 0.013963069 0.020706391
#4 10 C 0.007104358 0.006809539 0.006680631 0.009059533 0.006681197
#5 17 B 0.015691738 0.016709763 0.015761994 0.016062111 0.015917196
#6 17 C 0.006816436 0.006809539 0.006680631 0.009059533 0.006681197
If we want as a list of datasets of both 'data1' and 'data2'
nm1 <- grep("data", names(df), value = TRUE)
nm1 %>%
purrr::map(~ df %>%
select(-one_of(nm1), .x) %>%
separate(place, into = c("grp", "number"), "(?<=[A-Z])(?=[0-9])") %>%
spread(number, .x) )
#[[1]]
# time grp 02 03 04 05 06
#1 1 B 0.013818378 0.014362551 0.014647562 0.013662700 0.015510173
#2 1 C 0.006818502 0.006683564 0.006655434 0.006691479 0.006666660
#3 10 B 0.014507653 0.017446481 0.014021427 0.013963069 0.020706391
#4 10 C 0.007104358 0.006809539 0.006680631 0.009059533 0.006681197
#5 17 B 0.015691738 0.016709763 0.015761994 0.016062111 0.015917196
#6 17 C 0.006816436 0.006809539 0.006680631 0.009059533 0.006681197
#[[2]]
# time grp 02 03 04 05 06
#1 1 B 0.04537806 0.04137149 0.04605845 0.04047918 0.05114334
#2 1 C 0.01613193 0.01439985 0.01495033 0.01640835 0.01588618
#3 10 B 0.04615134 0.05265521 0.04604666 0.04051543 0.08686543
#4 10 C 0.01922288 0.01692618 0.01670344 0.08135286 0.13284165
#5 17 B 0.05164134 0.05985174 0.04830957 0.04755007 0.04922883
#6 17 C 0.01515405 0.01692618 0.01670344 0.08135286 0.13284165
It is not clear how the output should look like when we have multiple value columns. The dcast from data.table can deal with multiple value.var columns
library(data.table)
setDT(df)[, c("grp", "number") := tstrsplit(place, "(?<=[A-Z])(?=[0-9])", perl = TRUE)]
dcast(df, grp + time ~ number, value.var = c("data1", "data2"))
It is somewhat unclear from your question, but I think that this is what you want:
library(tidyverse)
df %>%
mutate(
column = str_extract(place, "[0-9]+"),
place = str_extract(place, "[A-Z]")
) %>%
gather(data1, data2, key = "data", value = "val") %>%
spread(column, val) %>%
split(f = .$data)
Which produces the following format:
$data1
time place data 02 03 04 05 06
1 1 B data1 0.013818378 0.014362551 0.014647562 0.013662700 0.015510173
3 1 C data1 0.006818502 0.006683564 0.006655434 0.006691479 0.006666660
5 10 B data1 0.014507653 0.017446481 0.014021427 0.013963069 0.020706391
7 10 C data1 0.007104358 0.006809539 0.006680631 0.009059533 0.006681197
9 17 B data1 0.015691738 0.016709763 0.015761994 0.016062111 0.015917196
11 17 C data1 0.006816436 0.006809539 0.006680631 0.009059533 0.006681197
$data2
time place data 02 03 04 05 06
2 1 B data2 0.04537806 0.04137149 0.04605845 0.04047918 0.05114334
4 1 C data2 0.01613193 0.01439985 0.01495033 0.01640835 0.01588618
6 10 B data2 0.04615134 0.05265521 0.04604666 0.04051543 0.08686543
8 10 C data2 0.01922288 0.01692618 0.01670344 0.08135286 0.13284165
10 17 B data2 0.05164134 0.05985174 0.04830957 0.04755007 0.04922883
12 17 C data2 0.01515405 0.01692618 0.01670344 0.08135286 0.13284165

how can I add partially related number to a data

I am trying to add numbers to my data which belongs to each data
my data is like
df <- structure(list(data = structure(c(1L, 1L, 1L, 1L, 1L, 3L, 3L,
4L, 4L, 5L, 5L, 6L, 5L, 7L, 7L, 8L, 8L, 2L, 2L, 2L), .Label = c("data1",
"data10", "data2", "data3", "data4", "data5", "data6", "data7"
), class = "factor"), values = structure(c(3L, 8L, 18L, 1L, 15L,
17L, 19L, 7L, 2L, 2L, 11L, 10L, 6L, 4L, 9L, 12L, 14L, 5L, 13L,
16L), .Label = c("112864.443", "11319531", "12874.443", "142983324",
"1612410048", "16349475.63", "184901841", "2223793.8", "30553282.01",
"312004.547", "3135868.44", "317403612.9", "3686081.063", "43701608",
"623793.8", "64959501.42", "67666215", "767666215", "775987137.8"
), class = "factor")), .Names = c("data", "values"), class = "data.frame", row.names = c(NA,
-20L))
I want to have the exact values after each of my first column. since they are not consecutive, I dont know how to add them into a separate column. a desire output should look like below
data values
data1 12874.443 1
data1 2223793.8 1
data1 767666215 1
data1 112864.443 1
data1 623793.8 1
data2 67666215 2
data2 775987137.8 2
data3 184901841 3
data3 11319531 3
data4 11319531 4
data4 3135868.44 4
data5 312004.547 5
data4 16349475.63 4
data6 142983324 6
data6 30553282.01 6
data7 317403612.9 7
data7 43701608 7
data10 1612410048 10
data10 3686081.063 10
data10 64959501.42 10
one way is to use gsub to extract the value and add it as another column
df$label <- gsub("[^[:digit:]]", "", df$data)
another way is to use str_extract thanks to this question R: split character data into numbers and letters
library(stringr)
df$label <- as.numeric(str_extract(df$data, "[0-9]+"))
> df
# data values label
# 1 data1 12874.443 1
# 2 data1 2223793.8 1
# 3 data1 767666215 1
# 4 data1 112864.443 1
# 5 data1 623793.8 1
# 6 data2 67666215 2
# 7 data2 775987137.8 2
# 8 data3 184901841 3
# 9 data3 11319531 3
# 10 data4 11319531 4
# 11 data4 3135868.44 4
# 12 data5 312004.547 5
# 13 data4 16349475.63 4
# 14 data6 142983324 6
# 15 data6 30553282.01 6
# 16 data7 317403612.9 7
# 17 data7 43701608 7
# 18 data10 1612410048 10
# 19 data10 3686081.063 10
# 20 data10 64959501.42 10

R : ddply and return string

I have a dataframe like this:
id col1
1 1 1
2 2 2
3 3 3
4 4 4
5 5 1
6 1 2
7 2 3
8 3 4
I would like to group by id's then create a string that contains the values in col1 separated by a space and in descending value.
I first order the data frame by id and col1 but am unable to get the output from ddply as a string with no quotes.
df111 <- df111[order(df111$id, -df111$col1),]
df222 <- ddply(df111, .(id), function(col1) as.character(paste0(col1,sep = ' ')))
id V1 V2
1 1 c(1, 1, 1, 1) c(0.793507214868441, 0.539258575299755, 0.165128685068339, 0.153290810529143)
2 2 c(2, 2, 2, 2) c(0.872032727580518, 0.827515688957646, 0.236087603960186, 0.165240615839139)
3 3 c(3, 3, 3, 3) c(0.759382889838889, 0.484359077410772, 0.182580581633374, 0.0723447729833424)
4 4 c(4, 4, 4, 4) c(0.874859027564526, 0.642130059422925, 0.0569298807531595, 0.0227038362063468)
5 5 c(5, 5, 5, 5) c(0.392553070792928, 0.386064056074247, 0.299609177513048, 0.222290486795828)
I'd like some thing like this:
id V1
1 1 .793507214868441 0.539258575299755 0.165128685068339 0.153290810529143
Any suggestions?
EDIT:
> dput(df111)
structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), col1 = c(0.793507214868441,
0.539258575299755, 0.165128685068339, 0.153290810529143, 0.872032727580518,
0.827515688957646, 0.236087603960186, 0.165240615839139, 0.759382889838889,
0.484359077410772, 0.182580581633374, 0.0723447729833424, 0.874859027564526,
0.642130059422925, 0.0569298807531595, 0.0227038362063468, 0.392553070792928,
0.386064056074247, 0.299609177513048, 0.222290486795828)), .Names = c("id",
"col1"), row.names = c(1L, 11L, 16L, 6L, 7L, 12L, 17L, 2L, 18L,
13L, 8L, 3L, 14L, 9L, 19L, 4L, 20L, 10L, 5L, 15L), class = "data.frame")
I think maybe you just need to use summarise rather than a custom anonymous function...?
dat <- read.table(text = "id col1
1 1 1
2 2 2
3 3 3
4 4 4
5 5 1
6 1 2
7 2 3
8 3 4",header = TRUE,sep = "")
> ddply(dat,.(id),summarise,val = paste(sort(col1,decreasing = TRUE),collapse = " "))
id val
1 1 2 1
2 2 3 2
3 3 4 3
4 4 4
5 5 1

Resources