Random sampling only a subset of data in R - r

I have a dataset (N of 2794) of which I want to extract a subset, randomly reallocate the class and put it back into the dataframe.
Example
| Index | B | C | Class|
| 1 | 3 | 4 | Dog |
| 2 | 1 | 9 | Cat |
| 3 | 9 | 1 | Dog |
| 4 | 1 | 1 | Cat |
From the above example, I want to random take N number of observations from column 'Class' and mix them up so you get something like this..
| Index | B | C | Class|
| 1 | 3 | 4 | Cat | Re-sampled
| 2 | 1 | 9 | Dog | Re-sampled
| 3 | 9 | 1 | Dog |
| 4 | 1 | 1 | Dog | Re-sampled
This code randomly extracts rows and re samples them, but I don't want to extract the rows. I want to keep them in the dataframe.
sample(Class[sample(nrow(Class),N),])

Suppose df is your data frame:
df <- data.frame(index=1:4, B=c(3,1,9,1), C=c(4,9,1,1), Class=c("Dog", "Cat", "Dog", "Cat"))
Would this do what you want?
dfSamp <- sample(1:nrow(df), N)
df$Class[dfSamp] <- sample(df$Class[dfSamp])

I simulated the data frame and did an example:
df <- data.frame(
ID=1:4,
Class=c('Dog', 'Cat', 'Dog', 'Cat')
)
N <- 2
sample_ids <- sample(nrow(df), N)
df$Class[sample_ids] <- sample(df$Class, length(sample_ids))

Assuming Class is how you named your datafame, you could do this:
library(dplyr)
bind_rows(
Class %>%
mutate(origin = 'not_sampled'),
Class %>%
sample(100, replace = TRUE) %>%
mutate(origin = 'sampled'))
Sample 100 observations of the original dataframe and stack them to the bottom of it. I am also adding a column so that you know if the observation was sampled or present in the dataframe from the beginning.

What you're wanting to do is replace in-line some classes, but not others.
So, if we start with a data frame, df
set.seed(100)
df = data.frame(index = 1:100,
B = sample(1:10,100,replace = T),
C = sample(1:10,100,replace = T),
Class = sample(c('Cat','Dog','Bunny'),100,replace = T))
And you want to update 5 random rows, then we need to pick which rows to update and what new classes to put in those rows. By referencing unique(df$class) you don't weight the classes by their current occurrence. You could adjust this with the weight argument or remove unique to use occurrence as weight.
n_rows = 5
rows_to_update = sample(1:100,n_rows,replace = F)
new_classes = sample(unique(df$Class),n_rows,replace = T)
rows_to_update
#> [1] 85 65 94 60 48
new_classes
#> [1] "Bunny" "Dog" "Dog" "Dog" "Bunny"
We can inspect what the original data looked like
df[rows_to_update,]
#> index B C Class
#> 85 85 1 2 Dog
#> 65 65 5 1 Bunny
#> 94 94 5 10 Dog
#> 60 60 3 7 Bunny
#> 48 48 9 1 Cat
We can update this in place with a reference to the column and the rows to update.
df$Class[rows_to_update] = new_classes
df[rows_to_update,]
#> index B C Class
#> 85 85 1 2 Bunny
#> 65 65 5 1 Dog
#> 94 94 5 10 Dog
#> 60 60 3 7 Dog
#> 48 48 9 1 Bunny

Related

How can I apply a calculation to a dataframe when a value is not equal to a specified string?

Say I have the following table and only want to sum when the column does not contain "Fred". If the value does contain "Fred" I want to sum and multiply by 2. How can I achieve this preferably using dplyr?
Name | x | Expected Value |
Jack | 15 | 17 |
Sally | 3 | 3 |
Fred | 10 | 20 |
Jack | 2 | 17 |
Obviously this is not a serious table, but in essence, I just want to know how to apply a different basic calculation for a specified unique value in a large dataset.
in Base R:
transform(df, expect = ave((1 + (Name == 'Fred')) * x, Name, FUN = sum))
Name x expect
1 Jack 15 17
2 Sally 3 3
3 Fred 10 20
4 Jack 2 17
Consider using tidyverse:
library(tidyverse)
df %>%
group_by(Name) %>%
mutate(expect = sum((1 + (Name == 'Fred')) * x))
Name x expect
<chr> <int> <dbl>
1 Jack 15 17
2 Sally 3 3
3 Fred 10 20
4 Jack 2 17
this is a a very simple way to solve your situation:
sample_table<- data.table(
Name = c('Jack','Sally','Fred','Jack'),
x = c(15,3,10,2),
exp = c(17,3,20,17))
sample_table[Name%in%"Fred",end_value:=(x+exp)*2]
sample_table[!(Name%in%"Fred"),end_value:=(x+exp)]

Aggregate rows into new column based on common value in another column in R

I have two data frames
df1 is like this
| NOC | 2007 | 2008 |
|:---- |:------:| -----:|
| A | 100 | 5 |
| B | 100 | 5 |
| C | 100 | 5|
| D | 20 | 2 |
| E | 10 | 12 |
| F | 2 | 1 |
df2
| NOC | GROUP |
|:---- |:------:|
| A | aa|
| B | aa |
| C | aa |
| D | bb |
| E | bb |
| F | cc |
I would like to create a new df3 which will aggregate the columns 2007 and 2008 based on Group identity by assigning the sum of rows with the same group identity, so my df3 would look like this
NOC
2007
2008
GROUP
S2007
s2008
A
100
5
aa
300
15
B
100
5
aa
300
15
C
100
5
aa
300
15
D
20
2
bb
30
14
E
10
12
bb
30
14
F
2
1
cc
2
1
my codes are not very efficient, I first merged df1 with df2 by NOC, into df3
df3<-merge(df1, df2, by="NOC",all.x=TRUE)
then used dprl summarised into df4 and created s2007 and s2008
df3 %>%
group_by(GROUP) %>%
summarise(num = n(),
s2017 = sum(2007),s2018 = sum(2008))->df3
then I merged df1 with df3 again to create my final database
I am wondering two problems:
is there a more efficient way?
since my dataframe contains annual data 2007-2030, currently I am writing out the summarize function for each year, is there a faster way of summarize all the columns except NOC?
Thank you!
Before this, a small piece of advice, never name your columns in numeric, it may create you many glitches.
library(tidyverse)
df1 %>% left_join(df2, by = 'NOC') %>%
group_by(GROUP) %>%
mutate(across(c(`2007`, `2008`), ~sum(.), .names = 's.{.col}' ))
# A tibble: 6 x 6
# Groups: GROUP [3]
NOC `2007` `2008` GROUP s.2007 s.2008
<chr> <int> <int> <chr> <int> <int>
1 A 100 5 aa 300 15
2 B 100 5 aa 300 15
3 C 100 5 aa 300 15
4 D 20 2 bb 30 14
5 E 10 12 bb 30 14
6 F 2 1 cc 2 1

Generate Data Frame from Count Data

I am trying to create an unsummarized data frame from a data frame of count data.
I have had some experience creating sample datasets but I am having some trouble trying to get a specific number of rows and proportion for each state/person without coding each of them separately and then combining them. I was able to do it using the following code but I feel like there is a better way.
set.seed(2312)
dragon <- sample(c(1),3,replace=TRUE)
Maine <- sample(c("Maine"),3,replace=TRUE)
Maine1 <- data.frame(dragon, Maine)
dragon <- sample(c(0),20,replace=TRUE)
Maine <- sample(c("Maine"),20,replace=TRUE)
Maine2 <- data.frame(dragon, Maine)
Maine2
library(dplyr)
maine3 <- bind_rows(Maine1, Maine2)
Is there a better way to generate this dataset then the code above?
I am trying to create a data frame from the following count data:
+-------------+--------------+--------------+
| | # of dragons | # no dragons |
+-------------+--------------+--------------+
| Maine | 3 | 20|
| California | 1 | 10|
| Jocko | 28 | 110515 |
| Jessica Day | 17 | 26122 |
| | 14 | 19655 |
+-------------+--------------+--------------+
And I would like it to look like this:
+-----------------------+---------------+
| | Dragons (1/0) |
+-----------------------+---------------+
| Maine | 1 |
| Maine | 1 |
| Maine | 1 |
| Maine | 0 |
| Maine….(2:20) | 0…. |
| California | 1 |
| California….(2:10) | 0… |
| Ect.. | |
+-----------------------+---------------+
I do not want the code written for me but would love with ideas on function or examples that you think might be helpful.
I am not completely sure what does sampling have to do with this problem?
It looks to me like you are looking for untable.
Here is an example
data:
set.seed(1)
no_drag = sample(1:5, 5)
drag = sample(15:25, 5)
df <- data.frame(names = LETTERS[1:5],
drag,
no_drag)
names drag no_drag
1 A 24 2
2 B 25 5
3 C 20 4
4 D 23 3
5 E 15 1
library(reshape)
library(tidyverse)
df %>%
gather(key, value, 2:3) %>% #convert to long format
{untable(.,num = .$value)} %>% #untable by value column
mutate(value = ifelse(key == "drag", 0, 1)) %>% #convert values to 0/1
select(-key) %>% #remove unwanted column
arrange(names) #optional
#part of output
names value
1 A 0
2 A 0
3 A 0
4 A 0
5 A 0
6 A 0
7 A 0
8 A 0
9 A 0
10 A 0
11 A 0
12 A 0
13 A 0
14 A 0
15 A 0
16 A 0
17 A 0
18 A 0
19 A 0
20 A 0
21 A 0
22 A 0
23 A 0
24 A 0
25 A 1
26 A 1
27 B 0
28 B 0
29 B 0
30 B 0
there are other ways to tackle the problem here is one:
One is like #Frank mentioned in the comment:
df %>%
gather(key, val, 2:3) %>%
mutate(v = Map(rep, key == "drag", val)) %>%
unnest %>%
select(-key, -val)
Another:
df <- gather(df, key, value, 2:3)
df <- df[rep(seq_len(nrow(df)), df$value), 1:2]
df$key[df$key == "drag"] <- FALSE
df$key[df$key != "drag"] <- TRUE
One can use tidyr::expand to expand rows in desired format.
The solution using df used by #missuse can be shown as:
library(tidyverse)
df %>% gather(key,value,-names) %>%
mutate(key = ifelse(key=="drag", 1, 0)) %>%
group_by(names,key) %>%
expand(value = 1:value) %>%
select(names, value = key) %>%
as.data.frame()
# names value
# 1 A 0
# 2 A 0
# 3 A 1
# 4 A 1
# 5 A 1
# 6 A 1
# 7 A 1
# 8 A 1
# 9 A 1
# 10 A 1
# ...so on
# 117 E 1
# 118 E 1
# 119 E 1
# 120 E 1
# 121 E 1
# 122 E 1

Quick way of matching data between two dataframes [R]

I have two dataframes: df_workingFile and df_groupIDs
df_workingFile:
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006
df_groupIDs:
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2
For df_groupIDs, I want to get the ID and Date of the event with the max sales in that group. So group "a1" has 2 events in df_workingFile, "v" and "w". I want to identify that event "w" has the Max sales value and bring it's information into df_groupIDs. The final output should look like this:
GroupID | numIDs | MaxSales | ID | Date
a1 | 2 | 3 | w | 2010
b1 | 2 | 8 | x | 2007
c3 | 1 | 2 | z | 2006
Now here's the problem. I wrote code that does this, but it's very inefficient and takes forever to process when I deal with datasets of 50-100K rows. I need help figuring out how to rewrite my code to be more efficient. Here's what I currently have:
i = 1
for (groupID in df_groupIDs$groupID) {
groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
df_groupIDs$ID[i] = groupEvents$ID[index]
df_groupIDs$Date[i] = groupEvents$Date[index]
i = i+1
}
Using dplyr:
library(dplyr)
df_workingFile %>%
group_by(GroupID) %>% # for each group id
arrange(desc(Sales)) %>% # sort by Sales (descending)
slice(1) %>% # keep the top row
inner_join(df_groupIDs) # join to df_groupIDs
select(GroupID, numIDs, MaxSales, ID, Date)
# keep the columns you want in the order you want
Another simpler method, if the Sales are integers (and can thus be relied on for equality testing with the MaxSales column):
inner_join(df_groupIDs, df_workingFile,
by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))
This makes use of a feature that SQLite has that if max is used on a line then it automatically brings along the row that the maximum came from.
library(sqldf)
sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date
from df_groupIDs g left join df_workingFile w using(GroupID)
group by GroupID")
giving:
GroupID numIDs MaxSales ID Date
1 a1 2 3 w 2010
2 b1 2 8 x 2007
3 c3 1 2 z 2006
Note: The two input data frames shown reproducibly are:
Lines1 <- "
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)
Lines2 <- "
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2"
df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)

How to reshape a dataframe in R, when values and variables are in the column names? [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
EDIT:
Upon further examination, this dataset is way more insane than I previously believed.
Values have been encapsulated in the column names!
My dataframe looks like this:
| ID | Year1_A | Year1_B | Year2_A | Year2_B |
|----|---------|---------|---------|---------|
| 1 | a | b | 2a | 2b |
| 2 | c | d | 2c | 2d |
I am searching for a way to reformat it as such:
| ID | Year | _A | _B |
|----|------|-----|-----|
| 1 | 1 | a | b |
| 1 | 2 | 2a | 2b |
| 2 | 1 | c | d |
| 2 | 2 | 2c | 2d |
The answer below is great, and works perfectly, but the issue is that the dataframe needs more work -- somehow possibly be spread back out, so that each row has 3 columns.
My best idea was to do merge(df, df, by="ID") and then filter out the unwanted rows but this is quickly becoming unwieldy.
df <- data.frame(ID = 1:2, Year1_A = c('a', 'c'), Year1_B = c('b','d' ), Year2_A = c('2a', '2c'), Year2_B = c('2b', '2d'))
library(tidyr)
# your example data
df <- data.frame(ID = 1:2, Year1_A = c('a', 'c'), Year1_B = c('b','d' ), Year2_A = c('2a', '2c'), Year2_B = c('2b', '2d'))
# the solution
df <- gather(df, Year, value, -ID)
# cleaning up
df$Year <- gsub("Year", "", df$Year)
Result:
> df
ID Year value
1 1 1_A a
2 2 1_A c
3 1 1_B b
4 2 1_B d
5 1 2_A 2a
6 2 2_A 2c
7 1 2_B 2b
8 2 2_B 2d

Resources