Fid sample size based on num of rows in data - r

I have a dataset that looks like this:
Region
Name
Region 1
Name 14
Region 2
Name 18
Region 2
Name 2
Region 2
Name 21
Region 2
Name 44
Region 3
Name 64
Region 3
Name 24
Region 4
Name 1
Region 4
Name 1
Region 4
Name 98
Region 5
Name 98
Region 5
Name 8
Region 5
Name 8
Region 5
Name 8
Region 5
Name 98
I need to breakup the data by Region, and then select a random sample of only 5% of the "Name" per Region, based on the number of rows in Region.
So lets say there are 30 Name in Region 2, then i need a random sample of 3*.05. If there are 50 Name in Region 6, then i need a random sample of 5*.05.
So far, ive been able to split() the data using
d = split(data, f = data$Region)
but when i try to run an lapply function i get an error that there are different number of rows in the list that split() provided
lapply(data, function(x) {
sample_n(data, nrow(d)*.05)
} )
Any thoughts?
Thank you

Here's a base R solution.
lapply(split(data, data$Region),
\(x) x[sample(nrow(x), nrow(x) * 0.05),])
You can then convert it back into a data frame with rbind

Related

Multiply various subsets of a data frame by different elements of a vector R

I have a data frame:
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
And a vector:
vals <- sample(7:100, 10)
I want to multiply cols Room1, Room2 and Room3 by a different element of the vector for every unique ID number and output a new data frame (df2).
I managed to multiply each column per id by EVERY element of the vector using the following:
samp_func <- function(x) {
x*vals[i]
}
for (i in vals) {
df2 <- df %>% mutate_at(c("Room1", "Room2", "Room3"), samp_func)
}
But the resulting df (df2) is each Room column multiplied by the same element of the vector (vals) for each of the different ids. When what I want is each Room column (per id) multiplied by a different element of the vector vals. Sorry in advance if this is not clear I am a beginner and still getting to grips with the terminology.
Thanks!
EDIT: The desired output should look like the below, where the columns for each ID have been multiplied by a different element of the vector vals.
id Room1 Room2 Room3
1 1 24.674826880 60.1942571 46.81276141
2 1 21.970270107 46.0461779 35.09928150
3 1 26.282357614 -3.5098880 38.68400541
4 1 29.614182061 -39.3025587 25.09146592
5 1 33.030886472 46.0354881 42.68209027
6 1 41.362699668 -23.6624632 26.93845129
7 1 5.429031042 26.7657577 37.49086963
8 1 18.733422977 -42.0620572 23.48992138
9 1 -17.144070723 9.9627315 55.43999326
10 1 45.392182468 20.3959968 -16.52166621
11 2 30.687978299 -11.7194020 27.67351631
12 2 -4.559185345 94.9256561 9.26738357
13 2 86.165076849 -1.2821515 29.36949423
14 2 -12.546711562 47.1763755 152.67588456
15 2 18.285856423 60.5679496 113.85971720
16 2 72.074929648 47.6509398 139.69051486
17 2 -12.332519694 67.8890324 20.73189965
18 2 80.889634991 69.5703581 98.84404415
19 2 87.991093995 -20.7918559 106.13610773
20 2 -2.685594148 71.0611693 47.40278949
21 3 4.764445589 -7.6155681 12.56546664
22 3 -1.293867841 -1.1092243 13.30775785
23 3 16.114831628 -5.4750642 8.58762550
24 3 -0.309470950 7.0656088 10.07624289
25 3 11.225609780 4.2121241 16.59168866
26 3 -3.762529113 6.4369973 15.82362705
27 3 -5.103277731 0.9215625 18.20823042
28 3 -10.623165177 -5.2896293 33.13656839
29 3 -0.002517872 5.0861361 -0.01966699
30 3 -2.183752881 24.4644310 13.55572730
This should solve your problem. You can use a new dataset of all id, value combinations to make sure you calculate each combination and merge on the Room values. Then use mutate to make new Room columns.
Also, in the future I'd recommend setting a seed when asking questions with random data as it's easier for someone to replicate your output.
library(dplyr)
set.seed(0)
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
vals <- sample(7:100, 10)
other_df <- data.frame(id=rep(1:10),
val = rep(vals, 10))
df2 <- inner_join(other_df, df)
df2 <- df2 %>%
mutate(Room1 = Room1*val,
Room2 = Room2*val,
Room3 = Room3*val)

Apply a function to multiple dataframes and store the results in a unique dataframe

I have 360 dataframes (each containing 1 column with species names) named "sp_i_j", where i goes from 1 to 30 (corresponding to 30 samples) and j from 1 to 12 (corresponding to 12 replicates per sample) -> sp_1_1, sp_1_2,..., sp_30_11, sp_30_12.
For example, sp_1_1 is like that:
>sp_1_1
species
1 Cynoglossus bilineatus
2 Denticeps clupeoides
3 Gnathopogon imberbis
4 Grasseichthys gabonensis
5 Howella brodiei
sp_2_1 looks like that:
> sp_2_1
species
1 Acipenser fulvescens
2 Acrossocheilus stenotaeniatus
3 Allocyttus niger
4 Anguilla celebesensis
5 Aulopyge huegelii
And I have 30 other dataframes named "species_s1" to "species_s30" also containing only 1 column with species names.
species_s1 is like that:
> species_s1
species
1 Pseudaspius leptocephalus
2 Denticeps clupeoides
3 Howella brodiei
4 Microphysogobio tafangensis
5 Semotilus atromaculatus
6 Grasseichthys gabonensis
and species_s2 is like that:
> species_s2
species
1 Geotria australis
2 Odontamblyopus rebecca
3 Neocyttus rhomboidalis
4 Tinca tinca
5 Aulopyge huegelii
6 Rastrelliger kanagurta
I want to apply the following function to all of the 360 dataframes:
TP <- nrow(inner_join(sp_1_1, species_s1))
So that all the dataframes starting with "sp_1_" are compared to "species_s1", dataframes starting with "sp_2_" are compared to "species_s2", and so on.
And I would like to store the results in a unique dataframe of 30 columns (corresponding to the samples) and 12 lines (corresponding to the replicates). So that the results of the comparisons of df "sp_1_1" to "sp_1_12" with "species_s1" will be stored in the 12 lines of the first column; results of the comparisons of df "sp_2_1" to "sp_2_12" with "species_s2" will be stored in the 12 lines of the second column; and so on.
I tried something like that:
for (i in 1:30) {
for (j in 1:12) {
TP[i,j] <- nrow(inner_join(sp_[i]_[j], species_s[i]))
}
}
But obviously that doesn't work.
Any suggestions?

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

R "melt-cast" like operation

I have a file contains contents like this:
name: erik
age: 7
score: 10
name: stan
age:8
score: 11
name: kyle
age: 9
score: 20
...
As you can see, each record actually contains 3 rows in the file. I am wondering how can I read in the file and transform into data dataframe looks like below:
name age score
erik 7 10
stan 8 11
kyle 9 20
...
What I have done so far(thanks tcash21):
> data <- read.table(file.choose(), header=FALSE, sep=":", col.names=c("variable", "value"))
> data
variable value
1 name erik
2 age 7
3 score 10
4 name stan
5 age 8
6 score 11
7 name kyle
8 age 9
9 score 20
I am thinking how can I split the column into two columns by : and then maybe use something similar like cast in reshape package to do what I want?
or how can I get the rows that has index number 1,4,7,... only, which has a constant step
Thanks!
Another possibility:
library(reshape2)
df$id <- rep(1:(nrow(df)/3), each = 3)
dcast(df, id ~ variable, value.var = "value")
# id age name score
# 1 1 7 erik 10
# 2 2 8 stan 11
# 3 3 9 kyle 20
If the format is predictable you might want to do something really simple like
# recreate data
data <- as.matrix(c("erik",7,10,"stan",8, 11,"kyle",9,20),ncol=1)
# get individual variables
names <- data[seq(1,length(data)-2,3)]
age <- data[seq(2,length(data)-1,3)]
score <- data[seq(3,length(data),3)]
# combine variables
reformatted.data <- as.data.frame(cbind(names,age,score))

create dataframe in for loop using dataframe array

I'm having a dataframe as like below. I need to extract df based on the region which is availabe in RL
>avg_data
region SN value
beta 1 32
alpha 2 44
beta 3 55
beta 4 60
atp 5 22
> RL
V1
1 beta
2 alpha
That dataframe should be in array something like REGR[beta] which should contain beta related information as like below
region SN value
beta 1 32
beta 3 55
beta 4 60
Similarly for REGR[alpha]
region SN value
alpha 2 44
So that I can pass REGR as a argument for plotting graph.
REGR <- data.frame()
for (i in levels(RL$V1)){
REGR[i,] <- avg_data[avg_data$region==i, ];
}
I did some mistake in the above code. Please correct me.. Thank you
The split function may be of interest to you. From the help page, split divides the data in the vector x into the groups defined by f.
So for your data, it may look something like:
> split(avg_data, avg_data$region)
$alpha
region SN value
2 alpha 2 44
$atp
region SN value
5 atp 5 22
$beta
region SN value
1 beta 1 32
3 beta 3 55
4 beta 4 60
If you want to filter out the records that do not occur in RL, I'd probably do that in a preprocessing step using the %in% function and [ for extraction:
x <- avg_data[avg_data$region %in% RL$V1,]
#-----
region SN value
1 beta 1 32
2 alpha 2 44
3 beta 3 55
That's what I'd feed to split if you want to drop atp.
The approach above may be overkill if you are just wanting to plot. Here's an example using sapply to iterate through each level of region and make a plot:
sapply(unique(x$region), function(z)
plot(x[x$region == z,"value"], main=z[1]))

Resources