Iterative grep with r - r

Ok so thank you guy, I will start again the question:
this is my df
df = read.table(text = ' replicate size fh
ms03a_T0_r1 397.51 1099
ms03a_T0_r1 695.46 8
ms03a_T0_r1 708.76 1409
ms03a_T0_r1 1203.98 102
ms03a_T0_r2 397.52 749
ms03a_T0_r2 493.97 23
ms03a_T0_r2 538.43 12
ms03a_T0_r3 397.49 638
ms03a_T0_r3 399.84 9
ms03a_T0_r3 404.95 33
ms03a_T0_r3 406.85 40 ', header = T)
Rn <- as.numeric(length(levels(ol$replicate)))
# just to calculate the number of samples
From this I would like to have 3 new dataset each one that will contain only rows with *_r1 value of "replicate" variable, rows with *_r2 and rows with *_r3.
I thought to did this whit these commands:
for (i in 1:Rn){
x <- df[as.character(sub('.*_r', '', as.character(replicate))) %in% i];
outfile <- paste("rep_",i,"_edited.txt",sep="")
write.table(x,quote=FALSE,sep=", ",outfile)
}
but I am able to get .txt outputs and not df objects in r. In this way then I will have to import them again in r to move on the next step of my "script", and I have no idea how set r for import them automatically

My guess is you want this:
df = read.table(text = ' replicate size fh
ms03a_T0_r1 397.51 1099
ms03a_T0_r1 695.46 8
ms03a_T0_r1 708.76 1409
ms03a_T0_r1 1203.98 102
ms03a_T0_r2 397.52 749
ms03a_T0_r2 493.97 23
ms03a_T0_r2 538.43 12
ms03a_T0_r3 397.49 638
ms03a_T0_r3 399.84 9
ms03a_T0_r3 404.95 33
ms03a_T0_r3 406.85 40 ', header = T)
library(data.table)
dt = data.table(df)
special.ids = c(1,3)
dt[as.numeric(sub('.*_r', '', as.character(replicate))) %in% special.ids]
# replicate size fh
#1: ms03a_T0_r1 397.51 1099
#2: ms03a_T0_r1 695.46 8
#3: ms03a_T0_r1 708.76 1409
#4: ms03a_T0_r1 1203.98 102
#5: ms03a_T0_r3 397.49 638
#6: ms03a_T0_r3 399.84 9
#7: ms03a_T0_r3 404.95 33
#8: ms03a_T0_r3 406.85 40
Note, as.character is needed because read.table converts strings to factors by default. You may not need that for you actual data if it's already strings.
Upon second reading, maybe you want this?
split(df, sub('.*_r', '', as.character(df$replicate)))
OP - you need to fix your question.

Related

Apply loop for rollapply windows

I currently have a dataset with 50,000+ rows of data for which I need to find rolling sums. I have completed this using rollaply which has worked perfectly. I need to apply these rolling sums across a range of widths (600, 1200, 1800...6000) which I have done by cut and pasting each line of script and changing the width. While it works, I'd like to tidy my script but applying a loop, or similar, if possible so that once the rollapply function has completed it's first 'pass' at 600 width, it then completes the same with 1200 and so on. Example:
Var1 Var2 Var3
1 11 19
43 12 1
4 13 47
21 14 29
41 15 42
16 16 5
17 17 16
10 18 15
20 19 41
44 20 27
width_2 <- rollapply(x$Var1, FUN = sum, width = 2)
width_3 <- rollapply(x$Var1, FUN = sum, width = 3)
width_4 <- rollapply(x$Var1, FUN = sum, width = 4)
Is there a way to run widths 2, 3, then 4 in a simpler way rather than cut and paste, particularly when I have up to 10 widths, and then need to run this across other cols. Any help would be appreciated.
We can use lapply in base R
lst1 <- lapply(2:4, function(i) rollapply(x$Var1, FUN = sum, width = i))
names(lst1) <- paste0('width_', 2:4)
list2env(lst1, .GlobalEnv)
NOTE: It is not recommended to create multiple objects in the global environment. Instead, the list would be better
Or with a for loop
for(v in 2:4) {
assign(paste0('width_', v), rollapply(x$Var1, FUN = sum, width = v))
}
Create a function to do this for multiple dataset
f1 <- function(col1, i) {
rollapply(col1, FUN = sum, width = i)
}
lapply(x[c('Var1', 'Var2')], function(x) lapply(2:4, function(i)
f1(x, i)))
Instead of creating separate vectors in global environment probably you can add these as new columns in the already existing dataframe.
Note that rollaplly(..., FUN = sum) is same as rollsum.
library(dplyr)
library(zoo)
bind_cols(x, purrr::map_dfc(2:4,
~x %>% transmute(!!paste0('Var1_roll_', .x) := rollsumr(Var1, .x, fill = NA))))
# Var1 Var2 Var3 Var1_roll_2 Var1_roll_3 Var1_roll_4
#1 1 11 19 NA NA NA
#2 43 12 1 44 NA NA
#3 4 13 47 47 48 NA
#4 21 14 29 25 68 69
#5 41 15 42 62 66 109
#6 16 16 5 57 78 82
#7 17 17 16 33 74 95
#8 10 18 15 27 43 84
#9 20 19 41 30 47 63
#10 44 20 27 64 74 91
You can use seq to generate the variable window size.
seq(600, 6000, 600)
#[1] 600 1200 1800 2400 3000 3600 4200 4800 5400 6000

writing out .dat file in r

I have a dataset looks like this:
ids <- c(111,12,134,14,155,16,17,18,19,20)
scores.1 <- c(0,1,0,1,1,2,0,1,1,1)
scores.2 <- c(0,0,0,1,1,1,1,1,1,0)
data <- data.frame(ids, scores.1, scores.1)
> data
ids scores.1 scores.1.1
1 111 0 0
2 12 1 1
3 134 0 0
4 14 1 1
5 155 1 1
6 16 2 2
7 17 0 0
8 18 1 1
9 19 1 1
10 20 1 1
ids stands for student ids, scores.1 is the response/score for the first question, and scores.2 is the response/score for the second question. Student ids vary in terms of the number of digits but scores always have 1 digit. I am trying to write out as .dat file by generating some object and use those in write.fwf function in gdata library.
item.count <- dim(data)[2] - 1 # counts the number of questions in the dataset
write.fwf(data, file = "data.dat", width = c(5,rep(1, item.count)),
colnames = FALSE, sep = "")
I would like to separate the student ids and question response with some spaces,so I would like to use 5 spaces for students ids and to specify that I used width = c(5, rep(1, item.count)) in write.fwf() function. However, the output file looks like this having the spaces at the left side of the student ids
11100
1211
13400
1411
15511
1622
1700
1811
1911
2011
rather than at the right side of the ids.
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11
Any recommendations?
Thanks!
We can use unite to unite the 'score' columns into a single one and then use write.csv
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
with #akrun's help, this gives what I wanted:
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
write.fwf(data, file = "data.dat",
width = c(5,item.count),
colnames = FALSE, sep = " ")
in the .dat file, the dataset looks like this below:
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

CSV conversion in R for standard calculations

I have a problem calculating the mean of columns for a dataset imported from this CSV file
I import the file using the following command:
dataGSR = read.csv("ShimmerData.csv", header = TRUE, sep = ",",stringsAsFactors=T)
dataGSR$X=NULL #don't need this column
Then I take a subset of this
dati=dataGSR[4:1000,]
i check they are correct
head(dati)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
4 31329 0 713 623.674691281028 2545 3706.5641025641 2409 3529.67032967033
5 31649 9.765625 713 623.674691281028 2526 3678.89230769231 2501 3664.46886446886
6 31969 19.53125 712 638.528829576655 2528 3681.80512820513 2501 3664.46886446886
7 32289 29.296875 713 623.674691281028 2516 3664.3282051282 2498 3660.07326007326
8 32609 39.0625 711 654.10779696494 2503 3645.39487179487 2496 3657.14285714286
9 32929 48.828125 713 623.674691281028 2505 3648.30769230769 2496 3657.14285714286
When I type
means=colMeans(dati)
Error in colMeans(dati) : 'x' must be numeric
In order to solve this problem I convert everything into a matrix
datiM=data.matrix(dati)
But when I check the new variable, data values are different
head(datiM)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
4 370 1 10 1 65 65 1 1
5 375 3707 10 1 46 46 24 24
6 381 1025 9 2 48 48 24 24
7 386 2162 10 1 36 36 21 21
8 392 3126 8 3 23 23 19 19
9 397 3229 10 1 25 25 19 19
My questions here is:
How to convert correctly the "dati" variable in order to perform the colMeans()?
In addition to #akrun's advice, another option is to convert the columns to numeric yourself (rather than having read.csv do it):
dati <- data.frame(
lapply(dataGSR[-c(1:3),-9],as.numeric))
##
R> colMeans(dati)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
33004.2924 18647.4609 707.4335 718.3989 2521.3626 3672.1383 2497.9013 3659.9287
Where dataGSR was read in with stringsAsFactors=F,
dataGSR <- read.csv(
file="F:/temp/ShimmerData.csv",
header=TRUE,
stringsAsFactors=F)
Unless you know for sure that you need character columns to be factors, you are better off setting this option to FALSE.
The header lines ("character") in the dataset span first 4 lines. We could skip the 4 lines, use header=FALSE and then change the column names based on the info from the first 4 lines.
dataGSR <- read.csv('ShimmerData.csv', header=FALSE,
stringsAsFactors=FALSE, skip=4)
lines <- readLines('ShimmerData.csv', n=4)
colnames(dataGSR) <- do.call(paste, c(strsplit(lines, ','),
list(sep="_")))
dataGSR <- dataGSR[,-9]
unname(colMeans(dataGSR))
# [1] 33004.2924 18647.4609 707.4335 718.3989 2521.3626
# 3672.1383 2497.9013
# [8] 3659.9287

R - setting equiprobability over a specific variable when sampling

I have a data set with more than 2 millions entries which I load into a data frame.
I'm trying to grab a subset of the data. I need around 10000 entries but I need the entries to be picked with equal probability on one variable.
This is what my data looks like with str(data):
'data.frame': 2685628 obs. of 3 variables:
$ category : num 3289 3289 3289 3289 3289 ...
$ id: num 8064180 8990447 747922 9725245 9833082 ...
$ text : chr "text1" "text2" "text3" "text4" ...
You've noticed that I have 3 variables : category,id and text.
I have tried the following :
> sample_data <- data[sample(nrow(data),10000,replace=FALSE),]
Of course this works, but the probability of sample if not equal. Here is the output of count(sample_data$category) :
x freq
1 3289 707
2 3401 341
3 3482 160
4 3502 243
5 3601 1513
6 3783 716
7 4029 423
8 4166 21
9 4178 894
10 4785 31
11 5108 121
12 5245 2178
13 5637 387
14 5946 1484
15 5977 117
16 6139 664
Update: Here is the output of count(data$category) :
x freq
1 3289 198142
2 3401 97864
3 3482 38172
4 3502 59386
5 3601 391800
6 3783 201409
7 4029 111075
8 4166 6749
9 4178 239978
10 4785 6473
11 5108 32083
12 5245 590060
13 5637 98785
14 5946 401625
15 5977 28769
16 6139 183258
But when I try setting the probability I get the following error :
> catCount <- length(unique(data$category))
> probabilities <- rep(c(1/catCount),catCount)
> train_set <- data[sample(nrow(data),10000,prob=probabilities),]
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
I understand that the sample function is randomly picking between the row number but I can't figure out how to associate that with the probability over the categories.
Question : How can I sample my data over an equal probability for the category variable?
Thanks in advance.
I guess you could do this with some simple base R operation, though you should remember that you are using probabilities here within sample, thus getting the exact amount per each combination won't work using this method, though you can get close enough for large enough sample.
Here's an example data
set.seed(123)
data <- data.frame(category = sample(rep(letters[1:10], seq(1000, 10000, by = 1000)), 55000))
Then
probs <- 1/prop.table(table(data$category)) # Calculating relative probabilities
data$probs <- probs[match(data$category, names(probs))] # Matching them to the correct rows
set.seed(123)
train_set <- data[sample(nrow(data), 1000, prob = data$probs), ] # Sampling
table(train_set$category) # Checking frequencies
# a b c d e f g h i j
# 94 103 96 107 105 99 100 96 107 93
Edit: So here's a possible data.table equivalent
library(data.table)
setDT(data)[, probs := .N, category][, probs := .N/probs]
train_set <- data[sample(.N, 1000, prob = probs)]
Edit #2: Here's a very nice solution using the dplyr package contributed by #Khashaa and #docendodiscimus
The nice thing about this solution is that it returns the exact sample size within each group
library(dplyr)
train_set <- data %>%
group_by(category) %>%
sample_n(1000)
Edit #3:
It seems that data.table equivalent to dplyr::sample_n would be
library(data.table)
train_set <- setDT(data)[data[, sample(.I, 1000), category]$V1]
Which will also return the exact sample size within each group

Resources