Separating data frame randomly but keeping identical values together - r

I have a large data set that I am trying to work with. I am currently trying to separate my data set into three different data frames, that will be used for different points of testing.
ind<-sample(3, nrow(df1), replace =TRUE, prob=c(0.40, 0.50, 0.10))
df2<-as.data.frame(df1[ind==1,1:27])
df3<-as.data.frame(df1[ind==2, 1:27])
df4<-as.data.frame(df1[ind==3,1:27])
However, the first column in df1 is an invoice number, and multiple rows can have the same invoice number, as returns and mistakes are included. I am trying to find a way that will split the data up randomly, but keep all rows with the same invoice number together.
Any suggestions on how I may manage to accomplish this?

Instead of sampling the rows, you could sample the unique invoice numbers and then select the rows with those invoice numbers.
## Some sample data
df1 = data.frame(invoice=sample(10,20, replace=T), V = rnorm(20))
## sample the unique values
ind = sample(3, length(unique(df1$invoice)), replace=T)
## Select rows by sampled invoice number
df1[df1$invoice %in% unique(df1$invoice)[ind==1], 1:2]
invoice V
2 8 -0.67717939
6 9 -0.89222154
9 8 -0.71756069
14 8 -0.03539096
15 2 0.38453752
16 9 -0.16298835
17 9 -0.30823521
20 2 -0.60198259

ind1 <- which(df1[,1] == 1)
ind2 <- which(df1[,1] == 2)
ind3 <- which(df1[,1] == 3)
df2 <- as.data.frame(df1[sample(ind1, length(ind1), replace = TRUE), 1:27])
df3 <- as.data.frame(df1[sample(ind2, length(ind2), replace = TRUE), 1:27])
df4 <- as.data.frame(df1[sample(ind3, length(ind3), replace = TRUE), 1:27])
ind determines which rows contain the the invoice numbers 1,2,3. Then to create the random data frames a random sample from only the rows that you wish are taken. Hope this helps.

Related

How to count missing values from two columns in R

I have a data frame which looks like this
**Contig_A** **Contig_B**
Contig_0 Contig_1
Contig_3 Contig_5
Contig_4 Contig_1
Contig_9 Contig_0
I want to count how many contig ids (from Contig_0 to Contig_1193) are not present in either Contig_A column of Contig_B.
For example: if we consider there are total 10 contigs here for this data frame (Contig_0 to Contig_9), then the answer would be 4 (Contig_2, Contig_6, Contig_7, Contig_8)
Create a vector of all the values that you want to check (all_contig) which is Contig_0 to Contig_10 here. Use setdiff to find the absent values and length to get the count of missing values.
cols <- c('Contig_A', 'Contig_B')
#If there are lot of 'Contig' columns that you want to consider
#cols <- grep('Contig', names(df), value = TRUE)
all_contig <- paste0('Contig_', 0:10)
missing_contig <- setdiff(all_contig, unlist(df[cols]))
#[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8" "Contig_10"
count_missing <- length(missing_contig)
#[1] 5
by match,
x <- c(0:9)
contigs <- sapply(x, function(t) paste0("Contig_",t))
df1 <- data.frame(
Contig_A = c("Contig_0", "Contig_3", "Contig_4", "Contig_9"),
Contig_B = c("Contig_1", "Contig_5", "Contig_1", "Contig_0")
)
xx <- c(df1$Contig_A,df1$Contig_B)
contigs[is.na(match(contigs, xx))]
[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8"
In your case, just change x as x <- c(0,1193)

How to build a data frame where one column adds onto the other in R

I'd like to create a randomized data frame in R where the values of the 2nd column = 1st column + number and the next value of the 1st column = 2nd column + number
It would look something like this:
I've tried doing this:
Sal = rnorm(150,mean=18,sd=1.7)
H2O = Sal + 0.2358
d = data.frame(Sal = rep(Sal,1), H2O = rep(H2O,1))
df = d[order(d$Sal,d$H2O),]
df
But it doesn't really work since the next number doesn't "build" upon the previous number.
How could I do this? Should I use a loop instead? My R experience is fairly limited (as you can probably tell)
Thank you in advance!
I think cumsum is what you are looking for (and then filling the rows of your data frame):
set.seed(42)
n_rows <- 5
rnd_numbers <- rnorm(n_rows * 2, mean = 18, sd = 17)
entries <- cumsum(rnd_numbers)
df <- data.frame(matrix(entries, nrow = n_rows, byrow = T))
colnames(df) <- c('Sal', 'H20')
df
Sal H20
1 41.30629 49.70642
2 73.87961 102.63827
3 127.51083 143.70672
4 187.40259 203.79339
5 256.10659 273.04045
You can check that the result follows the structure you described above by having a look at the underlying random numbers:
rnd_numbers
[1] 41.306294 8.400131 24.173183 28.758664 24.872561 16.195883
[7] 43.695874 16.390796 52.313203 16.933860
For you question, looking at the desired output data.frame, if you focus on values from left-to-right and top-to-bottom, you will see that increment with respect to the first value as the reference can be written as
c(0,1,2,3,....,2*nr-1)*number
where nr is the number of rows of your desired output. In this since, the only thing you need to do is to create the sequence as the increments and add them up to the initial value.
You can try the code below
set.seed(1)
Sal <- rnorm(1, mean = 18, sd = 1.7)
number <- 0.2358
nr <- 10
df <- setNames(
data.frame(
matrix(Sal + (seq(2 * nr) - 1) * number,
ncol = 2,
byrow = TRUE
)
),
c("Sal", "H2O")
)
which gives
> df
Sal H2O
1 16.93503 17.17083
2 17.40663 17.64243
3 17.87823 18.11403
4 18.34983 18.58563
5 18.82143 19.05723
6 19.29303 19.52883
7 19.76463 20.00043
8 20.23623 20.47203
9 20.70783 20.94363
10 21.17943 21.41523
Here's an approach using Reduce:
set.seed(123)
Start <- rnorm(1,14,1)
Values <- Reduce(`+`,rnorm(19, 0.25, 0.125),init = Start, accumulate = TRUE)
Matrix <- matrix(Values, ncol = 2, byrow = TRUE)
Result <- setNames(as.data.frame(Matrix),c("Salt","H2O"))
Result
Salt H2O
1 13.43952 13.66075
2 14.10559 14.36440
3 14.63057 15.09495
4 15.40256 15.49443
5 15.65857 15.85287
6 16.25588 16.55085
7 16.85095 17.11478
8 17.29530 17.76867
9 18.08090 18.08507
10 18.42274 18.61364
Reduce's first argument is a function with two arguments, in our case +. We set init = to be the starting value, and then a random value generated by rnorm is added. That new value is used as the starting value for the next number. We use accumulate = TRUE to keep all the values.
From here, we can use matrix(), to change the vector of values into a 2 x n matrix. Then we can convert to data.frame and add the column names.

(Pearson's) Correlation loop through the data frame

I have a data frame with 159 obs and 27 variables, and I want to correlate all 159 obs from column 4 (variable 4) with each one of the following columns (variables), this is, correlate column 4 with 5, then column 4 with 6 and so on... I've been unsuccessfully trying to create a loop, and since I'm a beginner in R, it turned out harder than I thought. The reason why I want to turn it more simple is that I would need to do the same thing for a couple more data frames and if I had a function that could do that, it would be so much easier and less time-consuming. Thus, it would be wonderful if anyone could help me.
df <- ZEB1_23genes # CHANGE ZEB1_23genes for df (dataframe)
for (i in colnames(df)){ # Check the class of the variables
print(class(df[[i]]))
}
print(df)
# Correlate ZEB1 with each of the 23 genes accordingly to Pearson's method
cor.test(df$ZEB1, df$PITPNC1, method = "pearson")
### OR ###
cor.test(df[,4], df[,5])
So I can correlate individually but I cannot create a loop to go back to column 4 and correlate it to the next column (5, 6, ..., 27).
Thank you!
If I've understood your question correctly, the solution below should work well.
#Sample data
df <- data.frame(matrix(data = sample(runif(100000), 4293), nrow = 159, ncol = 27))
#Correlation function
#Takes data.frame contains columns with values to be correlated as input
#The column against which other columns must be correlated cab be specified (start_col; default is 4)
#The number of columns to be correlated against start_col can also be specified (end_col; default is all columns after start_col)
#Function returns a data.frame containing start_col, end_col, and correlation value as rows.
my_correlator <- function(mydf, start_col = 4, end_col = 0){
if(end_col == 0){
end_col <- ncol(mydf)
}
#out_corr_df <- data.frame(start_col = c(), end_col = c(), corr_val = c())
out_corr <- list()
for(i in (start_col+1):end_col){
out_corr[[i]] <- data.frame(start_col = start_col, end_col = i, corr_val = as.numeric(cor.test(mydf[, start_col], mydf[, i])$estimate))
}
return(do.call("rbind", out_corr))
}
test_run <- my_correlator(df, 4)
head(test_run)
# start_col end_col corr_val
# 1 4 5 -0.027508521
# 2 4 6 0.100414199
# 3 4 7 0.036648608
# 4 4 8 -0.050845418
# 5 4 9 -0.003625019
# 6 4 10 -0.058172227
The function basically takes a data.frame as an input and spits out (as output) another data.frame containing correlations between a given column from the original data.frame against all subsequent columns. I do not know the structure of your data, and obviously, this function will fail if it runs into unexpected conditions (for instance, a column of characters in one of the columns).

Concatenate columns in data frame

We have brands data in a column/variable which is delimited by semicolon(;). Our task is to split these column data to multiple columns which we were able to do with the following syntax.
Attached the data as Screen shot.
Data set
Here is the R code:
x<-dataset$Pref_All
point<-df %>% separate(x, c("Pref_01","Pref_02","Pref_03","Pref_04","Pref_05"), ";")
point[is.na(point)] <- ""
However our question is: We have this type of brands data in more than 10 to 15 columns and if we use the above syntax the maximum number of columns to be split is to be decided on the number of brands each column holds (which we manually calculated and taken as 5 columns).
We would like to know is there any way where we can write the code in a dynamic way such that it should calculate the maximum number of brands each column holds and accordingly it should create those many new columns in a data frame. for e.g.
Pref_01,Pref_02,Pref_03,Pref_04,Pref_05.
the preferred output is given as a screen shot.
Output
Thanks for the help in advance.
x <- c("Swift;Baleno;Ciaz;Scross;Brezza", "Baleno;swift;celerio;ignis", "Scross;Baleno;celerio;brezza", "", "Ciaz;Scross;Brezza")
strsplit(x,";")
library(dplyr)
library(tidyr)
x <- data.frame(ID = c(1,2,3,4,5),
Pref_All = c("S;B;C;S;B",
"B;S;C;I",
"S;B;C;B",
" ",
"C;S;B"))
x$Pref_All <- as.character(levels(x$Pref_All))[x$Pref_All]
final_df <- x %>%
tidyr::separate(Pref_All, c(paste0("Pref_0", 1:b[[which.max(b)]])), ";")
final_df$ID <- x$Pref_All
final_df <- rename(final_df, Pref_All = ID)
final_df[is.na(final_df)] <- ""
Pref_All Pref_01 Pref_02 Pref_03 Pref_04 Pref_05
1 S;B;C;S;B S B C S B
2 B;S;C;I B S C I
3 S;B;C;B S B C B
4
5 C;S;B C S B
The trick for the column names is given by paste0 going from 1 to the maximum number of brands in your data!
I would use str_split() which returns a list of character vectors. From that, we can work out the max number of preferences in the dataframe and then apply over it a function to add the missing elements.
df=data.frame("id"=1:5,
"Pref_All"=c("brand1", "brand1;brand2;brand3", "", "brand2;brand4", "brand5"))
spl = str_split(df$Pref_All, ";")
# Find the max number of preferences
maxl = max(unlist(lapply(spl, length)))
# Add missing values to each element of the list
spl = lapply(spl, function(x){c(x, rep("", maxl-length(x)))})
# Bind each element of the list in a data.frame
dfr = data.frame(do.call(rbind, spl))
# Rename the columns
names(dfr) = paste0("Pref_", 1:maxl)
print(dfr)
# Pref_1 Pref_2 Pref_3
#1 brand1
#2 brand1 brand2 brand3
#3
#4 brand2 brand4
#5 brand5

Importing one long line of data with spaces into R

This question is a followup to my previous question, Importing one long line of data into R.
I have a large data file consisting of a single line of text. The format resembles
Cat 14 15 Horse 16
I'd eventually like to get it into a data.frame. In the above example I would end up with two variables, two variables, Animal and Number. The number of characters in each "line" is fixed, so in the example above each line contains 11 characters, animals being the first 7 and numbers being the next four.
So what I'd like is a data frame that looks like:
Animal Number
Cat 14
NA 15
Horse 16
You can read the file with read.fwf, specifying the column widths and the number of columns:
inp.fwf <- read.fwf("tmp.txt", widths = rep(c(7, 4), times = 3), as.is = TRUE)
Here the argument times = 3 works for your sample data; for your real file, you'll have to indicate how many pairs there are and change times accordingly. If you don't know how many entries you have, this might work:
inp.rl <- readLines("tmp.txt")
nchar(inp.rl)/11
This will give you a data.frame with one row and many columns. You need to break that into many rows and two columns:
inp.mat <- matrix(inp.fwf, byrow = TRUE, ncol = 2)
This will get you the correct shape for your data. The animal names are stored as character vectors, which you'll probably want to change into factors, but at this point all the data is in R, so you can easily tweak it.
Solution with vectorized substring function.
x <- readLines(textConnection("Cat 14 15 Horse 16 "))
idx <- seq.int(1,nchar(x),by=11)
vsubstr <- Vectorize(substr,vectorize.args=c("start","stop"))
dat <- data.frame(Animal= vsubstr(x,idx,idx+6),
Number= as.numeric(vsubstr(x,idx+7,idx+10)))
Not sure what the 15 is all about from the way you described data it should be animal-space-count-space-animal...
Anyway if the 15 should not be there here is one approach.
list1<-"Cat 14 Horse 16"
x <- unlist(strsplit(list1, " "))
x <- as.data.frame(matrix(x, length(x)/2, 2, byrow = TRUE))
x[, 2] <- as.numeric(as.character(x[, 2]))
x[, 1] <- as.character(x[, 1])
names(x) <-c('animal', 'count')
x
Assume you have a text file, test.dat, with repeated Animal Number pairs.
x <- scan("test.dat", what=list("", 0))
my.df <- data.frame(Animal = x[[1]], Number = x[[2]])
Tyler's use of read.fwf is perhaps cleaner, but here's another possible method.
x <- readLines(textConnection("Cat 14 15 Horse 16 "))
x <- matrix(strsplit(x, "")[[1]], nrow=11)
d <- data.frame(Animal = apply(x[1:7,], 2, paste, collapse=""),
Number = as.numeric(apply(x[8:11,], 2, paste, collapse="")))

Resources