I want to apply a for loop on a non-specified range of rows in a tibble.
I have to modify the following code that applies a for loop on a specific number of rows in a tibble:
for(times in unique(dat$wname)[1:111]){...}
In this tibble the range from 1:111 corresponds to a specific file, in fact, the value of the column "File" repeat 111 times. However, in my data, I do not know how many times the same file repeat. For example, I can have a file that repeats for 80 rows and another for 85. How do I tell the loop to look only in the range in which the rows in the column file have the same name?
I need something to say:
for(times in unique(dat$wname)["for each row in the column File with the same name"]){...}
How can I do it?
You can count the number of rows using nrow, or ncol if you want to use columns,in your dat variable or lenght in unique(dat$wname) and do something like this:
rows= nrow(dat) # OR
rows=length(unique(dat$wname))
for(times in unique(dat$wname)[1:rows]){...}
But a reproducible example would make things a lot easier to understand/answer
Related
I would like to be able to see whether values in a column of a data frame are within the range of values in another dataframe in R. I can do that with sapply and put Yes and NA with ifelse. However, I want to be able to find out exactly which row numbers (and even better what content) of the first file was used for those Yes rows. In other words, I want to know the contents of second df matched to which columns of the first and then get the contents of a specific column in the first file.
This is what I am using:
cigar2_count$Visit1_counts <- ifelse(sapply(cigar2_count$HPV_Position, function(p)
any(cigar1_count$minV <= p & cigar1_count$maxV >= p)),"YES", NA)
This is what I want to be able to do but it gives me content of first file based on row numbers of the second one not actually what row in the first file corresponded to the second file.
cigar2_count$Visit1_counts <- ifelse(sapply(cigar2_count$HPV_Position, function(p)
any(cigar1_count$minV <= p & cigar1_count$maxV >= p)),cigar1_count$Unique_Read_Count, NA)
Here is a sample data:
First file: I made columns for the 500 range of HPV_Position and named those min and max
Second file:
These are just samples though. The actual files are much larger.
Thanks!
I'm trying to analyze some data acquired from experimental tests with several variables being recorded. I've imported a dataframe into R and I want to obtain some statistical information by processing these data.
In particular, I want to fill in an empty dataframe with the same variable names of the imported dataframe but with statistical features like mean, median, mode, max, min and quantiles as rows for each variable.
The input dataframes are something like 60 columns x 250k rows each.
I've already managed to do this using apply as in the following lines of code for a single input file.
df[1,] <- apply(mydata,2,mean,na.rm=T)
df[2,] <- apply(mydata,2,sd,na.rm=T)
...
Now I need to do this in a for loop for a number of input files mydata_1, mydata_2, mydata_3, ... in order to build several summary statistics dataframes, one for each input file.
I tried in several different ways, trying with apply and assign but I can't really manage to access each row of interest in the output dataframes cycling over the several input files.
I wuold like to do something like the code below (I know that this code does not work, it's just to give an idea of what I want to do).
The output df dataframes are already defined and empty.
for (xx in 1:number_of_mydata_files) {
df_xx[1,]<-apply(mydata_xx,2,mean,na.rm=T)
df_xx[2,]<-apply(mydata_xx,2,sd,na.rm=T)
...
}
Actually I can't remember the error message given by this code, but the problem is that I can't even run this because it does not work.
I'm quite a beginner of R, so I don't have so much experience in using this language. Is there a way to do this? Are there other functions that could be used instead of apply and assign)?
EDIT:
I add here a simple table description that represents the input dataframes I’m using. Sorry for the poor data visualization right here. Basically the input dataframes I’m using are .csv imported files, looking like tables with the first row being the column description, aka the name of the measured variable, and the following rows being the acquired data. I have 250 000 acquisitions for each variable in each file, and I have something like 5-8 files like this being my input.
Current [A] | Force [N] | Elongation [%] | ...
—————————————————————————————————————
Value_a_1 | Value_b_1 | Value_c_1 | ...
I just want to obtain a data frame like this as an output, with the same variables name, but instead with statistical values as rows. For example, the first row, instead of being the first values acquired for each variable, would be the mean of the 250k acquisitions for each variable. The second row would be the standard deviation, the third the variance and so on.
I’ve managed to build empty dataframes for the output summary statistics, with just the columns and no rows yet. I just want to fill them and do this iteratively in a for loop.
Not sure what your data looks like but you can do the following where lst represents your list of data frames.
lst <- list(iris[,-5],mtcars,airquality)
lapply(seq_along(lst),
function(x) sapply(lst[[x]],function(x)
data.frame(Mean=mean(x,na.rm=TRUE),
sd=sd(x,na.rm=TRUE))))
Or as suggested by #G. Grothendieck simply:
lapply(lst, sapply, function(x)
data.frame(Mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE)))
If all your files are in the same directory, set working directory as that and use either list.files() or ls() to walk along your input files.
If they share the same column names, you can rbind the result into a single data set.
Given a table with a variety of values and lengths, what's the best way to create a dataframe for columnar analysis?
Example, given an unlabeled CSV that looks like this:
A,B,A,C
A,B,C,D,E,F
B,C,A,B,F,F,F
A,B
B,C,D
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,Y,X,Z,AA,AB,AC
The goal will be to eventually assign a value to each letter based on what position it appears in.
Given the variable, and unknown length of the rows, how should I approach this problem? Set up a dataframe with an absurdly large number of columns as a placeholder?
One option is to read each row as an element in a vector using readLines() -
x <- readLines("test.csv") # add appropriate path to the file
x
[1] "A,B,A,C" "A,B,C,D,E,F"
[3] "B,C,A,B,F,F,F" "A,B"
[5] "B,C,D" "A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,Y,X,Z,AA,AB,AC"
Now you can manipulate each element of this vector as you wish and then assemble the results in your desired structure. This way you don't have to "Set up a dataframe with an absurdly large number of columns as a placeholder".
I'm trying to update a bunch of columns by adding and subtracting SD to each value of the column. The SD is for the given column.
The below is the reproducible code that I came up with, but I feel this is not the most efficient way to do it. Could someone suggest me a better way to do this?
Essentially, there are 20 rows and 9 columns.I just need two separate dataframes one that has values for each column adjusted by adding SD of that column and the other by subtracting SD from each value of the column.
##Example
##data frame containing 9 columns and 20 rows
Hi<-data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
##Standard Deviation calcualted for each row and stored in an object - i don't what this objcet is -vector, list, dataframe ?
Hi_SD<-apply(Hi,2,sd)
#data frame converted to matrix to allow addition of SD to each value
Hi_Matrix<-as.matrix(Hi,rownames.force=FALSE)
#a new object created that will store values(original+1SD) for each variable
Hi_SDValues<-NULL
#variable re-created -contains sum of first column of matrix and first element of list. I have only done this for 2 columns for the purposes of this example. however, all columns would need to be recreated
Hi_SDValues$X1<-Hi_Matrix[,1]+Hi_SD[1]
Hi_SDValues$X2<-Hi_Matrix[,2]+Hi_SD[2]
#convert the object back to a dataframe
Hi_SDValues<-as.data.frame(Hi_SDValues)
##Repeat for one SD less
Hi_SDValues_Less<-NULL
Hi_SDValues_Less$X1<-Hi_Matrix[,1]-Hi_SD[1]
Hi_SDValues_Less$X2<-Hi_Matrix[,2]-Hi_SD[2]
Hi_SDValues_Less<-as.data.frame(Hi_SDValues_Less)
This is a job for sweep (type ?sweep in R for the documentation)
Hi <- data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
Hi_SD <- apply(Hi,2,sd)
Hi_SD_subtracted <- sweep(Hi, 2, Hi_SD)
You don't need to convert the dataframe to a matrix in order to add the SD
Hi<-data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
Hi_SD<-apply(Hi,2,sd) # Hi_SD is a named numeric vector
Hi_SDValues<-Hi # Creating a new dataframe that we will add the SDs to
# Loop through all columns (there are many ways to do this)
for (i in 1:9){
Hi_SDValues[,i]<-Hi_SDValues[,i]+Hi_SD[i]
}
# Do pretty much the same thing for the next dataframe
Hi_SDValues_Less <- Hi
for (i in 1:9){
Hi_SDValues[,i]<-Hi_SDValues[,i]-Hi_SD[i]
}
I have a huge dataframe of around 1M rows and want to split the dataframe based on one column & different ranges.
Example dataframe:
length<-sample(rep(1:400),100)
var1<-rnorm(1:100)
var2<-sample(rep(letters[1:25],4))
test<-data.frame(length,var1,var2)
I want to split the dataframe based on length at different ranges (ex: all rows for length between 1 and 50).
range_length<-list(1:50,51:100,101:150,151:200,201:250,251:300,301:350,351:400)
I can do this by subsetting from the dataframe, ex: test1<-test[test$length>1 &test$length<50,]
But i am looking for more efficient way using "split" (just a line)
range = seq(0,400,50)
split(test, cut(test$length, range))
But do heed Justin's suggestion and look into using data.table instead of data.frame and I'll also add that it's very unlikely that you actually need to split the data.frame/table.