Conditionally remove row from data frame using dates and means - r

I'd like to conditionally remove row from data frame using dates and means. In my example:
# Package
library(tidyverse)
# Open dataset
RES_all_files_better <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/RES_all_files_better_df.csv")
str(RES_all_files_better)
# 'data.frame': 507 obs. of 11 variables:
# $ STAND : chr "ARROIOXAVIER024B" "ARROIOXAVIER024B" "ARROIOXAVIER024B" "ARROIOXAVIER024B" ...
# $ ESPACAMENT: int 6 6 6 6 6 6 6 6 6 6 ...
# $ ESPECIE : chr "benthamii" "benthamii" "benthamii" "benthamii" ...
# $ IDADE : int 6 6 6 6 6 6 6 7 7 7 ...
# $ DATE_S2 : chr "2019-01-28" "2019-02-22" "2019-03-24" "2019-05-18" ...
# $ NDVI_avg : num 0.877 0.895 0.879 0.912 0.908 ...
# $ NDVI_sd : num 0.0916 0.0808 0.0758 0.1175 0.1132 ...
# $ NDVI_min : num -0.235 -0.1783 0.0844 -0.5666 -0.6093 ...
# $ NDVI_max : num 0.985 0.998 0.993 0.999 0.999 ...
# $ MONTH : int 1 2 3 5 7 8 9 11 12 12 ...
# $ NDVI_ref : num 0.823 0.823 0.823 0.823 0.823 ...
In my case, I search some operation for remove rows in data set, if NDVI_max+NDVI_min/2 is lower than NDVI_avg grouped by (ESPACAMENT,ESPECIE,IDADE) in the date (DATE_S2) before the actual date. An example for RES_all_files_better$STAND=="QUEBRACANGA012F":
# Original dataset:
STAND DATE_S2 NDVI_avg NDVI_min NDVI_max
...
208 QUEBRACANGA012F 2021-08-30 0.8748818 0.8238573 0.9072955
209 QUEBRACANGA012F 2021-11-08 0.5707210 0.2847520 0.8908801
210 QUEBRACANGA012F 2021-11-13 0.5515253 0.2275358 0.8940712
211 QUEBRACANGA012F 2021-12-28 0.5956103 0.2469136 0.9122636
212 QUEBRACANGA012F 2022-01-12 0.5952482 0.2084076 0.9031508
213 QUEBRACANGA012F 2022-01-22 0.5773518 0.2088580 0.8783236
214 QUEBRACANGA012F 2022-02-16 0.4246735 0.1674446 0.6224726
215 QUEBRACANGA012F 2022-02-26 0.4064463 0.1378491 0.6111995
#Final dataset:
STAND DATE_S2 NDVI_avg NDVI_min NDVI_max
...
208 QUEBRACANGA012F 2021-08-30 0.8748818 0.8238573 0.9072955
The lines 209 to 215 were removed because (NDVI_max+NDVI_min/2)=0.5878161 that is lower than NDVI_avg = 0.8748818 in last date 2021-08-30.
Please, any help with it?

We may need to filter on the min computed value ('new')
library(dplyr)
RES_all_files_better %>%
# convert to `Date` class and create a sequence column for checking
mutate(rn = row_number(), DATE_S2 = as.Date(DATE_S2)) %>%
# grouped by columns
group_by(ESPACAMENT,ESPECIE,IDADE) %>%
# create computed column
mutate(New = (NDVI_max+NDVI_min/2)) %>%
# filter the rows where the NDVI_avg is greater than the minimum value
filter(NDVI_avg > min(New)) %>%
ungroup #%>%
# select(-rn, -New)

Related

maximum value by groups using dplyr does not work in tsibble dataframe

I am working on the gafa_stock dataframe in the tsibbledata package. I want to find the maximum closing stock price for the each of the four stocks in the dataframe. Since the dataframe has four stocks I want to get a table with four rows with each row giving me the maximum value of a stock. I use the instructions here: Extract the maximum value within each group in a dataframe and write this code:
gafa_stock %>%
group_by(Symbol) %>%
summarise(maximum = max(Close))
The gafa_stock dataframe looks this
The str(gafa_stock) has these results
str(gafa_stock)
tsibble [5,032 x 8] (S3: tbl_ts/tbl_df/tbl/data.frame)
$ Symbol : chr [1:5032] "AAPL" "AAPL" "AAPL" "AAPL" ...
$ Date : Date[1:5032], format: "2014-01-02" "2014-01-03" "2014-01-06" ...
$ Open : num [1:5032] 79.4 79 76.8 77.8 77 ...
$ High : num [1:5032] 79.6 79.1 78.1 78 77.9 ...
$ Low : num [1:5032] 78.9 77.2 76.2 76.8 77 ...
$ Close : num [1:5032] 79 77.3 77.7 77.1 77.6 ...
$ Adj_Close: num [1:5032] 67 65.5 65.9 65.4 65.8 ...
$ Volume : num [1:5032] 5.87e+07 9.81e+07 1.03e+08 7.93e+07 6.46e+07 ...
- attr(*, "key")= tibble [4 x 2] (S3: tbl_df/tbl/data.frame)
..$ Symbol: chr [1:4] "AAPL" "AMZN" "FB" "GOOG"
..$ .rows : list<int> [1:4]
.. ..$ : int [1:1258] 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int [1:1258] 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 ...
.. ..$ : int [1:1258] 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 ...
.. ..$ : int [1:1258] 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 ...
.. ..# ptype: int(0)
..- attr(*, ".drop")= logi TRUE
- attr(*, "index")= chr "Date"
..- attr(*, "ordered")= logi TRUE
- attr(*, "index2")= chr "Date"
- attr(*, "interval")= interval [1:1] 1D
..# .regular: logi TRUE
And, my final results look like this
This command creates a table that has all the 5032 rows and three columns - Symbol, Date and the closing price labeled as maximum. What am I doing wrong? Is this because of some special characteristic of a ts or tsibble dataframe?
We can convert to a tibble first as there are other class attributes as well tbl_ts if the version of tsibble is < 0.9.3
gafa_stock %>%
as_tibble %>%
group_by(Symbol) %>%
summarise(maximum = max(Close), .groups = 'drop')
-output
# A tibble: 4 x 2
# Symbol maximum
# <chr> <dbl>
#1 AAPL 232.
#2 AMZN 2040.
#3 FB 218.
#4 GOOG 1268.
In the newer version (0.9.3), it works without the conversion
gafa_stock %>%
group_by(Symbol) %>%
summarise(maximum = max(Close), .groups = 'drop')
# A tibble: 4 x 2
# Symbol maximum
# <chr> <dbl>
#1 AAPL 232.
#2 AMZN 2040.
#3 FB 218.
#4 GOOG 1268.
According to tsibble (0.9.2)
Each observation should be uniquely identified by index and key in a valid tsibble.
Here, the attribute for index is "Date"
attr(gafa_stock, "index")[1]
#[1] "Date"
I think this is what you want:
gafa_stock %>%
group_by(Symbol) %>%
filter(Close == max(Close))
Result:
# A tsibble: 4 x 8 [!]
# Key: Symbol [4]
# Groups: Symbol [4]
Symbol Date Open High Low Close Adj_Close Volume
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AAPL 2018-10-03 230. 233. 230. 232. 230. 28654800
2 AMZN 2018-09-04 2026. 2050. 2013 2040. 2040. 5721100
3 FB 2018-07-25 216. 219. 214. 218. 218. 58954200
4 GOOG 2018-07-26 1251 1270. 1249. 1268. 1268. 2405600

Calculation of Geometric Mean of Data that includes NAs

EDIT: The problem was not within the geoMean function, but with a wrong use of aggregate(), as explained in the comments
I am trying to calculate the geometric mean of multiple measurements for several different species, which includes NAs. An example of my data looks like this:
species <- c("Ae", "Ae", "Ae", "Be", "Be")
phen <- c(2, NA, 3, 1, 2)
hveg <- c(NA, 15, 12, 60, 59)
df <- data.frame(species, phen, hveg)
When I try to calculate the geometric mean for the species Ae with the built-in function geoMean from the package EnvStats like this
library("EnvStats")
aggregate(df[, 3:3], list(df1$Sp), geoMean, na.rm=TRUE)
it works wonderful and skips the NAs to give me the geometric means per species.
Group.1 phen hveg
1 Ae 4.238536 50.555696
2 Be 1.414214 1.414214
When I do this with my large dataset, however, the function stumbles over NAs and returns NA as result even though there are e.g 10 numerical values and only one NA. This happens for example with the column SLA_mm2/mg.
My large data set looks like this:
> str(cut2trait1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 22 obs. of 19 variables:
$ Cut : chr "15_08" "15_08" "15_08" "15_08" ...
$ Block : num 1 1 1 1 1 1 1 1 1 1 ...
$ ID : num 451 512 431 531 591 432 551 393 511 452 ...
$ Plot : chr "1_1" "1_1" "1_1" "1_1" ...
$ Grazing : chr "n" "n" "n" "n" ...
$ Acro : chr "Leuc.vulg" "Dact.glom" "Cirs.arve" "Trif.prat" ...
$ Sp : chr "Lv" "Dg" "Ca" "Tp" ...
$ Label_neu : chr "Lv021" "Dg022" "Ca021" "Tp021" ...
$ PlantFunctionalType: chr "forb" "grass" "forb" "forb" ...
$ PlotClimate : chr "AC" "AC" "AC" "AC" ...
$ Season : chr "Aug" "Aug" "Aug" "Aug" ...
$ Year : num 2015 2015 2015 2015 2015 ...
$ Tiller : num 6 3 3 5 6 8 5 2 1 7 ...
$ Hveg : num 25 38 70 36 68 65 23 58 71 27 ...
$ Hrep : num 39 54 77 38 76 70 65 88 98 38 ...
$ Phen : num 8 8 7 8 8 7 6.5 8 8 8 ...
$ SPAD : num 40.7 42.4 48.7 43 31.3 ...
$ TDW_in_g : num 4.62 4.85 11.86 5.82 8.99 ...
$ SLA_mm2/mg : num 19.6 19.8 20.3 21.2 21.7 ...
and the result of my code
gm_cut2trait1 <- aggregate(cut2trait1[, 13:19], list(cut2trait1$Sp), geoMean, na.rm=TRUE)
is (only the first two rows):
Group.1 Tiller Hveg Hrep Phen SPAD TDW_in_g SLA_mm2/mg
1 Ae 13.521721 73.43485 106.67933 NA 28.17698 1.2602475 NA
2 Be 8.944272 43.95452 72.31182 5.477226 20.08880 0.7266361 9.309672
Here, the geometric mean of SLA for Ae is NA, even though there are 9 numeric measurements and only one NA in the column used to calculate the geometric mean.
I tried to use the geometric mean function suggested here:
Geometric Mean: is there a built-in?
But instead of NAs, this returned the value 1.000 when used with my big dataset, which doesn't solve my problem.
So my question is: What is the difference between my example df and the big dataset that throws the geoMean function off the rails?

R: Extracting lines from dataframes in list and splitting into new dataframes

I have a list with 3 data frames (DvE, DvS, EvS) in it:
str(Table.list2)
List of 3
$ DvE:'data.frame': 18482 obs. of 4 variables:
..$ gene : Factor w/ 18482 levels "c10000_g1_i3|m.32237",..: 1 2 3 4 5 6 7 8 9 10 ...
..$ FDR : num [1:18482] 0.502 0.982 0.936 0.411 0.461 ...
..$ log2FC : num [1:18482] 0.415 -0.245 0.728 -0.384 0.474 ...
..$ annotation: Factor w/ 4939 levels "","[Genbank](myosin heavy-chain) kinase [Calothrix sp. PCC 6303] ",..: 1 2204 2980 2204 1 2204 4622 2980 1 241 ...
$ DvS:'data.frame': 18482 obs. of 4 variables:
..$ gene : Factor w/ 18482 levels "c10000_g1_i3|m.32237",..: 1 2 3 4 5 6 7 8 9 10 ...
..$ FDR : num [1:18482] 1.25e-01 7.18e-01 2.02e-01 2.72e-13 6.02e-01 ...
..$ log2FC : num [1:18482] -0.417 0.583 2.148 1.689 -0.167 ...
..$ annotation: Factor w/ 4939 levels "","[Genbank](myosin heavy-chain) kinase [Calothrix sp. PCC 6303] ",..: 1 2204 2980 2204 1 2204 4622 2980 1 241 ...
$ EvS:'data.frame': 18482 obs. of 4 variables:
..$ gene : Factor w/ 18482 levels "c10000_g1_i3|m.32237",..: 1 2 3 4 5 6 7 8 9 10 ...
..$ FDR : num [1:18482] 1.78e-03 6.04e-01 4.09e-01 3.42e-19 3.20e-02 ...
..$ log2FC : num [1:18482] -0.832 0.828 1.42 2.073 -0.641 ...
..$ annotation: Factor w/ 4939 levels "","[Genbank](myosin heavy-chain) kinase [Calothrix sp. PCC 6303] ",..: 1 2204 2980 2204 1 2204 4622 2980 1 241 ...
all 3 dataframes have similar structure, e.g.:
> head(Table.list2$DvE)
gene FDR log2FC annotation
1 c10000_g1_i3|m.32237 0.5024600 0.4149066
2 c10000_g1_i4|m.32240 0.9818297 -0.2449509 [Pfam]Calcium-activated chloride channel
3 c10000_g1_i4|m.32242 0.9361868 0.7277203 [Pfam]LSM domain
4 c10000_g1_i5|m.32244 0.4114795 -0.3835745 [Pfam]Calcium-activated chloride channel
5 c10000_g1_i6|m.32245 0.4605157 0.4739777
6 c10000_g1_i6|m.32246 0.4965353 -0.4607749 [Pfam]Calcium-activated chloride channel
What I'd like to do is in each data frame, take out data that has FDR < 0.05 and log2FC > 0 and put in a new data frame, and then take out data that has FDR < 0.05 and log2FC < 0 and put in another data frame.
So that from a list of 3 data frames, I'd get 6 new data frames that are named:
DvE.+
DvE.-
DvS.+
DvS.-
EvS.+
EvS.-
Example output of DvE.+:
gene FDR log2FC annotation
47 c10010_g1_i4|m.32346 8.609296e-15 1.9188013 [Genbank]conserved unknown protein [Ectocarpus siliculosus]
48 c10010_g1_i4|m.32348 5.625766e-09 1.8240089 [Genbank]hypothetical protein THAOC_07134 [Thalassiosira oceanica]
155 c10037_g1_i4|m.32582 2.666894e-02 0.6669399 [Pfam]LETM1-like protein
211 c10050_g2_i2|m.32706 8.154555e-03 1.6900611 [Genbank]hypothetical protein SELMODRAFT_84252 [Selaginella moellendorffii]
243 c10057_g1_i1|m.32812 1.936893e-02 0.8141790 [Pfam]Fibrinogen alpha/beta chain family
265 c10061_g4_i2|m.32861 3.614401e-02 1.7059034 [Pfam]Maf1 regulator
I was wondering if there's a more elegant way/loop that I can do all this in rather than repeatedly writing out similar command lines?
Update:
I tried doing this:
DEG.list <- lapply(Table.list2, function(i){
pos <- i[(i$FDR < 0.05 & i$log2FC > 0),]
neg <- i[(i$FDR < 0.05 & i$log2FC < 0),]
assign(paste(i, ".+", sep=""), value=pos)
assign(paste(i, ".-", sep=""), value=neg)
})
But I got this error:
Warning messages:
1: In assign(paste(i, ".+", sep = ""), value = pos) :
only the first element is used as variable name
2: In assign(paste(i, ".-", sep = ""), value = neg) :
only the first element is used as variable name
3: In assign(paste(i, ".+", sep = ""), value = pos) :
only the first element is used as variable name
4: In assign(paste(i, ".-", sep = ""), value = neg) :
only the first element is used as variable name
5: In assign(paste(i, ".+", sep = ""), value = pos) :
only the first element is used as variable name
6: In assign(paste(i, ".-", sep = ""), value = neg) :
only the first element is used as variable name
Not tested:
listdf<-list(DvE, DvS, EvS)
library(dplyr) # filtering the data
alldf<-lapply(listdf, function(i) { # Each list contains two filtered dataframes
df1<-filter(i,FDR < 0.05 & log2FC > 0) # dfs have not been properly named here
df2<-filter(i,FDR < 0.05 & log2FC < 0)
list(df1,df2)
}

Build a proper dataframe from a matrix list after importing .xlsx file

Implemented:
I am importing a .xlsx file into R.
This file consists of three sheets.
I am binding all the sheets into a list.
Need to Implement
Now I want to combine this matrix lists into a single data.frame. With the header being the --> names(dataset).
I tried using the as.data.frame with read.xlsx as given in the help but it did not work.
I explicitly tried with as.data.frame(as.table(dataset)) but still it generates a long list of data.frame but nothing that I want.
I want to have a structure like
header = names and the values below that, just like how the read.table imports the data.
This is the code I am using:
xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
b <- rbind(list(lapply(1:sheet_ct, function(x) {
res <- read.xlsx(xlfile, x, as.data.frame = TRUE, header = TRUE)
})))
b <- b [-c(1),] # Just want to remove the second header
I want to have the data arrangement something like below.
Ei Mi hours Nphy Cphy CHLphy Nhet Chet Ndet Cdet DON DOC DIN DIC AT dCCHO TEPC Ncocco Ccocco CHLcocco PICcocco par Temp Sal co2atm u10 dicfl co2ppm co2mol pH
1 1 1 1 0.1023488 0.6534707 0.1053458 0.04994161 0.3308593 0.04991916 0.3307085 0.05042275 49.76304 14.99330000 2050.132 2150.007 0.9642220 0.1339044 0.1040715 0.6500288 0.1087667 0.1000664 0.0000000 9.900000 31.31000 370 0.01 -2.963256000 565.1855 0.02562326 7.879427
2 1 1 2 0.1045240 0.6448216 0.1103250 0.04988347 0.3304699 0.04984045 0.3301691 0.05085697 49.52745 14.98729000 2050.264 2150.007 0.9308690 0.1652179 0.1076058 0.6386706 0.1164099 0.1001396 0.0000000 9.900000 31.31000 370 0.01 -2.971632000 565.7373 0.02564828 7.879042
3 1 1 3 0.1064772 0.6369597 0.1148174 0.04982555 0.3300819 0.04976363 0.3296314 0.05130091 49.29323 14.98221000 2050.396 2150.007 0.8997098 0.1941872 0.1104229 0.6291149 0.1225822 0.1007908 0.8695131 9.900000 31.31000 370 0.01 -2.980446000 566.3179 0.02567460 7.878636
4 1 1 4 0.1081702 0.6299084 0.1187672 0.04976784 0.3296952 0.04968840 0.3290949 0.05175249 49.06034 14.97810000 2050.524 2150.007 0.8705440 0.2210289 0.1125141 0.6213265 0.1273103 0.1018360 1.5513170 9.900000 31.31000 370 0.01 -2.989259000 566.8983 0.02570091 7.878231
5 1 1 5 0.1095905 0.6239005 0.1221460 0.04971029 0.3293089 0.04961446 0.3285598 0.05220978 48.82878 14.97485000 2050.641 2150.007 0.8431960 0.2459341 0.1140222 0.6152447 0.1308843 0.1034179 2.7777070 9.900000
Please dont suggest me to have all data on a single sheet and also convert .xlsx to .csv or simple text format. I am trying really hard to have a proper dataframe from a .xlsx file.
Following is the file
And this is the post following : Followup
This is what resulted:
str(full_data)
'data.frame': 0 obs. of 19 variables:
$ Experiment : Factor w/ 2 levels "#","1":
$ Mesocosm : Factor w/ 10 levels "#","1","2","3",..:
$ Exp.day : Factor w/ 24 levels "1","10","11",..:
$ Hour : Factor w/ 24 levels "108","12","132",..:
$ Temperature: Factor w/ 125 levels "10","10.01","10.02",..:
$ Salinity : num
$ pH : num
$ DIC : Factor w/ 205 levels "1582.2925","1588.6475",..:
$ TA : Factor w/ 117 levels "1813","1826",..:
$ DIN : Factor w/ 66 levels "0.2","0.3","0.4",..:
$ Chl.a : Factor w/ 156 levels "0.171","0.22",..:
$ PIC : Factor w/ 194 levels "-0.47","-0.96",..:
$ POC : Factor w/ 199 levels "-0.046","1.733",..:
$ PON : Factor w/ 151 levels "1.675","1.723",..:
$ POP : Factor w/ 110 levels "0.032","0.034",..:
$ DOC : Factor w/ 93 levels "100.1","100.4",..:
$ DON : Factor w/ 1 level "µmol/L":
$ DOP : Factor w/ 1 level "µmol/L":
$ TEP : Factor w/ 100 levels "10.4934","11.0053",..:
[Note: Above is the structure after reading from .xlsx file......the levels makes the calculation and manipulation part tedious and messy.]
This is what I want to achieve:
str(a)
'data.frame': 9936 obs. of 29 variables:
$ Ei : int 1 1 1 1 1 1 1 1 1 1 ...
$ Mi : int 1 1 1 1 1 1 1 1 1 1 ...
$ hours : int 1 2 3 4 5 6 7 8 9 10 ...
$ Cphy : num 0.653 0.645 0.637 0.63 0.624 ...
$ CHLphy : num 0.105 0.11 0.115 0.119 0.122 ...
$ Nhet : num 0.0499 0.0499 0.0498 0.0498 0.0497 ...
$ Chet : num 0.331 0.33 0.33 0.33 0.329 ...
$ Ndet : num 0.0499 0.0498 0.0498 0.0497 0.0496 ...
$ Cdet : num 0.331 0.33 0.33 0.329 0.329 ...
$ DON : num 0.0504 0.0509 0.0513 0.0518 0.0522 ...
$ DOC : num 49.8 49.5 49.3 49.1 48.8 ...
$ DIN : num 15 15 15 15 15 ...
$ DIC : num 2050 2050 2050 2051 2051 ...
$ AT : num 2150 2150 2150 2150 2150 ...
$ dCCHO : num 0.964 0.931 0.9 0.871 0.843 ...
$ TEPC : num 0.134 0.165 0.194 0.221 0.246 ...
$ Ncocco : num 0.104 0.108 0.11 0.113 0.114 ...
$ Ccocco : num 0.65 0.639 0.629 0.621 0.615 ...
$ CHLcocco: num 0.109 0.116 0.123 0.127 0.131 ...
$ PICcocco: num 0.1 0.1 0.101 0.102 0.103 ...
$ par : num 0 0 0.87 1.55 2.78 ...
$ Temp : num 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 ...
$ Sal : num 31.3 31.3 31.3 31.3 31.3 ...
$ co2atm : num 370 370 370 370 370 370 370 370 370 370 ...
$ u10 : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
$ dicfl : num -2.96 -2.97 -2.98 -2.99 -3 ...
$ co2ppm : num 565 566 566 567 567 ...
$ co2mol : num 0.0256 0.0256 0.0257 0.0257 0.0257 ...
$ pH : num 7.88 7.88 7.88 7.88 7.88 ...
[Note: sorry for the extra columns, this is another dataset (simple text), which I am reading from read.table]
With NA's handled:
> unique(mydf_1$Exp.num)
[1] # 1
Levels: # 1
> unique(mydf_2$Exp.num)
[1] # 2
Levels: # 2
> unique(mydf_3$Exp.num)
[1] # 3
Levels: # 3
> unique(full_data$Exp.num)
[1] 2 3 4
Without handling NA's:
> unique(full_data$Exp.num)
[1] 1 NA 2 3
> unique(full_data$Mesocosm)
[1] 1 2 3 4 5 6 7 8 9 NA
I think this is what you need. I add a few comments on what I am doing:
xlfile <- list.files(pattern = "*.xlsx")
wb <- loadWorkbook(xlfile)
sheet_ct <- wb$getNumberOfSheets()
for( i in 1:sheet_ct) { #read the sheets into 3 separate dataframes (mydf_1, mydf_2, mydf3)
print(i)
variable_name <- sprintf('mydf_%s',i)
assign(variable_name, read.xlsx(xlfile, sheetIndex=i,startRow=1, endRow=209)) #using this you don't need to use my formula to eliminate NAs. but you need to specify the first and last rows.
}
colnames(mydf_1) <- names(mydf_2) #this here was unclear. I chose the second sheet's
# names as column names but you can chose whichever you want using the same (second and third column had the same names).
#some of the sheets were loaded with a few blank rows (full of NAs) which I remove
#with the following function according to the first column which is always populated
#according to what I see
remove_na_rows <- function(x) {
x <- x[!is.na(x)]
a <- length(x==TRUE)
}
mydf_1 <- mydf_1[1:remove_na_rows(mydf_1$Exp.num),]
mydf_2 <- mydf_2[1:remove_na_rows(mydf_2$Exp.num),]
mydf_3 <- mydf_3[1:remove_na_rows(mydf_3$Exp.num),]
full_data <- rbind(mydf_1[-1,],mydf_2[-1,],mydf_3[-1,]) #making one dataframe here
full_data <- lapply(full_data,function(x) as.numeric(x)) #convert fields to numeric
full_data2$Ei <- as.integer(full_data[['Ei']]) #use this to convert any column to integer
full_data2$Mi <- as.integer(full_data[['Mi']])
full_data2$hours <- as.integer(full_data[['hours']])
#*********code to use for removing NA rows *****************
#so if you rbind not caring about the NA rows you can use the below to get rid of them
#I just tested it and it seems to be working
n_row <- NULL
for ( i in 1:nrow(full_data)) {
x <- full_data[i,]
if ( all(is.na(x)) ) {
n_row <- append(n_row,i)
}
}
full_data <- full_data[-n_row,]
I think now this is what you need

Splitting a large data frame into smaller segments

I have the following data frame and I want to break it up into 10 different data frames. I want to break the initial 100 row data frame into 10 data frames of 10 rows. I could do the following and get the desired results.
df = data.frame(one=c(rnorm(100)), two=c(rnorm(100)), three=c(rnorm(100)))
df1 = df[1:10,]
df2 = df[11:20,]
df3 = df[21:30,]
df4 = df[31:40,]
df5 = df[41:50,]
...
Of course, this isn't an elegant way to perform this task when the initial data frames are larger or if there aren't an easy number of segments that it can be broken down into.
So given the above, let's say we have the following data frame.
df = data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))
Now I want to split it into new data frames comprised of 200 rows, and the final data frame with the remaining rows. What would be a more elegant (aka 'quick') way to perform this task.
> str(split(df, (as.numeric(rownames(df))-1) %/% 200))
List of 6
$ 0:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.592 1.664 -1.231 0.269 0.912 ...
..$ two : num [1:200] 0.639 -0.525 0.642 1.347 1.142 ...
..$ three: num [1:200] -0.45 -0.877 0.588 1.188 -1.977 ...
$ 1:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -0.0017 1.9534 0.0155 -0.7732 -1.1752 ...
..$ two : num [1:200] -0.422 0.869 0.45 -0.111 0.073 ...
..$ three: num [1:200] -0.2809 1.31908 0.26695 0.00594 -0.25583 ...
$ 2:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.578 0.433 0.277 1.297 0.838 ...
..$ two : num [1:200] 0.913 0.378 0.35 -0.241 0.783 ...
..$ three: num [1:200] -0.8402 -0.2708 -0.0124 -0.4537 0.4651 ...
$ 3:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] 1.432 1.657 -0.72 -1.691 0.596 ...
..$ two : num [1:200] 0.243 -0.159 -2.163 -1.183 0.632 ...
..$ three: num [1:200] 0.359 0.476 1.485 0.39 -1.412 ...
$ 4:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.43 -0.345 -1.206 -0.925 -0.551 ...
..$ two : num [1:200] -1.343 1.322 0.208 0.444 -0.861 ...
..$ three: num [1:200] 0.00807 -0.20209 -0.56865 1.06983 -0.29673 ...
$ 5:'data.frame': 123 obs. of 3 variables:
..$ one : num [1:123] -1.269 1.555 -0.19 1.434 -0.889 ...
..$ two : num [1:123] 0.558 0.0445 -0.0639 -1.934 -0.8152 ...
..$ three: num [1:123] -0.0821 0.6745 0.6095 1.387 -0.382 ...
If some code might have changed the rownames it would be safer to use:
split(df, (seq(nrow(df))-1) %/% 200)
require(ff)
df <- data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))
for(i in chunk(from = 1, to = nrow(df), by = 200)){
print(df[min(i):max(i), ])
}
If you can generate a vector that defines the groups, you can split anything:
f <- rep(seq_len(ceiling(1123 / 200)),each = 200,length.out = 1123)
> df1 <- split(df,f = f)
> lapply(df1,dim)
$`1`
[1] 200 3
$`2`
[1] 200 3
$`3`
[1] 200 3
$`4`
[1] 200 3
$`5`
[1] 200 3
$`6`
[1] 123 3
Chops df into 1 million row groups and pushes and appends a million at a time to df in SQL
batchsize = 1000000 # vary to your liking
# cycles through data by batchsize
for (i in 1:ceiling(nrow(df)/batchsize))
{
print(i) # just to show the progress
# below shows how to cycle through data
batch <- df[(((i-1)*batchsize)+1(batchsize*i),,drop=FALSE] # drop = FALSE keeps it from being converted to a vector
# if below not done then the last batch has Nulls above the number of rows of actual data
batch <- batch[!is.na(batch$ID),] # ID is a variable I presume is in every row
#in this case the table already existed, if new table overwrite = TRUE
(dbWriteTable(con, "df", batch, append = TRUE,row.names = FALSE))
}
Something like this...?
b <- seq(10, 100, 10)
lapply(seq_along(b), function(i) df[(b-9)[i]:b[i], ])
[[1]]
one two three
1 -2.4157992 -0.6232517 1.0531358
2 0.6769020 0.3908089 -1.9543895
3 0.9804026 -2.5167334 0.7120919
4 -1.2200089 0.5108479 0.5599177
5 0.4448290 -1.2885275 -0.7665413
6 0.8431848 -0.9359947 0.1068137
7 -1.8168134 -0.2418887 1.1176077
8 1.4475904 -0.8010347 2.3716663
9 0.7264027 -0.3573623 -1.1956806
10 0.2736119 -1.5553148 0.2691115
[[2]]
one two three
11 -0.3273536 -1.92475496 -0.08031696
12 1.5558892 -1.20158371 0.09104958
13 1.9202047 -0.13418754 0.32571632
14 -0.0515136 -2.15669216 0.23099397
15 0.1909732 -0.30802742 -1.28651457
16 0.8545580 -0.18238266 1.57093844
17 0.4903039 0.02895376 -0.47678196
18 0.5125400 0.97052082 -0.70541908
19 -1.9324370 0.22093545 -0.34436105
20 -0.5763433 0.10442551 -2.05597985
[[3]]
one two three
21 0.7168771 -1.22902943 -0.18728871
22 1.2785641 0.14686576 -1.74738091
23 -1.1856173 0.43829361 0.41269975
24 0.0220843 1.57428924 -0.80163986
25 -1.0012255 0.05520813 0.50871603
26 -0.1842323 -1.61195239 0.04843504
27 0.2328831 -0.38432225 0.95650710
28 0.8821687 -1.32456215 -1.33367967
29 -0.8902177 0.86414661 -1.39629358
30 -0.6586293 -2.27325919 0.27367902
[[4]]
one two three
31 1.3810437 -1.0178835 0.07779591
32 0.6102753 0.3538498 1.92316801
33 -1.5034439 0.7926925 2.21706284
34 0.8251638 0.3992922 0.56781321
35 -1.0832114 0.9878058 -0.16820827
36 -0.4132375 -0.9214491 1.06681472
37 -0.6787631 1.3497766 2.18327887
38 -3.0082585 -1.3047024 -0.04913214
39 -0.3433300 1.1008951 -2.02065141
40 0.6009334 1.2334421 0.15623298
[[5]]
one two three
41 -1.8608051 -0.08589437 0.02370983
42 -0.1829953 0.91139017 -0.01356590
43 1.1146731 0.42384993 -0.68717391
44 1.9039900 -1.70218225 0.06100297
45 -0.4851939 1.38712015 -1.30613414
46 -0.4661664 0.23504099 -0.29335162
47 0.5807227 -0.87821946 -0.14816121
48 -2.0168910 -0.47657382 0.90503226
49 2.5056404 0.27574224 0.10326333
50 0.2238735 0.34441325 -0.17186115
[[6]]
one two three
51 1.51613140 -2.5630782 -0.6720399
52 0.03859537 -2.6688365 0.3395574
53 -0.08695292 -0.5114117 -0.1378789
54 -0.51878363 -0.5401962 0.3946324
55 -2.20482710 0.1716744 0.1786546
56 -0.28133749 -0.4497112 0.5936497
57 -2.38269088 -0.4625695 1.0048914
58 0.37865952 0.5055141 0.3337986
59 0.09329172 0.1560469 0.2835735
60 -1.10818863 -0.2618910 0.3650042
[[7]]
one two three
61 -1.2507208 -1.5050083 -0.63871084
62 0.1379394 0.7996674 -1.80196762
63 0.1582008 -0.3208973 0.40863693
64 -0.6224605 0.1416938 -0.47174711
65 1.1556149 -1.4083576 -1.12619693
66 -0.6956604 0.7994991 1.16073748
67 0.6576676 1.4391007 0.04134445
68 1.4610598 -1.0066840 -1.82981058
69 1.1951788 -0.4005535 1.57256648
70 -0.1994519 0.2711574 -1.04364396
[[8]]
one two three
71 1.23897065 0.4473611 -0.35452535
72 0.89015916 2.3747385 0.87840852
73 -1.17339703 0.7433220 0.40232381
74 -0.24568490 -0.4776862 1.24082294
75 -0.47187443 -0.3271824 0.38542703
76 -2.20899136 -1.1131712 -0.33663075
77 -0.05968035 -0.6023045 -0.23747388
78 1.19687199 -1.3390960 -1.37884241
79 -1.29310506 0.3554548 -0.05936756
80 -0.17470891 1.6198307 0.69170207
[[9]]
one two three
81 -1.06792315 0.04801998 0.08166394
82 0.84152560 -0.45793907 0.27867619
83 0.07619456 -1.21633682 -2.51290495
84 0.55895466 -1.01844178 -0.41887672
85 0.33825508 -1.15061381 0.66206732
86 -0.36041720 0.32808609 -1.83390913
87 -0.31595401 -0.87081019 0.45369366
88 0.92331087 1.22055348 -1.91048757
89 1.30491142 1.22582353 -1.32244004
90 -0.32906839 1.76467263 1.84479228
[[10]]
one two three
91 2.80656707 -0.9708417 0.25467304
92 0.35770119 -0.6132523 -1.11467041
93 0.09598908 -0.5710063 -0.96412216
94 -1.08728715 0.3019572 -0.04422049
95 0.14317455 0.1452287 -0.46133199
96 -1.00218917 -0.1360570 0.88864256
97 -0.25316855 0.6341925 -1.37571664
98 0.36375921 1.2244921 0.12718650
99 0.13345555 0.5330221 -0.29444683
100 2.28548261 -2.0413222 -0.53209956

Resources