An VERY simplified example of my dataset:
1: 10010001 1961 78.2
2: 10010001 1962 84.0
3: 10010001 1963 70.2
4: 10010001 1964 130.5
5: 10010001 1965 54.3
I found this code online which sort of, but not quite, does what I want:
#create a list of the files from your target directory
file_list <- list.files(path="~/Desktop/Rprojects")
#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
allHUCS <- data.frame()
#I want to read each .csv from a folder named "Rprojects" on my desktop into one huge dataframe for further use.
for (i in 1:length(file_list)){
temp_data <- fread(file_list[i], stringsAsFactors = F)
allHUCS <- rbindlist(list(allHUCS, temp_data), use.names = T)
Question: I have read that one should not use rbindlist for a large dataset:
"You should never ever ever iteratively rbind within a loop: performance might be okay in the beginning, but with each call to rbind it makes a complete copy of the data, so with each pass the total data to copy increases. It scales horribly. Consider, file_list)." – #r2evans
I know this may seem simple but I'm unclear about how to use his directive. Would I write this for the last line?
allHUCS <-, temp_data), use.names = T)
Or something else? In my actual data, each .csv has 2099 objects with 3 variables (but I only care about the last two.) The total dataframe should contain 47,000,000+ objects of 2 variables. When I ran the original code I got these errors:
Error in rbindlist(list(allHUCS, temp_data), use.names = T) : Item 2
has 2 columns, inconsistent with item 1 which has 3 columns. To fill
missing columns use fill=TRUE.
In addition: Warning messages: 1: In fread(file_list[i],
stringsAsFactors = F) : Detected 1 column names but the data has 2
columns (i.e. invalid file). Added 1 extra default column name for the
first column which is guessed to be row names or an index. Use
setnames() afterwards if this guess is not correct, or fix the file
write command that created the file to create a valid file.
2: In fread(file_list[i], stringsAsFactors = F) : Stopped early on
line 20. Expected 2 fields but found 3. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<# mv *.csv .. ; >>
Except for the setnames() suggestion, I don't understand what I'm being told. I know it says it stopped early, but I don't even know how to see the entire dataset or to tell where it stopped.
I'm now reading that rbindlist and rbind are two different things and rbindlist is faster than, data). But the suggestion is, temp_data). Which is going to be fastest?

Since the original post does not include a reproducible example, here is one that reads data from the Pokémon Stats data that I maintain on Github.
First, we download a zip file containing one CSV file for each generation of Pokémon, and unzip it to the ./pokemonData subdirectory of the R working directory.
Next, we obtain a list of files in the directory to which we unzipped the CSV files.
thePokemonFiles <- list.files("./pokemonData",
Finally, we load the data.table package, use lapply() with data.table::fread() to read the files, combine the resulting list of data tables with, and print the head() and `tail() of the resulting data frame with all 8 generations of Pokémon stats.
data <-,lapply(thePokemonFiles,fread))
...and the output:
> head(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk Sp. Def Speed
1: 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
2: 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
3: 3 Venusaur Grass Poison 525 80 82 83 100 100 80
4: 4 Charmander Fire 309 39 52 43 60 50 65
5: 5 Charmeleon Fire 405 58 64 58 80 65 80
6: 6 Charizard Fire Flying 534 78 84 78 109 85 100
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
> tail(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk
1: 895 Regidrago Dragon 580 200 100 50 100
2: 896 Glastrier Ice 580 100 145 130 65
3: 897 Spectrier Ghost 580 100 65 60 145
4: 898 Calyrex Psychic Grass 500 100 80 80 80
5: 898 Calyrex Ice Rider Psychic Ice 680 100 165 150 85
6: 898 Calyrex Shadow Rider Psychic Ghost 680 100 85 80 165
Sp. Def Speed Generation
1: 50 80 8
2: 110 30 8
3: 80 130 8
4: 80 80 8
5: 130 50 8
6: 100 150 8


Indexing through 'names' in a list and performing actions on contained values in R

I have a data set of counts from standard solutions passed through an instrument that analyses chemical concentrations (an ICPMS for those familiar). The data is over a range of different standards and for each standard I have four repeat measurements that I want to calculate the mean and variance of.
I'm importing the data from an excel spreadsheet and then, following some housekeeping such as getting dates and times in the right format, I split the the dataset up into a list identified by the name of the standard solution using Count11.sp<-split(Count11.raw, Count11.raw$Type). Count11.raw$Type then becomes the list element name and I have the four count results for each chemical element in that list element.
So far so good.
I find I can yield an average (mean, median etc) easily enough by identifying the list element specifically i.e. mean(Count11.sp$'Ca40') , or sapply(Count11$'Ca40', median), but what I'm not able to do is automate that in a loop so that I can calculate the means for each standard and drop that into a numerical matrix for further manipulation. I can extract the list element names with names() and I can even use a loop to make a vector of all the names and reference the specific list element using these in a for loop.
For instance Count11.sp[names(Count11.sp[i])]will extract the full list element no problem:
$`Post Ca45t`
Type Run Date 7Li 9Be 24Mg 43Ca 52Cr 55Mn 59Co 60Ni
77 Post Ca45t 1 2011-02-08 00:13:08 114 26101 4191 453525 2632 520 714 2270
78 Post Ca45t 2 2011-02-08 00:13:24 114 26045 4179 454299 2822 524 704 2444
79 Post Ca45t 3 2011-02-08 00:13:41 96 26372 3961 456293 2898 520 762 2244
80 Post Ca45t 4 2011-02-08 00:13:58 112 26244 3799 454702 2630 510 792 2356
65Cu 66Zn 85Rb 86Sr 111Cd 115In 118Sn 137Ba 140Ce 141Pr 157Gd 185Re 208Pb
77 244 1036 56 3081 44 520625 78 166 724 10 0 388998 613
78 250 982 70 3103 46 526154 76 174 744 16 4 396496 644
79 246 1014 36 3183 56 524195 60 198 744 2 0 396024 612
80 270 932 60 3137 44 523366 70 180 824 2 4 390436 632
77 24
78 20
79 14
80 6
but sapply(Count11.sp[names(count11.sp[i])produces an error message: Error in median.default(X[[i]], ...) : need numeric data
while sapply(Input$Post Ca45t, median) <'Post Ca45t' being name Count11.sp[i] i=4> does exactly what I want and produces the median value (I can clean that vector up later for medians that don't make sense) e.g.
Type Run Date 7Li 9Be 24Mg
NA 2.5 1297109612.5 113.0 26172.5 4070.0
43Ca 52Cr 55Mn 59Co 60Ni 65Cu
454500.5 2727.0 520.0 738.0 2313.0 248.0
66Zn 85Rb 86Sr 111Cd 115In 118Sn
998.0 58.0 3120.0 45.0 523780.5 73.0
137Ba 140Ce 141Pr 157Gd 185Re 208Pb
177.0 744.0 6.0 2.0 393230.0 622.5
Can anyone give me any insight into how I can automate (i.e. loop through) these names to produce one median vector per list element? I'm sure there's just some simple disconnect in my logic here that may be easily solved.
Update: I've solved the problem. The way to do so is to use tapply on the original dataset with out the need to split it. tapply allows functions to be applied to data based on a user defined grouping criteria. In my case I could group according to the Count11.raw$Type and then take the mean of the data subset. tapply(Count11.raw$Type, Count11.raw[,3:ncol(Count11.raw)], mean), job done.

Change one specific value in a data table in R [duplicate]

Here is my code
nutrients<- read.csv("nutrients.csv", head = TRUE, sep = ",")
> plot(nutrients)
> head(nutrients)
crop Nutrient.dens N..tons.acre. P2O5 K2O sum.nut
1 broccoli 340.0 210 245 100 555
2 carrot 458.0 70 250 50 370
3 cauliflower 315.0 25 35 80 140
4 letuce 318.5 165 150 90 405
5 onion 109.0 120 30 150 300
6 tomato 186.0 175 85 275 535
> df_nutrients<-
> df_nutrients<- df_nutrients[1,1=="broc"]
I am sure this is easy, and Ive tried searching anything i can find to get the answer but i cannot find it. I just need to change that one variable to "broc". is there a specific function i need or something?
If crop is a character type, then a simple subset should work
nutrients$crop[nutrients$crop == "broccoli"] <- "broc"
If crop is a factor, then use this:
levels(nutrients$crop)[levels(nutrients$crop) == "broccoli"] <- "proc"

groups of different size randomly selected within different classes

i have such a difficult question (at least to me) that i spend 2 hours just writing it. Complete impossible to program it by my self. I try to be very clear and i´m sorry if i didn´t. I´m doing this in a very rustic way in excel, but i really need to program this.
i have a data.frame like this
id_pix id_lote clase f1 f2
45 4 Sg 2460 2401
46 4 Sg 2620 2422
47 4 Sg 2904 2627
48 5 M 2134 2044
49 5 M 2180 2104
50 5 M 2127 2069
83 11 S 2124 2062
84 11 S 2189 2336
85 11 S 2235 2162
86 11 S 2162 2153
87 11 S 2108 2124
with 17451 "id_pixel"(rows), 2080 "id_lote" and 9 "clase"
this is the "id_lote" count per "clase" (v1 is the id_lote count)
clase v1
1: S 1099
2: P 213
3: Sg 114
4: M 302
5: Alg 27
6: Az 77
7: Po 228
8: Cit 13
9: Ma 7
i need to split the "id_lote" randomly within the "clase". I mean i have 1099 "id_lote" for the "S" "clase" that are 9339 "id_pixel" (rows) and i want to randomly select 50 % of "id_lote" that are x "id_pixel"(rows). And do this for every "clase" considering that the size (number of "id_lote") of every "clase" are different. I also would like to be able to change the size of the selection (50 %, 30 %, etc). And i also want to keep the not selected set of "id_lote". I hope some one can help me with this!
here is the reproducible example
this is the data with 2 clase (S and Az), with 6 id_lote and 13 id_pixel
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
30 2 S 3021 2985
31 2 S 3020 2596
71 9 S 4725 4404
72 9 S 4759 4943
75 11 S 2728 2225
218 21 Az 4830 3007
219 21 Az 4574 2761
220 21 Az 5441 3092
1155 126 Az 7209 2449
1156 126 Az 7035 2932
and one result could be:
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
75 11 S 2728 2225
1155 126 Az 7209 2449
1156 126 Az 7035 2932
were 50% of id_lote were randomly selected in clase "S" (2 of 4 id_lote) but all the id_pixel in selected id_lote were keeped. The same for clase "Az", one id_lote was randomly selected (1 of 2 in this case) and all the id_pixel in selected id_lote were keeped.
what colemand77 proposed helped a lot. I think dplyr package is usefull for this but i think that if i do
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
i get the 30 % of the data of each clase but not grouped by id_lote like i need! I mean 30 % of the rows (id_pixel) were selected instead of id_lote.
i hope this example help to understand what i want to do and make it usefull for everybody. I´m sorry if i wasn´t clear enough the first time.
Thanks a lot!
First glimpse I'd say the dplyr package is your friend here.
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
so you first use group_by() and include the grouping levels you want to sample from, then you use sample_frac to sample the fraction of the results you want for each group.
As near as I can tell this is what you are asking for. If not, please consider re-stating your question to include either a reproducible example or clarify. Cheers.
to "keep" the not-selected members, I would add a column of unique ids, and use an anti-join anti_join()(also from the dplyr package) to find the id's that are not in common between the two data.frames (the results of the sampling and the original).
## Update ##
I'm understanding better now, I believe. Think about this as a two step process...
1) you want to select x% (50 in example) of the id_lote from each clase and return those id_lote #s (i'm assuming that a given id_lote does not exist for multiple clase?)
2) you want to see all of the id_pixels that correspond to each id_lote, all in one data.frame
I've broken this down into multiple steps for illustration, not because it is the fastest / prettiest.
raw data: (couldn't read your data into R.)
df<-data.frame(id_pix = c(1:200),
id_lote = sample(1:20,200, replace = TRUE),
clase = sample(letters[seq_along(1:10)], 200, replace = TRUE),
f1 = sample(1000:2000,200, replace = TRUE),
f2 = sample(2000:3000,200, replace = TRUE))
1) figure out which id_lote correspond to which clase - for this we use the dplyr summarise function and store it in a variable
summary<-df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
Source: local data frame [125 x 2]
Groups: clase
clase id_lote
1 a 1
2 a 2
3 a 4
4 a 5
5 a 6
6 a 7
7 a 8
8 a 9
9 a 11
10 a 12
.. ... ...
then we sample to get the 30% of the id_lote for each clase..
sampled_summary <- summary %>%
group_by(clase) %>%
sample_frac(.3,replace = FALSE)
so the result of this is a data table with two columns, (clase and id_lote) with 30% of the id_lotes shown for each clase.
2) ok so now we have the id_lotes randomly selected from each class but not the id_pix that are associated with that class. To accomplish this we do a join to get the corresponding full data set including the id_pix, etc.
result <- sampled_summary %>%
The above makes a copy of the data set a bunch, so if you have a substantial data set you could just do it all at one go:
result <- df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise() %>%
group_by(clase) %>%
sample_frac(.5,replace = FALSE) %>%
if this doesn't get you what you want, let me know and we'll take another crack at it.

Combining Two Rows with Different Levels according to Some Conditions into One in R

This is a part of my data: (The actual data contains about 10,000 observations with about 500 levels of SalesItem)
S<-data.frame(SalesItem=factor(s1), Sales=s2)
> str(S)
'data.frame': 10 obs. of 2 variables:
$ SalesItem: Factor w/ 10 levels "1008","1009",..: 1 2 3 4 5 6 7 8 9 10
$ Sales : num 155 153 154 150 176 165 159 143 179 150`
What I want to do is, if diff(SalesItem)=1, I want to combine the level of SalesItem into 1, for example: diff between SalesItem 1008 and 1009 equal to one, so, I want to rename SalesItem 1009 to 1008. So, later I can compute the sum of Sales for this SalesItem as one, because of my actual data=10,000, so, it is quite hard for me to do this one by one.
Is there any simplest way for me to do that?
Clearly the fact that you have converted the first column to a factor indicates that you might need those factors in some place. so i would suggest that instead of changing any of the columns, add a third column to your data frame which will help you maintain the SalesItem relevant to that value. here are the steps for it :
> s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
> s2<-c(155,153,154,150,176,165,159,143,179,150)
> s1 = as.integer(s1)
> s3 = ifelse((s1-1) %in% s1, s1-1, s1)
> S <- data.frame(SalesItem=s1, Sales=s2, ItemId=s3)
then you can just count on the basis of the ItemId column.
This is not a terribly efficient solution, but since your data only contains 10000 records, it is not going to be a big problem.
Set up provided example data, but convert the SalesItem field to an integer so that the diff() operation makes sense.
> s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
> s2<-c(155,153,154,150,176,165,159,143,179,150)
> s1 = as.integer(s1)
> S<-data.frame(SalesItem=s1, Sales=s2)
Reorder data frame so that the SalesItem field is in ascending order (not necessary for current data set, but required for solution) then find the differences.
> S = S[order(S$SalesItem),]
> d = c(0, diff(S$SalesItem))
Duplicate the SalesItem data and then filter based on the values of the differences.
> labels = s1
> #
> for (n in 1:nrow(S)) {if (d[n] == 1) labels[n] = labels[n-1]}
> S$labels = labels
The (temporary) labels field now has the required new values for the SalesItem field. Once you are happy that this is doing the right thing, you can modify last line in above code to simply over-write the existing SalesItem field.
> S
SalesItem Sales labels
1 1008 155 1008
2 1009 153 1008
3 1012 154 1012
4 1013 150 1012
5 1016 176 1016
6 1017 165 1016
7 1018 159 1016
8 1019 143 1016
9 1054 179 1054
10 1055 150 1054

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
[1] 138 79
[1] 95 195 239
[1] 57 360
[1] 6 457
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!$values) & flowrle$values]
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278
