dataframe subsetting using vector - r

I have a Data Frame
head(readDF1)
Date sulfate nitrate ID
279 2003-10-06 7.21 0.651 1
285 2003-10-12 5.99 0.428 10
291 2003-10-18 4.68 1.040 100
297 2003-10-24 3.47 0.363 200
303 2003-10-30 2.42 0.507 300
315 2003-11-11 1.43 0.474 332
If I'm subsetting using the below code it is working correct
readDF1[readDF1$ID==331]
but If I'm using
readDF1[readDF1$ID==1:300]
this is not working, I want to subset a Dataframe wihch has the values of the column ID from 1 to 300 (Asssume that ID contains values from 1 to 1000 and they are multiple)

== is the wrong operator here. You aren't asking 'which ID is equal to the sequence 1:331'.
You want %in% (i.e. which ID values can be found in 1:331
readDF1$ID[readDF1$ID %in% 1:331]

Related

Read in, transpose, and merge dataframe from MANY two column data frames with first row in common

I have thousands of comma separated .txt files with two columns, where one column has "wavelength" for the column name and the same wavelength values ("x" values) for all files, and the other column has the file name as the column name and response values (various observed "y" values).
If I read in a single file by readr, the format appears like this:
# A tibble: 2,151 x 2
Wavelength a1lm_00000.asd.ref.sco.txt ### [filename]
<dbl> <dbl>
1 350 0.0542
2 351 0.0661
3 352 0.0686
4 353 0.0608
5 354 0.0545
6 355 0.0589
7 356 0.0644
8 357 0.0587
9 358 0.0556
10 359 0.0519
...etc.
The end format I need is:
Filename "350" "351" "352" "353" etc.
a1lm_00000.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608 etc.
a1lm_00001.asd.ref.sco.txt 0.0567 0.0680 0.0704 0.0627 etc.
...etc.
In other words, I need the first column as the file identifier, and each following column a spectral response with the associated spectral wavelength as the column name.
So, I need to read all these files in from a directory, and either:
a.) Create a third column that is the file name, make all second columns names something like "response", apply bind_rows to all files, then use "spread" in the tidyr package.
b.) Transpose each files as soon as it is read, in such a way that the first row becomes all column names, the second row column name is inserted into a first column for row identifiers by file name, and row bind these resulting rows.
Option b. seems preferable. Either options seems like I will need to use either lapply and possibly bind_rows or bind_cols. But I'm not sure how best to do so. There are a lot of data, and a few of the methods I've used have caused my machine to run out of memory, so the more memory-efficient I can make it the better.
I recommend storing all data.frames in a list. Then it becomes a simple matter of merging data.frames, converting data from wide to long, and back to wide with a different key.
library(tidyverse)
reduce(lst, full_join) %>%
gather(file, value, -Wavelength) %>%
spread(Wavelength, value)
# file 350 351 352 353 354 355 356
#1 a1lm_00000.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608 0.0545 0.0589 0.0644
#2 a1lm_00001.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608 0.0545 0.0589 0.0644
# 357 358 359
#1 0.0587 0.0556 0.0519
#2 0.0587 0.0556 0.0519
Two more comments:
To store data.frames in a list, I would do something along the lines of map(file_names, ~read_csv2(.x)) (or in in base R lapply(file_names, function(x) read.csv(x))). Adjust file_names and read_csv2/read.csv parameters as necessary.
More generally, I would probably advise against such a format. It seems much easier to keep data in a list of long (and tidy) data.frames.
For completeness, the same can be achieved in base R using Reduce+merge to join data, and stack+reshape to convert from wide to long to wide.
df <- Reduce(merge, lst)
reshape(
cbind(stack(df, select = -Wavelength), Wavelength = df$Wavelength),
idvar = "ind", timevar = "Wavelength", direction = "wide")
# ind values.350 values.351 values.352 values.353
#1 a1lm_00000.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608
#11 a1lm_00001.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608
# values.354 values.355 values.356 values.357 values.358 values.359
#1 0.0545 0.0589 0.0644 0.0587 0.0556 0.0519
#11 0.0545 0.0589 0.0644 0.0587 0.0556 0.0519
Sample data
df1 <- read.table(text =
"Wavelength a1lm_00000.asd.ref.sco.txt
1 350 0.0542
2 351 0.0661
3 352 0.0686
4 353 0.0608
5 354 0.0545
6 355 0.0589
7 356 0.0644
8 357 0.0587
9 358 0.0556
10 359 0.0519", header = T)
df2 <- read.table(text =
"Wavelength a1lm_00001.asd.ref.sco.txt
1 350 0.0542
2 351 0.0661
3 352 0.0686
4 353 0.0608
5 354 0.0545
6 355 0.0589
7 356 0.0644
8 357 0.0587
9 358 0.0556
10 359 0.0519", header = T)
lst <- list(df1, df2)

How to apply a function from a package to a dataframe

How can I apply a package function to a data frame ?
I have a data set (df) with two columns (total and n) on which I would like to apply the pois.exact function (pois.exact(x, pt = 1, conf.level = 0.95)) from the epitools package with x = df$n and pt = df$total f and get a "new" data frame (new_df) with 3 more columns with the corresponding rounded computed rates, lower and upper CI ?
df <- data.frame("total" = c(35725302,35627717,34565295,36170648,38957933,36579643,29628394,18212075,39562754,1265055), "n" = c(24,66,166,461,898,1416,1781,1284,329,12))
> df
total n
1 35725302 24
2 35627717 66
3 34565295 166
4 36170648 461
5 38957933 898
6 36579643 1416
7 29628394 1781
8 18212075 1284
9 9562754 329
In facts, the dataframe in much more longer.
For example, for the first row the desired results are:
require (epitools)
round (pois.exact (24, pt = 35725302, conf.level = 0.95)* 100000, 2)[3:5]
rate lower upper
1 0.07 0.04 0.1
The new dataframe with the added results by applying the pois.exact function should look like that.
> new_df
total n incidence lower_95IC uppper_95IC
1 35725302 24 0.07 0.04 0.10
2 35627717 66 0.19 0.14 0.24
3 34565295 166 0.48 0.41 0.56
4 36170648 461 1.27 1.16 1.40
5 38957933 898 2.31 2.16 2.46
6 36579643 1416 3.87 3.67 4.08
7 29628394 1781 6.01 5.74 6.03
8 18212075 1284 7.05 6.67 7.45
9 9562754 329 3.44 3.08 3.83
Thanks.
df %>%
cbind( pois.exact(df$n, df$total) ) %>%
dplyr::select( total, n, rate, lower, upper )
# total n rate lower upper
# 1 35725302 24 1488554.25 1488066.17 1489042.45
# 2 35627717 66 539813.89 539636.65 539991.18
# 3 34565295 166 208224.67 208155.26 208294.10
# 4 36170648 461 78461.28 78435.71 78486.85
# 5 38957933 898 43383.00 43369.38 43396.62
# 6 36579643 1416 25833.08 25824.71 25841.45
# 7 29628394 1781 16635.82 16629.83 16641.81
# 8 18212075 1284 14183.86 14177.35 14190.37
# 9 39562754 329 120251.53 120214.06 120289.01
# 10 1265055 12 105421.25 105237.62 105605.12

Unable to apply ddply-summarise in R correctly

new here and new to R, so bear with me, please.
I have a data.frame similar to this:
time. variable TEER
1 0.07 cntrl 234.2795
2 1.07 cntrl 602.8245
3 2.07 cntrl 703.6844
4 3.07 cntrl 699.4538
...
48 0.07 cntrl 234.2795
49 1.07 cntrl 602.8245
50 2.07 cntrl 703.6844
51 3.07 cntrl 699.4538
...
471 0.07 agr1111 251.9119
472 1.07 agr1111 480.1573
473 2.07 agr1111 629.3744
474 3.07 agr1111 676.6782
...
518 0.07 agr1111 251.9119
519 1.07 agr1111 480.1573
520 2.07 agr1111 629.3744
521 3.07 agr1111 676.6782
...
753 0.07 agr2222 350.1049
754 1.07 agr2222 306.6072
755 2.07 agr2222 346.0387
756 3.07 agr2222 447.0137
757 4.07 agr2222 530.2433
...
802 2.07 agr2222 346.0387
803 3.07 agr2222 447.0137
804 4.07 agr2222 530.2433
805 5.07 agr2222 591.2122
I'm trying to apply ddply() to this data frame to get a new data frame with means and standard error (to plot later) like so:
> ddply(data_melt, c("time.", "variable"), summarise,
mean = mean(TEER), sd = sd(TEER),
sem = sd(TEER)/sqrt(length(TEER)))
What I get as an output data frame are same values of TEER in the mean column as in the first rows of the original data frame and zeroes in sd and sem columns. Also an error:
Warning message:
In levels<-(*tmp*, value = if (nl == nL) as.character(labels) else
paste0(labels, : duplicated levels in factors are deprecated
It looks like the function only goes through the first part of the data frame and doesn't bother looking at the duplicates of time. and variable group?
I already tried looking at the solutions to similar problems here but nothing seems to work. Am I missing something or is this a legitimate problem?
Any help / tips appreciated.
P.S Let me know if I'm not explaining the problem coherently enough and I'll try to go into more detail.
I think I've found a way around my problem.
Initially, when I load the data frame, each of the variables ("cntrl, "agr1111", "agr2222"), has a unique letter and number near them ("A1", "A2", "B1", "B2"), hence, looking like this: "cntrl.A1", "agr1111.B2". Instead, of substracting the letter-number from each of them using gsub i tried using filter with grepl to isolate certain rows that I need and summarise then.
Here's the code:
library(dplyr)
dt_11 <- dt %>%
group_by(time.) %>%
filter(grepl("agr1111", variable)) %>%
summarise(avg_11 = mean(teer),
sd_11 = sd(teer),
sem_11 = sd(teer)/sqrt(length(teer)))
This only gives me a data frame with one group of variables ("agr1111") and I'll have to do this two more times, for "cntrl" and "agr2222", hence resulting in 3 data frames. But I'm sure, I'll be able to either merge the data frames or plot them on the same graph separately.
This doesnt fit to be an answer, but too long to be a comment :
I ran your exact code and everything works fine!
> ddply(dt, c("time.", "variable"), summarise,
+ mean = mean(TEER), sd = sd(TEER),
+ sem = sd(TEER)/sqrt(length(TEER)), count = length(TEER))
#time. variable mean sd sem count
# 0.07 agr1111 251.9119 0 0 2
# 0.07 agr2222 350.1049 NA NA 1
# 0.07 cntrl 234.2795 0 0 2
# 1.07 agr1111 480.1573 0 0 2
# 1.07 agr2222 306.6072 NA NA 1
# 1.07 cntrl 602.8245 0 0 2
# 2.07 agr1111 629.3744 0 0 2
# 2.07 agr2222 346.0387 0 0 2
# 2.07 cntrl 703.6844 0 0 2
# 3.07 agr1111 676.6782 0 0 2
# 3.07 agr2222 447.0137 0 0 2
# 3.07 cntrl 699.4538 0 0 2
# 4.07 agr2222 530.2433 0 0 2
# 5.07 agr2222 591.2122 NA NA 1
> sessionInfo()
#other attached packages:
#[1] plyr_1.8.4
Could you update to latest version of packaes. I am not sure of the cause to your problem. I hope you understand how sd actually is calculated and why `NA~ appear.(HINT : look at the count column)

Reshaping dataframe (converting rows to columns)

I have the following dataset:
Name TOR_A Success_rate_A Realizable_Prod_A Assist_Rate_A Task_Count_A Date
1 BVG1 2.00 85 4.20 0.44 458 31/01/2014
2 BVG2 3.99 90 3.98 0.51 191 31/01/2014
3 BVG3 4.00 81 8.95 0.35 1260 31/01/2014
4 BVG4 3.50 82 2.44 4.92 6994 31/01/2014
5 BVG1 2.75 85 4.00 2.77 7954 07/02/2014
6 BVG2 4.00 91 3.50 1.50 757 07/02/2014
7 BVG3 3.80 82 7.00 1.67 7898 07/02/2014
8 BVG4 3.60 83 3.50 4.87 7000 07/02/2014
I wish to plot a ggplot line graph with Date on x-axis and TOR_A, Success_rate_A etc. on y-axis. I would also like to see it by the Name column. How can I prepare this dataset to achieve this objective?
I tried reshape in R but couldn't make it work.
UPDATE
Done it using reshape2::recast method as show below:
data_weekly = recast(data_frame_to_be_reshaped,variable+Name~Date,id.var=c(Name,Date))
You can use the Hadley Wickham's tidyr package.
df_reshaped <- gather(df_original, key = Variable, Value, Tor_A:Success_rate)
As you can see, the first argument in gather() function indicates the original data frame. Then you define how you would like to name a column with names of your original variables and then how should be named a column with their values. Finally, you specify which columns you want to reshape. All columns that will not be indicated (in our example: Date and Name) remains as they were in the original data frame.
There is a nice tutorial on tidyr published by Brad Boehmke in case you would need more information.

remove character columns from a numeric data frame

I have a data frame like the one you see here.
DRSi TP DOC DN date Turbidity Anions
158 5.9 3371 264 14/8/06 5.83 2246.02
217 4.7 2060 428 16/8/06 6.04 1632.29
181 10.6 1828 219 16/8/06 6.11 1005.00
397 5.3 1027 439 16/8/06 5.74 314.19
2204 81.2 11770 1827 15/8/06 9.64 2635.39
307 2.9 1954 589 15/8/06 6.12 2762.02
136 7.1 2712 157 14/8/06 5.83 2049.86
1502 15.3 4123 959 15/8/06 6.48 2648.12
1113 1.5 819 195 17/8/06 5.83 804.42
329 4.1 2264 434 16/8/06 6.19 2214.89
193 3.5 5691 251 17/8/06 5.64 1299.25
1152 3.5 2865 1075 15/8/06 5.66 2573.78
357 4.1 5664 509 16/8/06 6.06 1982.08
513 7.1 2485 586 15/8/06 6.24 2608.35
1645 6.5 4878 208 17/8/06 5.96 969.32
Before I got here i used the following code to remove those columns that had no values at all or some NA's.
rem = NULL
for(col.nr in 1:dim(E.3)[2]){
if(sum(is.na(E.3[, col.nr]) > 0 | all(is.na(E.3[,col.nr])))){
rem = c(rem, col.nr)
}
}
E.4 <- E.3[, -rem]
Now I need to remove the "date" column but not based on its column name, rather based on the fact that it's a character string.
I've seen here (Remove an entire column from a data.frame in R) already how to simply set it to NULL and other options but I want to use a different argument.
First use is.character to find all columns with class character. However, make sure that your date is really a character, not a Date or a factor. Otherwise use is.Date or is.factor instead of is.character.
Then just subset the columns that are not characters in the data.frame, e.g.
df[, !sapply(df, is.character)]
I was having a similar problem but the answer above isn't resolve it for a Date columns (that's what I needed), so I've found another solution:
df[,-grep ("Date|factor|character", sapply (df, class))]
Will return you your df without Date, character and factor columns.

Resources