Adding data frame below another data frame [duplicate] - r

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 5 years ago.
I want to do the following:
I have a Actual Sales Dataframe
Dates Actual
24/04/2017 58
25/04/2017 59
26/04/2017 58
27/04/2017 154
28/04/2017 117
29/04/2017 127
30/04/2017 178
Another data frame of Predicted values
Dates Predicted
01/05/2017 68.54159
02/05/2017 90.7313
03/05/2017 82.76875
04/05/2017 117.48913
05/05/2017 110.3809
06/05/2017 156.53363
07/05/2017 198.14819
Add the predicted Sales data frame below the Actual data Frame in following manner:
Dates Actual Predicted
24/04/2017 58
25/04/2017 59
26/04/2017 58
27/04/2017 154
28/04/2017 117
29/04/2017 127
30/04/2017 178
01/05/2017 68.54159
02/05/2017 90.7313
03/05/2017 82.76875
04/05/2017 117.48913
05/05/2017 110.3809
06/05/2017 156.53363
07/05/2017 198.14819

With:
library(dplyr)
bind_rows(d1, d2)
you get:
Dates Actual Predicted
1 24/04/2017 58 NA
2 25/04/2017 59 NA
3 26/04/2017 58 NA
4 27/04/2017 154 NA
5 28/04/2017 117 NA
6 29/04/2017 127 NA
7 30/04/2017 178 NA
8 01/05/2017 NA 68.54159
9 02/05/2017 NA 90.73130
10 03/05/2017 NA 82.76875
11 04/05/2017 NA 117.48913
12 05/05/2017 NA 110.38090
13 06/05/2017 NA 156.53363
14 07/05/2017 NA 198.14819
Or with:
library(data.table)
rbindlist(list(d1,d2), fill = TRUE)
Or with:
library(plyr)
rbind.fill(d1,d2)

Related

Linear regression on 415 files, output just filename, regression coefficient, significance

I am a beginner in R, I am learning the basics to analyse some biological data. I have 415 .csv files, each is a fungal species. Each file has 5 columns - (YEAR, FFD, LFD, MEAN, RANGE)
YEAR FFD LFD RAN MEAN
1 1950 NA NA NA NA
2 1951 NA NA NA NA
3 1952 NA NA NA NA
4 1953 NA NA NA NA
5 1954 NA NA NA NA
6 1955 NA NA NA NA
7 1956 NA NA NA NA
8 1957 NA NA NA NA
9 1958 NA NA NA NA
10 1959 140 141 1 140
11 1960 NA NA NA NA
12 1961 NA NA NA NA
13 1962 NA NA NA NA
14 1963 NA NA NA NA
15 1964 NA NA NA NA
16 1965 155 156 1 155
17 1966 NA NA NA NA
18 1967 NA NA NA NA
19 1968 152 153 1 152
20 1969 NA NA NA NA
21 1970 NA NA NA NA
22 1971 161 162 1 161
23 1972 NA NA NA NA
24 1973 143 144 1 143
25 1974 NA NA NA NA
26 1975 NA NA NA NA
27 1976 NA NA NA NA
28 1977 NA NA NA NA
29 1978 NA NA NA NA
30 1979 NA NA NA NA
31 1980 NA NA NA NA
32 1981 NA NA NA NA
33 1982 155 156 1 155
34 1983 NA NA NA NA
35 1984 NA NA NA NA
36 1985 157 158 1 157
37 1986 170 310 140 240
38 1987 173 274 101 232
39 1988 192 236 44 214
40 1989 234 320 86 277
41 1990 172 287 115 213
42 1991 148 287 139 205
43 1992 140 278 138 206
44 1993 152 273 121 216
45 1994 142 319 177 228
46 1995 261 318 57 287
47 1996 247 315 68 285
48 1997 164 270 106 230
49 1998 186 187 1 186
50 1999 235 236 1 235
51 2000 NA NA NA NA
52 2001 309 310 1 309
53 2002 203 308 105 256
54 2003 140 238 98 189
55 2004 204 313 109 267
56 2005 253 313 60 287
57 2006 247 300 53 279
58 2007 185 295 110 225
59 2008 259 260 1 259
60 2009 296 315 19 309
61 2010 230 303 73 275
62 2011 247 248 1 247
63 2012 206 207 1 206
64 2013 NA NA NA NA
65 2014 250 317 67 271
First I would like to see the regression coefficient (slope of the line) for each file, and the significance (p-value) for all of the files.
I can do it individually with:
fruit<-read.csv(file.choose(),header=TRUE)
yr<-fruit[,1]
ffd<-fruit[,2]
res<-lm(ffd~yr)
summary(res)
when I do this for the data, I get:
Call:
lm(formula = ffd ~ yr)
Residuals:
Min 1Q Median 3Q Max
-77.358 -20.858 -5.714 22.494 96.015
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4162.0710 950.1439 -4.380 0.000119 ***
yr 2.1864 0.4765 4.588 6.55e-05 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 38.75 on 32 degrees of freedom
(31 observations deleted due to missingness)
Multiple R-squared: 0.3968, Adjusted R-squared: 0.378
F-statistic: 21.05 on 1 and 32 DF, p-value: 6.549e-05
The only information I need from this at the moment is the regression coefficient (2.1864) and the p-value (6.549e-05)
The perfect output would be if I could get R to cycle through the 415 files, and give an output in the form of a table with 3 columns: filename, regression coefficient, and significance. There would be 415 rows, one for each file.
I would then like to do YEAR~LFD, YEAR~RANGE, and YEAR~MEAN. I am hoping that I can easily edit the code for YEAR~FFD and run it for the other 3 regressions.
The following code will probably work.
I have tested it with your data in two files. The functions that do all the work are these ones:
regrFun <- function(DF){
fit <- lm(DF[[1]] ~ DF[[2]])
coef(summary(fit))[2, c(1, 4)]
}
regrList <- function(iv, L){
res <- lapply(seq_along(L), function(i){
dftmp <- L[[i]]
cfs <- regrFun(dftmp[c(1, iv)])
data.frame(file = names(L)[i], Estimate = cfs[1], p.value = cfs[2])
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
}
Now read in the data files. In the following code line, substitute a common filename part for "pattern" in the obvious place.
filenames <- list.files(pattern = "pattern")
df_list <- lapply(filenames, read.csv)
names(df_list) <- filenames
And compute the values you want.
results_list <- lapply(2:ncol(df_list[[1]]), regrList, df_list)
names(results_list) <- names(df_list[[1]][-1])
First I simulate like 5 csv files, with columns that look like yours:
for(i in 1:5){
tab=data.frame(
YEAR=1950:2014,
FFD= rpois(65,100),
LFD= rnorm(65,100,10),
RAN= rnbinom(65,mu=100,size=1),
MEAN = runif(65,min=50,max=150)
)
write.csv(tab,paste0("data",i,".csv"))
}
Now, we need a vector of all the files in your directory, this will be different for yours, but try to create this somehow using the pattern argument:
csvfiles = dir(pattern="data[0-9]*.csv$")
So we use three libraries from tidyverse, and I guess each csv file is not so huge, so the code below reads in all the files, group them by the source and performs the regression, note you can use call the columns from the data frame, not having to rename them:
library(dplyr)
library(purrr)
library(broom)
csvfiles %>%
map_df(function(i){df = read.csv(i);df$data = i;df}) %>%
group_by(data) %>%
do(tidy(lm(FFD ~ YEAR,data=.))) %>%
filter(term!="(Intercept)")
# A tibble: 5 x 6
# Groups: data [5]
data term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 data1.csv YEAR -0.0228 0.0731 -0.311 0.756
2 data2.csv YEAR -0.139 0.0573 -2.42 0.0182
3 data3.csv YEAR -0.175 0.0650 -2.70 0.00901
4 data4.csv YEAR -0.0478 0.0628 -0.762 0.449
5 data5.csv YEAR 0.0204 0.0648 0.315 0.754
You can just change the formula inside lm(FFD ~ YEAR,data=.) to get the other regressions
data.table version, using StupidWolf's csv files layout and names, featuring the requested fields:
library(data.table)
input.dir = "/home/user/Desktop/My Folder/" # adjust to your needs
csvfiles <- list.files(path=input.dir, full.names=TRUE, pattern=".*data(.*)\\.csv") # adjust pattern
Above, I used a more specific regex pattern, but you could just pick pattern="*.csv" if you want to process all csv files in that folder.
# order the files
csvfiles <- csvfiles[order(as.numeric(gsub(".*data(.*)\\.csv", "\\1", csvfiles)))]
# function to read file and return requested columns
regrFun <- function(x){
DT <- fread(x)
fit <- lm(FFD ~ YEAR, data=DT)
return(as.list(c(filename=basename(x), coef(summary(fit))[2, c(1, 4)])))
}
# apply function and rename columns
DT <- rbindlist(lapply(csvfiles, regrFun))
setnames(DT, c("filename", "regression coefficient", "significance"))
DT
Result:
filename regression coefficient significance
1: data1.csv -0.113286713286712 0.0874762832713643
2: data2.csv -0.044449300699302 0.457096760642717
3: data3.csv 0.0464597902097902 0.499618510612891
4: data4.csv -0.032473776223776 0.638494798460044
5: data5.csv 0.0562062937062939 0.452955919860998
---
411: data411.csv 0.0381555944055959 0.544185411150829
412: data412.csv -0.0672202797202807 0.314346452751388
413: data413.csv 0.116564685314687 0.0694785724198052
414: data414.csv -0.0908216783216786 0.110811677724832
415: data415.csv -0.0282779720279721 0.638766712090455
You could write a R script that runs on a single file, then run it on every file via the terminal.
The script is simply a .R file with code inside.
To run it on every file you would execute on your terminal something on the lines of (using bash)
for file in $(ls yourDataDirectory); do
Rscript yourScriptFile.R $file >> finalOutput
done
This would run the script in yourScriptFile.R on every file in yourDataDircetory and save the output on finalOutput.
The script code itself would be very similar to the one you already wrote, but instead of file.choose() you would use the argument passed by the command line, as described here, and you would have to print only the information you're insterested, instead of the output of summary.
finalOutput could even be a csv file, if you format the script output correctly.

combine two similar columns in r

I'm trying to combine two columns of data that essentially contain the same information but some values are missing from each column that the other doesn't have. Column "wasiIQw1" holds the data for half of the group while column w1iq holds the data or the other half of the group.
select(gadd.us,nidaid,wasiIQw1,w1iq)[1:10,]
select(gadd.us,nidaid,wasiIQw1,w1iq)[1:10,]
nidaid wasiIQw1 w1iq
1 45-D11150341 104 NA
2 45-D11180321 82 NA
3 45-D11220022 93 93
4 45-D11240432 118 NA
5 45-D11270422 99 NA
6 45-D11290422 82 82
7 45-D11320321 99 99
8 45-D11500021 99 99
9 45-D11500311 95 95
10 45-D11520011 111 111
select(gadd.us,nidaid,wasiIQw1,w1iq)[384:394,]
nidaid wasiIQw1 w1iq
384 H1900442S NA 62
385 H1930422S NA 83
386 H1960012S NA 89
387 H1960321S NA 90
388 H2020011S NA 96
389 H2020422S NA 102
390 H2040011S NA 102
391 H2040331S NA 94
392 H2040422S NA 103
393 H2050051S NA 86
394 H2050341S NA 98
With the following code I joined df.a (a df with the id and wasiIQw1) with df.b (a df with the id and w1iq) and get the following results.
df.join <- semi_join(df.a,
df.b,
by = "nidaid")
nidaid w1iq
1 45-D11150341 NA
2 45-D11180321 NA
3 45-D11220022 93
4 45-D11240432 NA
5 45-D11270422 NA
6 45-D11290422 82
7 45-D11320321 99
8 45-D11500021 99
9 45-D11500311 95
10 45-D11520011 111
nidaid w1iq
384 H1900442S 62
385 H1930422S 83
386 H1960012S 89
387 H1960321S 90
388 H2020011S 96
389 H2020422S 102
390 H2040011S 102
391 H2040331S 94
392 H2040422S 103
393 H2050051S 86
394 H2050341S 98
All of this works except for the first four "NA"s that won't merge. Other "_join" functions from dplyr have not worked either. Do you have any tips for combining theses two columns so that no data is lost but all "NA"s are filled in if the other column has a present value?
I guess you can use coalesce here which finds the first non-missing value at each position.
library(dplyr)
gadd.us %>% mutate(w1iq = coalesce(w1iq, wasiIQw1))
This will select values from w1iq if present or if w1iq is NA then it would select value from wasiIQw1. You can switch the position of w1iq and wasiIQw1 if you want to give priority to wasiIQw1.
Here would be a way to do it with base R (no packages)
Create reproducible data:
> dat<-data.frame(nidaid=paste0("H",c(1:5)), wasiIQw1=c(NA,NA,NA,75,9), w1iq=c(44,21,46,75,NA))
>
> dat
nidaid wasiIQw1 w1iq
1 H1 NA 44
2 H2 NA 21
3 H3 NA 46
4 H4 75 75
5 H5 9 NA
Create a new column named new to combine the two. With this ifelse statement, we say if the first column wasiIQw1 is not (!) an 'NA' (is.na()), then grab it, otherwise grab the second column. Similar to Ronak's answer, you can switch the column names here to give one preference over the other.
> dat$new<-ifelse(!is.na(dat$wasiIQw1), dat$wasiIQw1, dat$w1iq)
>
> dat
nidaid wasiIQw1 w1iq new
1 H1 NA 44 44
2 H2 NA 21 21
3 H3 NA 46 46
4 H4 75 75 75
5 H5 9 NA 9
Using base R, we can do
gadd.us$w1iq <- with(gadd.us, pmax(w1iq, wasiIQw1, na.rm = TRUE))

Convert column of class factor into NA [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 3 years ago.
How to convert a column of class factor into numeric without disturbing the NAs present in it?
I do not want to convert it to 0!!
>Conceded
[1] 665 515 NA NA NA 67 98 15 31 NA NA NA NA NA 2195 2525 1756 6366 3143
[20] 7857 5926 2254 3199 4297 4568 2246 1506 2291
21 Levels: 15 1506 1756 2195 2246 2254 2291 2525 31 3143 3199 4297 4568 515 5926 6366 665 67 ...
NA
>class(Conceded)
[1]"factor"
>as.numeric(Conceded)
[1] 17 14 21 21 21 18 20 1 9 21 21 21 21 21 4 8 3 16 10 19 15 6 11 12 13 5 2 7
1)How can I retain the value of NA,while converting a factor vector into a number vector?
2)Also what are these values that appear as a result oh this conversion
3) why do I need to convert to character vector followed by numeric vector?
You will probably need to first convert to a character, and then to numeric. Otherwise your factor levels are used for the values instead of the original values coded by the text.
Ex.
x <- factor(c(23,4,7,16, 10, NA))
as.numeric(x) # wrong values
as.numeric(as.character(x)) # correct values

Challenge in data manipulation

I am trying to process a database of BLAST output to generate a data frame containing values for a given gene and given sample. When a gene is identified within a sample I would like the scaffold on which it was identified to be reported. If a given gene is NOT identified within a given sample I would like the cell to be filled with N/A.
sample_name scaffold gene_title match_(%)
P24_ST48 64 aadA12 94.56
401B_ST5223 381 blaTEM-163 99.65
P32_ST218 91 aadA24 90.41
HOS66_ST73 9 blaACT-5 72.31
HOS16_ST38 70 blaTEM-146 99.42
HOS56_ST131 48 aadA21 91.39
Ecoli_2009_1_ST131 41 sul1 99.88
PH152_ST95 37 dfrA33 83.94
Ecoli_2009_32_STNT 16 aac(3)-Ib 100.00
PH231_ST38 59 mph(D) 89.83
P44_STNT 135 blaTEM-105 99.88
Ecoli_2011_89_ST127 29 blaTEM-158 99.65
405C_ST1178 120 aadA1 99.75
P3_STNT 15 blaTEM-68 99.19
5A_ST34 174 blaTEM-127 99.88
P27_ST10 211 aph(3')-Ia 100.00
4D_ST767 393 blaTEM-152 98.95
P10_STNT 23 blaTEM-17 99.07
Ecoli_2014_27_ST131 49 sul2_15 99.88
Ecoli_2013_10_ST73 23 blaTEM-2 99.19
The output table would look something like:
Sample aadA1 aadA12 aadA24 blaTEM-163 ...
P24_ST48 N/A 64 N/A N/A
401B_ST5223 N/A N/A N/A 381
...
In excel I have concatenated the sample name and gene titles and reported the scaffold number on row where this string is identified using VLOOKUP - I have tried many different ways in R and am going around in circles.
Now trying to process +700 genes and +450 samples, the list of gene-sample combinations is getting somewhat laborious for excel to manage and I must find another solution with my collection of samples growing increasingly large.
Any help would be greatly appreciated.
Cheers,
Max
Here's how to do that with spread from tidyr
library(tidyr)
df1%>%
spread(key = gene_title,value = scaffold)
sample_name match_... aac(3)-Ib aadA1 aadA12 ...
1 401B_ST5223 99.65 NA NA NA
2 405C_ST1178 99.75 NA 120 NA
3 4D_ST767 98.95 NA NA NA
4 5A_ST34 99.88 NA NA NA
5 Ecoli_2009_1_ST131 99.88 NA NA NA
...
Data
df1 <- read.table(text="sample_name scaffold gene_title match_(%)
P24_ST48 64 aadA12 94.56
401B_ST5223 381 blaTEM-163 99.65
P32_ST218 91 aadA24 90.41
HOS66_ST73 9 blaACT-5 72.31
HOS16_ST38 70 blaTEM-146 99.42
HOS56_ST131 48 aadA21 91.39
Ecoli_2009_1_ST131 41 sul1 99.88
PH152_ST95 37 dfrA33 83.94
Ecoli_2009_32_STNT 16 aac(3)-Ib 100.00
PH231_ST38 59 mph(D) 89.83
P44_STNT 135 blaTEM-105 99.88
Ecoli_2011_89_ST127 29 blaTEM-158 99.65
405C_ST1178 120 aadA1 99.75
P3_STNT 15 blaTEM-68 99.19
5A_ST34 174 blaTEM-127 99.88
P27_ST10 211 aph(3')-Ia 100.00
4D_ST767 393 blaTEM-152 98.95
P10_STNT 23 blaTEM-17 99.07
Ecoli_2014_27_ST131 49 sul2_15 99.88
Ecoli_2013_10_ST73 23 blaTEM-2 99.19",
header=TRUE,stringsAsFactors=FALSE)
We can use dcast from data.table
library(data.table)
dcast(setDT(df1), sample_name + match_... ~ gene_title, value.var = 'scaffold')
# sample_name match_... aac(3)-Ib aadA1 aadA12 ...
#1: 401B_ST5223 99.65 NA NA
#2: 405C_ST1178 99.75 NA 120
#3: 4D_ST767 98.95 NA NA
#4: 5A_ST34 99.88 NA NA

Transforming long format data to short format by segmenting dates that include redundant observations

I have a data set that is long format and includes exact date/time measurements of 3 scores on a single test administered between 3 and 5 times per year.
ID Date Fl Er Cmp
1 9/24/2010 11:38 15 2 17
1 1/11/2011 11:53 39 11 25
1 1/15/2011 11:36 39 11 39
1 3/7/2011 11:28 95 58 2
2 10/4/2010 14:35 35 9 6
2 1/7/2011 13:11 32 7 8
2 3/7/2011 13:11 79 42 30
3 10/12/2011 13:22 17 3 18
3 1/19/2012 14:14 45 15 36
3 5/8/2012 11:55 29 6 11
3 6/8/2012 11:55 74 37 7
4 9/14/2012 9:15 62 28 18
4 1/24/2013 9:51 82 45 9
4 5/21/2013 14:04 135 87 17
5 9/12/2011 11:30 98 61 18
5 9/15/2011 13:23 55 22 9
5 11/15/2011 11:34 98 61 17
5 1/9/2012 11:32 55 22 17
5 4/20/2012 11:30 23 4 17
I need to transform this data to short format with time bands based on month (i.e. Fall=August-October; Winter=January-February; Spring=March-May). Some bands will include more than one observation per participant, and as such, will need a "spill over" band. An example transformation for the Fl scores below.
ID Fall1Fl Fall2Fl Winter1Fl Winter2Fl Spring1Fl Spring2Fl
1 15 NA 39 39 95 NA
2 35 NA 32 NA 79 NA
3 17 NA 45 NA 28 74
4 62 NA 82 NA 135 NA
5 98 55 55 NA 23 NA
Notice that dates which are "redundant" (i.e. more than 1 Aug-Oct observation) spill over into Fall2fl column. Dates that occur outside of the desired bands (i.e. November, December, June, July) should be deleted. The final data set should have additional columns that include Fl Er and Cmp.
Any help would be appreciated!
(Link to .csv file with long data http://mentor.coe.uh.edu/Data_Example_Long.csv )
This seems to do what you are looking for, but doesn't exactly match your desired output. I haven't looked at your sample data to see whether the problem lies with your sample desired output or the transformations I've done, but you should be able to follow along with the code to see how the transformations were made.
## Convert dates to actual date formats
mydf$Date <- strptime(gsub("/", "-", mydf$Date), format="%m-%d-%Y %H:%M")
## Factor the months so we can get the "seasons" that you want
Months <- factor(month(mydf$Date), levels=1:12)
levels(Months) <- list(Fall = c(8:10),
Winter = c(1:2),
Spring = c(3:5),
Other = c(6, 7, 11, 12))
mydf$Seasons <- Months
## Drop the "Other" seasons
mydf <- mydf[!mydf$Seasons == "Other", ]
## Add a "Year" column
mydf$Year <- year(mydf$Date)
## Add a "Times" column
mydf$Times <- as.numeric(ave(as.character(mydf$Seasons),
mydf$ID, mydf$Year, FUN = seq_along))
## Load "reshape2" and use `dcast` on just one variable.
## Repeat for other variables by changing the "value.var"
dcast(mydf, ID ~ Seasons + Times, value.var="Fluency")
# ID Fall_1 Fall_2 Winter_1 Winter_2 Spring_2 Spring_3
# 1 1 15 NA 39 39 NA 95
# 2 2 35 NA 32 NA 79 NA
# 3 3 17 NA 45 NA 29 NA
# 4 4 62 NA 82 NA 135 NA
# 5 5 98 55 55 NA 23 NA

Resources