merge .csvs based on common column but of inconsistent length - r

Afternoon (or morning, evening)
I am trying to merge several .csv files that have a similar layout, they have a class in one column (character) and an abundance (num) in another.
When imported as a data.frame example would be:
print(one[1:5,])
X Class Abundance_inds
1 1 Chaetognath 2
2 2 Copepod_Calanoid_Acartia_spp 9
3 3 Copepod_Calanoid_Centropages_spp 4
4 4 Copepod_Calanoid_Temora_spp 1
5 5 Copepod_Calanoid_Unknown 55
The class column (number of rows and order) changes every csv based on what was found and I want to bind several (30+) csvs based on the class column, I had the following (which I am sure was working a while ago.....):
DensityFiles <- list.files(CSVdirectory,
pattern = '.csv',
full.names = T)
Combined <- rbindlist(
lapply(
DensityFiles,
fread),
fill = TRUE,
use.names = TRUE)
This produces the following:
str(Combined)
Classes ‘data.table’ and 'data.frame': 461 obs. of 3 variables:
not quite what I was after! I am looking for the following:
> print(example)
X Class CSV.NAME CSV.NAME.1
1 1 Bivalve_Larvae 1 3
2 2 Bryozoa_Larvae 4 6
3 3 Chaetognath NA 7
4 4 Cnidaria 1 8
5 5 Copepod_Calanoid_Acartia_spp 22 NA
6 6 Copepod_Calanoid_Calanus_spp 24 4
7 7 Copepod_Calanoid_Candacia_sp 5 3
8 8 Copepod_Calanoid_Centropages_spp 41 2
9 9 Copepod_Calanoid_Temora_spp 39 8
10 10 Copepod_Calanoid_Unknown 458 NA
11 11 Copepod_Cyclopoid_Corycaeus_spp 46 NA
12 12 Copepod_Cyclopoid_Oithona_spp NA 4
13 13 Copepod_Cyclopoid_Oncaea_spp NA 7
14 14 Copepod_Harpacticoid 36 NA
15 15 Copepod_Nauplii 12 9
I can get the CSV name into the column header using idcol = "origin" when using
data.table libary rbindlist. but not sure if this works for all solutions.
I have had a good hunt around but most examples seem to be dealing with a consistent number of rows,
any help would be greatly appreciated!
Jim

You can use readr and bind_rows
library(dplyr)
library(readr)
df <- do.call(bind_rows, lapply(DensityFiles,read_csv))

Related

How to combine/concatenate two dataframes one after the other but not merging common columns in R?

Suppose there are two dataframes as follows with same column names and I want to combine/concatenate one after the other without merging the common columns. There is a way of assigning it columnwise like df1[3]<-df2[1] but would like to know if there's some other way.
df1<-data.frame(A=c(1:10), B=c(2:5, rep(NA,6)))
df2<-data.frame(A=c(12:20), B=c(32:40))
Expected Output:
A B A.1 B.1
1 2 12 32
2 3 13 33
3 4 14 34
4 5 15 35
5 NA 16 36
6 NA 17 37
7 NA 18 38
8 NA 19 39
9 NA 20 40
10 NA NA NA
I tend to work with multiple frames like this as a list of frames. Try this:
LOF <- list(df1, df2)
maxrows <- max(sapply(LOF, nrow))
out <- do.call(cbind, lapply(LOF, function(z) z[seq_len(maxrows),]))
names(out) <- make.names(names(out), unique = TRUE)
out
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
One advantage of this is that it allows you to work with an arbitrary number of frames, not just two.
One base R way could be
setNames(Reduce(cbind.data.frame,
Map(`length<-`, c(df1, df2), max(nrow(df1), nrow(df2)))),
paste0(names(df1), rep(c('', '.1'), each=2)))
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
Another option is to use the merge function. The documentation can be a bit cryptic, so here is a short explanation of the arguments:
by -- "the name "row.names" or the number 0 specifies the row names"
all = TRUE -- keeps all original rows from both dataframes
suffixes -- specify how you want the duplicated colnames to be distinguished
sort -- keep original sorting
merge(df1, df2, by = 0, all = TRUE, suffixes = c('', '.1'), sort = FALSE)
One way would be
cbind(
df1,
rbind(
df2,
rep(NA, nrow(df1) - nrow(df2))
)
)
`````

Sum each four rows in a column

I'm a beginner in R and I really need your help. I'm trying to get a new dataframe that stores the sum of each five rows in my columns.
For example, I have a dataframe (delta) with two columns (A,B)
A B
2 3
1 2
3 2
4 5
3 7
5 6
2 5
and the output I'm looking for is
AA BB
13 19
16 22
17 25
where
13 = row1+row2+row3+row4+row5
16 = row2+row3+row4+row5+row6
and so on ...
I have no idea where to start. Thanks a lot for your help guys.
The subject refers to 4 rows but the example in the question refers to 5. We have used 5 below but if you intended 4 just replace 5 with 4 in the code.
1) rollsum Using the reproducible input in the Note at the end use rollsum . Omit the as.data.frame if a matrix is ok as output.
library(zoo)
as.data.frame(rollsum(DF, 5))
## A B
## 1 13 19
## 2 16 22
## 3 17 25
2) filter filter in base R works too. Note that if you have dplyr loaded it clobbers filter so in that case use stats::filter in place of filter to ensure you get the correct version.
setNames(as.data.frame(na.omit(filter(DF, rep(1, 5)))), names(DF))
## A B
## 1 13 19
## 2 16 22
## 3 17 25
Note
Lines <- "
A B
2 3
1 2
3 2
4 5
3 7
5 6
2 5"
DF <- read.table(text = Lines, header = TRUE)
Here is a data.table option using frollsum, e.g.,
> na.omit(setDT(df)[,lapply(.SD,frollsum,5)])
A B
1: 13 19
2: 16 22
3: 17 25
or
> na.omit(setDT(df)[,setNames(frollsum(.SD,5),names(.SD))])
A B
1: 13 19
2: 16 22
3: 17 25

R, How to set row names attribute as numeric from character?

I'm new to R and would like to know how I can set attribute of row name as numeric.
I was trying to sort data frame by row names with
df[ order(row.names(df)),]
and the result was like
A B C D E
1 13 6 4 4 3
10 16 5 3 8 3
100 6 4 12 14 5
101 2 14 15 3 10
102 5 2 2 9 5
103 9 1 12 3 15
104 15 1 1 8 2
105 2 10 14 7 4
106 6 2 10 2 9
107 3 1 1 3 22
108 11 4 1 6 15
109 4 29 2 6 2
11 6 29 1 4 1
I have tried
row.names(df) <- attr(df, "row.names")
row.names(df) <- as.numeric(row.names(df))
But when I check row name again, it comes back to
[1] "character" "vector" "data.frameRowLabels" "SuperClassMethod"
I don't know what to do.. Please help me
From R's help on ?row.names:
All data frames have a row names attribute, a character vector of length the number of rows with no duplicates nor missing values.
This means that the row names will always be a character vector. You would need to use workarounds as suggested in the comments to make them "usable" as integers, basically always coercing. One suggestion could be that you create an id column of class integer and do not use row.names as id:
df$id <- as.integer(row.names(df))
df[order(df$id), ]
Omitting row.names also seems to be the way to go with popular data frame rethinking such as data.table or tibble - none of those use row names.

How to give a "/" in a column name to a dataframe in R?

I wish to give a "/" (backslash) in a column name in a dataframe. Any idea how?
I tried following to no avail,
tmp1 <- data.frame("Cost/Day"=1:10,"Days"=11:20)
tmp1
Cost.Day Days
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
I then tried this, it worked.
tmp <- data.frame(1:10,11:20)
colnames(tmp) <- c("Cost/Day","Days")
tmp
Cost/Day Days
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
I would prefer giving the name while constructing the dataframe itself. I tried escaping it but it still didn't work.
tmp2 <- data.frame("Cost\\/Day"=1:10,"Days"=11:20)
tmp2
You can use check.names=FALSE in the data.frame. By default, it is TRUE. And when it is TRUE, the function make.names changes the colnames. ie.
make.names('Cost/Day')
#[1] "Cost.Day"
So, try
dat <- data.frame("Cost/Day"=1:10,"Days"=11:20, check.names=FALSE)
head(dat,2)
# Cost/Day Days
#1 1 11
#2 2 12
The specific lines in data.frame function changing the column names is
--------
if (check.names)
vnames <- make.names(vnames, unique = TRUE)
names(value) <- vnames
--------

How to merge dating correctly

I'm trying to merge 7 complete data frames into one great wide data frame. I figured I have to do this stepwise and merge 2 frames into 1 and then that frame into another so forth until all 7 original frames becomes one.
fil2005: "ID" "abr_2005" "lop_2005" "ins_2005"
fil2006: "ID" "abr_2006" "lop_2006" "ins_2006"
But the variables "abr_2006" "lop_2006" "ins_2006" and 2005 are all either 0,1.
Now the things is, I want to either merge or do a dcast of some sort (I think) to make these two long data frames into one wide data frame were both "abr_2005" "lop_2005" "ins_2005" and abr_2006" "lop_2006" "ins_2006" are in that final file.
When I try
$fil_2006.1 <- merge(x=fil_2005, y=fil_2006, by="ID__", all.y=T)
all the variables with _2005 at the end if it is saved to the fil_2006.1, but the variables ending in _2006 doesn't.
I'm apparently doing something wrong. Any idea?
Is there a reason you put those underscores after ID__? Otherwise, the code you provided will work
An example:
dat1 <- data.frame("ID"=seq(1,20,by=2),"varx2005"=1:10, "vary2005"=2:11)
dat2 <- data.frame("ID"=5:14,"varx2006"=1:20, "vary2006"=21:40)
# create data frames of differing lengths
head(dat1)
ID varx2005 vary2005
1 1 1 2
2 3 2 3
3 5 3 4
4 7 4 5
5 9 5 6
6 11 6 7
head(dat2)
ID varx2006 vary2006
1 5 1 21
2 6 2 22
3 7 3 23
4 8 4 24
5 9 5 25
6 10 6 26
merged <- merge(dat1,dat2,by="ID",all=T)
head(merged)
ID varx2006 vary2006 varx2005 vary2005
1 1 NA NA 1 2
2 3 NA NA 2 3
3 5 1 21 3 4
4 5 11 31 3 4
5 7 13 33 4 5
6 7 3 23 4 5

Resources