How to merge tables with different column headers in loop [duplicate] - r

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 3 years ago.
I have a for loop that goes through a specific column in different CSV files (all these different files are just different runs for a specific class) and retrieve the count of each value. For example, in the first file (first run):
0 1 67
101 622 277
In the second run:
0 1 67 68
109 592 297 2
In the third run:
0 1 67
114 640 246
Note that each run might result in different values (look at the second run that includes one more value that is 68). I would like to merge all these results in one list and then write it to a CSV file. To do that, I did the following:
files <- list.files("/home/adam/Desktop/runs", pattern="*.csv", recursive=TRUE, full.names=TRUE, include.dirs=TRUE)
all <- list()
col <- 14
for(j in 1:length(files)){
dataset <- read.csv(files[j])
uniqueValues <- table(dataset[,col]) #this generates the examples shown above
all <- rbind(uniqueValues)
}
write.table(all, "all.csv", col.names=TRUE, sep=",")
The result of all is:
0 1 67
114 640 246
How to solve that?
The expected results in:
0 1 67 68
101 622 277 0
109 592 297 2
114 640 246 0

Marked this as a potential duplicate see link here
library(plyr)
df1 <- data.frame(A0 = c(101),
A1 = c(622),
A67 = c(277))
df2 <- data.frame(A0 = c(109),
A1 = c(592),
A67 = c(297),
A68= c(2))
df3 <- data.frame(A0 = c(114),
A1 = c(640),
A67 = c(246))
newds=rbind.fill(df1,df2,df3)

Related

for loop question in R with rbind or do.call

An VERY simplified example of my dataset:
HUC8 YEAR RO_MM
1: 10010001 1961 78.2
2: 10010001 1962 84.0
3: 10010001 1963 70.2
4: 10010001 1964 130.5
5: 10010001 1965 54.3
I found this code online which sort of, but not quite, does what I want:
#create a list of the files from your target directory
file_list <- list.files(path="~/Desktop/Rprojects")
#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
allHUCS <- data.frame()
#I want to read each .csv from a folder named "Rprojects" on my desktop into one huge dataframe for further use.
for (i in 1:length(file_list)){
temp_data <- fread(file_list[i], stringsAsFactors = F)
allHUCS <- rbindlist(list(allHUCS, temp_data), use.names = T)
}
Question: I have read that one should not use rbindlist for a large dataset:
"You should never ever ever iteratively rbind within a loop: performance might be okay in the beginning, but with each call to rbind it makes a complete copy of the data, so with each pass the total data to copy increases. It scales horribly. Consider do.call(rbind.data.frame, file_list)." – #r2evans
I know this may seem simple but I'm unclear about how to use his directive. Would I write this for the last line?
allHUCS <- do.call(rbind.data.frame(allHUCS, temp_data), use.names = T)
Or something else? In my actual data, each .csv has 2099 objects with 3 variables (but I only care about the last two.) The total dataframe should contain 47,000,000+ objects of 2 variables. When I ran the original code I got these errors:
Error in rbindlist(list(allHUCS, temp_data), use.names = T) : Item 2
has 2 columns, inconsistent with item 1 which has 3 columns. To fill
missing columns use fill=TRUE.
In addition: Warning messages: 1: In fread(file_list[i],
stringsAsFactors = F) : Detected 1 column names but the data has 2
columns (i.e. invalid file). Added 1 extra default column name for the
first column which is guessed to be row names or an index. Use
setnames() afterwards if this guess is not correct, or fix the file
write command that created the file to create a valid file.
2: In fread(file_list[i], stringsAsFactors = F) : Stopped early on
line 20. Expected 2 fields but found 3. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<# mv *.csv .. ; >>
Except for the setnames() suggestion, I don't understand what I'm being told. I know it says it stopped early, but I don't even know how to see the entire dataset or to tell where it stopped.
I'm now reading that rbindlist and rbind are two different things and rbindlist is faster than do.call(rbind, data). But the suggestion is do.call(rbind.data.frame(allHUCS, temp_data). Which is going to be fastest?
Since the original post does not include a reproducible example, here is one that reads data from the Pokémon Stats data that I maintain on Github.
First, we download a zip file containing one CSV file for each generation of Pokémon, and unzip it to the ./pokemonData subdirectory of the R working directory.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip",exdir="./pokemonData")
Next, we obtain a list of files in the directory to which we unzipped the CSV files.
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
Finally, we load the data.table package, use lapply() with data.table::fread() to read the files, combine the resulting list of data tables with do.call(), and print the head() and `tail() of the resulting data frame with all 8 generations of Pokémon stats.
library(data.table)
data <- do.call(rbind,lapply(thePokemonFiles,fread))
head(data)
tail(data)
...and the output:
> head(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk Sp. Def Speed
1: 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
2: 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
3: 3 Venusaur Grass Poison 525 80 82 83 100 100 80
4: 4 Charmander Fire 309 39 52 43 60 50 65
5: 5 Charmeleon Fire 405 58 64 58 80 65 80
6: 6 Charizard Fire Flying 534 78 84 78 109 85 100
Generation
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
> tail(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk
1: 895 Regidrago Dragon 580 200 100 50 100
2: 896 Glastrier Ice 580 100 145 130 65
3: 897 Spectrier Ghost 580 100 65 60 145
4: 898 Calyrex Psychic Grass 500 100 80 80 80
5: 898 Calyrex Ice Rider Psychic Ice 680 100 165 150 85
6: 898 Calyrex Shadow Rider Psychic Ghost 680 100 85 80 165
Sp. Def Speed Generation
1: 50 80 8
2: 110 30 8
3: 80 130 8
4: 80 80 8
5: 130 50 8
6: 100 150 8
>

Loop only iterate first 4 rows

My for loop only iterates the first 4 rows of the R dataframe. I read several similar postings and tried suggested approaches but none work. Any help is appreciated
df_total <- list()
for (i in 1:length(df_test)) {
df <- recover(df_test[i,], "PI", 1)
df$i <-i
df_total[[i]] <- df
}
big_data = do.call(rbind, df_total)
row_1 row_2 correct incorrect newrow1 newrow2
56245270 8549 9949 71 3 8550 9950
9332380 896 9949 71 1 897 9950
14783792 1460 4943 70 2 1461 4944
41437670 4943 10388 70 0 4944 10389
9323891 896 1460 70 2 897 1461
Note that length(df) gives you the number of columns of a data.frame. If you want the number of rows, use nrow(df).
Ideally you would use
seq(nrow(df))
to generate an index for a for loop, looping over the rows of a data.frame.

How to lapply grep() data by id

I have a df RawDat with two rows ID, data. I want to grep() my data by the id using e.g. lapply() to generate a new df where the data is sorted into columns by their id:
My df looks like this, except I have >80000 rows, and 75 ids:
ID data
abl 564
dlh 78
vho 354
mez 15
abl 662
dlh 69
vho 333
mez 9
.
.
.
I can manually extract the data using the grep() function:
ExtRawDat = as.data.frame(RawDat[grep("abl",RawDat$ID),])
However, I would not want to do that 75 times and cbind() them. Rather, I would like to use the lapply() function to automate it. I have tried several variations of the following code, but I don't get a script that provide the desired output.
I have a vector with the 75 ids ProLisV, to loop my argument
ExtRawDat = as.data.frame(lapply(ProLisV[1:75],function(x){
Temp1 = RawDat[grep(x,RawDat$ID),] # The issue is here, the pattern is not properly defined with the X input (is it detrimental that some of the names in the list having spaces etc.?)
Values = as.data.frame(Temp1$data)
list(Values$data)
}))
The desired output looks like this:
abl dlh vho mez ...
564 78 354 15
662 69 333 9
.
.
.
How do I adjust that function to provide the desired output? Thank you.
It looks like what you are trying to do is to convert your data from long form to wide form. One way to do this easily is to use the spread function from the tidyr package. To use it, we need a column to remove duplicate identifiers, so we'll first add a grouping variable:
n.ids <- 4 # With your full data this should be 75
df$group <- rep(1:n.ids, each = n.ids, length.out = nrow(df))
tidyr::spread(df, ID, data)
# group abl dlh mez vho
# 1 1 564 78 15 354
# 2 2 662 69 9 333
If you don't want the group column at the end, just do df$group <- NULL.
Data
df <- read.table(text = "
ID data
abl 564
dlh 78
vho 354
mez 15
abl 662
dlh 69
vho 333
mez 9", header = T)

Import multiple txt files into one data frame and use part of file names as "id"

I have a directory of text files named using the following convention: "Location[A-Z]_House[0-15]_Day[0_15].txt", so an example is LA_H05_D14.txt. Is there a way of splitting the names such that they can be made a factor? More specifically I would like to use the letter [A-Z] that comes after Location. E.g. LB_H01_D01.txt would be location "B" and all data belonging to Location B will be labelled "B"?
I have imported all the data from the files into one data frame:
l = list.files(patt="txt$", full.names = T)
library(dplyr)
Df = bind_rows(lapply(l, function(i) {temp <- read.table(i,stringsAsFactors = FALSE,sep=";");
setNames(temp, c("Date","Time","Timestamp","PM2_5(ug/m3)","AQI(US)","AQI(CN)","PM10(ug/m3)","Outdoor AQI(US)","Outdoor AQI(CN)","Temperature(C)","Temperature(F)","Humidity(%RH)","CO2(ppm)","VOC(ppb)"
))}), .id = "id")
The data looks like this with an "id" column:
head(Df)
id Date Time Timestamp PM2_5(ug/m3) AQI(US) AQI(CN) PM10(ug/m3) Outdoor AQI(US) Outdoor AQI(CN) Temperature(C) Temperature(F)
1 1 2017/10/17 20:31:38 1508272298 102.5 175 135 512 0 0 30 86.1
2 1 2017/10/17 20:31:48 1508272308 93.6 171 124 477 0 0 30 86.1
3 1 2017/10/17 20:31:58 1508272318 98.0 173 129 397 0 0 30 86.0
4 1 2017/10/17 20:32:08 1508272328 98.0 173 129 422 0 0 30 86.0
5 1 2017/10/17 20:32:18 1508272338 104.3 176 137 466 0 0 30 86.0
6 1 2017/10/17 20:32:28 1508272348 101.6 175 134 528 0 0 30 86.0
Humidity(%RH) CO2(ppm) VOC(ppb)
1 43 466 -1
2 43 467 -1
3 42 468 -1
4 42 469 -1
5 42 471 -1
6 42 471 -1
Independent of the issue concerning the content of the id column you might use the following code to extract the information from the filenames:
#you may use the original filenames
filenames <- basename(l)
#or the content of the id column
filenames <- as.character(Df$id) #if you have read in filenames in the Df
#for demonstration here a definition of exemplary filenames
filenames <- c("LA_H01_D01.txt"
,"LA_H02_D02.txt"
,"LD_H01_D14.txt"
,"LD_H01_D15.txt")
filenames <- gsub("_H|_D", "_", filenames)
filenames <- gsub(".txt|^L", "", filenames)
fileinfo <- as.data.frame(do.call(rbind, strsplit(filenames, "_")))
colnames(fileinfo) <- c("Location", "House", "Day")
fileinfo[, c("House", "Day")] <- apply(fileinfo[, c("House", "Day")], 2, as.numeric)
# Location House Day
# 1 A 1 1
# 2 A 2 2
# 3 D 1 14
# 4 D 1 15
#add the information to your Df as new columns
Df <- cbind(Df, fileinfo)
#the whole thing as a function used in your data import
add_fileinfo <- function(df, filename) {
filename <- gsub("_H|_D", "_", filename)
filename <- gsub(".txt|^L", "", filename)
fileinfo <- as.data.frame(do.call(rbind, strsplit(filename, "_")))
colnames(fileinfo) <- c("Location", "House", "Day")
fileinfo[, c("House", "Day")] <- apply(fileinfo[, c("House", "Day")], 2, as.numeric)
cbind(df, fileinfo[rep(seq_len(nrow(fileinfo)), each= nrow(df)),])
}
Df = bind_rows(lapply(l, function(i)
{temp <- read.table(i,stringsAsFactors = FALSE,sep=";");
setNames(temp, c("Date","Time","Timestamp","PM2_5(ug/m3)","AQI(US)","AQI(CN)","PM10(ug/m3)","Outdoor AQI(US)","Outdoor AQI(CN)","Temperature(C)","Temperature(F)","Humidity(%RH)","CO2(ppm)","VOC(ppb)"
));
temp <- add_fileinfo(temp, i);
}
), .id = "id")
Something like this (generic) solution should get you going.
mydata1 = read.csv(path1, header=T)
mydata2 = read.csv(path2, header=T)
Then, merge
myfulldata = merge(mydata1, mydata2)
As long as mydata1 and mydata2 have at least one common column with an identical name (that allows matching observations in mydata1 to observations in mydata2), this will work like a charm. It also takes three lines.
What if I have 20 files with data that I want to match observation-to-observation? Assuming they all have a common column that allows merging, I would still have to read 20 files in (20 lines of code) and merge() works two-by-two… so I could merge the 20 data frames together with 19 merge statements like this:
mytempdata = merge(mydata1, mydata2)
mytempdata = merge(mytempdata, mydata3)
.
.
.
mytempdata = merge(mytempdata, mydata20)
That’s tedious. You may be looking for a simpler way. If you are, I wrote a function to solve your woes called multmerge().* Here’s the code to define the function:
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)
Here is a good resource that should help you out.
https://stats.idre.ucla.edu/r/codefragments/read_multiple/

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Resources