Comparing column headers of two files to fetch data in R - r

I have a large CSV file, say INPUT, with about 500+ columns. I also have a dataframe DF that contains a subset of the column headers of INPUT which changes at every iteration.
I have to fetch the data from only those columns of INPUT that is present in the dataframe DF and write it into another CSV file, say OUTPUT.
In short,
INPUT.csv:
ID,Col_A,Col_B,Col_C,Col_D,Col_E,Col_F,,,,,,,,,,,,,Col_S,,,,,,,,,,,,,,,,Col_Z
1,009,abcd,67,xvz,33,50,,,,,,,,,,,,,,,,,,,,,,,,,,,,oup,,,,,,,,,,,,,,,,,,90
2,007,efgh,87,wuy,56,67,,,,,,,,,,,,,,,,,,,,,,,,,,,,ghj,,,,,,,,,,,,,,,,,,,,888
print(DF):
[1] "Col_D" "Col_Z"
[3] "Col_F" "Col_S"
OUTPUT.csv
ID,Col_D,Col_Z,Col_F,Col_S
1,xvz,90,50,oup
2,wuy,888,67,ghj
I'm a beginner when it comes to R. I would prefer for the matching of dataframe with the INPUT file to be automated, because i don't want to do this everyday when the dataframe gets updated.

I'm not sure whether this is the answer :
input <- read.table(...)
input[colnames(input) %in% colnames(DF)]

if I understand it correctly, you need to import the INPUT.csv file inside R and then match the columns of your DF with those columns of your INPUT, is that correct?
you can either use the match function or just import the INPUT.csv file inside RStudio via "Import Dataset" button and subset it. Subsetting of imported dataframes is fairly easy.
If you will import your dataset as INPUT, then you can make the subset of these columns in following way: INPUT[,c(1,2,4)]
and that will get you first, second and fourth column of the INPUT dataset.

First, to upload the csv is simple:
dataframe_read <- read.csv('/Path/to/csv/')
If I understand correctly that one dataframes columns is always a subset, the code is as follows:
### Example Dataframes
df1 <- data_frame(one = c(1,3,4), two= c(1,2,3), three = c(1,2,3))
df2 <- data_frame(one = c(1,3,4), three= c(1,2,3))
### Make new data frame
df3 <- df1[,colnames(df2)]
### Write new dataframe
write.csv(df3, 'hello.csv')

Related

Combining a list of data frames into a new data frame in R

This is a 3rd edit to the question (leaving below thread just in case):
The following code makes some sample data frames, selects those with "_areaX" in the title and makes a list of them. The goal is then to combine the data frames in the list into 1 data frame. It almost works...
Area1 <- 100
Area2 <- 200
Area3 <- 300
Zone <- 3
a1_areaX <- data.frame(Area1)
a2_areaX <- data.frame(Area2)
a3_areaX <- data.frame(Area3)
a_zoneX <- data.frame(Zone)
library(dplyr)
pattern = "_areaX"
df_list <- mget(ls(envir = globalenv(), pattern = pattern))
big_data = bind_rows(df_list, .id = "FileName")
The problem is the newly created data frame looks like this:
And I need it to look like this:
File Name
Area measurement
a1_areaX
100
a2_areaX
200
a3_areaX
300
Below are my earlier attempts at asking this question. Edited from first version:
I have csv files imported into R Global Env that look like this (I'd share the actual file(s) but there doesn't seem to be a way to do this here):
They all have a name, the above one is called "s6_section_area". There are many of them (with different names) and I've put them all together into a list using this code:
pattern = "section_area"
section_area_list <- list(mget(grep(pattern,ls(globalenv()), value = TRUE), globalenv()))
Now I want a new data frame that looks like this, put together from the data frames in the above made list.
File Name
Area measurement
a1_section_area
a number
a2_section_area
another number
many more
more numbers
So, the first column should list the name of the original file and the second column the measurement provided in that file.
Hope this is clearer - Not sure how else to provide reproducible example without sharing the actual files (which doesn't seem to be an option).
addition to edit: Using this code
section_area_data <- bind_rows(section_area_list, .id = "FileName")
I get (it goes on and on to the right)
I'm after a table that looks like the sample above, left column is file name with a list of file names going down. Right column is the measurement for that file name (taken from original file).
Note that in your list of dataframes (df_list) all the columns have different names (Area1, Area2, Area3) whereas in your output dataframe they all have been combined into one single column. So for that you need to change the different column names to the same one and bind the dataframes together.
library(dplyr)
library(purrr)
result <- map_df(df_list, ~.x %>%
rename_with(~"Area", contains('Area')), .id = 'FileName')
result
# FileName Area
#1 a1_areaX 100
#2 a2_areaX 200
#3 a3_areaX 300
Thanks everyone for your suggestions. In the end, I was able to combine the suggestions and some more thinking and came up with this, which works perfectly.
library("dplyr")
pattern = "section_area"
section_area_list <- mget(ls(envir = globalenv(), pattern = pattern))
section_area_data <- bind_rows(section_area_list, .id = "FileName") %>%
select(-V1)
So, a bunch of csv files were imported into R Global Env. A list of all files with a name ending in "section_area" was made. Those files were than bound into one big data frame, with the file names as one column and the value (area measurement in this case) in the other column (there was a pointless column in the original csv files called "V1" which I deleted).
This is what one of the many csv files looks like
sample csv file
And this is the layout of the final data frame (it goes on for about 150 rows)
final data frame

Appending two excel files into one dataframe

I am trying to append two excel files from Excel into R.
I am using the following the code to do so:
rm(list = ls(all.names = TRUE))
library(rio) #this is for the excel appending
library("dplyr") #this is required for filtering and selecting
library(tidyverse)
library(openxlsx)
path1 <- "A:/Users/Desktop/Test1.xlsx"
path2 <- "A:/Users/Desktop/Test2.xlsx"
dat = bind_rows(path1,path2)
Output
> dat = bind_rows(path1,path2)
Error: Argument 1 must have names.
Run `rlang::last_error()` to see where the error occurred
I appreciate that this is more for combining rows together, but can someone help me with combining difference workbooks into one dataframe in R Studio?
bind_rows() works with data frames AFTER they have been loaded into the R environment. Here are you merely trying to "bind" 2 strings of characters together, hence the error. First you need to import the data from Excel, which you could do with something like:
test_df1 <- readxl::read_xlsx(path1)
test_df2 <- readxl::read_xlsx(path2)
and then you should be able to run:
test_df <- bind_rows(test_df1, test_df2)
A quicker way would be to iterate the process using the map function from purrr:
test_df <- map_df(c(path1, path2), readxl::read_xlsx)
If you want to append one under the other, meaning that both excels have the same columns, I would
select the rows I wanted from the first excel and create a dataframe
select the rows from the second excel and create a second dataframe
append them with rbind().
On the other hand if you want to append the one next to another, I would choose the columns needed from first and second excel into two dataframes respectively and then I would go with cbind()

How to enable check strings = F in read.Alteryx WITHIN ALTERYX DESIGNER

When I import a csv data frame into R I can do
read.csv("some.csv", check.names=F)
This will keep column names with spaces in them. For example the column name
some data column will be read in as some data column.
The problem is when I use the R developer tool in Alteryx to read in a csv.
read.Alteryx("#1", mode="data.frame")
The column name some data column turns into some.data.column.
Now I realize I could use regular expressions and other parsing tools to rename the columns to what they were originally but I am hoping there is an alternative.
I believe something like to following will work:
df1 = read.Alteryx("#1", mode="data.frame")
df1metadata <- read.AlteryxMetaInfo("#1")
colnames(df1) <- df1metadata$Name

read multiple csv files with the same column headings and find the mean

Is it possible to read multiple csv excell files into R. All of the csv files have the same 4 columns. the first is a character, the second and third are numeric and the fourth is integer. I want to combine the data in each numeric column and find the mean.
I can get the csv files into R with
data <- list.files(directory)
myFiles <- paste(directory,data[id],sep="/")
I am unable to get the numbers from the individual columns add them and find the mean.
I am completely new to R and any advice is appreciated.
Here is a simple method:
Prep: Generate dummy data: (You already have this)
dummy <- data.frame(names=rep("a",4), a=1:4,b=5:8)
write.csv(dummy,file="data01.csv",row.names=F)
write.csv(dummy,file="data02.csv",row.names=F)
write.csv(dummy,file="data03.csv",row.names=F)
Step0: Load the file names: (just like you are doing)
data <- dir(getwd(),".csv")
Step1: Read and combine:
DF <- do.call(rbind,lapply(data,function(fn) read.csv(file=fn,header=T)))
DF
Step2: Find mean of appropriate columns:
apply(DF[,2:3],2,mean)
Hope that helps!!
EDIT: If you are having trouble with file path, try ?file.path.

R: row.names and data manipulation / export

I am having some issues understanding what row.names is and how it works. And, how I can get my data to do stuff the row.names allows one to do.
For example, I am creating some clusters with the code below (my data). I want to export the results which is what the sapply line does, but only to the screen for now. The first column (path_country) of my data frame are country names and the other columns are other variables (integers). I don't see an easy way to export these clusters to a table or list of countries and their group membership.
I tried to make a dummy example using example data sets in R. For example, mtcars, it was then that I noticed the first column was denoted as row.names. With mtcars I can create clusters, cutree to the specified number of groups and then save as a data frame. With this approach I have the 'car names' in the first column and the group number in the second column (more or less, could be cleaned up to look nicer, but is essentially what I am after), which is what I would like to happen with my data.
Any thoughts on this would be appreciated.
# my data
path_country <- read.csv("C:/path_country.csv")
patho <- subset(path_country, select=c(2:188))
patho.d <- dist(patho)
patho.hclust <- hclust(patho.d)
patho.hclust.groups11 = cutree(patho.hclust,11)
sapply(unique(patho.hclust.groups11),function(g)path_country$Country[patho.hclust.groups11 == g])
# mtcars data
car.d <- dist(mtcars)
car.h <- hclust(car.d)
car.h.11 <- cutree(car.h, 11)
nice_result <- as.data.frame(car.h.11)
write.table(nice_result, "test.txt", sep="\t")
1) You can create data.frame with row.names from CSV file:
# Names in the first column
path_country <- read.table("C:/path_country.csv", row.names=1)
# Names in column "Country"
path_country <- read.table("C:/path_country.csv", row.names="Country", head=TRUE)
Note, that in second case you should specify head=TRUE in order to use columns' names.
Now rownames(path_country) should give you vector with rows' names, and as.data.frame(patho.hclust.groups11) nice result for export.
2) At any time you can specify rows' names for your data.frame with command:
rownames(path_country) <- names.vector
where names.vector is a vector with unique names of length equal to number of rows in data.frame. In your example:
rownames(patho.hclust.groups11) <- path_country$Country
Note, that if you are using first approach you don't need this command.

Resources