R data compilation(sorting out or data manipulation) - r

here is my excel csv data(test.csv):
type,com,year,month,value
A,CH,2015,1,1000
A,CH,2015,2,5000
A,CH,2016,1,1500
A,MI,2015,1,1300
A,MI,2016,1,5006
B,CH,2015,1,7651
B,CH,2015,2,8684
B,MI,2016,1,2321
B,ZU,2015,1,6842
C,CH,2015,1,1562
C,CH,2016,2,6452
C,CH,2016,3,1562
C,MI,2016,1,6425
C,MI,2016,2,2682
C,ZU,2015,1,8543
C,ZU,2015,2,7531
how can I extract each type to each data frame with R.
To be more concise, I mean I want to build 3 new data frame(typeA, typeB and typeC). And how can I combine year and month into one so I can plot with ggplot2.
Here is an additional question: Where can I find some reference about sorting out data which is similar to the above problem?

In a more common sense:
df <- read.csv("data.csv", header=T)
df_list <- split(df, factor(df$type))
Every entry in df_list is now a new data.frame with one type, e.g. df_list[[1] or df_list$A.

Try:
data =read.csv("test.csv", header=T)
dataA = data[which(data$type =="A"),]
dataB = data[which(data$type =="B"),]
dataC = data[which(data$type =="C"),]

Related

How to extract the original name of a dataframe that has been renamed to a string variable

I have been looking through other posts and cannot seem to find recommendations on how to deal with the following problem, and it's driving me crazy.
# Create data frame and variable for this example
data.frame(orgdf, stringsAsFactors = TRUE)
a <- c(10,20,30,40)
b <- c('book', 'pen', 'textbook', 'pencil_case')
c <- c(TRUE,FALSE,TRUE,FALSE)
d <- c(2.5, 8, 10, 7)
# Join the variables to create a data frame
orgdf <- data.frame(a,b,c,d)
# Rename the data frame
newdf <- orgdf
# Plot the variables of the new data frame
plot(newdf$a, newdf$d, main = "orgdf")
This works if I hard code the "orgdf" into the plot title. I would like it to be able to extract the name somehow with never having to hard code it in. Any suggestions? Thanks!
I'm not sure if this might help; assuming you want multiple renamed data frames to plot using the title of the original data frame you could create your own plotting function, so you only have to hard code the name of the original data frame once:
f_plot <- function(df){
plot(df$a, df$d, main = "orgdf")
}
f_plot(newdf)
Created on 2020-07-19 by the reprex package (v0.3.0)

Merge in R only shows Header

I have 3 large excel databases converted to csv. I wish to combine these into one by using R.
I have tagged the 3 files as dat1,dat2,dat3 respectively. I tried to merge dat1 and dat2 with the name myfulldata, and then merge myfulldata with dat3, saved as myfulldata2.
When I did this though only the headers remained in the combination, essentially none of the contents of the databases were now visible. Screenshot linked below. The numbers of "obvs" in the myfulldata's are noted at 0 despite the respective ovs for each individual component being very large. Can anyone advise how to resolve?
Code:
dat1 <- read.csv("PS 2014.csv", header=T)
dat2 <- read.csv("PS 2015.csv", header=T)
dat3 <- read.csv("PS 2016.csv", header=T)
myfulldata = merge(dat1, dat2)
myfulldata2 = merge(myfulldata, dat3)
save(myfulldata2, file = "Palisis.RData")
Doing a merge in r is analogous to doing a join between two tables in a database. I suspect what you want to do is to aggregate your three CSV files row-wise (i.e. union them). In this case, you can try using rbind instead:
myfulldata <- rbind(dat1, dat2)
myfulldata <- rbind(myfulldata, dat3)
save(myfulldata, file = "Palisis.RData")
Note that this assumes that the number and ideally types of the columns in each data frame from CSV is the same (q.v. doing a UNION in SQL).

Combination of match and lapply in R

Here is my problem.
I have 8 * 3 dataframes. 8 for the years (2005 to 2012) and for each year I have three data frames corresponding to ecology, flowerdistrib and location. The names of the csv files are based on the same typology (flowerdistrib_2005.csv, ecology_2005.csv, ...)
I would like to constitute for each year a data frame which contains all the columns of the "flowerdistrib" file and part of the "ecology" and "location" ones.
I imported all of them thanks to this script:
listflower = list.files(path = "C:/Directory/.../", pattern = "flowerdistrib_")
for (i in listflower) {
filepath1 <- file.path("C:/Directory/.../",paste(i))
assign(i,read.csv(filepath1, sep=";", dec=",", header=TRUE))
}
Same for ecology and location.
Then I want to do a vlookup for each year with the three files with some specific columns.
In each year, the csv files ecology, location and flowerdistrib have a column named "idp" in common.
I know how to do for one year. I use the following script:
2005 example, extraction of the column named "xl93" present in the file location_2005.csv:
flowerdistrib_2005[, "xl93"] = location_2005$"xl93"[match(flowerdistrib_2005$"idp", location_2005$"idp")]
But I don't know how to proceed to do this once for all the years. I was thinking of using a for loop combined with the lapply function but I don't handle it very well as i am a R beginner.
I would appreciate any and all help.
Thanks a lot.
PS: I am not an english native, apologies for the possible misunderstandings and probably language mistakes.
This is a bit of a re-organization of your read.csv proceedure, but you could use something like the script below to do what you need to do. It would create a list data, which contains all dataframes for the years specified. You can also potentially combine all those data frames into one, if the input tables all have the very same structure.
Hope this helps, not sure if the code below works if you copy paste it and update the paths, but something very similar to this could work for you hopefully.
# Prepare empty list
data <- list()
# Loop through all years
for(year in 2005:2012){
# Load data for this year
flowers <- read.csv(paste('C:/Directory/.../a/flowerdistrib_', year, '.csv', sep=''), sep=";", dec=",", header=TRUE)
ecology <- read.csv(paste('C:/Directory/.../a/ecology_', year, '.csv', sep=''), sep=";", dec=",", header=TRUE)
location <- read.csv(paste('C:/Directory/.../a/location_', year, '.csv', sep=''), sep=";", dec=",", header=TRUE)
# Merge data for this specific year, using idp as identifier
all <- merge(flowers, ecology, by = "idp", all = TRUE)
all <- merge(all, location, by = "idp", all = TRUE)
# Add a year column with constant year value to data
all$year <- year
# Drop unused columns
dropnames = c('column_x', 'column_y')
all <- all[,!(names(all) %in% dropnames)]
# Or alternatively, only keep wanted columns
keepnames = c('idp', 'year', 'column_z', 'column_v')
all <- all[keepnames]
# Append data to list
data[[as.character(year)]] <- all
}
# At this point, data should be a list of dataframes with all data for each year
# so this should print the summary of the data for 2007
summary(data[['2007']])
# If all years have the very same column structure,
# you can use use rbind to combine all years into one big dataframe
data <- do.call(rbind, data)
# This would summarize the data frame with all data combined
summary(data)
Here is a shorter version using some functional programming concepts. First, we write a function read_and_merge that accepts a year as an argument, constructs a list of files for the year, reads them into data_ which is a list consisting of three files. The final trick is to use the Reduce function which recursively merges the three data frames. I am assuming that the only common column is idp.
read_and_merge <- function(year, mydir = "C:/Directory/.../a/"){
files_ = list.files(mydir, pattern = paste("*_", year, ".csv"))
data_ = lapply(files_, read.csv, sep = ";", dec = ",", header = TRUE)
Reduce('merge', data_)
}
The second step is to create a list of the years and use lapply to create datasets for each year.
mydata = lapply(2005:2012, read_and_merge)

Set a Data Frame Column as the Index of R data.frame object

Using R, how do I make a column of a dataframe the dataframe's index? Lets assume I read in my data from a .csv file. One of the columns is called 'Date' and I want to make that column the index of my dataframe.
For example in Python, NumPy, Pandas; I would do the following:
df = pd.read_csv('/mydata.csv')
d = df.set_index('Date')
Now how do I do that in R?
I tried in R:
df <- read.csv("/mydata.csv")
d <- data.frame(V1=df['Date'])
# or
d <- data.frame(Index=df['Date'])
# but these just make a new dataframe with one 'Date' column.
#The Index is still 0,1,2,3... and not my Dates.
I assume that by "Index" you mean row names. You can assign to the row names vector:
rownames(df) <- df$Date
The index can be set while reading the data, in both pandas and R.
In pandas:
import pandas as pd
df = pd.read_csv('/mydata.csv', index_col="Date")
In R:
df <- read.csv("/mydata.csv", header=TRUE, row.names="Date")
The tidyverse solution:
library(tidyverse)
df %>% column_to_rownames(., var = "Date")
while saving the dataframe use row.names=F
e.g. write.csv(prediction.df, "my_file.csv", row.names=F)

zoo merge() and merged column names

I am relatively new to R. I am merging data contained in multiple csv files into a single zoo object.
Here is a snippet of the code in my for loop:
temp <- read.csv(filename, stringsAsFactors=F)
temp_dates <- as.Date(temp[,2])
temp <- zoo(temp[,17], temp_dates)
dataset <- temp[seq_specified_dates]
# merge data into output
if (length(output) == 0)
output <- dataset
else
output <- merge(output, dataset, all=FALSE)
When I run head() on the output zoo object, I notice bizarrely named column names like: 'dataset.output.output.output' etc. How can I assign more meaningful names to the merged columns. ?
Also, how do I reference a particular column in a zoo object?. For example if output was a dataframe, I could reference the 'Patient_A' column as output$Patient_A. How do I reference a specific column in a merged zoo object?
I think this would work regardless of the date being a zoo class, if you provide an example I may be able to fix the details, but all in all this should be a good starting point.
#1- Put your multiple csv files in one folder
setwd(your path)
listnames = list.files(pattern=".csv")
#2-use package plyr
library(plyr)
pp1 = ldply(listnames,read.csv,header=T) #put all the files in on data.frame
names(pp1)=c('name1','name2','name3',...)
pp1$date = zoo(pp1$date)
# Reshape data frame so it gets organized by date
pp2=reshape(pp1,timevar='name1',idvar='date',direction='wide')
read.zoo is able to read and merge multiple files. For example:
idx <- seq(as.Date('2012-01-01'), by = 'day', length = 30)
dat1<- data.frame(date = idx, x = rnorm(30))
dat2<- data.frame(date = idx, x = rnorm(30))
dat3<- data.frame(date = idx, x = rnorm(30))
write.table(dat1, file = 'ex1.csv')
write.table(dat2, file = 'ex2.csv')
write.table(dat3, file = 'ex3.csv')
datMerged <- read.zoo(c('ex1.csv', 'ex2.csv', 'ex3.csv'))
If you want to access a particular column you can use the $ method:
datMerged$ex1.csv
EDITED:
You can extract a time period with the window method:
window(datMerged, start='2012-01-28', end='2012-01-30')
The xts package includes more extraction methods:
library(xts)
datMergedx['2012-01-03']
datMergedx['2012-01-28/2012-01-30']

Resources