Related
I have a folder with a couple hundred .csv files that I'd like to import and merge.
Each file contains two columns of data, but there are different numbers of rows, and the rows have different names. The columns don't have names (For this, let's say they're named x and y).
How can I merge these all together? I'd like to just stick the x columns together, side-by-side, rather than matching on any criteria so that the first row is matched across all data sets and empty rows are given NA.
I'd like column x to go away.
Although, the rows should stay in the order they were originally in from the csv.
Here's an example:
Data frame 112_c1.csv:
x y
1 -0.5604
3 -0.2301
4 1.5587
5 0.0705
6 0.1292
Dataframe 112_c2.csv:
x y
2 -0.83476
3 -0.82764
8 1.32225
9 0.36363
13 0.9373
42 -1.5567
50 -0.12237
51 -0.4837
Dataframe 113_c1.csv:
x y
5 1.5783
6 0.7736
9 0.28273
15 1.44565
23 0.999878
29 -0.223756
=
Desired result
112_c1.y 112_c2.y 113_c1.y
-0.5604 -0.83476 1.5783
-0.2301 -0.82764 0.7736
1.5587 1.32225 0.28273
0.0705 0.36363 1.44565
0.1292 0.9373 0.999878
NA -1.5567 -0.223756
NA -0.12237 -0.223756
NA -0.12237 NA
NA -0.4837 NA
I've tried a few things, and looked through many other threads. But code like the following simply produces NAs for any following columns:
df <- do.call(rbind.fill, lapply(list.files(pattern = "*.csv"), read.csv))
Plus, if I use rbind instead of rbind.fill I get the error that names do not match previous names and I'm unsure of how to circumvent this matching criteria.
Suggested solution using a function to calculate summary statistics right when loading data:
readCalc <- function(file_path) {
df <- read.csv(file_path)
return(data.frame(file=file_path,
column = names(df),
averages = apply(df, 2, mean),
N = apply(df, 2, length),
min = apply(df, 2, min),
stringsAsFactors = FALSE, row.names = NULL))
}
df <- do.call(rbind, lapply(list.files(pattern = "*.csv"), readCalc))
If we need the first or last value we could use dplyr::first, dplyr::last. We might even want to store the whole vector in a list somewhere, but if we only need the summary stats we might not even need it.
Here's a solution to read all your csv files from a folder called "data" and merge the y columns into a single dataframe. This assigns the file name as the column header.
library(tidyverse)
# store csv file paths
data_path <- "data" # path to the data
files <- dir(data_path, pattern = "*.csv") # get file names
files <- paste(data_path, '/', files, sep="")
# read csv files and combine into a single dataframe
compiled_data = tibble::tibble(File = files) %>% #create a tibble called compiled_data
tidyr::extract(File, "name", "(?<=data/)(.*)(?=[.]csv)", remove = FALSE) %>% #extract the file names
mutate(Data = lapply(File, readr::read_csv, col_names = F)) %>% #create a column called Data that stores the file names
tidyr::unnest(Data) %>% #unnest the Data column into multiple columns
select(-File) %>% #remove the File column
na.omit() %>% #remove the NA rows
spread(name, X2) %>% #reshape the dataframe from long to wide
select(-X1) %>% #remove the x column
mutate_all(funs(.[order(is.na(.))])) #reorganize dataframe to collapse the NA rows
Taken from here: cbind a dataframe with an empty dataframe - cbind.fill?
x <- c(1:6)
y <- c(1:3)
z <- c(1:10)
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
df <- as.data.frame(cbind.fill(x,y,z))
colnames(df) <- c("112_c1.y", "112_c2.y", "113_c1.y")
112_c1.y 112_c2.y 113_c1.y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 NA 4
5 5 NA 5
6 6 NA 6
7 NA NA 7
8 NA NA 8
9 NA NA 9
10 NA NA 10
After scraping some review data from a website, I am having difficulty organizing the data into a useful structure for analysis. The problem is that the data is dynamic, in that each reviewer gave ratings on anywhere between 0 and 3 subcategories (denoted as subcategories "a", "b" and "c"). I would like to organize the reviews so that each row is a different reviewer, and each column is a subcategory that was rated. Where reviewers chose not to rate a subcategory, I would like that missing data to be 'NA'. Here is a simplified sample of the data:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
The vec contains the information of the subcategories that were scored, and the "stop" is the end of each reviewers rating. As such, I would like to organize the result into a data frame with this structure. Expected Output
I would greatly appreciate any help on this, because I've been working on this issue for far longer than it should take me..
#alexis_laz provided what I believe is the best answer:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
stops <- vec == "stop"
i = cumsum(stops)[!stops] + 1L
j = vec[!stops]
tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity) # although mean/sum work
# a b c
#[1,] 2 5 1
#[2,] 1 3 NA
#[3,] NA NA NA
#[4,] NA NA 2
base R, but I'm using a for loop...
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
categories <- unique(vec)[unique(vec)!="stop"]
row = 1
df = data.frame(lapply(categories, function(x){NA_integer_}))
colnames(df) <- categories
rating = 1
for(i in vec) {
if(i=='stop') {row <- row+1
} else { df[row,i] <- ratings[[rating]]; rating <- rating+1}
}
Here is one option
library(data.table)
library(reshape2)
d1 <- as.data.table(melt(split(vec, c(1, head(cumsum(vec == "stop")+1,
-1)))))[value != 'stop', ratings := ratings
][value != 'stop'][, value := as.character(value)][, L1 := as.integer(L1)]
dcast( d1[CJ(value = value, L1 = seq_len(max(L1)), unique = TRUE), on = .(value, L1)],
L1 ~value, value.var = 'ratings')[, L1 := NULL][]
# a b c
#1: 2 5 1
#2: 1 3 NA
#3: NA NA NA
#4: NA NA 2
Using base R functions and rbind.fill from plyr or rbindlist from data.table to produce the final object, we can do
# convert vec into a list, split by "stop", dropping final element
temp <- head(strsplit(readLines(textConnection(paste(gsub("stop", "\n", vec, fixed=TRUE),
collapse=" "))), split=" "), -1)
# remove empty strings, but maintain empty list elements
temp <- lapply(temp, function(x) x[nchar(x) > 0])
# match up appropriate names to the individual elements in the list with setNames
# convert vectors to single row data.frames
temp <- Map(function(x, y) setNames(as.data.frame.list(x), y),
relist(ratings, skeleton = temp), temp)
# add silly data.frame (single row, single column) for any empty data.frames in list
temp <- lapply(temp, function(x) if(nrow(x) > 0) x else setNames(data.frame(NA), vec[1]))
Now, you can produce the single data.frame (data.table) with either plyr or data.table
# with plyr, returns data.frame
library(plyr)
do.call(rbind.fill, temp)
a b c
1 2 5 1
2 1 3 NA
3 NA NA NA
4 NA NA 2
# with data.table, returns data.table
rbindlist(temp, fill=TRUE)
a b c
1: 2 5 1
2: 1 3 NA
3: NA NA NA
4: NA NA 2
Note that the line prior to the rbinding can be replaced with
temp[lengths(temp) == 0] <- replicate(sum(lengths(temp) == 0),
setNames(data.frame(NA), vec[1]), simplify=FALSE)
where the list items that are empty data frames are replaced using subsetting instead of an lapply over the entire list.
dfOrig <- data.frame(rbind("1",
"C",
"531404",
"3",
"B",
"477644"))
setnames(dfOrig, "Value")
I have a single column vector, which actually comprises two observations of three variables. How do I convert it to a data.frame with the following structure:
ID Code Tag
"1" "C" "531404"
"3" "B" "477644"
Obviously, this is just a toy example to illustrate a real-world problem with many more observations and variables.
Here's another approach - it does rely on the dfOrig column being ordered 1,2,3,1,2,3 etc.
x <- c("ID", "Code", "Tag") # new column names
n <- length(x) # number of columns
res <- data.frame(lapply(split(as.character(dfOrig$Value), rep(x, nrow(dfOrig)/n)),
type.convert))
The resulting data is:
> str(res)
#'data.frame': 2 obs. of 3 variables:
# $ Code: Factor w/ 2 levels "B","C": 2 1
# $ ID : int 1 3
# $ Tag : int 531404 477644
As you can see, the column classes have been converted. In case you want the Code column to be character instead of factor you can specify stringsAsFactors = FALSE in the data.frame call.
And it looks like this:
> res
# Code ID Tag
#1 C 1 531404
#2 B 3 477644
Note: You have to get the column name order in x in line with the order of the entries in dfOrig$Value.
If you want to get the column order of res as specified in x, you can use the following:
res <- res[, match(x, names(res))]
Maybe convert to matrix with nrow:
# set number of columns
myNcol <- 3
# convert to matrix, then dataframe
res <- data.frame(matrix(dfOrig$Value, ncol = myNcol, byrow = TRUE),
stringsAsFactors = FALSE)
# convert the type and add column names
res <- as.data.frame(lapply(res, type.convert),
col.names = c("resID", "Code", "Tag"))
res
# resID Code Tag
# 1 1 C 531404
# 2 3 B 477644
You can create a sequence of numbers
x <- seq(1:nrow(dfOrig)) %% 3 #you can change this 3 to number of columns you need
data.frame(ID = dfOrig$Value[x == 1],
Code = dfOrig$Value[x == 2],
Tag = dfOrig$Value[x == 0])
#ID Code Tag
#1 1 C 531404
#2 3 B 477644
Another approach would be splitting the dataframe according to the sequence generated above and then binding the columns using do.call
x <- seq(1:nrow(dfOrig))%%3
res <- do.call("cbind", split(dfOrig,x))
You can definitely change the column names
colnames(res) <- c("Tag", "Id", "Code")
# Tag Id Code
#3 531404 1 C
#6 477644 3 B
How do I create a fixed size data frame of size [40 2], declare the first column with unique strings, and populate the other with specific values? Again, I want the first column to be the list of strings; I don't
want a row of headers.
(Someone please give me some pointers. I haven't program in R for a while and my R skills are terrible to
begin with.)
Two approaches:
# sequential strings
library(stringr)
df.1 <- data.frame(id=paste0("X",str_pad(1:40,2,"left","0")),value=NA)
head(df.1)
# id value
# 1 X01 NA
# 2 X02 NA
# 3 X03 NA
# 4 X04 NA
# 5 X05 NA
# 6 X06 NA
Second Approach:
# random strings
rstr <- function(n,k){
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
set.seed(1)
df.2 <- data.frame(id=rstr(40,5),value=NA)
head(df.2)
# id value
# 1 gjoxf NA
# 2 xyrqb NA
# 3 ferju NA
# 4 mszju NA
# 5 yfqdg NA
# 6 kajwi NA
The function rstr(n,k) produces a vector of length n with each element being a string of random characters of length k. rstr(...) does not guarantee that all strings are unique, but the probability of duplication is O(n/26^k).
Create the data.frame and define it's columns with the values
The reciclying rule, repeats the strings to match the 40 rows defined by the second column
df <- data.frame(x = c("unique_string 1", "unique_string 2"), y = rpois(40, 2))
# Change column names
names(df) <- c("string_col", "num_col")
I found this way of creating dataframes in R extremely productive and easy,
Create a raw array of values , then convert into matrix of required dimenions and finally name the columns and rows
dataframe.values = c(value1, value2,.......)
dataframe = matrix(dataframe.values,nrow=number of rows ,byrow = T)
colnames(dataframe) = c("column1","column2",........)
row.names(dataframe) = c("row1", "row2",............)
exampledf <- data.frame(columnofstrings=c("a string", "another", "yetanother"),
columnofvalues=c(2,3,5) )
gives
> exampledf
columnofstrings columnofvalues
1 a string 2
2 another 3
3 yetanother 5
I am wondering if it is possible to create a new dataframe with certain cells from each file from the working directory. for example say If I have 2 data frame like this (please ignore the numbers as they are random):
Say in each dataset, row 4 is the sum of my value and Row 5 is number of missing values. If I represent number of missing values as "M" and Sum of coloumns as "N", what I am trying to acheive is the following table:
So each file 'N' and 'M' are in 1 single row.
I have many files in the directory so I have read them in a list, but not sure what would be the best way to perform such task on a list of files.
this is my sample code for the tables I have shown and how I read them in list:
##Create sample data
df = data.frame(Type = 'wind', v1=c(1,2,3,100,50), v2=c(4,5,6,200,60), v3=c(6,7,8,300,70))
df2 =data.frame(Type = 'test', v1=c(3,2,1,400,40), v2=c(2,3,4,500,30), v3=c(6,7,8,600,20))
# write to directory
write.csv(df, file = "sample1.csv", row.names = F)
write.csv(df2, file = "sample2.csv", row.names = F)
# read to list
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = TRUE)
I would be really greatful if you could give me some suggestion about if this possible and how I should approch?
Many thanks,
Ayan
This should work:
processFile <- function(File) {
d <- read.csv(File, skip = 4, nrows = 2, header = FALSE,
stringsAsFactors = FALSE)
dd <- data.frame(d[1,1], t(unlist(d[-1])))
names(dd) <- c("ID", "v1N", "V1M", "v2N", "V2M", "v3N", "V3M")
return(dd)
}
ll <- lapply(mycsv, processFile)
do.call(rbind, ll)
# ID v1N V1M v2N V2M v3N V3M
# 1 wind 100 50 200 60 300 70
# 2 test 400 40 500 30 600 20
(The one slightly tricky/unusual bit comes in that third line of processFile(). Here's a code snippet that should help you see how it accomplishes what it does.)
(d <- data.frame(a="wind", b=1:2, c=3:4))
# a b c
# 1 wind 1 3
# 2 wind 2 4
t(unlist(d[-1]))
# b1 b2 c1 c2
# [1,] 1 2 3 4
CAVEAT: I'm not sure I fully understand what you want. I think you're reading in a list and want to select certain dataframes from that list with the same rows from that list. Then you want to create a data frame of those rows and go from long to wide format.
LIST <- lapply(2:3, function(i) {
x <- mylist[[i]][4:5, ]
x <- data.frame(x, row = factor(rownames(x)))
return(x)
}
)
DF <- do.call("rbind", LIST) #lets you bind an unknown number of rows from a list
levels(DF$row) <- list(M =4, N = 5) #recodes rows 4 and 5 with M and N
wide <- reshape(DF, v.names=c("v1", "v2", "v3"), idvar=c("Type"),
timevar="row", direction="wide") #reshape from long to wide
rownames(wide) <- 1:nrow(wide) #give proper row names
wide
This yields:
Type v1.M v2.M v3.M v1.N v2.N v3.N
1 wind 100 200 300 50 60 70
2 test 400 500 600 40 30 20