Split a data frame by rows and save as csv - r

I just have a data frame and want to split the data frame by rows, assign the several new data frames to new variables and save them as csv files.
a <- rep(1:5,each=3)
b <-rep(1:3,each=5)
c <- data.frame(a,b)
# a b
1 1 1
2 1 1
3 1 1
4 2 1
5 2 1
6 2 2
7 3 2
8 3 2
9 3 2
10 4 2
11 4 3
12 4 3
13 5 3
14 5 3
15 5 3
I want to split c by column a. i.e all rows are 1 in column a are split from c and assign it to A and save A as A.csv. The same to B.csv with all 2 in column a.
What I can do is
A<-c[c$a%in%1,]
write.csv (A, "A.csv")
B<-c[c$a%in%2,]
write.csv (B, "B.csv")
...
If I have 1000 rows and there will be lots of subsets, I just wonder if there is a simple way to do this by using for loop?

The split() function is very useful to split data frame. Also, you can use lapply() here - it should be more efficient than a loop.
dfs <- split(c, c$a) # list of dfs
# use numbers as file names
lapply(names(dfs),
function(x){write.csv(dfs[[x]], paste0(x,".csv"),
row.names = FALSE)})
# or use letters (max 26!) as file names
names(dfs) <- LETTERS[1:length(dfs)]
lapply(names(dfs),
function(x){write.csv(dfs[[x]],
file = paste0(x,".csv"),
row.names = FALSE)})

for(i in seq_along(unique(c$a))){
write.csv(c[c$a == i,], paste0(LETTERS[i], ".csv"))}
You should consider, however, what happens if you have more than 26 subsets. What will those files be named?

Related

R - How to create multiple datasets based on levels of factor in multiple columns?

I'm kinda new to R and still looking for ways to make my code more elegant. I want to create multiple datasets in a more efficient way, each based on a particular value over different columns.
This is my dataset:
df<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
And I need each column to be a factor:
df[colnames(df)] <- lapply(df[colnames(df)], factor)
Now, what I want to obtain is one dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes", one dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no", one dataframe called "Likert_rank_high" that contains all the observations that in the column "dummy2" have "high" and so on for all my other dummies.
I want to loop or streamline the process in some way, so that there are few commands to run to get all the datasets I need.
The first two dataframes should look something like this:
Dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes"
Dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no"
I have to do this with several dummies with multiple levels and would like to automate/loop the process or make it more efficient, so that I don't have to subset and rename every dataframe for each dummy level. Ideally I would also need to drop the last column in each df created (the one containing the dummy considered).
I tried splitting like below but it seems it is not possible using multiple values, I just get 4 dfs (yes AND high observations, yes AND low obs, no AND high obs etc.) like so:
Splitting with a list of columns doesn't work
list_df <- split(df[c(1:5)], list(df$dummy1,df$dummy2), sep=".")
Can you help? Thanks in advance!
You need two lapplys:
vals <- colnames(df)[1:5]
dummies <- colnames(df)[-(1:5)]
step1 <- lapply(dummies, function(x) df[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
step2
# $dummy1
# $dummy1$no
# A B C D E dummy1
# 3 2 2 3 5 2 no
# 4 3 3 3 5 3 no
# 5 4 4 4 5 4 no
# 6 5 2 2 4 2 no
# 8 1 5 1 5 1 no
#
# $dummy1$yes
# A B C D E dummy1
# 1 1 4 3 1 1 yes
# 2 2 4 3 2 4 yes
# 7 1 1 5 5 5 yes
# 9 2 2 2 2 2 yes
# 10 3 2 3 3 3 yes
#
#
# $dummy2
# $dummy2$high
# A B C D E dummy2
# 1 1 4 3 1 1 high
# 5 4 4 4 5 4 high
# 6 5 2 2 4 2 high
# 7 1 1 5 5 5 high
# 10 3 2 3 3 3 high
#
# $dummy2$low
# A B C D E dummy2
# 2 2 4 3 2 4 low
# 3 2 2 3 5 2 low
# 4 3 3 3 5 3 low
# 8 1 5 1 5 1 low
# 9 2 2 2 2 2 low
For the first data set ("dummy1" and "no") use step2$dummy1$no or step2[[1]][[1]] or step2[["dummy1"]][["no"]].
For programming purposes it is usually better to keep the list intact since it makes it simple to write code that processes all of the data frames in the list without having to specify them individually.
You are very close:
tbls <- unlist(step2, recursive=FALSE)
list2env(tbls, envir=.GlobalEnv)
ls()
# [1] "df" "dummies" "dummy1.no" "dummy1.yes" "dummy2.high" "dummy2.low" "step1" "step2" "tbls" "vals"
This will create the same set of tables.

Custom Data Set/Frame From List

Sample
A=data.frame("id"=c(1:10))
B=data.frame("id"=c(7:16))
C=data.frame("id"=c(-10:-1))
mylist=c(A,B,C)
What I want is a list which combindes these three data.frames into a single one:
WANT = data.frame("id"=c(1:10,7:16,-10:-1),
dataID=c(rep("A",10),rep("B",10),rep("C",10)))
If suppose I have list which contains a bunch of data frames (this is how I am given the data). I want to put them into one really big data frame/set like "WANT" that uses the names of the data sets in the list for dataID. I am able to do this with just a few for example A,B,C but I have like a hundred and am wondering how do i pull out the data frames in list and make a tall file like the "WANT" example.
you can add the dataID into the single dataframes and then bind them together:
EDIT: after some clarification, here is a new approach
listNAMES = letters[1:3]
library(tidyverse)
tibble(mydata = list(A, B, C),
dataID = listNAMES) %>%
unnest()
# A tibble: 30 x 2
names id
<chr> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
# ... with 20 more rows

populate Data Frame based on lookup data frame in R

How does one go about switching a data frame based on column names between to tables with a lookup table in between.
Orig
A B C
1 2 3
2 2 2
4 5 6
Ret
D E
7 8
8 9
2 4
lookup <- data.frame(Orig=c('A','B','C'),Ret=c('D','D','E'))
Orig Ret
1 A D
2 B D
3 C E
So that the final data frame would be
A B C
7 7 8
8 8 9
2 2 4
We can match the 'Orig' column in 'lookup' with the column names of 'Orig' to find the numeric index (although, it is in the same order, it could be different in other cases), get the corresponding 'Ret' elements based on that. We use that to subset the 'Ret' dataset and assign the output back to the original dataset. Here I made a copy of "Orig".
OrigN <- Orig
OrigN[] <- Ret[as.character(lookup$Ret[match(as.character(lookup$Orig),
colnames(Orig))])]
OrigN
# A B C
#1 7 7 8
#2 8 8 9
#3 2 2 4
NOTE: as.character was used as the columns in 'lookup' were factor class.
I believe that the following will work as well.
OrigN <- Orig
OrigN[, as.character(lookup$Orig)] <- Ret[, as.character(lookup$Ret)]
This method applies a column shuffle to Orig (actually a copy OrigN following #Akrun) and then fills these columns with the appropriately ordered columns of Ret using the lookup.

Import multiple data frames CSV - column separation

I have a csv file with multiple data frames that are all separated by a column (So 4 columns of data, empty column, 4 columns of data, etc.). Is there a nice way to read in the file and have R create a separate df for each of those contiguous sets of columns? Then I would be able to use lapply across all of these dfs.
Thanks for your help.
Read in the whole csv file, then use lapply to separately capture each four-column data frame into a list. Then use rbind to stack all the data frames into a single data frame.
dat = read.csv("YourFile.csv")
# Set this based on how many separate data frames are in your csv file
num.df = ncol(dat)/5 # Per #zx8754's comment
# This will tell the function the column numbers where
# each data frame starts
start.cols = seq(1, 1 + 5*(num.df-1), 5)
df.list = lapply(start.cols, function(x) {
# Capture the next 4 columns
df = dat[, x:(x+3)]
# Use whatever names are appropriate here. This is just
# to make sure all of the data frames have the same column names
# so that rbind won't throw an error
names(df) = c(paste0("col", 1:4))
return(df)
})
# rbind all the data frames into a single data frame
df = do.call(rbind, df.list)
You can take advantage of colClasses:
Example data:
h1 h2 h3 h1.1 h2.1 h3.1 h1.2 h2.2 h3.2
1 1 6 3 1 8 8 1 5 2
2 2 1 1 6 5 8 1 3 1
3 3 2 6 1 2 3 1 2 5
Then you can loop through the number of dataframes you wan't and read the file:
ngroups <- 3 #number of dataframes to read
datacols <- 3 #number of columns to read
fulldata <- list()
for (i in 1:ngroups) {
nskip <- (datacols+1)*(i-1)
cols.to.read <- c(rep("NULL", nskip), rep(NA, datacols), rep("NULL", (datacols+1)*(ngroups-i+1)-1)) #creates a list of NULLs and NAs. NULLs = don't read, NA = read
fulldata[[i]] <- read.csv("test.csv", colClasses=cols.to.read)
}
Result:
fulldata
[[1]]
h1 h2 h3
1 1 6 3
2 2 1 1
3 3 2 6
[[2]]
h1.1 h2.1 h3.1
1 1 8 8
2 6 5 8
3 1 2 3
[[3]]
h1.2 h2.2 h3.2
1 1 5 2
2 1 3 1
3 1 2 5
This works, but I believe the answers reading the file only once would be faster, since reading the same file over and over again doesn't sound like the optimal procedure.
First read in all your data into one large dataframe:
maindf <- read.table(yourfile)
Lets say n is the number of dataframes inside your csv file:
for (i in 0:n-1){
assign(paste0("df",i+1),maindf[,(1+4*i):(4+4*i)])
}
The result should be n dataframes that can be accessed like this: df1, df2,...dfn.
I didnt test it, because no sample data was provided.

Using loop variables

I would like to rename a large number of columns (column headers) to have numerical names rather than combined letter+number names. Because of the way the data is stored in raw format, I cannot just access the correct column numbers by using data[[152]] if I want to interact with a specific column of data (because random questions are filtered completely out of the data due to being long answer comments), but I'd like to be able to access them by data$152. Additionally, approximately half the columns names in my data have loaded with class(data$152) = NULL but class(data[[152]]) = integer (and if I rename the data[[152]] file it appropriately allows me to see class(data$152) as integer).
Thus, is there a way to use the loop iteration number as a column name (something like below)
for (n in 1:415) {
names(data)[n] <-"n" # name nth column after number 'n'
}
That will reassign all my column headers and ensure that I do not run into question classes resulting in null?
As additional background info, my data is imported from a comma delimited .csv file with the value 99 assigned to answers of NA with the first row being the column names/headers
data <- read.table("rawdata.csv", header=TRUE, sep=",", na.strings = "99")
There are 415 columns with headers in format Q001, Q002, etc
There are approximately 200 rows with no row labels/no label column
You can do this without a loop, as follows:
names(data) <- 1:415
Let me illustrate with an example:
dat <- data.frame(a=1:4, b=2:5, c=3:6, d=4:7)
dat
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Now rename the columns:
names(dat) <- 1:4
dat
1 2 3 4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
EDIT : How to access your new data
#Ramnath points out very accurately that you won't be able to access your data using dat$1:
dat$1
Error: unexpected numeric constant in "dat$1"
Instead, you will have to wrap the column names in backticks:
dat$`1`
[1] 1 2 3 4
Alternatively, you can use a combination of character and numeric data to rename your columns. This could be a much more convenient way of dealing with your problem:
names(dat) <- paste("x", 1:4, sep="")
dat
x1 x2 x3 x4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7

Resources