Let's say that I have a list which contains fourteen data frames. Each data frame contains a final column which contains a value for the city. So it would like
[[1]]
Row.Labels brz zone
1 3/31/09 NA SNE
2 4/30/09 NA SNE
3 5/31/09 NA SNE
[[2]]
Row.Labels brz zone
1 3/31/09 NA FED
2 4/30/09 NA FED
3 5/31/09 NA FED
...
What I want to do is name each data frame within the list with the value from the zone column. I figured a quick for loop would do the trick but I can't seem to find a solution to this problem.
dataset <- do.call("list", lapply(file_list, FUN = function(files){
read.csv(files, header=TRUE, stringsAsFactors=FALSE)
}))
# doesn't work
for( j in 1:length(dataset) ) {
names(dataset[j]) <- unique(dataset[[j]][,"zone"])
}
So the desired result of to name the first list element as SNE, the second list element as FED, and so forth. But I don't want to do it manually.
Related
Please have a close look at the example data sets and desired outcome to see the purpose of this question. It is not a merging data sets solution what I am looking for. So I could find the answer neither here: How to join (merge) data frames (inner, outer, left, right)?, nor here Use apply() to assign value to new column. It rather refers to a solution for assigning values to new colnames if they meet a condition.
Here a reproducible illustration of what I would like to do:
Email <- as.factor(c("1#1.com", "2#2.com", "3#3.com","4#4.com", "5#5.com"))
dataset1 <- data.frame(Email)
Code <- as.factor(c("Z001", "Z002", "Z003","Z004","Z005"))
Email <- as.factor(c("x#x.com", "2#2.com", "y#y.com", "1#1.com","z#z.com"))
dataset2 <- data.frame(Code, Email)
This results in the following example datasets:
Email
1 1#1.com
2 2#2.com
3 3#3.com
4 4#4.com
5 5#5.com
Code Email
1 Z001 x#x.com
2 Z002 2#2.com
3 Z003 y#y.com
4 Z004 1#1.com
5 Z005 z#z.com
Desired output:
Email Z002 Z004
1 1#1.com NA 1
2 2#2.com 1 NA
3 3#3.com NA NA
4 4#4.com NA NA
5 5#5.com NA NA
So I would like to write a loop that checks whether the Email of dataset2 occurs in dataset1, and if this condition is true, that the Code associated with the Email in dataset2, is assigned as a new column name to dataset1 with a 1 as cell value for this observation. My attempt to get this done and an example of the desired output clarifies the question.
My own attempt to fix it (I know it is wrong, but shows my intention):
for(i in 1:nrow(dataset2)){
if(dataset2$Email[i] %in% dataset1$Email)
dataset1[,dataset2$Code[i]] <- dataset2$Code[i]
dataset1[,dataset2$Code[i]][i] <- 1
}
Would be great if anyone could help me out.
Your dataset2 is in "long" format - changing the Code column into multiple columns is changing it to "wide" format. So in addition to the join, we also need to convert from long to wide - this R-FAQ is a good read on that. Combining these two operations, we do this:
dat = merge(dataset1, dataset2, all.x = T) ## left join
dat$value = 1 ## add the value we want in the result
## convert long to wide
result = reshape2::dcast(dat, Email ~ Code, value.var = "value", drop = T)
result["NA"] = NULL ## remove the NA column that is added
result
# Email Z002 Z004
# 1 1#1.com NA 1
# 2 2#2.com 1 NA
# 3 3#3.com NA NA
# 4 4#4.com NA NA
# 5 5#5.com NA NA
Suppose i have several data frames
dfx01=data.frame(city=c("a","b","c","d"),yr=c(2000,2001,2003,2002))
dfx02=data.frame(city=c("a","e","c","d"),yr=c(2000,2001,2005,2002))
dfx012=data.frame(city=c("f","b","c","d"),yr=c(2000,2000,2001,2002))
dfx022=data.frame(city=c("f","b","c","g"),yr=c(2002,2000,2003,2001))
how should i output corresponding data frames x01,x02,x012,x022 that subsets only yr=2001?
i attempted lapply
dflist=list(dfx01,dfx02,dfx012,dfx022)
lapply(dflist, fun(x){subset(x,startyr=2000)})
But how to name 4 new data frame x01,x02,x012,x022? thanks.
Your call just needs to be changed a little. Try
lapply(dflist, subset, yr == 2000)
But I prefer [ subsetting, because subset can have unintended results. Here's how to do that, and add new names at the same time. To set names similar to your data frame names, it's best to add names to the list first.
> dflist <- setNames(dflist, grep("dfx0", ls(), value = TRUE))
> setNames(lapply(dflist, function(x) x[x$yr==2001, ]),
gsub("df", "", names(dflist)))
# $x01
# city yr
# 2 b 2001
#
# $x012
# city yr
# 2 e 2001
#
# $x02
# city yr
# 3 c 2001
#
# $x022
# city yr
# 4 g 2001
So, I created a list a of csv files:
tbl = list.files(pattern="*.csv")
Then I separated them into two different lists:
tbl1 <- tbl[c(1,3:7,10:12,14:18,20)]
tbl2 <- tbl[c(2,19,8:9,13)]
Then loaded them:
list_of_data1 = lapply(tbl1, read.csv)
list_of_data2 = lapply(tbl2, read.csv)
And now I want to create a master file. I just want to select some data from each of csv file and store it in one table. To do that I created such loop:
gdata1 = lapply(list_of_data1,function(x) x[3:nrow(x),10:13])
for( i in 1:length(list_of_data1)){
rownames(gdata1[[i]]) = list_of_data1[[i]][3:nrow(list_of_data1[[i]]),1]
}
tmp = lapply(gdata1,function(x) matrix(as.numeric(x),ncol=4))
final.table1=c()
for(i in 1:length(gnames)){
print(i)
tmp=gnames[i]
f1 = function(x) {x[tmp,]}
tmp2 = lapply(gdata1,f1)
tmp3 = c()
for(j in 1:length(tmp2)){
tmp3=rbind(tmp3,tmp2[[j]])
}
tmp4 = as.vector(t(tmp3))
final.table1 = rbind(final.table1,tmp4)
}
rownames(final.table1) = gnames
I created two different lists of data because in first one list_of_data1 there are four interesting columns for me (10:13) and in the other one list_of_data2 there are only 3 columns (10:12). I want to put all of the data in one table. Is there any way to do it in one loop ?
I have an idea how to solve that problem. I may create a new loop for list_of_data2and after that bind both of them using cbind. I want to do it in more elegant way so that's why I came here!
I would suggest looking into do.call , you can rbind your first list of tables and then rbind your second list of tables and then cbind as you stated. Below a trivial use of do.call
#creating a list of tables that we are interested in appending
#together in one master dataframe
ts<-lapply(c(1,2,3),function(x) data.frame(c1=rep(c("a","b"),2),c2=(1:4)*x,c3=rnorm(4)))
#you could of course subset ts to the set of columns
#you find of interest ts[,colsOfInterest]
master<-do.call(rbind,ts)
After seeing your complication of various row/columns of interest in each file, I think you could do something like this. Seems a bit hackerish but could get the job done. I assume you merge the files based on a column named id, you could of course generalize this to multiple columns etc
#creating a series of data frames for which we only want a subset of row/cols
> df1<-data.frame(id=1:10,val1=rnorm(10),val2=rnorm(10))
> df2<-data.frame(id=5:10,val3=rnorm(6))
> df3<-data.frame(id=1:3,val4=rnorm(3), val5=rnorm(3), val6=rnorm(3))
#specifying which rows/cols we are interested in
#i assume you have some way of doing this programmatically or you defined elsewhere
> colsofinterest<-list(df1=c("id","val1"),df2=c("id","val3"),df3=c("id","val5","val6"))
> rowsofinterest<-list(df1=1:5,df2=5:8,df3=2:3)
#create a list of data frames where each has only the row/cols combination we want
> ts<-lapply(c("df1","df2","df3"),
function(x) get(x)[rowsofinterest[[x]],colsofinterest[[x]]])
> ts
[[1]]
id val1
1 1 0.24083489
2 2 -0.50140019
3 3 -0.24509033
4 4 1.41865350
5 5 -0.08123618
[[2]]
id val3
5 9 -0.1862852
6 10 0.5117775
NA NA NA
NA.1 NA NA
[[3]]
id val5 val6
2 2 0.2056010 -0.6788145
3 3 0.2057397 0.8416528
#now merge these based on a key column "id", and we want to keep all.
> final<-Reduce(function(x,y) merge(x,y,by="id",all=T), ts)
> head(final)
id val1 val3 val5 val6
1 1 0.24083489 NA NA NA
2 2 -0.50140019 NA 0.2056010 -0.6788145
3 3 -0.24509033 NA 0.2057397 0.8416528
4 4 1.41865350 NA NA NA
5 5 -0.08123618 NA NA NA
6 9 NA -0.1862852 NA NA
Is this what you are thinking about or did I misinterpret?
not ldplyr() functions in the same way as do.call() in JPC's answer.... I just happen to use plyr more, if you are looking at manipulating r datastructures in a vectorised way then lots of useful stuff in there.
library(plyr)
d1 <- ldplyr(list_of_data1, rbind)
d2 <- ldplyr(list_of_data2, rbind)
select cols of d1 and d2
d1 <- d1[,c(10:13)]
d2 <- d2[,c(10:12)]
final.df <- cbind(d1,d2)
How do I create a fixed size data frame of size [40 2], declare the first column with unique strings, and populate the other with specific values? Again, I want the first column to be the list of strings; I don't
want a row of headers.
(Someone please give me some pointers. I haven't program in R for a while and my R skills are terrible to
begin with.)
Two approaches:
# sequential strings
library(stringr)
df.1 <- data.frame(id=paste0("X",str_pad(1:40,2,"left","0")),value=NA)
head(df.1)
# id value
# 1 X01 NA
# 2 X02 NA
# 3 X03 NA
# 4 X04 NA
# 5 X05 NA
# 6 X06 NA
Second Approach:
# random strings
rstr <- function(n,k){
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
set.seed(1)
df.2 <- data.frame(id=rstr(40,5),value=NA)
head(df.2)
# id value
# 1 gjoxf NA
# 2 xyrqb NA
# 3 ferju NA
# 4 mszju NA
# 5 yfqdg NA
# 6 kajwi NA
The function rstr(n,k) produces a vector of length n with each element being a string of random characters of length k. rstr(...) does not guarantee that all strings are unique, but the probability of duplication is O(n/26^k).
Create the data.frame and define it's columns with the values
The reciclying rule, repeats the strings to match the 40 rows defined by the second column
df <- data.frame(x = c("unique_string 1", "unique_string 2"), y = rpois(40, 2))
# Change column names
names(df) <- c("string_col", "num_col")
I found this way of creating dataframes in R extremely productive and easy,
Create a raw array of values , then convert into matrix of required dimenions and finally name the columns and rows
dataframe.values = c(value1, value2,.......)
dataframe = matrix(dataframe.values,nrow=number of rows ,byrow = T)
colnames(dataframe) = c("column1","column2",........)
row.names(dataframe) = c("row1", "row2",............)
exampledf <- data.frame(columnofstrings=c("a string", "another", "yetanother"),
columnofvalues=c(2,3,5) )
gives
> exampledf
columnofstrings columnofvalues
1 a string 2
2 another 3
3 yetanother 5
Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.
This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.
A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT