Please have a close look at the example data sets and desired outcome to see the purpose of this question. It is not a merging data sets solution what I am looking for. So I could find the answer neither here: How to join (merge) data frames (inner, outer, left, right)?, nor here Use apply() to assign value to new column. It rather refers to a solution for assigning values to new colnames if they meet a condition.
Here a reproducible illustration of what I would like to do:
Email <- as.factor(c("1#1.com", "2#2.com", "3#3.com","4#4.com", "5#5.com"))
dataset1 <- data.frame(Email)
Code <- as.factor(c("Z001", "Z002", "Z003","Z004","Z005"))
Email <- as.factor(c("x#x.com", "2#2.com", "y#y.com", "1#1.com","z#z.com"))
dataset2 <- data.frame(Code, Email)
This results in the following example datasets:
Email
1 1#1.com
2 2#2.com
3 3#3.com
4 4#4.com
5 5#5.com
Code Email
1 Z001 x#x.com
2 Z002 2#2.com
3 Z003 y#y.com
4 Z004 1#1.com
5 Z005 z#z.com
Desired output:
Email Z002 Z004
1 1#1.com NA 1
2 2#2.com 1 NA
3 3#3.com NA NA
4 4#4.com NA NA
5 5#5.com NA NA
So I would like to write a loop that checks whether the Email of dataset2 occurs in dataset1, and if this condition is true, that the Code associated with the Email in dataset2, is assigned as a new column name to dataset1 with a 1 as cell value for this observation. My attempt to get this done and an example of the desired output clarifies the question.
My own attempt to fix it (I know it is wrong, but shows my intention):
for(i in 1:nrow(dataset2)){
if(dataset2$Email[i] %in% dataset1$Email)
dataset1[,dataset2$Code[i]] <- dataset2$Code[i]
dataset1[,dataset2$Code[i]][i] <- 1
}
Would be great if anyone could help me out.
Your dataset2 is in "long" format - changing the Code column into multiple columns is changing it to "wide" format. So in addition to the join, we also need to convert from long to wide - this R-FAQ is a good read on that. Combining these two operations, we do this:
dat = merge(dataset1, dataset2, all.x = T) ## left join
dat$value = 1 ## add the value we want in the result
## convert long to wide
result = reshape2::dcast(dat, Email ~ Code, value.var = "value", drop = T)
result["NA"] = NULL ## remove the NA column that is added
result
# Email Z002 Z004
# 1 1#1.com NA 1
# 2 2#2.com 1 NA
# 3 3#3.com NA NA
# 4 4#4.com NA NA
# 5 5#5.com NA NA
Related
I am working in R with a long table stored as a data.table containing values obtained in value changes for variables of numeric and character type. When I want to perform some functions like correlations, regressions, etc. I have to convert the table into wide format and homogenise the timestamp frequency.
I found a way to convert the long table to wide, but I think is not really efficient and I would like to know if there is a better more data.table native approach.
In the reproducible example below, I include the two options I found to perform the wide low transformation and in the comments I indicate what parts I believe are not optimal.
library(zoo)
library(data.table)
dt<-data.table(time=1:6,variable=factor(letters[1:6]),numeric=c(1:3,rep(NA,3)),
character=c(rep(NA,3),letters[1:3]),key="time")
print(dt)
print(dt[,lapply(.SD,typeof)])
#option 1
casted<-dcast(dt,time~variable,value.var=c("numeric","character"))
# types are correct, but I got NA filled columns,
# is there an option like drop
# available for columns instead of rows?
print(casted)
print(casted[,lapply(.SD,typeof)])
# This drop looks ugly but I did not figure out a better way to perform it
casted[,names(casted)[unlist(casted[,lapply(lapply(.SD,is.na),all)])]:=NULL]
# I perform a LOCF, I do not know if I could benefit of
# data.table's roll option somehow and avoid
# the temporal memory copy of my dataset (this would be the second
# and minor issue)
casted<-na.locf(casted)
#option2
# taken from http://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
casted2<-dcast(dt[,coalesce2(numeric,character),by=c("time","variable")],
time~variable,value.var="V1")
# There are not NA columns but types are incorrect
# it takes more space in a real table (more observations, less variables)
print(casted2)
print(casted2[,lapply(.SD,typeof)])
# Again, I am pretty sure there is a prettier way to do this
numericvars<-names(casted2)[!unlist(casted2[,lapply(
lapply(lapply(.SD,as.numeric),is.na),all)])]
casted2[,eval(numericvars):=lapply(.SD,as.numeric),.SDcols=numericvars]
# same as option 1, is there a data.table native way to do it?
casted2<-na.locf(casted2)
Any advice/improvement in the process is welcome.
I'd maybe do the char and num tables separately and then rbind:
k = "time"
typecols = c("numeric", "character")
res = rbindlist(fill = TRUE,
lapply(typecols, function(tc){
cols = c(k, tc, "variable")
dt[!is.na(get(tc)), ..cols][, dcast(.SD, ... ~ variable, value.var=tc)]
})
)
setorderv(res, k)
res[, setdiff(names(res), k) := lapply(.SD, zoo::na.locf, na.rm = FALSE), .SDcols=!k]
which gives
time a b c d e f
1: 1 1 NA NA NA NA NA
2: 2 1 2 NA NA NA NA
3: 3 1 2 3 NA NA NA
4: 4 1 2 3 a NA NA
5: 5 1 2 3 a b NA
6: 6 1 2 3 a b c
Note that OP's final result casted2, differs in that it has all cols as char.
I have a data frame that has three variables with the valid values of 1,2,3,4,5,6,7 for each variable. If there isn't a numeric value assigned to the variable, it will show NA. The data frame a looks like below:
ak_eth co_eth pa_eth
1 NA 1 NA
2 NA NA 1
3 NA NA NA
4 2 NA NA
5 NA NA 4
6 NA NA NA
Each row could have NA across all three variables or have only one value in one of the three variables. I want to create a new variable called recode that takes values from the existing three variables. If all three existing variables are NA, the new value is NA; if one of the three existing variables has a value, then take that value for the new variable.
I've tried this, but it seems didn't work for me.
a$recode[is.na(a$ak_eth) & is.na(a$co_eth) & is.na(a$pa_eth)] <- "NA"
library(car)
a$recode <- recode(a$ak_eth, "1=1;2=2;3=3;4=4;5=5;6=6;7=7")
a$recode <- recode(a$co_eth, "1=1;2=2;3=3;4=4;5=5;6=6;7=7")
a$recode <- recode(a$pa_eth, "1=1;2=2;3=3;4=4;5=5;6=6;7=7")
Any suggestions will be appreciated. Thanks!
We can use pmax
a$Recode_Var <- do.call(pmax, c(a, na.rm = TRUE))
Or use pmin
a$Recode_Var <- do.call(pmin, c(a, na.rm = TRUE))
Or another option is rowSums
r1 <- rowSums(a, na.rm = TRUE)
a$Recode_Var <- replace(r1, r1==0, NA)
NOTE: According to the OP's post Each row could have NA across all three variables or have only one value in one of the three variables
So, I created a list a of csv files:
tbl = list.files(pattern="*.csv")
Then I separated them into two different lists:
tbl1 <- tbl[c(1,3:7,10:12,14:18,20)]
tbl2 <- tbl[c(2,19,8:9,13)]
Then loaded them:
list_of_data1 = lapply(tbl1, read.csv)
list_of_data2 = lapply(tbl2, read.csv)
And now I want to create a master file. I just want to select some data from each of csv file and store it in one table. To do that I created such loop:
gdata1 = lapply(list_of_data1,function(x) x[3:nrow(x),10:13])
for( i in 1:length(list_of_data1)){
rownames(gdata1[[i]]) = list_of_data1[[i]][3:nrow(list_of_data1[[i]]),1]
}
tmp = lapply(gdata1,function(x) matrix(as.numeric(x),ncol=4))
final.table1=c()
for(i in 1:length(gnames)){
print(i)
tmp=gnames[i]
f1 = function(x) {x[tmp,]}
tmp2 = lapply(gdata1,f1)
tmp3 = c()
for(j in 1:length(tmp2)){
tmp3=rbind(tmp3,tmp2[[j]])
}
tmp4 = as.vector(t(tmp3))
final.table1 = rbind(final.table1,tmp4)
}
rownames(final.table1) = gnames
I created two different lists of data because in first one list_of_data1 there are four interesting columns for me (10:13) and in the other one list_of_data2 there are only 3 columns (10:12). I want to put all of the data in one table. Is there any way to do it in one loop ?
I have an idea how to solve that problem. I may create a new loop for list_of_data2and after that bind both of them using cbind. I want to do it in more elegant way so that's why I came here!
I would suggest looking into do.call , you can rbind your first list of tables and then rbind your second list of tables and then cbind as you stated. Below a trivial use of do.call
#creating a list of tables that we are interested in appending
#together in one master dataframe
ts<-lapply(c(1,2,3),function(x) data.frame(c1=rep(c("a","b"),2),c2=(1:4)*x,c3=rnorm(4)))
#you could of course subset ts to the set of columns
#you find of interest ts[,colsOfInterest]
master<-do.call(rbind,ts)
After seeing your complication of various row/columns of interest in each file, I think you could do something like this. Seems a bit hackerish but could get the job done. I assume you merge the files based on a column named id, you could of course generalize this to multiple columns etc
#creating a series of data frames for which we only want a subset of row/cols
> df1<-data.frame(id=1:10,val1=rnorm(10),val2=rnorm(10))
> df2<-data.frame(id=5:10,val3=rnorm(6))
> df3<-data.frame(id=1:3,val4=rnorm(3), val5=rnorm(3), val6=rnorm(3))
#specifying which rows/cols we are interested in
#i assume you have some way of doing this programmatically or you defined elsewhere
> colsofinterest<-list(df1=c("id","val1"),df2=c("id","val3"),df3=c("id","val5","val6"))
> rowsofinterest<-list(df1=1:5,df2=5:8,df3=2:3)
#create a list of data frames where each has only the row/cols combination we want
> ts<-lapply(c("df1","df2","df3"),
function(x) get(x)[rowsofinterest[[x]],colsofinterest[[x]]])
> ts
[[1]]
id val1
1 1 0.24083489
2 2 -0.50140019
3 3 -0.24509033
4 4 1.41865350
5 5 -0.08123618
[[2]]
id val3
5 9 -0.1862852
6 10 0.5117775
NA NA NA
NA.1 NA NA
[[3]]
id val5 val6
2 2 0.2056010 -0.6788145
3 3 0.2057397 0.8416528
#now merge these based on a key column "id", and we want to keep all.
> final<-Reduce(function(x,y) merge(x,y,by="id",all=T), ts)
> head(final)
id val1 val3 val5 val6
1 1 0.24083489 NA NA NA
2 2 -0.50140019 NA 0.2056010 -0.6788145
3 3 -0.24509033 NA 0.2057397 0.8416528
4 4 1.41865350 NA NA NA
5 5 -0.08123618 NA NA NA
6 9 NA -0.1862852 NA NA
Is this what you are thinking about or did I misinterpret?
not ldplyr() functions in the same way as do.call() in JPC's answer.... I just happen to use plyr more, if you are looking at manipulating r datastructures in a vectorised way then lots of useful stuff in there.
library(plyr)
d1 <- ldplyr(list_of_data1, rbind)
d2 <- ldplyr(list_of_data2, rbind)
select cols of d1 and d2
d1 <- d1[,c(10:13)]
d2 <- d2[,c(10:12)]
final.df <- cbind(d1,d2)
How do I create a fixed size data frame of size [40 2], declare the first column with unique strings, and populate the other with specific values? Again, I want the first column to be the list of strings; I don't
want a row of headers.
(Someone please give me some pointers. I haven't program in R for a while and my R skills are terrible to
begin with.)
Two approaches:
# sequential strings
library(stringr)
df.1 <- data.frame(id=paste0("X",str_pad(1:40,2,"left","0")),value=NA)
head(df.1)
# id value
# 1 X01 NA
# 2 X02 NA
# 3 X03 NA
# 4 X04 NA
# 5 X05 NA
# 6 X06 NA
Second Approach:
# random strings
rstr <- function(n,k){
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
set.seed(1)
df.2 <- data.frame(id=rstr(40,5),value=NA)
head(df.2)
# id value
# 1 gjoxf NA
# 2 xyrqb NA
# 3 ferju NA
# 4 mszju NA
# 5 yfqdg NA
# 6 kajwi NA
The function rstr(n,k) produces a vector of length n with each element being a string of random characters of length k. rstr(...) does not guarantee that all strings are unique, but the probability of duplication is O(n/26^k).
Create the data.frame and define it's columns with the values
The reciclying rule, repeats the strings to match the 40 rows defined by the second column
df <- data.frame(x = c("unique_string 1", "unique_string 2"), y = rpois(40, 2))
# Change column names
names(df) <- c("string_col", "num_col")
I found this way of creating dataframes in R extremely productive and easy,
Create a raw array of values , then convert into matrix of required dimenions and finally name the columns and rows
dataframe.values = c(value1, value2,.......)
dataframe = matrix(dataframe.values,nrow=number of rows ,byrow = T)
colnames(dataframe) = c("column1","column2",........)
row.names(dataframe) = c("row1", "row2",............)
exampledf <- data.frame(columnofstrings=c("a string", "another", "yetanother"),
columnofvalues=c(2,3,5) )
gives
> exampledf
columnofstrings columnofvalues
1 a string 2
2 another 3
3 yetanother 5
I am trying to calculate correlation for my data frame i.e. df3 which looks like this
group a b
1 01_01-102_PRT 0.5857299 1.0915944
2 01_1014_EMH -0.8875033 0.9982261
3 02_02-012_ABT 1.5402289 1.0095046
4 02_02-028B_TMA -0.2635421 0.9533909
5 02_097A_KMG 0.1529145 1.0452099
6 02_116_DMC 0.7375643 0.9927591
My code:
require(plyr)
func <- function(df3)
{
return(data.frame(COR = cor(df3$a, df3$b)))
}
too <- ddply(df3, .(group), func)
My output
group COR
1 01_01-102_PRT NA
2 01_1014_EMH NA
3 02_02-012_ABT NA
4 02_02-028B_TMA NA
5 02_097A_KMG NA
....
I have also tried other ways given here https://stats.stackexchange.com/questions/4040/r-compute-correlation-by-group but I always get NAs.
Help please
Thanks
Jason
It appears that each group consists of exactly one row and therefore of one a and one b value. You cannot calculate correlation if there's no variation in the data. Hence, you need at least two different values for both a and b.