How can I reshape my dataframe? - r

I have a huge data frame, that in a simple version it looks like this:
trials=c("1","2","3","4","5","6","7","8","9","10")
co =c(rep ("1",10))
stim=c("8","9","11","2","4","7","8","1","12","16")
ansbin=c("1","0","1","0","0","1","0","1","1","0")
stim.1=c("11","2","11","7","4","3","9","1","4","16")
ansbin.1=c("0","0","1","0","0","1","0","1","1","1")
trials.1=c("1","2","3","4","5","6","7","8","9","10")
co.1 =c(rep ("2",10))
stim1.1=c("11","2","11","2","5","7","8","15","17","10")
ansbin1.1=c("1","1","1","0","0","1","1","1","0","1")
stim2.1=c("11","2","14","1","4","8","9","10","4","12")
ansbin2.1=c("0","1","1","0","0","1","0","0","1","0")
ID<- data.frame(trials,co,stim,ansbin,stim.1,ansbin.1,trials.1,co.1,stim1.1,ansbin1.1,stim2.1,ansbin2.1)
View(ID)
Now I would like to form my new data.frame in the way that "stim", "stim.1","stim1.1" and "stim2.1" are under the same column called "stimulus", and the same thing for the answers: I would like all "ansbin", "ansbin.1", "ansbin1.1" and "ansbin2.1" under the same column called "answers".
Trials and Trials.1 at the same time should be under the same column, but the difference will the "co" column.
I tryied to use "reshape" like this:
df<-reshape(ID, direction="long",
idvar=c("trials", "co"),
varying= c("stim","stim.1", "stim1.1","stim2.1","ansbin","ansbin.1","ansbin1.1","ansbin2.1"
v.names=c("stimulus","answer"),
timevar="num",
)
but I have some problems and warning at the everytimes. I think it should be a problem linked to columns's name.
Can you help me?
Thank you in advance! :)

Here's the approach I would take:
library(data.table)
melt(
rbindlist(split.default(ID, cumsum(grepl("^trials", names(ID))))),
measure.vars = patterns("^stim", "^ansbin"), value.name = c("stim", "ansbin"))
# trials co variable stim ansbin
# 1: 1 1 1 8 1
# 2: 2 1 1 9 0
# 3: 3 1 1 11 1
# 4: 4 1 1 2 0
# 5: 5 1 1 4 0
# ---
# 36: 6 2 2 8 1
# 37: 7 2 2 9 0
# 38: 8 2 2 10 0
# 39: 9 2 2 4 1
# 40: 10 2 2 12 0
Basically, it sounds like you're looking at two rounds of "reshaping".
Stacking the columns from "trials" to the second set of "ansbin" on top of each other. I've done that with the rbindlist(split.default(...)) part of my answer.
Stacking each resulting pair of "stim" and "ansbin" columns on top of each other. I've done that with the melt(...) part of my answer.

Consider building a list of reshaped dataframes for each set: co, trials, stimulus, and answers, then merge them together. However, because co and trials only carry two columns while latter two carries four columns consider repeating columns prior to reshaping:
ID$co2 <- ID$co
ID$co3 <- ID$co.1
ID$trials.2 <- ID$trials
ID$trials.3 <- ID$trials.1
df_list <- lapply(c("co", "trials", "stim", "ans"), function(s)
reshape(ID, direction="long",
varying= grep(s, names(ID)),
v.names=c(s),
drop = grep(paste0("^", s), names(ID), invert=TRUE),
timevar="num",
new.row.names = 1:1000)
)
# CHAIN MERGE
finaldf <- Reduce(function(x, y) merge(x, y, by=c('id', 'num')), df_list)
finaldf <- with(finaldf, finaldf[order(num, id),]) # SORT DATAFRAME
rownames(finaldf) <- NULL # RESET ROWNAMES
head(finaldf)
# id num co trials stim ans
# 1 1 1 1 1 8 1
# 2 2 1 1 2 9 0
# 3 3 1 1 3 11 1
# 4 4 1 1 4 2 0
# 5 5 1 1 5 4 0
# 6 6 1 1 6 7 1

Related

How to shift data in only one column up and down in R?

I have a data frame that looks as follows:
ID
Count
1
3
2
5
3
2
4
0
5
1
And I am trying to shift ONLY the values in the "Count" column down one so that it looks as follows:
ID
Count
1
NA
2
3
3
5
4
2
5
0
I will also need to eventually shift the same data up one:
ID
Count
1
5
2
2
3
0
4
1
5
NA
I've tried the following code:
shift <- function(x, n){
c(x[-(seq(n))], rep(NA, n))
}
df$Count <- shift(df$Count, 1)
But it ended up duplicating the titles and shifting the data down, like as follows:
ID
Count
ID
Count
1
3
2
5
3
2
4
0
Is there an easy way for me to accomplish this? Thank you!!
# set as data.table
setDT(df)
# shift
df[, count := shift(count, 1)]
df$Count=c(NA, df$Count[1:(nrow(df)-1)])
1) dplyr Using DF shown reproducibly in the Note at the end, use lag and lead from dplyr
library(dplyr)
DF %>% mutate(CountLag = lag(Count), CountLead = lead(Count))
## ID Count CountLag CountLead
## 1 1 3 NA 5
## 2 2 5 3 2
## 3 3 2 5 0
## 4 4 0 2 1
## 5 5 1 0 NA
2) zoo This creates a zoo object using zoo's vectorized lag. Optionally use fortify.zoo(z) or as.ts(z) to convert it back to a data frame or ts object.
Note that dplyr clobbers lag with its own lag so we used stats::lag to ensure it does not interfere. The stats:: can optionally be omitted if dplyr is not loaded.
library(zoo)
z <- stats::lag(read.zoo(DF), seq(-1, 1)); z
Index lag-1 lag0 lag1
1 1 NA 3 5
2 2 3 5 2
3 3 5 2 0
4 4 2 0 1
5 5 0 1 NA
3) collapse flag from the collapse package is also vectorized over its second argument.
library(collapse)
with(DF, data.frame(ID, Count = flag(Count, seq(-1, 1))))
## ID Count.F1 Count... Count.L1
## 1 1 5 3 NA
## 2 2 2 5 3
## 3 3 0 2 5
## 4 4 1 0 2
## 5 5 NA 1 0
Note
DF <- data.frame(ID = 1:5, Count = c(3, 5, 2, 0, 1))

Merge multiple data frames with partially matching rows

I have data frames with lists of elements such as NAMES. There are different names in dataframes, but most of them match together. I'd like to combine all of them in one list in which I'd see whether some names are missing from any of df.
DATA sample for df1:
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 RH_Type-Function-S
6 6 RH_REFERENT-S
and for df2
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 UCZESTNIK
6 6 COACH
and expected result would be:
NAME. df1 df2
1 COACH NA 6
2 lh_Structure/Focus_C 4 4
3 lh_Structure/Focus_S 3 3
4 RH_REFERENT-S 6 NA
5 rh_Structure/Focus_C 2 2
6 rh_Structure/Focus_S 1 1
7 RH_Type-Function-S 5 NA
8 UCZESTNIK NA 5
I can do that with merge.data.frame(df1,df2,by = "x", all=T),
but the I can't do it with more df with similar structure. Any help would be appreciated.
It might be easier to work with this in a long form. Just rbind all the datasets below one another with a flag for which dataset they came from. Then it's relatively straightforward to get a tabulation of all the missing values (and as an added bonus, you can see if you have any duplicates in any of the source datasets):
dfs <- c("df1","df2")
dfall <- do.call(rbind, Map(cbind, mget(dfs), src=dfs))
table(dfall$x, dfall$src)
# df1 df2
# COACH 0 1
# lh_Structure/Focus_C 1 1
# lh_Structure/Focus_S 1 1
# RH_REFERENT-S 1 0
# rh_Structure/Focus_C 1 1
# rh_Structure/Focus_S 1 1
# RH_Type-Function-S 1 0
# UCZESTNIK 0 1

concatenating only vector values from a row

I have a problem with my R code. At first I have a dataframe (df) with one column which consists of numerical values as well as vectors. These vectors also contain numerical values. This is an example of some rows of the dataframe:
1. 60011000
2. 60523000
4. 60490000
5. 60599000
6. c("60741000", "60740000", "60742000")
7. 60647000
8. c("60766000", "60767000")
9. c("60563000", "60652000")
In the list you can see there are some rows (6, 8 & 9) containing vector elements. I want to concatenate the elements in the vectors to only one element.
For example the result from the vector of line 6 should look like this:
607410006074000060742000
And the result of line 8 should look like this
6076600060767000
My dataframe has more than 30,000 rows so it is impossible for me to do it manually.
Can you help me to solve my problem? It is important that the number of rows does not change.
Thank you very much and please excuse mistakes i made. I am not a native speaker.
The data:
dat <- read.table(text='60011000
60523000
60490000
60599000
c("60741000", "60740000", "60742000")
60647000
c("60766000", "60767000")
c("60563000", "60652000")', sep = "\t")
dat
# V1
# 1 60011000
# 2 60523000
# 3 60490000
# 4 60599000
# 5 c(60741000, 60740000, 60742000)
# 6 60647000
# 7 c(60766000, 60767000)
# 8 c(60563000, 60652000)
You can use gsub to replace all non-digit characters with the empty string.
dat$V1 <- gsub("[^0-9]+", "", dat$V1)
dat
# V1
# 1 60011000
# 2 60523000
# 3 60490000
# 4 60599000
# 5 607410006074000060742000
# 6 60647000
# 7 6076600060767000
# 8 6056300060652000
You could do:
df=data.frame(a=c(1,2,3,4,'c("60741000", "60740000", "60742000")'),
b=c(1,2,3,4,5),
stringsAsFactors = F)
> df
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 c("60741000", "60740000", "60742000") 5
df[,"a"]=sapply(df[,"a"],function(x) paste(eval(parse(text=x)),collapse = ""))
> df
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 607410006074000060742000 5
Here you go; (looks like someone beat me to the punch )
df <- read.table("df.txt",header=F,)
df
# V1
# 1 123
# 2 12
# 3 c("1","55","6")
# 4 356
# 5 c("99","55","3")
df[,1] <- as.numeric(as.character(gsub("[^0-9]","",df[,1])))
df
# V1
# 1 123
# 2 12
# 3 1556
# 4 356
# 5 99553

Merge, cbind: How to merge better? [duplicate]

This question already has answers here:
R: Adding NAs into Data Frame
(5 answers)
Closed 6 years ago.
I want to merge multiple vectors to a data frame. There are two variables, city and id that are going to be used for matching vectors to data frame.
df <- data.frame(array(NA, dim =c(10*50, 2)))
names(df)<-c("city", "id")
df[,1]<-rep(1:50, each=10)
df[,2]<-rep(1:10, 50)
I created a data frame like this. To this data frame, I want to merge 50 vectors that each corresponds to 50 cities. The problem is that each city only has 6 obs. Each city will have 4 NAs.
To give you an example, city 1 data looks like this:
seed(1234)
cbind(city=1,id=sample(1:10,6),obs=rnorm(6))
I have 50 cities data and I want to merge them to one column in df. I have tried the following code:
for(i in 1:50){
citydata<-cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) # each city data
df<-merge(df,citydata, by=c("city", "id"), all=TRUE)} # merge to df
But if I run this, the loop will show warnings like this:
In merge.data.frame(df, citydata, by = c("city", "id"), ... :
column names ‘obs.x’, ‘obs.y’ are duplicated in the result
and it will create 50 columns, instead of one long column.
How can I merge cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) to df in a one nice and long column? It seems both cbind and merge are not ways to go.
In case there are 50 citydata (each has 6 rows), I can rbind them as one long data and use data.table approach or expand.gird+merge approach as Philip and Jaap suggested.
I wonder if I can merge each citydata through a loop one by one, instead of rbind them and merge it to df.
data.table is good for this:
library(data.table)
df <- data.table(df)
> df
city id
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 1 5
---
496: 50 6
497: 50 7
498: 50 8
499: 50 9
500: 50 10
I'm using CJ instead of your for loop to make some dummy data. CJ cross-joins each column against each value of each other column, so it makes a two-column table with each possible pair of values of city and id. The [,obs:=rnorm(.N)] command adds a third column that draws random values (without recycling them as it would if it were inside the CJ)--.N means "# rows of this table" in this context.
citydata <- CJ(city=1:50,id=1:6)[,obs:=rnorm(.N)]
> citydata
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
296: 50 2 0.30592659
297: 50 3 -0.44989646
298: 50 4 0.05359738
299: 50 5 -0.57494269
300: 50 6 0.09565473
setkey(df,city,id)
setkey(citydata,city,id)
As these two tables have the same key columns the following looks up rows of df by the key columns in citydata, then defines obs in df by looking up the value in citydata. Therefore the resulting object is the original df but with obs defined wherever it was defined in citydata:
df[citydata,obs:=i.obs]
> df
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
496: 50 6 0.09565473
497: 50 7 NA
498: 50 8 NA
499: 50 9 NA
500: 50 10 NA
In base R you can do this with a combination of expand.grid and merge:
citydata <- expand.grid(city=1:50,id=1:6)
citydata$obs <- rnorm(nrow(citydata))
res <- merge(df, citydata, by = c("city","id"), all.x = TRUE)
which gives:
> head(res,12)
city id obs
1: 1 1 -0.3121133
2: 1 2 -1.3554576
3: 1 3 -0.9056468
4: 1 4 -0.6511869
5: 1 5 -1.0447499
6: 1 6 1.5939187
7: 1 7 NA
8: 1 8 NA
9: 1 9 NA
10: 1 10 NA
11: 2 1 0.5423479
12: 2 2 -2.3663335
A similar approach with dplyr and tidyr:
library(dplyr)
library(tidyr)
res <- crossing(city=1:50,id=1:6) %>%
mutate(obs = rnorm(n())) %>%
right_join(., df, by = c("city","id"))
which gives:
> res
Source: local data frame [500 x 3]
city id obs
(int) (int) (dbl)
1 1 1 -0.5335660
2 1 2 1.0582001
3 1 3 -1.3888310
4 1 4 1.8519262
5 1 5 -0.9971686
6 1 6 1.3508046
7 1 7 NA
8 1 8 NA
9 1 9 NA
10 1 10 NA
.. ... ... ...

Removing rows after a certain value in R

I have a data frame in R,
df <- data.frame(a=c(1,1,1,2,2,5,5,5,5,5,6,6), b=c(0,1,0,0,0,0,0,1,0,0,0,1))
I want to remove the rows which has values for the variable b equal to 0 which occurs after the value equals to 1 for the duplicated variable a values.
So the output I am looking for is,
df.out <- data.frame(a=c(1,1,2,2,5,5,5,6,6), b=c(0,1,0,0,0,0,1,0,1))
Is there a way to do this in R?
This should do the trick?
ind = intersect(which(df$b==0), which(df$b==1)+1)
df.out = df[-ind,]
The which(df$b==1) returns the index of the df where b==1. add one to this and intersect with the indexes where b==0.
How about
df[ ave(df$b, df$a, FUN=function(x) x>=cummax(x))==1, ]
# a b
# 1 1 0
# 2 1 1
# 4 2 0
# 5 2 0
# 6 5 0
# 7 5 0
# 8 5 1
# 11 6 0
# 12 6 1
Here we use ave to look within each level of a and we test to see if we've seen a 1 yet with cummax.

Resources