Select/ Extract rows basis of rowname in R [duplicate] - r

This question already has answers here:
subsetting data frame based on search pattern in vector
(2 answers)
subset with pattern
(3 answers)
Closed 6 years ago.
I want to extract some rows from my data in R based on specific identifier in column ids. My data is like this:
ids A1 B1 C1 D1 E1 ...
asd.wd.01 12 23 27 32 76
qsd.yh.02 54 32 32 11 22
gsd.kj.01 22 21 67 88 22
hnd.gd.02 22 88 42 41 93
sjd.td.01 52 31 72 19 31
And I want following output: (row with 01 eg. xxx.xx.01)
ids A1 B1 C1 D1 E1 ...
asd.wd.01 12 23 27 32 76
gsd.kj.01 22 21 67 88 22
sjd.td.01 52 31 72 19 31

You can use string matching. For example
Index <- grep("\\.01$", df$ids) ## Gives index of rows which contains .01
df <- df[Index, ] ## subsets dataframe

You can extract rows by using grepl
df <- subset(df, grepl("\\.01$", df$ids)

Use %>% (Pipe Operator) and filter() from dplyr package and %like% from data.table package. Extracted rows where Name ends with .1. You can use your data and do the similar substitution.
> library(dplyr)
> library(data.table)
> df <- data.frame(Name=c("A.1","B.1","A.3","B.2","C.1"),A=1:5,B=5:9,C=10:14)
> df
Name A B C
1 A.1 1 5 10
2 B.1 2 6 11
3 A.3 3 7 12
4 B.2 4 8 13
5 C.1 5 9 14
> df %>% filter(Name %like% ".1")
Name A B C
1 A.1 1 5 10
2 B.1 2 6 11
3 C.1 5 9 14

Related

How I can change the order of two column in R? [duplicate]

This question already has answers here:
How does one reorder columns in a data frame?
(12 answers)
Closed 3 years ago.
How I can swap 2 colmn of a data set in R? for example I have
1 56
2 43
3 42
4 32
and I want to have
56 1
43 2
42 3
32 4
We can do the reverse sequence (generalized)
df2 <- df1[ncol(df1):1]
or for a two column, it is
df1[2:1]
If the OP wants to select only a particular column
df2 <- df1[c(6, 1:5)]
With tidyverse
library(dplyr)
df2 <- df1 %>%
select(6, everything())
You can choose an arbitary order if you would like.
library(tidyverse)
df %>%
select(col3,col4,col2,col1)
df <- data.frame(c1 = 1:4, c2 = c(56, 43, 42, 32))
df
# c1 c2
#1 1 56
#2 2 43
#3 3 42
#4 4 32
df[c(2,1)]
# c2 c1
#1 56 1
#2 43 2
#3 42 3
#4 32 4
You can swap by changing the locations within c (combine):
df <- data.frame(c1=1:4, c2=c(56,43,42,32), c3=c(12,13,14,15));df
# c1 c2 c3
#1 1 56 12
#2 2 43 13
#3 3 42 14
#4 4 32 15
df[c(3,1,2)]
# c3 c1 c2
#1 12 1 56
#2 13 2 43
#3 14 3 42
#4 15 4 32

Modify data frames in list to have same # of rows

I'm trying to combine data frames (hundreds of them), but they have different numbers of rows.
df1 <- data.frame(c(7,5,3,4,5), c(43,56,23,78,89))
df2 <- data.frame(c(7,5,3,4,5,8,5), c(43,56,23,78,89,45,78))
df3 <- data.frame(c(7,5,3,4,5,8,5,6,7), c(43,56,23,78,89,45,78,56,67))
colnames(df1) <- c("xVar1","xVar2")
colnames(df2) <- c("yVar1","yVar2")
colnames(df3) <- c("zVar1","zVar2")
a1 <- list(df1,df2,df3)
a1 is what is my initial data actually looks like when I get it.
Now if I do:
b1 <- as.data.frame(a1)
I get an error, because the # of rows is not the same in the data (this would work fine if the # of rows was the same).
How do I make the # of rows equal or work around this issue?
I would like to be able to merge the data in this way (here is a working example with the same # of rows):
df1b <- data.frame(c(7,5,3,4,5), c(43,56,23,78,89))
df2b <- data.frame(c(7,5,3,4,6), c(43,56,24,48,89))
df3b <- data.frame(c(7,5,3,4,5), c(43,56,23,78,89))
colnames(df1b) <- c("xVar1","xVar2")
colnames(df2b) <- c("yVar1","yVar2")
colnames(df3b) <- c("zVar1","zVar2")
a2 <- list(df1b,df2b,df3b)
b2 <- as.data.frame(a2)
Thanks!
cbind.fill from rowr provides functionality for this and fills missing elements with NA:
library(purrr)
library(rowr)
b1 <- purrr::reduce(a1,cbind.fill,fill=NA)
One can add a key (row count as variable value in this case) to each dataframe then merge by the key.
# get list of dfs (should prob import data into a list of dfs instead)
list_df<-mget(ls(pattern = "df[0-9]"))
#add newcolumn -- "key"
list_df<-lapply(list_df, function(df, newcol) {
df[[newcol]]<-seq(nrow(df))
return(df)
}, "key")
#merge function
MergeAllf <- function(x, y){
df <- merge(x, y, by= "key", all.x= T, all.y= T)
}
#pass list to merge funct
library(tidyverse)
data <- Reduce(MergeAllf, list_df)%>%
select(key, everything())#reorder or can drop "key"
data
key xVar1 xVar2 yVar1 yVar2 zVar1 zVar2
1 1 7 43 7 43 7 43
2 2 5 56 5 56 5 56
3 3 3 23 3 23 3 23
4 4 4 78 4 78 4 78
5 5 5 89 5 89 5 89
6 6 NA NA 8 45 8 45
7 7 NA NA 5 78 5 78
8 8 NA NA NA NA 6 56
9 9 NA NA NA NA 7 67
Solution 1
You can achieve this with rbindlist(). Note that the column names will be the column names of the first data frame in the list:
library(data.table)
b1 = data.frame(rbindlist(a1))
> b1
xVar1 xVar2
1 7 43
2 5 56
3 3 23
4 4 78
5 5 89
6 7 43
7 5 56
8 3 23
9 4 78
10 5 89
11 8 45
12 5 78
13 7 43
14 5 56
15 3 23
16 4 78
17 5 89
18 8 45
19 5 78
20 6 56
21 7 67
Solution 2
Alternatively, you make all the columns have the same name, then bind by row:
b1 = lapply(a1, setNames, c("Var1","Var2"))
Now you can bind by rows:
b1 = do.call(dplyr::bind_rows, b1)
> b1
Var1 Var2
1 7 43
2 5 56
3 3 23
4 4 78
5 5 89
6 7 43
7 5 56
8 3 23
9 4 78
10 5 89
11 8 45
12 5 78
13 7 43
14 5 56
15 3 23
16 4 78
17 5 89
18 8 45
19 5 78
20 6 56
21 7 67

How to reshape a vector with comma separated records into a longitudinal dataframe? [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 4 years ago.
This is the dataframe I have:
data_frame(id= c(1,2,3),
x=c('19,22,77', '49,67', '28,19,45,23'),
y=c('19,22,77', '49,67', '28,19,45,23'),
t=c('10,20,30', '49,67', '28,19,45,23'))
The comma separated values are different observations over time for the same id thus I would like to strsplit on comma and reshape in longitudinal format but preserving the association with id
For example, the output for the id=1 only should be:
# A tibble: 3 × 4
id x y t
<dbl> <dbl> <dbl> <dbl>
1 1 19 19 10
2 1 22 22 20
3 1 77 77 30
in addition, you can use tidyr:
library(tidyr)
separate_rows(df,x,y,t, sep = ",")
Here is one method with data.table.
library(data.table)
setDT(df)[, lapply(.SD, tstrsplit, split=","), by=id]
id x y t
1: 1 19 19 10
2: 1 22 22 20
3: 1 77 77 30
4: 2 49 49 49
5: 2 67 67 67
6: 3 28 28 28
7: 3 19 19 19
8: 3 45 45 45
9: 3 23 23 23
for each id, we lapply through the variables and apply a trsplit (transpose string split), splitting on the comma.
data
df <- data.frame(id= c(1,2,3),
x=c('19,22,77', '49,67', '28,19,45,23'),
y=c('19,22,77', '49,67', '28,19,45,23'),
t=c('10,20,30', '49,67', '28,19,45,23'))
alternate solution using the hadleyverse
library(magrittr)
dplyr::data_frame(id= c(1,2,3),
x=c('19,22,77', '49,67', '28,19,45,23'),
y=c('19,22,77', '49,67', '28,19,45,23'),
t=c('10,20,30', '49,67', '28,19,45,23')) %>%
dplyr::mutate_if(is.character, stringr::str_split, pattern=',') %>%
tidyr::unnest()
# A tibble: 9 × 4
id x y t
<dbl> <chr> <chr> <chr>
1 1 19 19 10
2 1 22 22 20
3 1 77 77 30
4 2 49 49 49
5 2 67 67 67
6 3 28 28 28
7 3 19 19 19
8 3 45 45 45
9 3 23 23 23

How to perform complex multicolumn match in R /

I wish to match two dataframes based on conditionals on more than one column but cannot figure out how. So if there are my data sets:
df1 <- data.frame(lower=c(0,5,10,15,20), upper=c(4,9,14,19,24), x=c(12,45,67,89,10))
df2 <- data.frame(age=c(12, 14, 5, 2, 9, 19, 22, 18, 23))
I wish to match age from df2 that falls into the range between lower and upper in df1 with the aim to add an extra column to df2 containing the value of x in df1 where age lies between upper and lower. i.e. I want df2 to look like
age x
12 67
14 67
5 45
....etc.
How can I achieve such a match ?
I would go with a simple sapply and a "anded" condition in the df1$x selection like this:
df2$x <- sapply( df2$age, function(x) { df1$x[ x >= df1$lower & x <= df1$upper ] })
which gives:
> df2
age x
1 12 67
2 14 67
3 5 45
4 2 12
5 9 45
6 19 89
7 22 10
8 18 89
9 23 10
For age 12 for example the selection inside the brackets gives:
> 12 >= df1$lower & 12 <= df1$upper
[1] FALSE FALSE TRUE FALSE FALSE
So getting df1$x by this logical vector is easy as your ranges don't overlap
Using foverlaps from data.table is what you are looking for:
library(data.table)
setDT(df1)
setDT(df2)[,age2:=age]
setkey(df1,lower,upper)
foverlaps(df2, df1, by.x = names(df2),by.y=c("lower","upper"))[,list(age,x)]
# age x
# 1: 12 67
# 2: 14 67
# 3: 5 45
# 4: 2 12
# 5: 9 45
# 6: 19 89
# 7: 22 10
# 8: 18 89
# 9: 23 10
Here's another vectorized approach using findInterval on a melted data set
library(data.table)
df2$x <- melt(setDT(df1), "x")[order(value), x[findInterval(df2$age, value)]]
# age x
# 1 12 67
# 2 14 67
# 3 5 45
# 4 2 12
# 5 9 45
# 6 19 89
# 7 22 10
# 8 18 89
# 9 23 10
The idea here is to
First, tidy up you data so lower and upper will be in the same column and x will have corresponding values to that new column,
Then, sort the data according to these ranges (necessary for findInterval).
Finally, run findInterval within the x column in order to find the correct incidences
And here's a possible dplyr/tidyr version
library(tidyr)
library(dplyr)
df1 %>%
gather(variable, value, -x) %>%
arrange(value) %>%
do(data.frame(x = .$x[findInterval(df2$age, .$value)])) %>%
cbind(df2, .)
# age x
# 1 12 67
# 2 14 67
# 3 5 45
# 4 2 12
# 5 9 45
# 6 19 89
# 7 22 10
# 8 18 89
# 9 23 10

Forcing unique values before casting (pivoting) in R

I have a data frame as follows
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30
I’d like to re-cast it with a single row for each Identifier and one column for each value in the current location column. I don’t care about the data in V1 but I need the data in V2 and these will become the values in the new columns.
Note that for the Location column there are repeated values for Identifiers 2 and 3.
I ASSUME that the first task is to make the values in the Location column unique.
I used the following (the data frame is called “Test”)
L<-length(Test$Identifier)
for (i in 1:L)
{
temp<-Test$Location[Test$Identifier==i]
temp1<-make.unique(as.character(temp), sep="-")
levels(Test$Location)=c(levels(Test$Location),temp1)
Test$Location[Test$Identifier==i]=temp1
}
This produces
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B-1 23
3 43 A 10
3 43 B 17
3 43 A-1 18
3 43 B-1 20
3 43 C 25
3 50 A-2 30
Then using
cast(Test, Identifier ~ Location)
gives
Identifier A B C B-1 A-1 A-2
1 21 24 NA NA NA NA
2 NA 15 18 23 NA NA
3 10 17 25 20 18 30
And this is more or less what I want.
My questions are
Is this the right way to handle the problem?
I know R-people don’t use the “for” construction so is there a more R-elegant (relegant?) way to do this? I should mention that the real data set has over 160,000 rows and starts with over 50 unique values in the Location vector and the function takes just over an hour to run. Anything quicker would be good. I should also mention that the cast function had to be run on 20-30k rows of the output at a time despite increasing the memory limit. All the cast outputs were then merged
Is there a way to sort the columns in the output so that (here) they are A, A-1, A-2, B, B-1, C
Please be gentle with your replies!
Usually your original format is much better than your desired result. However, you can do this easily using the split-apply-combine approach, e.g., with package plyr:
DF <- read.table(text="Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30", header=TRUE, stringsAsFactors=FALSE)
#note that I make sure that there are only characters and not factors
#use as.character if you have factors
library(plyr)
DF <- ddply(DF, .(Identifier), transform, Loc2 = make.unique(Location, sep="-"))
library(reshape2)
DFwide <- dcast(DF, Identifier ~Loc2, value.var="V2")
# Identifier A B B-1 C A-1 A-2
#1 1 21 24 NA NA NA NA
#2 2 NA 15 23 18 NA NA
#3 3 10 17 20 25 18 30
If column order is important to you (usually it isn't):
DFwide[, c(1, order(names(DFwide)[-1])+1)]
# Identifier A A-1 A-2 B B-1 C
#1 1 21 NA NA 24 NA NA
#2 2 NA NA NA 15 23 18
#3 3 10 18 30 17 20 25
For reference, here's the equivalent of #Roland's answer in base R.
Use ave to create the unique "Location" columns....
DF$Location <- with(DF, ave(Location, Identifier,
FUN = function(x) make.unique(x, sep = "-")))
... and reshape to change the structure of your data.
## If you want both V1 and V2 in your "wide" dataset
## "dcast" can't directly do this--you'll need `recast` if you
## wanted both columns, which first `melt`s and then `dcast`s....
reshape(DF, direction = "wide", idvar = "Identifier", timevar = "Location")
## If you only want V2, as you indicate in your question
reshape(DF, direction = "wide", idvar = "Identifier",
timevar = "Location", drop = "V1")
# Identifier V2.A V2.B V2.C V2.B-1 V2.A-1 V2.A-2
# 1 1 21 24 NA NA NA NA
# 3 2 NA 15 18 23 NA NA
# 6 3 10 17 25 20 18 30
Reordering the columns can be done the same way that #Roland suggested.

Resources