How to convert a data frame in R? [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
I've got an R data frame in the form of
my.data1 = data.frame(sex = c("m", "f"),
A = c(1, 2),
B = c(3, 4))
However, I'd like my data to be in the form of
my.data2 = data.frame(value = c(1, 2, 3, 4),
group = c("A", "A", "B", "B"),
sex = c("m", "f", "m", "f"))
So basically, I wanna turn some former columns ("A" and "B") into table cells under the new column "group" and simultaneously collect all former table cells under one new column "value".
What is the easiest way to convert the data accordingly?
Thanks in advance!

A close result to what you want is reached using reshape2 and melt() function. You can define a variable as id and the data is reshaped:
library(reshape2)
#Data
my.data1 = data.frame(sex = c("m", "f"),
A = c(1, 2),
B = c(3, 4))
#Reshape
my.data2 <- melt(my.data1,id.vars = 'sex')
Output:
sex variable value
1 m A 1
2 f A 2
3 m B 3
4 f B 4
If you wanna go further, you can use a tidyverse approach with pivot_longer(). In this function you also have to set a reference column as id with cols argument:
library(tidyverse)
my.data1 %>% pivot_longer(cols = -sex)
Output:
# A tibble: 4 x 3
sex name value
<fct> <chr> <dbl>
1 m A 1
2 m B 3
3 f A 2
4 f B 4

Related

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

R merge rows with same row names and column name [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 2 years ago.
I have the data like
How can I reshape the data by merge the rows with same rowname and columname like this:
Trust you have allele information missing.
If added as following to the data:
data['allele']=c('a1','a2','a1','a2')
then following will solve the problem easily:
Basically wide to long, followed by joining columns of SNP and allele and then wide again.
library(tidyr)
long=data %>% gather(snp, value, -c(Pedigree,allele))
long_joined=unite(long, snp, c(snp, allele), remove=TRUE)
spread(long_joined, key = snp, value = value)
Maybe you can try aggregate with unlist:
> aggregate(.~P,df,unlist)
P S1.1 S1.2 S2.1 S2.2
1 a C C G G
2 b C C T T
Data
> dput(df)
structure(list(P = c("a", "a", "b", "b"), S1 = c("C", "C", "C",
"C"), S2 = c("G", "G", "T", "T")), class = "data.frame", row.names = c(NA,
-4L))
Solution using dplyr which is part of the tidyverse collection of R packages.
library(dplyr)
Data:
bar <- "Pedigree SNP1 SNP2
'Individual 1' C G
'Individual 1' C G
'Individual 2' C T
'Individual 2' C T"
foo <- read.table(text=bar, header = TRUE)
Code:
foo %>%
group_by(Pedigree) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = SNP1:SNP2, names_prefix = ".a")
Output:
#> # A tibble: 2 x 5
#> # Groups: Pedigree [2]
#> Pedigree SNP1_.a1 SNP1_.a2 SNP2_.a1 SNP2_.a2
#> <fct> <fct> <fct> <fct> <fct>
#> 1 Individual 1 C C G G
#> 2 Individual 2 C C T T
```
Created on 2020-07-26 by the reprex package (v0.3.0)

stacking rows as columns in R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I'm trying to stack rows of data into columns so that the variables in another column will repeat. I would like to turn something like this
tib <- tribble(~x, ~y, ~z, "a", 1,2, "b", 3,4)
> tib
# A tibble: 2 x 3
x y z
<chr> <dbl> <dbl>
1 a 1 2
2 b 3 4
into
t <- tribble(~X, ~Y, "a", 1, "a", 2, "b", 3, "b", 4)
> t
# A tibble: 4 x 2
X Y
<chr> <dbl>
1 a 1
2 a 2
3 b 3
4 b 4
Thanks for your help and sorry if I've missed this solution somewhere. I did a search, and tried applying gather(), spread(), but couldn't get it to work out.
Here is an example using data.table::melt():
# Assuming your data is a data.frame
xyz <- data.frame(
x = c("a", "b"),
y = c(1, 3),
z = c(2, 4)
)
library(data.table)
melt(xyz, id.vars = "x")[c(1, 3)]
x value
1 a 1
2 b 3
3 a 2
4 b 4
This can be done with many packages. One possibility is tidyr and the function gather (link)
EDIT
Using #sindri_baldur data:
library(tidyr)
xyz %>%
gather(class, measurement, -x)

Find out the row with different value with in same name [duplicate]

This question already has answers here:
How to remove rows that have only 1 combination for a given ID
(4 answers)
Selecting & grouping dual-category data from a data frame
(4 answers)
Closed 5 years ago.
I have a df looks like
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
Basically, A is 1, B is 2, C is 3 and so on.
However, as you can see, B has "2" and "15"."15" is the wrong value and it should not be here.
I would like to find out the row which Value won't matches within the same Name.
Ideal output will looks like
B 2
B 15
you can use tidyverse functions like:
df %>%
group_by(Name, Value) %>%
unique()
giving:
Name Value
1 A 1
2 B 2
3 B 15
4 C 3
5 D 4
6 E 5
then, to keep only the Name with multiple Value, append above with:
df %>%
group_by(Name) %>%
filter( n() > 1)
Something like this? This searches for Names that are associated to more than 1 value and outputs one copy of each pair {Name - Value}.
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
res <- do.call(rbind, lapply(unique(df$Name), (function(i){
if (length(unique(df[df$Name == i,]$Value)) > 1 ) {
out <- df[df$Name == i,]
out[!duplicated(out$Value), ]
}
})))
res
Result as expected
Name Value
4 B 2
5 B 15
Filter(function(x)nrow(unique(x))!=1,split(df,df$Name))
$B
Name Value
4 B 2
5 B 15
Or:
Reduce(rbind,by(df,df$Name,function(x) if(nrow(unique(x))>1) x))
Name Value
4 B 2
5 B 15

Select minimum data of grouped data - keeping all columns [duplicate]

This question already has an answer here:
R: Uniques (or dplyr distinct) + most recent date
(1 answer)
Closed 7 years ago.
I am running into a wall here.
I have a dataframe, many rows.
Here is schematic example.
#myDf
ID c1 c2 myDate
A 1 1 01.01.2015
A 2 2 02.02.2014
A 3 3 03.01.2014
B 4 4 09.09.2009
B 5 5 10.10.2010
C 6 6 06.06.2011
....
I need to group my dataframe by my ID, and then select the row with the oldest date, and write the ouput into a new dataframe - keeping all rows.
ID c1 c2 myDate
A 3 3 03.01.2014
B 4 4 09.09.2009
C 6 6 06.06.2011
....
That is how I approach it:
test <- myDf %>%
group_by(ID) %>%
mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
filter(date == min(b2))
To verfiy: The nrow of my resulting dataframe should be the same as unique returns.
unique(myDf$ID) %>% length == nrow(test)
FALSE
Does not work. I tried this:
newDf <- ddply(.data = myDf,
.variables = "ID",
.fun = function(piece){
take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
piece[take.this.row,]
})
That does run forever. I terminated it.
Why is the first approach not working and what would be a good way to approach the problem?
Considering you have a pretty large dataset, I think using data.table will be better ! Here is the data.table version to solve your problem, it will be quicker than dplyr package:
library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
myDate=c("01.01.2015","02.02.2014",
"03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]
> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
ID c1 c2 myDate
1: A 3 3 2014-01-03
2: B 4 4 2009-09-09
3: C 6 6 2011-06-06
PS: you can use setDT(mydf) to transform data.frame to data.table.
After grouping by 'ID', we can use which.min to get the index of 'myDate' (after converting to Date class), and we extract the rows with slice.
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.min(as.Date(myDate, '%d.%m.%Y')))
# ID c1 c2 myDate
# (chr) (int) (int) (chr)
#1 A 3 3 03.01.2014
#2 B 4 4 09.09.2009
#3 C 6 6 06.06.2011
data
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")), .Names = c("ID",
"c1", "c2", "myDate"), class = "data.frame", row.names = c(NA,
-6L))
If you wanted to just use the base functions you can also go with the aggregate and merge functions.
# data (from response above)
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")),
.Names = c("ID","c1", "c2", "myDate"),
class = "data.frame", row.names = c(NA,-6L))
# convert your date column to POSIXct object
df1$myDate = as.POSIXct(df1$myDate,format="%d.%m.%Y")
# Use the aggregate function to look for the minimum dates by group.
# In this case our variable of interest in the myDate column and the
# group to sort by is the "ID" column.
# The function will sort out the minimum date and create a new data frame
# with names "myDate" and "ID"
df2 = aggregate(list(myDate = df1$myDate),list(ID = df1$ID),
function(x){x[which(x == min(x))]})
df2
# Use the merge function to merge your original data frame with the
# data from the aggregate function
merge(df1,df2)

Resources