How can I melt a data frame row by row?
I found a really similar question on the forum but I still can't solve my problem without a different id variable.
This is my data set:
V1 V2 V3 V4 V5
51 20 29 12 20
51 22 51 NA NA
51 14 NA NA NA
51 75 NA NA NA
And I want to melt it into:
V1 variable value
51 V2 20
51 V3 29
51 V4 12
51 V5 20
51 V2 22
51 V3 51
51 V2 14
51 V2 75
Currently my approach is melting it row by row with a for loop and then rbind them together.
library(reshape)
df <- read.table(text = "V1 V2 V3 V4 V5 51 20 29 12 20 51 22 51 NA NA 51
+14 NA NA NA 51 75 NA NA NA", header = TRUE)
dfall<-NULL
for (i in 1:NROW(df))
{
dfmelt<-melt(df,id="V1",na.rm=TRUE)
dfall<-rbind(dfall,dfmelt)
}
Just wondering if there is any way to do this faster? Thanks!
We replicate the first column "V1" and the names of the dataset except the first column name to create the first and second column of the expected output, while the 'value' column is created by transposing the dataset without the first column.
na.omit(data.frame(V1=df1[1][col(df1[-1])],
variable = names(df1)[-1][row(df1[-1])],
value = c(t(df1[-1]))))
# V1 variable value
#1 51 V2 20
#2 51 V3 29
#3 51 V4 12
#4 51 V5 20
#5 51 V2 22
#6 51 V3 51
#9 51 V2 14
#13 51 V2 75
NOTE: No additional packages used.
Or we can use gather (from tidyr) to convert the 'wide' to 'long' format after we create a row id column (add_rownames from dplyr) and then arrange the rows.
library(dplyr)
library(tidyr)
add_rownames(df1) %>%
gather(variable, value, V2:V5, na.rm=TRUE) %>%
arrange(rowname, V1) %>%
select(-rowname)
# V1 variable value
# (int) (chr) (int)
#1 51 V2 20
#2 51 V3 29
#3 51 V4 12
#4 51 V5 20
#5 51 V2 22
#6 51 V3 51
#7 51 V2 14
#8 51 V2 75
Or with data.table
library(data.table)
melt(setDT(df1, keep.rownames=TRUE),
id.var= c("rn", "V1"), na.rm=TRUE)[
order(rn, V1)][, rn:= NULL][]
You can make a column with a unique ID for each row, so you can sort on it after melting. Using dplyr:
library(reshape2)
library(dplyr)
df %>% mutate(id = seq_len(n())) %>%
melt(id.var = c('V1','id'), na.rm = T) %>%
arrange(V1, id, variable) %>%
select(-id)
# V1 variable value
# 1 51 V2 20
# 2 51 V3 29
# 3 51 V4 12
# 4 51 V5 20
# 5 51 V2 22
# 6 51 V3 51
# 7 51 V2 14
# 8 51 V2 75
...or base R:
library(reshape2)
df$id <- seq_along(df$V1)
df2 <- melt(df, id.var = c('V1', 'id'), na.rm = TRUE)
df2[order(df2$V1, df2$id, df2$variable),-2]
Related
I would like to sum the values of Var1 and Var2 for each row and produce a new column titled Vars which gives the total of Var1 and Var2. I would then like to do the same for Col1 and Col2 and have a new column titled Cols which gives the sum of Col1 and Col2. How do I write the code for this? Thanks in advance.
df
ID Var1 Var2 Col1 Col2
1 34 22 34 24
2 3 25 54 65
3 87 68 14 78
4 66 98 98 100
5 55 13 77 2
Expected outcome would be the following:
df
ID Var1 Var2 Col1 Col2 Vars Cols
1 34 22 34 24 56 58
2 3 25 54 65 28 119
3 87 68 14 78 155 92
4 66 98 98 100 164 198
5 55 13 77 2 68 79
Assuming that column ID is irrelevant (no groups) and you are happy to specify column names (solution hard-coded, not generic).
A base R solution:
df$Vars <- rowSums(df1[, c("Var1", "Var2")])
df$Cols <- rowSums(df1[, c("Col1", "Col2")])
A tidyverse solution:
library(dplyr)
library(purrr)
df %>% mutate(Vars = map2_int(Var1, Var2, sum),
Cols = map2_int(Col1, Col2, sum))
# or just
df %>% mutate(Vars = Var1 + Var2,
Cols = Col1 + Col2)
There are many different ways to do this. With
library(dplyr)
df = df %>% #input dataframe
group_by(ID) %>% #do it for every ID, so every row
mutate( #add columns to the data frame
Vars = Var1 + Var2, #do the calculation
Cols = Col1 + Col2
)
But there are many other ways, eg with apply-functions etc. I suggest to read about the tidyverse.
Another dplyr way is to use helper functions starts_with to select columns and then use rowSums to sum those columns.
library(dplyr)
df$Vars <- df %>% select(starts_with("Var")) %>% rowSums()
df$Cols <- df %>% select(starts_with("Col")) %>% rowSums()
df
# ID Var1 Var2 Col1 Col2 Vars Cols
#1 1 34 22 34 24 56 58
#2 2 3 25 54 65 28 119
#3 3 87 68 14 78 155 92
#4 4 66 98 98 100 164 198
#5 5 55 13 77 2 68 79
A solution summing up all columns witch have the same name and end with numbers using gsub in base:
tt <- paste0(gsub('[[:digit:]]+', '', names(df)[-1]),"s")
df <- cbind(df, sapply(unique(tt), function(x) {rowSums(df[grep(x, tt)+1])}))
df
# ID Var1 Var2 Col1 Col2 Vars Cols
#1 1 34 22 34 24 56 58
#2 2 3 25 54 65 28 119
#3 3 87 68 14 78 155 92
#4 4 66 98 98 100 164 198
#5 5 55 13 77 2 68 79
Or an even more general solution:
idx <- grep('[[:digit:]]', names(df))
tt <- paste0(gsub('[[:digit:]]+', '', names(df)[idx]),"s")
df <- cbind(df, sapply(unique(tt), function(x) {rowSums(df[idx[grep(x, tt)]])}))
This is my dataframe:
set.seed(1)
df <- data.frame(A = 1:50, B = 11:60, c = 21:70)
head(df)
df.final <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
I want to delete the columns that its last 5 values are not filled by NA. That is, only the columns that has values in the rows from 46 to 50 remain. the columns which the last 5 values has one or more NA´s will be deleted.
Is it possible do this with dplyr?
Any help?
dplyr::select() accepts integer column positions. We can use that to achieve this -
result <- df.final %>% select(., which(!is.na(colSums(tail(., 5)))))
head(result)
A B
1 1 11
2 2 NA
3 3 13
4 NA 14
5 5 15
6 NA 16
Shree beat me to it, but it might come in handy:
> df.final %>% tail
A B c
45 45 55 65
46 46 NA 66
47 47 57 67
48 NA 58 68
49 NA 59 69
50 NA 60 NA
> df.final %>%
+ select_if(~ !any(is.na(tail(., n = 1)))) %>%
+ tail()
B
45 55
46 NA
47 57
48 58
49 59
50 60
Just change the n above to the number of last NAs that you want.
I have a dataset with 8 variables,when I run dplyr with syntax below, my output dataframe only has the variables I have used in the dplyr code, while I want all variables
ShowID<-MyData %>%
group_by(id) %>%
summarize (count=n()) %>%
filter(count==min(count))
ShowID
So my output will have two variables - ID and Count. How do I get rest of my variables in the new dataframe? Why is this happening, what am I clueless about here?
> ncol(ShowID)
[1] 2
> ncol(MyData)
[1] 8
MYDATA
key ID v1 v2 v3 v4 v5 v6
0-0-70cf97 1 89 20 30 45 55 65
3ad4893b8c 1 4 5 45 45 55 65
0-0-70cf97d7 2 848 20 52 66 56 56
0-0-70cf 2 54 4 846 65 5 5
0-0-793b8c 3 56454 28 6 4 5 65
0-0-70cf98 2 8 4654 30 65 6 21
3ad4893b8c 2 89 66 518 156 16 65
0-0-70cf97d8 3 89 20 161 1 55 45465
0-0-70cf 5 89 79 48 45 55 456
0-0-793b8c 5 89 20 48 545 654 4
0-0-70cf99 6 9 20 30 45 55 65
DESIRED
key ID count v1 v2 v3 v4 v5 v6
0-0-70cf99 6 1 9 20 30 45 55 65
RESULT FROM CODE
ID count
6 1
You can use the base R ave method to calculate number of rows in each group (ID) and then select those group which has minimum rows.
num_rows <- ave(MyData$v1, MyData$ID, FUN = length)
MyData[which(num_rows == min(num_rows)), ]
# key ID v1 v2 v3 v4 v5 v6
#11 0-0-70cf99 6 9 20 30 45 55 65
You could also use which.min in this case to avoid one step however, in case of multiple minimum values it would fail hence, I have used which.
No need to summarize:
ShowID <- MyData %>%
group_by(id) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count == min(count))
Given a data frame with numeric values in all columns except for the last one, how can I compute the mean across the row?
In this example, I am using all columns, including the name column which I would like to omit.
df <- as.data.frame(matrix(1:40, ncol=10)) %>%
mutate(name=LETTERS[1:4]) %>%
mutate(mean=rowMeans(.))
Desired data frame output:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 mean name
1 1 5 9 13 17 21 25 29 33 37 19 A
2 2 6 10 14 18 22 26 30 34 38 20 B
3 3 7 11 15 19 23 27 31 35 39 21 C
4 4 8 12 16 20 24 28 32 36 40 22 D
You could try:
df %>%
mutate(mean = select(., -matches("name")) %>% rowMeans(.))
In your setting, you could use
df <- as.data.frame(matrix(1:40, ncol=10)) %>%
mutate(name=LETTERS[1:4]) %>%
mutate(mean=rowMeans(.[,1:10]))
I am new to R, so it may be that some of concepts are not fully correct...
I have a set of files that I read into a list (here just shown the first 3 lines of each):
myfiles<-lapply(list.files(".",pattern="tab",full.names=T),read.table,skip="#")
myfiles
[[1]]
V1 V2 V3
1 10001 33 -0.0499469
2 30001 65 0.0991478
3 50001 54 0.1564400
[[2]]
V1 V2 V3
1 10001 62 0.0855260
2 30001 74 0.1536640
3 50001 71 0.1020960
[[3]]
V1 V2 V3
1 10001 49 -0.04661360
2 30001 65 0.16961500
3 50001 61 0.07089600
I want to apply an ifelse condition in order to substitute values in columns and then return exactly the same list. However, when I do this:
myfiles<-lapply(myfiles,function(x) ifelse(x$V2>50, x$V3, NA))
myfiles
[[1]]
[1] NA 0.0991478 0.1564400
[[2]]
[1] 0.0855260 0.1536640 0.1020960
[[3]]
[1] NA 0.16961500 0.07089600
it does in fact what I want to, but returns only the columns where the function was applied, and I want it to return the same list as before, with 3 columns (but with the substitutions).
I guess there should be an easy way to do this with some variant of "apply", but I was not able to find it or solve it.
Thanks
You can use lapply and transform/within. There are three possibilities:
a) ifelse
lapply(myfiles, transform, V3 = ifelse(V2 > 50, V3, NA))
b) mathematical operators (potentially more efficient)
lapply(myfiles, transform, V3 = NA ^ (V2 <= 50) * V3)
c) is.na<-
lapply(myfiles, within, is.na(V3) <- V2 < 50)
The result
[[1]]
V1 V2 V3
1 10001 33 NA
2 30001 65 0.0991478
3 50001 54 0.1564400
[[2]]
V1 V2 V3
1 10001 62 0.085526
2 30001 74 0.153664
3 50001 71 0.102096
[[3]]
V1 V2 V3
1 10001 49 NA
2 30001 65 0.169615
3 50001 61 0.070896
Perhaps this helps
lapply(myfiles,within, V3 <- ifelse(V2 >50, V3, NA))
#[[1]]
# V1 V2 V3
#1 10001 33 NA
#2 30001 65 0.0991478
#3 50001 54 0.1564400
#[[2]]
# V1 V2 V3
#1 10001 62 0.085526
#2 30001 74 0.153664
#3 50001 71 0.102096
#[[3]]
# V1 V2 V3
#1 10001 49 NA
#2 30001 65 0.169615
#3 50001 61 0.070896
Update
Another option would be to read the files using fread from data.table which would be fast
library(data.table)
files <- list.files(pattern='tab')
lapply(files, function(x) fread(x)[V2<=50,V3:=NA] )
#[[1]]
# V1 V2 V3
#1: 10001 33 NA
#2: 30001 65 0.0991478
#3: 50001 54 0.1564400
#[[2]]
# V1 V2 V3
#1: 10001 62 0.085526
#2: 30001 74 0.153664
#3: 50001 71 0.102096
#[[3]]
# V1 V2 V3
#1: 10001 49 NA
#2: 30001 65 0.169615
#3: 50001 61 0.070896
Or as #Richie Cotton mentioned, you could also bind the datasets together using rbindlist and then do the operation in one step.
library(tools)
dt1 <- rbindlist(lapply(files, function(x)
fread(x)[,id:= basename(file_path_sans_ext(x))] ))[V2<=50, V3:=NA]
dt1
# V1 V2 V3 id
#1: 10001 33 NA tab1
#2: 30001 65 0.0991478 tab1
#3: 50001 54 0.1564400 tab1
#4: 10001 62 0.0855260 tab2
#5: 30001 74 0.1536640 tab2
#6: 50001 71 0.1020960 tab2
#7: 10001 49 NA tab3
#8: 30001 65 0.1696150 tab3
#9: 50001 61 0.0708960 tab3
This seems harder than it should be because you are working with a list of data frames rather than a single data frame. You can combine all the data frames into a single one using rbind_all in dplyr.
library(dplyr)
# Some variable renaming for clarity:
# myfiles now refers to the file names; mydata now contains the data
myfiles <- list.files(pattern="tab", full.names=TRUE)
mydata <- lapply(myfiles, read.table, skip="#")
# Get the number of rows in each data frame
n_rows <- vapply(mydata, nrow, integer(1))
# Combine the list of data frames into a single data frame
all_mydata <- rbind_all(mydata)
# Add an identifier to see which data frame the row came from.
all_mydata$file <- rep(myfiles, each = n_rows)
# Now update column 3
is.na(all_mydata$V3) <- all_mydata$V2 < 50
Try adding an id column for each df and binding them together:
for(i in 1:3) myfiles[[i]]$id = i
ddf = myfiles[[1]]
for(i in 2:3) ddf = rbind(ddf, myfiles[[i]])
Then apply changes on composite df and split it back again:
ddf$V3 = ifelse(ddf$V2>50, ddf$V3, NA)
myfiles = lapply(split(ddf, ddf$id), function(x) x[1:3])
myfiles
$`1`
V1 V2 V3
1 10001 33 NA
2 30001 65 0.0991478
3 50001 54 0.1564400
$`2`
V1 V2 V3
11 10001 62 0.085526
21 30001 74 0.153664
31 50001 71 0.102096
$`3`
V1 V2 V3
12 10001 49 NA
22 30001 65 0.169615
32 50001 61 0.070896