Transform a dataframe to use first column values as column names - r

I have a dataframe with 2 columns:
.id vals
1 A 10
2 B 20
3 C 30
4 A 100
5 B 200
6 C 300
dput(tst_df)
structure(list(.id = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor"), vals = c(10, 20, 30, 100, 200,
300)), .Names = c(".id", "vals"), row.names = c(NA, -6L), class = "data.frame")
Now i want to have the .id column to become my column names and the vals will become 2 rows.
Like this:
A B C
10 20 30
100 200 300
Basically .id is my grouping variable and i want to have all values belonging to 1 group as a row. I expected something simple like melt and transform. But after many tries i still not succeeded. Is anyone familiar with a function that will accomplish this?

You can do this in base R with unstack:
unstack(df, form=vals~.id)
A B C
1 10 20 30
2 100 200 300
The first argument is the name of the data.frame and the second is a formula which determines the unstacked structure.

You can also use tapply,
do.call(cbind, tapply(df$vals, df$.id, I))
# A B C
#[1,] 10 20 30
#[2,] 100 200 300
or wrap it in data frame, i.e.
as.data.frame(do.call(cbind, tapply(df$vals, df$.id, I)))

Related

Fill in missing rows in data in R

Suppose I have a data frame like this:
1 8
2 12
3 2
5 -6
6 1
8 5
I want to add a row in the places where the 4 and 7 would have gone in the first column and have the second column for these new rows be 0, so adding these rows:
4 0
7 0
I have no idea how to do this in R.
In excel, I could use a vlookup inside an iferror. Is there a similar combo of functions in R to make this happen?
Edit: also, suppose that row 1 was missing and needed to be filled in similarly. Would this require another solution? What if I wanted to add rows until I reached ten rows?
Use tidyr::complete to fill in the missing sequence between min and max values.
library(tidyr)
library(rlang)
complete(df, V1 = min(V1):max(V1), fill = list(V2 = 0))
#Or using `seq`
#complete(df, V1 = seq(min(V1), max(V1)), fill = list(V2 = 0))
# V1 V2
# <int> <dbl>
#1 1 8
#2 2 12
#3 3 2
#4 4 0
#5 5 -6
#6 6 1
#7 7 0
#8 8 5
If we already know min and max of the dataframe we can use them directly. Let's say we want data from V1 = 1 to 10, we can do.
complete(df, V1 = 1:10, fill = list(V2 = 0))
If we don't know the column names beforehand, we can do something like :
col1 <- names(df)[1]
col2 <- names(df)[2]
complete(df, !!sym(col1) := 1:10, fill = as.list(setNames(0, col2)))
data
df <- structure(list(V1 = c(1L, 2L, 3L, 5L, 6L, 8L), V2 = c(8L, 12L,
2L, -6L, 1L, 5L)), class = "data.frame", row.names = c(NA, -6L))

R lapply need to use a different input depending on which value is being evaluated

Say I have a list c of three data frames:
> c
$first
a b
1 1 2
2 2 3
3 3 4
$second
a b
1 2 4
2 4 6
3 6 8
$third
a b
1 3 6
2 6 9
3 9 12
I want to run an lapply on c that will do a custom function on each data frame.
The custom function depends on three numbers and I want the function to use a different number depending on which data frame it's evaluating.
I was thinking of utilizing the names 'first', 'second', and 'third', but I'm unsure how to get those names once they're inside the lapply function. It would look something like this:
lapply(c, function(list, num1 = 1, num2 = -1, num3 = 0) {num <- ifelse(names(list) == "first", num1, ifelse(names(list) == "second", num2, num3)); return(list*num)})
So the result I would want would be first multiplied by 1, second multiplied by -1, and third multiplied by 0.
The names function gives the values a and b (the column names) instead of the name of the data frame itself, so that doesn't work. Is there a function that would be able to give me the 'first', 'second', and 'third' values I need?
Or alternatively, is there a better way of doing this in a lapply function?
May be, it would be easier with Map. We pass the number of interest in the order we want and do a simple multiplication
Map(`*`, lst1, c(1, -1, 0))
If the numbers are named
num1 <- setNames(c(1, -1, 0), c("first", "third", "second"))
then, match with the names of the list
Map(`*`, lst1, num1[names(lst1)])
#$first
# a b
#1 1 2
#2 2 3
#3 3 4
#$second
# a b
#1 0 0
#2 0 0
#3 0 0
#$third
# a b
#1 -3 -6
#2 -6 -9
#3 -9 -12
Or if we decide to go with lapply, loop over the names of the list , extract the list element based on the name as well as the corresponding vector element (named vector)
lapply(names(lst1), function(nm) lst1[[nm]] * num1[nm])
Or with sapply
sapply(names(lst1), function(nm) lst1[[nm]] * num1[nm], simplify = FALSE)
Or another option is map2 from purrr
library(purrr)
map2(lst1, num1[names(lst1)], `*`)
Note: c is a function name and it is not recommended to create object names with function names
data
lst1 <- list(first = structure(list(a = 1:3, b = 2:4), class = "data.frame",
row.names = c("1",
"2", "3")), second = structure(list(a = c(2L, 4L, 6L), b = c(4L,
6L, 8L)), class = "data.frame", row.names = c("1", "2", "3")),
third = structure(list(a = c(3L, 6L, 9L), b = c(6L, 9L, 12L
)), class = "data.frame", row.names = c("1", "2", "3")))
Besides the solutions by #akrun, you can also try the following code
mapply(`*`, lst1, c(1, -1, 0),SIMPLIFY = F)
or
lapply(seq_along(lst1), function(k) lst1[[k]]*c(1,-1,0)[k])

Summation of the corresponding number of values which are in different columns

My data frame looks like below:
df<-data.frame(alphabets1=c("A","B","C","B","C"," ","NA"),alphabets2=c("B","A","D","D"," ","E","NA"),alphabets3=c("C","F","G"," "," "," ","NA"), number = c("1","2","3","1","4","1","2"))
alphabets1 alphabets2 alphabets3 number
1 A B C 1
2 B A F 2
3 C D G 3
4 B D 1
5 C 4
6 E 1
7 NA NA NA 2
NOTE1: within the row all the values are unique, that is, below shown is not possible.
alphabets1 alphabets2 alphabets3 number
1 A A C 1
NOTE2: data frame may contains NA or is blank
I am struggling to get the below output: which is nothing but a dataframe which has the alphabets and the sum of their corresponding numbers, that is A alphabet is in 1st and 2nd rows so its sum of its corresponding number is 1+2 i.e 3 and let's say B, its in 1st, 2nd and 4th row so the sum will be 1+2+1 i.e 4.
output <-data.frame(alphabets1=c("A","B","C","D","E","F","G"), number = c("3","4","8","4","1","2","3"))
output
alphabets number
1 A 3
2 B 4
3 C 8
4 D 4
5 E 1
6 F 2
7 G 3
NOTE3: output may or may not have the NA or blanks (it doesn't matter!)
We can reshape it to 'long' format and do a group by operation
library(data.table)
melt(setDT(df), id.var="number", na.rm = TRUE, value.name = "alphabets1")[
!grepl("^\\s*$", alphabets1), .(number = sum(as.integer(as.character(number)))),
alphabets1]
# alphabets1 number
#1: A 3
#2: B 4
#3: C 8
#4: D 4
#5: E 1
#6: F 2
#7: G 3
Or we can use xtabs from base R
xtabs(number~alphabets1, data.frame(alphabets1 = unlist(df[-4]),
number = as.numeric(as.character(df[,4]))))
NOTE: In the OP's dataset, the missing values were "NA", and not real NA and the 'number' column is factor (which was changed by converting to integer for doing the sum)
data
df <- data.frame(alphabets1=c("A","B","C","B","C"," ",NA),
alphabets2=c("B","A","D","D"," ","E",NA),
alphabets3=c("C","F","G"," "," "," ",NA),
number = c("1","2","3","1","4","1","2"))
Here is a base R method using sapply and table. I first converted df$number into a numeric. See data section below.
data.frame(table(sapply(df[-length(df)], function(i) rep(i, df$number))))
Var1 Freq
1 11
2 A 3
3 B 4
4 C 8
5 D 4
6 E 1
7 F 2
8 G 3
9 NA 6
To make the output a little bit nicer, we could wrap a few more functions and perform a subsetting within sapply.
data.frame(table(droplevels(unlist(sapply(df[-length(df)],
function(i) rep(i[i %in% LETTERS],
df$number[i %in% LETTERS])),
use.names=FALSE))))
Var1 Freq
1 A 3
2 B 4
3 C 8
4 D 4
5 E 1
6 F 2
7 G 3
It may be easier to do this afterward, though.
data
I ran
df$number <- as.numeric(df$number)
on the OP's data resulting in this.
df <-
structure(list(alphabets1 = structure(c(2L, 3L, 4L, 3L, 4L, 1L,
5L), .Label = c(" ", "A", "B", "C", "NA"), class = "factor"),
alphabets2 = structure(c(3L, 2L, 4L, 4L, 1L, 5L, 6L), .Label = c(" ",
"A", "B", "D", "E", "NA"), class = "factor"), alphabets3 = structure(c(2L,
3L, 4L, 1L, 1L, 1L, 5L), .Label = c(" ", "C", "F", "G", "NA"
), class = "factor"), number = c(1, 2, 3, 1, 4, 1, 2)), .Names = c("alphabets1",
"alphabets2", "alphabets3", "number"), row.names = c(NA, -7L), class = "data.frame")

How to test whether a sequence of sections have gaps in it?

I have several (ice-)core section samples in a dataset (ID in the example below). Some cores have missing sections (i.e. gaps), but I do not know which ones. How to find this out using R?
Example:
dt <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), Sec.start = c(0,
5, 10, 20, 50, 100, 200, 0, 5, 10, 30), Sec.end = c(5, 10, 20,
30, 100, 200, 400, 5, 10, 20, 50), Section = c("0-5", "5-10",
"10-20", "20-30", "50-100", "100-200", "200-400", "0-5", "5-10",
"10-20", "30-50")), .Names = c("ID", "Sec.start", "Sec.end",
"Section"), row.names = c(NA, -11L), class = "data.frame")
dt
ID Sec.start Sec.end Section
1 a 0 5 0-5
2 a 5 10 5-10
3 a 10 20 10-20
4 a 20 30 20-30
5 b 50 100 50-100
6 b 100 200 100-200
7 b 200 400 200-400
8 c 0 5 0-5
9 c 5 10 5-10
10 c 10 20 10-20
11 c 30 50 30-50
"a" and "b" do not have gaps, whereas "c" does (missing piece between 20 and 30), so I am after a following result:
$a
[1] TRUE
$b
[1] TRUE
$c
[1] FALSE
You can try:
lapply(split(dt,dt$ID),function(x) all(x[-1,2]==x[-nrow(x),3]))
#$a
#[1] TRUE
#$b
#[1] TRUE
#$c
#[1] FALSE
Here's a dplyr approach:
library(dplyr)
dt %>%
group_by(ID) %>%
summarise(check = all(Sec.end == lead(Sec.start, default = last(Sec.end))))
#Source: local data table [3 x 2]
#
# ID check
# (fctr) (lgl)
#1 a TRUE
#2 b TRUE
#3 c FALSE
Or the same using data.table:
library(data.table)
setDT(dt)[, .(check = all(Sec.end == shift(Sec.start, 1L, 'lead', fill = last(Sec.end)))),
by=ID]
# ID check
#1: a TRUE
#2: b TRUE
#3: c FALSE
Both approaches make use of lag/lead functions (in data.table called shift) to compare each Sec.end value to the next row's Sec.start value. In the last row, where there's no leading Sec.start value, we supply a default value which is the last row's Sec.end - this means the last row (per ID) is always TRUE. We use all to check if all of the comparisons are TRUE per ID.

average column values across all rows of a data frame

I've got a data frame that I read from a file like this:
name, points, wins, losses, margin
joe, 1, 1, 0, 1
bill, 2, 3, 0, 4
joe, 5, 2, 5, -2
cindy, 10, 2, 3, -2.5
etc.
I want to average out the column values across all rows of this data, is there an easy way to do this in R?
For example, I want to get the average column values for all "Joe's", coming out with something like
joe, 3, 1.5, 2.5, -.5
After loading your data:
df <- structure(list(name = structure(c(3L, 1L, 3L, 2L), .Label = c("bill", "cindy", "joe"), class = "factor"), points = c(1L, 2L, 5L, 10L), wins = c(1L, 3L, 2L, 2L), losses = c(0L, 0L, 5L, 3L), margin = c(1, 4, -2, -2.5)), .Names = c("name", "points", "wins", "losses", "margin"), class = "data.frame", row.names = c(NA, -4L))
Just use the aggregate function:
> aggregate(. ~ name, data = df, mean)
name points wins losses margin
1 bill 2 3.0 0.0 4.0
2 cindy 10 2.0 3.0 -2.5
3 joe 3 1.5 2.5 -0.5
Obligatory plyr and reshape solutions:
library(plyr)
ddply(df, "name", function(x) mean(x[-1]))
library(reshape)
cast(melt(df), name ~ ..., mean)
And a data.table solution for easy syntax and memory efficiency
library(data.table)
DT <- data.table(df)
DT[,lapply(.SD, mean), by = name]
I have yet another way.
I show it on other example.
If we have matrix xt as:
a b c d
A 1 2 3 4
A 5 6 7 8
A 9 10 11 12
A 13 14 15 16
B 17 18 19 20
B 21 22 23 24
B 25 26 27 28
B 29 30 31 32
C 33 34 35 36
C 37 38 39 40
C 41 42 43 44
C 45 46 47 48
One can compute mean for duplicated columns in few steps:
1. Compute mean using aggregate function
2. Make two modifications: aggregate writes rownames as new (first) column so you have to define it back as a rownames...
3.... and remove this column, by selecting columns 2:number of columns of xa object.
xa=aggregate(xt,by=list(rownames(xt)),FUN=mean)
rownames(xa)=xa[,1]
xa=xa[,2:5]
After that we get:
a b c d
A 7 8 9 10
B 23 24 25 26
C 39 40 41 42
You can simply use functions from the tidyverse to group your data by name, and then summarise all remaining columns by a given function (eg. mean):
df <- tibble(name=c("joe","bill","joe","cindy"),
points=c(1,2,5,10), wins=c(1,3,2,2),
losses=c(0,0,5,3),
margin=c(1,4,-2, -2.5))
df %>% dplyr::group_by(name) %>% dplyr::summarise_all(mean)

Categories

Resources