Repeat row in a column of a data table - r

I have a data table which includes NAs in some cells as below.
Datatable:
enter image description here
However, I want to repeat 1st row in the column called "Category" to the following two rows written "NA" without any change in other columns which are "Numeric" and "Numeric.null". Same thing for 4th row in Category, repeat it to 5th and 6th rows but no change in other columns.
New:
2
I'm just learning R programming. I have tried rep function. But I couldn't do. Please help me.

We can use fill from tidyr
library(dplyr)
library(tidyr)
df1 <- df1 %>%
fill(Category)
df1
# Category Numeric Numeric.null
#1 A 1 1
#2 A 2 2
#3 A 3 4
#4 D 4 7
#5 D 5 6
#6 D 6 8
#7 E 7 11
Or using data.table with na.locf0
library(data.table)
library(zoo)
setDT(df1)[, Category := na.locf0(Category)][]
data
df1 <- structure(list(Category = c("A", NA, NA, "D", NA, NA, "E"), Numeric = 1:7,
Numeric.null = c(1L, 2L, 4L, 7L, 6L, 8L, 11L)),
class = "data.frame", row.names = c(NA,
-7L))

Related

How to merge two data frame which has jumbled column names

I have 2 data frames df1 and df2 with the same column names but in different column numbers. How to merge as df3 without creating additional columns/rows.
df1
a b c
1 3 6
df2
b c a
5 6 1
expected df3
a b c
1 3 6
1 5 6
Tried below code but it did not work
df3=merge(df1, df2, by = "col.names")
We may use bind_rows which automatically find the matching column names and if it is not there, it will add a NA row for those doesn't have. The order of columns will be based on the order from the first dataset input in `bind_rows i.e. df1
library(dplyr)
bind_rows(df1, df2)
-output
a b c
1 1 3 6
2 1 5 6
data
df1 <- structure(list(a = 1L, b = 3L, c = 6L), class = "data.frame", row.names = c(NA,
-1L))
df2 <- structure(list(b = 5L, c = 6L, a = 1L), class = "data.frame", row.names = c(NA,
-1L))
Rearrange columns of any one dataframe according on another dataframe so both the columns have the same order of column names and then use rbind.
rbind(df1, df2[names(df1)])
# a b c
#1 1 3 6
#2 1 5 6
In this case, using rbind(df1, df2) should work too.

R group_by one variable or (not and) another [duplicate]

This question already has an answer here:
Create a group index for values connected directly and indirectly
(1 answer)
Closed 2 years ago.
I have a dataset with two variables. As a simple example:
df <- rbind(c("A",1),c("B",2),c("C",2),c("C",3),c("D",4),c("D",5),c("E",1))
I would like to group them by the first component or the second, the desired output would be a third column with the following values:
c(1,2,2,2,3,3,1)
If I use dplyr and group_by and cur_group_id(), I would get groups by the first and second component, obtaining therefore
c(1,2,3,4,5,6,7)
Can anyone help me in an easy way, it could be either base R, dplyr or data.table, to obtain the desired group?
Thank you
Perhaps igraph could be a helpful tool for you
library(igraph)
df$grp <- membership(components(graph_from_data_frame(df, directed = FALSE)))[df$X1]
which gives
> df
X1 X2 grp
1 A 1 1
2 B 2 2
3 C 2 2
4 C 3 2
5 D 4 3
6 D 5 3
7 E 1 1
Data
> dput(df)
structure(list(X1 = c("A", "B", "C", "C", "D", "D", "E"), X2 = c(1L,
2L, 2L, 3L, 4L, 5L, 1L)), row.names = c(NA, -7L), class = "data.frame")

Fill in missing rows in data in R

Suppose I have a data frame like this:
1 8
2 12
3 2
5 -6
6 1
8 5
I want to add a row in the places where the 4 and 7 would have gone in the first column and have the second column for these new rows be 0, so adding these rows:
4 0
7 0
I have no idea how to do this in R.
In excel, I could use a vlookup inside an iferror. Is there a similar combo of functions in R to make this happen?
Edit: also, suppose that row 1 was missing and needed to be filled in similarly. Would this require another solution? What if I wanted to add rows until I reached ten rows?
Use tidyr::complete to fill in the missing sequence between min and max values.
library(tidyr)
library(rlang)
complete(df, V1 = min(V1):max(V1), fill = list(V2 = 0))
#Or using `seq`
#complete(df, V1 = seq(min(V1), max(V1)), fill = list(V2 = 0))
# V1 V2
# <int> <dbl>
#1 1 8
#2 2 12
#3 3 2
#4 4 0
#5 5 -6
#6 6 1
#7 7 0
#8 8 5
If we already know min and max of the dataframe we can use them directly. Let's say we want data from V1 = 1 to 10, we can do.
complete(df, V1 = 1:10, fill = list(V2 = 0))
If we don't know the column names beforehand, we can do something like :
col1 <- names(df)[1]
col2 <- names(df)[2]
complete(df, !!sym(col1) := 1:10, fill = as.list(setNames(0, col2)))
data
df <- structure(list(V1 = c(1L, 2L, 3L, 5L, 6L, 8L), V2 = c(8L, 12L,
2L, -6L, 1L, 5L)), class = "data.frame", row.names = c(NA, -6L))

Summing row values by specific columns using mutate_at and sum function?

I have a data table with questionnaire data, so the first column is participant IDs followed by columns of each questionnaire headed with the separate questions. for example, the data table would look like this, where A is one questionnaire and B is a different one:
ID A1 A2 A3 B1 B2
1 3 5 3 4 2
2 2 5 2 2 1
3 4 1 3 4 1
4 3 2 3 3 2
I want to be coding this using dplyr functions. I'm having trouble using mutate_at from dplyr to find the summary scores of each questionnaire, for each ID. I want to find the the sum for questionnaire A (from A1, A2, and A3), and for B...and so on. But my data table has many questionnaires in it (A, B, C, D.....etc) so my code right now looks like:
data %>%
group_by(ID) %>%
mutate_at(vars(contains("A")), funs(sum)) %>%
ungroup()
However running this always gives me an error of
Error: invalid 'type' (character) of argument
and I can't understand why. Same thing happens when I try mutate_each. How can I solve this?
I think one way would be the following. I can see how you want to work with the wide-format data using mutate_at, but you may want to choose long format here. That would make your life easy. You can use melt or gather to format your data in a long format. Then, you want to change the column, variable. You want to remove numbers. Finally you group the data by ID and variable and get sum.
melt(mydf, id.var = "ID") %>%
mutate(variable = gsub(pattern = "[0-9]+", replacement = "", x = variable)) %>%
group_by(ID, variable) %>%
summarise(total = sum(value))
# ID variable total
# <int> <chr> <int>
#1 1 A 11
#2 1 B 6
#3 2 A 9
#4 2 B 3
#5 3 A 8
#6 3 B 5
#7 4 A 8
#8 4 B 5
DATA
mydf <- structure(list(ID = 1:4, A1 = c(3L, 2L, 4L, 3L), A2 = c(5L, 5L,
1L, 2L), A3 = c(3L, 2L, 3L, 3L), B1 = c(4L, 2L, 4L, 3L), B2 = c(2L,
1L, 1L, 2L)), .Names = c("ID", "A1", "A2", "A3", "B1", "B2"), class = "data.frame", row.names = c(NA,
-4L))
The reason it's difficult to do is that you haven't explicitly coded the questionnaire type and number and the data are therefore not "tidy". Jazzurro's approach is right but here I've used the tidyr package to do this with gather and separate.
library(tidyr)
library(dplyr)
data %>%
gather(test, tot, A1:B2) %>%
separate(test, into=c("Q", "No"), sep=1) %>%
group_by(ID, Q) %>% summarise(totals=sum(tot))
This avoids having to use gsub and the like.
Also, you can add %>% spread(Q, totals) to the end of the pipeline if you want A and B in separate columns.

Fill a column's blank spaces contingent on a second column in R

I'd appreciate some help with this one. I have something similar to the data below.
df$A df$B
1 .
1 .
1 .
1 6
2 .
2 .
2 7
What I need to do is fill in df$B with each value that corresponds to the end of the run of values in df$A. Example below.
df$A df$B
1 6
1 6
1 6
1 6
2 7
2 7
2 7
Any help would be welcome.
It seems to me that the missing values are denoted by .. It is better to read the dataset with na.strings="." so that the missing values will be NA. For the current dataset, the 'B' column would be character/factor class (depending upon whether you used stringsAsFactors=FALSE/TRUE (default) in the read.table/read.csv.
Using data.table, we convert the data.frame to data.table (setDT(df1)), change the 'character' class to 'numeric' (B:= as.numeric(B)). This will also result in coercing the . to NA (a warning will appear). Grouped by "A", we change the "B" values to the last element (B:= B[.N])
library(data.table)
setDT(df1)[,B:= as.numeric(B)][,B:=B[.N] , by = A]
# A B
#1: 1 6
#2: 1 6
#3: 1 6
#4: 1 6
#5: 2 7
#6: 2 7
#7: 2 7
Or with dplyr
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(B= as.numeric(tail(B,1)))
Or using ave from base R
df1$B <- with(df1, as.numeric(ave(B, A, FUN=function(x) tail(x,1))))
data
df1 <- structure(list(A = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), B = c(".",
".", ".", "6", ".", ".", "7")), .Names = c("A", "B"),
class = "data.frame", row.names = c(NA, -7L))

Resources