conditionally remove rows that are not NA

conditionally remove rows that are not NA - r

I need to remove the rows that do not have NA values in the final two columns. Any ideas?
# A tibble: 640 x 4
`7 (very included)` `7 (very included)__1` X__1 X__2
<chr> <chr> <chr> <chr>
1 NA NA NA NA
2 7 (very included) 5 NA NA
3 NA NA NA NA
4 7 (very included) 7 (very included) 7 (very included) 7 (very included)
5 NA NA NA NA
6 NA NA NA NA
7 NA NA NA NA
8 5 4 NA NA
9 NA NA NA NA
10 7 (very included) 7 (very included) 7 (very included) NA
# ... with 630 more rows

Assuming your dataframe object is df, you can filter as below:
library(dplyr)
df %>%
filter(!is.na(`X__1`) & !is.na(`X__2`))
Or
df[!is.na(df$`X__1`) & !is.na(df$`X__2`), ]

Related

count occurrences across columns and match to ID column

I have a df of 100+ columns and not all are filled
> head(othertopics,20)
# A tibble: 20 x 118
Q6 Q10.1 Q10.2 Q10.3 Q10.4 Q10.5 Q10.6 Q10.7 Q10.8 Q10.9 Q10.10 Q10.11 Q10.12 Q10.13
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
2 294 NA NA NA NA NA NA NA NA NA NA NA NA NA
3 103 NA NA NA NA NA NA NA NA NA NA NA NA NA
4 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
5 87 NA NA NA NA NA NA NA NA NA NA NA NA NA
6 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
7 136 NA NA NA NA NA NA NA NA NA NA NA NA NA
8 19 NA NA NA NA NA NA NA NA NA NA NA NA NA
9 19 NA NA NA NA NA NA NA NA NA NA NA NA NA
10 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
11 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
12 19 NA NA NA 4 NA NA NA NA NA NA NA NA NA
13 52 NA NA NA NA NA NA NA NA NA NA NA NA NA
14 108 NA NA NA NA NA NA NA NA NA NA NA NA NA
Q6 is an ID.
across Q10.1 to Q10.117 there are different values assigned for each ID (see line 12).
Using unlist i used unlist and managed to get the frequency for every time a value was mentioned among the 117 columns. But i need to match them to their respective ID.
So basically i need to match an ID col with 117 columns and get the frequency of each column.
othertopics<-data.frame(table(unlist(TableTopic2[,22:138])))
Var1 Freq
10 1
100 4
101 1
102 12
103 7
104 21
105 36
106 1
so for example variable 105 appeared 36 times across 17 values of IDs on column Q6( This number I counted on Excel).
So, so far I only have the first half of my solution as i need to know what is the ID associated with the variables . ( ie: the 17 values i counted)
also note that the variable columns contain the number of their variable, So for example row Q10.105 is for variable 105 which has a frequency of 36.
I hope i was able to make it clear.
Thanks!

This question is not particularly clear, but I'll do my best. I think the way to tidy this data is to pivot all of the non-id columns to one column (I call it 'col_name') and then have another column with all of the values (mostly NA's; I call it 'numbered_var' for numbered variable). Then, you can aggregate based on the numbered_variable column.
This example is obviously not reproducible, so I constructed a simplified version of your data (I think):
library(dplyr)
library(tidyr)
df <- tibble(
id = 1:5,
Q1 = c(NA_integer_, 10L, NA_integer_, 10L, NA_integer_),
Q2 = c(NA_integer_, NA_integer_, 11L, NA_integer_, 11)
)
It looks like this:
# A tibble: 5 × 3
id Q1 Q2
<int> <int> <dbl>
1 1 NA NA
2 2 10 NA
3 3 NA 11
4 4 10 NA
5 5 NA 11
Next, I use tidyr::pivot_longer() to put the column names containing Q into a column, with their associated value in another column:
df <- pivot_longer(
df,
cols = contains("Q"), # you will want to use this, but first remove the Q from the id column name in your data
names_to = "col_name",
values_to = "numbered_var"
)
This makes the data long:
# A tibble: 10 × 3
id col_name numbered_var
<int> <chr> <dbl>
1 1 Q1 NA
2 1 Q2 NA
3 2 Q1 10
4 2 Q2 NA
5 3 Q1 NA
6 3 Q2 11
7 4 Q1 10
8 4 Q2 NA
9 5 Q1 NA
10 5 Q2 11
You should still probably have three columns, but the id's would repeat themselves n-column times, just as they repeat twice for the two columns here.
Next, I would group by the variables, which seem to be of interest, and list the unique id's that have the variables in a new column:
df <- group_by(df, numbered_var)
df <- summarize(
df,
var_appearances = n(),
ids = list(unique(id))
)
Now, the data frame looks like this:
# A tibble: 3 × 3
numbered_var var_appearances ids
<dbl> <int> <list>
1 10 2 <int [2]>
2 11 2 <int [2]>
3 NA 6 <int [5]>
That ids column is a list-column with a vector of ids:
print(df$ids)
[[1]]
[1] 2 4
[[2]]
[1] 3 5
[[3]]
[1] 1 2 3 4 5
I'm not sure this is exactly what you'll be able to do, but hopefully it sets you in the right direction.

Append several tables into one with one different column in R

I have a data frame after making several tables I would like to create a data frame that combines all tables into one data frame in order to export to excel. The only issue is the first variable is different in each table so bind_rows will not work.
Dummy sample data:
df1 = data.frame(Id = c(11:16), date = seq(as.Date("2015-01-01"),as.Date("2015-01-6"),1))
df2 = data.frame(HH_size = c(1:6 ), date = seq(as.Date("2015-01-01"),as.Date("2015-01-6"),1) )
let's say I made these tables
df11<- df1 %>%
dplyr::group_by(date) %>%
count(Id) %>%
tidyr::spread(date,n)
df22<- df2 %>%
dplyr::group_by(date) %>%
count(HH_size) %>%
tidyr::spread(date,n)
df11
Id `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06`
<int> <int> <int> <int> <int> <int> <int>
1 11 1 NA NA NA NA NA
2 12 NA 1 NA NA NA NA
3 13 NA NA 1 NA NA NA
4 14 NA NA NA 1 NA NA
5 15 NA NA NA NA 1 NA
6 16 NA NA NA NA NA 1
This will not work
list <- c("df11" , "df22")
list %>% map_df(bind_rows)
Error: Argument 1 must have names
here is my desired output:
label cat `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06`
Id 11 1 NA NA NA NA NA
Id 12 NA 1 NA NA NA NA
Id 13 NA NA 1 NA NA NA
Id 14 NA NA NA 1 NA NA
Id 15 NA NA NA NA 1 NA
Id 16 NA NA NA NA NA 1
HH_size 1 1 NA NA NA NA NA
HH_size 2 NA 1 NA NA NA NA
HH_size 3 NA NA 1 NA NA NA
HH_size 4 NA NA NA 1 NA NA
HH_size 5 NA NA NA NA 1 NA
HH_size 6 NA NA NA NA NA 1

This will serve your purpose.
. in dplyr/magrittr means result upto previous pipe. So names(.)[1] took out the name of first column and mutated it into a new column named label
Then again you needed first column back as cat. So I mutated a column cat with .x[[1]] which is first column of every iterated value passed on. I think you may also use . instead of .x as value just prior to pipe is .x only.
unselect first column
rearrange placement of these columns as desired.
map_df(list(df11, df22), ~.x %>%
mutate(label = names(.)[1],
cat = .x[[1]]) %>%
select(-1) %>%
select(label, cat, everything()))
# A tibble: 12 x 8
label cat `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06`
<chr> <int> <int> <int> <int> <int> <int> <int>
1 Id 11 1 NA NA NA NA NA
2 Id 12 NA 1 NA NA NA NA
3 Id 13 NA NA 1 NA NA NA
4 Id 14 NA NA NA 1 NA NA
5 Id 15 NA NA NA NA 1 NA
6 Id 16 NA NA NA NA NA 1
7 HH_size 1 1 NA NA NA NA NA
8 HH_size 2 NA 1 NA NA NA NA
9 HH_size 3 NA NA 1 NA NA NA
10 HH_size 4 NA NA NA 1 NA NA
11 HH_size 5 NA NA NA NA 1 NA
12 HH_size 6 NA NA NA NA NA 1

Put all the dataframes in a list and then you can do :
library(tidyverse)
list_df <- lst(df1, df2)
map_df(list_df, ~{
col <- names(.x)[1]
.x %>%
count(.data[[col]], date) %>%
pivot_wider(names_from = date, values_from = n) %>%
mutate(label = col) %>%
rename_with(~'cat', 1)
})
# cat `2015-01-01` `2015-01-02` `2015-01-03` `2015-01-04` `2015-01-05` `2015-01-06` label
# <int> <int> <int> <int> <int> <int> <int> <chr>
# 1 11 1 NA NA NA NA NA Id
# 2 12 NA 1 NA NA NA NA Id
# 3 13 NA NA 1 NA NA NA Id
# 4 14 NA NA NA 1 NA NA Id
# 5 15 NA NA NA NA 1 NA Id
# 6 16 NA NA NA NA NA 1 Id
# 7 1 1 NA NA NA NA NA HH_size
# 8 2 NA 1 NA NA NA NA HH_size
# 9 3 NA NA 1 NA NA NA HH_size
#10 4 NA NA NA 1 NA NA HH_size
#11 5 NA NA NA NA 1 NA HH_size
#12 6 NA NA NA NA NA 1 HH_size

Remove variables duplicates from data.table

I have data with over 6k columns. Each result has colums with data that are always the same.
XCODE Age Sex ResultA Sex ResultB
1 X001 12 2 2 2 4
2 X002 23 2 4 2 66
3 X003 NA NA NA NA NA
4 X004 32 1 1 1 3
5 X005 NA NA NA NA NA
6 X001 NA NA NA NA NA
7 X002 NA NA NA NA NA
8 X003 33 1 8 1 6
9 X004 NA NA NA NA NA
10 X005 55 2 8 2 8
I would like to remove duplicate e.g sex variable. Is there possibility of doing that with data.table?

You can use match if you need to check for equality of all values.
df[, unique(match(df, df)), with = F]
df2
# XCODE Age Sex ResultA ResultB
# 1 X001 12 2 2 4
# 2 X002 23 2 4 66
# 3 X003 NA NA NA NA
# 4 X004 32 1 1 3
# 5 X005 NA NA NA NA
# 6 X001 NA NA NA NA
# 7 X002 NA NA NA NA
# 8 X003 33 1 8 6
# 9 X004 NA NA NA NA
# 10 X005 55 2 8 8
Data used:
df <- fread('
XCODE Age Sex ResultA Sex ResultB
1 X001 12 2 2 2 4
2 X002 23 2 4 2 66
3 X003 NA NA NA NA NA
4 X004 32 1 1 1 3
5 X005 NA NA NA NA NA
6 X001 NA NA NA NA NA
7 X002 NA NA NA NA NA
8 X003 33 1 8 1 6
9 X004 NA NA NA NA NA
10 X005 55 2 8 2 8
')[, -'V1']

Try this:
df[, unique(colnames(df))]
One caveat: it will delete all columns with duplicated names. In your case, it will delete Sex even if the two columns have the same name but different content.

If you have duplicated columns with different names, you can transpose your dataframe, which allows you to use the unique function to solve your problem. Then you then transpose it back and set it back to dataframe (because it came a matrix when you transposed it).
df = data.frame(c = 1:5, a = c("A", "B","C","D","E"), b = 1:5)
df = t(df)
df = unique(df)
df = t(df)
df = data.frame(df)
Edit: like markus points out, this is probably not a good option if you have columns of multiples types because when t() coerces your dataframe to matrix it also coerces all your variables into the same type.

How to extract values of existing variable and paste them in top rows of dataframe (using R)

Probably there's a very easy solution to this but I can't figure it out for some reason. This is what my data (in R) look like (except for value_new which is the exact description of what I need!):
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
I hope that this is self explanatory. What I need is the values of "value" for is.na(value) (i.e. the first five rows) and paste these values as the first five rows (i.e. when value<0) of a new variable I'd like to call "value_new".
What is an easy way of doing this? I'd basically need to cut out the bottom half and paste it as new variable(s) in the top section of the dataframe. Hope this makes sense.

dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9))
dat$value_new = NA
dat$value_new[!is.na(dat$id)] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 NA 7 NA
# 7 NA NA NA
# 8 NA 4 NA
# 9 NA 1 NA
# 10 NA 9 NA
In case you have more rows with a non-NA id compared to NA id you can use:
dat<-data.frame("id"=c(1,2,3,4,5,6,NA,NA,NA,NA,NA),
"value"=c(rep(NA,6),7,NA,4,1,9))
k = sum(is.na(dat$id))
dat$value_new = NA
dat$value_new[!is.na(dat$id)][1:k] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 6 NA NA
# 7 NA 7 NA
# 8 NA NA NA
# 9 NA 4 NA
# 10 NA 1 NA
# 11 NA 9 NA
where k is the number of values you'll replace in the top part of your new column.

dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
ind <- which(!is.na(dat$value))[1]
newcol <- `length<-`(dat$value[ind:nrow(dat)], nrow(dat))
dat$value_new2 <- newcol
# id value value_new value_new2
#1 1 NA 7 7
#2 2 NA NA NA
#3 3 NA 4 4
#4 4 NA 1 1
#5 5 NA 9 9
#6 NA 7 NA NA
#7 NA NA NA NA
#8 NA 4 NA NA
#9 NA 1 NA NA
#10 NA 9 NA NA
Short version:
dat$value_new2 <- `length<-`(dat$value[which(!is.na(dat$value))[1]:nrow(dat)], nrow(dat))
I remove the first continuing NA and add them to the end. Not considering id's here.

R: In dataframe: set first non-NA value in column to NA

I have a large dataframe, 300+ columns (time series) with about 2600 observations. The columns are filled with a lot of NA's and then a short time series, and then typically NA's again. I would like to find the first non-NA value in each column and replace it with NA.
This is what I'm hoping to achieve, only with a much bigger dataframe:
Before:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 1 1 NA NA
4 2 2 1 1
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
After:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 2 2 NA NA
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
I've searched around and found a way to do this for each column, but my efforts to apply it to the whole dataframe has proven difficult.
I have created an example dataframe to reproduce my original dataframe:
#Dataframe with NA
x1=x2=c(NA,NA,1:10,NA,NA)
x3=x4=c(NA,NA,NA,1:7,NA,NA,NA,NA)
df=data.frame(x1,x2,x3,x4)
I have used this to replace the first value with NA in 1 column (provided by #Joshua Ulrich here), however I would like to apply it to all columns without manually changing 300+ codes:
NonNAindex <- which(!is.na(df[,1]))
firstNonNA <- min(NonNAindex)
is.na(df[,1]) <- seq(firstNonNA, length.out=1)
I have tried to set the above as a function and run it for all columns with apply/lapply, as well as a for loop, but haven't really figured out how to apply the changes to my dataframe. I'm sure there is something I've completely overlooked as I'm just taking my first small steps in R.
All suggestions would be highly appreciated!

We can use base R
df1[] <- lapply(df1, function(x) replace(x, which(!is.na(x))[1], NA))
df1
# x1 x2 x3 x4
#1 NA NA NA NA
#2 NA NA NA NA
#3 NA NA NA NA
#4 2 2 NA NA
#5 3 3 2 2
#6 4 4 3 3
#7 5 5 4 4
#8 6 6 5 5
#9 7 7 6 6
#10 8 8 7 7
#11 9 9 NA NA
#12 10 10 NA NA
#13 NA NA NA NA
#14 NA NA NA NA
Or as #thelatemail suggested
df1[] <- lapply(df1, function(x) replace(x, Position(Negate(is.na), x), NA))

Since you would like to do this for all columns, you could use the mutate_all function from dplyr. See http://dplyr.tidyverse.org/ for more information. In particular, you may want to look at some of the examples shown here.
library(dplyr)
mutate_all(df, funs(if_else(row_number() == min(which(!is.na(.))), NA_integer_, .)))
#> x1 x2 x3 x4
#> 1 NA NA NA NA
#> 2 NA NA NA NA
#> 3 NA NA NA NA
#> 4 2 2 NA NA
#> 5 3 3 2 2
#> 6 4 4 3 3
#> 7 5 5 4 4
#> 8 6 6 5 5
#> 9 7 7 6 6
#> 10 8 8 7 7
#> 11 9 9 NA NA
#> 12 10 10 NA NA
#> 13 NA NA NA NA
#> 14 NA NA NA NA

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

conditionally remove rows that are not NA - r

Assuming your dataframe object is df, you can filter as below: library(dplyr) df %>% filter(!is.na(`X1`) & !is.na(`X2`)) Or df[!is.na(df$`X1`) & !is.na(df$`X2`), ]

Related

count occurrences across columns and match to ID column

Append several tables into one with one different column in R

Remove variables duplicates from data.table

How to extract values of existing variable and paste them in top rows of dataframe (using R)

R: In dataframe: set first non-NA value in column to NA

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

conditionally remove rows that are not NA - r

Assuming your dataframe object is df, you can filter as below: library(dplyr) df %>% filter(!is.na(`X__1`) & !is.na(`X__2`)) Or df[!is.na(df$`X__1`) & !is.na(df$`X__2`), ]

Related

count occurrences across columns and match to ID column

Append several tables into one with one different column in R

Remove variables duplicates from data.table

How to extract values of existing variable and paste them in top rows of dataframe (using R)

R: In dataframe: set first non-NA value in column to NA

Categories

Resources

Assuming your dataframe object is df, you can filter as below: library(dplyr) df %>% filter(!is.na(`X1`) & !is.na(`X2`)) Or df[!is.na(df$`X1`) & !is.na(df$`X2`), ]