R Count elements of data frame and add a row - r

I am Reading data into two data frames, using this code:
IdaEmpA <- data.frame(table(unlist(DadosA$idade)))
IdaEmpB <- data.frame(table(unlist(DadosB$idade)))
Then I want to add a row with NAs quantity to those data frames. I tried like this:
IdaEmpA = rbind(IdaEmpA,c(7,sum(is.na(DadosA$idade))))
IdaEmpB = rbind(IdaEmpB,c(7,sum(is.na(DadosB$idade))))
The resulting data is:
> IdaEmpA
RespA QuantA
1 1 11
2 2 13
3 3 15
4 4 3
5 5 18
6 6 1
> IdaEmpB
RespB QuantB
1 1 18
2 2 14
3 3 21
4 4 2
5 6 13
But I am getting an warning and the value is not being added to the first column:
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 7) :
nível de fator inválido, NA gerado
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 7) :
nível de fator inválido, NA gerado
Results after warnning:
> IdaEmpA
RespA QuantA
1 1 11
2 2 13
3 3 15
4 4 3
5 5 18
6 6 1
7 <NA> 1
> IdaEmpB
RespB QuantB
1 1 18
2 2 14
3 3 21
4 4 2
5 6 13
6 <NA> 3
How do I manage to have value 7 instead of NA???
Any clues might help me out, thanks!

This occurs when there is a factor column. If the values in the new row for that column are not in the levels of the factor column, you will get this message. For example, if I have both columns as "numeric" , there won't be an error.
rbind(IdaEmpA,c(7,5))
# RespA QuantA
#1 1 11
#2 2 13
#3 3 15
#4 4 3
#5 5 18
#6 6 1
#7 7 5
If one of the column is factor
IdaEmpA$RespA <- factor(IdaEmpA$RespA)
rbind(IdaEmpA,c(7,5))
# RespA QuantA
#1 1 11
#2 2 13
#3 3 15
#4 4 3
#5 5 18
#6 6 1
#7 <NA> 5
#Warning message:
#In `[<-.factor`(`*tmp*`, ri, value = 7) :
# invalid factor level, NA generated
Because, the column in "IdaEmpA" appears to belong to "numeric" class, we can convert it to numeric before doing the rbind
IdaEmpA$RespA <- with(IdaEmpA, as.numeric(levels(RespA))[RespA])
If there are multiple columns that needs to be reconverted to numeric
indx <- sapply(IdaEmpA, is.factor)
IdaEmpA[indx] <- lapply(IdaEmpA[indx], function(x)
with(x, as.numeric(levels(x))[x]))
This could be all avoided while reading the dataset using read.table/read.csv. You can use stringsAsFactors=FALSE so that the columns that are "character" class will not get converted to "factor"
Once you have corrected the rbind step, it would be easier to do merge.
data
IdaEmpA <- structure(list(RespA = 1:6, QuantA = c(11L, 13L, 15L, 3L, 18L,
1L)), .Names = c("RespA", "QuantA"), class = "data.frame", row.names =
c("1", "2", "3", "4", "5", "6"))

Related

Shift select columns one to the right and replace empty space with the row number - 1

I have a data frame that looks similar to this (I've cut some out for easier reference, data has 93 rows):
Rank 1 A B C D
34 (TPE) 2 4 6 12
35 (TUR) 2 2 9 13
36 (GRE) 2 1 1 4
(UGA) 2 1 1 4 <NA>
I need to have the columns line up, but some of the data in "Rank" is offset to the left one column. I have assigned the rows with this problem to a vector:
off.set.rows <- c(which(is.na(df[ , 6])))
I need to have all rows in that vector shift one column to the right and replace the empty space it leaves in column 1 with the number in column 1 in the row previous to it. It should look like this:
Rank 1 A B C D
34 (TPE) 2 4 6 12
35 (TUR) 2 2 9 13
36 (GRE) 2 1 1 4
36 (UGA) 2 1 1 4
I've tried this: df[off.set.rows, 1:(ncol(df))] <- df[off.set.rows, 2:(ncol(df))], but it shifts everything in the row left one column and the (UGA) disappears, it moves the to column 5 and then repeats the value that moves into column 2 again in column 6 like this:
Rank 1 A B C D
34 (TPE) 2 4 6 12
35 (TUR) 2 2 9 13
36 (GRE) 2 1 1 4
2 1 1 4 <NA> 2
Help is much appreciated!!
Base R solution: How it works:
Subset df to only those rows that met the criteria defined in your off.set.rows
add a new column at the start to x
paste colnames from df to x
bind the rows of df and x together
remove the rows that meet the criteria defined in your off.set.rows
Use lag() to add the value above in Rank
off.set.rows <- c(which(is.na(df[ , 6])))
x <- subset(df, rownames(df) %in% off.set.rows)
x <- cbind(new=0, x)
colnames(x) <- colnames(df)
df <- rbind(df, x[1:6])
df <- subset(df, !rownames(df) %in% off.set.rows)
df$Rank <- ifelse(df$Rank==0, lag(df$Rank), df$Rank)
Rank X1 A B C D
1 34 (TPE) 2 4 6 12
2 35 (TUR) 2 2 9 13
3 36 (GRE) 2 1 1 4
41 36 (UGA) 2 1 1 4
data:
df <- structure(list(Rank = c("34", "35", "36", "(UGA)"), X1 = c("(TPE)",
"(TUR)", "(GRE)", "2"), A = c(2L, 2L, 2L, 1L), B = c(4L, 2L,
1L, 1L), C = c(6L, 9L, 1L, 4L), D = c(12L, 13L, 4L, NA)), class = "data.frame", row.names = c(NA,
-4L))
I think the attempt in the original post was close. The column numbers to replace should be from 2 to ncol(df) (2-6). And those are replaced by columns 1 through ncol(df) - 1 (1-5).
After moving values for those particular rows, I might consider replacing the first column in those rows with NA, then use fill from tidyr to replace them with the last non-missing value. This will also take care of situations when you may have consecutive offset rows (if that is a possibility).
library(tidyr)
off.set.rows <- c(which(is.na(df[ , ncol(df)])))
df[off.set.rows, 2:ncol(df)] <- df[off.set.rows, 1:ncol(df)-1]
df[off.set.rows, 1] <- NA
fill(df, 1, .direction = "down")
Output
Rank X1 A B C D
1 34 (TPE) 2 4 6 12
2 35 (TUR) 2 2 9 13
3 36 (GRE) 2 1 1 4
4 36 (UGA) 2 1 1 4

Extracting only numeric parts of a string column with apply and sub breaks

I have a data frame dat in R, which looks like that:
id x y z
1 0 4 California 15 MSG 2017/11
2 0 1 Nationally Representative 11 MSG 2016/04
3 1 1 Nationally Representative 8 MSG 2016/01
4 0 1 Nationally Representative 1 ASDE 2014/01
5 2 1 Nationally Representative 8 MSG 2016/01
6 0 1 Nationally Representative 5 MSG 2015/07
Now I want to loop through each column and only keep the numeric part at the beginning, e.g. in first row, variable x, I want to keep the "4", variable z I want to keep the "15" and so on.
I tried the following (i.e. searching with space characters in each column and delete it + the part after the space):
dat_new = apply(dat, 2, function(x) sub(" .+", "", x)) # searchs for any space and deletes the space + everything after the space
dat_new = as.data.frame(apply(dat_new, 2, as.numeric))
However, what works for a small subset of the data, e.g., the first six rows, eventually breaks. I.e., my total data frame hast 5100 rows and applying above functions leads to the first column ("id") getting empty, this also happens to some other columns. I currently found a workaround with using an actual for-loop, but wanted to nonetheless check what's wrong with my code and if there's another elegant solution.
Data types of dat are:
'data.frame': 5109 obs. of 4 variables:
$ id: int 1 2 3 4 5 6 7 8 9 10 ...
$ x : int 0 0 1 0 2 0 1 1 0 0 ...
$ y : Factor w/ 4 levels "1 Nationally Representative",..: 4 1 1 1 1 1 1 4 1 3 ...
$ z : Factor w/ 16 levels "1 ASDE 2014",..: 7 3 15 1 15 12 12 8 13 5 ...
Using base R we can lapply over selected columns and extract the numeric part
cols <- c("y", "z")
df[cols] <- lapply(df[cols], function(x) as.numeric(sub("(^\\d+).*", "\\1", x)))
df
# id x y z
#1 1 0 4 15
#2 2 0 1 11
#3 3 1 1 8
#4 4 0 1 1
#5 5 2 1 8
#6 6 0 1 5
We can use parse_number from readr on the columns 'y', 'z' to extract the first set of numeric substring
library(dplyr)
library(readr)
dat %>%
mutate_at(vars(y:z), list(~ parse_number(as.character(.))))
# d x y z
#1 1 0 4 15
#2 2 0 1 11
#3 3 1 1 8
#4 4 0 1 1
#5 5 2 1 8
#6 6 0 1 5
Or another option is to remove the substring from the space and then convert to numeric
library(stringr)
dat %>%
mutate_at(vars(y:z), list(~ as.numeric(str_remove(., "\\s+.*"))))
Or using base R, we remove the space followed by other characters and convert to numeric for columns other than the first
dat[-1] <- lapply(dat[-1], function(x) as.numeric(sub("\\s+.*", "", x)))
data
dat <- structure(list(d = 1:6, x = c(0L, 0L, 1L, 0L, 2L, 0L), y = structure(c(2L,
1L, 1L, 1L, 1L, 1L), .Label = c("1 Nationally Representative",
"4 California"), class = "factor"), z = structure(c(3L, 2L, 5L,
1L, 5L, 4L), .Label = c("1 ASDE 2014/01", "11 MSG 2016/04", "15 MSG 2017/11",
"5 MSG 2015/07", "8 MSG 2016/01"), class = "factor")), row.names = c(NA,
-6L), class = "data.frame")
An apply implementation(might be slow):
as.data.frame(apply(dat,2,function(x) gsub("[A-Z].*","",x)))
d x y z
1 1 0 4 15
2 2 0 1 11
3 3 1 1 8
4 4 0 1 1
5 5 2 1 8
6 6 0 1 5

Newly created data frame loses the labels for the categories of its vectors

I have a data frame like this:
> str(dynamics)
'data.frame': 3517 obs. of 3 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ y2015: int 245 129 301 162 123 125 115 47 46 135 ...
$ y2016: int NA 385 420 205 215 295 130 NA NA 380 ...
I take out the 3 vectors and name them differently,
Column 1:
> plantid <- dynamics$id
> head(plantid)
[1] 1 2 3 4 5 6
Column 2:
(I divide it into different classes and label them 2,3,4 and 5)
> y15 <- dynamics$y2015
> year15 <- cut(y15, breaks = c(-Inf, 50, 100, 150, Inf), labels = c("2", "3", "4", "5"))
> str(year15)
Factor w/ 4 levels "2","3","4","5": 4 3 4 4 3 3 3 1 1 3 ...
> head(year15)
[1] 5 4 5 5 4 4
Levels: 2 3 4 5
Column 3:
(Same here)
> y16 <- dynamics$y2016
> year16 <- cut(y16, breaks = c(-Inf, 50, 100, 150, Inf), labels = c("2", "3", "4", "5"))
> str(year16)
Factor w/ 4 levels "2","3","4","5": NA 4 4 4 4 4 3 NA NA 4 ...
> head(year16)
[1] <NA> 5 5 5 5 5
Levels: 2 3 4 5
So far so good!
The problem arises when I combine the above 3 vectors by cbind() to form a new data frame, the newly created vector levels are gone
Look at my code:
SD1 = data.frame(cbind(plantid, year15, year16))
head(SD1)
and I get a data frame like this:
> head(SD1)
plantid year15 year16
1 1 4 NA
2 2 3 4
3 3 4 4
4 4 4 4
5 5 3 4
6 6 3 4
as you can see the levels of 2nd and 3rd column have changed from 2, 3, 4, 5 back to 1, 2, 3, 4
How do I fix that?
cbind is most commonly used to combine objects into matrices. It strips out special attributes from the inputs to help ensure that they are compatible for combining into a single object. This means that data types with special attributes (such as the name and format attributes for factors and Dates) will be simplified to their underlying numerical representations. This is why cbind turns your factors into numbers.
Conversely, data.frame() by itself will preserve the individual object attributes. In this case, your use of cbind is unnecessary. To preserve your factor levels, simply use:
SD1 <- data.frame(plantid, year15, year16)

How to remove rows from a R data frame that have NA in two columns (NA in both columns NOT either one)? [duplicate]

This question already has answers here:
How to delete rows from a dataframe that contain n*NA
(4 answers)
Closed 6 years ago.
I have a R data frame df below
a b c
1 6 NA
2 NA 4
3 7 NA
NA 8 1
4 9 10
NA NA 7
5 10 8
I want to remove the row which has NA in BOTH a & b
My desired output will be
a b c
1 6 NA
2 NA 4
3 7 NA
NA 8 1
4 9 10
5 10 8
I tried something like this below
df1<-df[(is.na(df$a)==FALSE & is.na(df$b)==FALSE),]
but this removes all the NAs (performs an OR function). I need to do AND operation here.
How do i do it ?
You can try :
df1<-df[!(is.na(df$a) & is.na(df$b)), ]
using rowSums
df[!rowSums(is.na(df))==2,]
better one by saving a character[1]
df[rowSums(is.na(df))!=2,]
output:
a b
1 1 6
2 2 NA
3 3 7
4 NA 8
5 4 9
7 5 10
can be generalized using ncol
df[!rowSums(is.na(df))==ncol(df),]
[1] credits: alistaire
We can use rowSums on a logical matrix (is.na(df1)) and convert that to a logical vector (rowSums(...) < ncol(df1)) to subset the rows.
df1[rowSums(is.na(df1)) < ncol(df1),]
Or another option is Reduce with lapply
df1[!Reduce(`&`, lapply(df1, is.na)),]
Another approach
df[!apply(is.na(df),1,all),]
# a b
#1 1 6
#2 2 NA
#3 3 7
#4 NA 8
#5 4 9
#7 5 10
Data
df <- structure(list(a = c(1L, 2L, 3L, NA, 4L, NA, 5L), b = c(6L, NA,
7L, 8L, 9L, NA, 10L)), .Names = c("a", "b"), class = "data.frame", row.names = c(NA,
-7L))
this will also work:
df[apply(df, 1, function(x) sum(is.na(x)) != ncol(df)),]
a b
1 1 6
2 2 NA
3 3 7
4 NA 8
5 4 9
7 5 10
My thought is basically the same with other replies.
Considering any dataset with a specific row having all NAs, the sum of !is.na(ROW) will always be zero. So you just have to take out that row.
So you can just do:
df1 = df[-which(rowSums(!is.na(df))==0),]

Converting data to a 2 column format?

If I have a dataset like the following:
LA NY MA
1 2 3
4 5 6
3 5
4
(In other words, each row has a different structure. LA has 3 values, NY has 4 values, etc.)
I am trying to use lm to perform an ANOVA test (to decide whether the mean number is the same in each state), and it keeps showing "an error occurred" because rows do not match. One idea I got was to convert data to a 2-column format. Which command/package should I use to perform that task?
Edit: the data is from the txt file.
Another option after you read the file to convert to a 2-column format would be
df <- read.table("Betty.txt", header=TRUE, fill=TRUE, sep="\t")
## (as #Richard Scriven mentioned in the comment)
na.omit(stack(df))
# values ind
#1 1 LA
#2 4 LA
#3 3 LA
#5 2 NY
#6 5 NY
#7 5 NY
#8 4 NY
#9 3 MA
#10 6 MA
Update
The above I got by transforming the data to have \t delimiter. But, if the file is copy/pasted directly from the OP's post without any change (making sure that there are spaces for the 3rd and 4th row after the 2nd column)
lines <- readLines('Betty1.txt')
lines2 <- gsub("(?<=[^ ]) +|^[ ]+(?<=[ ])(?=[^ ])", ",", lines, perl=TRUE)
lines2
#[1] "LA,NY,MA" "1,2,3" "4,5,6" "3,5," ",4,"
df1 <- read.table(text=lines2, sep=',', header=TRUE)
df1
# LA NY MA
#1 1 2 3
#2 4 5 6
#3 3 5 NA
#4 NA 4 NA
and then do
na.omit(stack(df1))
Update2
Another option if you have fixed width columns is to use read.fwf
df <- read.fwf('Betty1.txt', widths=c(3,3,3), skip=1)
colnames(df) <- scan('Betty1.txt', nlines=1, what="", quiet=TRUE)
df
# LA NY MA
#1 1 2 3
#2 4 5 6
#3 3 5 NA
#4 NA 4 NA
library(tidyr)
gather(df, Var, Val, LA:MA, na.rm=TRUE)
# Var Val
#1 LA 1
#2 LA 4
#3 LA 3
#4 NY 2
#5 NY 5
#6 NY 5
#7 NY 4
#8 MA 3
#9 MA 6
Just add an 'NA' to the 4th line of your text and try:
> ddf = read.table(text="
+ LA NY MA
+ 1 2 3
+ 4 5 6
+ 3 5
+ NA 4
+ ", header=T, fill=T)
>
> ddf
LA NY MA
1 1 2 3
2 4 5 6
3 3 5 NA
4 NA 4 NA
>
> dput(ddf)
structure(list(LA = c(1L, 4L, 3L, NA), NY = c(2L, 5L, 5L, 4L),
MA = c(3L, 6L, NA, NA)), .Names = c("LA", "NY", "MA"), class = "data.frame", row.names = c(NA,
-4L))
>
> mm = melt(ddf)
No id variables; using all as measure variables
>
> mm
variable value
1 LA 1
2 LA 4
3 LA 3
4 LA NA
5 NY 2
6 NY 5
7 NY 5
8 NY 4
9 MA 3
10 MA 6
11 MA NA
12 MA NA
>
> with(mm, aov(value~variable))
Call:
aov(formula = value ~ variable)
Terms:
variable Residuals
Sum of Squares 4.833333 15.166667
Deg. of Freedom 2 6
Residual standard error: 1.589899
Estimated effects may be unbalanced
3 observations deleted due to missingness

Resources