Converting data to a 2 column format? - r

If I have a dataset like the following:
LA NY MA
1 2 3
4 5 6
3 5
4
(In other words, each row has a different structure. LA has 3 values, NY has 4 values, etc.)
I am trying to use lm to perform an ANOVA test (to decide whether the mean number is the same in each state), and it keeps showing "an error occurred" because rows do not match. One idea I got was to convert data to a 2-column format. Which command/package should I use to perform that task?
Edit: the data is from the txt file.

Another option after you read the file to convert to a 2-column format would be
df <- read.table("Betty.txt", header=TRUE, fill=TRUE, sep="\t")
## (as #Richard Scriven mentioned in the comment)
na.omit(stack(df))
# values ind
#1 1 LA
#2 4 LA
#3 3 LA
#5 2 NY
#6 5 NY
#7 5 NY
#8 4 NY
#9 3 MA
#10 6 MA
Update
The above I got by transforming the data to have \t delimiter. But, if the file is copy/pasted directly from the OP's post without any change (making sure that there are spaces for the 3rd and 4th row after the 2nd column)
lines <- readLines('Betty1.txt')
lines2 <- gsub("(?<=[^ ]) +|^[ ]+(?<=[ ])(?=[^ ])", ",", lines, perl=TRUE)
lines2
#[1] "LA,NY,MA" "1,2,3" "4,5,6" "3,5," ",4,"
df1 <- read.table(text=lines2, sep=',', header=TRUE)
df1
# LA NY MA
#1 1 2 3
#2 4 5 6
#3 3 5 NA
#4 NA 4 NA
and then do
na.omit(stack(df1))
Update2
Another option if you have fixed width columns is to use read.fwf
df <- read.fwf('Betty1.txt', widths=c(3,3,3), skip=1)
colnames(df) <- scan('Betty1.txt', nlines=1, what="", quiet=TRUE)
df
# LA NY MA
#1 1 2 3
#2 4 5 6
#3 3 5 NA
#4 NA 4 NA
library(tidyr)
gather(df, Var, Val, LA:MA, na.rm=TRUE)
# Var Val
#1 LA 1
#2 LA 4
#3 LA 3
#4 NY 2
#5 NY 5
#6 NY 5
#7 NY 4
#8 MA 3
#9 MA 6

Just add an 'NA' to the 4th line of your text and try:
> ddf = read.table(text="
+ LA NY MA
+ 1 2 3
+ 4 5 6
+ 3 5
+ NA 4
+ ", header=T, fill=T)
>
> ddf
LA NY MA
1 1 2 3
2 4 5 6
3 3 5 NA
4 NA 4 NA
>
> dput(ddf)
structure(list(LA = c(1L, 4L, 3L, NA), NY = c(2L, 5L, 5L, 4L),
MA = c(3L, 6L, NA, NA)), .Names = c("LA", "NY", "MA"), class = "data.frame", row.names = c(NA,
-4L))
>
> mm = melt(ddf)
No id variables; using all as measure variables
>
> mm
variable value
1 LA 1
2 LA 4
3 LA 3
4 LA NA
5 NY 2
6 NY 5
7 NY 5
8 NY 4
9 MA 3
10 MA 6
11 MA NA
12 MA NA
>
> with(mm, aov(value~variable))
Call:
aov(formula = value ~ variable)
Terms:
variable Residuals
Sum of Squares 4.833333 15.166667
Deg. of Freedom 2 6
Residual standard error: 1.589899
Estimated effects may be unbalanced
3 observations deleted due to missingness

Related

Shift select columns one to the right and replace empty space with the row number - 1

I have a data frame that looks similar to this (I've cut some out for easier reference, data has 93 rows):
Rank 1 A B C D
34 (TPE) 2 4 6 12
35 (TUR) 2 2 9 13
36 (GRE) 2 1 1 4
(UGA) 2 1 1 4 <NA>
I need to have the columns line up, but some of the data in "Rank" is offset to the left one column. I have assigned the rows with this problem to a vector:
off.set.rows <- c(which(is.na(df[ , 6])))
I need to have all rows in that vector shift one column to the right and replace the empty space it leaves in column 1 with the number in column 1 in the row previous to it. It should look like this:
Rank 1 A B C D
34 (TPE) 2 4 6 12
35 (TUR) 2 2 9 13
36 (GRE) 2 1 1 4
36 (UGA) 2 1 1 4
I've tried this: df[off.set.rows, 1:(ncol(df))] <- df[off.set.rows, 2:(ncol(df))], but it shifts everything in the row left one column and the (UGA) disappears, it moves the to column 5 and then repeats the value that moves into column 2 again in column 6 like this:
Rank 1 A B C D
34 (TPE) 2 4 6 12
35 (TUR) 2 2 9 13
36 (GRE) 2 1 1 4
2 1 1 4 <NA> 2
Help is much appreciated!!
Base R solution: How it works:
Subset df to only those rows that met the criteria defined in your off.set.rows
add a new column at the start to x
paste colnames from df to x
bind the rows of df and x together
remove the rows that meet the criteria defined in your off.set.rows
Use lag() to add the value above in Rank
off.set.rows <- c(which(is.na(df[ , 6])))
x <- subset(df, rownames(df) %in% off.set.rows)
x <- cbind(new=0, x)
colnames(x) <- colnames(df)
df <- rbind(df, x[1:6])
df <- subset(df, !rownames(df) %in% off.set.rows)
df$Rank <- ifelse(df$Rank==0, lag(df$Rank), df$Rank)
Rank X1 A B C D
1 34 (TPE) 2 4 6 12
2 35 (TUR) 2 2 9 13
3 36 (GRE) 2 1 1 4
41 36 (UGA) 2 1 1 4
data:
df <- structure(list(Rank = c("34", "35", "36", "(UGA)"), X1 = c("(TPE)",
"(TUR)", "(GRE)", "2"), A = c(2L, 2L, 2L, 1L), B = c(4L, 2L,
1L, 1L), C = c(6L, 9L, 1L, 4L), D = c(12L, 13L, 4L, NA)), class = "data.frame", row.names = c(NA,
-4L))
I think the attempt in the original post was close. The column numbers to replace should be from 2 to ncol(df) (2-6). And those are replaced by columns 1 through ncol(df) - 1 (1-5).
After moving values for those particular rows, I might consider replacing the first column in those rows with NA, then use fill from tidyr to replace them with the last non-missing value. This will also take care of situations when you may have consecutive offset rows (if that is a possibility).
library(tidyr)
off.set.rows <- c(which(is.na(df[ , ncol(df)])))
df[off.set.rows, 2:ncol(df)] <- df[off.set.rows, 1:ncol(df)-1]
df[off.set.rows, 1] <- NA
fill(df, 1, .direction = "down")
Output
Rank X1 A B C D
1 34 (TPE) 2 4 6 12
2 35 (TUR) 2 2 9 13
3 36 (GRE) 2 1 1 4
4 36 (UGA) 2 1 1 4

Replace whole word or words with partial match in R

I have a data frame with thousands of misspelled city names. I need to correct these and can't find the solution though I've searched extensively. I've tried several functions and approaches
This is a miniature sample of the data:
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
num city
1 1 BORNE
2 2 BOERNAE
3 3 BARNE
4 4 BOERNE
5 5 GALDEN
6 6 GELDON
7 7 GOELDEN
8 8 GOLDEN
These are some of the functions I've tried, tried many more including str_replace and str_detect:
cit <- function(x){
ifelse(x %in% grepl(c("BOR","BOE","BAR")),"BOERNE",
ifelse(x %in% grepl(c("GAL","GEL","GOE")), "GOLDEN", "OTHER"))
}
Or
cit <- function(x){
ifelse(x %in% c("BOR","BOE","BAR"),"BOERNE",
ifelse(x %in% c("GAL","GEL","GOE"), "GOLDEN", "OTHER"))
}
Run code:
`citA$city2 <- cit(citA$city)`
Incorrect result:
num city city2
1 1 BOERNE OTHER
2 2 BOERNAE OTHER
3 3 BARNE OTHER
4 4 BOERNE OTHER
5 5 GALDEN OTHER
6 6 GELDON OTHER
7 7 GOELDEN OTHER
8 8 GOLDEN OTHER
Also tried:
citA$city[grepl(c("BOR","BOE","BAR"),citA$city)] <- "BOERNE"
But that throws an error:
Warning message:
In grepl(c("BOR", "BOE", "BAR"), citA$city) :
argument 'pattern' has length > 1 and only the first element will be used
Your ideas would be greatly helpful!
If you have many such patterns you can use case_when from dplyr :
library(dplyr)
library(stringr)
citA %>%
mutate(city2 = case_when(str_detect(city, 'BOR|BOE|BAR') ~ 'BOERNE',
str_detect(city, 'GAL|GEL|GOE|GOL') ~ 'GOLDEN',
TRUE ~ 'OTHER'))
# num city city2
#1 1 BORNE BOERNE
#2 2 BOERNAE BOERNE
#3 3 BARNE BOERNE
#4 4 BOERNE BOERNE
#5 5 GALDEN GOLDEN
#6 6 GELDON GOLDEN
#7 7 GOELDEN GOLDEN
#8 8 GOLDEN GOLDEN
We can paste it to a single string for the pattern in grep with | (meaning OR). The pattern argument in grep is not vectorized i.e. it takes only a single element
citA$city[grepl(paste(c("BOR","BOE","BAR"), collapse="|"),citA$city)] <- "BOERNE"
citA
# num city
#1 1 BOERNE
#2 2 BOERNE
#3 3 BOERNE
#4 4 BOERNE
#5 5 GALDEN
#6 6 GELDON
#7 7 GOELDEN
#8 8 GOLDEN
NOTE: The column 'city' is created as factor. It should be a character class by making use of stringsAsFactors = FALSE
data
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"),
stringsAsFactors = FALSE)
I've got a package on github that may help, that allows recoding of factor levels with regex matching. Load with package with
devtools::install_github("jwilliman/xfactor")
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
citA$city2 <- xfactor::xfactor(citA$city, levels = c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))
citA
#> num city city2
#> 1 1 BORNE BOERNE
#> 2 2 BOERNAE BOERNE
#> 3 3 BARNE BOERNE
#> 4 4 BOERNE BOERNE
#> 5 5 GALDEN GOLDEN
#> 6 6 GELDON GOLDEN
#> 7 7 GOELDEN GOLDEN
#> 8 8 GOLDEN GOLDEN
Created on 2020-04-20 by the reprex package (v0.3.0)
Otherwise you could use the following function to clean/update the factor levels, uses a similar syntax.
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
make_levels <- function(.f, patterns, replacement = NULL, ignore.case = FALSE) {
lvls <- levels(.f)
# Replacements can be listed in the replacement argument, taken as names in patterns, or the patterns themselves.
if(is.null(replacement)) {
if(is.null(names(patterns)))
replacement <- patterns
else
replacement <- names(patterns)
}
# Find matching levels
lvl_match <- setNames(vector("list", length = length(patterns)), replacement)
for(i in seq_along(patterns))
lvl_match[[replacement[i]]] <- grep(patterns[i], lvls, ignore.case = ignore.case, value = TRUE)
# Append other non-matching levels
lvl_other <- setdiff(lvls, unlist(lvl_match))
lvl_all <- append(
lvl_match,
setNames(as.list(lvl_other), lvl_other)
)
return(lvl_all)
}
levels(citA$city) <- make_levels(citA$city, c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))
citA
#> num city
#> 1 1 BOERNE
#> 2 2 BOERNE
#> 3 3 BOERNE
#> 4 4 BOERNE
#> 5 5 GOLDEN
#> 6 6 GOLDEN
#> 7 7 GOLDEN
#> 8 8 GOLDEN
Created on 2020-04-20 by the reprex package (v0.3.0)

R - Replace values in a specific even column based on values from a odd specific column - Application to the whole dataframe

My data frame:
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
For the case A and qA (= quality A): I want the values assigned to the quality value 1 and 3 are replaced by NA
And the same for the case B and qB
The final data has to be like this:
desired_data <- data.frame(A = c("NA",5,6,"NA","NA"), qA = c(1,2,2,3,1), B = c(2,5,"NA","NA","NA"), qB = c(2,2,1,3,1))
My question is how to perform that?
I have a big dataframe with about 90 columns, so I need code which doesn't require the column names to work properly.
To help, I have this part of code which select columns starting with "q" letter:
data[,grep("^[q]", colnames(data))]
You could just do this...
data[,seq(1,ncol(data),2)][(data[,seq(2,ncol(data),2)]==1)|
(data[,seq(2,ncol(data),2)]==3)] <- NA
data
A qA B qB
1 NA 1 2 2
2 5 2 5 2
3 6 2 NA 1
4 NA 3 NA 3
5 NA 1 NA 1
One solution is to separate in two tables and use vectorisation in base R
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
data
#> A qA B qB
#> 1 1 1 2 2
#> 2 5 2 5 2
#> 3 6 2 6 1
#> 4 8 3 8 3
#> 5 7 1 4 1
quality <- data[,grep("^[q]", colnames(data))]
data2 <- data[,setdiff(colnames(data), names(quality))]
data2[quality == 1 | quality == 3] <- NA
data2
#> A B
#> 1 NA 2
#> 2 5 5
#> 3 6 NA
#> 4 NA NA
#> 5 NA NA

Delete consecutive empty rows in R

df presents possible name matches. Each pair of matches should be divided by an empty row. However, in some cases my output includes several empty rows between the matching pairs:
> df <- data.frame(id = c(1,2,NA,3,4,NA,NA,NA,5,6,NA), name = c("john jones", "john joners",
NA, "clara prat", "klara prat", NA, NA, NA, "alan turing", "allan turing",
NA), stringsAsFactors = F)
> df
id name
1 1 john jones
2 2 john joners
3 NA <NA>
4 3 clara prat
5 4 klara prat
6 NA <NA>
7 NA <NA>
8 NA <NA>
9 5 alan turing
10 6 allan turing
11 NA <NA>
The desired output is:
> df
id name
1 1 john jones
2 2 john joners
3 NA <NA>
4 3 clara prat
5 4 klara prat
6 NA <NA>
7 5 alan turing
8 6 allan turing
9 NA <NA>
I can do this with a for loop, which I understand is less than optimal.
Perhaps this helps
v1 <- rowSums(!is.na(df))
df[unlist(lapply(split(seq_along(v1),
cumsum(c(1, diff(!v1))<0)), function(i)
i[seq(which.max(v1[i]==0))])),]
# id name
#1 1 john jones
#2 2 john joners
#3 NA <NA>
#4 3 clara prat
#5 4 klara prat
#6 NA <NA>
#9 5 alan turing
#10 6 allan turing
#11 NA <NA>
Here is another approach using rle to look for runs of missing
miss <- rowSums(is.na(df))
# get runs of missing
r <- rle(miss)
r$values <- seq_along(r$values)
# subset data, removing rows when all columns are missing
# and rows sequentially missing
df[!(miss == ncol(df) & duplicated(inverse.rle(r))), ]
# id name
# 1 1 john jones
# 2 2 john joners
# 3 NA <NA>
# 4 3 clara prat
# 5 4 klara prat
# 6 NA <NA>
# 9 5 alan turing
# 10 6 allan turing
# 11 NA <NA>
As mentioned by Akrun, you can use data.table::rleid to avoid some of the explicit rle calculations
df[!(rowSums(is.na(df)) == ncol(df) & duplicated(data.table::rleid(is.na(df[[1]])))) , ]
Using the IRanges package.
df <- data.frame(id = c(1,2,NA,3,4,NA,NA,NA,5,6,NA), name = c("john jones", "john joners",
NA, "clara prat", "klara prat", NA, NA, NA, "alan turing", "allan turing",
NA), stringsAsFactors = F)
library(IRanges)
na.rs <- which(is.na(df$id) & is.na(df$name))
na.rs.re <- reduce(IRanges(na.rs, na.rs))
na.rs.rm <- na.rs.re[width(na.rs.re)>1]
start(na.rs.rm) <- start(na.rs.rm) + 1
df[-as.integer(na.rs.rm), ]
# id name
# 1 1 john jones
# 2 2 john joners
# 3 NA <NA>
# 4 3 clara prat
# 5 4 klara prat
# 6 NA <NA>
# 9 5 alan turing
# 10 6 allan turing
# 11 NA <NA>
Surely not the best solution but easy to follow..
miss <- rowSums(is.na(df))
r <- sum(rle(miss)[[2]])
for(i in 2:length(df$id)){
while(is.na(df$id[i-1]) & is.na(df$id[i])){
df <- df[-(i),]
if(sum(is.na(df$id)) == r) break
}
}

R Count elements of data frame and add a row

I am Reading data into two data frames, using this code:
IdaEmpA <- data.frame(table(unlist(DadosA$idade)))
IdaEmpB <- data.frame(table(unlist(DadosB$idade)))
Then I want to add a row with NAs quantity to those data frames. I tried like this:
IdaEmpA = rbind(IdaEmpA,c(7,sum(is.na(DadosA$idade))))
IdaEmpB = rbind(IdaEmpB,c(7,sum(is.na(DadosB$idade))))
The resulting data is:
> IdaEmpA
RespA QuantA
1 1 11
2 2 13
3 3 15
4 4 3
5 5 18
6 6 1
> IdaEmpB
RespB QuantB
1 1 18
2 2 14
3 3 21
4 4 2
5 6 13
But I am getting an warning and the value is not being added to the first column:
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 7) :
nível de fator inválido, NA gerado
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 7) :
nível de fator inválido, NA gerado
Results after warnning:
> IdaEmpA
RespA QuantA
1 1 11
2 2 13
3 3 15
4 4 3
5 5 18
6 6 1
7 <NA> 1
> IdaEmpB
RespB QuantB
1 1 18
2 2 14
3 3 21
4 4 2
5 6 13
6 <NA> 3
How do I manage to have value 7 instead of NA???
Any clues might help me out, thanks!
This occurs when there is a factor column. If the values in the new row for that column are not in the levels of the factor column, you will get this message. For example, if I have both columns as "numeric" , there won't be an error.
rbind(IdaEmpA,c(7,5))
# RespA QuantA
#1 1 11
#2 2 13
#3 3 15
#4 4 3
#5 5 18
#6 6 1
#7 7 5
If one of the column is factor
IdaEmpA$RespA <- factor(IdaEmpA$RespA)
rbind(IdaEmpA,c(7,5))
# RespA QuantA
#1 1 11
#2 2 13
#3 3 15
#4 4 3
#5 5 18
#6 6 1
#7 <NA> 5
#Warning message:
#In `[<-.factor`(`*tmp*`, ri, value = 7) :
# invalid factor level, NA generated
Because, the column in "IdaEmpA" appears to belong to "numeric" class, we can convert it to numeric before doing the rbind
IdaEmpA$RespA <- with(IdaEmpA, as.numeric(levels(RespA))[RespA])
If there are multiple columns that needs to be reconverted to numeric
indx <- sapply(IdaEmpA, is.factor)
IdaEmpA[indx] <- lapply(IdaEmpA[indx], function(x)
with(x, as.numeric(levels(x))[x]))
This could be all avoided while reading the dataset using read.table/read.csv. You can use stringsAsFactors=FALSE so that the columns that are "character" class will not get converted to "factor"
Once you have corrected the rbind step, it would be easier to do merge.
data
IdaEmpA <- structure(list(RespA = 1:6, QuantA = c(11L, 13L, 15L, 3L, 18L,
1L)), .Names = c("RespA", "QuantA"), class = "data.frame", row.names =
c("1", "2", "3", "4", "5", "6"))

Resources