I'm quite new to manipulate dataframes in R. I need to create a dataframe by joining several other ones, each containing some data.
I've succeeded in joining them, but I got that:
https://i.stack.imgur.com/SkFDg.png
And what I want is a clean dataframe, so I would like to remove the , " " and $ characters in order to obtain a "real" dataframe. Can you help me with that? Many thanks!
PS: I'm using dplyr and statsr libraries, don't know if this onformation is useful though...
Your data looks like comma separated format (csv). The simplest way would probably be to save it as plain text and read the file with csv.get.
As noted by #Jan, the best way is to read-in the data more fittingly. If, for some reason, this is not a viable option, then this might work:
First off, some illustrative data:
v1 <- c(',"Name","Area","Population"')
v2 <- c(',"Afghanistan",652230,32564342')
v3 <- c(',"Akrotiri",123,NA"')
v4 <- c(',"Albania",28748,3029278')
df1 <- as.data.frame(rbind(v1,v2,v3,v4))
df1
V1
v1 ,"Name","Area","Population"
v2 ,"Afghanistan",652230,32564342
v3 ,"Akrotiri",123,NA"
v4 ,"Albania",28748,3029278
The first step is (i) to get rid of the leading comma and the quote marks using gsub, (ii) split the rows at the comma using strsplit, (iii) to save the result as a dataframe using as.data.frame, and (iv) to transpose it using t:
df2 <- t(as.data.frame(apply(df1, 2, function(x) strsplit(trimws(gsub('^,|"', '', x)),","))))
The rest is rather cosmetic: first remove the row names, then add the correct column names, and finally remove the first row (which contains the names too):
rownames(df2) <- NULL
colnames(df2) <- df2[1,]
df3 <- as.data.frame(df2[-1,])
The result is a neat and clean structure:
df3
Name Area Population
1 Afghanistan 652230 32564342
2 Akrotiri 123 NA
3 Albania 28748 3029278
Related
I would like to subset a data frame (Data) by column names. I have a character vector with column name IDs I want to exclude (IDnames).
What I do normally is something like this:
Data[ ,!colnames(Data) %in% IDnames]
However, I am facing the problem that there is a name "X-360" and another one "X-360.1" in the columns. I only want to exclude the "X-360" (which is also in the character vector), but not "X-360.1" (which is not in the character vector, but extracted anyway). - So I want only exact matches, and it seems like this does not work with %in%.
It seems such a simple problem but I just cannot find a solution...
Update:
Indeed, the problem was that I had duplicated names in my data.frame! It took me a while to figure this out, because when I looked at the subsetted columns with
Data[ ,colnames(Data) %in% IDnames]
it showed "X-360" and "X-360.1" among the names, as stated above.
But it seems this was just happening when subsetting the data, before there were just columns with the same name ("X-360") - and that happened because the data frame was set up from matrices with cbind.
Here is a demonstration of what happened:
D1 <-matrix(rnorm(36),nrow=6)
colnames(D1) <- c("X-360", "X-400", "X-401", "X-300", "X-302", "X-500")
D2 <-matrix(rnorm(36),nrow=6)
colnames(D2) <- c("X-360", "X-406", "X-403", "X-300", "X-305", "X-501")
D <- cbind(D1, D2)
Data <- as.data.frame(D)
IDnames <- c("X-360", "X-302", "X-501")
Data[ ,colnames(Data) %in% IDnames]
X-360 X-302 X-360.1 X-501
1 -0.3658194 -1.7046575 2.1009329 0.8167357
2 -2.1987411 -1.3783129 1.5473554 -1.7639961
3 0.5548391 0.4022660 -1.2204003 -1.9454138
4 0.4010191 -2.1751914 0.8479660 0.2800923
5 -0.2790987 0.1859162 0.8349893 0.5285602
6 0.3189967 1.5910424 0.8438429 0.1142751
Learned another thing to be careful about when working with such data in the future...
One regex based solution here would be to form an alternation of exact keyword matches:
regex <- paste0("^(?:", paste(IDnames, collapse="|"), ")$")
Data[ , !grepl(regex, colnames(Data))]
I am new to R and have spent the last 2 months on this website trying to learn more. I want to pull information from a dataset that has a specific keyword and then those that DO have that keyword, I want to pull the 5 words before and after that keyword. Then I want to know what number(s) they have near them in that same sentence.
To explain the "Why", I have a list of tickets I want to pull all the Titles of the tickets. Then I want to know from that list of those tickets which are requesting for additional Storage. If they are, I want to know how MUCH storage they are asking for, and then later I will create actions depending on how much storage they're asking for (but that is later).
Example of the code I have completed so far (it's a bit messy, I am working on a better/cleaner way still, I'm very new to R).
The keyword I'm searching for: Storage
Dataframe referenced as: DF, DF2, DF3 etc.
Column from DF: Title
#Check for keyword#
grep("storage", DF$Title, ignore.case=true)
#Pull words before and after keywords, this is case sensitive for some reason so I have to do it twice and merge the data frames, it also creates a list instead of a data frame so I have to change that into a data frame...Messy I know#
DF2 <- stringr::str_extract_all(DF$Title, "([^\\s]+\\s){0,5}Storage(\\s[^\\s]+){0,5}")
#Turn list into dataframe#
DF3 <- do.call(rbind.data.frame, DF2)
#Pull words before and after but in lower case, same as step two#
DF4 <- stringr::str_extract_all(DF$Title, "([^\\s]+\\s){0,5}storage(\\s[^\\s]+){0,5}")
#Turn list into dataframe#
DF5 <- do.call(rbind.data.frame, DF4)
#Change column names ( I have to do this to merge them via rbind)
DF6 <- setnames(DF3, c("Keyword")
DF7 <- setnames(DF5, c("Keyword")
#Merge both data frames together#
DF6 <- rbind(DF6,Df7)
I want to check the amount of storage being requested, so I'm trying to look for a number referencing GB or TB, etc. I've tried numerous code but a lot only pull the numbers or number right after a keyword, not all numbers in the sentence.
Example of what I've tried with it not working
DFTest <- as.integer(str_match(DF6, "(?i\\bGB:?\\s*(\\d+")[,2])
The following approach will extract all numbers before a specific keyword (this case I used AND), or after a keyword. You can change your keyword in the regex pattern.
library(tidyverse)
df <- data.frame(obs = 1:5, COL_D = c("2019AND", "AND1999", "101AND", "AND12", "20AND1999999"))
df2 <- df %>%
mutate(Extracted_Num = str_extract_all(COL_D, regex("\\d+(?=AND)|(?<=AND)\\d+")))
# obs COL_D Extracted_Num
# 1 1 2019AND 2019
# 2 2 AND1999 1999
# 3 3 101AND 101
# 4 4 AND12 12
# 5 5 20AND1999999 20, 1999999
I am new to this community, currently working on a R project in which I need to find each of the element separated by comma in a dataframe, on any of the columns in another dataframe, here is an example below:
#DataFrame1
a=c("AA,BB","BB,CC,FF","CC,DD,GG,FF","GG","")
df1=as.data.frame(a)
#DataFrame2
x=c("AA","XX","BB","YY","ZZ","MM","YY","CC")
y=c("DD""VV","NN","XX","CC","AA","WW","FF")
z=c("CC","AA","YY","GG","HH","OO","PP","QQ")
df2=as.data.frame(x,y,z)
what I need to do is find, if any of the elements, lets take for example "AA,BB" (which is the first cell in column x of df1) "AA" is an element and "BB" is another element , is available on any of the columns (x,y,x) in df2, if a match is found I need to identify that row or rows, there is also a possibility of more then one match on df2 rows
. Hope I was able to explain this problem well, expert please help
Here it is a solution in 2 steps:
# load tidyverse
library(tidyverse)
Step 1: Split the elements separated by comma from df1 in a new data frame new_df
1a) To do this, we first identify the number of columns to be generated
(as the maximum number of elements separated by ,; that is: maximum number of , + 1)
number_new_columns <- max(sapply(df1$a, function(x) str_count(x, ","))) + 1
1b) Generate the new data frame new_df
new_df <- df1 %>%
separate(a, c(as.character(seq_len(number_new_columns)))) # missing pieces will be filled with NA
# Above, we used c(as.character(seq_len(number_new_columns))) to generate column names as numbers -- not very creative :)
Step 2: Identify the position of each unique element from new_df in df2
(hope I understood correctly this second part of the question)
2a) Get the unique elements (from new_df)
unique_elements <- unlist(new_df) %>%
unique()
2b) Get a list whose components contain the positions of each unique element within df2
output <- lapply(unique_elements, function(x) {
which(df2 == x, arr.ind=TRUE)
})
I wanted to add an additional column to an existing dataframe where the value
of newColumn would be based on a capture group of a regex applied to another value in the same row and the only thing I came up with that worked so far was this (probably not R-esque) standard-approach of looping but it is awefully slow (for a DF of around 1.5 million rows).
Dataframe with Columns:
ID Text NewColumn
Atm I work with this:
df$newColumn <- rep("", nrow(df));
for (row in 1:nrow(df)) {
df$newColumn[row] <- str_match(df$Text[row], regex)[1,2];
}
I tried using apply/lapply after reading several posts but none of my approaches created the expected result. Is this even possible with a function of the apply-family, and if yes: how?
Example:
for
regex <- "^[0-9]*([a-zA-Z]*)$";
and a table like the following:
ID Text
------------------
1 231Ben
2 112Claudine
3 538Julia
I would expect:
ID Text NewColumn
----------------------------
1 231Ben Ben
2 112Claudine Claudine
3 538Julia Julia
The str_match and gsub/sub etc are vectorized, so we don't have to loop through the rows if the pattern is the same
df1$NewColumn <- gsub("\\d+", "", df1$Text)
Or with stringr functions
library(stringr)
df1$NewColumn <- str_match(df1$Text, "([A-Za-z]+)")[,1]
str_extract(df1$Text, "[A-Za-z]+")
#[1] "Ben" "Claudine" "Julia"
I have some trouble with a script which uses cbind to add columns to a data frame. I select these columns by regular expression and I love that cbind automatically provides a prefix if you add more then one column. Bit this is not working if you just append one column... Even if I cast this column as a data frame...
Is there a way to get around this behaviour?
In my example, it works fine for columns starting with a but not for b1 column.
df <- data.frame(a1=c(1,2,3),a2=c(3,4,5),b1=c(6,7,8))
cbind(df, log=log(df[grep('^a', names(df))]))
cbind(df, log=log(df[grep('^b', names(df))]))
cbind(df, log=as.data.frame(log(df[grep('^b', names(df))])))
A solution would be to create an intermediate dataframe with the log values and rename the columns :
logb = log(df[grep('^b', names(df))]))
colnames(logb) = paste0('log.',names(logb))
cbind(df, logb)
What about
cbw <- c("a","b") # columns beginning with
cbw_pattern <- paste0("^",cbw, collapse = "|")
cbind(df, log=log(df[grep(cbw_pattern, names(df))]))
This way you do select both pattern at once. (all three columns).
Only if just one column is selected the colnames wont fit.