Remove thousand's separator [duplicate] - r

This question already has answers here:
How can I declare a thousand separator in read.csv? [duplicate]
(4 answers)
Closed 9 years ago.
I imported an Excel file and got a data frame like this
structure(list(A = structure(1:3, .Label = c("1.100", "2.300",
"5.400"), class = "factor"), B = structure(c(3L, 2L, 1L), .Label = c("1.000.000",
"500", "7.800"), class = "factor"), C = structure(1:3, .Label = c("200",
"3.100", "4.500"), class = "factor")), .Names = c("A", "B", "C"
), row.names = c(NA, -3L), class = "data.frame")
I would now like to convert these chars to numeric or even integer. However, the dot character (.) is not a decimal sign but a "thousand's separator" (it's German).
How would I convert the data frame properly?
I tried this:
df2 <- as.data.frame(apply(df1, 2, gsub, pattern = "([0-9])\\.([0-9])", replacement= "\\1\\2"))
df3 <- as.data.frame(data.matrix(df2))
however, apply seems to convert each column to a list of factors. Can I maybe prevent apply from doing so?

You can use this :
sapply(df, function(v) {as.numeric(gsub("\\.","", as.character(v)))})
Which gives :
A B C
[1,] 1100 7800 200
[2,] 2300 500 3100
[3,] 5400 1000000 4500
This will give you a matrix object, but you can wrap it into data.frame() if you wish.
Note that the columns in you original data are not characters but factors.
Edit: Alternatively, instead of wrapping it with data.frame(), you can do this to get the result directly as a data.frame:
# the as.character(.) is just in case it's loaded as a factor
df[] <- lapply(df, function(x) as.numeric(gsub("\\.", "", as.character(x))))

I think I just found another solution:
It's necessary to use stringsAsFactors = FALSE.
Like this:
df2 <- as.data.frame(apply(df1, 2, gsub, pattern = "([0-9])\\.([0-9])", replacement= "\\1\\2"), stringsAsFactors = FALSE)
df3 <- as.data.frame(data.matrix(df2))

Related

how to extract the specific words from a string in R?

how can I extract "7-9", "2-5" and "2-8", then paste to new column as event_time?
event_details
2.9(S) 7-9 street【Train】#2097
2.1(S) 2-5 street【Train】#2012
2.2(S) 2-8A TBC【Train】#202
You haven't really shared the logic to extract the numbers but based on limited data that you have shared we can do :
df$new_col <- sub('.*(\\d+-\\d+).*', '\\1', df$event_details)
df
# event_details new_col
#1 2.9(S) 7-9 street【Train】 7-9
#2 2.1(S) 2-5 street【Train】 2-5
#3 2.2(S) 2-8A TBC【Train】 2-8
Or same using str_extract
df$new_col <- stringr::str_extract(df$event_details, "\\d+-\\d+")
data
df <- structure(list(event_details = structure(c(3L, 1L, 2L),
.Label = c("2.1(S) 2-5 street【Train】",
"2.2(S) 2-8A TBC【Train】", "2.9(S) 7-9 street【Train】"), class =
"factor")), class = "data.frame", row.names = c(NA, -3L))

Replace value in a column with 'blank' if it matches the value in another column [duplicate]

This question already has answers here:
replacing values in a column with another column R
(4 answers)
Closed 3 years ago.
I have a following data set as a data frame within R
article_number 1st_cutoff_date 2nd_cutoff_date
abc 12/01/2019 01/14/2020
def 02/10/2020 02/10/2020
What I want to do is in cases where 1st_cutoff_date == 2nd_cutoff_date, then replace 2nd_cutoff date with blank value " ". So in the second case 'def' then 2nd_cutoff_date would be blank " "
the data frame is of factors and there are NA's - I have converted to character and tried the following:
AAR_FTW_Final_w_LL[AAR_FTW_Final_w_LL$`1st_Booking_Deadline` == AAR_FTW_Final_w_LL$`2nd_Booking_Deadline`, c("2nd_Booking_Deadline")] <- " "
&
ind<- AAR_FTW_Final_w_LL$`1st_Booking_Deadline` == AAR_FTW_Final_w_LL[`2nd_Booking_Deadlilne`]
AAR_FTW_Final_w_LL[ind, c("2nd_Booking_Deadline")] <- " "
Both return the error:
Error in AAR_FTW_Final_w_LL$`1st_Booking_Deadline` :
$ operator is invalid for atomic vectors
I have tried replacing the $ with [] but then I get the error that one of the columns is missing. Is there any easier way to do to this task?
Convert from factors to characters :
df[] <- lapply(df, as.character)
Then use replace
transform(df, `2nd_cutoff_date` = replace(`2nd_cutoff_date`,
`1st_cutoff_date` == `2nd_cutoff_date`, ''))
# article_number X1st_cutoff_date X2nd_cutoff_date
#1 abc 12/01/2019 01/14/2020
#2 def 02/10/2020
It adds X to the column name since it is not a standard in R to have columns starting with a number.
Another approach after you convert the data to characters would be
df$`2nd_cutoff_date`[df$`1st_cutoff_date` == df$`2nd_cutoff_date`] <- ""
data
df <- structure(list(article_number = structure(1:2, .Label = c("abc",
"def"), class = "factor"), `1st_cutoff_date` = structure(2:1,
.Label = c("02/10/2020", "12/01/2019"), class = "factor"),
`2nd_cutoff_date` = structure(1:2, .Label = c("01/14/2020",
"02/10/2020"), class = "factor")), class = "data.frame", row.names = c(NA, -2L))

subset df according nested list while there is a white space

I have a data frame and I would like to subset it according specific values. When I have tried to do it, there is problem because of the white space inside the values in sample_df$mentions.
I used this script for subsetting the data frame:
sample_list <- list()
for (i in colnames(sample_name)){
sample_list <- sapply(sample_df$mentions, function(x)any(x %in% sample_name[[i]]))
new_sample_df <- sample_df[sample_list,]
}
I have tried strsplit function to get rid of the space but it has created other problems.
sample_df$mentions <- strsplit(as.charater(sample_df$mentions),"[[:space:]]")
Thank you for your help in advance.
My expected outcome should be like this:
mentions screen_name
5 islambey1453, hamzayerlikaya, tahaayhan, hidoturkoglu15 ak_Furkan54
10 nurhandnci, SSSBBL777, serkanacar007, Chequevera06, kubilayy81 tanrica_gaia
sample_name reproducible data:
sample_name <- structure(list(Name = structure(2:1, .Label = c("hamzayerlikaya",
"SSSBBL777"), class = "factor")), row.names = c(NA, -2L), class = "data.frame")
sample_df reproducible data:
sample_df <- structure(list(mentions = list(character(0), "srgnsnmz92", character(0),
"Berivan_Aslan_", c("islambey1453", " hamzayerlikaya", " tahaayhan",
" hidoturkoglu15"), character(0), "themarginale", character(0),
character(0), c("nurhandnci", " SSSBBL777", " serkanacar007",
" Chequevera06", " kubilayy81")), screen_name = c("SaadetYakar",
"beraydogru", "EL_Turco_DLC", "hebunagel", "ak_Furkan54", "zaferakyol011",
"melmitem", "mobbingabla", "BekarKronik", "tanrica_gaia")), row.names = c(NA,
10L), class = "data.frame")
We can loop through the 'Name' and use that in grepl, Reduce it to a single logical vector and subset the rows of 'sample_df'
sample_df[Reduce(`|`, lapply(as.character(sample_name$Name),
grepl, x = sample_df$mentions)),]
# mentions screen_name
#5 islambey1453, hamzayerlikaya, tahaayhan, hidoturkoglu15 ak_Furkan54
#10 nurhandnci, SSSBBL777, serkanacar007, Chequevera06, kubilayy81 tanrica_gaia
NOTE: This would work with any length of 'Name' column
Another option is regex_inner_join
library(fuzzyjoin)
library(tidyverse)
regex_inner_join(sample_df, sample_name, by = c("mentions" = "Name")) %>%
select(mentions, screen_name)
# mentions screen_name
#1 islambey1453, hamzayerlikaya, tahaayhan, hidoturkoglu15 ak_Furkan54
#2 nurhandnci, SSSBBL777, serkanacar007, Chequevera06, kubilayy81 tanrica_gaia
Since mentions is a list we can use sapply and select only those rows in sample_df where any of the mentions has Name in it.
sample_df[sapply(sample_df$mentions, function(x) any(grepl(pattern, x))), ]
# mentions screen_name
#5 islambey1453, hamzayerlikaya, tahaayhan, hidoturkoglu15 ak_Furkan54
#10 nurhandnci, SSSBBL777, serkanacar007, Chequevera06, kubilayy81 tanrica_gaia
where pattern is
pattern = paste0("\\b", sample_name$Name, "\\b", collapse = "|")

remove double quotes from factors in a dataframe

I got a dataframe to work on where I have a bunch of variables as factors in quotation marks like ""x1"".
str(df) gives me something like this:
$ x : Factor w/ 10 Levels "\"\"x1\"\"",..: 1 7 9 ...
I tried to get rid of the quotation marks with the gsub() function but that didn´t work. Probably because I don´t know what to insert as pattern? Would be great if somebody can solve this puzzle and maybe explain to me if the "\"\"x1\"\"" is the solution to this?
An example for the dataframe would look like this:
structure(list(Sent = structure(c(2L, 2L, 2L, 2L, 2L), .Label = c("\"\"Opted out\"\"",
"\"\"Yes\"\""), class = "factor"), Responded = structure(c(2L,
2L, 2L, 2L, 2L), .Label = c("\"\"Complete\"\"", "\"\"No\"\"",
"\"\"Partial\"\""), class = "factor")), row.names = c(NA, -5L
), class = c("tbl_df", "tbl", "data.frame"), .Names = c("Sent",
"Responded"))
Thanks in advance!
vec = c('""x1""', '""x2""', '""x3""')
vec = factor(vec)
levels(vec) <- gsub('["\\]', "", levels(vec))
#> vec
#[1] x1 x2 x3
#Levels: x1 x2 x3
See how I would use ' as wrapper, when I want to use " inside a string.
Another problem it didn't work for you was probably because you didn't use the levels attribute but rather the factor variable itself.
Factor variables are internally stored as 1, 2, 3,... numbers.
As you now have provided data, you can use: (df1 being your data with the factor columns)
df1[] <- lapply(df1, function(vec){ levels(vec) <- gsub('["\\]',"",levels(vec)); vec})

R-How to get specific string value in data frame header [duplicate]

This question already has answers here:
Use gsub remove all string before first white space in R
(4 answers)
Remove all text before first occurence of specific characeter in R
(1 answer)
Closed 5 years ago.
I have data frame df ,How to return value from string after some first character or number.
My data frame.
ID Name Name.FirstName Name.Last.Name Age
1 rosy ton P 23
2 Jhon peter N 22
my expected data frame.
ID Name FirstName Last.Name Age
1 rosy ton P 23
2 Jhon peter N 22
I need to remove before first (.) value from dataframe Header only .
dput
structure(list(ID = c(1, 2), Name = structure(c(2L, 1L), .Label = c("Jhon",
"rosy"), class = "factor"), Name.FirstName = structure(c(2L,
1L), .Label = c("peter", "ton"), class = "factor"), Name.Last.Name = structure(c(2L,
1L), .Label = c("N", "P"), class = "factor"), Age = c(23, 22)), .Names = c("ID",
"Name", "Name.FirstName", "Name.Last.Name", "Age"), row.names = c(NA,
-2L), class = "data.frame")
We can use sub to match zero or more characters that are not a . ([^.]*) from the start (^) of the column name followed by a . and replace it with blank (""). As . is a metacharacter i.e. it matches for any character, we escape to get the literal meaning of the character
names(df) <- sub("^[^.]*\\.", "", names(df))
names(df)
#[1] "ID" "Name" "FirstName" "Last.Name" "Age"

Resources