Convert a column from string to Int in Dataframe

Convert a column from string to Int in Dataframe - julia

Imagine I have a dataframe
df=DataFrame(A=rand(5),B=["8", "9", "4", "3", "12"])
What I want to do is to convert column B to Int type, so I used
df[!,:B] = convert.(Int64,df[!,:B])
But I got warning:
'Cannot Convert an object of type string to an object of type Int64'
Could you please tell me why I was wrong?

What you are looking for is the parse function, broadcast over the elements in the column with dot notation:
df = DataFrame(A = rand(5), B = ["8", "9", "4", "3", "12"])
df[!, :B] = parse.(Int64, df[!, :B])

I believe what you want is df[!,:B] = Int64.(df[!,:B]). Convert is only defined between types where you can convert without losing information (ie in this case, you can't convert an arbitrary string to an Int)

Related

R stringr unexpected behavior with str_replace & str_pad. Bug or Layer-8 problem?

I am using R 4.1.3 and the stringr-package 1.4.0 and get some unexpected results from this code:
stringr::str_replace(string = "5",
pattern = "([0-9]+)",
replacement = stringr::str_pad(string = "\\1", width = 3, side = "left", pad = "0"))
Expected: "005"; Result: "05".
All the parts generate the expected results:
(1) The padding
stringr::str_pad(string = "5", width = 3, side = "left", pad = "0")
Returns "005"
(2) The regex match
stringr::str_replace(string = "5", pattern = "([0-9]+)", replacement = "\\1")
Returns "5".
Only the combination of these two leads to unexpected behavior.
For clarification, I already have working code and several solutions to choose from to achive what I want to do, i.e. using an anonymous function:
stringr::str_replace(string = "5",
pattern = "([0-9]+)",
replacement = {\(x) stringr::str_pad(string = x, width = 3, side = "left", pad = "0")})
The intention of the post is to clarifiy why the code at the top does not work.
Thanks in advance for any helpful input.
Edit:
It seems that "\1" refers to the content of the capture group, but the character length is determined from the literal "\1".
stringr::str_replace(string = "5",
pattern = "([0-9]+)",
replacement = {\(x) as.character(nchar(x))})
stringr::str_replace(string = "5",
pattern = "([0-9]+)",
replacement = as.character(nchar("\\1")))
Returns "1" and "2". The second example always returns "2" as replacement for the captured group, independend of its content.

The problem here is that \1 in the scope of the inner call to str_pad() does not mean the first capture group, but rather the number 1 escaped by backslash. Instead, consider this version as a workaround:
x <- c("5", "12", "345", "1234")
output <- sub("^0{1,3}(\\d{3,})", "\\1", paste0("000", x))
output
[1] "005" "012" "345" "1234"

Dealing with data from the same user twice when moving through a loop in R

Let's say I have a dataframe that looks a bit like this:
df <- tribble(
~person_id, ~timestamp,
"1", "02:26:10.000000",
"1", "03:45:37.000000",
"2", "22:03:39.000000",
"3", "11:46:24.000000",
"4", "18:26:55.000000",
"5", "17:01:20.000000",
"5", "03:10:17.000000",
"6", "23:16:05.000000",
)
df
Now let's say I import individual .csv files that each match the person_id like so:
user_files <- list.files(pattern = "\\.csv$", path = here("data"),
full.names = TRUE)
user_files <- user_files[sub("\\.csv$", "", basename(user_files)) %in% df$person_id]
There will naturally be fewer .csv files than the length of df$person_id because persons "1" and "5 appear twice in df$person_id
I would now like to run a for loop that runs a program on each csv file. HOWEVER, where there are more than one of the same person_id, I would like to re-run the loop using the same csv file (since it's the same person again but a different timestamp so will yield different results).
This is what the loops look like
for(i in seq(1:length(user_files))) {
user_file <- read_csv(user_files[i])
#Run lots of analysis on the CSV file
}
Now I need something in the loop that says "if df$person_id occurs more than once, repeat the loop using the same CSV file". Thanks in advance for any assistance.

If the user_files are unique, then match the user_files with the 'df$person_id' use that index to subset the 'user_files'
v1 <- sub("\\.csv$", "", basename(user_files))
user_files2 <- na.omit(user_files[match(df$person_id, v1)])
Now, loop over 'user_files2'
Or a better approach is to merge/join with the original dataset and loop over the filtered data user_files column
library(dplyr)
df1 <- inner_join(df, tibble(user_files,
person_id = sub("\\.csv$", "", basename(user_files))))

Decoding to Chinese characters in R

I accidentally converted the columns of Chinese characters in a tab delimited text file to encoded characters. The records are encoded to look like this:
<U+5ECA><U+574A><U+5E02>
How do I convert that to this?
廊坊市
You can recreate the first 6 lines of my data frame in R with this code:
structure(list(City_Code = c(110000L, 110000L, 110000L, 110000L, 110000L, 110000L), Origin_City = c("<U+5ECA><U+574A><U+5E02>", "<U+4FDD><U+5B9A><U+5E02>", "<U+5929><U+6D25><U+5E02>", "<U+5F20><U+5BB6> <U+53E3><U+5E02>", "<U+627F><U+5FB7><U+5E02>", "<U+90AF><U+90F8><U+5E02>"), Origin_Province = c("<U+6CB3><U+5317><U+7701>", "<U+6CB3><U+5317><U+7701>", "<U+5929><U+6D25><U+5E02>", "<U+6CB3><U+5317><U+7701>", "<U+6CB3><U+5317><U+7701>", "<U+6CB3><U+5317><U+7701>"), Destination_City = c("<U+5317><U+4EAC>", "<U+5317><U+4EAC>", "<U+5317><U+4EAC>", "<U+5317><U+4EAC>", "<U+5317<U+4EAC>", "<U+5317><U+4EAC>"), Percentage = c("28.08%", "6.86%", "5.70%", "3.38%", "3.05%", "2.76%"), Date = c("2020-03-13", "2020-03-13", "2020-03-13", "2020-03-13", "2020-03-13", "2020-03-13")), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")

This code will convert the string to the appropriate Chinese characters:
library(stringi)
string <- '<U+5ECA><U+574A><U+5E02>'
cat(stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", string)))
# Output: 廊坊市
Source: Convert unicode to readable characters in R

Invalid subscript type list, not sure why

I'm beginning to learn R, and I'm writing a script, but I'm getting a weird error. I have a data frame, and I'd like to take a subset of the columns. I created a variable called meansAndStdevs, which is a logical vector. I want to use this logical vector to subset the columns in my data frame. Here's the code I have:
features <- read.table("./features.txt")$V2;
meanAndStdevRegEx <- "(-mean\\(\\))|(-std\\(\\))";
meansAndStdevs <- as.logical(sapply(features, function(f) { grep(meanAndStdevRegEx, f); }));
fileData <- read.table(filePath);
fileDataSubset <- fileData[, meansAndStdevs]
However, I end up getting the error Error in .subset(x, j) : invalid subscript type 'list', and I'm not sure why! I think it might have something to do with my meansAndStdevs list having NAs in place of FALSEs. Hoping for some guidance.
Here are the first few items in the features list (it's class is actually "factor"):
features <- c("tBodyAcc-mean()-X", "tBodyAcc-mean()-Y", "tBodyAcc-mean()-Z",
"tBodyAcc-std()-X", "tBodyAcc-std()-Y", "tBodyAcc-std()-Z", "tBodyAcc-mad()-X",
"tBodyAcc-mad()-Y", "tBodyAcc-mad()-Z", "tBodyAcc-max()-X", "tBodyAcc-max()-Y",
"tBodyAcc-max()-Z", "tBodyAcc-min()-X", "tBodyAcc-min()-Y")
Here is the data in fileData: https://raw.githubusercontent.com/MDSilber/CourseProject/master/Dataset/test/X_test.txt
It's pretty large though, so here's some more info on it:
dput(fileData[1:5, 1:3])
structure(list(V1 = c(0.25717778, 0.28602671, 0.27548482, 0.27029822,
0.27483295), V2 = c(-0.02328523, -0.013163359, -0.02605042, -0.032613869,
-0.027847788), V3 = c(-0.014653762, -0.11908252, -0.11815167,
-0.11752018, -0.12952716)), .Names = c("V1", "V2", "V3"), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
This is a table of 561 columns. I'm trying to extract the columns that correspond to the TRUE values of the meansAndStdevs vector and create a new data frame in fileDataSubset from that.
Thanks in advance!

I figured out why it wasn't working. I was supposed to be using grepl, instead of grep, since grepl outputs a logical vector (which is what I wanted). Thanks for all your help!

When I run fileDataSubset <- fileData[, meansAndStdevs], I get the invalid columns error. This is because the logical vector meansAndStdevs has more columns than fileData. You can take a subset of meansAndStdevs which corresponds to your data, then subset fileData on that basis:
datacols <- meansAndStdevs[1:ncol(fileData)]
fileDataSubset <- fileData[, datacols]
I am assuming the following setup (showing for clarity because your post has them out of order):
fileData <- structure(list(V1 = c(0.25717778, 0.28602671, 0.27548482, 0.27029822,
0.27483295), V2 = c(-0.02328523, -0.013163359, -0.02605042, -0.032613869,
-0.027847788), V3 = c(-0.014653762, -0.11908252, -0.11815167,
-0.11752018, -0.12952716)), .Names = c("V1", "V2", "V3"), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
features <- c("tBodyAcc-mean()-X", "tBodyAcc-mean()-Y", "tBodyAcc-mean()-Z",
"tBodyAcc-std()-X", "tBodyAcc-std()-Y", "tBodyAcc-std()-Z", "tBodyAcc-mad()-X",
"tBodyAcc-mad()-Y", "tBodyAcc-mad()-Z", "tBodyAcc-max()-X", "tBodyAcc-max()-Y",
"tBodyAcc-max()-Z", "tBodyAcc-min()-X", "tBodyAcc-min()-Y")
meanAndStdevRegEx <- "(-mean\\(\\))|(-std\\(\\))";
meansAndStdevs <- as.logical(sapply(features, function(f) { grep(meanAndStdevRegEx, f); }));
You can then see that the sizes of meansAndStdevs and fileDataSubset are different:
> length(meansAndStdevs)
[1] 14
> ncol(fileDataSubset)
[1] 3
This is why you need to subset meansAndStdevs to use it as an array index.

R: Rename or copy dataframe and naming it as defined in a vector

I want to create a new dataframe from an existing one and naming it as defined in a vector:
I have a dataset with many different questions, and to go through the dataset a bit quicker, I have developed a list of generic functions that can be called upon. For each question, I define the specific values, such as can be seen below. In the second part, I more or less create a clean dataset for the question, which is saved as a dataframe called 'questionid'. Because that variable is overwritten with each question, I want to create a duplicate of this dataframe and call it as specified under 'questionname' (in this case "A1"). I find it very difficult to find easy ways to do that. I hope someone can help me.
# Specify vectors and variables
question <- "Would you recommend edX to a friend of you?"
questionname <- "A1"
edXid <- "i4x-DelftX-ET3034TUx-problem-b3d30df864ca41ffa0170e790f01a783_2_1"
clevels <- c("0 - Not at all likely", "1", "2", "3", "4", "5 - Neutral", "6", "7", "8", "9", "10 - Extremely likely")
csvname <- paste(questionname, ".csv", sep="")
pngname <- paste(questionname, ".png", sep="")
# Run code
questionid <- subset(allDatasolar, allDatasolar[,3]==edXid, select = -c(X,question))
questionid <- questionid[-grep("dummy", questionid$answer), ]
questionid <- droplevels(questionid)
# as.name(questionname) <- as.data.frame(questionid) # does not work
questionid$answer <- factor(questionid$answer, ordered=TRUE, levels=clevels)
write.csv(data.frame(summary(questionid$answer)), file = csvname)
png(file = pngname, width = 640)
barchart(questionid$answer, main = question, xlab = "", col='lightblue')
dev.off()

You're looking for assign
>question = "What do you need?"
>questionname = "A1"
>
>questionid = data.frame(question, x="minimal working example")
>
>assign(questionname, questionid)
>
>A1
question x
1 What do you need? minimal working example
Assign takes a string (or a character variable, in this case) as the first argument and makes an object with that name that is a copy of whatever is in the second argument. In this case, you can feel free to keep over-writing the questionid data frame, but you will be making copies along the way based on your "questionname" variable value.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Convert a column from string to Int in Dataframe - julia

What you are looking for is the parse function, broadcast over the elements in the column with dot notation: df = DataFrame(A = rand(5), B = ["8", "9", "4", "3", "12"]) df[!, :B] = parse.(Int64, df[!, :B])

I believe what you want is df[!,:B] = Int64.(df[!,:B]). Convert is only defined between types where you can convert without losing information (ie in this case, you can't convert an arbitrary string to an Int)

Related

R stringr unexpected behavior with str_replace & str_pad. Bug or Layer-8 problem?

Dealing with data from the same user twice when moving through a loop in R

Decoding to Chinese characters in R

Invalid subscript type list, not sure why

R: Rename or copy dataframe and naming it as defined in a vector

Categories

Resources