Scripts for publishing. Contribute to jobgemws/Publications development by creating an account on GitHub.
Sooner or later, any information system gets a database, often – more than one. With time, that database gathers very much data, from several GBs to dozens of TBs. To understand how the functionals will perform with the data volumes increasing, we need to generate the data to fill that database.
All scripts presented and implemented will execute on the JobEmplDB database of a recruiting service. The database realization is available here.
The database development and testing involve two primary approaches to filling in data:
However, this approach does not let you increase the database itself many times. It becomes problematic to predict changes in the entire information system’s functionality in the future.
On the other way, you can analyze impersonal sanitized data taken from the production database. Basing on them, you can define how to generate the test data that would be like the real data by their looks, properties, and interrelations. This way, Approach 1 produces Approach 2.
Now, let’s review in detail both the approaches to the data filling in databases for testing and development.
First, let’s define the general algorithm of copying and changing the data from the production environment.
The general algorithm is as follows:
After that stage, you can return to step 5 again for the tested database in the protected non-production environment. Then, you forward the tested backup to the necessary environments to restore it and use it for development and testing.
We’ve presented the general algorithm of the production database’s data copying and changing. Let’s describe how to implement it.
The database is named JobEmplDB_Test. It has three filegroups:
Copying of the scheme and the necessary data from the production database into a newly created database To copy the scheme and the necessary data from the production database into the new one, you can use several tools. First, it is the Visual Studio (SSDT). Or, you can use third-party utilities like:
Making scripts for data changes Essential requirements for the data changes’ scripts
It must be impossible to restore the real data using that script. e.g., The lines’ inversion won’t suit, as it allows us to restore the real data. Usually, the method is to replace each character or byte by a pseudorandom character or byte. The same applies to the date and time.
The data changing must not alter the selectivity of their values. It won’t work to assign NULL to the table’s field. Instead, you have to ensure that the same values in the real data would remain the same in the altered data. For instance, in real data, you have a value of 103785 found 12 times in the table. When you change this value in the altered data, the new value must remain 12 times in the same fields of the table.
The size and length of the values should not differ significantly in the altered data. E.g., you replace each byte or character with a pseudorandom byte or character. The initial string remains the same in size and length.
Interrelations in the data must not be broken after the changes. It relates to the external keys and all other cases where you refer to the altered data. Changed data must stay in the same relations as the real data were.
Data changes scripts implementation.
Now, let’s review the particular case of the data changing to depersonalize and hide the secret information. The sample is the recruitment database.
The sample database includes the following personal data that you need to depersonalize:
Last and first name;
Date of birth;
The ID-card issue date;
The remote access certificate as the bytes sequence;
The service fee for resume promotion. First, we’ll check simple examples for each type of the altered data:
Date and time change;
Numerical value change;
Changing the byte sequences;
Character data change.
Date and time change You can get a random date and time using the following script:
DECLARE @dt DATETIME; SET @dt = CAST(CAST(@StartDate AS FLOAT) + (CAST(@FinishDate AS FLOAT) - CAST(@StartDate AS FLOAT)) * RAND(CHECKSUM(NEWID())) AS DATETIME);
Here, @StartDate and @FinishDate are the start and end values of the range. They correlate respectively for the pseudorandom date and time generation.
UPDATE [dbo].[Employee] SET [DocDate] = CAST(CAST(CAST(CAST([BirthDate] AS DATETIME) AS FLOAT) + (CAST(GETDATE() AS FLOAT) - CAST(CAST([BirthDate] AS DATETIME) AS FLOAT)) * RAND(CHECKSUM(NEWID())) AS DATETIME) AS DATE);
The [DocDate] field stands for the document issue date. We replace it with a pseudorandom date, keeping in mind the dates’ ranges and their limitations.
The “bottom” limit is the candidate’s date of birth. The “upper” edge is the current date. We don’t need the time here, so the time and date format transformation to the necessary date comes in the end. You can get pseudorandom values for any part of the date and time in the same way.
Numerical value change
You can get a random integer with the help of the following script:
DECLARE @Int INT; SET @Int = CAST(((@MaxVal + 1) - @MinVal) * RAND(CHECKSUM(NEWID())) + @MinVal AS INT);
@MinVal и @MaxVal are the start and end range’s values for the pseudorandom number generation. We generate it using the system functions RAND, CHECKSUM, and NEWID.
UPDATE [dbo].[Employee] SET [CountRequest] = CAST(((@MaxVal + 1) - @MinVal) * RAND(CHECKSUM(NEWID())) + @MinVal AS INT);
The [CountRequest] field stands for the number of requests that companies make for this candidate’s resume.
Similarly, you can get pseudorandom values for any numerical value. E.g., have a look at the random number of the decimal type (18,2) generation:
DECLARE @Dec DECIMAL(18,2); SET @Dec=CAST(((@MaxVal + 1) - @MinVal) * RAND(CHECKSUM(NEWID())) + @MinVal AS DECIMAL(18,2));
Thus, you can update the resume promotion service fee in the following way:
UPDATE [dbo].[Employee] SET [PaymentAmount] = CAST(((@MaxVal + 1) - @MinVal) * RAND(CHECKSUM(NEWID())) + @MinVal AS DECIMAL(18,2));
Changing the byte sequences You can get a random byte sequence using the following script:
DECLARE @res VARBINARY(MAX); SET @res = CRYPT_GEN_RANDOM(@Length, CAST(NEWID() AS VARBINARY(16)));
@Length stands for the sequence’s length. It defines the number of bytes returned. Here, @Length must not be larger than 16.
Generation is done with the help of the system functions CRYPT_GEN_RANDOM and NEWID. E.g., you can update the remote access certificate for each candidate in the following way:
UPDATE [dbo].[Employee] SET [RemoteAccessCertificate] = CRYPT_GEN_RANDOM(CAST(LEN([RemoteAccessCertificate]) AS INT), CAST(NEWID() AS VARBINARY(16)));
We generate a pseudorandom byte sequence of the same length present in the [RemoteAccessCertificate] field at the time of the change. We suppose that the byte sequence length does not exceed 16.
Similarly, we can create our function that will return pseudorandom byte sequences of any length. It will put the results of the system function CRYPT_GEN_RANDOM work together using the simple “+” addition operator. But 16 bytes are usually enough in practice.
Let’s make a sample function returning the pseudorandom byte sequence of the definite length, where it will be possible to set the length of more than 16 bytes. For this, make the following presentation:
CREATE VIEW [test].[GetNewID] AS SELECT NEWID() AS [NewID]; GO
We need it to evade the limitation forbidding to use NEWID within the function.
In the same way, create the next presentation for the same purpose:
CREATE VIEW [test].[GetRand] AS SELECT RAND(CHECKSUM((SELECT TOP(1) [NewID] FROM [test].[GetNewID]))) AS [Value]; GO
Create one more presentation:
CREATE VIEW [test].[GetRandVarbinary16] AS SELECT CRYPT_GEN_RANDOM(16, CAST((SELECT TOP(1) [NewID] FROM [test].[GetNewID]) AS VARBINARY(16))) AS [Value]; GO
First, we define if the necessary function is present. If not – we create a stud first. In any case, the code involves changing the function’s definition appropriately. In the end, we add the function’s description via the extended properties. More details on the database’s documentation are in this article.
To update the remote access certificate for each candidate, you can do as follows:
UPDATE [dbo].[Employee] SET [RemoteAccessCertificate] = [test].[GetRandVarbinary](CAST(LEN([RemoteAccessCertificate]) AS INT));
As you see, there are no limitations to the byte sequence length here.
Originally posted at https://codingsight.com/synthetic-data-generation-part-1-data-copying/ Author Evgeniy Gribkov
SQL stands for Structured Query Language. SQL is a scripting language expected to store, control, and inquiry information put away in social databases. The main manifestation of SQL showed up in 1974, when a gathering in IBM built up the principal model of a social database. The primary business social database was discharged by Relational Software later turning out to be Oracle.
This is part 3 of “MS SQL Server- Zero to Hero” and in this article, we will be discussing about the SCHEMAS in SQL SERVER. Before getting into this article, please consider to visit previous articles in this series from below.
Understanding various SQL Server system databases and their roles is an excellent aid for managing your SQL Server instance.
The article provides a thorough guide to SQL Server graph databases with their advantages and downsides, cases of use and useful comparisons. Would SQL Server graph database features fit your next project? While you may not know the definitive answer right now, you might be wondering, “What problems does it solve?”.
It is crucial to back up the SQL Server master database daily. Read to know how to restore the master database to the instance.